Analogies between the crossing number and the tangle crossing number

Tanglegrams are special graphs that consist of a pair of rooted binary trees with the same number of leaves, and a perfect matching between the two leaf-sets. These objects are of use in phylogenetics and are represented with straightline drawings where the leaves of the two plane binary trees are on two parallel lines and only the matching edges can cross. The tangle crossing number of a tanglegram is the minimum crossing number over all such drawings and is related to biologically relevant quantities, such as the number of times a parasite switched hosts. Our main results for tanglegrams which parallel known theorems for crossing numbers are as follows. The removal of a single matching edge in a tanglegram with $n$ leaves decreases the tangle crossing number by at most $n-3$, and this is sharp. Additionally, if $\gamma(n)$ is the maximum tangle crossing number of a tanglegram with $n$ leaves, we prove $\frac{1}{2}\binom{n}{2}(1-o(1))\le\gamma(n)<\frac{1}{2}\binom{n}{2}$. Further, we provide an algorithm for computing non-trivial lower bounds on the tangle crossing number in $O(n^4)$ time. This lower bound may be tight, even for tanglegrams with tangle crossing number $\Theta(n^2)$.


Introduction
A drawing D(G) of a graph G in the plane is a set of distinct points in the plane, one for each vertex of G, and a collection of simple open arcs, one for each edge of the graph, such that if e is an edge of G with endpoints v and w, then the closure (in the plane) of the arc α representing e consists precisely of α and the two points representing v and w. We further require that no edge-arc intersects any vertex point. The (standard) crossing number cr(D(G)) of D(G) is the number of pairs (x, {α, β}), where x is a point of the plane, α, β are arcs of D representing distinct edges of G such that x ∈ α ∩ β. The crossing number cr(G) of a graph G is defined to be the minimum crossing number over all of its drawings.
Tanglegrams are well studied in the phylogenetics and computer science literature. A tanglegram of size n is a triplet containing two rooted binary trees (L and R), each with n leaves, and a fixed perfect matching M between the two set of leaves. Two tanglegrams T 1 = (L 1 , R 1 , M 1 ) and T 2 = (L 2 , R 2 , M 2 ) are the same if there is a pair of tree-isomorphisms (φ, ψ) from L 1 to L 2 and from R 1 to R 2 that map each pair of matched leaves to a pair of matched leaves. A layout of a tanglegram is a straight-line plane drawing of the trees, the first drawn in the half plane x ≤ 0 with its leaves on the line x = 0 and the second in the half plane x ≥ 1 with its leaves on the line x = 1, with a straight-line drawing of the matching edges between the leaves. The tangle crossing number crt(T ) of a tanglegram T is the the minimum crossing number over all of its layouts, i.e. the minimum number of unordered pairs of crossing edges over all layouts. The tangle crossing number is related to the number of times parasites switch hosts [10] as well as the number of horizontal gene transfers [4].
Though tangle crossing numbers are crossing numbers of a very specific kind of drawing of a very specific class of graphs, a number of analogies are known between tangle crossing numbers and crossing numbers. As with the crossing numbers of general graphs [9], computing the tangle crossing number is NP-hard [8], even when both trees are complete binary trees [3]. Testing whether a graph is planar can be done in polynomial (in fact linear) time [12]. Analogously, testing for tangle crossing number 0 can also be done in linear time [8]. Recently, Czabarka, Székely, and Wagner [6] gave an analogue of Kuratowski's Theorem [13] for tanglegrams, characterizing tangle crossing number 0. Clearly, for a graph G with e edges we have cr(G) = O(e 2 ), while for a tanglegram T of size n, crt(T ) = O(n 2 ). The expected crossing number of an Erdős-Rényi random graph G ∈ G(n, p) for p = c n for any c > 1 is Θ(e 2 ) where e = p n 2 is the expected number of edges [14], and the expected tangle crossing number of a random and uniformly selected tanglegram with n leaves is Θ(n 2 ) [5], i.e. both of these quantities are as large as possible in order of magnitude.
We continue the study of the tangle crossing number with results which parallel results for graph crossing numbers. Hliněný and Salazar [11] studied the crossing number of 1-edge planar graphs (i.e. graphs in which there exists an edge whose removal results in a planar graph). For each k ≥ 1, they define a 1-edge planar graph G k with 2k + 4 vertices, 6k + 7 edges, and crossing number k. We find that the behavior is quite similar for the tangle crossing number. First we establish an upper bound for crt(T ) − crt(T − e) given any tanglegram and any matching edge e. Then for each n ≥ 4, we define a tanglegram of size n with tangle crossing number n − 3 for which there is a single matching edge whose removal yields a planar subtanglegram. In summary, we prove the following theorem in Section 3: Theorem 1. For any tanglegram T of size n ≥ 3 and any matching edge e in T , let T − e be the tanglegram induced by deleting the endpoints of e and suppressing their (now degree two) neighbors. Then crt(T ) − crt(T − e) ≤ n − 3. This inequality is best possible, even when T − e is planar.
We then examine the largest tangle crossing number of a tanglegram of size n (an analogue of the crossing number of the complete graph on n vertices). It is well known (e.g. by the crossing lemma or by the counting method) that the crossing number of the complete graph K n is Θ(n 4 ) = Θ n Interestingly, the structure of a size n tanglegram with maximum tangle crossing number remains unknown.
We conclude with a polynomial time algorithm for computing lower bounds on the tangle crossing number in Section 5. Drawing random tanglegrams of size n from a uniform distribution, we give computational evidence that these lower bounds are Θ(n 2 ) with high probability, thus matching the result of Czabarka, Székely, Wagner [5] that such a tanglegram has tangle crossing number Θ(n 2 ) with high probability.

Preliminaries
Before delving into the proofs of our main theorems, we need to establish some terminology and more formal definitions. A rooted binary tree is a tree in which one vertex is designated as the root and each vertex has either 0 or 2 children. A vertex with 0 children is a leaf. A vertex with 2 children is called an internal vertex. Thinking of the root as a common ancestor to all other vertices, the notions of descendant, parent, children and sibling become clear. If B is a rooted binary tree, a subset of the leaves of B induces a binary subtree B , obtained from the smallest subtree of B by suppressing all degree 2 vertices and choosing as the root of B the vertex which was closest to the root of B. For any internal vertex v of B, the subtree induced by the leaves which are descendants of v is a clade of B at v. If the two children of v are leaves, then the corresponding clade is called a cherry.
A tanglegram layout is a straight-line drawing in the plane of two rooted binary trees, L and R, each with n leaves and a perfect matching M between their leaves (each leaf of L paired with a unique leaf of R) having the following properties: • A plane drawing of L appears in the half plane x ≤ 0 with only the leaves of L on the line x = 0. • A plane drawing of R appears in the half plane x ≥ 1 with only the leaves of R on the line x = 1. • The matching is represented by a (straight-line) drawing of edges connecting each leaf of L with the appropriate leaf of R.
The crossing number of such a layout is precisely the number of unordered pairs of matching edges which cross. As there are n matching edges, the crossing number is clearly at most n 2 . To transform one layout into another, we define a switch. First observe that a layout induces a total order on the leaves of L by the y-coordinate of the leaves on the line x = 0. Now each internal vertex v of L has two children v 1 and v 2 . To make a switch at v, redraw the tree L so that in the new layout, the order of leaves and is reversed if and only if one was a leaf in the clade at v 1 and the other was a leaf in the clade at v 2 . The resulting tanglegram layout displays the new drawing of L, an unchanged drawing of R, and the corresponding straight-line drawing of the matching edges connecting the appropriate pairs of leaves. Switch operations at internal vertices of R are defined analogously. Observe that the switch operation defines an equivalence relation on the set of tanglegram layouts and each equivalence class will be called a tanglegram, denoted by the triple (L, R, M ).
Let T = (L, R, M ) be a tanglegram. The size of T is the size of the matching M (also the number of leaves in L and the number of leaves in R). The tangle crossing number of T , denoted crt(T ), is the minimum number of pairs of edges that cross, among all layouts of T . If T has size n then one can easily deduce that crt(T ) ≤ n 2 . Given a tanglegram T = (L, R, M ), a subset M of M induces a subtanglegram T = (L , R , M ) where L is the subtree of L induced by leaves of L which are endpoints of edges in M and R is defined similarly.
We let γ(n) to denote the maximum tangle crossing number among all tanglegrams of size n. In addition, we utilize the now standard notation [n] for the set {1, 2, . . . , n}.

Subtanglegrams of one size smaller
In a tanglegram of size n, the tangle crossing number is at most 1 2 n 2 (Theorem 2). Given a tanglegram with tangle crossing number close to this upper bound, on average, each matching edge crosses one fourth of all the other matching edges. We explore the maximum number of crossings a single edge could contribute to the overall tangle crossing number. Phrased another way, for any tanglegram T of size n and subtanglegram T of size n − 1, we determine the maximum value of crt(T ) − crt(T ). The result is given in Theorem 3, an upper bound which Theorem 4 shows to be tight, even for tanglegrams with T planar. These two theorems together complete the proof of Theorem 1.
Throughout this section, given a tanglegram T = (L, R, M ) and e ∈ M , we use T − e to denote the subtanglegram of T induced by edges in M − e. Proof. We will proceed by induction on n. First observe that if T is a tanglegram of size at most 3 then it is planar, and if T is a tanglegram of size 4 then crt(T ) ≤ 1 [6]; so the theorem is trivial when n ≤ 4. Let n ≥ 5 and suppose that in every tanglegram of size n − 1, each edge contributes at most (n − 1) − 3 to the tangle crossing number. Fix a tanglegram T = (L, R, M ) of size n, and let e ∈ M be an arbitrary matching edge of T . Say e has endpoints u in L and v in R. Fix an optimal layout D of T −e = (L u , R v , M −e) with the fewest number of crossings.
In L, let w L be the parent of u and let L be the clade at the other child of w L . (Similarly, define w R and R .) There are two planar drawings of L whose subdrawings of L u agree with the drawing of L u in D , one with u immediately above the leaves of L and one with u immediately below the leaves of L . The ordering of the leaves of L u in each of these drawings of L is exactly the same as the ordering of the leaves in the drawing of L u in D . Further, one of these drawings of L can be obtained from the other by making a switch at w L . A similar claim can be made about R, v, R v , w R and R . Figure 1 uses dashed lines to indicate the two potential positions of u and for v in a drawing of T .
We claim that there is drawing D of T using one of these two drawings of L and one of these two drawings of R in which matching edge e crosses at most n − 3 edges. This is sufficient to complete the proof as the number of crossings between two edges of M − e in D is exactly crt(T − e) (because the underlying drawing of First observe that L and R each have at least one leaf. There are two cases to consider: (1) L and R each have exactly one leaf and they are matched in M − e, or (2) there is a leaf in L and a leaf in R which are not matched with one another.
For the first case, let f be the edge matching the single leaf in L with the single leaf in R . Consider the drawing of T with u above L and v above R so that e is above f . Suppose, for contradiction, that e participates in strictly more than n − 3 crossings in this drawing of T . As e does not cross itself or f , matching edge e must cross every other edge in M . Since there are no leaves between the left endpoints of e and f and no leaves between the right endpoints of e and f , it follows that f also participates in n − 2 crossings in this drawing of T . As the drawing of T − e was optimal, we see that f contributes n−2 to the tangle crossing number of tanglegram T − e which had size n − 1. However, by the induction hypothesis, each edge in T − e contributes at most (n − 1) − 3 crossings to crt(T − e), a contradiction.
For the second case, let u L be a leaf in L and v R be a leaf in R which are not matched to each other. We say u L (respectively, v R ) is "matched upward" if the leaf to which it is matched is at least as high as the lowest leaf of R (respectively, L ). The leaf u L (respectively, v R ) is "matched downward" if the leaf to which it is matched is no higher than the highest leaf of R (respectively, L ).
Let f 1 and f 2 be the matching edges, one with endpoint u L and the other with endpoint v R . If u L and v R are both matched upward (respectively, downward), draw the vertex u below (respectively, above) L and the vertex v below (respectively, above) R . On the other hand, if u L is matched to a leaf higher (lower) than the leaves of R and v R is matched to a leaf lower (higher) than the leaves of L , then draw u directly below (above) the leaves of L and v directly above (below) the leaves of R . In each of these cases, the edge e crosses neither f 1 nor f 2 , and therefore crosses at most n − 3 other edges, from which crt(T ) − crt(T − e) ≤ n − 3 follows. Now we prove that the inequality in Theorem 3 is best possible. To do so, we present an infinite family of tanglegrams {P n : n ≥ 4} such that P n has size n, tangle crossing number n − 3, and there exists a matching edge e such that crt(P n − e) = 0. We say P n is 1-edge tangle planar as P n is not planar but there is a matching edge e such that the subtanglegram P n − e is planar. The two binary trees in P n are rooted caterpillars. Definition 1. The rooted caterpillar C n of size n is the unique rooted binary tree with n leaves such that there are two leaves of distance n − 1 from the root and for each i ∈ [n − 2] there is one leaf of distance i from the root. (See Figure 2 for an example.) Figure 2. The caterpillar C 6 Definition 2. For each n ≥ 4, we define the caterpillar tanglegram P n = (L n , R n , M n ) as follows: L n and R n are copies of the rooted caterpillar C n . We label the leaves of L n as u i , where i is the leaf's distance from the root. Since there are precisely two leaves at distance n − 1, we arbitrarily label one of these u n instead. Similarly, the leaves of R n are labeled using v i . Finally, we construct the matching Figure 3 for an example.) Figure 3. The caterpillar tanglegram P 8 Theorem 4. For each n ≥ 4, the caterpillar tanglegram P n is 1-edge tangle planar and has tangle crossing number n − 3.
Proof. Note that the tanglegram P n − u n v n is clearly a planar tanglegram (see Figure 3), so P n is 1-edge tangle planar. The same drawing demonstrates that crt(P n ) ≤ n−3. It remains to show that crt(P n ) ≥ n−3. Suppose, for contradiction, that there is some k for which crt(P k ) < k −3. Furthermore, let k be the least index witnessing this strict inequality. One can check by computer that crt(P n ) = n − 3 for 4 ≤ n ≤ 7, so k ≥ 8. Since P k contains a subdrawing of P 7 , crt(P k ) ≥ 7 − 3 = 4. There are two cases for a fixed optimal drawing of P k : at least one matching edge in the set {u i v k−i | 2 ≤ i ≤ k − 2} is part of a crossing or else none of them are.
In the latter case, only the edges u 1 v k−1 , u k−1 v 1 , and u k v k have crossings, and therefore 3 ≥ crt(P k ) ≥ 4, a contradiction.
In the former case, say the edge u j v k−j is part of a crossing. The subtanglegram induced by M k − u j v k−j is isomorphic to P k−1 and has tangle crossing number at most crt(P k ) − 1. It follows that crt(P k−1 ) ≤ crt(P k ) − 1 < (k − 1) − 3, which contradicts the minimality of k.

Maximizing the crossing number
While a single edge in a tanglegram of size n can contribute up to n − 3 to the tangle crossing number, not all matching edges can realize this many crossings in a drawing which minimizes the tangle crossing number. The aim of this section is to better understand the maximum tangle crossing number among tanglegrams of the same size. To prove Theorem 2. We begin with the first part: This implies that D has exactly n 2 − k crossings. Since crt(T ) = k, every layout has at least k crossings. Consequently, n 2 − k ≥ k and crt(T ) = k ≤ 1 2 n 2 . Suppose that, contrary to our statement, crt(T ) = 1 2 n 2 . It follows from our proof so far that any layout of T has 1 2 n 2 crossings, and for any unordered pair {e, f } of matching edges there is a layout in which they cross. Let C be a cherry of R with leaves 1 and 2 incident with matching edges e, f ∈ M . As noted above, e and f must cross in some layout D of T . From D, we create a new layout D by making a switch at the parent of 1 and 2 . The number of crossings in D is by constructing for each n ≥ 4 a family T n of tanglegrams of size n such that for any ε > 0 and large enough n, for all T ∈ T n crt(T )/ n 2 ≥ 1 2 − ε.
We begin by constructing a family T k 2 for each integer k ≥ 2. Any T ∈ T k 2 is the result of the following procedure: Take an arbitrary (2k + 2)-tuple of size k rooted binary trees (L 0 , . . . , L k , R 0 , . . . , R k ). Label the k leaves of L 0 with labels {1, 2, . . . , k} arbitrarily. For each i ∈ [k], identify the root of L i with leaf i in L 0 and assign labels {v ij : j ∈ [k]} to the k leaves of L i . The result is the rooted binary tree L with k 2 leaves. Similarly, R is built from (R 0 , R 1 , . . . , R k ) with leaf labels {w ij : i, j ∈ [k]}. The matching is defined as M = {v ij w ji : i, j ∈ [k]}. Figure 4 shows a tanglegram in T 9 . The binary trees L 1 , L 2 , L 3 are marked by dashed rectangles. The tree L 0 is the subtree of L consisting of the roots of L 1 , L 2 , L 3 , and their ancestors. Note that the trees L 0 , L 1 , L 2 , and L 3 need not be isomorphic. They are only isomorphic here because there is only one binary tree, up to isomorphism, with 3 leaves. Further, for any choice of two clades in L and two clades in R, there is at least one pair of edges which cross.

Figure 4. A tanglegram in T 9
With a well-defined set of tanglegrams T k 2 for each k ≥ 2, we now define T n for any integer n. Fix n and choose k such that k 2 ≤ n < (k + 1) 2 . Let T n be the set of tanglegrams of size n such that T ∈ T n if and only if there is a tanglegram T ∈ T k 2 with T a subtanglegram of T . Figure 5 shows a tanglegram in T 5 . The tanglegram with bold edges is a subtanglegram in T 4 .  Proof. First we show that for each k ≥ 2 and T ∈ T k 2 , crt(T ) ≥ k  As the tangle crossing number of each tanglegram in T k 2 is at least k 2 2 , the tangle crossing number of each tanglegram in T n with n > k 2 is also at least k 2 2 .
Let n ≥ 4 and k = √ n , so k 2 ≤ n < (k +1) 2 . Observe that for each tanglegram As a result, Theorems 5 and 6 complete the proof of Theorem 2.

Lower bound of the tangle crossing number
Let T = (L, R, M ) be a tanglegram of size n. In this section, we present an algorithm which outputs a non-trivial lower bound for the tangle crossing number of T in O(n 4 ) time. As we will show, this lower bound is tight for some tanglegrams with quadratic tangle crossing number. The algorithm runs in two phases. First it partitions the leaves of each tree into clades. In the second phase the clades are used to compute the lower bound for crt(T ). Now we describe the algorithm for partitioning the leaves a given tree into clades, given a restriction on their size. Note that we use this algorithm independently for L and R.
Algorithm 1 can be implemented in O(n) time. This follows from noting that step 1 requires a post-order traversal of B and each of steps 2 and 3 require a pre-order traversal of B. Let W = {v i } k i=1 be the set of vertices from step 2. Note that if v i , v j ∈ W , then v i is not an ancestor of v j and vice versa. It is easy to see that a consequence of this property is that the collection {V i } k i=1 from step 3 is

Algorithm 1 Partition leaves into clades
Input: A binary tree B and a number C > 1.
Output: A partition of the leaves of B into clades of size at most C. 1: Label each vertex v in B with the number of leaves in the clade of B at v. 2: Let {v i } k i=1 be the set of vertices such that the label at v i is at most C and whose parent has label greater than C.
Note that Algorithm 2 runs in O(n 4 ). This follows since steps 1 and 2 take O(n) time, step 3 takes O(n 2 ) time, and step 4 takes O(n 4 ) time. To prove correctness, suppose U a and U b are clades in L and suppose V c and V d are clades in R. Because these are clades, any layout of T will have either the M ac edges cross the M bd edges or the M ad edges cross the M bc edges. As a result, these 4 clades will contribute at least min{|M ac ||M bd |, |M ad ||M bc |} to the tangle crossing number. Thus, as done in step 4, summing these minimums over all 2 pairs of clades from L and r 2 pairs of clades from R, we obtain a lower bound on crt(T ).
One may notice that Algorithm 2 depends on the choice of C L and C R . When n = k 2 , the choice of C L = C R = √ n is optimal for the tanglegrams in T k 2 from Section 4 described for the proof of Theorem 6. For each tree in these tanglegrams, Algorithm 1 finds the k clades with k leaves that were used to build these trees. With this clade partition, M i,j = 1 for all i, j ∈ [k]. So the tangle crossing number is at least k 2 2 by Algorithm 2. It is not hard to find tanglegrams in T k 2 with tangle crossing number exactly k 2 2 . Thus the output of Algorithm 2 for the family of tanglegrams T k 2 is tight. We ran simulations for different choices of C L and C R with random tanglegrams drawn from a uniform distribution. Figure 6 shows the average lower bounds when C L , C R ∈ {4, √ n, n 2 }. For each n ∈ {10, . . . , 100}, we picked 100 tanglegrams of size n uniformly at random. The random sampling algorithm is a SageMath [7] implementation of Algorithm 3 from [2, p. 253]. The source code for our implementation is available at [1]. Based on the simulations, it appears that C L = C R = n 2 yields better lower bounds.
In [5] it is shown that there exists C > 0 such that a random tanglegram has tangle crossing number Cn 2 with high probability. Fitting the ll curve from Figure 6, the curve corresponding to C L = C R = n/2, to a quadratic function via  respectively. The curve labeled ml represents the average output with C L = √ n and C R = n/2.
least squares yields 0.055n 2 + O(n). This suggests that the tangle crossing number of the random tanglegram is at least 0.055n 2 . For the same sample, a plot of the maximum lower bounds is fit by the curve 0.08n 2 + O(n). These two growth rates are to be compared with the upper bound of 0.25n 2 from Theorem 2. Another way to view this process is to create an auxiliary bipartite multigraph with a vertex for each clade and the number of edges between two clades is the number of edges which match a vertex of one clade to a vertex of the other clade. We then restrict to straight-line drawings where the vertices of one partite set remain on the line x = 0 and the vertices of the other partite set lie on the line x = 1. The minimum crossing number over all such drawings of this multigraph provides a lower bound on the crossing number of the tanglegram. However, Garey and Johnson [9] proved that even this problem on the auxiliary bipartite multigraph is NP-complete.

Open Questions and Further Work
Although the lower bound provided in Section 5 is tight for many small tanglegrams, we don't expect it being close to the real answer all the time, since we are doing a polynomial time approximation to an NP-hard problem. One may notice that the lower bound is dependent on the choice of clades. While we made an arbitrary choice, we are interested in polynomial time algorithms to choose the clades for an optimized lower bound.
In Section 4, we provided a family of tanglegrams with crossing number asymptotically 1 2 n 2 . While the tangle crossing number of tanglegrams in T n is at least , there are tanglegrams of size n with larger tangle crossing number. Is it perhaps true that max{crt(T ) : T ∈ T n } = γ(n), at least for n = k 2 ? We remain interested in the maximum tangle crossing number over all tanglegrams of size n.