Maximal clades in random binary search trees

We study maximal clades in random phylogenetic trees with the Yule-Harding model or, equivalently, in binary search trees. We use probabilistic methods to reprove and extend earlier results on moment asymptotics and asymptotic normality. In particular, we give an explanation of the curious phenomenon observed by Drmota, Fuchs and Lee (2014) that asymptotic normality holds, but one should normalize using half the variance.


Introduction
Recall that there are two types of binary trees; we fix the notation as follows. A full binary tree is an rooted tree where each node has either 0 or 2 children; in the latter case the two children are designated as left child and right child. A binary tree is a rooted tree where each node has 0, 1 or 2 children; moreover, each child is designated as either left child or right child, and each node has at most one child of each type. (Both versions can be regarded as ordered trees, with the left child before the right when there are two children.) It is convenient to regard also the empty tree ∅ as a binary tree (but not as a full binary tree). In a full binary tree, the leaves (nodes with no children) are called external nodes; the other nodes (having 2 children) are internal nodes. There is a simple, well-known bijection between full binary trees and binary trees: Given a full binary tree, its internal nodes form a binary tree; this is a bijection, with inverse given by adding, to any given binary tree, external nodes as children at all free places.
Note that a full binary tree with n internal nodes has n+1 external nodes, and thus 2n + 1 nodes in total. In particular, the bijection just described yields a bijection between the full binary trees with 2n + 1 nodes and the binary trees with n nodes.
If T is a binary, or full binary, tree, we let T L and T R be the subtrees rooted at the left and right child of the root, with T L = ∅ [T R = ∅] if the root has no left [right] child.
A phylogenetic tree is the same as a full binary tree. In this context, the clade of an external node v is defined to be the set of external nodes that are descendants of the parent of v. (This is called a minimal clade by Blum and François [3] and Chang and Fuchs [6].) Note that two clades are either nested or disjoint; furthermore, each external node belongs to some clade (for example its own). Hence, the set of maximal clades forms a partition of the set of external nodes. We let F (T ) denote the number of maximal clades of a phylogenetic tree T . (Except that for technical reasons, see Section 2, we define F (T ) = 0 for a phylogenetic tree T with only one external node. Obviously, this does not affect asymptotics.) The maximal clades, and the number of them, were introduced by Durand, Blum and François [11], together with a biological motivation, and further studied by Drmota, Fuchs and Lee [10].
The phylogenetic trees that we consider are random; more precisely, we consider the Yule-Harding model of a random phylogenetic treeT n with a given number n internal, and thus n+1 external, nodes. These can be defined recursively, withT 0 the unique phylogenetic tree with 1 node (the root), and T n+1 obtained fromT n (n 0) by choosing an external node uniformly at random and converting it to an internal node with two external children. (Alternatively, we obtain the same random model by constructing the tree bottom-up by Kingman's coalescent [17], see further Aldous [2], Blum and François [3] and Chang and Fuchs [6].) Recall that, for any n 1, the number of internal nodes in the left subtreeT n,L (or the right subtreeT n,R ) is uniformly distributed on {0, . . . , n − 1}, and that conditioned on this number being m,T n,L has the same distribution asT m ; see also Remark 5.1.
Under the bijection above, the Yule-Harding random treeT n corresponds to the random binary search tree T n with n nodes, see e.g. Blum, François and Janson [4] and Drmota [9].
The random variable that we study is thus X n := F (T n ), the number of maximal clades in the Yule-Harding model. It was proved by Durand and François [12] that the mean number of maximal clades E X n ∼ αn, where This was reproved by Drmota, Fuchs and Lee [10], in a sharper form: 12; 10]).
Moreover, Drmota, Fuchs and Lee [10] found also corresponding results for the variance and higher central moments:

3)
and for any fixed integer k 3, As a consequence of (1.3)-(1.4), the limit distribution of F (T n ) (after centering and normalization) cannot be found by the method of moments. Nevertheless, [10] further proved asymptotic normality, where, unusually, the normalizing uses (the square root of) half the variance: Here and below, d −→ denotes convergence in distribution; similarly, p −→ will denotes convergence in probability. Unspecified limits (including implicit ones such as ∼ and o(1)) will be as n → ∞. Furthermore, Y p = o p (a n ), for random variables Y n and positive numbers a n , means Y n /a n p −→ 0. We let C, C 1 , C 2 , . . . denote some unspecified positive constants.
The purpose of the present paper is to use probabilistic methods to reprove these theorems, together with some further results; we hope that this can give additional insight, and it might perhaps also suggest future generalizations to other types of random trees.
In particular, we can explain the appearance of half the variance in Theorem 1.3 as follows: Fix a sequence of numbers N = N (n), and say that a clade is small if it has at most N + 1 elements, and large otherwise. (We use N + 1 in the definition only for later notational convenience; the subtree corresponding to a small clade has at most N internat nodes.) Let X N n be the number of maximal small clades, i.e., the small clades that are not contained in any other small clade. It turns out that a suitable choice of N is about √ n; we give two versions in the next theorem.
and E X n − E X N n = o Var X N n , so we may replace X N n by X n in the numerator of (1.6). However, n log n, for example N := n log log n. Then the conclusions of (i) still hold; moreover, P(X n = X N n ) → 0. The theorem thus shows that the large clades are rare, and do not contribute to the asymptotic distribution; however, when they appear, the larges clades give a large (actually negative) contribution to X n , and as a result, half the variance of X n comes from the large clades. (When there is a large clade, there is less room for other clades, so X n tends to be smaller than usually. See also (2.4) and (2.2) below.) For higher moments, the large clades play a similar, but even more extreme, role. Note that (for n 2) with probability 2/n, the root ofT n has one internal and one external node, and then there is a clade consisting of all external nodes; this is obviously the unique maximal clade, and thus X n = 1. Since E X n = αn + O(1) by Theorem 1.1, we thus have X n − E X n = −αn + O(1) with probability 2/n, and this single exceptional event gives a contribution ∼ (−1) k 2α k n k−1 to E(X n − E X n ) k , which explains a fraction (k − 2)/k of the moment (1.4); in particular, this explains why the moment is of order n k−1 .
We shall see later that, roughly speaking, the moment asymptotic in (1.4) is completely explained by extremely large clades of size Θ(n), which appear in the O(1) first generations of the tree.
This will also lead to a version of (1.4) for absolute central moments: Theorem 1.5. For any fixed real p > 2, as n → ∞, In Section 2, we transfer the problem from random phylogenetic trees to random binary search tree, which we shall use in the proofs. The theorems above are proved in Sections 3-7.

Binary trees
We find it technically convenient to work with binary trees instead of full binary trees (phylogenetic trees), so we use the bijection in Section 1 to define F (T ) also for binary trees T . (We use the same notation F ; this should not cause any confusion.) With this translation, our problem is thus to study X n := F (T n ), where T n is the binary search tree with n nodes.
The clades in a phylogenetic tree correspond to the internal nodes that have at least one external child, i.e., the nodes in the corresponding binary tree that have outdegree at most 1. We call such nodes green. For a binary tree T , the number F (T ) is thus the number of maximal green nodes, i.e., the number of green nodes that have no green ancestor. (This holds also for the phylogenetic tree T with a single node, and thus for the empty binary tree, with our definition F (T ) = 0 in this case.) It follows that, for any binary tree T , if T has a green root, F (T L ) + F (T R ) otherwise. (2.1) Define, for a binary tree T , Then F (T ) is given by the recursion and thus where T v is the subtree rooted at v, consisting of v and all its descendants. In another words, F (T ) is the additive functional defined by the toll function f (T ). The advantage of this point of view is that we have eliminated the maximality condition and now sum over all subtrees T v , and that we can use general results for this type of sums, see Holmgren and Janson [16]. We let T denote the random binary search tree with a random number of elements such that P(|T | = n) = 2/((n + 1)(n + 2)), n 1. The random binary tree T can be constructed by a continuous-time branching process: Let (T t ) t 0 be the growing tree that starts with an isolated root at time t = 0 and such that each existing node gets a left and a right child after random waiting times that are independent and Exp(1); we stop the process at a random time τ ∼ Exp(1), independent of everything else, and can take T =T τ , see Aldous [1] (where it is also proved that T is the limit in distribution of a random fringe tree in a binary search tree).

The mean
Recall that T n is the random binary search tree with n nodes. Define ν n := E F (T n ) and µ n := E f (T n ), with F and f as in Section 2. (In particular, ν 0 = µ 0 = 0, while ν 1 = µ 1 = 1 since F (T 1 ) = f (T 1 ) = 1.) For n 2, T n,L is empty with probability 1/n, and conditioned on this event, T n,R has the same distribution as T n−1 . The same holds if we interchange L and R. Hence, taking the expectation in (2.2), Furthermore, we see that (2.2) implies P f (T n ) = 0 2/n. for any binary tree T . In particular, this and (3.2) yield It is now a simple consequence of general results that ν n := E F (T n ) is asymptotically linear in n. Recall the random binary tree T defined in Section 2. Proof.
In order to prove Theorem 1.1, it remains to show that α defined in (3.6) equals (1−e −2 )/4 as asserted in (1.1). In other words, we need the following.
We can prove Lemma 3.2 by probabilistic methods, using the construction of T by a branching process in Section 2. However, this proof is considerably longer than the proof of Theorem 1.1 by singularity analysis of generating functions in [12] and [10]; we nevertheless find the probabilistic proof interesting, and perhaps useful for future generalizations, but since the methods in it are not needed for other results in the present paper, we postpone our proof of Lemma 3.2 to Section 7.
Note that g(T ) depends only on the sizes |T L | and |T R |. This enables us to easily estimate the variance of G(T n ).

Asymptotic normality
We prove the central limit theorem Theorem 1.3 by a martingale central limit theorem for a suitable martingale that we construct in this section.
Consider the infinite binary tree T ∞ , where each node has two children, and denote its root by o. We may regard any binary tree T as a subtree of T ∞ with the same root o. (In the general sense that the node set V (T ) is a subset of V ∞ := V (T ∞ ), and that the left and right children are the same as in T ∞ , when they exist.) In particular we regard the random binary search tree T n as a subtree of T ∞ .
Order the nodes in T ∞ in breadth-first order as v(1) = o, v(2), . . . , and let V j := {v(1), . . . , v(j)} be the set of the first j nodes. Let F j be the σ-field generated by the sizes |T n,v,L | and |T n,v,R | of the two child subtrees of T n at each node v ∈ V j . Equivalently, we may regard V j as the internal nodes in a full binary tree; let ∂V j be the corresponding set of j + 1 external nodes. Then F j is generated by the subtree sizes Then, conditioned on F j , T n consists of some given subtree of V j together with attached subtrees T n,v at all nodes v ∈ ∂V j ; these are independent binary search trees of some given orders.
We allow here j = 0; V 0 = ∅ and F 0 is the trivial σ-field.
Remark 5.1. As is well-known, see e.g. [9], another construction of the random binary search tree T n (n 1) is to let the random variable I n be uniformly distributed on {0, . . . , n − 1}, and to let T n be defined recursively such that, given I n , T n,L and T n,R are independent binary search trees with |T n,L | = I n and |T n,R | = n − 1 − I n . (When the tree is used to sort n keys, I n tells how many of the keys that are assigned to the left subtree.) The pair (I n , n − 1 − I n ) thus tells how the tree is split at the root, and there is a similar pair for each node. Then F j is generated by these pairs (i.e., splits) for the nodes v 1 , . . . , v j .
Recall that g(T ) by (4.4) depends only on the sizes |T L | and |T R |. Hence, F j specifies the value of g(T n,v ) for every v ∈ V j , and it follows that Since the sequence of σ-fields (F j ) ∞ 0 is increasing, the sequence M n,j := E G(T n ) | F j , j 0, is a martingale (for any fixed n). It follows from (5.1) that the martingale differences are where v(j) L and v(j) R are the children of v(j). It follows easily that, with ψ k defined in (4.17), Consequently, the conditional square function is given by (It suffices to sum over v ∈ T n , since ψ 0 = 0.) This is again a sum of the same type as (2.4) and (4.9), for the random tree T n . (Note that the toll function ψ |T | here depends only on the size of T .) In particular, [16,Theorem 3.4] applies (in this case we can also use [7], [8] or [13]); this yields If j is large enough, say j 2 n , then V (T n ) ⊆ V j and thus M n,j = G(T n ). In particular, G(T n ) = M n,∞ . Thus, by a standard (and simple) martingale identity, Var G(T n ) = Var M n,∞ = E W n ; hence (5.5) yields the first equality in (4.18). (This is no coincidence; the proof just given of (5.5) is essentially the same as the proof of [16, Lemma 7.1] that was used in (4.18), but stated in martingale formulation.) We now split the sum G(T n ) into two parts, roughly corresponding to small and large clades. We fix a cut-off N = N (n); for definiteness and simplicity we choose N = N (n) := √ n, but we note that the arguments below hold with a few minor modifications for any N √ n with N = o( √ n log n). We then define, for binary trees T , thus G(T ) = G ′ (T ) + G ′′ (T ). We shall see that, asymptotically, both G ′ (T n ) and G ′′ (T ) contribute to the variance with equal amounts, but nevertheless G ′′ (T n ) is negligible (in probability). We begin with the main term G ′ (T n ).
Moreover, the representation (5.18) and [16,Theorem 3.9] (again summing only to n, as we may) yield, noting that the toll function ψ ′ |T | depends only on the size of T , using (5.16),  Remark 5.3. We used the breadth-first order above as just one convenient order. It is perhaps more natural to consider instead of the sets V j arbitrary node sets V of (finite) subtrees of T ∞ that include the root o. This would give us, instead of (M n,j ) j , a martingale indexed by binary trees. However, we have no use for this exotic object here, and use instead the standard martingales above.
(ii). The conclusions of (i) hold by the same proofs (with some minor modifications in some estimates).
Moreover, let Z n,k be the number of clades of size k + 1. Then, for n 2, the expected number is given by which completes the proof.

Higher moments
We begin the proof of Theorem 1.5 by proving a weaker estimate. We let X p := (E X p ) 1/p for any random variable X. Recall that ν n := E F (T n ).
Lemma 6.1. For any fixed real p > 2, and all n 1, Proof. Fix p > 2 and let m 1 be chosen below. (The constants C i below may depend on p but not on m.) Let V j and F j be as in Section 5, and write V ′ m := V 2 m −1 , F ′ m := F 2 m −1 . Thus ∂V ′ m consists of the 2 m nodes in T ∞ of depth m, and V ′ m consists of the 2 m − 1 nodes of smaller depth. It follows from (2.4) that, for any binary tree T , Hence, by combining (6.3) and (6.4), We shall use this decomposition for the binary search tree T n . Note first that by (3.2)-(3.3), and thus E |f (T n,v )| p 2n p−1 . Let Z := v∈∂V ′ m F (T n,v )−ν |Tn,v| be the second sum in (6.5) for T = T n . The σ-field F ′ m specifies the sizes of the subtrees T n,v for v ∈ ∂V ′ m , and conditioned on F ′ m , these subtrees are independent and distributed as T n(v) of the given sizes n(v). Hence, conditionally on F ′ m , the terms in the sum Z are independent and have means zero, so we can apply Rosenthal's inequality [14,Theorem 3.9.1], which yields We note first that by (1.3), (6.11) and thus |T n,v | log n C 2 n log n. (6.12) Hence the second term on the right-hand side in (6.10) is C 3 (n log n) p/2 . Taking the expectation in (6.10) we thus obtain We can write (6.5) for T = T n as Thus, by Minkowski's inequality, (6.9) and (6.13), Furthermore, (6.13) can be written We prove the lemma by induction, and assume that A k Ck p−1 for all k < n. Since |T n,v | < n for every v ∈ ∂V ′ m , (6.16) and the inductive hypothesis yield with U 1 , . . . , U m independent and U (0, 1). Consequently, There are 2 m nodes in ∂V ′ m , and thus (6.17) yields which together with (6.15) yields, since (n log n) p/2 = O(n p−1 ) when p > 2, A n C 6 C 1 C(2/p) m n p−1 + C 6 C 4 (n log n) p/2 + C 8 2 mp n p−1 C 6 C 1 C(2/p) m n p−1 + C 9 2 mp n p−1 . (6.21) Now choose m such that (2/p) m C 6 C 1 < 1/2 (which is possible because p > 2). Then choose C := 2 mp+1 C 9 . With these choices, (6.21) yields In other words, we have proved the inductive step: A k Ck p−1 for k < n implies A n Cn p−1 . Consequently, this is true for all n 0, i.e., (6.1) holds. (The initial cases n = 0 and n = 1 are trivial, since A 0 = A 1 = 0.) Lemma 6.2. For any fixed real p > 2, as n → ∞, Proof. By Minkowski's inequality, (6.2) and (1.2), which is (6.23). For n 2, it follows from (2.2) that The idea in the proof of Theorem 1.5 is to approximate The heuristic reason for this is that the p is dominated by the event when there is one large term (corresponding to one large clade, cf. the discussion before Theorem 1.5), and then (6.27) We shall justify this in several steps. We begin by finding the expectation of the final sum in (6.27), cf. the sought result (1.8). Lemma 6.3. As n → ∞, Proof. We apply again [16,Theorem 3.4] and obtain E v∈Tn |f (T n,v )| p = (n + 1) (6.29) By (6.26), as k → ∞, and it follows that, as n → ∞, using p > 2, Next we take again some m 1 and use the notation in the proof of Lemma 6.1. Since we now have proved (6.1), the proof of Lemma 6.1 shows that (6.20) holds for every n, and thus, since p > 2, Consequently, by (6.14) and Minkowski's inequality, (6.32) In particular, (6.32) and (6.2) imply Y p = O(n 1−1/p ). By the mean value theorem, |x p − y p | p|x − y| max{x p−1 , y p−1 } (6.33) for any x, y 0; hence (6.32) implies, using also (6.2) again, (6.34) Let δ > 0 be a small positive number to be chosen later, and let J v be the indicator of the event that v is green and |T n,v | δn. (The idea is that the significant contributions only come from nodes v with J v = 1.) Lemma 6.4. For each fixed m 1 and δ > 0, and all n 1, Proof. We use again the σ-fields F j from Section 5. Since F j−1 specifies |T n,v j |, but not how this subtree is split at v j , we have and thus, by taking the expectation, P(J v j = 1) 2/(δn). Since there are < 2 m nodes in V ′ m , (6.35) follows. Furthermore, for any two nodes v i and v j with i < j, J v i is determined by F j−1 , and (6.37) thus gives also Thus, by taking the expectation and using (6.37) again, P(J v i J v j = 1) 4/(δn) 2 . Summing over the less than 2 m Proof of Theorem 1.5. We show this in several steps.
Step 1. Define For each v, it follows from (6.6) by conditioning on |T n,v | that Hence, (6.40) and Minkowski's inequality yield Step 2. Similarly, using (6.41) again, Step 3. By (6.39), and in the latter case we have by (3.3) the trivial bounds |Y 1 | p (2 m n) p and Consequently, by (6.36), Thus, for fixed m 1 and δ > 0, Step 4. Define F (p) (T ) := v∈T |f (T v )| p . Then, in analogy with (6.3), Note that Lemma 6.3 implies E F (p) (T n ) = O(n p−1 ). Hence, by first conditioning on F ′ m , and using (6.19), Taking T = T n in (6.47) and taking the expectation, we thus find Step 5. Finally, combining (6.34), (6.43), (6.46), (6.44), (6.49) and (6.28), we obtain For any ε > 0, we can make each of the error terms on the right-hand side less than εn p−1 by first choosing m large and then δ small, and finally n large.
Proof of (1.4). Now p = k is an integer. If k is even, then (1.4) is the same as (1.8), so we may assume that p = k 3 is odd. In this case, (6.33) holds for all real x, y. Thus for any random variables X and Y , using also Hölder's inequality, (6.51) It is now easy to modify the proof of Theorem 1.5 and obtain The estimate (1.4) now follows from (6.52), (6.53) and (6.28).

Proof of Lemma 3.2
Define a chain of length k in a (binary) tree T to be a sequence of k nodes v 1 · · · v k such that v i+1 is a (strict) descendant of v i for each i = 1, . . . , k − 1. In other words, v 1 , . . . , v k are some nodes (in order) on some path from the root. We say that the chain v 1 · · · v k is green if all nodes v 1 , . . . , v k are green. (The nodes between the v i 's may have any colour.) For a binary tree T and k 1, let F k (T ) be the number of green chains v 1 · · · v k in T , and let f k (T ) be the number of such chains where v 1 is the root. Obviously, cf. (2.4), These functionals are useful to us because of the following simple relations, that are cases of inclusion-exclusion.
Lemma 7.1. For any binary tree T , Proof.
Let v be a node in T and consider the contribution to the sum in (7.3) of all chains with final node v k = v. This is clearly 0 if 1 if v is not green, and it is 1 if v is a maximal green node; furthermore, if v is green but has j 1 green ancestors, then the contribtion is easily seen to be Proof. We use the construction of T =T τ in Section 2, which we formulate as follows. Consider again the infinite binary tree T ∞ , and growT t as a subtree of T ∞ , cf. Section 5. To do this, we equip each node v in T ∞ with two clocks C L (v) and C R (v). These are started when v is added to the growing treeT t , and each chimes after a random time with an exponential distribution with mean 1; when the clock chimes we add a left or right child, respectively, to v. There is also a doomsday clock C 0 , started at 0 and with the same Exp(1) distribution; when it chimes (at time τ ), the process is stopped and the treeT τ is output. All clocks are independent of each other. Fix a chain v 1 · · · v k in the infinite tree T ∞ , with v 1 = o, the root. Let ℓ i 0 be the number of nodes between v i and v i+1 . We compute the probability that v 1 · · · v k is a green chain in T =T τ by following the construction of T t as time progresses, checking in several steps whether still v 1 · · · v k is a candidate for a green chain, and computing the probability of this. (We use throughout the proof the Markov property and the memoryless property of the exponential distribution.) We assume for notational convenience that the path from v 1 to v k always uses the left child of each node. (By symmetry, this does not affect the result.) 1. If k > 1, we first need that v 1 = o has a left child but no right child (in order to be green); in particular, of the three clocks C L (v 1 ), C R (v 1 ), C 0 that run from the beginning, C L (v 1 ) has to chime first. This has probability 1/3.
2. Given that Step 1 succeeds, v 1 gets a left child w 1 . If ℓ 1 > 0, we need a left child of w 1 , and still no right child at v 1 . (But we do not care whether we get a right child at w 1 or not.) Hence we need that C L (w 1 ) chimes first among the three clocks C L (w 1 ), C R (v 1 ), C 0 (ignoring all other clocks). This has probability 1/3. This is repeated for ℓ 1 nodes; thus, the total probability that steps 1 and 2 succeed is 3 −(ℓ 1 +1) .
3. This takes us to v 2 . If k > 2, we need a left child but no right child at v 2 , and still no right child at v 1 . Hence, the next chime from the four clocks C L (v 2 ), C R (v 2 ), C R (v 1 ), C 0 has to come from C L (v 2 ). This has probability 1/4. 4. Similarly for each of the ℓ 2 nodes between v 2 and v 3 ; again the probability of success at each of these nodes is 1/4. Hence the probability that Steps 3 and 4 succeed is 4 −(ℓ 2 +1) . 5. Steps 3 and 4 are repeated for v i for each i < k, yielding a probability (i + 2) −(ℓ i +1) of success for each i.