Towards Degree Distribution of a Duplication-Divergence Graph Model

We present a rigorous and precise analysis of degree distribution in a dynamic graph model introduced by Sol´e, Pastor-Satorras et al. in which nodes are added according to a duplication-divergence mechanism. This model is discussed in nu-merous publications with only very few recent rigorous results, especially for the degree distribution. In this paper we focus on two related problems: the expected value and variance of the degree of a given node over the evolution of the graph and the expected value and variance of the average degree over all nodes. We present exact and precise asymptotic results showing that both quantities may decrease or increase over time depending on the model parameters. Our findings are a step towards a better understanding of the graph behaviors such as degree distributions, symmetry, power law, and structural compression.


Introduction
Many real-world networks, such as protein-protein and citation networks, are widely viewed as driven by an internal evolution mechanism based on duplication and mutation [19]. New nodes are added to the network as copies of existing nodes together with some random divergence, resulting in differences among the original nodes and their copies. It has been claimed that graphs generated from these models exhibit many properties characteristic to real-world networks such as power-law degree distribution, the large clustering coefficient, and a large amount of symmetry [3,11]. However, some of these results turned out not to be correct (e.g., power-law degree distribution was disproved in [6]) or not proved rigorously. In this paper we focus on presenting exact and precise asymptotic results for the expected degree of a given node over time and the average degree in the graph. We show that these two quantities exhibit phase transitions over the parameter space.
The widest known duplication divergence model was introduced by Solé, Pastor-Satorras et al. [14], denoted here as DD(t, p, r). It is defined as follows: starting from a given graph G t 0 on t 0 vertices (labeled from 1 to t 0 ) we repeat the following procedure until we get a graph on t vertices: (i) Duplication: Select a node u from a current graph G (on k vertices) uniformly at random. Add node v (with label k + 1) to the graph and add edges between v and all neighbors of u; (ii) Divergence: connections from v are randomly retained with probability p (otherwise they are deleted). Furthermore, for all nodes w not adjacent to u we add an edge between v and w, independently at random with probability r/k. Note that nodes in the graph are labeled by the numbers from 1 to t, according to their order of appearance in the graph.
This model is a generalization of the pure duplication model, where no edges are added in the divergence step (r = 0). It is also similar to the Chung-Lu model, in which the duplication step and the first part of the divergence step (retaining p-edges) is exactly the same as in our model, however instead of adding r-edges we only add a single edge between v and u with some fixed probability q [3].
Additionally note that the case p = 0 of our model somewhat resembles Erdős-Renyí graphs in the sense that all pairs of vertices correspond to independent Bernoulli variables. However, their respective probabilities are different, as at k-th step we add a vertex with k − 1 possible incident edges with probabilities Ber(r/k).
It has been shown that graphs generated by this model for a set of parameters fit very well into the structure of some real-world networks (e.g., protein-protein and citation networks) in terms of the degree distribution [4] and small subgraphs (graphlets) count [13]. It was also shown that this model may exhibit a large amount of symmetry (measured by the number of automorphisms) [15], and this distinguishes it from other graph models such as Erdős-Renyi and preferential attachment [10]. We formally showed in [17] that for the special case of p = 1, r = 0 the expected logarithm of the number of automorphisms for graphs on t vertices is asymptotically Θ(t log t), which indicates a lot of symmetry. However, the extension of it for all p and r is a difficult open problem.
The most interesting open problem in the duplication-divergence model DD(t, p, r) is the quest to uncover the behavior of the degree distribution, that is, the number of nodes of a given degree. For r = 0 it was recently proved by Hermann and Pfaffelhuber in [6] that depending on a value of p either there exists a limiting distribution with almost all vertices isolated or there is no limiting distribution as t → ∞. It should be mentioned that still the most interesting problem, namely the rate of convergence is open. They also asserted (but without proof) that this holds also for r > 0. Our findings in this paper indicate that this is not the case as the size of the graph grows to infinity. Moreover, it is shown in [9] that the number of vertices of degree one is Ω(ln t) but again the precise rates of growth of the number of vertices with degrees k > 0 are yet unknown. Recently, also for r = 0, Jordan [8] showed that the power-law of the degree distribution exists for the connected component for p < e −1 . In this case the exponent is equal to γ which is the solution of 3 = γ + p γ−2 . The case p ⩾ e −1 still remains an open problem.
In this paper we approach the problem of the degree distribution from a different perspective. We investigate the behavior of two closely related variables: the degree of a vertex s in G t denoted deg t (s) and the average degree D(G t ) in G t . We present in Theorems 5-17 exact and precise asymptotics of the expected values and variances of the parameters under investigation when t → ∞. We show that all parameters exhibit phase transition as a function of p. In particular, we find that E[deg t (s)] grows respectively like t s p , t s log s or t s p s 2p−1 , depending whether p < 1/2, p = 1/2 or p > 1/2.
Furthermore, E[D(G t )] is either Θ(1), Θ(log t) or Θ(t 2p−1 ) for the same ranges of p. We also determine the exact constants for the leading terms that strictly depend on p, r, t 0 and the structure of the seed graph G t 0 . This confirms the empirical findings of [7] regarding the seed graph influence on the structure of G t .
These findings allow us to better understand why the DD(t, p, r) model differs quite substantially from other graph models such as preferential attachment model [18]. In particular, we observe that the expected degree behaves differently as t → ∞ for various values of s and p. For example, if p > 1/2, then for s = O(1) (that is, for very old nodes) we observe that E[deg t (s)] = O(t p ) while for s = Θ(t) (i.e., very young nodes) we have E[deg t (s)] = O(t 2p−1 ). This behavior is very different than the degree distribution for, say, preferential attachment model, for which the expected degree of a vertex s in a graph on t vertices is of order t/s [10] and may lead to our better understanding of some graph behaviors such as degree distribution, existence of power-law phenomenon, symmetry, and structural compression.

Main results
In this section we present our main results with proofs and auxiliary lemmas delayed until the next section.
We use the standard graph notation, e.g. from [5]: V (G) denotes the set of vertices of graph G, N G (u) -the set of neighbors of vertex u in G, deg G (u) = |N G (u)| -the degree of u in G. For brevity we use the abbreviations for G t , e.g. deg t (u) instead of deg Gt (u). All graphs are simple. Let us also introduce the average degree D(G t ) of G t as , and average degree squared D(G t ) of G t as . They are also known in the literature as the first and second moment of the degree distribution, respectively. Formally, we define the DD(t, p, r) model as follows: let 0 ⩽ p ⩽ 1 and 0 ⩽ r ⩽ t 0 be the parameters of the model. Let also G t 0 be a graph on t 0 vertices, with V (G t 0 ) = {1, . . . , t 0 }. Now, for every t = t 0 , t 0 + 1, . . . we create G t+1 from G t according to the following rules: 1. add a new vertex t + 1 to the graph, 2. pick vertex u from V (G t ) = {1, . . . , t} uniformly at random -and denote u as parent(t + 1), (a) if i ∈ N t (parent(t + 1)), then add an edge between i and t + 1 with probability p, , then add an edge between i and t + 1 with probability r t .

Expected value
First, we derive the exact and the asymptotic expression for the expected values of the average degree, the degree of the last node, and the degree of a given node. We start our considerations by deriving a recurrence expression for E[deg t (s)] from the definition of the model.
Note that in order to solve this recurrence we need to know the behavior of E[deg s (s)] for s ⩾ t 0 .
Next, we establish a relation between the expected degree of the newest vertex and the expected average degree of the previous graph.
Now, we use the model definition again to find the following recurrence for E[D(G t )]: This leads us to the exact and asymptotic expressions for E[D(G t )].
Combining Lemmas 2 and 4 we obtain the following result: where c 3 , c 4 are as above.
Similarly from Lemma 4 and Theorem 5 we get the final formula for deg t (t).
Now we are in the position to state the exact and asymptotic expressions for E[deg t (s)]. , , c 3 and c 4 as above.
Theorem 9. Asymptotically as t → ∞ it holds that: the electronic journal of combinatorics 28(1) (2021), #P1.18 Thus we have presented a complete characterization of the expected average degree and the expected degree of a vertex s at time t.

Variance
The procedure described above can be extended to find also the second moment of the degree distribution.
First, the analogous reasoning as in the previous subsection leads to a recurrence To solve this recurrence we need to know the behavior of E[deg 2 s (s)] for all s ⩾ t 0 . To do this we will need the following lemma connecting E[deg 2 t (t)] and the first two moments of the degree distribution.
Lemma 11. For any t ⩾ t 0 it holds that Similarly as before, here we need to find E[D 2 (G t )], that is, the average degree squared of G t . We find the following recurrence: .
Here E[D 2 (G t+1 )] should not be confused with E[D 2 (G t+1 )], however the latter one may be derived in similar fashion: Finally, we may obtain both exact and asymptotic formulas for the variances of investigated random variables. However, due to the complicated form of the constants in exact formulas, we drop the exact formulas from the paper and present only the asymptotic rates of growth.
Theorem 14. The following holds Applying Theorems 5 and 14 to Lemma 11 we observe E[D 2 (G t )] asymptotically dominates E[D(G t )] and therefore Theorem 15. It holds that Now we present the asymptotic expressions for Var[deg t (s)]. Here we have two cases with different regimes -however it should be noted that the leading terms may have different exact leading constants for different ranges of p and r, the same as we have in Theorem 9.
Theorem 16. Asymptotically as t → ∞ it holds that: We conclude by stating the formula for Var[D(G t )].
Theorem 17. It is true that asymptotically as t → ∞ It is worth noting that for both deg t (t) and deg t (s) the order of growth of variance is completely dominated by the second moment (unless it's O(1)). However, with D(G t ) the situation is different: both E[D 2 (G t )] and (E[D(G t )]) 2 have the same order -although with different leading constants.

Analysis
In this section we provide proofs of our main results. We start with a sequence of lemmas that allow us to solve a particular type of recurrence encountered in this analysis, and then we extract asymptotics using analytic tools.

Useful lemmas
We begin our analysis by deriving a series of lemmas useful for the analysis of the following type of recurrence for some nonnegative functions g 1 (n), g 2 (n) and a Markov process G n . It should be noted that our recurrences for E[deg t (s)] and E[D(G t )] (e.g. see Lemmas 1 and 3) fall under this pattern. Next lemma is a generalization of a result obtained in [6], where only the case g 1 (n) = 1 + a n , a > 0, was analyzed. Lemma 18. Let (G n ) ∞ n=n 0 be a Markov process for which Ef (G n 0 ) > 0 and (1) holds with g 1 (n) > 0, g 2 (n) ⩾ 0 for all n = n 0 , n 0 + 1, . . .. Then (ii) The process (M n ) ∞ n=n 0 defined by M n 0 = f (G n 0 ) and the electronic journal of combinatorics 28(1) (2021), #P1.18 is a martingale.
(ii) For all n ⩾ n 0 Proof. Observe that which proves (i). Furthermore, after some algebra and taking expectation with respect to G n we arrive at which completes the proof.
We now observe that (2) as a solution of recurrences of type (1) contains sophisticated products and the sum of products with which we must deal to find asymptotics. The next lemma shows how to handle such products.
Proof. We have which completes the proof.
the electronic journal of combinatorics 28(1) (2021), #P1.18 The next lemma presents the well-known asymptotic expansion of the gamma function but we include it here for the sake of completeness.
Lemma 20 (Abramowitz, Stegun [1]). For any a, b ∈ R if n → ∞, then k (x) are the generalized Bernoulli polynomials. Now we deal with sum of products as seen in (2). In particular, we are interested in the following sum of products In the next three lemmas we consider three cases: a+1 > b, a + 1 = b and a + 1 < b.
Proof. We estimate the sum using Lemma 20 and the Euler-Maclaurin formula [16, p. 294] which completes the proof.
Proof. We proceed as before ; z] is the generalized hypergeometric function. Moreover it is true that asymptotically Proof. The proof of the first formula follows directly from the definition of the generalized hypergeometric function. Second formula follows from Lemma 20, as we know that for n → ∞:

Proofs for expected values
Proof of Lemma 1. Observe that for any t ⩾ s we know that vertex s may be connected to vertex t + 1 in one of the following two cases: • either s ∈ N t (parent(t + 1)) (which holds with probability deg t (s) t ) and we add an edge between s and t + 1 (with probability p), • or s / ∈ N t (parent(t + 1)) (with probability t−deg t (s) t ) and we an add edge between s and t + 1 (with probability r t ). Therefore, we obtain the following recurrence for E[deg t (s)]: After applying the law of total expectation we get the desired result.
the electronic journal of combinatorics 28(1) (2021), #P1.18 Proof of Lemma 2. We first observe that it follows from the definition of the model that the degree of the new vertex t+1 is the total number of edges from t+1 to N t (parent(t + 1)) (chosen independently with probability p) and to all other vertices (chosen independently with probability r t ). Note that it can be expressed as a sum of two binomial variables deg t+1 (t + 1) ∼ Bin (deg t (parent(t + 1)), p) + Bin t − deg t (parent(t + 1)), r t .
These variables are independent conditional on the choice of parent, hence k Pr(deg t (parent(t + 1)) = k) + r.
By combining the two equations above with the law of total expectation we finally establish the lemma.
Proof of Lemma 3. Again we turn directly to the definition of the model, from which we derive the following recurrence for the average degree of G t+1 : where the last equality follows from Lemma 2. The final result is a direct consequence of applying the law of total expectation.
To finish the proof, we again apply the law of total expectation.
Proof of Lemma 13. From the model definition we get Now it is sufficient to substitute the respective parts of the formula by D 2 (G t ) and D(G t ) and once again apply the law of total expectation.
The rate of growth of B 1 (t) is of course Θ(t p 2 +2p−1 ). The rates of growth of B 3 (t) can be found by applying Lemmas 21 to 23 to the respective cases: for example, when p > Finding the asymptotics of B 2 (t) is more complicated. To solve it we substitute E[D(G j )] using Theorem 5 and note that Γ(j + c 3 )Γ(j + c 4 )Γ(j + 1) 2 Γ(j + c 5 + 1)Γ(j + c 6 + 1)Γ(j + c 7 + 1)Γ(j) , so that Here B 21 (t) poses no problem, as it can be analyzed similarly as B 3 (t). Moreover, we bound .
Now if p > √ 2 − 1 the right hand side is upper bounded by a constant, as the first sum is of order Θ(t 1−2p ) and the second sum is of order Θ(t 2p−2p−p 2 ) -and therefore in total B 22 (t) = O(t 1−2p−p 2 ), so it's bounded from above by a constant.
All other cases are treated in exactly analogous way. By putting them all together in the way presented above we obtain the final result.
Proof of Theorem 16. Lemma 10 combined with Lemma 18 lead us to From Lemmas 19 and 20 we find that • the first part grows like Θ(log t) for p = 0 and Θ(t 2p ) otherwise, • the second part grows like Θ(1), Θ(log t) or Θ(t 2p−1 ) for p less, equal to or greater than 1 2 , so the first part is always asymptotically greater than the second part. The variance is obtained in the usual way, by subtracting the square of the respective formulas from Theorem 5. Note that here, in contrast to Theorem 9 we do not need to distinguish the case s = ct − o(t), as it differs only in the leading constant, but not in the rate of growth.
This establishes the behavior of E[D 2 (G t )] -and the only remaining part is to combine this with Theorem 5 to find Var[D(G t )].

Discussion
In this paper we focus on rigorous and precise analysis of the expected average degree and variance of a given node in the network as well as the average degree over all nodes. We presented exact and asymptotic results showing phase transitions of these quantities as a function of p.
It is worth noting that the parameter p solely drives the rate of growth of both first and second moments of variables D(G t )], deg t (t) and deg t (s). The parameter r impacts only the leading constant and lower order terms. The proposed methodology can be easily extended to obtain higher moments of the above quantities, if needed.
The future work may include investigations both the large deviation of the degree distribution as well as the complete spectrum of the degree distribution (i.e., the number of nodes of degree k) as a function of k, t, G t 0 , p and r.