The phase transition in site percolation on pseudo-random graphs

We establish the existence of the phase transition in site percolation on pseudo-random $d$-regular graphs. Let $G=(V,E)$ be an $(n,d,\lambda)$-graph, that is, a $d$-regular graph on $n$ vertices in which all eigenvalues of the adjacency matrix, but the first one, are at most $\lambda$ in their absolute values. Form a random subset $R$ of $V$ by putting every vertex $v\in V$ into $R$ independently with probability $p$. Then for any small enough constant $\epsilon>0$, if $p=\frac{1-\epsilon}{d}$, then with high probability all connected components of the subgraph of $G$ induced by $R$ are of size at most logarithmic in $n$, while for $p=\frac{1+\epsilon}{d}$, if the eigenvalue ratio $\lambda/d$ is small enough as a function of $\epsilon$, then typically $R$ spans a connected component of size at least $\frac{\epsilon n}{d}$ and a path of length proportional to $\frac{\epsilon^2n}{d}$.


Introduction and main results
Let G = (V, E) be a d-regular graph on n vertices. Form a random vertex subset R ⊆ V by putting every vertex v ∈ V into R independently with probability p. What can be said about the properties of the random subgraph of G induced by R? How large a connected component does it typically contain? How long a path can one find with high probability (whp) in G[R]?
Of course, the above model is nothing else but the site percolation, sometimes also called the vertex percolation, on G. Although it is perhaps somewhat less popular than its sister model of bond (edge) percolation, it has been quite extensively studied for various graphs and probability regimes.
A well tested intuition suggests that interesting things start happening when the expected vertex degree in the so formed random subgraph crosses the value of 1. This should correspond to the vertex probability p = 1 d . For this regime, we expect the cardinality of R to be about n/d, and it is thus natural so scale the obtained structures relative to this size. There are several results of this type, showing the typical emergence of a connected component C whose size is proportional to n/d for several concrete graphs, like the d-dimensional cube Q d ( [4], [8]), or the n-dimensional Hamming torus [9] (in fact, statements much more accurate than those to be presented here have been obtained for these models). We however aim to obtain a result applicable to a large class of d-regular graphs.
Certainly some further assumptions on the ground graph G have to be made if we aim to get a positive result, that is, to claim the typical existence of a large connected component spanned by R. Indeed, we can start with the graph G being a collection of vertex disjoint cliques of size d + 1 -in which case of course all connected components in R are of size at most d + 1, much smaller than n/d for small degree d = d(n). Thus, it is natural to impose some restrictions on the edge distribution of G.
Here we assume that G is a pseudo-random graph. Informally speaking, a pseudo-random graph is a graph G = (V, E), whose edge distribution resembles closely that of a truly random graph G(n, p) of the same edge density p = 2|E| |V | . There are several possible models of pseudorandom graphs commonly used. In this paper we adapt the notion of (n, d, λ)-graphs. A graph G is an (n, d, λ)-graph if G has n vertices, is d-regular, and all eigenvalues of the adjacency matrix of G, but the first one, are at most λ in their absolute values. (We assume that the eigenvalues of (the adjacency matrix of) G are ordered in the non-increasing order λ 1 ≥ . . . ≥ λ n . The largest eigenvalue of any d-regular graph is easily seen to be d, sometimes referred to as the trivial eigenvalue of G.) The reader can consult the survey [6] for an extensive discussion of this notion.
It is well known that the pseudo-randomness of the edge distribution in (n, d, λ)-graphs can be controlled through the so called eigenvalue ratio λ/d -the smaller the ratio is the closer the edge distribution of G approaches that of a random graph with edge probability p = d n . We will state a standard result establishing this connection later (Lemma 2.1).
Equipped with this formalism, we can now state our main results.
Theorem 1. Let ǫ > 0. Let G = (V, E) be a graph of maximum degree at most d on n vertices. Form a random subset R ⊆ V by including each vertex v ∈ V in R independently and with probability p. If p = 1−ǫ d , then whp all connected components of the induced subgraph G[R] are of size less than 4 ǫ 2 ln n.
Theorem 2. For every small enough ǫ > 0 there exists δ > 0 such that the following is true. Let G = (V, E) be an (n, d, λ)-graph. Assume that d = o(n) and λ d < δ. Let p = 1+ǫ d . Form a random subset R ⊆ V by including each vertex v ∈ V in R independently and with probability p. Then whp R contains a path of length at least ǫ 2 n 5d in G. Some comments are in order here. First, observe that Theorem 1 holds unconditionally, i.e., without any further assumptions on the edge distribution of G -it applies to every graph G of maximum degree d. This means that if the vertex probability p = p(d) is a notch below the critical value 1/d, then even for the best ground graphs G the random induced subgraph G[R] typically shatters into relatively small pieces. On the positive side, for p(d) above the critical probability and assuming that G is pseudo-random, R contains typically a component of size linear in |R|, and even a path of linear size. This phenomenon can be viewed as the phase transition in this site percolation model. It is pretty much in line with the familiar situation in the random graph G(n, p). There, for p = 1−ǫ n all components are of size O ǫ (log n), while for p = 1+ǫ n , there is whp a connected component of size linear in n. This is, in one sentence, the essence of the fundamental discovery of Erdős and Rényi [5]. As for paths, Ajtai, Komlós and Szemerédi proved some 20 years later [1] that G(n, p) with p = 1+ǫ n contains whp a path of length linear in n as well (see [7] for a recent simple proof of this fact). Actually, even the order of the dependence on ǫ in our theorems matches the corresponding results for G(n, p).
Let us say a few general words about the proofs. We use the Depth First Search algorithm (DFS) for all three theorems above. We run the DFS algorithm on our random instance, allowing it to uncover the random set R along the algorithm execution. In this respect, our arguments are somewhat similar to those of [7], with the most substantial difference being that in our setting the algorithm exposes random decisions on the vertices of G, rather than its its edges as in [7]. Another key ingredient in the proof is an estimate of the number of non-expanding sets of a given size in an (n, d, λ)-graph; here we are pretty much inspired by a similar argument from the paper of Alon and Rödl [2].
The notation we use here is fairly standard. In particular, for a graph G = (V, E) and disjoint vertex subsets U, W ⊂ V , we denote by N G (U ) the external neighborhood of U , i.e., N G (U ) = {v ∈ V \ U : v has a neighbor in U }; let also e G (U ), e G (U, W ) denote the number of edges of G spanned by U , between U and W , resp. For v ∈ V and U ⊆ V , let d(v, U ) be the number of neighbors of v in U . If the graph G is clear from the context, we often allow ourselves not to put it in the indices of the above notation. We omit rounding signs occasionally for the the sake of clarity of presentation.

Eigenvalues and edge distribution
We will apply the following standard estimate (see, e.g., Chapter 9 of [3]), sometimes called the expander mixing lemma; it postulates that the edge distribution in an (n, d, λ)-graph G with small eigenvalue ratio λ/d is quite close to that of a truly random graph of edge density d/n. In fact this will be the only tool about graph eigenvalues used in our proof.
where e(B, C) denotes the number of ordered pairs Then |C| ≤ 2 and from here (the last inequality is due to the assumption |B| ≥ n 2 ), and the claim follows.

Depth first search on random vertex subgraphs
The Depth First Search, or DFS for brevity, is a standard graph search algorithm, usually used to uncover the connected components of an input graph G = (V, E). In this paper we use it in a somewhat unusual context, revealing also a random subset R ⊂ V of the vertex set V as the algorithm proceeds. Here is a brief description of the algorithm. It maintains and updates a partition of V into four sets of vertices, letting S be the set of vertices whose exploration is complete, T be the set of unvisited vertices, and U = V \ (S ∪ T ), where the vertices of U are kept in a stack (the last in, first out data structure), and finally W be the set of vertices discovered to fall outside of the random set R. It is also assumed that some order σ on the vertices of G is fixed, and the algorithm prioritizes vertices according to σ. The algorithm starts with S = U = W = ∅ and T = V , and runs till U ∪ T = ∅. At each round of the algorithm, if the set U is non-empty, the algorithm queries T for neighbors of the last vertex v that has been added to U , scanning these neighbors according to σ. If v has a neighbor u in T , the algorithm flips a coin that comes heads with probability p. If the result of this coin flipping is positive, the algorithm deletes u from T and inserts it into U ; otherwise u moves to W . If v does not have a neighbor in T , then v is popped out of U and is moved to S. If U is empty, the algorithm chooses the first vertex u of T according to σ, deletes it from T , flips a coin for u and either pushes it into U or moves to W based on the result of this coin flipping. Once the algorithm execution is complete, the set S coincides with the random set R, while W is its complement W = V \ R.
Observe that the DFS algorithm starts revealing a connected component C of the induced subgraph G[R] at the moment the first vertex of C gets into (empty beforehand) U and completes discovering all of C when U becomes empty again. We call a period of time between two consecutive emptyings of U an epoch, each epoch corresponding to one connected component of G[R].
The following basic properties of the DFS algorithm will be useful to us: • at any stage of the algorithm, it has been revealed already that the graph G has no edges between the current set S and the current set T , and thus N G (S) ⊆ U ∪ W ; • the set U always spans a path (indeed, when a vertex u is added to U , it happens because u is a neighbor of the last vertex v in U ; thus, u augments the path spanned by U , of which v is the last vertex).
We will run the DFS on an n-vertex input G, fixing some order σ on V (G). When the DFS algorithm is fed with a sequence of i.i.d. Bernoulli(p) random variablesX = (X i ) n i=1 , so that is gets its i-th query answered positively if X i = 1 and answered negatively otherwise, the final subset S of the algorithm is distributed exactly like a random subset R, formed by including each vertex of V independently and with probability p. Thus, studying the structure of G[R] can be reduced to studying the properties of the random sequenceX -a much more accessible task.

Concentration of random variables
As we indicated, our argument allows to study the properties of a random vectorX = (X i ) n i=1 , instead of studying directly the subgraph of G, spanned by a random subset R. Since for a subset I ⊆ [n], the sum i∈I X i is distributed binomially with parameters |I| and p, we can use standard large deviation estimates for binomial random variables. In particular, we have:  , in which at least k of the random variables X i take value 1.
(1) For a given interval I of length kd in [n], the sum i∈I X i is distributed binomially with parameters kd and p. Applying the standard Chernoff-type bound (see, e.g., Theorem A.1.11 of [3]) to the upper tail of Bin(kd, p), and then the union bound, we see that the probability of the existence of an interval violating the assertion of the lemma is at most for small enough ǫ > 0.
(2) This follows by applying Chernoff to the upper tail of ǫ 3 n i=1 X i ∼ Bin ǫ 3 n, 1+ǫ d . (3) This follows by applying Chernoff to the upper tail of ǫn i=1 X i ∼ Bin ǫn, 1+ǫ d . (4) Partition [ǫn] into 1/ǫ 3 intervals I j of length ǫ 4 n each. Applying Chernoff to the lower tails of the interval sums i∈I j X i and then the union bound, we derive that whp for all j i∈I j say. Assume this to be true. Then for ǫ 3 n ≤ t ≤ ǫn, 3 Proofs

Proof of Theorem 1
Assume to the contrary that R contains a connected component C with at least k = 4 ǫ 2 ln n vertices. Let us look at the epoch of the DFS when C was created. Consider the moment inside this epoch when the algorithm found the k-th vertex of C and has just moved it to U . Denote C 0 = (S ∪U )∩C at that moment. Then |C 0 | = k, the subgraph G[C 0 ] is connected and thus spans at least k − 1 edges. Notice that The algorithm got exactly k positive answers to its queries to random variables X i during the epoch, with each positive answer being responsible for revealing yet another vertex of C 0 . At this moment during the epoch only the vertices in C 0 and those neighboring them in G have been queried, and the number of these vertices is therefore at most k + kd − 2(k − 1) ≤ kd. It thus follows that the sequenceX contains an interval of length at most kd with at least k 1's inside -a contradiction to Property 1 of Lemma 2.3.

Proofs of Theorems 2 and 3
The proofs of Theorems 2 and 3 are based on the same lemma we present and prove next. Observe that in an (n, d, λ)-graph G = (V, E), every vertex subset S expands itself outside by at most the factor of d. The assumption λ/d ≤ δ we put on the eigenvalue ratio is fairly mild and cannot guarantee such expansion for all sets. The key lemma below asserts that sets S ⊂ V of relevant size |S| = Θ(n/d), not expanding themselves by nearly the factor of d, are very rare in G even under this weak eigenvalue ratio assumption, and thus are unlikely to fall into a random subset R of size proportional to n/d. Lemma 3.1. For every 0 < α 0 ≤ 1/2, 0 < c ≤ 1/3 there exists δ > 0 such that the following is true. Let G = (V, E) be an (n, d, λ)-graph. Assume that d = o(n) and λ d < δ. Let p ≤ 2 d . Form a random subset R ⊆ V by including each vertex v ∈ V in R independently and with probability p. Then whp R does not contain a set S with |S| = m, cn , and is expanding otherwise. We estimate from above the number of non-expanding m-sets in G.
, and is good otherwise. Each good vertex v i appended to S i−1 increases substantially the external neighborhood of the prefix. Suppose that τ has at most αm bad vertices. Then (We subtracted m in the first line above to account for the the m vertices of S itself, not contributing to the external neighborhood of S; also, we assumed d to be large enough -this can be guaranteed by our choice of δ). It follows that in this case the set S is expanding. Hence, in order to produce a non-expanding set S, the sequence τ should contain at least αm bad vertices. Let us zoom in at the i-th vertex v i of τ . Given v 1 , . . . , v i−1 , the set S i−1 ∪ N i−1 has obviously at most (i − 1)(d + 1) < m(d + 1) < n/2 vertices. Then by Corollary 2.2 the number of bad choices for v i is at most 2 α 2 λ d 2 n ≤ 2 α 2 δ 2 n. Therefore the number of sequences τ with at least αm bad vertices is at most Dividing by m! to get the number of unordered non-expanding m-sets, and then multiplying by p m we get that the probability that R contains a non-expanding m-set is at most Recall that we assumed p ≤ 2 d and m ≥ cn d , implying np m ≤ 2 c . Choosing δ > 0 from the lemma statement small enough guarantees that the expression in (2) is, say, at most 2 −m . Applying the union bound over all possible values of m establishes the lemma.
Proof of Theorem 2. Set α 0 = ǫ 25 , c = ǫ, and choose δ in the theorem statement to be δ(α 0 , c) from Lemma 3.1. Run the DFS algorithm of G and feed it with a sequenceX = (X i ) n i=1 of i.i.d. Bernoulli(p) random variables. Assume thatX satisfies the properties stated in Lemma 2.3. We claim that after the first ǫn vertex queries (of the type "Whether v ∈ R?") of the DFS algorithm, the set U contains at least ǫ 2 n 5d vertices, with the contents of U forming a path of desired length at that moment. At that point, all ǫn queried vertices reside in S ∪ U ∪ W , implying |S ∪ U ∪ W | = ǫn. Also, each positive answer to a query put a vertex in U (that possibly has migrated further to S).
for small enough ǫ > 0 -a contradiction. Hence U never empties in the interval [ǫ 3 n, ǫn]. This means that all vertices added to U during this period belong to the same connected component C, whose epoch contains this interval; their number is ǫn i=ǫ 3 n

Concluding remarks
We have proven that in the site percolation model for an (n, d, λ)-graph G, under rather mild assumptions on the spectral ratio λ/d, the phase transition occurs at p = 1 d : for p = 1−ǫ d , whp all connected components of the subgraph of G induced by a random subset R are of size at most logarithmic in n, while for p = 1+ǫ d , the random set R spans whp a connected component of size at least ǫn d and a path of length proportional to ǫ 2 n d . Although we have established the existence of the phase transition for this model of pseudorandom graphs in this paper, many further natural questions about site percolation on pseudorandom graphs have not been resolved here, and it would be nice to address them. Particular issues include, for the super-critical regime p = 1+ǫ d : • the uniqueness of the giant component, bounding sizes of all other components spanned by the random subset R; • accurate (in ǫ) asymptotics of the size of the giant component in G[R]; • upper bounding the length of a longest path/cycle spanned by R.
And of course, it would be very interesting to look into the critical regime p = 1+o(1) d , aiming to try and understand the continuous evolution of the size of the giant component spanned by R from logarithmic to linear in |R|.