Estimating global subgraph counts by sampling

We give a simple proof of a generalization of an inequality for homomorphism counts by Sidorenko. A special case of our inequality says that if d v denotes the degree of a vertex v in a graph G and hom ∆ ( H, G ) denotes the number of homo-morphisms from a connected graph H on h vertices to G which map a particular vertex of H to a vertex v in G with d v ! ∆ , then hom ∆ ( H, G ) " ! v ∈ G d h − 1 v 1 d v ! ∆ . We use this inequality to study the minimum sample size needed to estimate the number of copies of H in G by sampling vertices of G at random


Introduction
This paper consists of two main parts. In Section 2 we present a simple proof of an inequality that generalizes Sidorenko's inequality on homomorphism counts. Our motivation for this result comes from an application to estimating global subgraph counts by sampling, which is discussed in Section 3.
Theorem 1 (Sidorenko, 1994). For any connected graph H on h ! 1 vertices and any graph G, hom(H, G) " hom(K 1,h−1 , G). (2.1) In fact, Sidorenko [11] showed this for trees H, but this is immediately equivalent to our formulation, since hom(H, G) " hom(T, G) for any spanning tree T of H.
If H is a rooted graph, with root o, and ∆ ! 0, we also define where d v here and below denotes the degree of a vertex v in a graph. (The graph will be clear from the context; in this section it is always G.) We show the following extension of Theorem 1.
Theorem 2. For any connected rooted graph H on h ! 1 vertices, any graph G, and any ∆ ! 0, Note that Sidorenko's theorem is the special case ∆ = 0 of our theorem. We will use induction to prove a more general statement. Let α := (α w ) w∈H be a vector of non-negative real numbers α w indexed by the vertices in H, and let In particular, taking all α w = 0, we have hom ∆,0 (H, G) = hom ∆ (H, G). Hence, Theorem 2 is a special case of the following result.
Theorem 3. For any connected rooted graph H on h ! 1 vertices, any graph G, any ∆ ! 0, and any non-negative vector α = (α w ) w∈H , Proof. We assume that H is a tree, otherwise we can replace it by a spanning subtree.
To prove (2.5), we use a double induction over the number of vertices h in H and the number of non-root vertices w such that the weight α w > 0. Hence, we may assume that (2.5) holds if we replace the pair (H, α) by a pair (H ′ , α ′ ) such that either (i) H ′ has fewer vertices than H, or (ii) H ′ has the same number of vertices as H, but there are fewer non-root vertices w ∈ H ′ with α ′ w > 0 than non-root w ∈ H with α w > 0. The base case h = 1 is trivial, since in this case (2.5) is an identity. To prove the induction step, we consider three cases separately. Case 1: H has a leaf w ∕ = o with α w = 0. Let v be the unique neighbour of w in H. Define H ′ := H \ {w}, and let α ′ v := α v + 1, and α ′ u := α u for all other u ∈ H ′ . Then hom ∆,α (H, G) = hom ∆,α ′ (H ′ , G), and thus (2.5) follows by the induction hypothesis, since H ′ has one vertex less than H. Case 2: H has (at least) two (distinct) non-roots v and w with α v , α w > 0. Here we use Hölder's inequality, in a way that is essentially the same as in Sidorenko [11] (although he does it in a more general way).
By decomposing the sum in (2.4) according to the values of ϕ(v) and ϕ(w), we obtain for some numbers µ x,y ! 0 that do not depend on α v and α w . We regard the numbers µ x,y as a measure µ on the finite set V (G) × V (G), and rewrite (2.7) as Hölder's inequality now yields Hence (2.5) follows from the induction hypothesis, since both α ′ and α ′′ have one less non-root vertex with positive weight than α. Case 3: The remaining case. If none of the cases above applies, then every non-root leaf has positive weight, and there is at most one non-root vertex with positive weight. In particular, there is at most one non-root leaf. If also |V (H)| ! 2, then H must have exactly one non-root leaf, say v, and thus H is a path with end vertices o and v.
In this case, we use a variant of an argument that has been used to show other inequalities (see, e.g., [7,Theorems 43 and 236] and [6, Theorem 2.4]). We write, for (2.14) Then both f (x) and g(x) are (weakly) increasing functions of d x , and thus, for all x, y ∈ G, Consequently, using also the symmetry of H interchanging o and v, , and thus (2.5) follows by induction, since α ′ has one less non-root vertex with positive weight than α.
Proof of Theorem 2. As mentioned above, this is the special case α = 0 of Theorem 3.

Estimating subgraph counts
Let now H be a fixed connected graph on h vertices and G an arbitrary (large) graph on n vertices. Let Emb(H, G) denote the set of embeddings (injective homomorphisms) H → G; we will be interested in estimating the number emb(H, G) := | Emb(H, G)| by sampling a rather small number of vertices of G and exploring small neighbourhoods of them. A similar problem for sequences of graphs with a weak local limit has been studied in [8]; there a uniform integrability condition on the (h − 1)th power of the random vertex degree was used. Uniform integrability of graph degrees or their powers is natural for sequences of graphs, and has been used both in theoretical work and in applications, see, e.g., [1,2]. In our setting, we use instead the related (3.3) below. The general problem of estimating small subgraph counts in a given graph has been considered by many authors for a variety of applications, see, e.g., the survey paper [10].
To estimate emb(H, G), we may assume that H is a rooted graph with a root o (o can be chosen arbitrarily). For a vertex v ∈ G, let X(v) = X(H, G, v) denote the number of embeddings σ ∈ Emb(H, G) such that σ(o) = v. We may then estimate emb(H, G) from the numbers X(v * i ) for some randomly sampled vertices v * i in G. However, since vertices of high degree in G may give outliers with exceptionally high numbers of such embeddings, we use truncation in order to obtain our error bounds.
Choose an arbitrary rooted spanning tree T of H with the same root o. Say that a vertex u ∈ T is internal if it has at least one child in the rooted tree T . Denote by i T the number of internal vertices in T . Choose also a positive integer ∆. For a vertex v ∈ G, letX(v) =X(H, G, v, T, ∆) denote the number of embeddings σ ∈ Emb(H, G) such that σ(o) = v and d σ(u) < ∆ for all internal vertices u ∈ T .
Let N ! 1 and let v * 1 , . . . , v * N be drawn from V (G) independently and uniformly at random. Consider the following estimate for n −1 emb(H, G): (Here and below we define (∆ − 1) 0 := 1 in the special case ∆ = 1.) Its mean can be estimated by ) X N , with an error that can be bounded using, for example, Hoeffding's bound, which we do to prove the following result.
Theorem 4. Let H be a connected graph on h ! 1 vertices with a rooted spanning tree T , let G be a graph on n ! 1 vertices, and let D be the degree of a uniformly random vertex in G. Suppose a positive integer ∆ and a non-negative λ satisfy Let s > 0 and p ∈ (0, 1]. If where the union is over the internal vertices of T . the electronic journal of combinatorics 30(2) (2023), #P2.24 The first term on the right is equal to n EX(v * 1 ). By the union bound and Theorem 2 applied to each tree that can be obtained from T by rerooting at an internal vertex, the second term is at most i T n E [D h−1 1 D!∆ ] " i T λn. Therefore Hoeffding's classical inequality says that for a sum of independent random variables X 1 , . . . , X N with values in [0, 1] and µ = N −1 E(X 1 + · · · + X N ) we have The claim follows by applying this to the random variables X i :=X(v * i )/(∆ − 1) h−1 and using (3.7) (or usingX(v * i ) = 0 if ∆ = 1). In particular, we obtain by choosing s = % E D h−1 in Theorem 4 the following corollary. (Choosing s in this way makes sense when n −1 emb(H, G) is of the same order as its upper bound n −1 hom(K 1,h−1 , G) = E D h−1 , which often may be reasonable to expect in practice.) Corollary 5. With notation as above, assume that (3.3) holds. If % > 0, p ∈ (0, 1], and If we are able to draw edges K 2 , "wedges" P 3 or other small subgraphs in G uniformly at random, we can estimate certain small subgraph densities using a much smaller sample size using the following generalization of Theorem 4. Other authors have used other methods to obtain some practical results in estimating densities of graphs H on h " 5 vertices using algorithms which sample random paths on up to 5 vertices as their first step, see, e.g., [10,Section 4.3].
To state the generalization, let now O be a non-empty subgraph of H. Let, as above, Similarly as above we have: the electronic journal of combinatorics 30(2) (2023), #P2.24 Theorem 6. With notation and assumptions as above and in Theorem 4, including (3.3), assume also emb(O, G) ! 1 and, instead of (3.4), Proof. Using the same argument as for (3.7) we get by Theorem 2: (3.14) To finish the proof, we again apply Hoeffding's inequality. Finally, Theorem 2 allows us to generalize (the difficult part of) Theorem 2.1 of [8], with a simpler proof that does not require the local weak convergence assumption. Theorem 7. Let H be a fixed connected graph on h vertices, and pick an arbitrary vertex o ∈ H as its root. Let (G n , n ∈ {1, 2, . . . , }) be a sequence of graphs. Let V n be a uniformly random vertex from V (G n ) and let D n := d Vn be its degree in G n .
If X(H, G n , V n ) converges in distribution to a random variable X * as n → ∞, and D h−1 n is uniformly integrable, then Proof. Write X n := X(H, G n , V n ). Clearly Hence we need to prove that E X n → E X * . Since X n d −→ X * , we have E X * " lim inf n→∞ E X n by Fatou's lemma.
To show the opposite inequality, fix a rooted spanning tree T of H and a positive integer ∆, and writeX n :=X(H, G n , V n , T, ∆). By (3.16) and (3.6) (with G n instead of G), Theorem 2 implies, as above for (3.7), Furthermore, by (3.2), writing x ∧ y := min(x, y), Since X n converges in distribution to X * , we have This holds for every ∆ > 0, and lim ∆→∞ % ∆ = 0 by the uniform integrability assumption; thus lim sup n→∞ E X n " E X * , which completes the proof.

A note on real-world experiments
We tested Corollary 5 with some real-world degree distributions. For our first experiment, we considered survey data on self-reported human contact count distributions during a single day in the USA in four COVID-19 pandemic waves [5]. For our second experiment, we considered degree distributions of more than 500 empirical networks of various types and sizes made available as part of the supporting code of [3].
Although the lower bounds in our first experiment seem to be interesting for further exploration, we believe that the survey sizes in [5] (several thousand respondents in each of the COVID-19 waves) were too small to determine if there exists a practically useful choice of λ and ∆ in (3.3), even for h = 3. In the second case for 75% of degree distributions we got a lower bound for N exceeding the underlying network size.
We believe that in the first case establishing a better understanding on the degree tails, simply collecting more data or applying the adaptive mean estimation methods [4,9] might help. In the second case, since the full network data is available, the methods mentioned in [10] and Theorem 6 seem to be more suitable.
The code and the results of our experiments are available at https://github.com/ valentas-kurauskas/subgraph-counts-hoeffding.

Concluding remarks
We extended Sidorenko's inequality and used it to derive bounds on the sample size needed to estimate the number of small subgraphs in a large graph using only the weak assumption (3.3).
Like Hoeffding's bound, our estimate works for worst case graphs; therefore the lower bound from Theorem 4 may be too pessimistic for specific real-world graphs. Nevertheless it would be interesting to better understand if our results and assumptions of type (3.3) can be useful in practice.
It would also be interesting to find an interpretation or applications of the general case of our inequality (2.5).