Hidden Words Statistics for Large Patterns

We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern $w$ of length $m$ as a subsequence in a random text of length $n$. The quantity of interest is the number of occurrences of $w$ as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion detection, to trace reconstruction, to deletion channel, and to DNA-based storage systems. In all of these applications, the pattern $w$ is of variable length. To the best of our knowledge this problem was only tackled for a fixed length $m=O(1)$ [Flajolet, Szpankowski and Vall\'ee, 2006]. In our main result we prove that for $m=o(n^{1/3})$ the number of subsequence occurrences is normally distributed. In addition, we show that under some constraints on the structure of $w$ the asymptotic normality can be extended to $m=o(\sqrt{n})$. For a special pattern $w$ consisting of the same symbol, we indicate that for $m=o(n)$ the distribution of number of subsequences is either asymptotically normal or asymptotically log normal. We conjecture that this dichotomy is true for all patterns. We use Hoeffding's projection method for $U$-statistics to prove our findings.


Introduction and Motivation
One of the most interesting and least studied problem in pattern matching is known as the subsequence string matching or the hidden pattern matching [12]. In this case, we search for a pattern w = w 1 w 2 · · · w m of length m in the text Ξ n = ξ 1 . . . ξ n of length n as subsequence, that is, we are looking for indices 1 i 1 < i 2 < · · · < i m n such that ξ i 1 = w 1 , ξ i 2 = w 2 , . . . , ξ im = w m . We say that w is hidden in the text Ξ n . We do not put any constraints on the gaps i j+1 − i j , so in language of [8] this is known as the unconstrained hidden pattern matching. The most interesting quantity of such a problem is the number of subsequence occurrences in the text generated by a random source. In this paper, we study the limiting distribution of this quantity when m, the length of the pattern, grows with n.
Hereafter, we assume that a memoryless source generates the text Ξ, that is, all symbols are generated independently with probability p a for symbol a ∈ A, where the alphabet A is assumed to be finite. We denote by p w = j p w j the probability of the pattern w. Our goal is to understand the probabilistic behavior, in particular, the limiting distribution of the number of subsequence occurrences that we denote by Z := Z Ξ (w). It is known that the behavior of Z depends on the order of magnitude of the pattern length m. For example, for the exact pattern matching (i.e., the pattern w must occur as a string in consecutive positions of the text), the limiting distribution is normal for m = O(1) (more precisely, when np w → ∞, hence up to m = O(log n)), but it becomes a Pólya-Aeppli distribution when np w → λ > 0 for some constant λ, and finally (conditioned on being non-zero) it turns into a geometric distribution when np w → 0 [12] (see also [2]). We might expect a similar behaviour for the subsequence pattern matching. In [8] it was proved by analytic combinatoric methods that the number of subsequence occurrences, Z Ξ (w), is asymptotically normal when m = O(1), and not much is known beyond this regime. (See also [3]). Asymptotic normality for fixed m follows also by general results for U -statistics [10]. However, in many applications -as discussed below -we need to consider patterns w whose lengths grow with n.
In this paper, we prove two main results. In Theorem 6 we establish that for m = o(n 1/3 ) the number of subsequence occurrences is normally distributed. Furthermore, in Theorem 7 we show that under some constrains on the structure of w, the asymptotic normality can be extended to m = o( √ n). Moreover, for the special pattern w = a m consisting of the same symbol repeated, we show in Theorem 4 that for m = o( √ n), the distribution of number of occurrences is asymptotically normal, while for larger m (up to cn for some c > 0) it is asymptotically log-normal. We study more special patterns in Section 4 and conjecture that this dichotomy is true for a large class of patterns. Finally, for typical random w we establish in Corollary 20 that Z is asymptotically normal for m = o(n 2/5 ).
Regarding methodology, unlike [8] we use here probabilistic tools. We first observe that Z can be represented as a U -statistic (see (2.3) and Section 3.2). This suggests to apply the Hoeffding projection method [10] to prove asymptotic normality of Z for some large patterns. Indeed, we first decompose Z into a sum of orthogonal random variables with variances of decreasing order in n (for m not too large), and show that the variable of the largest variance converges to a normal distribution, proving our main results Theorems 6 and 7.
The hidden pattern matching problem, especially for large patterns, finds many ap-plications from intrusion detection, to trace reconstruction, to deletion channel, to DNAbased storage systems [1,4,5,6,12,17]. Here we discuss below in some detail two of them, namely the deletion channel and the trace reconstruction problem. A deletion channel [5,6,7,14,17,20] with parameter d takes a binary sequence Ξ n = ξ 1 · · · ξ n where ξ i ∈ A as input and deletes each symbol in the sequence independently with probability d. The output of such a channel is then a subsequence ζ = ζ(x) = ξ i 1 . . . ξ i M of Ξ, where M follows the binomial distribution Binom(n, (1 − d)), and the indices i 1 , . . . , i M correspond to the bits that are not deleted. Despite significant effort [6,14,15,17,20] the mutual information between the input and output of the deletion channel and its capacity are still unknown. However, it turns out that the mutual information I(Ξ n ; ζ(Ξ n )) can be exactly formulated as the problem of the subsequence pattern matching. In [5] it was proved that where the sum is over all binary sequences of length smaller than n and Z Ξ n (w) is the number of subsequence occurrences of w in the text Ξ n . As one can see, to find precise asymptotics of the mutual information we need to understand the probabilistic behavior of Z for m n and typical w. The trace reconstruction problem [4,11,16,18] is related to the deletion channel problem since we are asking how many copies of the output deletion channel we need to see until we can reconstruct the input sequence with high probability. The rest of the paper is structured as follows. Section 2 contains detailed definitions and some other preliminaries, followed by first (Theorem 4) detailed results for the simple example of a pattern w = a m , which can be treated by elementary methods. Then we present our main results (Theorems 6 and 7). The proofs of the main results are given in Section 3. Section 4 discusses some special cases; in particular, we consider a random pattern w (Theorem 19 and Corollary 20). In the concluding section we comment on the sharpness of our results and conditions, and state a conjecture for a possible extension to larger m.

Main Results
In this section we formulate precisely our problem and present our main results. Proofs are delayed till the next section.

Problem formulation and notation
We consider a random string Ξ n = ξ 1 . . . ξ n of length n. We assume that ξ 1 , ξ 2 , . . . are i.i.d. random letters from a finite alphabet A; each letter ξ i has the distribution for some given vector p = (p a ) a∈A ; we assume p a > 0 for each a. We may also use ξ for a random letter with this distribution. Let w = w 1 · · · w m be a fixed string of length m over the same alphabet A. We assume n m. Let which is the probability that ξ 1 · · · ξ m equals w. Let Z = Z n,w (ξ 1 · · · ξ n ) be the number of occurrences of w as a subsequence of ξ 1 · · · ξ n . For a set S (in our case [n] or [m]) and k 0, let S k be the collection of sets α ⊆ S with |α| = k. Thus, S k = |S| k . For k = 0, S 0 contains just the empty set ∅. For k = 1, we identify S 1 and S in the obvious way. We write α ∈ [n] k as {α 1 , . . . , α k }, where we assume that α 1 < · · · < α k . Then where Remark 1. In the limit theorems, we are studying the asymptotic distribution of Z. We then assume that n → ∞ and (usually) m → ∞; we thus implicitly consider a sequence of words w (n) of lengths m n = |w (n) |. But for simplicity we do not show this in the notation.
We have E I α = p w for every α. Hence, so E Z * = n m and We also write Y p := E |Y | p 1/p for the L p norm of a random variable Y , while x is the usual Euclidean norm of a vector x in some R m . Also, convergence in distribution and probability, respectively. Finally, C denotes constants that may be different at different occurrences; they may depend on the alphabet A and (p a ) a∈A , but not on n, m or w.
We are now ready to present our main results regarding the limiting distribution of Z, the number of subsequence w = a 1 , . . . a m occurrences when m → ∞. We start with a simple example, namely, w = a m = a · · · a for some a ∈ A, and show that depending on whether m = o( √ n) or not the number of subsequences will follow asymptotically either the normal distribution or the log-normal distribution.
Before we present our results we consider asymptotically normal and log-normal distributions in general, and discuss their relation.

Asymptotic normality and log-normality
If X n is a sequence of random variables and a n and b n are sequences of real numbers, with b n > 0, then We say that X n is asymptotically normal if X n ∼ AsN(a n , b n ) for some a n and b n , and asymptotically log-normal if ln X n ∼ AsN(a n , b n ) for some a n and b n (this assumes X n 0). Note that these notions are equivalent when the asymptotic variance b n is small, as made precise by the following lemma.
Lemma 2. If b n → 0, and a n are arbitrary, then ln X n ∼ AsN(a n , b n ) ⇐⇒ X n ∼ AsN(e an , b n e 2an ). (2.11) Proof. By replacing X n by X n /e an , we may assume that a n = 0. If ln X n ∼ AsN(0, b n ) with b n → 0, then ln X n p −→ 0, and thus X n p −→ 1. It follows that ln X n /(X n − 1) p −→ 1 (with 0/0 := 1), and thus and thus X n ∼ AsN(1, b n ). The converse is proved by the same argument.
Remark 3. Lemma 2 is best possible. Suppose that ln X n ∼ AsN(a n , b n ). If b n → b > 0, then ln X n /e an = ln X n − a n d −→ N (0, b), and thus In this case (and only in this case), X n thus converges in distribution, after scaling, to a log-normal distribution. If b n → ∞, then no linear scaling of X n can converge in distribution to a non-degenerate limit, as is easily seen.

A simple example
We consider first a simple example where the asymptotic distribution can be found easily by explicit calculations. Fix a ∈ A and let w = a m = a · · · a, a string with m identical letters. Then, if N = N a is the number of occurrences of a in ξ 1 · · · ξ n , then (2.14) We will show that Z is asymptotically normal if m is small, and log-normal for larger m.
and thus Then, by the Central Limit Theorem, By (2.14), we have where Γ(x) is the Euler gamma function. We fix a sequence ω n → ∞ such that np a −m ω n n 1/2 ; this is possible by the assumption. Note that (2.19) implies that Y /ω n p −→ 0, and thus P(|Y | ω n ) → 1. We may thus in the sequel assume |Y | ω n . We assume also that n is so large that np a − m 2ω n > 0.
Stirling's formula implies, by taking the logarithm and differentiating twice (in the complex half-plane Re z > 1 2 , say) Consequently, (2.20) yields, noting the assumptions just made imply |Y | ω n Consequently, using also (2.19), we obtain  Hence, Z n /z n converges in distribution to a log-normal distribution, so Z n is asymptotically log-normal but not asymptotically normal. See also Remark 3.

General results
We now present our main results. However, first we discuss the road map of our approach. First, we observe that the representation (2.3) shows that Z can be viewed as a U -statistic. For convenience, we consider Z * in (2.7), which differs from Z by a constant factor only, and show in (3.18 in Lemma 14 we prove that V 1 appropriately normalized converges to the standard normal distribution. This will allow us to conclude the asymptotic normality of Z.
Here, we only consider the region m = o n 1/2 . First, for m = o n 1/3 we claim that the number of subsequence occurrences always is asymptotically normal.
Furthermore, E Z = n m p w and Var Z ∼ p 2 w σ 2 1 . In the second main result, we restrict the patterns w to such that are not typical for the random text; however, we will allow m = o n 1/2 . Theorem 7. Let q = (q a ) a∈A be the proportions of the letters in w, i.e., Suppose that lim inf n→∞ q−p > 0. If further m = o n 1/2 , then we have the asymptotic normality where σ 2 1 is given by (2.29). Furthermore, E Z = n m p w and Var Z ∼ p 2 w σ 2 1 . We prove both theorems in Section 3.5 after some preliminary results as presented in the next section.

Analysis and Proofs
In this section we will prove our main results. We start with some preliminaries.

Preliminaries and more notation
Thus, letting ξ be any random variable with the distribution of ξ i , Let p * := min a p a and Let ϕ a and B be as above.

A decomposition
The representation (2.3) shows that Z is a special case of a U -statistic. (Recall that, in general, a U -statistic is a sum over subsets α as in (2.3) of f ξ α 1 , . . . , ξ α k for some function f .) For fixed m, the general theory of [10] applies and yields asymptotic normality. (Cf. [13, Section 4] for a related problem.) For m → ∞ (our main interest), we can still use the orthogonal decomposition of [10], which in our case takes the following form. By the definitions in Section 2.1 and (3.1), By multiplying out this product, we obtain Hence, |γ| k=1 ϕ wγ k (ξ αγ k ). (3.14) Consequently, combining the terms in (3.12) with the same α γ , Note also that by the combinatorial definition of c(β, γ) given before (3.14), we see that

The projection method
We use the projection method used by [10] to prove asymptotic normality for U -statistics. Translated to the present setting, the idea of the projection method is to approximate Z * − E Z * = Z * − V 0 by V 1 , thus ignoring all terms with 2 in the sum in (3.18). In order to do this, we estimate variances.
First, by (3.4) and the independence of the ξ i , or, equivalently, (3.25) This leads to the following estimates.
Lemma 9. For 1 m, Proof. The definition of V in (3.17) and (3.25) yield, since the summands V ,β are orthogonal, as needed.
Note that, for 1 < m, (3.29) Proof. By (3.28) and the assumption, for 1 < m, and thus, summing a geometric series, Remark 11. For later use, we define also Then, for fixed i, (π(i, j)) j is a (shifted) hypergeometric distribution denoted as HGe: with, using (3.32), Note that V 1,i is a function of ξ i , and thus the random variables V 1,i are independent. Furthermore, (3.2) implies E V 1,i = 0. Let Then, see (3.20), Observe that it follows from (3.37) and (3.1) that (3.40)
We will see in Lemma 16 that the bound (3.42) is sharp within a constant factor much more generally.

Lemma 14.
Suppose that m = o(n). Then V 1 is asymptotically normal: Proof. We show that the central limit theorem applies to the sum V 1 = i V 1,i in (3.36). The terms V 1,i are independent and have means E V 1,i = 0. We verify Lyapunov's condition.
The random variable ξ is defined on some probability space (Ω, F, P ) and takes values in the finite set A. Thus the linear space V of functions Ω → R of the form f (ξ) has finite dimension |A|. Moreover, every function in V is bounded. The L 2 and L 3 norms · 2 and · 3 are thus finite on V, and are thus both norms on the finite-dimensional vector space V; hence there exists a constant C such that for any function f , (3.57) In particular, since the definition (3.37) shows that V 1,i is a function of ξ i d = ξ, Consequently, using (3.58), (3.39) and (3.59), This shows the Lyapunov condition, and thus a standard form of the central limit theorem, [9, Theorem 7.2.4 or 7.6.2], yields (3.56).

Proofs of Theorem 6 and 7
We next prove a general theorem showing asymptotic normality under some conditions.
Now we are ready to prove our main results.
Proof of Theorem 6. By Lemma 13, Recall that in Theorem 7, the range of m is improved, assuming that w is not typical for the random source with probabilities p = (p a ) a∈A that we consider. Proof. Let Hence, by the Cauchy-Schwarz inequality, (3.72) Furthermore, by (3.70) and (3.21)

Some Special Cases
In this section we consider two interesting cases. In the first we assume that the pattern w is alternating and in the second case we consider random w.

Alternating w
As an extreme example, we consider alternating w, that is, w = 010101 . . . for A = {0, 1}. We prove that this case matches the general lower bound (3.45) in Lemma 13. Proof. It is slightly more convenient to let A = {±1}; thus we consider w = w 1 · · · w m with in the unbiased case p 1 = p −1 = 1 2 . Then, by (3.1), for x ∈ A, and thus, for a, x ∈ A, ϕ a (x) = ax.
where we thus define Note that (4.6) gives E V 2 1,i = τ 2 i , so (4.7) is consistent with our earlier definition (3.38). (The sign of τ i is irrelevant for our purposes.) By (4.7) and (3.33)-(3.34), we have, with π(i, j) and X ∼ HGe(n − 1, m − 1, i − 1), as defined in Remark 11, (4.8) By Lemma 18 below, this implies, for 2 m n/2 and 1 i (n + 1)/2, This enables us to conclude, using the symmetry |τ i | = |τ n+1−i | and still assuming 2 m n/2, that Lemma 18. Suppose that X is a hypergeometric random variable X ∼ HGe(n, k, ). Then Note that the expectation in (4.11) is the difference of the probabilities that X is even or odd.
Proof. By (a special case of) a theorem by [19], the probability generating function of X has only negative real zeroes, and thus there exist probabilities r i ∈ [0, 1], i = 1, . . . , k, such that if I i ∼ Be(r i ) are independent indicator variables, then i I i has the same distribution as X, i.e., (4.12) Hence, with s i := 1 − r i , and thus, using also Var X = i Var I i by (4.12), This yields (4.11) by the standard formula This completes the proof.

A random w
Theorem 7 applies when w is far from a typical string Ξ m from our random source. In this subsection we consider the opposite case, i.e., when w is like Ξ m . More precisely, we consider the case when w = W is a random string, of a given length m, drawn from the same source; thus W d = Ξ m , but W is independent of Ξ n . (We use capital W to emphasize that the string is random.) We think of this as a two-stage random experiment. First we sample W ; then we sample Ξ. Conditioned on W = w, we thus have the same situation as before.
We write, for example, σ 2 1 (w) to indicate the dependence on w; thus σ 2 1 (W ) is a random variable. The next theorem shows that σ 2 1 (W ) is concentrated about a value that is roughly the geometric mean of the upper and lower bounds in (3.42) and (3.45).
Then, for n 1 and 1 m n/2, Furthermore, if also m, n → ∞, then  We have already computed ρ(a, a) = p −1 a − 1 in (3.7). Similarly, in general, recalling (3.1), By (3.37), for a given string w, where ρ w j , w k = Cov ϕ w j (ξ i ), ϕ w k (ξ i ) . Thus, by (3.39), c(i, j)c(i, k)ρ w j , w k . and thus E ρ(W j , W k ) = 0 when j = k, while Consequently, taking the expectation in (4.22) and recalling (3.33), the electronic journal of combinatorics 28(2) (2021), #P2.36 where π(i, j) is defined in Remark 11. We thus want to estimate the final double sum. First, fix i and recall from (3.34) that (π(i, j)) j is the probability distribution of X + 1 with X ∼ HGe(n − 1, m − 1, i − 1). Let µ := E X + 1 and γ 2 := Var X. By Chebyshev's inequality, and thus by the Cauchy-Schwarz inequality, 9 16 |j−µ| 2γ Furthermore, see (4.15), γ 2 = Var X im/n. Hence,   It follows from (3.32) that and it follows easily that the maximum in (4.30) is attained at It is then easy to see, by Stirling's formula and some calculations, that for i n/2 , max j π(i, j) C n mi The result (4.16) for the expectation follows by (4.25), (4.29) and (4.34). Next, we estimate the variance of σ 2 1 (W ). Let π(i, j)π(i, k) 2 . (4.37) To estimate (4.37), we split the inner sum into the ranges i n/2 and i > n/2 , using (x + y) 2 2(x 2 + y 2 ); by symmetry it suffices to consider the case i n/2 . It follows from (4.31) after some calculations that then π(i, j) Ce −C(j−j 0 ) 2 /(j+j 0 ) π(i, j 0 ) Cj where j 0 is defined in (4.32). It follows, omitting the details, that for 1 j k m, n/2 i=1 π(i, j)π(i, k) C n mk 1/2 e −C(j−k) 2 /m Consequently, Theorem 15 applies and shows asymptotic normality when m = o n 1/2 , which we already knew, see Theorems 4(iii) and 7. This example shows that Theorems 15 and 7 are sharp, in the sense that the range of m for which they yield asymptotic normality cannot be extended; see Example 5.
Remark 22. The argument in the proof of Theorem 7 applies also in other cases where σ 2 1 is of the same order as the upper bound in (3.42). Then Theorem 15 applies and shows asymptotic normality for m = o n 1/2 . A simple example is when w = 0 · · · 01 · · · 1, or more generally, when, say, the first and second half of w have different distributions of the letters, even if the average proportions in the entire string q = p. (This can be seen by a modification of the argument in the proof of Lemma 16.) Based on these examples we conjecture the following. for some sequence a n .
In particular, by (4.1), if Conjecture 23 holds, then for an alternating string w = 0101 · · · , Z is asymptotically normal for any m = o(n). Moreover, for random w as discussed in Section 4.2, by Theorem 19, Conjecture 23 suggests that asymptotic normality holds for m = o n 2/3 , and log-normality beyond that.
Note that this conjecture implies that if σ 2 1 is of a smaller order than the upper bound in (3.42) (for n 1/3 m n, say), then asymptotic normality holds for a larger range of m than o n 1/2 , while our proof above, on the contrary, verifies this only in a range smaller than m = o n 1/2 .