Application of entropy compression in pattern avoidance

In combinatorics on words, a word $w$ over an alphabet $\Sigma$ is said to avoid a pattern $p$ over an alphabet $\Delta$ if there is no factor $f$ of $w$ such that $f= (p)$ where $h: \Delta^*\to\Sigma^*$ is a non-erasing morphism. A pattern $p$ is said to be $k$-avoidable if there exists an infinite word over a $k$-letter alphabet that avoids $p$. We give a positive answer to Problem 3.3.2 in Lothaire's book"Algebraic combinatorics on words", that is, every pattern with $k$ variables of length at least $2^k$ (resp. $3\times2^{k-1}$) is 3-avoidable (resp. 2-avoidable). This improves previous bounds due to Bell and Goh, and Rampersad.


Introduction
A pattern p is a non-empty word over an alphabet ∆ = {A, B, C, . . . } of capital letters called variables. A word x over Σ is an instance of p if there exists a non-erasing morphism h : ∆ * → Σ * such that h(p) = x. The avoidability index λ(p) of a pattern p is the size of the smallest alphabet Σ such that there exists an infinite word w over Σ containing no instance of p as a factor. Bean, Ehrenfeucht, and McNulty [1] and Zimin [15] characterized unavoidable patterns, i.e., such that λ(p) = ∞. We say that a pattern p is t-avoidable if λ(p) ≤ t. For more informations on pattern avoidability, we refer to Chapter 3 of Lothaire's book [7].
In this paper, we consider upper bounds on the avoidability index of long enough patterns with k variables. Bell and Goh [2] and Rampersad [11] used a method based on power series and obtained the following bounds: Theorem 1 ( [2,11]) Let p be a pattern with exactly k variables.
(a) If p has length at least 2 k then λ(p) ≤ 4. [2] (b) If p has length at least 3 k then λ(p) ≤ 3. [11] (c) If p has length at least 4 k then λ(p) = 2. [11] Our main result improves these bounds: Theorem 2 Let p be a pattern with exactly k variables.
Theorem 2 gives a positive answer to Problem 3.3.2 of Lothaire's book [7]. The bound 2 k in Theorem 2.(a) is tight in the sense that for every k ≥ 1, the pattern p k with k variables in the family {A, ABA, ABACABA, ABACABADABACABA, . . . } has length 2 k − 1 and is unavoidable. Similarly, the bound 3 × 2 k−1 in Theorem 2.(b) is tight in the sense that for every k ≥ 1, the pattern with k variables in the family {AA, AABAA,AABAACAABAA, AABAACAABAADAABAACAABAA, . . . } has length 3 × 2 k−1 − 1 and is not 2-avoidable. Hence, this shows that the upper bound 3 of Theorem 2.(a) is best possible.
The avoidability index of every pattern with at most 3 variables is known, thanks to various results in the literature. In particular, Theorem 2 is proved for k ≤ 3: • For k = 1, the famous results of Thue [13,14] give λ(AA) = 3 and λ(AAA) = 2. • For k = 2, every binary pattern of length at least 4 contains a square, and is thus 3-avoidable. Moreover, Roth [12] proved that every binary pattern of length at least 6 is 2-avoidable. • For k = 3, Cassaigne [4] began and the first author [9] finished the determination of the avoidability index of every pattern with at most 3 variables. Every ternary pattern of length at least 8 is 3-avoidable and every binary pattern of length at least 12 is 2-avoidable.
So, there remains to prove the cases k ≥ 4.
Section 2 is devoted to some preliminary results. We prove Theorem 2.(a) in Section 3 as a corollary of a result of Bell and Goh [2]. In Section 4, we prove Theorem 2.(b) using the so-called entropy compression method.

Preliminary results
Let p be a pattern over ∆ = {A, B, C, . . .}. An occurrence y of p is an assignation of a non-empty words over Σ to every variable of p that form a factor. Note that two distinct occurrences of p may form the same factor. For example, if p = ABA then the occurrence y = (A = 00; B = 1) of p forms the factor 00100; on the other hand, y ′ = (A = 0; B = 010) is a distinct occurrence of p which forms the same factor 00100.
A pattern p is doubled if every variable of p appears at least twice in p.
A pattern p is balanced if it is doubled and every variable of p appears both in the prefix and the suffix of length |p| 2 of p. Note that if the pattern has odd length, then the variable X that appears in the middle of p (i.e. in position |p| 2 + 1) must appear also in the prefix and in the suffix in order to make p balanced.

Claim 1
For every integer f ≥ 2, every pattern with at most k variables and length at least f × 2 k−1 contains a balanced pattern p ′ with at most k ′ ≥ 1 variables and length at least f × 2 k ′ −1 as a factor.
Proof. We prove this claim by induction on k. If k = 1, then p has size at least f ≥ 2 and is clearly balanced. Suppose this is true for some k = n, i.e. p with n variables and length at least f × 2 n−1 contains a balanced pattern p ′ as a factor with at most k ′ variables and length at least f × 2 k ′ −1 . Let k = n + 1 and let p 1 (resp. p 2 ) be the prefix (resp. the suffix) of p of size |p| 2 . If p is not balanced, then there exists a variable X in p that does not occur in p i for some i ∈ {1, 2}. Thus, p i has at most k − 1 = n variables and length at least f × 2 n−1 . Therefore, by induction hypothesis, p contains a balanced pattern with at most k ′ variables and length at least f × 2 k ′ −1 as a factor. ✷ In the following, we will only use the fact that the pattern p ′ in Claim 1 is doubled instead of balanced.

3-avoidable long patterns
We prove Theorem 2.(a) as a corollary of the following result of Bell and Goh [2]: ) Every doubled pattern with at least 6 variables is 3-avoidable.
Proof of Theorem 2.(a). We want to prove that every pattern with exactly k variables and length at least 2 k is 3-avoidable, or equivalently, that every pattern with at most k variables and length at least 2 k is 3-avoidable. By Claim 1, every such pattern contains a doubled pattern p ′ as a factor with at most k ′ ≥ 1 variables and length at least 2 k ′ . So there remains to show that every doubled pattern with at most k variables and length at least 2 k is 3-avoidable. As discussed in the introduction, the case of patterns with at most 3 variables has been settled. Now, it is sufficient to prove that doubled patterns of length at least 2 4 = 16 are 3-avoidable.
Suppose that p 1 is a doubled pattern containing a variable X that appears at least 4 times. Replace 2 occurrences of X with a new variable to obtain a pattern p 2 . Example: We replace the first and third occurrence of B in p 1 = ABBCDBCABDDCB by a new variable E to obtain p 2 = AEBCDECABDDCB. Then p 2 is a doubled pattern such that |p 1 | = |p 2 | and λ(p 1 ) ≤ λ(p 2 ), since every occurrence of p 1 is also an occurrence of p 2 .
Given a doubled pattern p of length at least 16, we make such replacements as long as we can. We thus obtain a doubled pattern p ′ of length at least 16 such that λ(p) ≤ λ(p ′ ). Moreover, every variable in p ′ appears either 2 or 3 times and therefore p ′ contains at least ⌈16/3⌉ = 6 variables. So p ′ is 3-avoidable by Lemma 3. Thus p is 3-avoidable, which finishes the proof. ✷

2-avoidable long patterns
We want to prove that every pattern with exactly k variables and length at least 3 × 2 k−1 is 2-avoidable, or equivalently, that every pattern with at most k variables and length at least 3 × 2 k−1 is 2-avoidable. By Claim 1, every such pattern contains a doubled pattern p ′ as a factor with at most k ′ ≥ 1 variables and length at least 3 × 2 k ′ −1 . So there remains to show that every doubled pattern with at most k variables and length at least 3 × 2 k−1 is 2-avoidable.
As discussed in the introduction, the case of patterns with at most 3 variables has been settled. Now, it is sufficient to prove Theorem 2.(b) for doubled patterns and k ≥ 4.
Suppose by contradiction that there exists a doubled pattern p on k variables and length at least q(k) that is not 2-avoidable. Then there exists an integer n such that any word w ∈ Σ n contains p. We put an arbitrary order on the k variables of p and call A j the j-th variable of p.

The algorithm AvoidPattern
Let V = {0, 1} t be a vector of length t. The following algorithm takes the vector V as input and returns a word w avoiding p and a data structure R that is called a record in the remaining of the paper.
Algorithm 1: AvoidPattern Input : V . Output: w (a word avoiding p) and R (a data structure).
Encode in R that a letter was appended to w The way we encode information in R at lines 5 and 7 will be explained in Subsection 4.2.
In the algorithm AvoidPattern, let w i be the word w after i steps. Clearly, w i avoids p at each step. By contradiction hypothesis, the resulting word w of the algorithm (that is w t ) has length less than n. We will prove that each output of the algorithm allows to determine the input. Then we obtain a contradiction by showing that the number of possible outputs is strictly smaller than the number of possible inputs when t is chosen large enough compared to n. This implies that every pattern p with at most k variables and length at least q(k) is 2-avoidable.
To analyze the algorithm, we borrow ideas from graph coloring problems [5,6]. These results are based on the Moser-Tardos [8] entropy-compression method which is an algorithmic proof of the Lovász Local Lemma.

The record R
An important part of the algorithm is to keep the record R of each step of the algorithm. Let R i be the record after i steps of the algorithm AvoidPattern.
On one hand, given V as input of the algorithm, this produces a pair (R t , w t ).
On the other hand, given a pair (R t , w t ), we will show in Lemma 4 that we can recover the entire input vector V . So, each input vector V produces a distinct pair (R t , w t ).
Let V be the set of input vectors V of size t, let R be the set of records R produced by the algorithm AvoidPattern and let O be the set of different outputs (R t , w t ). After the execution of the algorithm (t steps), w t avoids p by definition and therefore |w t | < n by contradiction hypothesis. Hence, the number of possible final words w t is independent from t (it is at most 2 n ). We then clearly have |O| ≤ 2 n × |R|. We will prove that |V| ≤ |O| and that |R| = o(2 t ) to obtain the contradiction 2 t = |V| ≤ |O| ≤ 2 n × |R| = o(2 t ).
The record R is a triplet R = (D, L, X) where D is a binary word (each element is 0 or 1), L is a vector of (k − 1)-sets of non-zero integers and X is a vector of binary words. At the beginning, D, L and X are empty. At step i of the algorithm, we append V [i] to w i−1 to get w ′ i . If w ′ i contains no occurrence of p, then we append 0 to D to get R i and we set w i = w ′ i . Otherwise, suppose that w ′ i contains an occurrence y of p that forms a factor f of length ℓ (f is the ℓ last letters of w ′ i ). Recall that A j is the j-th variable of p. Let ℓ j = |A j | in the factor f , let Let X ′ be the binary word obtained from A 1 · A 2 · . . . · A k (where "·" is the concatenation operator) followed by as many 0's as necessary to get length ℓ 2 . Note that we necessarily have |A 1 · A 2 · . . . · A k | ≤ ℓ 2 since the pattern is doubled. Eventually, to get R i , we append the factor 01 ℓ to D; we add L ′ as the last element of L; finally we add X ′ as the last element of X.
Let V i be the vector V restricted to its i first elements. We will show that the pair (R i , w i ) at some step i allows to recover V i .

Lemma 4
After i steps of the algorithm AvoidPattern, the pair (R i , w i ) permits to recover V i .
• Suppose that 0 is a suffix of D. This means that at step i, no occurrence of p was found: the algorithm appended V [i] to w i−1 to get w i . Therefore V [i] is the last letter of w i , say x. Then the word w i−1 is obtained from w i by erasing the last letter and the record R i−1 is obtained from R i by removing the suffix 0 of D. We recover V i−1 from (R i−1 , w i−1 ) by induction hypothesis and we obtain V i = V i−1 · x. • Suppose now that 01 ℓ is a suffix of D. This means that an occurrence y of p which forms a factor f of length ℓ has been created during step i. The last element L ′ of L is a (k − 1)-set L ′ = {L 1 , L 2 , . . . , L k−1 } and the last element X ′ of X is a binary word of length ℓ 2 . Let ℓ 1 = L 1 and for 2 ≤ s ≤ k − 1, let ℓ s = L s − L s−1 . So, for 1 ≤ s ≤ k − 1, ℓ s is clearly the length of the variable A s of p in the occurrence y by construction of L ′ . We know the pattern p, the total length of the factor f (that is ℓ) and the lengths of the k − 1 first variables of p in f , so we are able to compute the length ℓ k of the last variable A k . So we are now able to recover the occurrence y of p: the first ℓ 1 letters of X ′ correspond to A 1 , the next ℓ 2 letters correspond to A 2 and so on (X ′ may contain some 0's at the end which are not relevant). It follows that the factor f is completely determined. So w i−1 is obtained from w i · f by removing the last letter x of f , this letter x being V [i] (the appended letter to w i−1 at step i to get w ′ i ). The record R i−1 is obtained from R i as follows: remove the suffix 01 ℓ from D; remove the last element of L and the last element of X. We recover V i−1 from (R i−1 , w i−1 ) by induction hypothesis and we obtain V i = V i−1 · x.

✷
The previous lemma proves that distinct input vectors cannot correspond to the same pair (R t , w t ). So we get |V| ≤ |O|.

Analysis of R
Now we compute |R|. Let R = R t = (D, L, X) be a given record produced by an execution of AvoidPattern. Let D, L and X be the set of such binary words D, of such (k − 1)-sets of non-zero integers L, and of such vectors of binary words X, respectively. We thus have |R| ≤ |D| × |L| × |X |.
Let us give some useful information in order to get upper bounds on |D|, |X |, and |L|. The algorithm runs in t steps. At each step, one letter is appended to w, so t letters have been appended and therefore the number of erased letters during the execution of the algorithm is t − |w t |. At some steps, an occurrence of p appears and forms a factor which is immediately erased. Let m be the number of erased factors during the execution of the algorithm. Let f i , 1 ≤ i ≤ m, be the m erased factors. We have |f i | ≥ q(k) since each variable of p is a non-empty word and p has length at least q(k). Moreover, we have Each time a factor f i is erased, we add an element to L and X, so |L| = |X| = m.

Analysis of D
In the binary word D, each 0 corresponds to an appended letter during the execution of the algorithm and each 1 corresponds to an erased letter. Therefore, D has length 2t−|w t |. Observe that every prefix in D contains at least as many 0's as 1's. Indeed, since a 1 corresponds to an erased letter x, this letter x had to be added first and thus there is a 0 before that corresponds to this 1. The word D is therefore a partial Dyck word. Since any erased factor f i has length at least q(k), any maximal sequence of 1's (which is called a descent in the sequel) in D has length at least q(k). So D is a partial Dyck words with t 0's such that each descent has length at least q(k). The following two lemmas due to Esperet and Parreau [6] give an upper bound on |D|.
Let C t,r,d (resp. C t,d ) be the number of partial Dyck words with t 0's and t − r 1's (resp. Dyck words of length 2t) such that all descents have length at least d. Hence, we have |D| ≤ C t,|wt|,q(k) .
The radius of convergence r of φ d is 1 and since P (0) = 1 and P (r) = −1, P (x) = 0 has a solution τ in the open interval (0, r). By Lemma 6, this solution is unique and, for some constant c d , . So, we can compute γ d for d fixed. We will use the following bounds: γ 24 ≤ 1.27575, γ 40 ≤ 1.15685, and γ 100 ≤ 1.08603. Note that when d increases, γ d decreases.

Analysis of X
Each element X ′ of X corresponds to an erased factor f i and by construction |X ′ | = |f i | 2 . So the sum of the lengths of the elements of X is Thus, the vector X corresponds to a binary word of length at most t 2 . Therefore |X | ≤ 2 ⌊ t 2 ⌋ ≤ ( √ 2) t .

Analysis of L
Each element L ′ = {L 1 , L 2 , . . . , L k−1 } of L corresponds to an erased factor f i and by construction each L j ∈ L ′ corresponds to the sum of the lengths of the j first variables of p in f i .
Let h k (ℓ) be the number of such (k − 1)-sets L ′ that correspond to factors of length ℓ. Recall that |f i | ≥ q(k), so h k (ℓ) is defined for k ≥ 4 and ℓ ≥ q(k). Each of the m elements of L corresponds to an erased factor, so |L| ≤ we are able to upper-bound g k (ℓ) by some constant c for all ℓ ≥ q(k), then we get Now we bound g k (ℓ) using two different methods depending on the value of k and the length q(k) of p. For any factor f i , we have |A 1 · A 2 · . . . · A k | ≤ ⌊|f i |/2⌋ since p is doubled (each variable appears at least twice). For a given L ′ = {L 1 , L 2 , . . . , L k−1 } that corresponds to some factor f i , we have L k−1 = |A 1 · A 2 · . . . · A k−1 | ≤ ⌊|f i |/2⌋. Therefore, L ′ is a (k − 1)-set of distinct non-zero integers at most ⌊|f i |/2⌋, i.e. k − 1 integers chosen among integers between 1 and ⌊|f i |/2⌋; thus h k (ℓ) ≤ ⌊ℓ/2⌋ k−1 and so g k (ℓ) ≤ ⌊ℓ/2⌋ k−1 1 ℓ . We can upper-bound g k (l) by for all ℓ ≥ q(k). Moreover, we can see that g k (q(k)) ≤ g 5 (48) for all k ≥ 5.

Bound on
The second method to bound the size of g 4 (ℓ) is based on ordinary generating functions (OGF). Here, k = 4, so let A 1 , A 2 , A 3 , A 4 be the four variables of p and let a i be the number of apparitions of A i in p. Therefore, a 1 +a 2 +a 3 +a 4 = |p|. Recall that each variable appears at least twice in p since p is doubled, so a i ≥ 2. Moreover, a factor of length ℓ, with 24 ≤ ℓ ≤ 99, is necessarily an occurrence of a pattern of length between 24 and 99. So we just have to consider patterns p with 24 ≤ |p| ≤ 99.
Given L ′ = {L 1 , L 2 , L 3 } an element of L that corresponds to some factor f i , we can compute the lengths ℓ i of each variable A i in f i (ℓ 1 = L 1 , ℓ i = L i −L i−1 for i ∈ {2, 3} and ℓ 4 = |f i |−(a 1 ℓ 1 +a 2 ℓ 2 +a 3 ℓ 3 ) ). Recall that ℓ i ≥ 1 since each variable of p is a non-empty word. Let A |p| = i≥|p| b i x i be the OGF of such sets L ′ , i.e. b i is the number of 3-sets {L 1 , L 2 , L 3 } that corresponds to a factor of length i formed by an occurrence of a pattern of length |p| (that is b i is the number of 4-tuples (ℓ 1 , ℓ 2 , ℓ 3 , ℓ 4 ) with ℓ i ≥ 1 such that a 1 ×ℓ 1 +a 2 ×ℓ 2 +a 3 ×ℓ 3 +a 4 ×ℓ 4 = i). So by definition of h 4 , we have h 4 (ℓ) = b ℓ . This kind of OGF has been studied and is similar to the well-known problem of counting the number of ways you can change a dollar [10]: you have only five types of coins (pennies, nickels, dimes, quarters, and half dollars) and you want to count the number of ways you can change any amount of cents. So, let C = i≥1 c i x i be the OGF of the problem and thus any c i is the number of ways you can change i cents. Then, for example, c 100 corresponds to the number of ways you can change a dollar. Here, . In our case, we have four coins, each of them has value a i (so we can have different types of coins with the same value) and each type of coins appears at least once (since ℓ i ≥ 1). Thus we get 1−x a 3 × x a 4 1−x a 4 . We use Maple for our computation. For each 24 ≤ |p| ≤ 99, for each 4-tuple (a 1 , a 2 , a 3 , a 4 ) such that a i = |p|, we consider the associated OGF A |p| and we compute, using Maple, the truncated series expansion up to the order 100, that gives A |p| = b 24 x 24 + b 25 x 25 + . . . + b 99 x 99 + O(x 100 ) with explicit values for the coefficients b i . So, for any 24 ≤ ℓ ≤ 99, g 4 (ℓ) is upper-bounded by the maximum of (b ℓ ) 1 ℓ taken oven all A |p| . Maple gives that (b ℓ ) 1 ℓ is maximal for |p| = 24, (a 1 , a 2 , a 3 , a 4 ) = (2, 2, 2, 18), and ℓ = 46: in this case, b 46 = 84 (i.e. there exist 84 distinct 3-sets L ′ that correspond to some factor of length 46 formed by an occurrence of a pattern of length 24). So, g 4 (ℓ) ≤ 84 If k = 4, then g 4 (ℓ) < 1.10112 for 24 ≤ ℓ ≤ 99 and g 4 (ℓ) < 1.10456 for ℓ ≥ 100. So for k = 4, we have |L| < (1.10456) t .

Conclusion
In our results, we heavily use the fact that the patterns are doubled. The fact that the patterns are long is convenient for our proofs but does not seem so important. So we ask whether every doubled pattern is 3-avoidable. By the remarks in Section 1 and by Lemma 3, the only remaining cases are doubled patterns with 4 and 5 variables. Also, does there exist a finite k such that every doubled pattern with at least k variables is 2-avoidable ? We know that such a k is at least 5 since ABCCBADD is not 2-avoidable.