Solving problems with Advanced String Data Structures

A problem solving based approach to understand advanced string data structures better

String related problems used to scare the shit out of me. Then I was introduced to some string suffix structures and it absolutely blew my mind. I hope my approach to solve the following problems will help you get some insight.

Please keep in mind that this article is in no way an introduction to these data structures. I'll attach some links at the end. Read them to understand how these data structures work.

Let's start then.

Codeforces GYM 101991E- Exciting Menus

Problem statement:

Problem Statement

Problem link: Exciting Menus - Codeforces

Let's think about the popularity of a substring first. If this was the only thing that we were asked to find, then a trie would suffice. Length of the prefix is also easy to maintain in a trie. The only thing remaining is the joy level.

It's pretty apparent that if we have multiple occurrences of a substring, then we should take the highest joy level corresponding to that substring. I don't think we can do that with a normal trie. Here's where Aho-Corasick comes into play.

Think what the fail link of an Aho-Corasick node means. It points to the node with the maximum prefix match with the suffix of the original node. Suppose we have a prefix 'aab''aab' . Then the node corresponding to this prefix must be a fail link/suffix link to a prefix ending with 'aab''aab' , right? Then that node will be a suffix link to another node ending with 'aab''aab' and so on.

Now, we will make another graph where we will add a directed edge from the suffix link node to the original node. For example, if the suffix link of a prefix 'bcaab''bcaab' goes to the prefix 'aab''aab', then we will add a directed edge from 'aab''aab' to 'bcaab''bcaab'. This graph will be a rooted tree. In this newly formed graph, the descendants of 'aab''aab' will be the prefixes which have 'aab''aab' as their suffix. So, if we know the maximum joy level of the descendants of 'aab''aab', we can assign that value to 'aab''aab' , pretty cool right?

Now, we'll just traverse through each node of the Aho-Corasick automaton and use the formula on each of them to find out the answer.

Here's my code:

#include <bits/stdc++.h>
#define N 100007
#define ll long long
#define pii pair<int,int>
#define ff first
#define ss second
using namespace std;
 
std::vector<int> adj[N];
string s[N];
int node[N][27];
int vis[N];
int val[N];
int backnode[N];
int cnt[N];
bool ending[N];
int id;
int ln[N];
int arr[N];
ll ans;
int depth[N];
inline void init()
{
    id = 0;
    for (int i = 0; i < 26; i++)
        node[id][i] = 0;
}
 
inline int newnode()
{
    id++;
    for (int i = 0; i < 26; i++)
    {
        node[id][i] = 0;
    }
    backnode[id] = 0;
    cnt[id] = 0;
    val[id] = 0;
    ending[id] = 0;
    depth[id] = 0;
    return id;
}
inline void Insert(string &st)
{
    int u = 0;
    int n = st.size();
    for (int i = 0; i < n; i++)
    {
        int x = st[i] - 'a';
        if (!node[u][x]) node[u][x] = newnode();
        u = node[u][x];
        val[u] = max(val[u], arr[i]);
        cnt[u]++;
        depth[u] = i + 1;
    }
 
    ending[u] = 1;
}
 
inline void AhoCorasik()
{
    queue<int>q;
    for (int i = 0; i < 26; i++)
    {
        if (node[0][i])
        {
            q.push(node[0][i]);
            backnode[node[0][i]] = 0;
        }
    }
    while (!q.empty())
    {
        int u = q.front();
        int w = backnode[u];
        adj[w].push_back(u);
        q.pop();
 
        for (int i = 0; i < 26; i++)
        {
            int v = node[u][i];
            if (v)
            {
                q.push(v);
                backnode[v] = node[backnode[u]][i];
            }
            else
            {
                node[u][i] = node[backnode[u]][i];
            }
        }
    }
}
 
void dfs(int x)
{
    ans = max( ans, 1LL * depth[x] * val[x] * cnt[x] );
 
    for ( auto v : adj[x] )
    {
        dfs(v);
    }
}
 
int traverse(int x)
{
    for ( auto v : adj[x] )
    {
        val[x] = max(val[x], traverse(v));
    }
 
    return val[x];
}
 
int main()
{
    ios_base::sync_with_stdio(0);
    cin.tie(0);
 
    freopen("exciting.in", "r", stdin);
 
    int tc;
    cin >> tc;
 
    while (tc--)
    {
        init();
 
        ans = 0;
        int n;
        cin >> n;
 
        for ( int i = 0; i < n; i++ )
        {
            cin >> s[i];
            ln[i] = s[i].size();
        }
 
        for ( int i = 0; i < n; i++ )
        {
            for ( int j = 0; j < ln[i]; j++ )
                cin >> arr[j];
 
            Insert(s[i]);
        }
 
        AhoCorasik();
        int k = traverse(0);
        dfs( 0 );
 
        for( int i=0;i<=id;i++ )
            adj[i].clear();
 
        cout << ans << "\n";
    }
}

Codeforces 1202E- You Are Given Some Strings...

Problem statement:

Problem Statement

Problem link: You Are Given Some Strings.. - Codeforces

Since the constraints are in the $2 \cdot 10^5$ ballpark, we can't even dare to think of bruteforce here. Even though the problem deals with string concatenation, we'll now see why we don't need to concatenate anything.

Let, $f(i) =$ Number of query strings starting at position i

$g(i) =$ Number of query strings ending at position i

Then the answer is $\sum_{i=1}^n f(i) \cdot g(i)$

Now, our next challenge is to figure out how we can calculate $f(i)$ and $g(i)$ .

Let's have a look at $f(i)$ first. We can define $f(i)$ like this as well:

$f(i) =$ Number of query strings that are prefixes of the suffix starting at position i

We can find out the number of times a prefix of a suffix repeats in a string using suffix array. Since suffix array is always sorted, the occurrence of any prefix in a suffix will always be consecutive. So let's build a suffix array using the string $t$ . Now for every query string, we can find out how many times it has occurred in string $t$ using binary search. We will get a range in the suffix array where a query string occurs. We'll then increment the counter for each position in the range. That's it! We're done with $f(i)$ .

Now getting back to $g(i)$ , we can observe that this is exactly $g(i)$ for the reversed version of the string $t$ ! To calculate $g(i)$ , we can just reverse $t$ and perform the same operations as we did for $f(i)$ .

Now we can just loop through each position and calculate the answer.

My code:

#include<bits/stdc++.h>
#define ll long long
#define M 200007
#define mod 1000000007
using namespace std;
const int kinds = 256;///maximum ASCII value of any character of the string
int nn;
char str[M];
int K, buc[M], r[M], sa[M], X[M], Y[M], high[M];
bool cmp(int *r, int a, int b, int x)
{
    return (r[a] == r[b] && r[a + x] == r[b + x]);
}
vector<int>saa;
int lcp[M];
 
string t, s;
int n, m;
 
void suffix_array_DA(int n, int m)
{
    int *x = X, *y = Y, i, j, k = 0, l;
    memset(buc, 0, sizeof(buc));
    for (i = 0; i < n; i++)
        buc[ x[i] = str[i] ]++;
    for (i = 1; i < m; i++)
        buc[i] += buc[i - 1];
    for (i = n - 1; i >= 0; i--)
        sa[--buc[x[i]]] = i;
    for (l = 1, j = 1; j < n; m = j, l <<= 1)
    {
        j = 0;
        for (i = n - l; i < n; i++)
            y[j++] = i;
        for (i = 0; i < n; i++)
            if (sa[i] >= l)
                y[j++] = sa[i] - l;
        for (i = 0; i < m; i++)
            buc[i] = 0;
        for (i = 0; i < n; i++)
            buc[ x[y[i]] ]++;
        for (i = 1; i < m; i++)
            buc[i] += buc[i - 1];
        for (i = n - 1; i >= 0; i--)
            sa[ --buc[ x[y[i]] ]] = y[i];
        for (swap(x, y), x[sa[0]] = 0, i = 1, j = 1; i < n; i++)
            x[sa[i]] = cmp(y, sa[i - 1], sa[i], l) ? j - 1 : j++;
    }
    for (i = 1; i < n; i++)
        r[sa[i]] = i;
    for (i = 0; i < n - 1; high[r[i++]] = k)
        for (k ? k-- : 0, j = sa[r[i] - 1]; str[i + k] == str[j + k]; k++);
}
void suffix_array_construction(string s)
{
    int n = s.size();
    for (int i = 0; i < n; i++)
        str[i] = s[i];
    str[n] = '\0';
    suffix_array_DA(n + 1, kinds);
 
    for (int i = 1; i <= n; i++)
        saa.push_back(sa[i]);
}
void lcp_construction(string const& s, vector<int> const& p)
{
    int n = s.size();
    vector<int> rank(n, 0);
    for (int i = 0; i < n; i++)
        rank[p[i]] = i;
    int k = 0;
 
    for (int i = 0; i < n; i++)
    {
        if (rank[i] == n - 1)
        {
            k = 0;
            continue;
        }
        int j = p[rank[i] + 1];
        while (i + k < n && j + k < n && s[i + k] == s[j + k])
            k++;
        lcp[rank[i]] = k;
        if (k)
            k--;
    }
}
 
bool greaterEqual(int sp, int n, const string &ps) {
    int spl = n - sp, sps = ps.size(), o = min(spl, sps);
    for (int i = 0; i < o; i++) {
        if (t[sp + i] > ps[i]) return 1;
        else if (t[sp + i] < ps[i]) return 0;
    }
    return spl >= sps;
}
 
bool greaterByVal(int sp, int n, const string &ps) {
    int spl = n - sp, sps = ps.size(), o = min(spl, sps);
    for (int i = 0; i < o; i++) {
        if (t[sp + i] > ps[i]) return 1;
        else if (t[sp + i] < ps[i]) return 0;
    }
    return 0;
}
 
pair<int, int> rangeString(int n, const string &ps) {  // returns [a,b) , not [a,b]
    int lo = 0, hi = n - 1, l = n, r = n;
    while (lo <= hi) {
        auto mid = (lo + hi) >> 1;
        if (greaterEqual(saa[mid], n, ps)) l = mid, hi = mid - 1;
        else lo = mid + 1;
    }
    lo = l, hi = n - 1;
    while (lo <= hi) {
        auto mid = (lo + hi) >> 1;
        if (greaterByVal(saa[mid], n, ps)) r = mid, hi = mid - 1;
        else lo = mid + 1;
    }
    return {l, r};
}
 
bool match(int mid)
{
    int cnt = 0;
 
    for ( int j = 0, k = mid; j < m and k < n; j++, k++ )
    {
        if ( t[k] == s[j] )
            cnt++;
        else
            break;
    }
 
    if (cnt >= m)
        return 1;
    return 0;
}
 
ll st[M], en[M];
ll shuru[M], shesh[M];
string ss[M];
 
int main()
{
    ios_base::sync_with_stdio(0);
    cin.tie(0);
 
    cin >> t;
    n = t.size();
    suffix_array_construction(t);
    lcp_construction(t, saa);
    nn = saa.size();
 
    int q;
    cin>>q;
 
    for ( int i = 0; i < q; i++ )
    {
        cin >> ss[i];
 
        pair<int,int>strRange= rangeString(n,ss[i]);
 
        st[strRange.first]++;
        st[strRange.second]--;  // [a,b)
    }
 
    for( int i=1;i<n;i++ )
        st[i]+= st[i-1];
    for( int i=0;i<n;i++ )
        shuru[saa[i]]= st[i];
 
    saa.clear();
 
    reverse(t.begin(), t.end());
    suffix_array_construction(t);
    lcp_construction(t, saa);
 
    for ( int i = 0; i < q; i++ )
    {
        reverse( ss[i].begin(), ss[i].end() );
 
        pair<int,int>strRange= rangeString(n,ss[i]);
 
        en[strRange.first]++;
        en[strRange.second]--;  // [a,b)
    }
 
    for( int i=1;i<n;i++ )
        en[i]+= en[i-1];
    for( int i=0;i<n;i++ )
        shesh[n-saa[i]-1]= en[i];
 
    ll ans= 0;
 
    for( int i=1;i<n;i++ )
        ans+= ( shuru[i]*shesh[i-1] );
 
    cout<< ans ;
}

Toph- Distinctness

Problem statement:

Problem Statement

Problem link: Distinctness- Toph

Let's get straight to the point. If you know how suffix automaton actually works, this is a straightforward problem for you.

Each node in a suffix automaton corresponds to one or more unique substrings of the original string. What do they have in common? All of those substrings end in the same position(s) in the original string. Everytime we add a new character at the end of the original string, we create at least 1 new substring (why?). Suppose we have a string $'aab'$ and we have created a suffix automaton using this string. If we add a $'c'$ at the end, we know for sure that we're creating a new unique substring of length 4 ( $'aabc'$ ). If we add this character to the automaton, the string $'aabc'$ will belong to a node that corresponds to all the unique substrings that end at position 4. This means that we know how many unique substrings end at position 4; which implies that we have found out the number of new unique substrings we have created. If the substring $'aabc'$ belongs to the node $curr$ , then

The number of new substrings = $len(curr)-len(suffixLink(curr))len(curr)-len(suffixLink(curr))$ .

ans(i) = ans(i-1)+len(curr)-len(suffixLink(curr))

My code:

#include<bits/stdc++.h>
#define ll long long
#define sc second
#define mod 1000000007
#define mx 400010
#define inf 2e9
using namespace std;
 
char s[mx];
bool f;
ll ans;
 
class Automata
{
public:
    struct data
    {
        int link, len, cnt, dist;
        int valid = 0;
        int next[27] ;
 
        data() {}
        data(int link, int len) : link(link), len(len) {}
    };
 
    data *node ;
    bool visfordfs[mx];
    int num, last ;
    set< pair<int, int> > st;
 
    void reset()
    {
        num = 1 ;
 
        node[0].valid = 0;
        node[0].link = -1 ;
        node[0].len = 0 ;
        node[0].cnt = 0 ;
        last = 0 ;
        memset(node[0].next, 0, sizeof(node[0].next));
    }
 
    Automata() {}
    Automata(int mx_len)
    {
        mx_len += 7 ;
        mx_len = mx_len * 2 ;
        node = new data[mx_len] ;
 
        memset(visfordfs, 0, sizeof(visfordfs));
 
        reset();
    }
 
    void addLetter(char ch)
    {
        int cur = num++;
        int let = ch - 'a' ;
        int p = last ;
 
        if (f)
            node[cur].valid= 1;
 
        node[cur].len = node[last].len + 1 ;
        node[cur].cnt = 1;
        st.insert({node[cur].len, cur});
        memset(node[cur].next, 0, sizeof(node[cur].next));
 
        for (p = last ; p != -1 && !node[p].next[let] ; p = node[p].link)
            node[p].next[let] = cur ;
 
        if (p == -1)
            node[cur].link = 0 ;
 
        else
        {
            int q = node[p].next[let] ;
 
            if (node[p].len + 1 == node[q].len)
                node[cur].link = q ;
 
            else
            {
                int clone = num++;
 
                node[clone] = node[q] ;
                node[clone].len = node[p].len + 1 ;
                node[clone].valid= 0;
                node[clone].cnt = 0;
                st.insert({node[clone].len, clone});
 
                for (; p != -1 && node[p].next[let] == q ; p = node[p].link)
                    node[p].next[let] = clone ;
 
                node[q].link = node[cur].link = clone ;
            }
        }
 
        if( node[cur].link>-1 )
            ans+= node[cur].len-node[node[cur].link].len; // Adding the new ones
 
        last = cur ;
    }
 
    void count_sub_str()
    {
        for ( auto it = st.rbegin(); it != st.rend(); it++ )
        {
            node[ node[it->sc].link ].cnt += node[it->sc].cnt;
        }
    }
    void count_valid_sub_str()
    {
        for ( auto it = st.rbegin(); it != st.rend(); it++ )
        {
            node[ node[it->sc].link ].valid += node[it->sc].valid;
        }
    }
 
    void count_dist_str( int x )
    {
        if ( node[x].dist )
            return;
 
        node[x].dist = 1;
 
        for ( int i = 0; i < 26; i++ )
            if ( node[x].next[i] )
            {
                count_dist_str( node[x].next[i] );
                node[x].dist += node[ node[x].next[i] ].dist ;
            }
    }
};
 
Automata sa;
 
int main()
{
    int sz;
 
    scanf("%s", s);
    sz= strlen(s);
 
    sa = Automata( sz );
 
    for ( int i = 0; i < sz; i++ )
    {
        sa.addLetter( s[i] );
        cout<<ans<<"\n";
    }
 
}

Resources: