A problem solving based approach to understand advanced string data structures better
String related problems used to scare the shit out of me. Then I was introduced to some string suffix structures and it absolutely blew my mind. I hope my approach to solve the following problems will help you get some insight.
Please keep in mind that this article is in no way an introduction to these data structures. I'll attach some links at the end. Read them to understand how these data structures work.
Let's think about the popularity of a substring first. If this was the only thing that we were asked to find, then a trie would suffice. Length of the prefix is also easy to maintain in a trie. The only thing remaining is the joy level.
It's pretty apparent that if we have multiple occurrences of a substring, then we should take the highest joy level corresponding to that substring. I don't think we can do that with a normal trie. Here's where Aho-Corasick comes into play.
Think what the fail link of an Aho-Corasick node means. It points to the node with the maximum prefix match with the suffix of the original node. Suppose we have a prefix 'aab''aab' . Then the node corresponding to this prefix must be a fail link/suffix link to a prefix ending with 'aab''aab' , right? Then that node will be a suffix link to another node ending with 'aab''aab' and so on.
Now, we will make another graph where we will add a directed edge from the suffix link node to the original node. For example, if the suffix link of a prefix 'bcaab''bcaab' goes to the prefix 'aab''aab', then we will add a directed edge from 'aab''aab' to 'bcaab''bcaab'. This graph will be a rooted tree. In this newly formed graph, the descendants of 'aab''aab' will be the prefixes which have 'aab''aab' as their suffix. So, if we know the maximum joy level of the descendants of 'aab''aab', we can assign that value to 'aab''aab' , pretty cool right?
Now, we'll just traverse through each node of the Aho-Corasick automaton and use the formula on each of them to find out the answer.
Since the constraints are in the 2⋅105 ballpark, we can't even dare to think of bruteforce here. Even though the problem deals with string concatenation, we'll now see why we don't need to concatenate anything.
Let, f(i)=
Number of query strings starting at position i
g(i)= Number of query strings ending at position i
Then the answer
is ∑i=1nf(i)⋅g(i)
Now, our next challenge is to figure out how we can calculate f(i) and g(i).
Let's have a look at f(i) first. We can define f(i) like this as well:
f(i)= Number of query strings that are prefixes of the suffix starting
at position i
We can find out the number of times a prefix of a suffix repeats in
a string using suffix array. Since suffix array is always sorted, the occurrence
of any prefix in a suffix will always be consecutive. So let's build a suffix array
using the string t. Now for every query string, we can find out how many times it
has occurred in string t using binary search. We will get a range in the suffix
array where a query string occurs. We'll then increment the counter for each position
in the range. That's it! We're done with f(i).
Now getting back to g(i), we can observe that this is exactly g(i) for the reversed version of the string t! To calculate g(i), we can just reverse t and perform the same operations as we did for f(i).
Now we can just loop through each position and calculate the answer.
Let's get straight to the point. If you know how suffix automaton actually works, this is a straightforward problem for you.
Each node in a suffix automaton corresponds to one or more unique substrings of the original string. What do they have in common? All of those substrings end in the same position(s) in the original string. Everytime we add a new character at the end of the original string, we create at least 1 new substring (why?). Suppose we have a string ′aab′ and we have created a suffix automaton using this string. If we add a ′c′ at the end, we know for sure that we're creating a new unique substring of length 4 (′aabc′). If we add this character to the automaton, the string ′aabc′ will belong to a node that corresponds to all the unique substrings that end at position 4. This means that we know how many unique substrings end at position 4; which implies that we have found out the number of new unique substrings we have created. If the substring ′aabc′ belongs to the node curr , then
The number of new substrings = len(curr)−len(suffixLink(curr))len(curr)−len(suffixLink(curr)).