SA-IS algorithm
What is SA-IS?
SA-IS stands for Suffix Array Induced Sorting. It is an algorithm to construct suffix arrays in linear time. Suffix arrays are all the suffixes of a sequence ranked w.r.t. a certain order. For example, in the string “banana”, all the suffixes are “banana”, “anana”, “nana”, “ana”, “na”, and “a”, and the corresponding suffix array is [5, 3, 1, 0, 4, 2] where each number corresponds to the index of the corresponding suffix ranked in alphabetical order (i.e. 5 is the index of suffix “a”, ranked 1st, “3” is the index of suffix “ana”, ranked 2nd). Suffix arrays are important because they allow for quick binary search (O(logN)) of patterns within the original array and form the foundation of certain more sophisticated operations. They are especially important when the data is big such as in the case of big textual corpus in NLP or large DNA sequencing in bioinformatics.
Understanding SA-IS
I find SA-IS to be one of the most difficult algorithms that I have encountered, both in terms of its conceptual ingenuity and the implementation that I came across. The original paper can be found here, but a more accessible paper explaining it can be found here, the implementation is found here.
The essence of SA-IS are the 2 following ideas:
- If a sequence is fully ascending (i.e. aaabc) or fully descending (i.e. cbaaa), then construction of the suffix array will be simple. What make it hard are the “knots” within the sequence where the ascending/descending trend switches (i.e. cbaaabc, where ‘a’ changes the trend from descending to ascending). If we can rank the “knots”, then we can rank the rest of the sequence easily.
- In order to rank the “knots”, we can divide the original sequence using substrings containing these “knots” and label them using a new alphabet, a process called lexical naming. This way, we reduce the original problem into a smaller problem recursively, where the sum of these recursions are bounded by the original sequence’s length.
There are several important definitions before we can formalize these ideas, to follow along, one should first read the “A Close look of SA_IS” section of the 2nd paper, especially figures 3, 4, 5, 6, 7. One should understand what the ‘w’ substring means and what are the 3 steps of the algorithm. The paper also have a supplementary section where proofs for why the algorithm works can be found. I won’t repeat them here as it would be repetitive and a waste of my time. What I will present however is what I found to be fuzzy even after reading the papers, mainly the recursion part, and how to rank the ‘w’ substring part. I will do so using careful debugging of the implementation.
Implementation
The implementation I came across is written in Rust, a wonderful language that I didn’t have much experience on. I used VSCode and the Rust-Analyser extension to debug Rust on Windows. The MS C/C++ extension also has to be installed for Rust-Analyser’s debugger to work as it provides the necessary debugging interfaces for LLVM-generated binaries shared by both languages. Once this is done, we can setup a launch configuration using rust-analyzer: Generate launch configuration
of VSCode command palette. Specific to this implementation, we also need to change the optimization level from opt-level=3
to opt-level=0
in Cargo.toml. Otherwise, the debugger will skip quite a lot of checkpoints. The main implementation of the SA-IS algorithm is in the fn sais<T>(sa: &mut [u64], stypes: &mut SuffixTypes, bins: &mut Bins, text: &T)
function of table.rs
. I created an example string “lartistartist” and will use it to highlight all the phases of the suffix array creation. The array creation can be divided into 2 big steps: 1. sorting the ‘w’ substrings 2. sorting the rest of the substrings. Each big step can be further divided into smaller substeps. Below are the codes and their effects on the example string.
String Name Trend lartistartist DVADVADVADVAD where D means descending, V means valley, A means ascending. Refer to paper 2 above.
SORTING THE WSTRINGS
- Step 1: create suffix array (sa) partitioned with bins and pointers. This forms the foundation of bucket sort that we will use later. In “lartistartist”, the buckets will be ‘a’, ‘i’, ‘l’, ‘r’, ‘s’ and ‘t’.
a i l r s t
__|__|_|__|__|____|
0 2 4 5 7 9
- Step 2: insert valley suffixes following the bucket sort process described in step 0 of paper 2. This is only a guess, based on the assumption that longer substrings are lower in ranking.
a i l r s t
7 1|10 4|_|__|__|____|
0 2 4 5 7 9
- Step 3: insert descending suffixes based on the ‘w’ substrings and the induced descending suffixes themselves, following the bucket sort process described in step 1 of paper 2.
a i l r s t
7 1|10 4|0|__|__|12 6 9 3|
0 2 4 5 7 9
- Step 4: insert ascending suffixes based on the descending suffixes and the induced ascending suffixes themselves, following the bucket sort process described in step 2 of paper 2. These first 4 substeps are almost identical to the substeps of the 2nd major step (sorting the rest of the substrings). The only difference being these are conditioned on a guessed position of ‘w’ substrings, while in the 2nd major step the position of the ‘w’ substrings are known to be correct.
a i l r s t
7 1|10 4|0|8 2|11 5|12 6 9 3|
0 2 4 5 7 9
- Step 5: find and move all wstrings to the beginning (in this example it is unchanged as all(num_wstrs = 4) the wstrings are already at the beginning after step 1). This is the start of a series of manipulations (steps 5 - 10) involving lexical naming and potential recursions, with the goal of making sure the ‘w’ substrings are correctly ranked in the end.
a i l r s t
7 1|10 4|0|8 2|11 5|12 6 9 3|
0 2 4 5 7 9
- Step 6: replace all non-wstrings with max value(m), then put the associated lexical names of the wstring at index ‘sa[num_wstrs + cur_sufi / 2]’, where ‘cur_sufi’ is the index of the wstring in the original text. This step basically puts lexical names for the wstrings from left to right.
a i l r s t
7 1|10 4|0|m 2|0 m|1 m m m|
0 2 4 5 7 9
- Step 7: smush the inserted lexical names at the end of the array. This is a preparatory step for the potential split and recursion of the next step.
a i l r s t
7 1|10 4|0|m 2|0 m|0 2 0 1|
0 2 4 5 7 9
- Step 8: (OPTIONAL) if there are repeated lexical name (indicating identical wstrings), split and recurse. In our example, we do have repeated lexical names (‘0’), so we enter the recursion code inside the “if” block, the “else” block will be explained later.
In this example, we have:
|wstring index|wstring |suffix |lexical name|
|-------------|--------------|--------------|------------|
|7 |arti |artist |0 |
|1 |arti |artistartist |0 |
|10 |ist<sentinel> |ist |1 |
|4 |ista |istartist |2 |
The recursion enters by spliting the original sa and take its head (containing wstrings 7 1 10 4) as the new sa and its tail (containing lexical names 0 2 0 1) as the new text, discarding the middle (0 m 2 0 m). The lexical names keep the order of the original text, this guarantees that sorting these names is equivalent to sorting the original suffixes.
a i l r s t
7 1|10 4|0|m 2|0 m | 0 2 0 1| //split at 'sa.len() - num_wstrs'
0 2 4 5 7 9
The recursion returns (2 0 3 1) which is the suffix array of the lexical names (0 2 0 1). This updates the original sa to:
a i l r s t
2 0|3 1|0|m 2|0 m|0 2 0 1|
0 2 4 5 7 9
- Step 9: replace the lexical names with their corresponding suffix index in the original text
a i l r s t
2 0|3 1|0|m 2|0 m|1 4 7 10|
0 2 4 5 7 9
- Step 10: map the suffix indices from the reduced text to suffix indices in the original text using the information we got in step 9, and zero out everything after the wstrings. At this point, we can be sure that the wstrings’ positions are correct.
a i l r s t
7 1|10 4|0|0 0|0 0|0 0 0 0|
0 2 4 5 7 9
SORTING THE REST OF THE SUBSTRINGS
- Step 11: insert the valley suffixes again using bucket sort, this is different from step 2 as this time we know that the wstrings are sorted and not guessed. In this example, the suffix array doesn’t change as all the wstrings are at the start of the alphabetical order.
a i l r s t
7 1|10 4|0|0 0|0 0|0 0 0 0|
0 2 4 5 7 9
- Step 12: insert the descending suffixes again.
a i l r s t
7 1|10 4|0|0 0|0 0|12 6 9 3|
0 2 4 5 7 9
- Step 13: insert the ascending suffixes again. This gives the final suffix array [7, 1, 10, 4, 0, 8, 2, 11, 5, 12, 6, 9, 3] for the string “lartistartist” in linear time.
a i l r s t
7 1|10 4|0|8 2|11 5|12 6 9 3|
0 2 4 5 7 9
And finally, the SA-IS algorithm completes its mission.
What is left to discuss is the base case of the recursion: when there are no duplicate lexical names or identical wstrings in step 8. In this case, we enter the else block of step 8:
This code basically returns us the lexical names of the non-duplicate wstrings from left to right, mushed at the right of the suffix array. This will ensure that the lexical names’ suffix array is in correct order by the property of bucket sort and wstring itself, as if the wstrings starts with 2 different characters, bucket sort ensures the order, and if they start with the same character but are not identical, the property of wstring ensures the longer wstring’s character corresponding to the last character of the shorter wstring is descending, therefore ranking after the shorter wstring.