Answers file: Text indexing
===========================

Replace the placeholders "..." with your answers.


0. Who are the group members?

Hussein Mansour, Gustav Eurén, John Croft

00. Did you run your tests using PyPy?  ...
(You can answer "no" if you don't know about PyPy)

No

Part 1: Perform some linear searches
------------------------------------

Search for the following strings using the linear search option,
in the largest text file you have (e.g., bnc-larger.txt.gz).

For each search, write down how many matches you get and how long time each query takes.
(If there are many matches it's enough if you just write that there are many of them.)

1a. Search for "and":

Text file:   bnc-larger.txt.gz
N:o matches: many (132208)
Query time:  0.0 seconds


1b. Search for "20th-century":

Text file:   bnc-larger.txt.gz
N:o matches: 4
Query time:  2.58 seconds


1c. Search for "this-does-not-exist":

Text file:   bnc-larger.txt.gz
N:o matches: 0
Query time:  2.58 seconds


Part 2: Create a suffix array manually
--------------------------------------

2a. Create a suffix array from the string "SIRAPIPARIS".
    How does it look?

[3, 'APIPARIS']
[7, 'ARIS']
[5, 'IPARIS']
[1, 'IRAPIPARIS']
[9, 'IS']
[6, 'PARIS']
[4, 'PIPARIS']
[2, 'RAPIPARIS']
[8, 'RIS']
[10, 'S']
[0, 'SIRAPIPARIS']


2b. Now create a suffix array from the string "AAAAAAAAAA".
    How does it look?

[9, 'A']
[8, 'AA']
[7, 'AAA']
[6, 'AAAA']
[5, 'AAAAA']
[4, 'AAAAAA']
[3, 'AAAAAAA']
[2, 'AAAAAAAA']
[1, 'AAAAAAAAA']
[0, 'AAAAAAAAAA']


Part 3: Insertion sort
----------------------

3a. How long does it take to insertion sort the suffix array for the following files?

bnc-tinyest:  0.7 s 0.45
bnc-tinyer:   10 s  7.78
bnc-tiny:     72 s  59.5
bnc-smallest: 627 s  (optional)


Part 4: Quicksort
-----------------

4a. How long time does it take to quicksort the suffix array for the following files?


bnc-smaller:  3.3 s
bnc-small:    13.7 s
bnc-medium:   35.4 s


Part 5: Binary search in the suffix array
-----------------------------------------

5a. Why do you think the search results are not shown in increasing order of position?

The suffix array is sorted alphabetically. Therefore the search results are also in alphabetical order (eg. "and he" < "and we", the former might have first appeared later in the text, but alphabetically, it is an earlier match.)


5b. How long time does it take to search for a non-existing string in the largest
    text file, using linear search and binary search, respectively?
    (Don't include the time it takes to read the text and the index.)

Text file:     bnc-larger.txt.gz
Linear search: 2.58 s 
Binary search: 0.00 s


Part 6 (optional): Multi-key quicksort
--------------------------------------

6a (optional).
How long time does it take for quicksort and multi-key quicksort, respectively,
to sort the suffix array for the following files?
(You can of course take the quicksort timings from part 4 above)


bnc-smaller:  ... s (quicksort),  ... s (multikey)
bnc-medium:   ... s (quicksort),  ... s (multikey)
bnc-large:    ... s (quicksort),  ... s (multikey)


Part 7: Empirical complexity analysis
-------------------------------------

Determine empirically the *asymptotic average-case complexity* of your three sorting implementations.
You can, e.g., do your use a curve fitting tool such as the `scipy` package in Python, or an online tool such as <http://curve.fit/>
Below:
* "expected complexity" is the order of growth predicted by theory (no justification needed),
* "actual complexity" is the order of growth you determined empirically.

7a. InsertionSort

Expected complexity: quadratic
Actual complexity:   quadratic

How did you determine this?

We tested the algorithm on arrays ranging from 1000 to 12000, and noted the time down. we could immediately notice that the time was almost 4 as big for
for every time the amount increased by 2 times, which fits the expected complexity. 
After that we used curve.fit to test the result we got, and they match the quadratic curve the most compared to the other curves.


7b. Quicksort

Expected complexity: nlog(n)
Actual complexity:   nlog(n)

How did you determine this?

We ran quicksort on the various textfiles and plotted the average runtime against the character length of each.
If we plot a linear line of best fit, we see that the derivative is slightly positive, which is consistent with nlogn complexity.
Additionally, if we roughly plot the derivative of the results, we obtain a log curve, which is consistent with (nlogn)' = logn + c


7c (optional). MultikeyQuicksort

Expected complexity: ...
Actual complexity:   ...

How did you determine this?

...


7d. Finally, vary the size of the alphabet. You can do this for only quicksort if you want.
    How does the sorting time change depending on the size of the alphabet?
    (Note: don't try 1-letter alphabets for this question.)

																				
How much faster/slower is it if you use ten letters (e.g. "ABCDEFGHIK"),
as compared to a 4-letter alphabet?

Using a 10-letter alphabet is about 38% faster than the 4-letter case.

How much faster/slower is it if you use only two letters (e.g., "AB"),
as compared to a 4-letter alphabet?

The 2-letter case is only 60% as fast as the 4-letter case.


7e. Why do you think sorting becomes faster/slower when you use different alphabet sizes?

This is due to the fact that whenever suffixes are compared, there is a linear operation that traverses the compared suffixes until they differ. With a smaller alphabet, there will be fewer permutations per length, and more traversal is needed. 
In summary, the suffix comparison time is inversely proportional to alphabet size. 
Plotting alphabet size vs runtime gives a curve of ~ exp(-alphabet_size). ie. exponential decay, trending towards some constant at which the first value of each suffix is different and only a single comparison is needed.


7f (optional). Now you can try a 1-letter alphabet (e.g. "A") if you want.

What happens when you sort a text consisting of only "A"s?

The suffixes will simply be sorted by the length of the remaining text (starting from the suffix index).
The suffix comparison function will be forced to traverse the entire length of the shorter suffix provided, linearly, as there are no permutations to differ them by.

Why do you think it behaves so much different from other alphabet sizes?

See above.


Appendix: general information
-----------------------------

A1. How many hours did you spend on the assignment?

John: 12 hours total.

Others: approximately 5 hours each.


A2. Do you know of any bugs or limitations?

No


A3. Did you collaborate with any other students on this lab?
    If so, write with whom and in what way you collaborated.
    Also list any resources (including the web) you have used in creating your design.

No resources expcet for the course book was used.


A4. Did you encounter any serious problems?

The most time consuming part was probably the quicksort algorithm where some minor problems 
regarding the tracking of the index in each subarray and the corresponding index of the suffix 
array. Another problem was the difficulty of visually seeing the linearithmic curve from the 
complexity analysis of quicksort but it was resolved by plotting the rate of change in sorting
time against the rate of change in text lenght. Generally no serious problems.

Taking the derivative (or multiple derivatives) of n vs. runtime seems to be a powerful way to determine empirical complexity. 


A5. Do you have other comments or feedback?

No