Answers file ============ Replace the placeholders "..." with your answers. 0. Who are the group members? ... 00. Did you run your tests using PyPy? ... (You can answer "no" if you don't know about PyPy) Part 1: Complexity analysis --------------------------- 1a. What is the asymptotic complexity of 'compute_intersections'? Justify your answer. Answer in terms of N, the total number of 5-grams in the input files. You may make the following assumptions: - There are D documents. - Each document has the the same number of 5-grams K. - K is larger than D. - There is not much plagiarised text, that is, most 5-grams occur in only one file. Specifically, the number of duplicate occurrences of 5-grams is a small *constant*. Complexity: ... Justification: ... 1b. How long did the program take compute the intersections for the 'tiny', 'small' and 'medium' directories? (You can skip the 'medium' directory if it takes more than 5 minutes.) tiny: ... s small: ... s medium: ... s 1c. Is the ratio between the times what you would expect, given the asymptotic complexity? Explain very briefly why. ... 1d. How long do you predict the program would take to compute the intersections for the 'big' and 'huge' directories, respectively? Show your calculations. Do the same for the 'medium' directory too if you didn't run it in 1b. Write your answer in something human-readable (e.g., minutes/hours/days/months/years). medium: ... big: ... huge: ... Calculations: ... Part 2: Using an intermediate data structure -------------------------------------------- 2a. How long time does it take to (1) build the n-gram index, (2) compute the intersections, for the 'small' directory? build_ngram_index: ... s compute_intersections: ... s 2b. Was this an improvement compared to the original implementation (see 1b)? ... Part 3: Implementing a BST -------------------------- 3a. How long time does it take to (1) build the n-gram index, (2) compute the intersections, for the 'tiny', 'small' and 'medium' directories? If you get a stack overflow or recursion error, just say so. Note: You might see a *slowdown* compared to above... Don't be alarmed, you will solve all this in part 4. tiny: - build_ngram_index: ... s - compute_intersections: ... s small: - build_ngram_index: ... s - compute_intersections: ... s medium: - build_ngram_index: ... s - compute_intersections: ... s 3b. Which of the BSTs appearing in the program usually become unbalanced? (The BSTs are 'file_ngrams', 'ngram_index', 'intersections'.) ... 3c (optional). Is there a simple way to stop these trees becoming unbalanced? (I.e., without using a self-balancing data structure.) ... Part 4: Implementing an AVL tree -------------------------------- 4a. How long time does it take to (1) build the n-gram index, (2) compute the intersections, for the 'small', 'medium' and 'big' directories? small: - build_ngram_index: ... s - compute_intersections: ... s medium: - build_ngram_index: ... s - compute_intersections: ... s big: - build_ngram_index: ... s - compute_intersections: ... s For the below questions, we denote by N the total number of 5-grams. We assume there is a (small) constant number of duplicate occurrences of 5-grams. 4b. What is the asymptotic complexity of 'build_ngram_index'? Justify briefly. Complexity: ... Justification: ... 4c. What is the asymptotic complexity of 'compute_intersections'? Justify briefly. Complexity: ... Justification: ... 4d. The 'huge' directory contains 4 times as many n-grams as the 'big'. Approximately how long time will it take to run the program on 'huge', given that you know how long time it took to run on 'big' (or 'medium')? Justify briefly. If you have the patience you can also run it to see how close your calculations were. Theoretical time to run 'huge': ... Justification: ... (Optional) Actual time to run 'huge': ... 4e. Briefly compare your answer in 4d, with your answer in 1d. ... 4f (optional). Instead of the previous assumption, we now allow an arbitrary total similarity score S. What is the asymptotic complexity of the two functions in terms of both N and S (at the same time)? Complexity of 'build_ngram_index': ... Justification: ... Complexity of 'compute_intersections': ... Justification: ... Appendix: general information ----------------------------- A1. How many hours did you spend on the assignment? ... A2. Do you know of any bugs or limitations? ... A3. Did you collaborate with any other students on this lab? If so, write with whom and in what way you collaborated. Also list any resources (including the web) you have used in creating your design. ... A4. Did you encounter any serious problems? ... A5. Do you have other comments or feedback? ...