You may have wondered how the sorting algorithms we’ve seen in this course compare to Python’s built-in
list.sort. It turns out that they are quite a bit worse on average, which is perhaps not too
surprising given that you could try many possible optimizations to make these algorithms more efficient. In
this lab, we’ll look at timsort, the algorithm that Python uses for its
list.sort, developed by Tim Peters in 2002.
By the end of this lab, students will be able to
This lab is fairly technical, and is a nice programming exercise that requires careful attention to detail—good practice for your final exam and good preparation for future courses.
Download timsort.py into your labs/lab10 folder.
Fundamentally, timsort is a highly optimized version of mergesort. Recall the basic idea of mergesort: recursively keep dividing up the list into small pieces, sort each piece, then merge the sorted pieces.
Open timsort.py and take a few minutes with your partner and go over
the provided mergesort2 and _merge functions. Make sure you understand these
before moving on! Note that these functions do not return new lists, but instead operate on the input list,
making them a little bit different than the version you just worked on. (However, _merge does
create a temporary list, as the list concatenation in the last line creates a new list before storing it in
lst[start:end].)
Standard mergesort is oblivious to the actual data; like selection sort, it behaves the same regardless of whether the data is already sorted or not.
Intuitively, however, if the data is already partially sorted, then mergesort should be able to make use of that: any time it reaches an already sorted piece, it should immediately just halt rather than making new recursive calls. Note that it takes linear time to detect whether a slice of the list is sorted, which is much more efficient than running mergesort on it directly.
This is the biggest idea in timsort: look for, and take advantage of, partial sorting in the input data. In this task, you’ll write a function that does the first part.
A run in a list is defined as a maximal list slice that’s already sorted. For example, the
list [1, 4, 7, 10, 2, 5, 3, -1] has four runs: [1, 4, 7, 10], [2,
5], [3], and [-1]. Duplicates may occur in a run. Note
that the slice [4, 7, 10] is not a run, because even though it’s sorted, it isn’t
maximal: we could add the initial 1 to it and still have a sorted slice.
In the starter code, implement find_runs, which is given a list and outputs an ordered list of
tuples, where each tuple represents the “slice indexes” of the runs in the list. For example, if we gave
this Python function the list [1, 4, 7, 10, 2, 5, 3, -1], the runs are [0:4],
[4:6], [6:7], and [7:8], so the function would return [(0, 4),
(4, 6), (6, 7), (7, 8)].
We’ve provided some hints on how to do this in the starter code in the loop; but keep in mind this can be done just by processing each element in the list once. This already is not easy, so spend some time with your partner discussing a strategy before coding (as always) and don’t rush. Ask your TA for help, too!
Now that we have the runs, we can merge them, one pair at a time, and this will result in a
sorted list! This is done in timsort by treating the list of run indexes returned by find_runs
as a stack:
runs = find_runs(lst)
while <runs> has more than one tuple:
pop the top two runs off the stack <runs>
merge the runs in <lst>, which produces a new run
push the new run onto <runs>
Implement this algorithm in the timsort function; note that because you’re working with a list
and not our Stack class, you should use the pop and append list
methods.
After you complete this, you should have a fully functional sorting algorithm in timsort. Your
algorithm is a long way from the fully-optimized timsort, and in fact almost certainly slower than
mergesort at this point. Let’s keep going!
Consider the reverse-sorted list [5, 4, 3, 2, 1]. If we gave this to find_runs,
we’d get [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5)], which is the longest list we could possibly
get. But it seems funny that a reverse-sorted list is a “bad” input, given that it has just as much
structure as a sorted list.
So here’s the next idea: we can detect descending runs at the same time as detecting ascending runs in the same loop.
Complete find_runs2, which should be the same as find_runs, except in addition to
finding ascending runs, it finds descending runs. See the doctests for some examples.
After you’re confident in your solution, modify your code so that after when a descending run is found, but
before it is added to the runs variable, you reverse the run in-place, so that
it becomes an ascending run. Note that this might result in two adjacent runs that should be merged, but
that’s okay.
It turns out that short runs really aren’t that fast to work with, and it’s actually more efficient to artificially create larger runs by doing some manual sorting.
Modify find_runs2 into find_runs3 to do the following:
We’ve provided an insertion sort algorithm for you, but it’s up to you to read its docstring to determine how to use it correctly.
The provided _merge algorithm creates a new list that reaches the same size as the original
list, and then copies the contents back over to the original.
Let’s optimize that a little bit by making the following observation: suppose we want to merge two consecutive runs A and B in the list. We can copy A to a temporary storage, and then just start filling in entries where A used to be (running the standard merge algorithm on B and the copy of A).
This works because the number of “unused” spaces in the list is equal to the number of entries of A still remaining in the temporary storage.
Modify _merge into _merge2 to make use of this observation.
Even with the optimizations done in Tasks 3 and 4, it’s quite possible for the number of runs to grow linearly with the size of the list. If we store all of the runs, this can require a large amount of memory if the input is very large.
So rather than finding all runs, and then doing all of the merges, timsort actually interleaves these two actions to keep the stack small. Specifically, it performs the following check and actions each time a run is pushed onto the stack, and the stack has size >= 3:
Spend your remaining lab time in your lab designing and implementing timsort2 to use this
approach.
It’s a really nice exercise in design to think about how you would need to modify your work in the previous parts to make them work here, so we strongly encourage you to talk with your partner and your TA before writing any code at all!
You’ll find that even if you completed every part of this lab, your timsort2 algorithm is still
far slower than the built-in list.sort. If you’re interested, the full timsort algorithm is
covered in-depth by its creator Tim Peters over here.