The last three weeks has been amazing. Thanks for teaching us and tolerating us and believing in us and making cookies for us… You’re the best.
Tell me about your favorite type of cookie and any dietary restrictions you have!
Here are the resources for today’s lecture:
Today’s lab is meant to just be a simple review of the day’s activities with most of the day left open for projects
Part 0: Variable Bit Encoding for Information Retrieval
Although I stressed it in class, it is worth saying again: IR is very concerned with space efficiency! So much so that significant work goes into the careful compression of postings lists. Consider, for example, the Reuter’s corpus, a commonly used and cited collection of around 800,000 documents. In general, each document in this corpus has around 200 tokens per document, 6 characters per token, and 100,000,000 postings. Uncompressed, this gives a corpus size of nearly 960MB and a postings list of 250MB.
Given a positional index
we see that storing the gaps between the positions of term incidences in a document often means storing much smaller numbers. In these cases we can use fewer bits to encode the difference, but not in all cases. We need a variable bit encoding in order to store the odd large gap, here and there.
Variable Byte Encryption:
Recall that an integer is stored as four bytes. We will pretend that they are unsigned, or only taking on non-negative values. This means that we can use them to represent numbers from to . However, we could certainly use fewer bits for numbers like 10 or 15 or even 511. A simple method for this is variable byte encryption. Given a number, we get its variable byte encryption by writing it first in binary, padding it at the front with 0s so that the number of bits used is a multiple of seven, then chunking splitting the bits into groups of seven. The first group has a 1 appended to it to show that it is the first byte of the representation, then all other groups have a 0 appended to them. Take for example, 2056 is 100000001000 in binary. We chunk this as 10000 0001000 then pad it to 0010000 0001000. Finally we get 10010000 00001000, a two byte representation for 2056.
Write the following numbers in variable byte notation. Note that Google can convert numbers to binary for you!
For example, the number 15 is 1111 in binary, so it’s gamma code is 1110111.
What is the following sequence of gaps?
How many bits would we have needed to encode this with the standard 4 byte integer encoding?
A cool idea from the IR textbook.
codes are relatively inefficient for large numbers (e.g., 1025 in Table 5.5 ) as they encode the length of the offset in inefficient unary code. codes differ from codes in that they encode the first part of the code (length) in code instead of unary code. The encoding of offset is the same. For example, the code of 7 is 10,0,11 (again, we add commas for readability). 10,0 is the code for length (2 in this case) and the encoding of offset (11) is unchanged. (i) Compute the codes for the other numbers in Table 5.5 . For what range of numbers is the code shorter than the code? (ii) code beats variable byte code in Table 5.6 because the index contains stop words and thus many small gaps. Show that variable byte code is more compact if larger gaps dominate. (iii) Compare the compression ratios of code and variable byte code for a distribution of gaps dominated by large gaps.
Part 1: ImageClusterer (optional)
Recall the demos that we saw of using clustering to find image features? We are going to do that ourselves! Get excited to implement k-means clustering! Download the starter code here. Read through the project and look at the part you need to implement, the clusterPixels() method. The general steps are outlined in the comments, but as far as actual implementation is concerned, you need to think carefully about a few parts, especially how you will keep track of which pixels wind up clustered to which centers. Remember, this is totally optional! If you want to devote your time to your final project, work on that instead.
Part 2: Final Projects
Part 3: Daily Feedback
Your friend Ramsey invites you to a potluck at his house. Wanting to know how many people you’ll need to cook for, you ask about how many people will be there. Rather than give a straight answer, Ramsey tells you that the number of attendees will be such the smallest number such that he is guaranteed that among the guests, either:
- At least three of the guests will bring the same type of food (entree, appetizer, drink, dessert, etc…), or
- At least three of the guests will bring three different types of food
How many guests will there be?
For posterity, here is record of lab being fully work on final projects and play catch-up!
You are sending four bits over a network to a friend, but you suspect that the evil bit flipper man might try to flip a bit. To be cautious, you allow yourself three extra bits to help encode the values of the original four. Given that the evil bit flipper man may flip one of your bits, devise a scheme for using the seven bits to make sure that your friend receives the four original bits, regardless of which of the seven bits the evil bit flipper man might flip.