Contact information

Hi all. Please fill out this form so I can get your contact info.

Thank you Ben!

The last three weeks has been amazing. Thanks for teaching us and tolerating us and believing in us and making cookies for us…  You’re the best.

Puzzle 12: The most difficult puzzle!

Tell me about your favorite type of cookie and any dietary restrictions you have!

Better PageRank Paper

Better PageRank Paper

Here is a link to the better paper describing PageRank which I found in my search for good information about TrustRank for my final project.

Resources for Lecture 12

Howdy gang!

Here are the resources for today’s lecture:


Lab 11: Information Retrieval and ImageClusterer

Today’s lab is meant to just be a simple review of the day’s activities with most of the day left open for projects

Part 0: Variable Bit Encoding for Information Retrieval

Although I stressed it in class, it is worth saying again: IR is very concerned with space efficiency! So much so that significant work goes into the careful compression of postings lists. Consider, for example, the Reuter’s corpus, a commonly used and cited collection of around 800,000 documents. In general, each document in this corpus has around 200 tokens per document, 6 characters per token, and 100,000,000 postings. Uncompressed, this gives a corpus size of nearly 960MB and a postings list of 250MB.

Given a positional index

encoding postings list
the docIDs 283042 283043 283044 283045
gaps 1 1 1
computer docIDs 283047 283154 283159 283202
gaps 107 5 43
arachnocentric docIDs 252000 500100
gaps 252000 248100

we see that storing the gaps between the positions of term incidences in a document often means storing much smaller numbers. In these cases we can use fewer bits to encode the difference, but not in all cases. We need a variable bit encoding in order to store the odd large gap, here and there.

Variable Byte Encryption:

Recall that an integer is stored as four bytes. We will pretend that they are unsigned, or only taking on non-negative values. This means that we can use them to represent numbers from 0 to 2^{32}-1. However, we could certainly use fewer bits for numbers like 10 or 15 or even 511. A simple method for this is variable byte encryption. Given a number, we get its variable byte encryption by writing it first in binary, padding it at the front with 0s so that the number of bits used is a multiple of seven, then chunking splitting the bits into groups of seven. The first group has a 1 appended to it to show that it is the first byte of the representation, then all other groups have a 0 appended to them. Take for example, 2056 is 100000001000 in binary. We chunk this as 10000 0001000 then pad it to 0010000 0001000. Finally we get 10010000 00001000, a two byte representation for 2056.

Write the following numbers in variable byte notation. Note that Google can convert numbers to binary for you!

  1. 1
  2. 1023
  3. 12345678
What is the smallest number for which variable byte encoding is less efficient than standard four byte notation? In other words, for what number n does the variable byte representation of n require more bits than the standard four byte representation?
Variable Bit Codes:
The simplest variable bit code is unary coding. The number x is represented by x 1s followed by a single 0. This clearly not optimal! However, it does allow us to specify a number of bits to read. We can now represent each positive integer (and note that we don’t need zero if we are only representing gaps) as its binary encoding without a leading 1, preceded by the length of of this representation in unary. This is called a gamma code.

For example, the number 15 is 1111 in binary, so it’s gamma code is 1110111.


What is the following sequence of gaps?



How many bits would we have needed to encode this with the standard 4 byte integer encoding?


A cool idea from the IR textbook.

$\gamma $ codes are relatively inefficient for large numbers (e.g., 1025 in Table 5.5 ) as they encode the length of the offset in inefficient unary code. $\delta$ codes differ from $\gamma $ codes in that they encode the first part of the code (length) in $\gamma $ code instead of unary code. The encoding of offset is the same. For example, the $\delta$ code of 7 is 10,0,11 (again, we add commas for readability). 10,0 is the $\gamma $ code for length (2 in this case) and the encoding of offset (11) is unchanged. (i) Compute the $\delta$ codes for the other numbers in Table 5.5 . For what range of numbers is the $\delta$ code shorter than the $\gamma $ code? (ii) $\gamma $ code beats variable byte code in Table 5.6 because the index contains stop words and thus many small gaps. Show that variable byte code is more compact if larger gaps dominate. (iii) Compare the compression ratios of $\delta$code and variable byte code for a distribution of gaps dominated by large gaps.

Part 1: ImageClusterer (optional)

Recall the demos that we saw of using clustering to find image features? We are going to do that ourselves! Get excited to implement k-means clustering! Download the starter code here. Read through the project and look at the part you need to implement, the clusterPixels() method. The general steps are outlined in the comments, but as far as actual implementation is concerned, you need to think carefully about a few parts, especially how you will keep track of which pixels wind up clustered to which centers. Remember, this is totally optional! If you want to devote your time to your final project, work on that instead.

Part 2: Final Projects

Part 3: Daily Feedback

Puzzle 10: Ramsey’s party

Your friend Ramsey invites you to a potluck at his house. Wanting to know how many people you’ll need to cook for, you ask about how many people will be there. Rather than give a straight answer, Ramsey tells you that the number of attendees will be such the smallest number such that he is guaranteed that among the guests, either:

  • At least three of the guests will bring the same type of food (entree, appetizer, drink, dessert, etc…), or
  • At least three of the guests will bring three different types of food

How many guests will there be?

Lab 10: Final Project time

For posterity, here is record of lab being fully work on final projects and play catch-up!

Puzzle 9: An error correcting code

You are sending four bits over a network to a friend, but you suspect that the evil bit flipper man might try to flip a bit. To be cautious, you allow yourself three extra bits to help encode the values of the original four. Given that the evil bit flipper man may flip one of your bits, devise a scheme for using the seven bits to make sure that your friend receives the four original bits, regardless of which of the seven bits the evil bit flipper man might flip.

A good walkthrough of k-means by the esteemed Sebastian Thrun