I wrote this tweet about my Amazon internship in 2007 as I rolled out of bed yesterday morning and it went viral.
That internship was my first work experience overall in software and I got incredibly lucky being placed on the Similarities team.
One could do way worse as a 20-year old landing an internship at Amazon: the Similarities team was doing cutting edge recommendations and experimentation work, I got to deploy and see feedback from my work on-site every week, and I was paired with a kind and helpful mentor on a great team.
My experience that summer is the reason I continued doing data & systems work in college and after I graduated.
The Similarities team was responsible for producing the dataset that powered "Customers who bought this also bought". The dataset was also used in personalized recommendations they'd surface to customers elsewhere: on-site and in emails.
~20% of revenue was attributed to similarities at the time if I remember correctly (my mentor was kind enough to terrify me with an estimate of the revenue Amazon lost from an an outage I initiated). So they were an effective and important part of the site. But there were still a number of unintuitive similarities that would appear in the "Customers who bought this also bought" widget.
While there was a long tail of random issues in different product categories, – Amazon had begun a rapid expansion beyond books and was working on fixing them – the biggest problem, by far, was Harry Potter.
More specifically, in the summer of 2007 it was Harry Potter and the Deathly Hallows.
It would show up as a similarity everywhere. Like, you'd be on Amazon buying a mop and the similarities widget would show a recommendation for Harry Potter and the Deathly Hallows, followed by Pine-Sol and a bucket.
The team computed item-to-item similarities using collaborative filtering with customer order baskets as the input. Put simply, if customers bought products A and B together frequently enough, then Amazon would present A and B as similar items.
But what happens when the same product appears in virtually every order basket? The Harry Potter problem.
The similarities team deployed me to one idea they had to address the Harry Potter problem: could they use feedback from users to cull unintuitive similarities?
Amazon got feedback in two ways:
- Implicitly: clickstream and conversion data from the similarities widget ("Customers who bought this also bought")
- Explicitly: users could provide feedback on personalized recommendations they received in emails/on-site. I don't know if this still exists.
So I spent the summer trying to use the output of Amazon's collaborative filtering algorithm + all this clickstream/conversion/feedback data to address unintuitive similarities. The thinking was that if a similarity was unintuitive, then presumably it'd underperform by some measure based on user feedback.
My primary target was Harry Potter and the Deathly Hallows: it stuck out like a sore thumb and it was an easy way to see if an approach was working qualitatively.
Amazon similarities were served by a Berkeley DB (BDB) file at the time. BDBs are embedded key-value stores – a file you can ship around with a format optimized for key-value lookups. Amazon would crunch numbers and emit a new similarities BDB nightly or weekly (I don't remember which).
The similarities BDB mapped ASINs (product IDs in Amazon parlance) to lists of ASINs, like this:
That's the ASIN for this great cat litter mapped to different sizes of pee pad refills in the "Compare with similar items" section. This is also an example of an unintuitive similarity: if I use clumping litter for my cat, it is unlikely that I will use pee pads too.
So each week I'd write a Perl script to crunch clickstream/conversion/feedback data and then remove or reorder some of the mappings in the BDB file based on my algorithm that week. We'd now have two sets of similarities:
- A: the similarities emitted by Amazon's collaborative filtering algorithm
vikrams_algorithm_of_the_week(A), Amazon's similarities from A with mappings removed or reordered using whatever approach I was trying.
Then I'd send an email to the team with a link to a CGI script I wrote that allowed us to qualitatively assess B against A. It was a little web page with an input box at the top where you could enter an ASIN and it would show you similarities from A compared to similarities from B.
We'd exchange emails about the quality of my similarities or talk about them in a meeting. Then my mentor would decide whether or not we would push it to production.
Most weeks we'd push something to production and A/B test it. I don't know what percentage of the site saw my similarities, but I suspect it was low: it would take a few days for us to get conclusive results and even in 2007 Amazon got tons of traffic.
I threw a lot of things at the wall.
Two approaches that I remember relied on the idea that users will simply click more on items on the left side of a page. It is well-known that page position is a massive determinant of clickthrough rates. The "Customers who Bought this Also Bought" widget was laid out from left to right, so items in the first slot had a baked-in "boost".
So, if that is true yet we see a similarity in the first slot "underperform", maybe we should reorder it. Here are two different ways I did that:
- A basic approach: if an ASIN in slot 1 has a lower clickthrough rate than an ASIN in slot 2, swap slots 1 and 2.
- A more complicated approach: if an ASIN in slot 1 has a statistically lower clickthrough rate compared to the ASIN in slot 2, swap slots 1 and 2.
I don't remember the specifics for #2. It might have been something like:
- Get the difference in clickthrough rate between slots 1 and 2.
- See where it falls on the distribution of "difference in clickthrough rate between slots 1 and 2".
- Decide it's underperforming if it is some distance away from the mean.
I spent the entire summer trying stuff like this and published a log of what I did on Amazon's internal wiki.
Some takeaways I recall:
- None of these approaches improved conversions over the baseline: I do remember thinking my similarities were better in some of our qualitative assessments, but, site-wide, customers voted with their $ and showed that they were worse.
- Simple approaches are better than complicated ones: any time I got fancy, like in the more complicated approach above, the performance of my similarities tanked.
- Invalidating a bunch of approaches and documenting them is still useful: with this work, the team had a set of hypotheses they could either ignore or approach again in the future with more clarity. It's a great project to deploy an intern to.
- None of these approaches come close to solving the Harry Potter problem: that book's Amazon detail page will be forever etched in my brain.
There are many wildly qualified people who have worked on similarities and recommendations at Amazon, Netflix and elsewhere the last 15 years. Greg Linden started and led a lot of the personalization work at Amazon and has written about some of it on his blog.
I don't know if Greg was there in 2007. My mentor and the team's manager during my internship keep a much lower profile, but were also extremely talented.
👋 to Wes & Brent if you see this! Thank you for setting me up with a delightful and impactful experience 16 years ago.