Reddit is Making Data Great Again
Back in June, Reddit made changes to the way 3rd parties interact with the site. In the process, the forum-based website insighted a riot. Moderators of the forums led a revolt, changing 8K forums from public to private (seriously hurting website traffic, and even causing an outage). They demanded Reddit roll back the changes or face continued sabotage, and the ongoing saga has been like a slow-motion car crash.
The revolt has also sparked heated debate on several important issues:
Is Reddit being stupid?
What does this have to do with AI?
I have thoughts on these I need to get off my chest. But first, let me explain the technical change Reddit made.
Rather than letting 3rd parties interact with Reddit’s website and data for free via APIs, Reddit announced a fee for API use.
What are APIs? Great question.
Understanding APIs
API stands for application programming interface, and APIs are essentially just a way for applications to “talk” to each other, exchanging data and instructions. One example that’s easy to understand comes from Expedia, the travel booking site. When you search for a hotel in Scottsdale, it’s able to show you hotels from a bunch of different brands. Those brands (ex. Marriott) have their own websites you could search, but Expedia helps speed things up by searching those websites for you. And instead of literally filling-in the website search form the way you or I would, Expedia communicates with those websites via an API. In technical parlance, Expedia “calls” the Marriott API, and Marriott sends back a “response”. The “call” would say the same kind of stuff that you say when you fill-out the form: these dates, this location, this # of rooms. The “response” gives you the results: these options, these prices, etc.
In this relationship between Marriott and Expedia, Marriott gets to decide whether to allow API access and whether they will charge Expedia a fee for it.
So let’s get back to the Reddit example… In the past, they had defined a set of APIs that 3rd party developers could call for free. The 3rd parties could do anything they want with that data… and I mean anything. For example:
Provide upgraded or customized views of Reddit content
Give moderators tools to quickly remove spam
Laser etch the data onto kale, and use the kale to make overpriced smoothies
Why Reddit is Charging for API Usage
When you boil it down, there are really just two reasons.
Reason #1: They want to be profitable
Despite the fact that Reddit has been around since 2005, has raised $1.3B+ privately, and has a valuation of ~$10B – they haven’t turned a profit. Co-Founder and CEO Steve Huffman was even quoted as saying:
“Unlike some of the apps [3rd parties using Reddit APIs], we’re not profitable.”
To change that, Reddit has done a few things to cut costs, including a 5% layoff in June. They probably see the API change as a way to keep cutting costs and bring some of the 3rd party app revenue onto their platform instead. Here’s what I mean:
Reddit will still incur the computing cost of sending responses to each API call, but now each call/response likely generates a small profit
Reddit can build internal features that mimic the functionality of the most popular 3rd party apps, and it could even charge a premium to Reddit users for some of those features
For 3rd party apps that were essentially just upgraded views of Reddit, those 3rd parties had their own ads. Now, Reddit’s ad revenue should go up as viewers go back to the main site.
It’s also worth noting that part of why the mods revolted is because many of the 3rd party apps made mods lives easier. It would be smart for Reddit to prioritize mod-friendly features, which it sounds like they’re doing. Here’s another Huffman quote from a leaked internal memo:
“There’s a lot of noise with this one. Among the noisiest we’ve seen. Please know that our teams are on it, and like all blowups on Reddit, this one will pass as well. The most important things we can do right now are stay focused, adapt to challenges, and keep moving forward. We absolutely must ship what we said we would. The only long term solution is improving our product, and in the short term we have a few upcoming critical mod tool launches we need to nail.”
Reason #2: They just discovered oil right under their feet
While large language models have gotten a ton of press, they are basically ubiquitous. Any startup team can grab OpenAI’s latest GPT model or Meta’s new “Llama” model and start building an AI-based solution. The real “oil” is the troves of data that are used to help train the models. That data is sitting inside Reddit’s servers. Until these API changes went into place, they were effectively letting anyone have free training data.
Now, Reddit can take its time licensing access to training data for any companies that want to train models. And the reason for taking a different approach to pricing training data vs. normal API calls is because the value to buyers will be very different. AI trainers would be fine paying a lump sum for a single, huge dump of data, but an API fee will be small.
Wrapping Up
Reddit data is especially appetizing to would-be AI trainers because it’s mostly text based, which is really helpful to the LLMs (ex. ChatGPT) that learn to predict the next word in a series of words. But they aren’t the only website with data like that. What are some others?
Twitter has all that tweet data and threaded conversations – sounds delicious.
StackOverflow has technical conversations about coding and example snippets of code – yummy.
Quora has experts answering tricky questions with paragraphs of detailed information – scrumptious.
And how are those companies reacting?
They’re doing the same damn thing. Twitter shut off its free API in April, and that same month StackOverflow announced they’ll start charging for training data. Quora seems to be going a step further, and in February they released their own AI chat bot called “Poe”. For now, it’s not trained on their expert Q&A data (probably need to build some opt-in/out features and communicate the change to their users), but I will be SHOCKED if they don’t start doing that soon.
For years, most of the discussion about valuable data was in relation to advertising models – having more data meant better targeting of ads and higher ad revenue. It’s awesome to see companies making data great again, but as a completely new revenue stream.
Bonus Bullets
Quote of the Week:
“I'm not a big believer in second-guessing decisions.”
— Bob Iger, 2x CEO at Disney
Quick News Reactions:
SF’s 24/7 Robotaxis: Rather than restricted hours, it’s now a free-for-all in San Francisco, and I’m jealous.
YouTube “Samples”: They’re doing a TikTok style feed of music videos and content from artists. Premium short-form doesn’t work! Heard of Quibi? Exactly.
Twitter (X) is Throttling Traffic: To other social media platforms and news sites. Free speech platform?