cross-posted from: https://lemmy.world/post/76533
One of the arguments made for Reddit’s API changes is that they are now the go to place for LLM training data (e.g. for ChatGPT).
I haven’t seen a whole lot of discussion around this and would like to hear people’s opinions. Are you concerned about your posts being used for LLM training? Do you not care? Do you prefer that your comments are available to train open source LLMs?
(I will post my personal opinion in a comment so it can be up/down voted separately)
A community for discussion amongst professional software developers.
Posts should be relevant to those well into their careers.
For those looking to break into the industry, are hustling for their first job, or have just started their career and are looking for advice, check out:
My posts are going to be used for LLM training regardless.
Reddit has every right to charge for their API, but the amount they wanted to charge was too high.
Other use cases aren’t relevant here either. They could have come to an agreement with Apollo etc that would have charged them reasonable rates while charging more to data scrapers. They could have done ads and dev share on the mobile apps. Most people wouldn’t have objected to that.
That part’s not a Reddit-specific problem though. I’ve seen a similar pattern play out at several companies I work for:
I think another huge problem that you didn’t mention was the timeframe. Had they given the apps even 6 months from announcing the price they may have been able to pivot to subscriptions. The short timeframe (combined with the gaslighting from the CEO) makes it hard to want to try though.
Scraping open content is OK. Search engines have been doing that, it’s their main job.
LLM won’t exist without large inputs, hehe, and the internet is a good source for a big volume of language, most of which can even make sense.
I don’t feel like Reddit should be against LLMs, ignoring their bogus claims. At least I hope GitHub doesn’t share private and licenced repos.
I was wondering if someone would bring up search engine indexing. Google certainly has the upper hand for LLM training data with Reddit’s new API change since they have the comments anyway. This is a big reason I fear these API changes, it is very much concentrating power in the hands of already powerful companies.
Always has been meme.jpg
I really don’t think Reddit changed because of the AI, it’s just for the IPO, trying to pump and dump it sky high.
It’s really sad when you imagine what we could do as a species, if we could work together instead of trying to one-up each other.
It kind of brings me back to decentralized services, which for me is the ultimate freedom model, and I’m loving this alternative to Reddit.
I think the claim is nonsense. If that were their concern they would rather change the usage agreement and maybe take some of them to court.
What they actually did is everything in their power to drive mobile users to their mobile app. They want old fashioned user tracking data for advertising and selling on. Together with more in app ads.
I totally agree that Reddit’s motivation is probably not related to LLMs and the link I posted is more of an excuse than anything. However, I am curious what people think about data scraping and LLMs in general.
I hope cross posts are OK. But I am curious about Experienced Dev’s perspective on this as well since the question is rather technical.
Copying my opinion from the other thread in case you don’t want to look at my other thread:
I’d tend to agree. There are enough barriers to training large models without artificially increase them just because the largest players can afford it.