A Glimmer of Hope in the AI Replacing Jobs Argument
Published on September 4, 2023
Published on Wealthy Affiliate — a platform for building real online businesses with modern training and AI.
News stories continue to emerge stating that AI is going to take away most jobs. Generative AI can do the work of multiple writers, so the argument goes. AI can replace artists, voice over talent, actors, and so on.
But here's an interesting twist. A research paper submitted to arXiv suggests that as more models are trained using previous iterations of generative AI, the more diluted the information becomes. In other words, when the content creation comes from AI, this accounts for what the authors refer to as model collapse.
[2305.17493v2] The Curse of Recursion: Training on Generated Data Makes Models Forget (arxiv.org)
Disclaimer: arXiv offers a repository for researchers. The organization (run by Cornell University) does not peer review the research, although the articles are moderated for completeness. As with anything, it's important to keep that in mind when considering the content of the research. That doesn't necessarily make it wrong. But use that information for what it's worth.
If you decide to forge ahead to read the research, be forewarned that it is a bit mathematically intensive. I have not gotten through the entire article yet but I did submit it to ChatGPT to summarize it. Here is the result of that summarization:
----------- Start of ChatGTP results --------------
Ready to put this into action?
Start your free journey today — no credit card required.
This paper explores the phenomenon of "Model Collapse" in the context of generative models, particularly Large Language Models (LLMs) like GPT. Model collapse is a process where generative models progressively lose the ability to represent the true underlying data distribution when they are trained on data generated by their predecessors. The paper makes several key points:
1. **Background**: The paper begins by discussing the importance of human communication online, mentioning the role of search engines and smart text editors. It highlights that LLMs, such as GPT-4, are capable of generating text that is often indistinguishable from human-written text.
2. **Model Collapse**: The paper introduces the concept of model collapse, which is a degenerative process observed in generative models. It occurs as models are trained on data generated by previous generations of models, leading to a progressive loss of information about the true data distribution.
3. **Causes of Model Collapse**: The paper identifies two primary causes of model collapse:
- **Statistical Approximation Error**: This error arises due to the finite number of samples used in training, leading to the loss of information.
- **Functional Approximation Error**: This error stems from limitations in the expressiveness of the function approximators (e.g., neural networks) used in the models.
4. **Discrete Distributions**: The paper discusses the case of discrete probability distributions and how model collapse affects them. It explains that model collapse in this scenario is primarily due to statistical errors.
5. **Single-Dimensional Gaussian**: The paper explores the case of a single-dimensional Gaussian distribution. It shows that even with perfect functional approximation (i.e., fitting a Gaussian to the data), model collapse can still occur due to sampling errors. Variance increases linearly with each generation.
6. **Wasserstein Distance**: The paper suggests using the Wasserstein-2 distance as a measure to quantify the distance between the true distribution and the approximated distribution at each generation.
7. **Implications**: The paper concludes by discussing the broader implications of model collapse. It emphasizes the importance of having access to genuine human-generated content to avoid model collapse, particularly when the tails of the data distribution matter. It suggests that the use of LLMs to generate content on the internet can contaminate training data with model-generated data.
In summary, the paper presents a theoretical exploration of model collapse in generative models, illustrating how both statistical and functional approximation errors contribute to the phenomenon. It emphasizes the need for careful consideration of data sources when training generative models to avoid model collapse.
----------- End of ChatGTP results --------------
The research seems to conclude that humans will continue to be needed in the creative process for the models to "stay fresh." Even though I knew this, I decided to ask ChatGPT the following question:
Me: Does the paper suggest that we will continue to require human-created content to keep the information fresh for the models?
----------- Start of ChatGTP results --------------
Yes, the paper suggests that access to genuine human-created content is essential to keep the information fresh for generative models and to avoid model collapse. The paper emphasizes that when models are trained on data generated by their predecessors, they can progressively lose touch with the true underlying data distribution. This means that relying solely on model-generated content for training new generations of models can lead to a degradation in the quality and diversity of the generated content.
To prevent or mitigate model collapse, the paper suggests that continuing to incorporate real human-generated data into the training process is crucial. Human-generated content serves as a valuable source of diverse and authentic data that can help maintain the models' ability to generate high-quality, relevant, and up-to-date content.
In essence, the paper highlights the ongoing importance of human-created content in training and fine-tuning generative models to ensure that they remain reliable and capable of producing fresh and contextually accurate output.
----------- End of ChatGTP results --------------
AI is here to stay and it will improve remarkably over the next few years. There may even be a time where AI becomes exactly like humans and that may be enough to offer the diversity in the "gene pool" of content. Who knows? But at least in the short term, it seems that human-generated content is still held in high regard.
Best Regards,
Jim
Share this insight
This conversation is happening inside the community.
Join free to continue it.The Internet Changed. Now It Is Time to Build Differently.
If this article resonated, the next step is learning how to apply it. Inside Wealthy Affiliate, we break this down into practical steps you can use to build a real online business.
No credit card. Instant access.
