How Web Data Restrictions Could Cripple AI—and Your Marketing Efforts!
- Shivendra Lal

- Sep 16, 2024
- 6 min read
Updated: Sep 17, 2024
Recently, the Data Provenance Initiative, a volunteer collective of AI researchers, published a study based on a large-scale audit of datasets used to train AI models. These results are unexpected and very surprising. This study raises questions about the freshness and contextual relevance of AI-generated content which is something marketers and businesses should think about. I'm very excited to share this with you...
What were the AI researchers looking for?
Artificial intelligence systems are built on massive sets of data sourced from the web. AI researchers examined three of the most popular collections of web data. The question is, what were they looking for? Well, the AI researchers observed that the Internet has become a primary ingredient in training AI, and that AI is now having to deal with restrictions on accessing web data for training due to ethical and legal concerns over a period of time. They also looked at what website owners do to tell web crawlers which data is allowed to be captured as part of their research. In case you're not familiar, web crawlers are basically bots that view and grab information from web pages and store it on their own server. All the search engines like Google, Yahoo!, and Bing have web crawlers that capture data automatically.
How was this research of AI models done?
The team did a deep dive on the websites that were accessed to build the three popular collections of web data to gain a better understanding of how AI companies collect their training data. The researchers randomly selected the web domains from each collection of web data. They worked with a team to add specific comments about the properties of the selected web domains.
They used three popular web data collections, C4, RefinedWeb, and Dolma, for this detailed study. Then they ranked 3,950 websites based on how much text they had. Additionally, the team picked out 10,000 domains from all three data collections that were more popular or high-quality.
Among these 10,000 domains, they randomly picked 2,000 for their researchers to add specific comments about their properties. These people helped manually label websites for the type of content they have like text and images; their purpose, like news or e-commerce; the presence of paywalls and ads; terms of service or use; and metadata. Metadata is a bunch of extra information about a website that's not on the main website.
Further, they wanted to know how the website owners intended to allow their site to be crawled and its content used, including training AI models. And not just at a point in time. Instead, they looked at how their terms of service and robots.txt changed every month over time!
What did the AI researchers find?
It's really interesting what the research found. First, let's go over each finding and then I'll explain how they provide businesses and marketers with food for thought.
The amount of data that AI can learn from online is shrinking. The Internet has been a valuable resource for teaching artificial intelligence (AI) systems. Because the Internet has so much information, it's helped AI systems learn many things and generate multi-modal content for a ton of use cases. However, some websites are making it harder to access their data for AIs. There's a chance that AI development will slow down. The research found that there was a 5% jump in website content that was unsearchable by web crawlers and a nearly 45% increase in website content that was restricted by terms of service. In other words, websites are becoming more selective about who can access their information, making it harder for AI system developers to find good data. This was all observed within the timeframe of a year!
With more restrictions, information will be less accurate and won't provide a complete picture. It would be less up-to-date, which means there wouldn't be as much new information from news and forums, so the data would be old and irrelevant; and it would be hard to scale AI systems because they wouldn't have access to a wide range of perspectives and topics shared on the Internet. The research findings indicate, though it is not conclusive, that the companies developing these AI systems, like OpenAI and Google, may be ignoring the opt-outs to scrape website content whether through robots.txt or terms of services.
Better rules are needed for the Internet, so people know what's allowed and what's not, especially when it comes to things like privacy, sharing information, and online interactions. The responsibility falls on website owners, who use robots.txt to anticipate all possible web crawlers and specify whether they can access and use their content for AI training.
Since website owners aren't interested in their information being used for research or training algorithms, this also affects non-profits, archives, and educational research.
A very astute observation is made in the study. Content on the Internet wasn't made to train AI models. Now that AI companies are using the web data to train their models, they're discouraging people from creating their own content. Especially when AI is now able to create content that's similar to or even better than what humans create, which could reduce the value and demand for human-created content.
Last but not least, AI researchers found a mismatch between the use cases that AI models like ChatGPT are being used for and the web data that's used to train those AI models. So the researchers found that the AI chatbots trained on web data don't compete with the training sources. This isn't conclusive, but it does suggest that artificial intelligence in real-life situations might affect copyright and fair use laws. Even though the debate is ongoing, the research findings reinforce the need for it to continue.
What are the takeaways from the study on web data restrictions for businesses and marketers?
Based on these findings, marketers and businesses might find some food for thought. Let's go through each point one by one.
According to the study, website owners are restricting or limiting access to and use of their content for training AI models. There's a very good chance your robots.txt and Terms of Service let web crawlers access your content. By limiting them, you compromise the discoverability of your brand and products or services. If you allow them, your website content might appear in the AI Overviews of generative search. There are a few important questions for business owners and marketers here. Are web crawlers collecting your website content to train AI models aligned with your overall strategy? If yes, do you know which content you'd like web crawlers to access and capture? Can you give up your ranking in generative search results if you don't want your content to be used to train AI models? If so, how would you compensate for this loss of search visibility?
It's possible that limitations in the use of web domain data might make training data less accurate, outdated, and less diverse, which could have a big impact on AI-generated content marketing strategies. Additionally, the study points out gaps in the use cases for using AI chatbots and the underlying training data. That might mean marketers need to rethink the way they review AI-generated content. If they use inaccurate, outdated, or contextually limited content, it could hurt their brand. As a side note, this reinforces Google's thinking behind the March 2024 Core Update.
Next, the study points out a possible decline in motivation to make human-generated content. This has another side to it. There's an opportunity to make more human-generated content if there's a trend towards restricting access to fresh data for training AI. Does this mean that businesses that rely on AI-generated content for their marketing probably need to reevaluate their human-generated content budget? Is it still a good decision to choose AI-generated content if we factor in the possibility of less accuracy, freshness, and diversity of the training data?
Also, how do businesses and marketers plan to use AI without compromising on transparency? Is it going to affect their brand perception and presence? Last but not least, there's no standardized way to communicate consent, either through robots.txt or terms of service. There have been recent accusations that AI crawlers are accessing content despite some websites explicitly disallowing them in robots.txt. Marketers need to figure out how to communicate and enforce consent preferences that are fully aligned with their business priorities to web crawlers to train AI models. There's an alternative suggested in the study. Basically, it suggests that it would be helpful to have a standardized system for defining different use cases for website content and how that content can be used once it's accessed. In this way, website owners can decide who uses their content and for what purpose.
Web data restrictions are a complex problem with far-reaching implications for businesses and marketers. Based on the takeaways from this AI research, marketing organizations can develop better strategies and tactics to mitigate any potential impact on their outreach and achieve their marketing goals. It's important to remember that this study focuses on the challenges of collecting data for training AI models, not necessarily the challenges of content discoverability for users. Each search engine and social media platform has its own algorithm for ranking and surfacing content, and these algorithms are constantly changing. In order to stay competitive, marketers need to stay up-to-date on the latest technological developments in content creation and discovery.





Comments