August 31, 2024
By Burton Kelso
The Technology Expert
AI is a wonderful tool, but it isn’t perfect. Along with its internal databases,
AI scrapes information from websites all over the planet to get the data it needs to provide you with the information you are looking for, including your personal and business websites. You don’t want AI to use your website and intellectual property to train its Large Language Model. Here’s what you need to know.
What are AI scrapers? Bots are crawling all over the Internet. The more well-known bots are search engine bots which collect data to index websites and to see where websites rank on Google and other search engines. Spam bots are designed to harvest data such as your email address. Bots used with Generative AI tools such as Co-Pilot, Gemini, and ChatGPT are designed to collect your website’s information and then use that data to train their AI to answer your content and photorealistic image prompts. This is controversial because web creators aren’t happy with the intellectual property and privacy violations with bots taking that information.
Should you block AI from scraping your website? On one hand if, you allow AI to scrape your website, you are helping train AI and make it better. On the other hand you could see your website content show up on other websites without giving credit to you. It’s important to know that Google-extended will not block Google’s SGE from crawling your website, and therefore blocking Google AI bots poses no risk to your organic search rankings.
How to stop AI bots from scraping your website. There are multiple ways you can use to try to stop AI bots from scraping your website. Some of these suggestions may require advanced knowledge of web design. You might have to consult with your website creator to implement some of these tips.
Change the settings in your site builder tools. If you or your web designed created a website with Wix, Squarespace, GoDaddy or another website builder tool, you can go into settings and instruct your website to block AI scraping.
Using the robots.txt protocol. AI developers acknowledge using robots.txt command which allows you to tell AI crawlers not to scrape data from your website. You can add a robots.txt file to your website using the following line.
User-agent: ChatGPT
Disallow: /
This tells ChatGPT to block the crawling of all pages on your website. To specify blocking of specific pages or subfolders, just amend the / to your required URL.
You have to enter the command for each Chatbot. If you want to block Google Gemini from scraping your website, use the following line in your robots.txt file.
User-agent: Google-Extended
Disallow: /
Blocking other AI bots
If you’d rather keep your website information away from other brands of AI, then you may also want to consider Common Crawl. Common Crawl is one of the largest datasets used by AI for training, with ChatGPT and other large language models all utilizing this. Because of this, CCBot is the 2nd most blocked AI chatbot.
As with GPT Bot and Google, you can prevent CCBot from scraping your content by using the robots.txt exclusion protocol. Add the lines below to your robots.txt file to stop its crawling activities:
User-agent: CCBot
Disallow: /
Blocking AI Chatbots from larger companies is easy, but new smaller bots are always popping up, which means that blocking them via robots.txt isn’t always the answer.
Use CAPTCHAs. You probably hate CAPTCHAs when you visit websites. Who wants to click on squares to show how many pictures have a fire hydrant, but they work by preventing bots from accessing websites. CAPTCHAs work by Implementing actions that deter automated bots by requiring a human-like response or computational proof.
Install and configure a WAF to filter website traffic. A WAF filter, or Web Application Firewall watches all the traffic coming to and from your website and blocks anything that looks dangerous or harmful. This helps keep the website safe from hackers and other bad guys who might try to cause trouble.
Website IP Blocking. Website IP blocking is a method used to restrict access to your website or online service based on the IP address of the user trying to connect.
Take Legal Measures. If you feel that AI crawling is infringing on any of your intellectual property rights, then you might want to consult legal advice can be the best course of action.
Hopefully, this will give you tips to help prevent AI chatbots from scraping data from your website. What are your thoughts on AI using your content? Are you happy to help these bots learn and become more useful, or do you feel that it is a threat to content producers? Let me know in the comments below. If you have any questions, please reach out. I’m always available.
Want to ask me a tech question? Send it to [email protected]. If you prefer to connect with me on social media, you can find me on Facebook, Instagram, LinkedIn, and Twitter and watch great tech tip videos on my YouTube channel. I love technology. I’ve read all of the manuals and I want to make technology fun and easy to use for everyone! If you need on-site or remote tech support for your Windows\Macintosh, computers, laptops, Android/Apple smartphone, tablets, printers, routers, smart home devices, and anything that connects to the Internet, please feel free to contact my team at Integral. My team of friendly tech experts are always standing by to answer your questions and help make your technology useful and fun. Reach out to us at www.callintegralnow.com or phone at 888.256.0829.
You must be logged in to post a comment.