Making a Blocklist to Remove Spam from Search Engines
Table of Contents
Intro
I made a blocklist to remove AI-generated articles, misleading ads, and unwanted content from search engine results! Find it here!
It's no secret that search engines have gotten so much worse over the years. There's an outrageous amount of spam that has flooded the internet and nobody has a real solution on what to do about it. There are companies trying, like Kagi, but they're often expensive and few people use them.
A lot of this is obviously due to AI language models making it very easy to mass-produce garbage, but honestly, that's only part of the problem. Writing content has recently been less about proper writing and more about engineering articles to target SEO (Search Engine Optimization), meaning that it would have all the keywords and structure to trick Google into showing that website first.
Everybody has been abusing SEO for ages now, which is why most articles feel like they're trying to sell you something or just want clicks. The only difference now is that AI can do it instantly for free rather than having to pay poor freelancers $2 per article.
I've reached a breaking point earlier this year when I was searching for how to do something on Windows and found that around MOST of the results on the first page of DuckDuckGo were AI-generated spam sites. Enough was enough!
I started work on BadWebsiteBlocklist, a manually-curated filter that removes hundreds of domains and subdomains from search by using an extension called uBlacklist. As of now, the repo has almost 500 stars and the text file contains over 250 lines!
Why?
First off, why did I make another blocklist? They're nothing new - there are quite a few efforts like Hosts on GitHub and the Huge AI Blocklist, etc.
Most of these usually block malware and ad tracking, but don't focus on spam or low-effort content. Some like the AI blocklist are really focused on one thing and tend to block a bunch of domains that people may want to see, including popular subreddits like /r/machinelearning.
Blocklists also tend to be thousands of lines containing domains and rules for what gets blocked. They're intimidating to look through, and it's never really explained how any of these domains made it on the list. Even I wouldn't remember why I blocked something a week ago.
My goal was to make a blocklist that's very organized, general enough that anyone could use it, and one that's easy to contribute to. The blocklist focuses more on removing low-effort trash you'll find in search engines rather than just block objectively malicious links.
Not just AI-generated articles, but also misleading articles and ones that trick you into looking at advertisements. Have you ever tried to do anything hard disk related and got linked to an article that tells you to buy some software to fix it? Yeah, those would get blocked too. That means a fair bit of the list can be biased, but trying to be 100% objective means you would end up with nothing blocked. I'm also trying to block only the offending articles rather than blocking the whole domain.
Some search engines allow you to block things or create your own rules for search, but the neat thing about a blocklist is that it can be shared with everyone and used in any way you'd like, whether it's a browser extension or pi-hole or rolled into your own list of rules. My list is public domain (CC0), which means anyone can use it however they like!
Isn't Making a Blocklist Futile?
It's safe to assume >90% of the internet is spam and advertising, and there's always more being made. A fair amount of people think it's an impossible task to try and manually block spam, and I agree!
The goal of making this blocklist isn't to remove all spam, it's to remove the top 0.001% of spam. The ones that make it to the first few pages on Google and such. If we can just do that, search results will be SO much more tolerable.
My theory was actually right! After the list grew to ~100 domains, I started feeling a real difference on DuckDuckGo. The few really popular AI-generated spam sites were gone, and malicious articles that always rear their head like PartitionWizard weren't showing up anymore. With less spam showing up in the first two pages, more helpful articles starting rising to the top. It was working!
Methodology for the Blocklist
Here's what gets added to the blocklist:
- AI spam or very low-effort SEO spam websites
- Misleading or thinly-veiled advertisements acting as information (ex: Top 10 VPNs article written by a VPN provider where they put themselves as #1)
- Malicious websites that trick you into installing unwanted garbage (ex: Softonic)
- Websites that just re-host articles from other sources (ex: MSN)
🛈 Why MSN is blocked Blocking MSN from showing up in search has been controversial but I stand by it. It just re-hosts articles from other sources which often outranks the original source because Microsoft is forcing MSN on everybody, from the news tab on Windows to the front page of Edge. The goal with blocking MSN is to help the original sources float back to the top of search results.
In terms of keeping everything organized, every website that gets added to the list will have an issue attached to it explaining why it was blocked. There's a template issue to help with this, telling people to briefly explain why the domain (or subdomain) should get blocked, with examples of offending links.
The other important thing is that I try not to block an entire domain just because they happen to make spammy articles. Most products online have a blog to advertise their stuff, so I tend to only block the blog subdomain.
Is There No Other Solution?
I don't think so. The problem with search engines is that they try to be as objective as reasonably possible, which means that they don't pick Website A over Website B just because it has better content. What matters is metrics and being favorable to the algorithm.
AI-generated spam and blatant SEO manipulation has shown that the algorithms we rely on are fallible. I'm tired of terrible and misleading articles, but there doesn't seem to be much interest in taking an actual stance and doing anything about it. I can understand why though, who's going to decide which websites are better than others?
Meanwhile, efforts to remove AI spam have had mixed results, and a couple of fairly popular spam sites still make it through. With the abundance of different fine-tuned language models and AI getting exponentially cheaper, this is only going to get worse.
I decided to start blocking stuff on my own. Since I was already doing it, I also decided to make it public and hopefully get help from the community along the way. Making and maintaining a blocklist is annoying work, but one must imagine Sisyphus happy or something.
Outro
I have more to say about the state of the internet and how it's only a matter of time before the internet starts collapsing on itself, but I'll get to that in another post down the line... Maybe...
For now, I could use your help! If you know any spam sites that should get added to the list, please open an issue here. Make sure it isn't already in the list first, though!
If you found this useful and want to see more, please consider donating. It really helps a lot.