Protect Yourself Against The Duplicate Content Threat Caused by Scrapers

First I want to answer the obvious question for those who might not be so up on net lingo: what is a Scraper? Without getting too technical, a Scraper is a person or program that searches through websites and steals content. That information is then placed on a separate site, usually copy/pasted wholesale and occasionally badly rewritten.

The express purpose behind a scraper site is to spam search engines. This catches the attention of users who (theoretically) will see it in the results. Thanks to high keyword density and social spam such as through blog commenting, they were in the past able to push these scraper sites through Google ranks with a decent regularity.

Now that SEO manipulation is more difficult to pull off. But it hasn’t stopped the threat of Scrapers, nor has it significantly lessened the number of sites that use stolen content.

Scrapers and Duplicate Content

Duplicate Content Threat

Besides the obvious plagiarism headache caused by web scraping, there is another serious risk involved. That is in duplicate content, and how it can affect the original content creator. Believe it or not, you could actually be cited for duplicate content of your own work.

This happens when a bot comes across multiple instances of your content that have been posted wholesale without enough originality on the site to overlook it as, for example, a reposted news piece or open source allowance.

When they catch signs of duplicate content, the most popular site hosting it is usually considered the original. At least they get precedence. But what happens if their site has managed to overtake yours in views? Or their rankings have spammed themselves higher? Or you just get lost in the crowd as the content is spammed again and again?

There is a common myth that Google penalizes users for duplicate content, but this isn’t technically true. Instead, the penalty is not any measurable punishment, but rather Google banishing your page to the unshown results section of the search. Few people actually click to see those unposted results, and so your site might be banned into obscurity with each duplicate content page found.

How To Protect Yourself

Google Authorship

It is really easy to protect yourself from this threat, as long as you utilize the tools Google has given you. In this case, that tool would be Google Authorship. Every time you post a page, the web giant will automatically associate that content with you.

Of course, this doesn’t mean people will not steal it. But when they do, you will have a trail of breadcrumbs that Google can follow. This will prove that you are the original creator of the content, and that the other site is a Scraper.

Even if they have a higher ranking than you, that site will get a hit for being spam. Which will lead to a duplicate content strike, maybe several if it happens with various items from your own content pool, or others who report them for spam.

Setting It Up

Google Authorship

Everything is linked through Google+, so you will need an account to get started. Set up your profile (if you don’t have one already), and make sure the photo used for it is a good headshot. This is a slightly odd requirement that the website uses, as they are very big on eliminating anonymity on the web.

Next, use your byline on every page of content you create. This byline has to match the name on your Google+ profile.

Your email address used should also have the same domain as where your content is posted. This is always the more fussy rule that annoys people, so if you don’t have an email under the same domain, you can go here. It will lead you through the alternate method of signing up.

Once you have all of that set up, go to the official Google Authorship page and enter your email address used for the content pages. This will sync it all together.


It is really frustrating, but if you don’t protect your content you won’t have any recourse when someone steals it. If they manage to boost the popularity of their site (which spammers are often able to do using dirty tricks), they will gain the rights by default.

Unfair? Absolutely. Which is why Google came up with this method of linking content to the author in a way that makes it immediately apparent who wrote it. Take advantage of it, or you could be left in the dust.

Image Credit: 1.

Leave a Reply

Your email address will not be published. Required fields are marked *