Straight from Google webmaster central, the search engineers claim that they correctly identify duplicate content and scraped text 99% of the time.
The Google Search Quality Team post about the underlying concerns detailing how their spider handles scraped content.
If you have ever had your content brazenly scraped and repurposed by other websites, and then seen it outrank your own pages in the search results then you need to understand the following tips.
Creating high quality content without proper attribution, links and authorship with google+ is only a license to give someone else an advantage over you.
- Duplicate content within your domain can be controlled & recommends that you require a link back from the syndicated content site. To make sure your site is identified as the original source, ensure a backlink is present.
- Cross Domain / Scraped duplicate content – Ensure your site is following all Google guidelines. If scraped content ranks higher than your site, the problem may be a technical issue on your end. check that the content is not blocked by robots.txt, examine your sitemap file & check if the site has any errors or warning flags in Google Webmaster Central.
They post this rather un-assuring paragraph:
When encountering such duplicate content on different sites, we look at various signals to determine which site is the original one, which usually works very well. This also means that you shouldn’t be very concerned about seeing negative effects on your site’s presence on Google if you notice someone scraping your content.
So Google will correctly identify cross domain content problems with a 99% success rate, but the real problem is when the Google algorithm fails.
GWC have more about the search engine’s continues battle with an overflowing array of content.