To scrape or not to scrape

According to Wikipedia, Web Scraping is defined as “a computer software technique of extracting information from websites.” Now, Google has taken measures to penalize site scrapers in an effort to reduce what it considers webspam.

Many websites offer an RSS feed of their content. In the early days, many sites just provided a headline and a paragraph of the content, with the hope that people would follow the link back to their website. Now, it’s more common to include all of the content along with images. While that provides a better experience for people using RSS aggregators such as Google Reader or Bloglines, this also gives a site scraper everything they need to post an entire article on their website.

With an inexpensive piece of software, it isn’t difficult to build an entire website consisting of articles and content from other sources, which is commonly referred to as ‘auto-blogging’. Depending on your point of view, this technique is morally dubious. One can make the argument that if a website owner provides the RSS feed without any stipulations, people should be free to use it how they wish. Netiquette takes it one-step further and recommends adding a link back to the originating site. Many site owners are aware of this practice and state in their Terms of Use policy that content may not be used in this fashion.

The reason Google is stepping in, is because many of these site scraping sites have dramatically improved their SEO ranking and in some cases have gotten better search results than the original source website.

The reality is that many of these site scrapers will try to find workarounds and it’ll turn into a cat-and-mouse game. Virus and spambot writers have been doing this for years, but it’s in Google’s best interest to clean up their search results, so I suspect they’ll put their sizeable resources to bear on this issue.

What do you think…

Is ‘site-scraping’ morally acceptable? Should Google do whatever is necessary to combat it?

Related Posts Plugin for WordPress, Blogger...

Tags: , ,

Author:Craig Berry

Craig Berry is a Catholic web developer and musician.
Connect with him online.
  • Anonymous

    Speaking for myself, I’m fairly convinced that it would not be a good use of the time and talent God has given me to develop a scraping website. I would not want to face God and say that was what I had done with my career.

    In terms of Google’s actions, I would say it is justified in protecting the integrity of their search results, and has the effect of incentivizing (or removing a disincentive for) original content.

    Now “whatever’s necessary” can be pretty expansive. I do not think they would be justified to say, assassinate those who put together sites, or even launching DDOS attacks against them. But as far as its own resources like PageRank go, I’d say Google is justified.

  • Anonymous

    Speaking for myself, I’m fairly convinced that it would not be a good use of the time and talent God has given me to develop a scraping website. I would not want to face God and say that was what I had done with my career.

    In terms of Google’s actions, I would say it is justified in protecting the integrity of their search results, and has the effect of incentivizing (or removing a disincentive for) original content.

    Now “whatever’s necessary” can be pretty expansive. I do not think they would be justified to say, assassinate those who put together sites, or even launching DDOS attacks against them. But as far as its own resources like PageRank go, I’d say Google is justified.

  • http://www.facebook.com/matt.korger Matt J Korger

    There’s different use cases for scraping websites. I know someone who uses Perl to scrape the National Weather Service website to get details on the local forecast, ect since its not available in a consumable format. I will have to admit I use them for work to help determine if a word is a person’s name or not(it’s a bit more complicated than is sounds with multi-national data). It’s one thing to take and pass off someones intellectual creation and pass it off as your own, like plagarism, but it’s another to use this in a more referential manner. I would be happy to see Google cracking down on this. Sorry to say to all the anti-authoritarian types, but a regulated internet is far superior to a anarchical internet.

  • http://www.facebook.com/matt.korger Matt J Korger

    There’s different use cases for scraping websites. I know someone who uses Perl to scrape the National Weather Service website to get details on the local forecast, ect since its not available in a consumable format. I will have to admit I use them for work to help determine if a word is a person’s name or not(it’s a bit more complicated than is sounds with multi-national data). It’s one thing to take and pass off someones intellectual creation and pass it off as your own, like plagarism, but it’s another to use this in a more referential manner. I would be happy to see Google cracking down on this. Sorry to say to all the anti-authoritarian types, but a regulated internet is far superior to a anarchical internet.