HTTP screen-scraping and caching
The technique of stealing....er...leveraging content from other people's web pages is commonly referred to as "screen-scraping". Screen-scraping is how you get site content from a website if they don't offer a more natural way to get the content such as an RSS feed or an API.
Many companies that are common targets of screen-scraping (such as Amazon.com and IMDB.com) have strict policies against screen-scraping. So please respect the usage policies of any site you wish to screen-scrape.
The script and techniques in this article provide an ASP class to make screen-scraping easy and powerful.
Since many sites don't appreciate you scraping their content, and those that do don't want you to hit them every 10 seconds, it's a good idea to incorporate some kind of caching mechanism in your code that screen-scrapes. This way you can scrape the other site's content once per hour rather than every-time you get a hit on your own site. For example, if you scrape stock market data from finance.yahoo.com to place content on your home page, you'd prefer (and so would Yahoo) that you don't hit Yahoo for every hit to your site. Instead, hit Yahoo once, cache the result, then use that cached result to serve additional hits to your site for whatever period of time makes sense.
For example, let's say I want to grab the Google home page and display it in my own site. Using the httpcache class, you could simply do the following:
The code above will go pull the page source for http://www.google.com then display it as the output of your page.
Now, let's add caching. The code above will hit Google everytime you refresh the page. Now, let's only hit Google at most once per hour. The code below shows how to use httpcache's caching feature.
You did 2 things. First, you provided a cache filename for the class to use when saving the scraped results to disk. Second, you provided a Time To Live. Now, when you refresh the page, you will pull the data from your webserver harddrive rather than hit Google. That is until after the file is one hour old. The first hit to your page after the cache file is one hour old (3600 seconds) will cause a new hit to Google and a new cache file save.
In this next example, we'll pull only a specific table of data.
We used httpcache's table_dump() method to show a default table view of the data we extracted from the external web page. In reality, you'd probably not use table_dump(). You'd probably want to work with the individual cells of data. For example, if you wanted to output just the 3rd column of the 5th data row:
Finally, another powerful feature of httpcache is the ability to ouput binary data directly. You can use this to output data directly within an <img> tag. You need to create a simple asp page that uses httpcache to return the image. That code is shown at the end of this article. I name this file img_cache.asp.
The above will output an image from www.snippetedit.com. The image will be cached locally on your webserver's hard drive for 60 seconds. Actually, it will be saved there indefinitely, but any new hits to your page after 60 seconds will cause a new hit to the source site.
Here is the code for img_cache.asp:
Notice in the code above their is an include for md5.asp. httpcache uses the md5 function to automatically create a cache filename if you don't supply one. I did not write md5.asp, but it works well, here it is. Copy the code and save it as a file named md5.asp.
Finally, here is the httpcache_class.asp code that you include in your pages like the examples in this article.
This article was a bit more complex than some, but if you put all the pieces together and play with the examples, you'll find httpcache to be a powerful and simple to use class for screen-scraping. The caching feature will make you a good neighbor as it will keep you from causing unnecessary hits to those external sites.
Hope you enjoyed this article and the code. Troy Wolf is the author of SnippetEdit, a website editor written in PHP. SnippetEdit is as simple as it gets for non-technical users to edit pre-defined snippets of content in their websites.