codetoad.com
  ASP Shopping CartForum & BBS
  - all for $20 from CodeToad Plus!
  
  Home || ASP | ASP.Net | C++/C# | DHTML | HTML | Java | Javascript | Perl | VB | XML || CodeToad Plus! || RAM 
Search Site:



Home » ASP » Article

HTTP screen-scraping and caching

Article by:  Troy Wolf  ( 372 ) (5/31/2005)
Bookmark us now! Add to Favourites
Email a friend! Tell a friend
Summary: An ASP class that makes it easy to leverage data from web pages that may be your own or pages from other websites. The class has methods that make it easy to pull entire pages, specific sections of content, tables of data, or even just a specific image from third-party websites.
Viewed: 50224 times Rating (14 votes): 
 4.3 out of 5
  Rate this Article   Read Comments   Post Comments

HTTP screen-scraping and caching



The technique of stealing....er...leveraging content from other people's web pages is commonly referred to as "screen-scraping". Screen-scraping is how you get site content from a website if they don't offer a more natural way to get the content such as an RSS feed or an API.

Many companies that are common targets of screen-scraping (such as Amazon.com and IMDB.com) have strict policies against screen-scraping. So please respect the usage policies of any site you wish to screen-scrape.

The script and techniques in this article provide an ASP class to make screen-scraping easy and powerful.

Since many sites don't appreciate you scraping their content, and those that do don't want you to hit them every 10 seconds, it's a good idea to incorporate some kind of caching mechanism in your code that screen-scrapes. This way you can scrape the other site's content once per hour rather than every-time you get a hit on your own site. For example, if you scrape stock market data from finance.yahoo.com to place content on your home page, you'd prefer (and so would Yahoo) that you don't hit Yahoo for every hit to your site. Instead, hit Yahoo once, cache the result, then use that cached result to serve additional hits to your site for whatever period of time makes sense.

For example, let's say I want to grab the Google home page and display it in my own site. Using the httpcache class, you could simply do the following:


Select All Code


The code above will go pull the page source for http://www.google.com then display it as the output of your page.

Now, let's add caching. The code above will hit Google everytime you refresh the page. Now, let's only hit Google at most once per hour. The code below shows how to use httpcache's caching feature.

Select All Code


You did 2 things. First, you provided a cache filename for the class to use when saving the scraped results to disk. Second, you provided a Time To Live. Now, when you refresh the page, you will pull the data from your webserver harddrive rather than hit Google. That is until after the file is one hour old. The first hit to your page after the cache file is one hour old (3600 seconds) will cause a new hit to Google and a new cache file save.

In this next example, we'll pull only a specific table of data.

Select All Code


We used httpcache's table_dump() method to show a default table view of the data we extracted from the external web page. In reality, you'd probably not use table_dump(). You'd probably want to work with the individual cells of data. For example, if you wanted to output just the 3rd column of the 5th data row:

Select All Code


Finally, another powerful feature of httpcache is the ability to ouput binary data directly. You can use this to output data directly within an <img> tag. You need to create a simple asp page that uses httpcache to return the image. That code is shown at the end of this article. I name this file img_cache.asp.

Select All Code


The above will output an image from www.snippetedit.com. The image will be cached locally on your webserver's hard drive for 60 seconds. Actually, it will be saved there indefinitely, but any new hits to your page after 60 seconds will cause a new hit to the source site.

Here is the code for img_cache.asp:

Select All Code


Notice in the code above their is an include for md5.asp. httpcache uses the md5 function to automatically create a cache filename if you don't supply one. I did not write md5.asp, but it works well, here it is. Copy the code and save it as a file named md5.asp.

Select All Code


Finally, here is the httpcache_class.asp code that you include in your pages like the examples in this article.

Select All Code


This article was a bit more complex than some, but if you put all the pieces together and play with the examples, you'll find httpcache to be a powerful and simple to use class for screen-scraping. The caching feature will make you a good neighbor as it will keep you from causing unnecessary hits to those external sites.

Hope you enjoyed this article and the code. Troy Wolf is the author of SnippetEdit, a website editor written in PHP. SnippetEdit is as simple as it gets for non-technical users to edit pre-defined snippets of content in their websites.



Useful Links


CodeToad Experts

Can't find the answer?
Our Site experts are answering questions for free in the CodeToad forums
Rate this article:     Poor Excellent
View highlighted Comments
User Comments on 'HTTP screen-scraping and caching'
RELATED ARTICLES
ASP FilesystemObject
by Jeff Anderson
An introduction to the Filesystemobject
ASP GetTempName
by Jeff Anderson
Use the GetTempName method to create a randomly generated temporary file on the server.
ASP Format Date and Time Script
by Jeff Anderson
An ASP script showing the variety of date and time formats possible using the FormatDateTime Function.
ASP OpenTextFile
by Jeff Anderson
An introduction to the OpenTextFile Method of the FileSystemObject
Email validation using Regular Expression
by Jeff Anderson
Using regular expression syntax is an exellent way to thoroughly validate an email. It's possible in ASP.
Add or Subtract Hours in SQL or ASP using DateAdd
by Jeff Anderson
A beginners guide to using the SQL DATEADD function to add or subtract hours. Particularly useful when setting the time displayed on the ASP page to a different time zone (eg when the server is in the US, and the site is for a UK audience).
The asp:radiobutton and asp:radiobuttonlist control
by David Sussman, et al
In HTML, radio buttons are used when we need to make multiple sets of choices available, but we want the user to select only one of them.
The asp:checkbox and asp:checkboxlist control
by David Sussman, et al
Checkboxes are similar to radio buttons, and in HTML, they were used to allow multiple choices from a group of buttons.
Concatenate strings in sql
by Jeff Anderson
A brief introduction to concatenating strings in an sql query (using SQL server or access databases).
ASP FileExists
by Jeff Anderson
An introduction to the FileExistsMethod of the FileSystemObject








Recent Forum Threads
•  Run a program both on windows and linux
•  VERO.SurfCAM.v2014
•  Schlumberger.Petrel.V2013.2
•  Petrel.V2013.2
•  Altair.HyperWorks.v12
•  VoluMill.v6.1
•  VoluMill.NEXION.6
•  VERO.SurfCAM.v2014
•  Schlumberger.Petrel.V2013.2


Recent Articles
ASP GetTempName
Decode and Encode UTF-8
ASP GetFile
ASP FolderExists
ASP FileExists
ASP OpenTextFile
ASP FilesystemObject
ASP CreateFolder
ASP CreateTextFile
Javascript Get Selected Text


© Copyright codetoad.com 2001-2015