|
|
Home » ASP » Article
HTTP screen-scraping and caching
|
| Article by: | Troy Wolf ( 372 ) (5/31/2005) |
|
| Summary: | An ASP class that makes it easy to leverage data from web pages that may be your own or pages from other websites. The class has methods that make it easy to pull entire pages, specific sections of content, tables of data, or even just a specific image from third-party websites. |
|
| Viewed: 35246 times |
Rating (13 votes): |
|
4.2 out of 5 |
|
|
|
HTTP screen-scraping and caching
The technique of stealing....er...leveraging content from other people's web pages is commonly referred to as "screen-scraping". Screen-scraping is how you get site content from a website if they don't offer a more natural way to get the content such as an RSS feed or an API.
Many companies that are common targets of screen-scraping (such as Amazon.com and IMDB.com) have strict policies against screen-scraping. So please respect the usage policies of any site you wish to screen-scrape.
The script and techniques in this article provide an ASP class to make screen-scraping easy and powerful.
Since many sites don't appreciate you scraping their content, and those that do don't want you to hit them every 10 seconds, it's a good idea to incorporate some kind of caching mechanism in your code that screen-scrapes. This way you can scrape the other site's content once per hour rather than every-time you get a hit on your own site. For example, if you scrape stock market data from finance.yahoo.com to place content on your home page, you'd prefer (and so would Yahoo) that you don't hit Yahoo for every hit to your site. Instead, hit Yahoo once, cache the result, then use that cached result to serve additional hits to your site for whatever period of time makes sense.
For example, let's say I want to grab the Google home page and display it in my own site. Using the httpcache class, you could simply do the following:
The code above will go pull the page source for http://www.google.com then display it as the output of your page.
Now, let's add caching. The code above will hit Google everytime you refresh the page. Now, let's only hit Google at most once per hour. The code below shows how to use httpcache's caching feature.
|
Select All Code
|
|
You did 2 things. First, you provided a cache filename for the class to use when saving the scraped results to disk. Second, you provided a Time To Live. Now, when you refresh the page, you will pull the data from your webserver harddrive rather than hit Google. That is until after the file is one hour old. The first hit to your page after the cache file is one hour old (3600 seconds) will cause a new hit to Google and a new cache file save.
In this next example, we'll pull only a specific table of data.
|
Select All Code
|
|
We used httpcache's table_dump() method to show a default table view of the data we extracted from the external web page. In reality, you'd probably not use table_dump(). You'd probably want to work with the individual cells of data. For example, if you wanted to output just the 3rd column of the 5th data row:
|
Select All Code
|
|
Finally, another powerful feature of httpcache is the ability to ouput binary data directly. You can use this to output data directly within an <img> tag. You need to create a simple asp page that uses httpcache to return the image. That code is shown at the end of this article. I name this file img_cache.asp.
|
Select All Code
|
|
The above will output an image from www.snippetedit.com. The image will be cached locally on your webserver's hard drive for 60 seconds. Actually, it will be saved there indefinitely, but any new hits to your page after 60 seconds will cause a new hit to the source site.
Here is the code for img_cache.asp:
|
Select All Code
|
|
Notice in the code above their is an include for md5.asp. httpcache uses the md5 function to automatically create a cache filename if you don't supply one. I did not write md5.asp, but it works well, here it is. Copy the code and save it as a file named md5.asp.
|
Select All Code
|
|
Finally, here is the httpcache_class.asp code that you include in your pages like the examples in this article.
|
Select All Code
|
|
This article was a bit more complex than some, but if you put all the pieces together and play with the examples, you'll find httpcache to be a powerful and simple to use class for screen-scraping. The caching feature will make you a good neighbor as it will keep you from causing unnecessary hits to those external sites.
Hope you enjoyed this article and the code. Troy Wolf is the author of SnippetEdit, a website editor written in PHP. SnippetEdit is as simple as it gets for non-technical users to edit pre-defined snippets of content in their websites.
|
|
View highlighted Comments
User Comments on 'HTTP screen-scraping and caching'
|
Posted by :
ghubbell at 15:11 on Tuesday, October 25, 2005
|
Error in Win Server 2003, IIS6:
Microsoft VBScript runtime error '800a01fa'
Class not defined: 'httpcache'
test1.asp, line 4
Any help?
| |
Posted by :
acedish at 15:10 on Sunday, April 23, 2006
|
in the examples you need to change the first 2 lines from;
<%
<!-- #include file="httpcache_class.asp" -->
to
<!-- #include file="httpcache_class.asp" -->
<%
| |
Posted by :
lenehanj at 10:25 on Tuesday, May 23, 2006
|
Hi
I trying to get the above the script to work. It is loading the external url and extracting the info I need ok! No problems there. However, It will not cache the info.
In the httpcache_class.asp, I have set: cachePath = "cache/".
I have a folder called "cache" at the same level as the asp files in your example.
I keep getting "cache save failed" before the scraped information.
Any ideas? I have never used cache before, so any help is appreciated.
John L
| |
Posted by :
!aborabi at 09:19 on Monday, July 24, 2006
|
Very good tutorial. I like it.
---------------------
http://www.astrawebhosting.net
http://www.visionwebhosting.net
| |
Posted by :
rizwanbhutta at 02:18 on Saturday, September 22, 2007
|
it is very good tutorial, it is working very well, but i have a problem at the example which is grabbing table info from url
but at line
.table_extract "column heading to sort", 0, true
what is "column heading to sort" . is it the name of column or some thing else.
| |
|
To post comments you need to become a member. If you are already a member, please log in .
| RELATED ARTICLES |
ASP FilesystemObject by Jeff Anderson
An introduction to the Filesystemobject |
 |
ASP GetTempName by Jeff Anderson
Use the GetTempName method to create a randomly generated temporary file on the server. |
 |
ASP OpenTextFile by Jeff Anderson
An introduction to the OpenTextFile Method of the FileSystemObject |
 |
ASP Format Date and Time Script by Jeff Anderson
An ASP script showing the variety of date and time formats possible using the FormatDateTime Function. |
 |
Email validation using Regular Expression by Jeff Anderson
Using regular expression syntax is an exellent way to thoroughly validate an email. It's possible in ASP. |
 |
ASP FileExists by Jeff Anderson
An introduction to the FileExistsMethod of the FileSystemObject |
 |
Creating a Dynamic Reports using ASP and Excel by Jeff Anderson
A simple way to generate Excel reports from a database using Excel. |
 |
Concatenate strings in sql by Jeff Anderson
A brief introduction to concatenating strings in an sql query (using SQL server or access databases). |
 |
Add or Subtract Hours in SQL or ASP using DateAdd by Jeff Anderson
A beginners guide to using the SQL DATEADD function to add or subtract hours. Particularly useful when setting the time displayed on the ASP page to a different time zone (eg when the server is in the US, and the site is for a UK audience). |
 |
ASP CreateTextFile by Jeff Anderson
An explanation of the CreateTextFile Method, part of the ASP FileSystemObject |
 |
| |