The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
The Guiding Principles and New Rules document is now in effect.
Programmatically iterating over a database via basic HTTP requests
Let me be more specific: I'm going to be parsing a database based solely on an incrementing integer in a URL, such as site.com/?entry=1 -> site.com/?entry=2 -> site.com/?entry=3, and so on. I just need to know how to send the request, and how to receive the resulting HTML page. I'm looking for the least intrusive way to do this, since I don't want to take up much of the site's bandwidth. This is a one-time parsing, so it doesn't exactly have to be elegant or maintainable or even in any specific language, but I would prefer it to be in a language which has powerful string manipulation facilities.
I don't know much about this type of programming, so I am looking for as simplistic a solution as possible. I'm fairly sure I've even seen Perl scripts that can do this sort of thing with a few lines, but Google has failed me so far. Thank you.
wget seems like a useful tool here. If you have access to a *nix shell, then this seems like it should help. I'm using zsh for the loop, as it expands ranges. If there's a proxy in the way, wget can use that too, although you'll need to check the manual.
Assuming you know the highest value, replace n with that. Otherwise, wget will just report 404 errors if it runs over, and they should be easy on the server.
mkdir data_html
cd data_html
for i in {1..n}; do
wget http://site.com/\?entry=$i
wait
done
Running through:
- we want to download to a specific directory, so we move into it.
- in the loop we get the file, let it finish, then move to the next one.
If you want, you can add 'sleep 2' or similar after 'wait' to keep load down.
I realise that this doesn't really meet all of your criteria, but I think it should get you the pages you're looking for, and that will make parsing easy.
Why would you use that for something like this when there are libraries in place that will handle the HTTP protocol bits or solutions like linden posted available? Making the HTTP request doing manipulation at the socket level is serious overkill and extra effort.
If you want the downloading and parsing done all in one shot instead of broken up into 2 bits (request all pages, and then later parsing the data), then php has libraries for making HTTP requests. I haven't touched php in forever to point you in the right direction. If you want to do it in Perl, since you mentioned it, you want to look at making HTTP GET requests using LWP. The LWP package may need to be installed from CPAN, although probably not if Perl is used much on the machine you are doing this on.
In Perl it would go something like this. I'm sure I missed something as I did this from memory and no testing.
use LWP::UserAgent;
use HTTP::Request;
my $url = 'http://www.whatever.com/';
my $ua = LWP::UserAgent->new();
my $request = HTTP::Request->new(GET=>$url);
foreach (1 ... $last_num) {
$request->content("entry=$_");
my $result = $ua->request($request);
if ($result->is_success) {
my $html = $result->content;
# parse the html
} else {
#didn't get a HTTP 200 response log a failure or print it to screen or whatever
}
}
use LWP::UserAgent;
use HTTP::Request;
my $url = 'http://www.whatever.com/';
my $ua = LWP::UserAgent->new();
my $request = HTTP::Request->new(GET=>$url);
foreach (1 ... $last_num) {
$request->content("entry=$_");
my $result = $ua->request($request);
if ($result->is_success) {
my $html = $result->content;
# parse the html
} else {
#didn't get a HTTP 200 response log a failure or print it to screen or whatever
}
}
Let me be more specific: I'm going to be parsing a database based solely on an incrementing integer in a URL, such as site.com/?entry=1 -> site.com/?entry=2 -> site.com/?entry=3, and so on. I just need to know how to send the request, and how to receive the resulting HTML page. I'm looking for the least intrusive way to do this, since I don't want to take up much of the site's bandwidth. This is a one-time parsing, so it doesn't exactly have to be elegant or maintainable or even in any specific language, but I would prefer it to be in a language which has powerful string manipulation facilities.
I don't know much about this type of programming, so I am looking for as simplistic a solution as possible. I'm fairly sure I've even seen Perl scripts that can do this sort of thing with a few lines, but Google has failed me so far. Thank you.
What's the language that you tend to use the most? Most languages have some sort of library (either built-in, or available online) that you can use for this purpose. For example, if you're into the .NET Languages you could use the classes found in System.Net to retrieve webpages as a memory stream.
So what do you usually use? That'll let us help you in the way that'll be most natural for you.
Thanks for the suggestions. I ended up finding LuaSocket, a module for Lua that adds network support. It's most convenient, since the database is being implemented in Lua as well.
One more question, though. With this kind of parsing, what is an appropriate wait time between requests, assuming each page is about 10KB?
Hmm, I think that might depend on the server. In honesty though, unless they set some kind of explicit restriction you should be good to go as soon as the last one is received... or even before that.
Three seconds is considered good Internet citizenship.
Oh, I didn't know there was a de-facto rule, by bad. I'm kind of embarrassed now... I've written programs like this before, and never put a pause on it. Never thought about the fact that it might be in bad taste to do so. =(
Orgs that run their own servers sometimes ban anything that looks like bot activity. It's generally not a good idea to model your programs after DoS apps.
Posts
Assuming you know the highest value, replace n with that. Otherwise, wget will just report 404 errors if it runs over, and they should be easy on the server.
Running through:
- we want to download to a specific directory, so we move into it.
- in the loop we get the file, let it finish, then move to the next one.
If you want, you can add 'sleep 2' or similar after 'wait' to keep load down.
I realise that this doesn't really meet all of your criteria, but I think it should get you the pages you're looking for, and that will make parsing easy.
If you want the downloading and parsing done all in one shot instead of broken up into 2 bits (request all pages, and then later parsing the data), then php has libraries for making HTTP requests. I haven't touched php in forever to point you in the right direction. If you want to do it in Perl, since you mentioned it, you want to look at making HTTP GET requests using LWP. The LWP package may need to be installed from CPAN, although probably not if Perl is used much on the machine you are doing this on.
In Perl it would go something like this. I'm sure I missed something as I did this from memory and no testing.
:^:
Job done.
What's the language that you tend to use the most? Most languages have some sort of library (either built-in, or available online) that you can use for this purpose. For example, if you're into the .NET Languages you could use the classes found in System.Net to retrieve webpages as a memory stream.
So what do you usually use? That'll let us help you in the way that'll be most natural for you.
One more question, though. With this kind of parsing, what is an appropriate wait time between requests, assuming each page is about 10KB?
Oh, I didn't know there was a de-facto rule, by bad. I'm kind of embarrassed now... I've written programs like this before, and never put a pause on it. Never thought about the fact that it might be in bad taste to do so. =(