Grabbing a bunch of sequential pages in Windows

FyreWulff · March 2011

Alright, I've tried googling this. Apparently Google either thinks I'm trying to make AJAX listings with PHP, or it points me to Linux tutorials using a bash shell which I don't have in Windows.

What I'd like to do is save all these pages in order, starting with

http://www.bungie.net/stats/PlayerStatsHalo2.aspx?player=FyreWulff&sg=3&ctl00_mainContent_bnetpgl_recentgamesChangePage=1

and ending with

http://www.bungie.net/stats/PlayerStatsHalo2.aspx?player=FyreWulff&sg=3&ctl00_mainContent_bnetpgl_recentgamesChangePage=267

The list is finite as Halo 2 is no longer playable online, so no new pages or anything will be showing up here. I need to download them since I'll be changing my gamertag, and the result of doing that may delete these listings.

So short of using Firefox's "Save Page" 267 times or installing Ubuntu again, is there a really nice way to grab all those pages and download them sequentially in Windows?

Spam · March 2011

Dump the first url into excel, drag it down to create a list of all 267 urls, save it as a text file or html file (will probably need to do a find/replace of "a href=" tags in the html one to make sure links generate properly).

Then open the list with a download manager like Flashget to download them all.

*Edit*

In fact, here you go - http://dl.dropbox.com/u/4197966/Book1.htm - just drop that list into a download manager

Echo · March 2011

Are you only interested in the HTML page, no external images/resources?

Find a Windows version of curl, run

curl http://www.bungie.net/stats/PlayerStatsHalo2.aspx?player=FyreWulff&sg=3&ctl00_mainContent_bnetpgl_recentgamesChangePage=[1-267] -o page#1.html

edit: here's a Windows binary

FyreWulff · March 2011

Yeah. I just need the HTML.

Grabbed that, ran it.

It spits out a bunch of HTML in the command line (which looks like the correct page I'm looking for) and then spits out

 - bunch of html that looks correct -
      </div>
</body>
</html>
'sg' is not recognized as an internal or external command,
operable program or batch file.
'ctl00_mainContent_bnetpgl_recentgamesChangePage' is not recognized as an intern
al or external command,
operable program or batch file.

So I was like "wait, I think the & is messing it up" and converted all the &s and ?s to the html entities, saved it as a .bat to run:

curl http://www.bungie.net/stats/PlayerStatsHalo2.aspx&#37;3Fplayer%3DFyreWulff%26sg%3D3%26ctl00_mainContent_bnetpgl_recentgamesChangePage%3D[1-5] -o page#1.html

Output:

C:\curl>curl http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6
sgD36ctl00_mainContent_bnetpgl_recentgamesChangePageD[1-5] -o page#1.html

[1/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
tl00_mainContent_bnetpgl_recentgamesChangePageD1 --> page1.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2119  100  2119    0     0  11271      0 --:--:-- --:--:-- --:--:-- 22542

[2/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
tl00_mainContent_bnetpgl_recentgamesChangePageD2 --> page2.html
100  2119  100  2119    0     0  22542      0 --:--:-- --:--:-- --:--:-- 22542

[3/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
tl00_mainContent_bnetpgl_recentgamesChangePageD3 --> page3.html
100  2119  100  2119    0     0  34177      0 --:--:-- --:--:-- --:--:-- 34177

[4/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
tl00_mainContent_bnetpgl_recentgamesChangePageD4 --> page4.html
100  2119  100  2119    0     0  27166      0 --:--:-- --:--:-- --:--:-- 27166

[5/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
tl00_mainContent_bnetpgl_recentgamesChangePageD5 --> page5.html
100  2119  100  2119    0     0  34177      0 --:--:-- --:--:-- --:--:-- 34177

It's now downloading the pages, but apparently I'm a fucktard and don't know how to encode the HTML entities correctly in Windows command line, as it's now obviously stripping the % and the first character that comes after it and leaving the rest, resulting in 404s (I've reduced the amount of pages it attempts to pull down while debugging this, don't worry about that part).

What am I doing wrong here?

Echo · March 2011

Oh yeah. Try just wrapping the URL in quotes - curl "http://..."

FyreWulff · March 2011

curl "http://www.bungie.net/stats/PlayerStatsHalo2.aspx?player=FyreWulff&sg=3&ctl00_mainContent_bnetpgl_recentgamesChangePage="[1-267] -o page#1.html

Seems to be working! Checked the output of a smaller batch and it's matching up with what I see on Bungie.net, so we can consider this solved as I slowly grab 267 pages. Thanks for the help Echo.

Spam, I'll save your strategy for times when I can't download or access curl. Thank you for that suggestion.

Penny Arcade

Quick Links

Grabbing a bunch of sequential pages in Windows

Posts