The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
Please vote in the Forum Structure Poll. Polling will close at 2PM EST on January 21, 2025.

Grabbing a bunch of sequential pages in Windows

FyreWulffFyreWulff YouRegistered User, ClubPA regular
edited March 2011 in Help / Advice Forum
Alright, I've tried googling this. Apparently Google either thinks I'm trying to make AJAX listings with PHP, or it points me to Linux tutorials using a bash shell which I don't have in Windows.

What I'd like to do is save all these pages in order, starting with

http://www.bungie.net/stats/PlayerStatsHalo2.aspx?player=FyreWulff&sg=3&ctl00_mainContent_bnetpgl_recentgamesChangePage=1

and ending with

http://www.bungie.net/stats/PlayerStatsHalo2.aspx?player=FyreWulff&sg=3&ctl00_mainContent_bnetpgl_recentgamesChangePage=267

The list is finite as Halo 2 is no longer playable online, so no new pages or anything will be showing up here. I need to download them since I'll be changing my gamertag, and the result of doing that may delete these listings.

So short of using Firefox's "Save Page" 267 times or installing Ubuntu again, is there a really nice way to grab all those pages and download them sequentially in Windows?

FyreWulff on

Posts

  • SpamSpam Registered User regular
    edited March 2011
    Dump the first url into excel, drag it down to create a list of all 267 urls, save it as a text file or html file (will probably need to do a find/replace of "a href=" tags in the html one to make sure links generate properly).

    Then open the list with a download manager like Flashget to download them all.

    *Edit*

    In fact, here you go - http://dl.dropbox.com/u/4197966/Book1.htm - just drop that list into a download manager

    Spam on
  • EchoEcho ski-bap ba-dapModerator, Administrator admin
    edited March 2011
    Are you only interested in the HTML page, no external images/resources?

    Find a Windows version of curl, run
    curl http://www.bungie.net/stats/PlayerStatsHalo2.aspx?player=FyreWulff&sg=3&ctl00_mainContent_bnetpgl_recentgamesChangePage=[1-267] -o page#1.html
    

    edit: here's a Windows binary

    Echo on
  • FyreWulffFyreWulff YouRegistered User, ClubPA regular
    edited March 2011
    Yeah. I just need the HTML.

    Grabbed that, ran it.


    It spits out a bunch of HTML in the command line (which looks like the correct page I'm looking for) and then spits out
     - bunch of html that looks correct -
          </div>
    </body>
    </html>
    'sg' is not recognized as an internal or external command,
    operable program or batch file.
    'ctl00_mainContent_bnetpgl_recentgamesChangePage' is not recognized as an intern
    al or external command,
    operable program or batch file.
    


    So I was like "wait, I think the & is messing it up" and converted all the &s and ?s to the html entities, saved it as a .bat to run:
    curl http://www.bungie.net/stats/PlayerStatsHalo2.aspx&#37;3Fplayer%3DFyreWulff%26sg%3D3%26ctl00_mainContent_bnetpgl_recentgamesChangePage%3D[1-5] -o page#1.html
    

    Output:
    C:\curl>curl http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6
    sgD36ctl00_mainContent_bnetpgl_recentgamesChangePageD[1-5] -o page#1.html
    
    [1/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
    tl00_mainContent_bnetpgl_recentgamesChangePageD1 --> page1.html
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  2119  100  2119    0     0  11271      0 --:--:-- --:--:-- --:--:-- 22542
    
    [2/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
    tl00_mainContent_bnetpgl_recentgamesChangePageD2 --> page2.html
    100  2119  100  2119    0     0  22542      0 --:--:-- --:--:-- --:--:-- 22542
    
    [3/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
    tl00_mainContent_bnetpgl_recentgamesChangePageD3 --> page3.html
    100  2119  100  2119    0     0  34177      0 --:--:-- --:--:-- --:--:-- 34177
    
    [4/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
    tl00_mainContent_bnetpgl_recentgamesChangePageD4 --> page4.html
    100  2119  100  2119    0     0  27166      0 --:--:-- --:--:-- --:--:-- 27166
    
    [5/5]: http://www.bungie.net/stats/PlayerStatsHalo2.aspxFplayerDFyreWulff6sgD36c
    tl00_mainContent_bnetpgl_recentgamesChangePageD5 --> page5.html
    100  2119  100  2119    0     0  34177      0 --:--:-- --:--:-- --:--:-- 34177
    

    It's now downloading the pages, but apparently I'm a fucktard and don't know how to encode the HTML entities correctly in Windows command line, as it's now obviously stripping the % and the first character that comes after it and leaving the rest, resulting in 404s (I've reduced the amount of pages it attempts to pull down while debugging this, don't worry about that part).

    What am I doing wrong here?

    FyreWulff on
  • EchoEcho ski-bap ba-dapModerator, Administrator admin
    edited March 2011
    Oh yeah. Try just wrapping the URL in quotes - curl "http://...&quot;

    Echo on
  • FyreWulffFyreWulff YouRegistered User, ClubPA regular
    edited March 2011
    curl "http://www.bungie.net/stats/PlayerStatsHalo2.aspx?player=FyreWulff&sg=3&ctl00_mainContent_bnetpgl_recentgamesChangePage="[1-267] -o page#1.html
    

    Seems to be working! Checked the output of a smaller batch and it's matching up with what I see on Bungie.net, so we can consider this solved as I slowly grab 267 pages. Thanks for the help Echo.

    Spam, I'll save your strategy for times when I can't download or access curl. Thank you for that suggestion.

    FyreWulff on
Sign In or Register to comment.