Posted By

noah on 02/16/08


Tagged

auth archive recursive download cookies mirror wget automation agent scraping authorization one-liners


Versions (?)

Who likes this?

4 people have marked this snippet as a favorite

kyokutyo
c4ptivate
chrisdpratt
armanx


Download an entire site with wget -pkr


 / Published in: Bash
 

URL: http://lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php

Where to Get Even More WGet Hacks

See also these killer wget hacks by Jeff Veen.

The WGet Hacks

Here are a couple of recipes to download and archive an entire Web site, starting with the given page and recursing down 1 level. Adjust how many levels deep by changing the numeric argument given after -l

Pitfalls

As of 2008, WGet doesn't follow @import links in CSS.

  1. #Get page.com and each page it links to as well as linked assets like images and CSS. Change hyperlinks to point to the locally downloaded pages.
  2. wget -pkr -l 1 http://site
  3.  
  4.  
  5. #Same as above but also follow links to other domains.
  6. wget -Hpkr -l 1 http://site
  7.  
  8. #Same as the first example, but use a cookie
  9. wget -pkr -l 1 --no-cookies --header "Cookie: JSESSIONID=12345" https://securesite
  10.  
  11.  
  12. # Mirror an html site.
  13. # Read time-stamps when overwriting files that already exist.
  14. # Wait about 10 seconds beteen tries
  15. wget -m -N -w10 --random-wait http://site
  16.  
  17. # Behave very badly by ignoring the robots.txt directive.
  18. # And spoof Mozilla.
  19. # Also output is appended to site.com.log
  20. wget -m -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009090214" http://site.com/
  21.  
  22. wget -pkr -l 1 -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009090214" http://site.com/
  23.  
  24. # Then of course you can see the current output from wget with
  25. tail -f site.com.log

Report this snippet  

Comments

RSS Icon Subscribe to comments
Posted By: hemanthhm on January 11, 2009

More tips

http://ubunt2.blogspot.com/2009/01/wget-tircks-and-tips.html

You need to login to post a comment.