/ Published in: Bash
URL: http://lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php
Where to Get Even More WGet Hacks
See also these killer wget hacks by Jeff Veen.
Here are a couple of recipes to download and archive an entire Web site, starting with the given page and recursing down 1 level. Adjust how many levels deep by changing the numeric argument given after -l
PitfallsAs of 2008, WGet doesn't follow @import links in CSS.
Expand |
Embed | Plain Text
#Get page.com and each page it links to as well as linked assets like images and CSS. Change hyperlinks to point to the locally downloaded pages. wget -pkr -l 1 http://site #Same as above but also follow links to other domains. wget -Hpkr -l 1 http://site #Same as the first example, but use a cookie wget -pkr -l 1 --no-cookies --header "Cookie: JSESSIONID=12345" https://securesite # Mirror an html site. # Read time-stamps when overwriting files that already exist. # Wait about 10 seconds beteen tries wget -m -N -w10 --random-wait http://site # Behave very badly by ignoring the robots.txt directive. # And spoof Mozilla. # Also output is appended to site.com.log wget -m -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009090214" http://site.com/ wget -pkr -l 1 -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009090214" http://site.com/ # Then of course you can see the current output from wget with tail -f site.com.log
Comments
Subscribe to comments
You need to login to post a comment.

More tips
http://ubunt2.blogspot.com/2009/01/wget-tircks-and-tips.html