See also these killer
wget hacks by Jeff Veen.
Here are a couple of recipes to download and archive an entire Web site, starting with the given page and recursing down 1 level. Adjust how many levels deep by changing the numeric argument given after -lPitfalls
As of 2008, WGet doesn't follow @import links in CSS.
- #Get page.com and each page it links to as well as linked assets like images and CSS. Change hyperlinks to point to the locally downloaded pages.
- wget -pkr -l 1 http://site
- #Same as above but also follow links to other domains.
- wget -Hpkr -l 1 http://site
- #Same as the first example, but use a cookie
- wget -pkr -l 1 --no-cookies --header "Cookie: JSESSIONID=12345" https://securesite
- # Mirror an html site.
- # Read time-stamps when overwriting files that already exist.
- # Wait about 10 seconds beteen tries
- wget -m -N -w10 --random-wait http://site
- # Behave very badly by ignoring the robots.txt directive.
- # And spoof Mozilla.
- # Also output is appended to site.com.log
- wget -m -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:18.104.22.168) Gecko/2009090214" http://site.com/
- wget -pkr -l 1 -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:22.214.171.124) Gecko/2009090214" http://site.com/
- # Then of course you can see the current output from wget with
- tail -f site.com.log
CommentsSubscribe to comments
You need to login to post a comment.