Webscraping with wget

The following wget commands allow to download content:

wget link

wget -i file_with_links.txt

wget -i links.txt -P ./folderYouWantToSaveTo/ -nc

Practice Webscraping by downloading the following database collection:

Steps:

The first step is to allocate the original link were the data is stored. In this case every article links to the source on the bottom of the page. The source is a .xml file.
Identify the pattern of the link. In this case the main link looked like this:

http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A2006.05.0001%3Aarticle%3D17. As you can see the main link follows the pattner of …hopper/dltext… while the other link goes like …hopper/text…

Download the sourcecode of the wegpage with all the links to the various articles.
Clean the document with regular expression to extracts the links
Add or change the link containing …dltext… instead of …text… only.
Save all links in a .txt file
Open a Command Line Programm like Git Bash or other and enter the wget command as instructed above to download the files

Webscraping with wget HW 7