Chapter 6 Data Transfer

In this chapter, we will explore commands that will allow us to download files from the internet.

Command Description
wget Download files from the web
curl Transfer data from or to a server
hostname Name of the current host
ping Ping a remote host
nslookup Name server details

We have not executed the commands in this ebook as downloading multiple files from the internet will take a lot of time or result in errors but we have checked all the commands offline to ensure that they work.

6.1 wget

The wget command will download contents of a URL and files from the internet. Using additional options, we can

  • download contents/files to a file
  • continue incomplete downloads
  • download multiple files
  • limit download speed and number of retries
Command Description
wget url Download contents of a url
wget -o file url Download contents of url to a file
wget -c Continue an incomplete download
wget -P folder_name -i urls.txt Download all urls stored in a text file to a specific directory
wget --limit-rate Limit download speed
wget --tries Limit number of retries
wget --quiet Turn off output
wget --no-verbose Print basic information
wget --progress-dot Change progress bar type to dot
wget --timestamping Check if the timestamp of the file has changed before downloading
wget --wait Wait between retrievals

6.1.1 Download URL

Let us first use wget to download contents of a URL. Note, we are not downloading file as such but just the content of the URL. We will use the URL of the home page of R project.

wget https://www.r-project.org/

If you look at the list of files, you can see a new file, index.html which we just downloaded using wget. Downloading contents this way will lead to confusion if we are dealing with multiple URLs. Let us learn to save the contents to a file (we can specify the name of the file which should help avoid confusion.)

6.1.2 Specify Filename

In this example, we download contents from the same URL and in addition specify the name of the file in which the content must be saved. Here we save it in a new file, rhomepage.html using the -o option followed by the filename.

wget -o rhomepage.html https://www.r-project.org/

6.1.3 Download File

How about downloading a file instead of a URL? In this example, we will download a logfile from the RStudio CRAN mirror. It contains the details of R downloads and individual package downloads. If you are a package developer and would want to know the countries in which your packages are downloaded, you will find this useful. We will download the file for 29th September and save it as sep_29.csv.gz.

wget -o sep_29.csv.gz http://cran-logs.rstudio.com/2019/2019-09-29.csv.gz

6.1.4 Download Multiple URLs

How do we download multiple URLs? One way is to specify the URLs one after the other separated by a space or save all URLs in a file and read them one by one. In the below example, we have saved multiple URLs in the file urls.txt.

cat urls.txt
## http://cran-logs.rstudio.com/2019/2019-09-26.csv.gz
## http://cran-logs.rstudio.com/2019/2019-09-27.csv.gz
## http://cran-logs.rstudio.com/2019/2019-09-28.csv.gz

We will download all the above URLs and save them in a new folder downloads. The -i indicates that the URLs must be read from a file (local or external). The -P option allows us to specify the directory into which all the files will be downloaded.

wget -P downloads -i urls.txt     

6.1.5 Quiet

The --quiet option will turn off wget output. It will not show any of the following details:

  • name of the file being saved
  • file size
  • download speed
  • eta etc.
wget –-quiet http://cran-logs.rstudio.com/2019/2019-10-06.csv.gz

6.1.6 No Verbose

Using the -nv or --no-verbose option, we can turn off verbose without being completely quiet (as we did in the previous example). Any error messages and basic information will still be printed.

wget –-no-verbose http://cran-logs.rstudio.com/2019/2019-10-13.csv.gz    

6.1.7 Check Timestamp

Let us say we have already downloaded a file from a URL. The file is updated from time to time and we intend to keep the local copy updated as well. Using the --timestamping option, the local file will have timestamp matching the remote file; if the remote file is not newer (not updated), no download will occur i.e. if the timestamp of the remote file has not changed it will not be downloaded. This is very useful in case of large files where you do not want to download them unless they have been updated.

wget –-timestamping http://cran-logs.rstudio.com/2019/2019-10-13.csv.gz

6.2 curl

The curl command will transfer data from or to a server. We will only look at downloading files from the internet.

Command Description
curl url Download contents of a url
curl url -o file Download contents of url to a file
curl url > file Download contents of url to a file
curl -s Download in silent or quiet mode

6.2.1 Download URL

Let us download the home page of the R project using curl.

curl https://www.r-project.org/

6.2.2 Specify File

Let us download another log file from the RStudio CRAN mirror and save it into a file using the -o option.

curl http://cran-logs.rstudio.com/2019/2019-09-08.csv.gz -o sept_08.csv.gz 

Another way to save a downloaded file is to use > followed by the name of the file as shown in the below example.

curl http://cran-logs.rstudio.com/2019/2019-09-01.csv.gz > sep_01.csv.gz

6.2.3 Download Silently

The -s option will allow you to download files silently. It will mute curl and will not display progress meter or error messages.

curl http://cran-logs.rstudio.com/2019/2019-09-01.csv.gz -o sept_01.csv.gz -s

6.3 R Functions

In R, we can use download.file() to download files from the internet. The following packages offer functionalities that you will find useful.

Command R
wget download.file()
curl curl::curl_download()
hostname R.utils::getHostname.System()
ping pingr::ping()
nslookup curl::nslookup()