Chapter 6 Data Transfer
In this chapter, we will explore commands that will allow us to download files from the internet.
Command | Description |
---|---|
wget
|
Download files from the web |
curl
|
Transfer data from or to a server |
hostname
|
Name of the current host |
ping
|
Ping a remote host |
nslookup
|
Name server details |
We have not executed the commands in this ebook as downloading multiple files from the internet will take a lot of time or result in errors but we have checked all the commands offline to ensure that they work.
6.1 wget
The wget
command will download contents of a URL and files from the internet.
Using additional options, we can
- download contents/files to a file
- continue incomplete downloads
- download multiple files
- limit download speed and number of retries
Command | Description |
---|---|
wget url
|
Download contents of a url |
wget -o file url
|
Download contents of url to a file |
wget -c
|
Continue an incomplete download |
wget -P folder_name -i urls.txt
|
Download all urls stored in a text file to a specific directory |
wget --limit-rate
|
Limit download speed |
wget --tries
|
Limit number of retries |
wget --quiet
|
Turn off output |
wget --no-verbose
|
Print basic information |
wget --progress-dot
|
Change progress bar type to dot |
wget --timestamping
|
Check if the timestamp of the file has changed before downloading |
wget --wait
|
Wait between retrievals |
6.1.1 Download URL
Let us first use wget
to download contents of a URL. Note, we are not downloading file as such but just the content of the URL. We will use the URL of the home page of R project.
wget https://www.r-project.org/
If you look at the list of files, you can see a new file, index.html
which we just downloaded using wget
. Downloading contents this way will lead to confusion if we are dealing with multiple URLs. Let us learn to save the contents to a file (we can specify the name of the file which should help avoid confusion.)
6.1.2 Specify Filename
In this example, we download contents from the same URL and in addition specify the name of the file in which the content must be saved. Here we save it in a new file, rhomepage.html
using the -o
option followed by the filename.
wget -o rhomepage.html https://www.r-project.org/
6.1.3 Download File
How about downloading a file instead of a URL? In this example, we will download a logfile from the RStudio CRAN mirror. It contains the details of R downloads and individual package downloads. If you are a package developer and would want to know the countries in which your packages are downloaded, you will find this useful. We will download the file for 29th September and save it as sep_29.csv.gz
.
wget -o sep_29.csv.gz http://cran-logs.rstudio.com/2019/2019-09-29.csv.gz
6.1.4 Download Multiple URLs
How do we download multiple URLs? One way is to specify the URLs one after the other separated by a space or save all URLs in a file and read them one by one. In the below example, we have saved multiple URLs in the file urls.txt
.
cat urls.txt
## http://cran-logs.rstudio.com/2019/2019-09-26.csv.gz
## http://cran-logs.rstudio.com/2019/2019-09-27.csv.gz
## http://cran-logs.rstudio.com/2019/2019-09-28.csv.gz
We will download all the above URLs and save them in a new folder downloads
. The -i
indicates that the URLs must be read from a file (local or external). The -P
option allows us to specify the directory into which all the files will be downloaded.
wget -P downloads -i urls.txt
6.1.5 Quiet
The --quiet
option will turn off wget
output. It will not show any of the following details:
- name of the file being saved
- file size
- download speed
- eta etc.
wget –-quiet http://cran-logs.rstudio.com/2019/2019-10-06.csv.gz
6.1.6 No Verbose
Using the -nv
or --no-verbose
option, we can turn off verbose without being completely quiet (as we did in the previous example). Any error messages and basic information will still be printed.
wget –-no-verbose http://cran-logs.rstudio.com/2019/2019-10-13.csv.gz
6.1.7 Check Timestamp
Let us say we have already downloaded a file from a URL. The file is updated from time to time and we intend to keep the local copy updated as well. Using the --timestamping
option, the local file will have timestamp matching the remote file; if the remote file is not newer (not updated), no download will occur i.e. if the timestamp of the remote file has not changed it will not be downloaded. This is very useful in case of large files where you do not want to download them unless they have been updated.
wget –-timestamping http://cran-logs.rstudio.com/2019/2019-10-13.csv.gz
6.2 curl
The curl
command will transfer data from or to a server. We will only look at
downloading files from the internet.
Command | Description |
---|---|
curl url
|
Download contents of a url |
curl url -o file
|
Download contents of url to a file |
curl url > file
|
Download contents of url to a file |
curl -s
|
Download in silent or quiet mode |
6.2.1 Download URL
Let us download the home page of the R project using curl
.
curl https://www.r-project.org/
6.2.2 Specify File
Let us download another log file from the RStudio CRAN mirror and save it into a file using the -o
option.
curl http://cran-logs.rstudio.com/2019/2019-09-08.csv.gz -o sept_08.csv.gz
Another way to save a downloaded file is to use >
followed by the name of the file as shown in the below example.
curl http://cran-logs.rstudio.com/2019/2019-09-01.csv.gz > sep_01.csv.gz
6.2.3 Download Silently
The -s
option will allow you to download files silently. It will mute curl
and will not display progress meter or error messages.
curl http://cran-logs.rstudio.com/2019/2019-09-01.csv.gz -o sept_01.csv.gz -s
6.3 R Functions
In R, we can use download.file()
to download files from the internet. The following packages offer functionalities that you will find useful.
Command | R |
---|---|
wget
|
download.file()
|
curl
|
curl::curl_download()
|
hostname
|
R.utils::getHostname.System()
|
ping
|
pingr::ping()
|
nslookup
|
curl::nslookup()
|