Bandwidth saving and process acceleration by using cURL

Categories: Software Testing |

Bandwidth saving and process acceleration by using cURL

In our daily work we come across several tasks that require the implementation of an automatic job such as a cron job or a continuous integration process with Jenkins, or  we have to download an external static file as part of that automatic process like a third party library. Also, we may need to use the latest version of that external resource, having  to download it over and over again.

 The consequences of downloading the file every time can be summoned in two:

 1-    Wasting bandwidth if we already have downloaded the file version.

2-    Wasting time re-processing the same content and making slower the main process

 The best approach to solve this problem is to download the file only when it has been modified. To achieve this, we can take advantage of the If-Modified-Since HTTP header.

Saving the bandwidth

A very common approach would be to use curl to perform the download. Fortunately it has a time condition option:

 -z, –time-cond <date expression>|<file>

              (HTTP/FTP) Request a file that has been modified later than the given time and date, or one that has  been  modified  before  that  time.  The  <date expression>  can  be  all sorts of date strings or if it doesn’t match any internal ones, it is taken as a filename and tries to get the modification date (mtime) from <file> instead. See the curl_getdate(3) man pages for date expression details.

               Start the date expression with a dash (-) to make it request for a document that is older than the given date/time, default is  a  document  that  is newer than the specified date/time.

 

Let’s take a look at our first run:

 [iviamontes@localhost ]$ curl -v -z index.html http://www.example.com -o index.html

Warning: Illegal date format for -z, –timecond (and not a file name).

Warning: Disabling time condition. See curl_getdate(3) for valid date syntax.

* Rebuilt URL to: http://www.example.com/

* Adding handle: conn: 0x1c518c0

* Adding handle: send: 0

* Adding handle: recv: 0

* Curl_addHandleToPipeline: length: 1

* – Conn 0 (0x1c518c0) send_pipe: 1, recv_pipe: 0

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 –:–:– –:–:– –:–:–     0* About to connect() to www.example.com port 80 (#0)

*   Trying 93.184.216.119…

* Connected to www.example.com (93.184.216.119) port 80 (#0)

> GET / HTTP/1.1

> User-Agent: curl/7.32.0

> Host: www.example.com

> Accept: */*

>

< HTTP/1.1 200 OK

< Accept-Ranges: bytes

< Cache-Control: max-age=604800

< Content-Type: text/html

< Date: Sat, 17 May 2014 12:37:04 GMT

< Etag: “359670651”

< Expires: Sat, 24 May 2014 12:37:04 GMT

< Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT

* Server ECS (fll/0761) is not blacklisted

< Server: ECS (fll/0761)

< X-Cache: HIT

< x-ec-custom-error: 1

< Content-Length: 1270

<

{ [data not shown]

100  1270  100  1270    0     0   2519      0 –:–:– –:–:– –:–:–  2524

* Connection #0 to host www.example.com left intact

 

The important part of the command output is highlighted. At the beginning of the output there is a warning about the syntax, but it’s safe to ignore this warning because the index.html filehas not been created yet.

 

On the second run:

 [iviamontes@localhost kk]$ curl -v -z index.html http://www.example.com -o index.html

* Rebuilt URL to: http://www.example.com/

* Adding handle: conn: 0x10cb8c0

* Adding handle: send: 0

* Adding handle: recv: 0

* Curl_addHandleToPipeline: length: 1

* – Conn 0 (0x10cb8c0) send_pipe: 1, recv_pipe: 0

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 –:–:– –:–:– –:–:–     0* About to connect() to www.example.com port 80 (#0)

*   Trying 93.184.216.119…

* Connected to www.example.com (93.184.216.119) port 80 (#0)

> GET / HTTP/1.1

> User-Agent: curl/7.32.0

> Host: www.example.com

> Accept: */*

> If-Modified-Since: Sat, 17 May 2014 12:37:04 GMT

>

< HTTP/1.1 304 Not Modified

< Accept-Ranges: bytes

< Cache-Control: max-age=604800

< Date: Sat, 17 May 2014 12:37:13 GMT

< Etag: “359670651”

< Expires: Sat, 24 May 2014 12:37:13 GMT

< Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT

* Server ECS (fll/0761) is not blacklisted

< Server: ECS (fll/0761)

< X-Cache: HIT

< x-ec-custom-error: 1

<

  0     0    0     0    0     0      0      0 –:–:– –:–:– –:–:–     0

* Connection #0 to host www.example.com left intact

 

The 304 HTTP code means that no content has been transferred, because Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT is previous to If-Modified-Since: Sat, 17 May 2014 12:37:04 GMT, andtheindex.html file remains untouched. The bandwidth was saved!


No re-processing

Bandwidth saving and process acceleration by using cURLThe next question is: how do we know when should we execute our script to process the download? The answer is on the HTTP Response Code from the server. When the response code is 200, this means that the file has been updated, and we should run the script. In the other hand, the 304 response code means that the file on the server is exactly the same that the one previously downloaded, so we don’t need to process everything again.

Once again, curl comes in handy: it provides a list of options to give format to the output by using the -w option.To avoid too much logs, you’d rather use -s flag to execute curl on silent mode and get only the HTTP response code.

[iviamontes@localhost ]$ curl http://example.com -z index.html -o index.html -s -w %{http_code}

304[iviamontes@localhost ]$

Finally, the complete script should be:

if [[ “$(curl http://example.com -z index.html -o index.html -s -L -w %{http_code})” == “200” ]];   then
# code here to process index.html because 200 means it gets updated

fi

 

If you follow this little trick, you will be saving not only bandwidth that can be invested in other processes but also a lot of precious time!