June 05, 2018

Querying Access Logs on AWS

My favorite type of webapp is a static one.

Here are a few reasons:

  • Your costs are your domain name + file-hosting, and maybe a CDN (which for most sites amounts to pennies a month),
  • Your site up-time is hard to beat,
  • and ultimately, your analytics is easy.

“How easy, Max?”

Very easy, I’ll show you how!

To take an example, maxmautner.com is hosted on Amazon S3 behind Cloudfront.

In order to track the amount of traffic your website receives you can use a 3rd party analytics provider like Google Analytics.

Tools like Google Analytics suffer from a couple big problems:

  • they under-count your real traffic due to client-side tampering (e.g. adblockers)
  • they impede you from accessing your raw data–imposing limitations on how you can use your traffic data

However there is an approach that is even easier/more accurate.

Enabling Cloudfront Access Logs

I’ve enabled a feature of Cloudfront to log all requests to logfiles on S3:

Enable Access Logs to S3 for a Cloudfront Distribution

Log files will appear in your designated location on S3:

View Access Logs on S3

There are a couple techniques for making use of the data that you are now collecting:

  • query the data as it is
  • performing Extract-Transform-Load of the data to a query data format that’s more optimal for certain types of queries, e.g. Elasticsearch or Redshift (AKA shared, columnar Postgres)

I’ll be showing how to perform both approaches:

Using Athena to Query Access Logs

Using Redshift to Query Access Logs

A demo with maxmautner.com data

  • Querying using Athena - forthcoming….
  • Querying using Redshift - forthcoming….

Want me to complete the blog post? Let me know!