Querying Access Logs on AWS
My favorite type of webapp is a static one.
Here are a few reasons:
- Your costs are your domain name + file-hosting, and maybe a CDN (which for most sites amounts to pennies a month),
- Your site up-time is hard to beat,
- and ultimately, your analytics is easy.
“How easy, Max?”
Very easy, I’ll show you how!
To take an example, maxmautner.com is hosted on Amazon S3 behind Cloudfront.
In order to track the amount of traffic your website receives you can use a 3rd party analytics provider like Google Analytics.
Tools like Google Analytics suffer from a couple big problems:
- they under-count your real traffic due to client-side tampering (e.g. adblockers)
- they impede you from accessing your raw data–imposing limitations on how you can use your traffic data
However there is an approach that is even easier/more accurate.
- Enabling Cloudfront Access Logs
- Using Athena to Query Access Logs
- Using Redshift to Query Access Logs
- A demo with maxmautner.com data
Enabling Cloudfront Access Logs
I’ve enabled a feature of Cloudfront to log all requests to logfiles on S3:
Log files will appear in your designated location on S3:
There are a couple techniques for making use of the data that you are now collecting:
- query the data as it is
- performing Extract-Transform-Load of the data to a query data format that’s more optimal for certain types of queries, e.g. Elasticsearch or Redshift (AKA shared, columnar Postgres)
I’ll be showing how to perform both approaches:
Using Athena to Query Access Logs
- What is it? “…a serverless, interactive query service that makes it easy to analyze big data in S3 using standard SQL.”
- What is it really? hosted Presto, an open source distributed SQL query engine for running interactive analytic queries on data where it is located (e.g. on S3)
- How to use it?
- Pricing: $5 per TB of data scanned (by query)
- When does it make sense? When you have an infrequent and predictable query load, or you a small dataset (e.g. <50TB–e.g. a full-table scan would be 50 x $5 = $250)
Using Redshift to Query Access Logs
- What is it? “…a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL”
- How to use it?
- ETL (Glue?)
- Lambdas (to perform ETL)
- Pricing: $180+/month (on-demand), $113+/month (12-month reserve pricing)
- When does it make sense? When you have frequent or undpredictable query load, and you have a large dataset
A demo with maxmautner.com data
- Querying using Athena - forthcoming….
- Querying using Redshift - forthcoming….
Want me to complete the blog post? Let me know!