Processing data from Contest Website using Athena (AWS)
Based on the post titled “AWS Architecture for an IPL Cricket Contest Website”
Athena uses AWS S3 as it’s persistent store and supports files in the following formats — CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats.
In our use case, we log data in plain JSON and also use GZIP compression to utilize the storage space optimally. Athena recognizes whether the files are GZIP compressed based on their extension.
I will now explain how you can write SQL queries on the S3 data collected from the contest website and process it using Athena.
Open Athena from AWS console
Click on Settings from Top right option which will open below screen
Enter result location from S3. Basically this location is where Athena stores query results so for our demo, i am keeping the same S3 location as used in my original post where we created S3 Website for Cricket Contest.
Keep Encrypt query results and autocomplete as unchecked and hit save button.
Create query as below
CREATE EXTERNAL TABLE IF NOT EXISTS ipl_cricket_contest_data(
`first_name` string,
`last_name` string,
`phone` string,
`email` string,
`gender` string,
`ip_address` string,
`time` string
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’ LOCATION ‘s3://iplcricketcontest.com/2021/’
Above query will create table with name “ipl_cricket_contest_data”.
A SerDe (Serializer/Deserializer) is a way in which Athena interacts with data in various formats.
Location is where Athena should read data from, here we have mentioned our S3 bucket location.
Once you run above query, it will show a Query successful message and a table will be created on left hand side list
Click on Preview table as per below screen, this menu will appear of you click on 3 vertical dots at the end of table name.
It will execute below query and data will be shown as below
This usually works like normal SQL query, you can even drop table. Its basically a Presto SQL — Query Language.
In case your S3 bucket data is encrypted, use below which will allow Athena to read Encrypted data and store it in table.
TBLPROPERTIES (‘has_encrypted_data’=’true’)
So that’s how you can read data from S3 using Athena. There are couple of more features for Athena that i will discuss in my upcoming post.
If the article has given you some value, or you know someone who might need it, please share it on the Internet. Don’t let him idle on his blog and waste his potential.
I would also be pleased if you write a comment.