Analyzing Apache Access Logs in Hive Without RegexSerDe
Parsing Apache logs is a well documented Hive use case, but when I tried to implement it recently using RegexSerDe I could not get it to work. I copied the gzipped Apache logs to S3 and used Amazon Elastic MapReduce to create the Hive table as documented, but when I tried to select from the table all of the values came out as NULL. I tried some different versions of the Hive DDL that I found online, but none of them worked.
Rather than continuing to debug that implementation I decided to take a different approach. Rather than using RegexSerDe, I just created a table with one column representing the entire log line and used the Hive regexp_extract function to select various virtual columns.
For example, here’s the table in Hive:
1 2 3 4 5 6 | |
And here’s a query to extract the HTTP response code from each log line so we can count how many times each response code occurs in a given day:
1 2 3 4 5 | |
The rest of my project mostly involved parsing the request. I was going to have to use regexp_extract to do that anyways, so using that function to do all of the parsing was pretty easy.
