Analyzing Apache Access Logs in Hive Without RegexSerDe

Parsing Apache logs is a well documented Hive use case, but when I tried to implement it recently using RegexSerDe I could not get it to work. I copied the gzipped Apache logs to S3 and used Amazon Elastic MapReduce to create the Hive table as documented, but when I tried to select from the table all of the values came out as NULL. I tried some different versions of the Hive DDL that I found online, but none of them worked.

Rather than continuing to debug that implementation I decided to take a different approach. Rather than using RegexSerDe, I just created a table with one column representing the entire log line and used the Hive regexp_extract function to select various virtual columns.

For example, here’s the table in Hive:

CREATE EXTERNAL TABLE IF NOT EXISTS apache_log (
  log_line STRING
)
PARTITIONED BY (dt STRING)
STORED AS TEXTFILE
LOCATION '${BUCKET}/apache';

And here’s a query to extract the HTTP response code from each log line so we can count how many times each response code occurs in a given day:

select regexp_extract(log_line,' ([0-9]{3}) (?:.*[0-9]{1,6})',1) as http_response_code, 
  count(1)
from apache_log 
where dt='2013-08-12'
group by regexp_extract(log_line,' ([0-9]{3}) (?:.*[0-9]{1,6})',1);

The rest of my project mostly involved parsing the request. I was going to have to use regexp_extract to do that anyways, so using that function to do all of the parsing was pretty easy.

Aug 13th, 2013

Hive

Comments

Analyzing Apache Access Logs in Hive Without RegexSerDe

Posted by Ike Walker

Comments