external table on flume log files in S3

Søren Tue, 24 Apr 2012 07:21:00 -0700

Hi Hive community

We are collecting huge amounts of data into Amazon S3 using Flume.

In Elastic Mapreduce, we have so far managed to create an external Hivetable on JSON formatted gzipped log files in S3 using a customizedserde. The log files are collected and stored in one single folder withfile names following this pattern:

usr-20120423-012725137+0000.2392780833002846.00000029.gz
usr-20120423-012928765+0000.2392904461259123.00000029.gz
usr-20120423-013032368+0000.2392968063991639.00000029.gz

There are thousands to millions of these files. Is there a way to makeHIVE benefit from the datetime stamp in the filenames? For example tomake queries on smaller subsets. Or filtering when creating theexternal table.

If using the INPUT__FILE__NAME, the job gets done but there is nosignificant performance gain. I guess, due the the evaluation order ofthe SQL statement. I.e. processing the entire repository takes the sametime as only one day's logs. Same large number of total open-file jobs.


SELECT *
FROM mytable
WHERE INPUT__FILE__NAME LIKE 's3://myflume-logs/usr-20120423%';

Best practise knowledge from others who have been down this road is verywelcomed.


thanks in advance
Soren

external table on flume log files in S3

Reply via email to