Hi Soren
      If you can collect or order the log files into date based sub dirs in S3. 
Then you can partition the table based on date. With partitions you can query a 
subset of your data based on date. You can organize the data into date  folders 
during flume ingestion itself.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Søren <[email protected]>
Date: Tue, 24 Apr 2012 16:20:09 
To: [email protected]<[email protected]>
Reply-To: [email protected]
Subject: external table on flume log files in S3

Hi Hive community

We are collecting huge amounts of data into Amazon S3 using Flume.

In Elastic Mapreduce, we have so far managed to create an external Hive 
table on JSON formatted gzipped log files in S3 using a customized 
serde. The log files are collected  and stored in one single folder with 
file names following this pattern:
usr-20120423-012725137+0000.2392780833002846.00000029.gz
usr-20120423-012928765+0000.2392904461259123.00000029.gz
usr-20120423-013032368+0000.2392968063991639.00000029.gz

There are thousands to millions of these files. Is there a way to make 
HIVE benefit from the datetime stamp in the filenames? For example to 
make  queries on smaller subsets. Or filtering when creating the 
external table.

If using the INPUT__FILE__NAME, the job gets done but there is no 
significant performance gain. I guess, due the the evaluation order of 
the SQL statement. I.e. processing the entire repository takes the same 
time as only one day's logs. Same large number of total open-file jobs.

SELECT *
FROM mytable
WHERE INPUT__FILE__NAME LIKE 's3://myflume-logs/usr-20120423%';

Best practise knowledge from others who have been down this road is very 
welcomed.

thanks in advance
Soren


Reply via email to