We have logs stored in HDFS in the following format /YEAR/MONTH/DAY. It's not 
guaranteed though that we will have every single day thought so there will be 
gaps. Now we have some jobs that require us to retrieve the last X amount of 
days of data for only days that actually contain data/exist. 

We have something like the following: https://gist.github.com/anonymous/5364554 
(The naming is a little off since its technically not an InputFormat. .any 
ideas on a proper name?) Basically it uses retrieves all directory for a given 
path and sorts them in descending order, limiting to the last X. It then 
delegates the setInputPaths to FileInputFormat. Just in case if you are 
wondering how we are using it here is an example of a custom PigStorage class 
we use here: https://gist.github.com/anonymous/5364601

Although this works, I am thinking there may be a better/easier way to 
accomplish the same thing. Any ideas?

Thanks

- M



Reply via email to