Re: Number of files to load

2015-05-05 Thread Jonathan Coveney
You should check out parquet. If you can avoid 5minute log files, you can have an hourly (or daily!) MR job that compacts these. Another nice thing about parquet is it has filter push down so if you want a smaller range of time you can avoid deserializing most of the other data El martes, 5 de ma

Re: Number of files to load

2015-05-05 Thread Rendy Bambang Junior
Thanks, Im not aware of splittable file formats. If that is the case, is number of files affect spark performance? Maybe because overhead when opening file? And that problem is solved by having a big sized files in splittable file format? Any suggestion from your experience how to organize data

Re: Number of files to load

2015-05-05 Thread Jonathan Coveney
"As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minutes." This depends on the file format. Many file formats are splittable (like parquet), meaning that you can seek into various points of the file. 2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior

Number of files to load

2015-05-05 Thread Rendy Bambang Junior
Let say I am storing my data in HDFS with folder structure and file partitioning as per below: /analytics/2015/05/02/partition-2015-05-02-13-50- Note that new file is created every 5 minutes. As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minut