You should check out parquet.
If you can avoid 5minute log files, you can have an hourly (or daily!) MR
job that compacts these. Another nice thing about parquet is it has filter
push down so if you want a smaller range of time you can avoid
deserializing most of the other data
El martes, 5 de ma
Thanks, Im not aware of splittable file formats.
If that is the case, is number of files affect spark performance? Maybe
because overhead when opening file? And that problem is solved by having a
big sized files in splittable file format?
Any suggestion from your experience how to organize data
"As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minutes."
This depends on the file format. Many file formats are splittable (like
parquet), meaning that you can seek into various points of the file.
2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior
Let say I am storing my data in HDFS with folder structure and file
partitioning as per below:
/analytics/2015/05/02/partition-2015-05-02-13-50-
Note that new file is created every 5 minutes.
As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minut