Andries,

This proved again the seek time is very very expensive. 

I learnt that the big files like 100G size would give  good performance if they 
can be split properly without spanning across the MFS/HDFS blocks.
You can give it a try increasing the file size.

Thanks
Sudhakar Thota




On Feb 2, 2015, at 6:11 PM, Andries Engelbrecht <[email protected]> 
wrote:

> Sharing my experience and looking for input/other experiences when it comes 
> to working with directories with multiple JSON files.
> 
> The current JSON data is located in a subdirectory structure with 
> year/month/day/hour in over 400 directories. Each contained over 120 
> relatively small JSON files 1-10MB, thus around 50k files total. This led to 
> substantial query startup overhead due to the large number of files for a 
> relatively small total data set. Most queries would take 230 seconds to 
> complete and be in a pending state for over 120seconds.
> 
> Concatenating all the files in each directory to a single JSON, reduced the 
> total number of files to just over 400 (as expected), resulted in the same 
> queries executing in less than 47 seconds.
> 
> Which brings up the question of what would be the maximum advisable size for 
> a single JSON file? As at some point there will be tradeoff with reduced # of 
> files vs maximum size of a single file.
> 
> Something to consider when using Flume or another tool as data source for 
> eventual Drill consumption.
> 
> —Andries
> 
> 

Reply via email to