Sharing my experience and looking for input/other experiences when it comes to 
working with directories with multiple JSON files.

The current JSON data is located in a subdirectory structure with 
year/month/day/hour in over 400 directories. Each contained over 120 relatively 
small JSON files 1-10MB, thus around 50k files total. This led to substantial 
query startup overhead due to the large number of files for a relatively small 
total data set. Most queries would take 230 seconds to complete and be in a 
pending state for over 120seconds.

Concatenating all the files in each directory to a single JSON, reduced the 
total number of files to just over 400 (as expected), resulted in the same 
queries executing in less than 47 seconds.

Which brings up the question of what would be the maximum advisable size for a 
single JSON file? As at some point there will be tradeoff with reduced # of 
files vs maximum size of a single file.

Something to consider when using Flume or another tool as data source for 
eventual Drill consumption.

—Andries


Reply via email to