Sharing my experience and looking for input/other experiences when it comes to working with directories with multiple JSON files.
The current JSON data is located in a subdirectory structure with year/month/day/hour in over 400 directories. Each contained over 120 relatively small JSON files 1-10MB, thus around 50k files total. This led to substantial query startup overhead due to the large number of files for a relatively small total data set. Most queries would take 230 seconds to complete and be in a pending state for over 120seconds. Concatenating all the files in each directory to a single JSON, reduced the total number of files to just over 400 (as expected), resulted in the same queries executing in less than 47 seconds. Which brings up the question of what would be the maximum advisable size for a single JSON file? As at some point there will be tradeoff with reduced # of files vs maximum size of a single file. Something to consider when using Flume or another tool as data source for eventual Drill consumption. —Andries
