Andries, This proved again the seek time is very very expensive.
I learnt that the big files like 100G size would give good performance if they can be split properly without spanning across the MFS/HDFS blocks. You can give it a try increasing the file size. Thanks Sudhakar Thota On Feb 2, 2015, at 6:11 PM, Andries Engelbrecht <[email protected]> wrote: > Sharing my experience and looking for input/other experiences when it comes > to working with directories with multiple JSON files. > > The current JSON data is located in a subdirectory structure with > year/month/day/hour in over 400 directories. Each contained over 120 > relatively small JSON files 1-10MB, thus around 50k files total. This led to > substantial query startup overhead due to the large number of files for a > relatively small total data set. Most queries would take 230 seconds to > complete and be in a pending state for over 120seconds. > > Concatenating all the files in each directory to a single JSON, reduced the > total number of files to just over 400 (as expected), resulted in the same > queries executing in less than 47 seconds. > > Which brings up the question of what would be the maximum advisable size for > a single JSON file? As at some point there will be tradeoff with reduced # of > files vs maximum size of a single file. > > Something to consider when using Flume or another tool as data source for > eventual Drill consumption. > > —Andries > >
