That will be very helpful to be able to speed up the process to deal with very 
large volumes of files. In this case the file size can be increased to reduce 
the file number volume, but this may not be the case for other use cases where 
data is kept for a much longer timeframe.

Another observation is that the number of issues with complex queries dropped 
when the number of files dropped from approx 50k to just over 400.

—Andries


On Feb 2, 2015, at 8:02 PM, Steven Phillips <[email protected]> wrote:

> I think we need to fix this issue. We need to come up with a way to
> initialize the queries that have lots of files more quickly.
> 
> On Mon, Feb 2, 2015 at 6:39 PM, Sudhakar Thota <[email protected]> wrote:
> 
>> Andries,
>> 
>> This proved again the seek time is very very expensive.
>> 
>> I learnt that the big files like 100G size would give  good performance if
>> they can be split properly without spanning across the MFS/HDFS blocks.
>> You can give it a try increasing the file size.
>> 
>> Thanks
>> Sudhakar Thota
>> 
>> 
>> 
>> 
>> On Feb 2, 2015, at 6:11 PM, Andries Engelbrecht <[email protected]>
>> wrote:
>> 
>>> Sharing my experience and looking for input/other experiences when it
>> comes to working with directories with multiple JSON files.
>>> 
>>> The current JSON data is located in a subdirectory structure with
>> year/month/day/hour in over 400 directories. Each contained over 120
>> relatively small JSON files 1-10MB, thus around 50k files total. This led
>> to substantial query startup overhead due to the large number of files for
>> a relatively small total data set. Most queries would take 230 seconds to
>> complete and be in a pending state for over 120seconds.
>>> 
>>> Concatenating all the files in each directory to a single JSON, reduced
>> the total number of files to just over 400 (as expected), resulted in the
>> same queries executing in less than 47 seconds.
>>> 
>>> Which brings up the question of what would be the maximum advisable size
>> for a single JSON file? As at some point there will be tradeoff with
>> reduced # of files vs maximum size of a single file.
>>> 
>>> Something to consider when using Flume or another tool as data source
>> for eventual Drill consumption.
>>> 
>>> —Andries
>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Steven Phillips
> Software Engineer
> 
> mapr.com

Reply via email to