It may be an option to look at having an option to parallelize some of the dfs 
metadata gathering, this way it opens the option for very large clusters to 
handle large data sets with big number of individual files.

In the mean time it is something that people should consider when dealing with 
data sets with large numbers of files.

 
On Feb 2, 2015, at 11:58 PM, Ted Dunning <[email protected]> wrote:

> Finding 50K files given only a directory name is unlikely to ever be
> efficient.  Reading small files is also unlikely unless the contents are
> linearized (huge luck if so, unlikely to always work).
> 
> Caching the contents of recursive directory structures would make things go
> faster, but it is easy to miss new files that way.  Looking at mtime on 400
> directories might work.
> 
> 
> On Mon, Feb 2, 2015 at 8:02 PM, Steven Phillips <[email protected]>
> wrote:
> 
>> I think we need to fix this issue. We need to come up with a way to
>> initialize the queries that have lots of files more quickly.
>> 
>> On Mon, Feb 2, 2015 at 6:39 PM, Sudhakar Thota <[email protected]>
>> wrote:
>> 
>>> Andries,
>>> 
>>> This proved again the seek time is very very expensive.
>>> 
>>> I learnt that the big files like 100G size would give  good performance
>> if
>>> they can be split properly without spanning across the MFS/HDFS blocks.
>>> You can give it a try increasing the file size.
>>> 
>>> Thanks
>>> Sudhakar Thota
>>> 
>>> 
>>> 
>>> 
>>> On Feb 2, 2015, at 6:11 PM, Andries Engelbrecht <
>> [email protected]>
>>> wrote:
>>> 
>>>> Sharing my experience and looking for input/other experiences when it
>>> comes to working with directories with multiple JSON files.
>>>> 
>>>> The current JSON data is located in a subdirectory structure with
>>> year/month/day/hour in over 400 directories. Each contained over 120
>>> relatively small JSON files 1-10MB, thus around 50k files total. This led
>>> to substantial query startup overhead due to the large number of files
>> for
>>> a relatively small total data set. Most queries would take 230 seconds to
>>> complete and be in a pending state for over 120seconds.
>>>> 
>>>> Concatenating all the files in each directory to a single JSON, reduced
>>> the total number of files to just over 400 (as expected), resulted in the
>>> same queries executing in less than 47 seconds.
>>>> 
>>>> Which brings up the question of what would be the maximum advisable
>> size
>>> for a single JSON file? As at some point there will be tradeoff with
>>> reduced # of files vs maximum size of a single file.
>>>> 
>>>> Something to consider when using Flume or another tool as data source
>>> for eventual Drill consumption.
>>>> 
>>>> —Andries
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> --
>> Steven Phillips
>> Software Engineer
>> 
>> mapr.com
>> 

Reply via email to