I think we need to fix this issue. We need to come up with a way to
initialize the queries that have lots of files more quickly.

On Mon, Feb 2, 2015 at 6:39 PM, Sudhakar Thota <[email protected]> wrote:

> Andries,
>
> This proved again the seek time is very very expensive.
>
> I learnt that the big files like 100G size would give  good performance if
> they can be split properly without spanning across the MFS/HDFS blocks.
> You can give it a try increasing the file size.
>
> Thanks
> Sudhakar Thota
>
>
>
>
> On Feb 2, 2015, at 6:11 PM, Andries Engelbrecht <[email protected]>
> wrote:
>
> > Sharing my experience and looking for input/other experiences when it
> comes to working with directories with multiple JSON files.
> >
> > The current JSON data is located in a subdirectory structure with
> year/month/day/hour in over 400 directories. Each contained over 120
> relatively small JSON files 1-10MB, thus around 50k files total. This led
> to substantial query startup overhead due to the large number of files for
> a relatively small total data set. Most queries would take 230 seconds to
> complete and be in a pending state for over 120seconds.
> >
> > Concatenating all the files in each directory to a single JSON, reduced
> the total number of files to just over 400 (as expected), resulted in the
> same queries executing in less than 47 seconds.
> >
> > Which brings up the question of what would be the maximum advisable size
> for a single JSON file? As at some point there will be tradeoff with
> reduced # of files vs maximum size of a single file.
> >
> > Something to consider when using Flume or another tool as data source
> for eventual Drill consumption.
> >
> > —Andries
> >
> >
>
>


-- 
 Steven Phillips
 Software Engineer

 mapr.com

Reply via email to