This worked perfectly. Thanks Jason.  (It also made my small 1.5m per day
table into 600K per day... so double win)

So, I like this approach and will use it. It makes sense how it works, but
obviously this is something I had to come to the Drill User group for, and
you, being an expert on Drill internals had a solid answer.  Turn around
time on this was awesome, so, thus, when I am looking to "improve" on your
answer, please note that I appreciate your fast response, and am not trying
to take away from.

First of all, I feel like that as a "perhaps knows just enough to be
dangerous" power/intermediate user, it barely registered that I should, for
optimal performance, do something about 400 files.  I saw them and almost
moved on before remembering an issue I've had with this in the past.  Thus,
even bringing it up, and looking for a way to reduce files almost slipped
by.

Second, the issue of many files isn't just an issue for performance on
planning like in my case, it's a very serious issues for folks running HDFS
clusters. Perhaps a call out in documentation would be handy here in the
CTAS page. Identifying how files are created, and that if you want to
reduce, a simple hack could be to run it through and ORDER BY.  This could
be beneficial for HDFS users, normal users, and the drill project, because
people using the docs will do that, realizing better performance.

Third: Could we/should be provide an sys.options flag that defaults to
false (for backwards compatibility and no surprises) that when set to try
in a query "mimics" the effect of a Order by?  store.parquet.merge.files
 or something like to ask the process to go through a single stream if
possible at the end regardless of order by? While this may not add
performance, it would be a handy way to document this. (Perhaps not single
stream? Perhaps we could say if > 0 (0 being the default, thus not adding a
merging step) it would add a step to merge the data into N streams based on
the value. (so in my case, I could set it 1, but others may have needs to
set it higher).

I'd be curious on the thoughts here.


John


On Thu, Jun 23, 2016 at 12:26 PM, Jason Altekruse <ja...@dremio.com> wrote:

> Apply a sort in your CTAS, this will force the data down to a single stream
> before writing.
>
> Jason Altekruse
> Software Engineer at Dremio
> Apache Drill Committer
>
> On Thu, Jun 23, 2016 at 10:23 AM, John Omernik <j...@omernik.com> wrote:
>
> > When have a small query writing smaller data (like aggregate tables for
> > faster aggregates for Dashboards etc).  It appears to write a ton of
> small
> > files.  Not sure why, maybe its just how the join worked out etc. I have
> a
> > "day" that is 1.5M in total size, but 400 files total. This seems
> > excessive.
> >
> > While I don't have the "small files" issues because I run MapR-FS, having
> > 400 files that make 1.5 mb of total date kills me on the planning phase.
> >  How can I get Drill, when doing a CTAS to go through a round of
> > consolidation on the parquet files?
> >
> > Thanks
> >
> > John
> >
>

Reply via email to