MapRFS

John Omernik Tue, 08 Mar 2016 05:52:31 -0800

The slowness you saw with Parquet can be heavily dependent on on how your
CTAS was written.  Did you cast to types as needed? Drill could be making
some fast and loose assumptions about your data, and thus typing
incorrectly.  When I was in a similar scenario, I used some stronger typing
and saw quite a bit of improvement with the Parquet files. This can be
difficult if everything is nested though, your mileage may vary.

As to Jacques comment, the profile.json is important, if they are large
files and the planner is going through lots of files, that may make up the
bulk of your data.   A partitioning strategy, should you be able to find
one can help here, but there are still some issues that can crop up.  I
think the planning inefficiencies are being worked on.

John

On Mon, Mar 7, 2016 at 9:58 PM, Ted Dunning <[email protected]> wrote:

> > On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson <[email protected]> wrote:
> > I also tried converting the JSON files to Parquet using CTAS.  The
> Parquet
> > queries took much longer than the JSON queries.  Is that expected as
> well?
>
> No. That is not expected.
>

Re: Parallelism / data locality in HDFS/MapRFS

Reply via email to