MapRFS

Jason Altekruse Tue, 08 Mar 2016 07:55:21 -0800

Considering your description of the data, 1.5 GB per file with only 500
records in each give you somewhere around 30 MB records. This in itself
doesn't necessarily cause an issue, but the structure of your example
record makes me think you may have many individual columns in the nested
structure, rather than the record size dominated by long lists or strings.

If this is the case, you might just be hitting overhead of storing a very
complex structure in a columnar system. Parquet is columnar on disk, but
Drill also uses a columnar in-memory structure to store records during
processing. Even with a conservative estimate of 1 kb for each value stored
somewhere in your nested structure, that gives you 30,000 columns. Due to
how Drill stores them, nested fields share most of the same status as
fields at the root of your schema. While this works, it might require some
tuning work to make it efficient.

Can you give us an idea of how many groups and fields are in a typical
record? How many of the fields are lists? Are the lists that appear in your
data much larger than in your example here?

On Tue, Mar 8, 2016 at 5:51 AM, John Omernik <j...@omernik.com> wrote:

> The slowness you saw with Parquet can be heavily dependent on on how your
> CTAS was written.  Did you cast to types as needed? Drill could be making
> some fast and loose assumptions about your data, and thus typing
> incorrectly.  When I was in a similar scenario, I used some stronger typing
> and saw quite a bit of improvement with the Parquet files. This can be
> difficult if everything is nested though, your mileage may vary.
>
> As to Jacques comment, the profile.json is important, if they are large
> files and the planner is going through lots of files, that may make up the
> bulk of your data.   A partitioning strategy, should you be able to find
> one can help here, but there are still some issues that can crop up.  I
> think the planning inefficiencies are being worked on.
>
> John
>
> On Mon, Mar 7, 2016 at 9:58 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > > On Mon, Mar 7, 2016 at 3:02 PM, Eric Pederson <eric...@gmail.com>
> wrote:
> > > I also tried converting the JSON files to Parquet using CTAS.  The
> > Parquet
> > > queries took much longer than the JSON queries.  Is that expected as
> > well?
> >
> > No. That is not expected.
> >
>

Re: Parallelism / data locality in HDFS/MapRFS

Reply via email to