Thanks Steven. That's really handy to know.

When it comes to "partitioning" files with Parquet, are there any sort of
limits that we should be aware of?  I.e. how many levels, vs unique
values?  If I have 3 levels each with 1500 uniq values, even with the
hash_distribute is that going to break things, will it just make my "create
table" slow, but subsequent queries will be fast?  This one of those topics
that's difficult for people to wrap their heads around, and it would be
interesting to know if there are some "guidelines" to follow for best
performance... or am I just thinking about this from a "hive" like
standpoint that needs to go away?  (I.e. in hive, too many partitions can
be a very bad thing)

John




On Thu, Oct 8, 2015 at 6:00 PM, Steven Phillips <[email protected]> wrote:

> In answer to the other part of your question, yes, by default, each
> fragment will write into its own set of files, you could be looking at (#
> unique values) * (number of fragments) files being created. There is an
> option to shuffle the data before writing, so that each value will be
> written by only a one writer:
>
> alter session set `store.partition.hash_distribute` = true
>
> That should reduce the number of files.
>
> On Thu, Oct 8, 2015 at 11:12 AM, John Omernik <[email protected]> wrote:
>
> > That helped my memory leak! Thanks all!
> >
> > On Thu, Oct 8, 2015 at 10:59 AM, John Omernik <[email protected]> wrote:
> >
> > > Sweet! I'll go check it out. Thanks!
> > >
> > > On Thu, Oct 8, 2015 at 10:53 AM, Paul Ilechko <[email protected]>
> > > wrote:
> > >
> > >> Yes, Drill 1.2 is available at package.mapr.com as of yesterday
> > >>
> > >> On Thu, Oct 8, 2015 at 11:48 AM, John Omernik <[email protected]>
> wrote:
> > >>
> > >> > MapR have a package yet? :) When I compiled Drill with the MapR
> > Profile
> > >> > myself, I couldn't get MapR Tables working, so I reverted back to
> > Drill
> > >> 1.1
> > >> > as packaged by MapR.
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Oct 8, 2015 at 10:42 AM, Abdel Hakim Deneche <
> > >> > [email protected]>
> > >> > wrote:
> > >> >
> > >> > > We fixed a similar issue as part of Drill 1.2. Can you give it a
> try
> > >> to
> > >> > see
> > >> > > if your problem is effectively resolved ?
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > On Thu, Oct 8, 2015 at 8:33 AM, John Omernik <[email protected]>
> > >> wrote:
> > >> > >
> > >> > > > I am on the MapR Packaged version of 1.1.  Do you still need the
> > >> > > > sys.version?
> > >> > > >
> > >> > > > On Thu, Oct 8, 2015 at 10:13 AM, Abdel Hakim Deneche <
> > >> > > > [email protected]>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hey John,
> > >> > > > >
> > >> > > > > The error you are seeing is a memory leak. Drill's allocator
> > found
> > >> > that
> > >> > > > > about 1MB of allocated memory wasn't released at the end of
> the
> > >> > > > fragment's
> > >> > > > > execution.
> > >> > > > >
> > >> > > > > What version of Drill are you using ? can you share the result
> > of:
> > >> > > > >
> > >> > > > > select * from sys.version;
> > >> > > > >
> > >> > > > > Thanks
> > >> > > > >
> > >> > > > > On Thu, Oct 8, 2015 at 7:35 AM, John Omernik <
> [email protected]>
> > >> > wrote:
> > >> > > > >
> > >> > > > > > I am trying to complete a test case on some data. I took a
> > >> schema
> > >> > and
> > >> > > > > used
> > >> > > > > > log-synth (thanks Ted) to create fairly wide table.  (89
> > >> > columns).  I
> > >> > > > > then
> > >> > > > > > outputted my data as csv files, and created a drill view, so
> > >> far so
> > >> > > > good.
> > >> > > > > >
> > >> > > > > > One of the columns is a "date" column, (YYYY-MM-DD) format
> and
> > >> has
> > >> > > 1216
> > >> > > > > > unique values. To me this would be like a 4 ish years of
> daily
> > >> > > > > partitioned
> > >> > > > > > data in hive, so tried to created my data partiioning on
> that
> > >> > field.
> > >> > > > > >
> > >> > > > > > If I create a Parquet table based on that, eventually things
> > >> hork
> > >> > on
> > >> > > me
> > >> > > > > and
> > >> > > > > > I get the error below.  If I don't use the PARTITION BY
> > clause,
> > >> it
> > >> > > > > creates
> > >> > > > > > the table just fine with 30 files.
> > >> > > > > >
> > >> > > > > > Looking in the folder it was supposed to create the
> > PARTITIONED
> > >> > > table,
> > >> > > > it
> > >> > > > > > has over 20K files in there.  Is this expected? Would we
> > expect
> > >> > > > > #Partitions
> > >> > > > > > * #Fragment files? Could this be what the error is trying to
> > >> tell
> > >> > me?
> > >> > > >  I
> > >> > > > > > guess I am just lost on what the error means, and what I
> > >> > should/could
> > >> > > > > > expect on something like this.  Is this a bug or expected?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Error:
> > >> > > > > >
> > >> > > > > > java.lang.RuntimeException: java.sql.SQLException: SYSTEM
> > ERROR:
> > >> > > > > > IllegalStateException: Failure while closing accountor.
> > >> Expected
> > >> > > > private
> > >> > > > > > and shared pools to be set to initial values.  However, one
> or
> > >> more
> > >> > > > were
> > >> > > > > > not.  Stats are
> > >> > > > > >
> > >> > > > > > zone init allocated delta
> > >> > > > > >
> > >> > > > > > private 1000000 1000000 0
> > >> > > > > >
> > >> > > > > > shared 9999000000 9997806954 1193046.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Fragment 1:25
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > [Error Id: cad06490-f93e-4744-a9ec-d27cd06bc0a1 on
> > >> > > > > > hadoopmapr1.mydata.com:31010]
> > >> > > > > >
> > >> > > > > > at sqlline.IncrementalRows.hasNext(IncrementalRows.java:73)
> > >> > > > > >
> > >> > > > > > at
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
> > >> > > > > >
> > >> > > > > > at
> sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
> > >> > > > > >
> > >> > > > > > at sqlline.SqlLine.print(SqlLine.java:1583)
> > >> > > > > >
> > >> > > > > > at sqlline.Commands.execute(Commands.java:852)
> > >> > > > > >
> > >> > > > > > at sqlline.Commands.sql(Commands.java:751)
> > >> > > > > >
> > >> > > > > > at sqlline.SqlLine.dispatch(SqlLine.java:738)
> > >> > > > > >
> > >> > > > > > at sqlline.SqlLine.begin(SqlLine.java:612)
> > >> > > > > >
> > >> > > > > > at sqlline.SqlLine.start(SqlLine.java:366)
> > >> > > > > >
> > >> > > > > > at sqlline.SqlLine.main(SqlLine.java:259)
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > >
> > >> > > > > Abdelhakim Deneche
> > >> > > > >
> > >> > > > > Software Engineer
> > >> > > > >
> > >> > > > >   <http://www.mapr.com/>
> > >> > > > >
> > >> > > > >
> > >> > > > > Now Available - Free Hadoop On-Demand Training
> > >> > > > > <
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > >
> > >> > > Abdelhakim Deneche
> > >> > >
> > >> > > Software Engineer
> > >> > >
> > >> > >   <http://www.mapr.com/>
> > >> > >
> > >> > >
> > >> > > Now Available - Free Hadoop On-Demand Training
> > >> > > <
> > >> > >
> > >> >
> > >>
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> ----------------------------------
> > >> Paul Ilechko
> > >> Senior Systems Engineer
> > >> MapR Technologies
> > >> 908 331 2207
> > >>
> > >
> > >
> >
>

Reply via email to