Hey Parth, thanks for the response !

I tried fetching the metadata using parquet-tools Hadoop mode instead, and
I get OOM errors: Heap and GC limit exceeded.

It seems that my problem is actually resource related, still a bit weird
how parquet metadata read is so hungry ?

It seems that even after a restart (clean state/no queries running) only
~4GB mem is free from a 16GB machine.

I am going to run the tests on a bigger machine, and will tweak the JVM
options and will let you know.

Regards.
Carlos.

On Wed, May 9, 2018 at 9:04 PM, Parth Chandra <[email protected]> wrote:

> The most common reason I know of for this error is if you do not have
> enough CPU. Both Drill and the distributed file system will be using cpu
> and sometimes the file system, especially if it is distributed, will take
> too long. With your configuration and data set size, reading the file
> metadata should take no time at all (I'll assume the metadata in the files
> is reasonable and not many MB itself).  Is your system by any chance
> overloaded?
>
> Also, call me paranoid, but seeing /tmp in the path makes me suspicious.
> Can we assume the files are written completely when the metadata read is
> occurring. They probably are, since you can query the files individually,
> but I'm just checking to make sure.
>
> Finally, there is a similar JIRA
> https://issues.apache.org/jira/browse/DRILL-5908, that looks related.
>
>
>
>
> On Wed, May 9, 2018 at 4:15 PM, Carlos Derich <[email protected]>
> wrote:
>
> > Hello guys,
> >
> > Asking this question here because I think i've hit a wall with this
> > problem, I am consistently getting the same error, when running a query
> on
> > a directory-based parquet file.
> >
> > The directory contains six 158MB parquet files.
> >
> > RESOURCE ERROR: Waited for 15000ms, but tasks for 'Fetch parquet
> > metadata' are not complete. Total runnable size 6, parallelism 6.
> >
> >
> > Both queries fail:
> >
> > *select count(*) from dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/`*
> >
> > *select * from* *from dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/`
> > limit 1*
> >
> > BUT If I try running any other query in any of the 6 parquet files inside
> > the directory it works fine:
> > eg:
> > *select * from
> > dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/185d3076-v_
> docker_node0001-
> > 140526122190592.parquet`*
> >
> > Running *`refresh table metadata`* gives me the exact same error.
> >
> > Also tried to set *planner.hashjoin* to false.
> >
> > Checking the drill source it seems that the wait metadata timeout is not
> > configurable.
> >
> > Have any of you faced a similar situation ?
> >
> > Running this locally on my 16GB RAM machine, hdfs in a single node.
> >
> > I also found an open ticket with the same error message:
> > https://issues.apache.org/jira/browse/DRILL-5903
> >
> > Thank you in advance.
> >
>

Reply via email to