Re: [DISCUSSION] current project state

2018-08-27 Thread Carlos Derich
Hello guys,

Thanks for bringing up this discussion, I may be a little bit late but I
would like to add an use case I've been through recently.

I think Drill should be able to use ZK for storing session's data. In a
multiple Drillbit scenario, if a second Drillbit receives a request with a
session token attached, it should try to retrieve that session from ZK
before generating a new session. I have seen this topic pop up every now
and then, I'd be happy to work on it as a first contribution if we decide
how this should work.

I would also like to add that we could try focus on improvements on
documentation ? more specifically on docs that currently do not exist and
also for writing custom storage plugins. I had to work recently on a custom
storage plugin for GeoJSON files and I think the only resource I could find
on this was Charle's log-file plugin (
https://github.com/cgivre/drill-logfile-plugin) which I am incredible
grateful for. I would be more than happy to work on these docs.

Derich.

On Sun, Aug 19, 2018 at 4:49 AM Uwe L. Korn  wrote:

> Hello Arina
>
> On Tue, Aug 14, 2018, at 4:08 PM, Arina Yelchiyeva wrote:
> > 3. Drill vs Arrow is the topic I heard since I have started working with
> > Drill. But so far nobody dared to tackle it. I would suspect Drill first
> > would have to contribute changes in Arrow to be able to migrate which
> could
> > be a show-stopper if Arrow community does not accept them.
>
> What would the changes that need to go into Arrow to make it usable for
> Drill? I suspect that many of them should also align with the Arrow
> project. Especially as the Java code of Arrow started out from Drill's
> ValueVector code. If you already know some of the issues, it would be
> really helpful to open tickets in the ARROW JIRA (feel free to add a drill
> label to them, so one can search for them). Even if there is no plan to
> implement them currently, it definitely helps us Arrow developers in
> visibility what users of the Arrow library need / prevent from adoption.
>
> Uwe
>


Re: CT from parquet to CSV seems to not properly encode to UTF8

2018-07-17 Thread Carlos Derich
Hey guys,

Adding this JVM flag to the drill-env.sh file made it to work.

export JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF8"

Thank you very much.


On Tue, Jul 17, 2018 at 1:49 AM, Kunal Khatua  wrote:

> Hi Carlos
>
> It looks similar to an issue reported previously:
> https://lists.apache.org/thread.html/1f3d4c427690c06f1992bc5070f355
> 689ccc5b1ed8cc3678ad8e9106@
>
> Could you try setting the JVM's file encoding to UTF-8 and retry? If it
> does not work, please file a JIRA in https://issues.apache.org
>
> Thanks
> Kunal
> On 7/16/2018 1:25:45 PM, Carlos Derich  wrote:
> It seems to be an issue only with CSV/TSV files.
>
> Tried writing the output as JSON and it handles the encoding properly.
>
> alter session set `store.format`='json'
> create table dfs.tmp.test3 as select `city` from dfs.parquets.`file`
>
> Returns:
>
> {"city": "Montréal"}
>
>
> additional info:
>
> parquet-tools schema:
>
> message root {
> optional binary city (UTF8);
> }
>
>
> On Mon, Jul 16, 2018 at 2:49 PM, Carlos Derich
> wrote:
>
> > Hello guys, hope everyone is well.
> >
> > I am having an encoding issue when converting a table from parquet into
> > csv files, I wonder if someone could shed some light on it ?
> >
> > One of my data sets has data in French with lots of accentuation, and it
> > is persisted in HDFS as parquet.
> >
> >
> > When I query the parquet table with: *select `city` from
> > dfs.parquets.`file` , *it properly return the data encoded.
> >
> >
> > *city*
> >
> > *Montréal*
> >
> >
> > Then I convert this table into a CSV file with the following query:
> >
> > *alter session set `store.format`='csv'*
> > *create table dfs.csvs.`converted` as select * from dfs.parquets.`file`*
> >
> >
> > Then when I run a select query on it, it returns data not properly
> encoded:
> >
> > *select columns[0] from dfs.csvs.`converted`*
> >
> > Returns:
> >
> > *Montr?al*
> >
> >
> > My storage plugin is pretty standard:
> >
> > "csv" : {
> > "type" : "text",
> > "extensions" : [ "csv" ],
> > "delimiter" : ",",
> > "skipFirstLine": true
> > },
> >
> > Should I explicitly add an charset option somewhere ? Couldn't find
> > anything helpful on the docs.
> >
> > Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS
> > -Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck.
> >
> > Have anyone ran into similar issues ?
> >
> > Thank you !
> >
>


Re: CT from parquet to CSV seems to not properly encode to UTF8

2018-07-16 Thread Carlos Derich
It seems to be an issue only with CSV/TSV files.

Tried writing the output as JSON and it handles the encoding properly.

alter session set `store.format`='json'
create table dfs.tmp.test3 as select `city` from dfs.parquets.`file`

Returns:

{"city": "Montréal"}


additional info:

parquet-tools schema:

message root {
  optional binary city (UTF8);
}


On Mon, Jul 16, 2018 at 2:49 PM, Carlos Derich 
wrote:

> Hello guys, hope everyone is well.
>
> I am having an encoding issue when converting a table from parquet into
> csv files, I wonder if someone could shed some light on it ?
>
> One of my data sets has data in French with lots of accentuation, and it
> is persisted in HDFS as parquet.
>
>
> When I query the parquet table with: *select `city` from
> dfs.parquets.`file` , *it properly return the data encoded.
>
>
> *city*
>
> *Montréal*
>
>
> Then I convert this table into a CSV file with the following query:
>
> *alter session set `store.format`='csv'*
> *create table dfs.csvs.`converted` as select * from dfs.parquets.`file`*
>
>
> Then when I run a select query on it, it returns data not properly encoded:
>
> *select columns[0] from dfs.csvs.`converted`*
>
> Returns:
>
> *Montr?al*
>
>
> My storage plugin is pretty standard:
>
> "csv" : {
> "type" : "text",
> "extensions" : [ "csv" ],
> "delimiter" : ",",
> "skipFirstLine": true
> },
>
> Should I explicitly add an charset option somewhere ? Couldn't find
> anything helpful on the docs.
>
> Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS
> -Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck.
>
> Have anyone ran into similar issues ?
>
> Thank you !
>


CT from parquet to CSV seems to not properly encode to UTF8

2018-07-16 Thread Carlos Derich
Hello guys, hope everyone is well.

I am having an encoding issue when converting a table from parquet into csv
files, I wonder if someone could shed some light on it ?

One of my data sets has data in French with lots of accentuation, and it is
persisted in HDFS as parquet.


When I query the parquet table with: *select `city` from
dfs.parquets.`file` , *it properly return the data encoded.


*city*

*Montréal*


Then I convert this table into a CSV file with the following query:

*alter session set `store.format`='csv'*
*create table dfs.csvs.`converted` as select * from dfs.parquets.`file`*


Then when I run a select query on it, it returns data not properly encoded:

*select columns[0] from dfs.csvs.`converted`*

Returns:

*Montr?al*


My storage plugin is pretty standard:

"csv" : {
"type" : "text",
"extensions" : [ "csv" ],
"delimiter" : ",",
"skipFirstLine": true
},

Should I explicitly add an charset option somewhere ? Couldn't find
anything helpful on the docs.

Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS
-Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck.

Have anyone ran into similar issues ?

Thank you !


Re: Failed to fetch parquet metadata after 15000ms

2018-05-11 Thread Carlos Derich
Hey Kunal, thanks for the help !

I've found the problem, my parquet files were being generated by a third
party warehouse db with an export to parquet command.

It seems that the parquet files being generated are somehow corrupted,
making the parquet reader to eat all the memory and then dying.

I'm opening a ticket with them.

Tested creating the parquet files from drill itself instead of from the
warehouse and it worked fine and super fast.

Thanks for all the help



On Thu, May 10, 2018 at 2:57 PM, Kunal Khatua <ku...@apache.org> wrote:

> That doesn't look too big. Are the queries failing during planning or
> execution phase?
>
> Also, you mentioned that you are running this on a machine with 16GB RAM.
> How much memory have you given to Drill? Typical min config is about 8GB
> for Drill alone, and with 2-3GB with the OS, not a whole lot is left for
> other services (incl. HDFS).
>
> If you need to run barebones, here are a couple of things worth doing.
> 1. Check the memory allocated to Drill. I'm not sure how much HDFS, etc
> have been allocated, but you should be able to get away with as low 4-5GB
> (1-2GB heap and the rest for Direct)
> 2. Reduce the 'planner.width.max_per_node' to as low as 1. This will
> force Drill to run, in effect, at a parallelization of 1 (i.e 1 thread per
> major fragment).
> 3. There is the option of "planner.memory.max_query_memory_per_node" that
> defaults at 2GB, which should be sufficient even in your environment, when
> coupled with #2.
>
> The current Apache Drill master is also carrying real-time information
> about the heap and direct memory. If you are using that, you should be able
> to see whether you are quickly running out of memory (which, I suspect, is
> the source of your troubles).
>
> Hope that helps.
> On 5/10/2018 10:05:29 AM, Carlos Derich <carlosder...@gmail.com> wrote:
> It is relatively big ?
>
> parquet-tools schema output:
>
> message schema {
> optional int64 id;
> optional int64 cbd_id;
> optional binary company_name (UTF8);
> optional binary category (UTF8);
> optional binary subcategory (UTF8);
> optional binary description (UTF8);
> optional binary full_address_source (UTF8);
> optional binary street_address (UTF8);
> optional binary neighborhood (UTF8);
> optional binary city (UTF8);
> optional binary administrative_area_level_3 (UTF8);
> optional binary administrative_area_level_2 (UTF8);
> optional binary administrative_area_level_1 (UTF8);
> optional binary postal_code (UTF8);
> optional binary country (UTF8);
> optional binary formatted_address (UTF8);
> optional binary geometry;
> optional binary telephone (UTF8);
> optional binary website (UTF8);
> optional int32 retrieved_at;
> optional binary source_url (UTF8);
> }
>
> Thanks for the help , will keep you posted, this will help me understand
> better drill hardware requirements.
>
> On Thu, May 10, 2018 at 12:59 PM, Parth Chandra wrote:
>
> > That might be it. How big is the schema of your data? Do you have lots of
> > fields? If parquet-tools cannot read the metadata, there is little chance
> > anybody else will be able to do so either.
> >
> >
> > On Thu, May 10, 2018 at 9:57 AM, Carlos Derich
> > wrote:
> >
> > > Hey Parth, thanks for the response !
> > >
> > > I tried fetching the metadata using parquet-tools Hadoop mode instead,
> > and
> > > I get OOM errors: Heap and GC limit exceeded.
> > >
> > > It seems that my problem is actually resource related, still a bit
> weird
> > > how parquet metadata read is so hungry ?
> > >
> > > It seems that even after a restart (clean state/no queries running)
> only
> > > ~4GB mem is free from a 16GB machine.
> > >
> > > I am going to run the tests on a bigger machine, and will tweak the JVM
> > > options and will let you know.
> > >
> > > Regards.
> > > Carlos.
> > >
> > > On Wed, May 9, 2018 at 9:04 PM, Parth Chandra wrote:
> > >
> > > > The most common reason I know of for this error is if you do not have
> > > > enough CPU. Both Drill and the distributed file system will be using
> > cpu
> > > > and sometimes the file system, especially if it is distributed, will
> > take
> > > > too long. With your configuration and data set size, reading the file
> > > > metadata should take no time at all (I'll assume the metadata in the
> > > files
> > > > is reasonable and not many MB itself). Is your system by any chance
> > > > overloaded?
> > > >
> > > > Also, call me paranoid, but seeing /tm

Re: Failed to fetch parquet metadata after 15000ms

2018-05-10 Thread Carlos Derich
It is relatively big ?

parquet-tools schema output:

message schema {
  optional int64 id;
  optional int64 cbd_id;
  optional binary company_name (UTF8);
  optional binary category (UTF8);
  optional binary subcategory (UTF8);
  optional binary description (UTF8);
  optional binary full_address_source (UTF8);
  optional binary street_address (UTF8);
  optional binary neighborhood (UTF8);
  optional binary city (UTF8);
  optional binary administrative_area_level_3 (UTF8);
  optional binary administrative_area_level_2 (UTF8);
  optional binary administrative_area_level_1 (UTF8);
  optional binary postal_code (UTF8);
  optional binary country (UTF8);
  optional binary formatted_address (UTF8);
  optional binary geometry;
  optional binary telephone (UTF8);
  optional binary website (UTF8);
  optional int32 retrieved_at;
  optional binary source_url (UTF8);
}

Thanks for the help , will keep you posted, this will help me understand
better drill hardware requirements.

On Thu, May 10, 2018 at 12:59 PM, Parth Chandra <par...@apache.org> wrote:

> That might be it. How big is the schema of your data? Do you have lots of
> fields? If parquet-tools cannot read the metadata, there is little chance
> anybody else will be able to do so either.
>
>
> On Thu, May 10, 2018 at 9:57 AM, Carlos Derich <carlosder...@gmail.com>
> wrote:
>
> > Hey Parth, thanks for the response !
> >
> > I tried fetching the metadata using parquet-tools Hadoop mode instead,
> and
> > I get OOM errors: Heap and GC limit exceeded.
> >
> > It seems that my problem is actually resource related, still a bit weird
> > how parquet metadata read is so hungry ?
> >
> > It seems that even after a restart (clean state/no queries running) only
> > ~4GB mem is free from a 16GB machine.
> >
> > I am going to run the tests on a bigger machine, and will tweak the JVM
> > options and will let you know.
> >
> > Regards.
> > Carlos.
> >
> > On Wed, May 9, 2018 at 9:04 PM, Parth Chandra <par...@apache.org> wrote:
> >
> > > The most common reason I know of for this error is if you do not have
> > > enough CPU. Both Drill and the distributed file system will be using
> cpu
> > > and sometimes the file system, especially if it is distributed, will
> take
> > > too long. With your configuration and data set size, reading the file
> > > metadata should take no time at all (I'll assume the metadata in the
> > files
> > > is reasonable and not many MB itself).  Is your system by any chance
> > > overloaded?
> > >
> > > Also, call me paranoid, but seeing /tmp in the path makes me
> suspicious.
> > > Can we assume the files are written completely when the metadata read
> is
> > > occurring. They probably are, since you can query the files
> individually,
> > > but I'm just checking to make sure.
> > >
> > > Finally, there is a similar JIRA
> > > https://issues.apache.org/jira/browse/DRILL-5908, that looks related.
> > >
> > >
> > >
> > >
> > > On Wed, May 9, 2018 at 4:15 PM, Carlos Derich <carlosder...@gmail.com>
> > > wrote:
> > >
> > > > Hello guys,
> > > >
> > > > Asking this question here because I think i've hit a wall with this
> > > > problem, I am consistently getting the same error, when running a
> query
> > > on
> > > > a directory-based parquet file.
> > > >
> > > > The directory contains six 158MB parquet files.
> > > >
> > > > RESOURCE ERROR: Waited for 15000ms, but tasks for 'Fetch parquet
> > > > metadata' are not complete. Total runnable size 6, parallelism 6.
> > > >
> > > >
> > > > Both queries fail:
> > > >
> > > > *select count(*) from dfs.`/tmp/37454954-3c0a-47c5-
> > 9793-1c333d87fbbb/`*
> > > >
> > > > *select * from* *from dfs.`/tmp/37454954-3c0a-47c5-
> 9793-1c333d87fbbb/`
> > > > limit 1*
> > > >
> > > > BUT If I try running any other query in any of the 6 parquet files
> > inside
> > > > the directory it works fine:
> > > > eg:
> > > > *select * from
> > > > dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/185d3076-v_
> > > docker_node0001-
> > > > 140526122190592.parquet`*
> > > >
> > > > Running *`refresh table metadata`* gives me the exact same error.
> > > >
> > > > Also tried to set *planner.hashjoin* to false.
> > > >
> > > > Checking the drill source it seems that the wait metadata timeout is
> > not
> > > > configurable.
> > > >
> > > > Have any of you faced a similar situation ?
> > > >
> > > > Running this locally on my 16GB RAM machine, hdfs in a single node.
> > > >
> > > > I also found an open ticket with the same error message:
> > > > https://issues.apache.org/jira/browse/DRILL-5903
> > > >
> > > > Thank you in advance.
> > > >
> > >
> >
>


Re: Failed to fetch parquet metadata after 15000ms

2018-05-10 Thread Carlos Derich
Hey Parth, thanks for the response !

I tried fetching the metadata using parquet-tools Hadoop mode instead, and
I get OOM errors: Heap and GC limit exceeded.

It seems that my problem is actually resource related, still a bit weird
how parquet metadata read is so hungry ?

It seems that even after a restart (clean state/no queries running) only
~4GB mem is free from a 16GB machine.

I am going to run the tests on a bigger machine, and will tweak the JVM
options and will let you know.

Regards.
Carlos.

On Wed, May 9, 2018 at 9:04 PM, Parth Chandra <par...@apache.org> wrote:

> The most common reason I know of for this error is if you do not have
> enough CPU. Both Drill and the distributed file system will be using cpu
> and sometimes the file system, especially if it is distributed, will take
> too long. With your configuration and data set size, reading the file
> metadata should take no time at all (I'll assume the metadata in the files
> is reasonable and not many MB itself).  Is your system by any chance
> overloaded?
>
> Also, call me paranoid, but seeing /tmp in the path makes me suspicious.
> Can we assume the files are written completely when the metadata read is
> occurring. They probably are, since you can query the files individually,
> but I'm just checking to make sure.
>
> Finally, there is a similar JIRA
> https://issues.apache.org/jira/browse/DRILL-5908, that looks related.
>
>
>
>
> On Wed, May 9, 2018 at 4:15 PM, Carlos Derich <carlosder...@gmail.com>
> wrote:
>
> > Hello guys,
> >
> > Asking this question here because I think i've hit a wall with this
> > problem, I am consistently getting the same error, when running a query
> on
> > a directory-based parquet file.
> >
> > The directory contains six 158MB parquet files.
> >
> > RESOURCE ERROR: Waited for 15000ms, but tasks for 'Fetch parquet
> > metadata' are not complete. Total runnable size 6, parallelism 6.
> >
> >
> > Both queries fail:
> >
> > *select count(*) from dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/`*
> >
> > *select * from* *from dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/`
> > limit 1*
> >
> > BUT If I try running any other query in any of the 6 parquet files inside
> > the directory it works fine:
> > eg:
> > *select * from
> > dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/185d3076-v_
> docker_node0001-
> > 140526122190592.parquet`*
> >
> > Running *`refresh table metadata`* gives me the exact same error.
> >
> > Also tried to set *planner.hashjoin* to false.
> >
> > Checking the drill source it seems that the wait metadata timeout is not
> > configurable.
> >
> > Have any of you faced a similar situation ?
> >
> > Running this locally on my 16GB RAM machine, hdfs in a single node.
> >
> > I also found an open ticket with the same error message:
> > https://issues.apache.org/jira/browse/DRILL-5903
> >
> > Thank you in advance.
> >
>


Failed to fetch parquet metadata after 15000ms

2018-05-09 Thread Carlos Derich
Hello guys,

Asking this question here because I think i've hit a wall with this
problem, I am consistently getting the same error, when running a query on
a directory-based parquet file.

The directory contains six 158MB parquet files.

RESOURCE ERROR: Waited for 15000ms, but tasks for 'Fetch parquet
metadata' are not complete. Total runnable size 6, parallelism 6.


Both queries fail:

*select count(*) from dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/`*

*select * from* *from dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/`
limit 1*

BUT If I try running any other query in any of the 6 parquet files inside
the directory it works fine:
eg:
*select * from
dfs.`/tmp/37454954-3c0a-47c5-9793-1c333d87fbbb/185d3076-v_docker_node0001-140526122190592.parquet`*

Running *`refresh table metadata`* gives me the exact same error.

Also tried to set *planner.hashjoin* to false.

Checking the drill source it seems that the wait metadata timeout is not
configurable.

Have any of you faced a similar situation ?

Running this locally on my 16GB RAM machine, hdfs in a single node.

I also found an open ticket with the same error message:
https://issues.apache.org/jira/browse/DRILL-5903

Thank you in advance.