Re: Drill + parquet

2020-02-06 Thread Vishal Jadhav (BLOOMBERG/ 731 LEX)
Thank you very much for you help! From: vvo...@gmail.com At: 02/06/20 04:13:19To: Vishal Jadhav (BLOOMBERG/ 731 LEX ) , user@drill.apache.org Subject: Re: Drill + parquet Hi Vishal, Pull request with the fix for DRILL-5733 is opened and will be merged soon. Kind regards, Volodymyr Vysotskyi

Re: Drill + parquet

2020-02-06 Thread Vova Vysotskyi
orks fine on my local file system, but fails on HDFS. > Not sure, I am running into the issue mentioned here - > https://issues.apache.org/jira/browse/DRILL-5733 > > From: user@drill.apache.org At: 02/04/20 15:48:23To: Vishal Jadhav > (BLOOMBERG/ 731 LEX ) , user@drill.apache.org

Re: Drill + parquet

2020-02-04 Thread Vishal Jadhav (BLOOMBERG/ 731 LEX)
l.apache.org At: 02/04/20 11:28:25To: Vishal Jadhav >> (BLOOMBERG/ 731 LEX ) , user@drill.apache.org >> Subject: Re: Drill + parquet >> >> Parquet is default file format for apache drill >> so you do not need to give a parquet file for a drill query. Instead give >

Re: Drill + parquet

2020-02-04 Thread Arina Yelchiyeva
rg At: 02/04/20 11:28:25To: Vishal Jadhav >> (BLOOMBERG/ 731 LEX ) , user@drill.apache.org >> Subject: Re: Drill + parquet >> >> Parquet is default file format for apache drill >> so you do not need to give a parquet file for a drill query. Instead give >> the folder

Re: Drill + parquet

2020-02-04 Thread Nitin Pawar
. > https://drill.apache.org/docs/querying-parquet-files/ > As per it, I can query an individual parquet file, why is it failing with > the 'not a directory' error. > > > From: user@drill.apache.org At: 02/04/20 11:28:25To: Vishal Jadhav > (BLOOMBERG/ 731 LEX ) , user@drill.a

Re: Drill + parquet

2020-02-04 Thread Vishal Jadhav (BLOOMBERG/ 731 LEX)
) , user@drill.apache.org Subject: Re: Drill + parquet Parquet is default file format for apache drill so you do not need to give a parquet file for a drill query. Instead give the folder path which contains the files. eg: select * from hdfs_storage>..`folder1` will query all the parquet fi

Re: Drill + parquet

2020-02-04 Thread Nitin Pawar
Parquet is default file format for apache drill so you do not need to give a parquet file for a drill query. Instead give the folder path which contains the files. eg: select * from hdfs_storage>..`folder1` will query all the parquet files in folder1 On Tue, Feb 4, 2020 at 9:55 PM Vishal Jadhav

Drill + parquet

2020-02-04 Thread Vishal Jadhav (BLOOMBERG/ 731 LEX)
Hello Drillers, Need some help with the hdfs + parquet files. I have configured the HDFS storage with parquet & csv format plugins. I can query the - ..`*.csv` correctly. Also, I have a similar directory structure for the parquet files (in a different directory), But, not able to query it.

Re: drill parquet - create table as ... partition by ... non present column

2018-12-10 Thread Anton Gozhiy
Benj, I meant that without the metadata Drill won't recognize files as partitions. Although that should not be a problem for the optimization mechanisms. As for your original question, I think that it's rather not implemented than an intended limitation. Feel free to submit a feature request to

Re: drill parquet - create table as ... partition by ... non present column

2018-12-07 Thread benj.dev
Hi, Thanks for details. It's the point, I don't want to write additional metadata, but just organize the parquet file to have more useful stats. In a simple GROUP BY it's possible to not SELECT some of "grouped" column. (Example SELECT a, b FROM ... GROUP BY a, b, c;) In the same way, I think it

Re: drill parquet - create table as ... partition by ... non present column

2018-12-06 Thread Anton Gozhiy
Hi Benj, Creating partitions as in your first example won't work. >From the docs: "During partitioning, Drill creates separate files, but not separate directories, for different partitions." ( https://drill.apache.org/docs/how-to-partition-data/). Also, Drill doesn't write additional metadata

drill parquet - create table as ... partition by ... non present column

2018-12-05 Thread benj . dev
In would like to create a parquet with a partition on computed data (without to have to put the result of the computation in the parquet) : The goal is to optimize the parquet for typical expecting queries. Imaginary example : CREATE TABLE `mytable` PARTITION BY (substr(name,1,1)) AS SELECT

Re: Drill Parquet Partitioning Method

2017-04-04 Thread John Omernik
When I post questions like that I take a very user centric mindset. For me, abstracting what needs to be done here to ensure help users seamlessly work with other tools make Drill a friendly addition to any Data Science team. If admins have to do more, or train users to handle how Drill does stuff

Re: Drill Parquet Partitioning Method

2017-04-03 Thread Jinfeng Ni
That's a good idea. Let me clarify one thing first. Drill has two kinds of partitions: auto partition, or directory-based partition. The first one is a result of using drill's CTAS partition by statement [1]. Both partition column name and column value are written and encoded in the output

Drill Parquet Partitioning Method

2017-04-03 Thread John Omernik
So as a user of Drill now for a while, I have gotten used to the idea of partitions just being values, instead of key=value like other things (hive, impala, others). >From a user/analyst perspective, the dir0, dir1, dirN methodology provides quite a bit of flexibility, but to be intuitive, we

Re: First impressions with Drill+Parquet+S3

2016-10-28 Thread Uwe Korn
Hello Parth, I filed JIRAs for S3 performance: * https://issues.apache.org/jira/browse/DRILL-4977 * https://issues.apache.org/jira/browse/DRILL-4976 * https://issues.apache.org/jira/browse/DRILL-4978 and one for execution of drillbits inside Apache Mesos+Aurora: *

Re: First impressions with Drill+Parquet+S3

2016-10-06 Thread Uwe Korn
Yes. Performance was much better with a real file system (i.e. I ran locally on my laptop using the SSD installed there). I don’t expect to have the exact same performance with S3 as I don’t have things like data locality there. My use case is mainly querying „cold“ datasets, i.e. ones that are

Re: First impressions with Drill+Parquet+S3

2016-10-06 Thread Ted Dunning
Have you tried running against a real file system interface? Or even just against HDFS? On Thu, Oct 6, 2016 at 12:35 PM, Uwe Korn wrote: > Hello, > > We had some test runs with Drill 1.8 in the last days and wanted to share > the experience with you as we've made some

First impressions with Drill+Parquet+S3

2016-10-06 Thread Uwe Korn
Hello, We had some test runs with Drill 1.8 in the last days and wanted to share the experience with you as we've made some interesting findings that astonished us. We did run on our internal company cluster and thus used the S3 API to access our internal storage cluster, not AWS (the behavior

Re: IPv6 in Drill/Parquet

2015-07-24 Thread Jim Scott
let me clarify... If you were grouping by household, you may want to group on the left side. If it is stored in a single valued field, then you would have to manipulate the value in some way to get the portion you want to group by. Thusly, storing it in two parts would be optimal for the use

Re: IPv6 in Drill/Parquet

2015-07-24 Thread Stefán Baxter
Well, that is only true if you dont have a BigInteger to hold it :) see: https://java-ipv6.googlecode.com/svn/artifacts/0.14/doc/apidocs/com/googlecode/ipv6/IPv6Address.html Regards, -Stefan On Fri, Jul 24, 2015 at 2:39 PM, Jim Scott jsc...@maprtech.com wrote: an IPv6 address is actually two

Re: IPv6 in Drill/Parquet

2015-07-24 Thread Jim Scott
an IPv6 address is actually two longs. Depending on the type of analysis you are doing you may prefer to store them that way. e.g. the range on the left side is a home / location and the range on the right side are sub values (devices within the home). Depending on your use case you may want to

Re: IPv6 in Drill/Parquet

2015-07-24 Thread Stefán Baxter
thank you! On Fri, Jul 24, 2015 at 3:23 PM, Jim Scott jsc...@maprtech.com wrote: let me clarify... If you were grouping by household, you may want to group on the left side. If it is stored in a single valued field, then you would have to manipulate the value in some way to get the