Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Jason Altekruse
This is actually a known issue, constant folding is not working in the select clause because of a costing problem. Constant folding only works currently in the where clause today. https://issues.apache.org/jira/browse/DRILL-2218 On Fri, Jul 24, 2015 at 4:13 PM, Ted Dunning wrote: > I think that

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Ted Dunning
I think that constant reduction isn't entirely working in the presence of joins. For example, I removed the isRandom annotation from my random number generator. You can see constant reduction working if I give a literal number: 0: jdbc:drill:zk=local> select b.x,a.y,random(1, 3) from (values > (

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Stefán Baxter
Hi, I understand how this can be useful to deal with both row/record and directory should for a result but then there is huge optimization potential left unexploited. (I'm not fully understanding if this "directory failing" happens with more proof or not). - If this does not eventually fail direc

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Jason Altekruse
A little clarification on that point. The directory filters are not syntactically separated from filters on regular columns that we read out of files themselves. Without optimization, the easiest way to think about the directory columns are just data that is added to each record coming out of the s

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Stefán Baxter
Hi Jason, I will share this code tomorrow on github so you can review this using that if it helps. When I was testing this, earlier today, I saw, to my surprise, that the query sometime returned results. This was not constant and I could run exactly the same statement with two different results (

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Jason Altekruse
I'm not sure, it is possible that it is being evaluated during planning to prune the scan, but the filter above the scan is not being removed as it should be. I'll try to re-create it the case to take a look. Stefan, Earlier you had mentioned that it was not only inefficient, but it was also givin

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Jacques Nadeau
- This is being called for *every record for every file in every directory* Are you sure? Constant reduction should take care of this. @Jason, any ideas why it might be failing? -- Jacques Nadeau CTO and Co-Founder, Dremio On Fri, Jul 24, 2015 at 10:45 AM, Stefán Baxter wrote: > Hi, > >

Re: IPv6 in Drill/Parquet

2015-07-24 Thread Stefán Baxter
thank you! On Fri, Jul 24, 2015 at 3:23 PM, Jim Scott wrote: > let me clarify... > > If you were grouping by household, you may want to group on the left side. > If it is stored in a single valued field, then you would have to manipulate > the value in some way to get the portion you want to g

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Stefán Baxter
Hi, thanks for the tips. Observation: - This is being called for *every record for every file in every directory* Can you please tell me what needs to be done to make sure this is only called 1 for each directory, preferably before file in that directory are opened/scanned. Regards, -S

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Jacques Nadeau
Two quick notes: - If you switch to internal null handling, you have to define separate udfs for each possible combination of nullable and non-nullable values. - isSet is an integer, so your if clause would actually be: if (! (yearDir.isSet == 1) ) { // yearDir is NULL, handle this here } -- J

Re: IPv6 in Drill/Parquet

2015-07-24 Thread Jim Scott
let me clarify... If you were grouping by household, you may want to group on the left side. If it is stored in a single valued field, then you would have to manipulate the value in some way to get the portion you want to group by. Thusly, storing it in two parts would be optimal for the use case.

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Abdel Hakim Deneche
Hi Stehan, I think when you specify your UDF as NULL_IF_NULL it means Drill will handle null values automatically: if any passed argument to your UDF is NULL, the UDF won't be evaluated and Drill will return NULL instead. In your case your UDF need to handle NULL values by setting: nulls = NullH

Re: IPv6 in Drill/Parquet

2015-07-24 Thread Stefán Baxter
Well, that is only true if you dont have a BigInteger to hold it :) see: https://java-ipv6.googlecode.com/svn/artifacts/0.14/doc/apidocs/com/googlecode/ipv6/IPv6Address.html Regards, -Stefan On Fri, Jul 24, 2015 at 2:39 PM, Jim Scott wrote: > an IPv6 address is actually two longs. Depending o

Re: IPv6 in Drill/Parquet

2015-07-24 Thread Jim Scott
an IPv6 address is actually two longs. Depending on the type of analysis you are doing you may prefer to store them that way. e.g. the range on the left side is a home / location and the range on the right side are sub values (devices within the home). Depending on your use case you may want to s

Re: storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Stefán Baxter
Hi, I have this running now: "select occurred_at, dir0, dir1, dir2 from dfs.tmp.`/analytics/processed/test/events` as t where dir0 = dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) order by occurred_at;" Observ

storage structure - querying directories - sanity check and UDF assistance

2015-07-24 Thread Stefán Baxter
Hi, I would like to share our intentions for organizing our data and how we plan to construct queries for it. There are four main reasons for sharing this: a) I would like to sanity check the approach b) I'm having a hard time writing a UDF to optimize this and need a bit of help. c) This can p

IPv6 in Drill/Parquet

2015-07-24 Thread Stefán Baxter
Hi, Has anyone here opinion/ideas on how ipv6 addresses might be stored efficiently in Parquet via Drill. The Java BigInteger class handles the 128 variant but the BigIntHolder in Drill relies on a Long. Storing it in two longs is not optimal and it would surprise me if the variable binary field