Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-08 Thread Wes McKinney
The dictionary batches simply wrap a record batch with one “column”. There should be no code difference (e.g. buffer layouts are the same) between the code handling the data in a dictionary and a normal record batches. In general, a dictionary may contain a null. On Wed, Nov 8, 2017 at 4:05 PM

Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

2017-11-08 Thread Jacques Nadeau
My analysis previously (if I recall) was basically (I think I did it on the similar Parquet PR) was that no system truly supported N fields for all operations (postgres was closest I believe). Some basic operations would maintain them but they would quickly not behave differently (e.g. 24 hours is

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-08 Thread Brian Hulette
Agreed, that sounds like a great solution to this problem - the layout information is redundant and it doesn't make sense to include it in every schema. Although I would argue we should write down exactly what buffers are supposed to go on the wire in the dictionary batches (i.e. value

Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

2017-11-08 Thread Julian Hyde
I have argued before on this list, and still believe, that you should represent an interval as you would a number. If intervals are 64 bit signed, then sure, use the 64 bit integer representation; if you were to allow intervals with fixed precision and scale, then use the same representation as

Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

2017-11-08 Thread Wes McKinney
Makes sense. The key question is whether the data is represented as a single 64-bit integer or as effectively a C struct struct { int32_t days; int32_t milliseconds; } The struct representation cannot accommodate higher resolution units like microseconds and nanoseconds. From my perspective,

Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

2017-11-08 Thread Julian Hyde
I don't know many examples of interval being used in the real world. But here's the kind of thing: the policy is that an offer is open for 60 hours, so if the offer is made to a particular customer at 12:34pm on Sunday, you want to compute that it ends at 12:34am on Wednesday. The interval "60

Re: Faster PySpark UDFs using Apache Arrow in Spark 2.3.0

2017-11-08 Thread Jacques Nadeau
Totally awesome. Nice job Li and everyone else! On Mon, Oct 30, 2017 at 2:22 PM, Phillip Cloud wrote: > Congrats Li! This is awesome. > > On Mon, Oct 30, 2017 at 2:05 PM Wes McKinney wrote: > > > hi all, > > > > One of our newest committers, Li Jin, has

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

2017-11-08 Thread Wes McKinney
Per Jacques' comment in ARROW-1693 https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244812#comment-16244812, I think we should remove the buffer layout from the metadata. It would be a good idea to do this for 0.8.0 since

Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

2017-11-08 Thread Wes McKinney
Pleading ignorance on use of the SQL interval type, my prior would be that many algorithms would first convert the interval components into an absolute timedelta. Is that not the case? My preference right now would be to have a single Interval type, where the DAY_TIME type actually contains an

Re: Arrow sync today

2017-11-08 Thread Jacques Nadeau
We spent a bunch of time trying to figure it out and as far as I can tell, there is no way of supporting a larger number of users AND accepting all entrants. On Fri, Nov 3, 2017 at 2:58 PM, Wes McKinney wrote: > @Jacques, is there a way the meeting can be configured so that

Re: [DISCUSS] Expanding Arrow interval type metadata, changing Java memory representation

2017-11-08 Thread Jacques Nadeau
I'm all for moving interval to the new definition. I think we should avoid introducing a timedelta type until it is really important. We need several users demanding a type before we should implement it. Otherwise, we have huge amounts of type bloat (which means nothing will fully implement the

[jira] [Created] (ARROW-1781) [CI] OSX Builds on Travis-CI time out often

2017-11-08 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-1781: Summary: [CI] OSX Builds on Travis-CI time out often Key: ARROW-1781 URL: https://issues.apache.org/jira/browse/ARROW-1781 Project: Apache Arrow Issue Type:

[jira] [Created] (ARROW-1780) JDBC Adapter for Apache Arrow

2017-11-08 Thread Atul Dambalkar (JIRA)
Atul Dambalkar created ARROW-1780: - Summary: JDBC Adapter for Apache Arrow Key: ARROW-1780 URL: https://issues.apache.org/jira/browse/ARROW-1780 Project: Apache Arrow Issue Type: New Feature