Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

2019-08-27 Thread Wes McKinney
Here's the C++ changes

https://github.com/apache/arrow/pull/5211

I'm going to create a integration branch where we can merge each patch
before merging to master

On Fri, Aug 23, 2019 at 9:03 AM Wes McKinney  wrote:
>
> It isn't implemented in C++ yet but I will try to get a patch up for
> that soon (today maybe). I think we should create a branch where we
> can stack the patches that implement this for each language.
>
> On Fri, Aug 23, 2019 at 4:04 AM Paul Taylor  wrote:
> >
> > I'll do the JS updates. Is it safe to validate against the Arrow C++
> > integration tests?
> >
> >
> > On 8/22/19 7:28 PM, Micah Kornfield wrote:
> > > I created https://issues.apache.org/jira/browse/ARROW-6313 as a tracking
> > > issue with sub-issues on the development work.  So far no-one has claimed
> > > Java and Javascript tasks.
> > >
> > > Would it make sense to have a separate dev branch for this work?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Thu, Aug 22, 2019 at 3:24 PM Wes McKinney  wrote:
> > >
> > >> The vote carries with 4 binding +1 votes and 1 non-binding +1
> > >>
> > >> I'll merge the specification patch later today and we can begin
> > >> working on implementations so we can get this done for 0.15.0
> > >>
> > >> On Tue, Aug 20, 2019 at 12:30 PM Bryan Cutler  wrote:
> > >>> +1 (non-binding)
> > >>>
> > >>> On Tue, Aug 20, 2019, 7:43 AM Antoine Pitrou 
> > >> wrote:
> >  Sorry, had forgotten to send my vote on this.
> > 
> >  +1 from me.
> > 
> >  Regards
> > 
> >  Antoine.
> > 
> > 
> >  On Wed, 14 Aug 2019 17:42:33 -0500
> >  Wes McKinney  wrote:
> > > hi all,
> > >
> > > As we've been discussing [1], there is a need to introduce 4 bytes of
> > > padding into the preamble of the "encapsulated IPC message" format to
> > > ensure that the Flatbuffers metadata payload begins on an 8-byte
> > > aligned memory offset. The alternative to this would be for Arrow
> > > implementations where alignment is important (e.g. C or C++) to copy
> > > the metadata (which is not always small) into memory when it is
> > > unaligned.
> > >
> > > Micah has proposed to address this by adding a
> > > 4-byte "continuation" value at the beginning of the payload
> > > having the value 0x. The reason to do it this way is that
> > > old clients will see an invalid length (what is currently the
> > > first 4 bytes of the message -- a 32-bit little endian signed
> > > integer indicating the metadata length) rather than potentially
> > > crashing on a valid length. We also propose to expand the "end of
> > > stream" marker used in the stream and file format from 4 to 8
> > > bytes. This has the additional effect of aligning the file footer
> > > defined in File.fbs.
> > >
> > > This would be a backwards incompatible protocol change, so older
> > >> Arrow
> > > libraries would not be able to read these new messages. Maintaining
> > > forward compatibility (reading data produced by older libraries)
> > >> would
> > > be possible as we can reason that a value other than the continuation
> > > value was produced by an older library (and then validate the
> > > Flatbuffer message of course). Arrow implementations could offer a
> > > backward compatibility mode for the sake of old readers if they
> > >> desire
> > > (this may also assist with testing).
> > >
> > > Additionally with this vote, we want to formally approve the change
> > >> to
> > > the Arrow "file" format to always write the (new 8-byte)
> > >> end-of-stream
> > > marker, which enables code that processes Arrow streams to safely
> > >> read
> > > the file's internal messages as though they were a normal stream.
> > >
> > > The PR making these changes to the IPC documentation is here
> > >
> > > https://github.com/apache/arrow/pull/4951
> > >
> > > Please vote to accept these changes. This vote will be open for at
> > > least 72 hours
> > >
> > > [ ] +1 Adopt these Arrow protocol changes
> > > [ ] +0
> > > [ ] -1 I disagree because...
> > >
> > > Here is my vote: +1
> > >
> > > Thanks,
> > > Wes
> > >
> > > [1]:
> > >> https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
> > 
> > 
> > 


[jira] [Created] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan

2019-08-27 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-6372:
--

 Summary: [Rust][Datafusion] Predictate push down optimization can 
break query plan
 Key: ARROW-6372
 URL: https://issues.apache.org/jira/browse/ARROW-6372
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Affects Versions: 0.14.1
Reporter: Paddy Horan
 Fix For: 0.15.0


The following code reproduces the issue:

[https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]

If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


AppVeyor re-build by committers is enabled

2019-08-27 Thread Sutou Kouhei
Hi,

I asked INFRA to add the "arrow committers" GitHub team
to teams of the "ApacheSoftwareFoundation" AppVeyor account:

  https://issues.apache.org/jira/browse/INFRA-18930

Now, Apache Arrow committers can re-build AppVeyor jobs from
AppVeyor Web URI. AppVeyor jobs sometimes fail by network
trouble. Re-building will be useful for the case.

Note that we need to sign out and sign in again to reflect
this change.


Thanks,
--
kou


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-08-27 Thread Rok Mihevc
On Wed, Aug 28, 2019 at 1:18 AM Wes McKinney  wrote:

> null/NA. But, as far as I'm aware, this component of pandas is
> relatively unique and was never intended as an alternatives to sparse
> matrix libraries.
>

Another example is
https://sparse.pydata.org/en/latest/generated/sparse.SparseArray.html?highlight=fill%20value#sparse.SparseArray.fill_value,
but it might have been influenced by Pandas.

I'm ok with dropping this for now.


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-08-27 Thread Wes McKinney
On Tue, Aug 27, 2019 at 6:07 PM Neal Richardson
 wrote:
>
> Forgive me if this is off topic; I haven't been following this closely
> and I haven't used scipy.sparse. But there are some very reasonable
> cases where you might want to fill sparse data with a value other than
> 0:
>
> * The sparseness is missing data, and 0 is not the same as NA
> * Better compression: figure out which value is most common in the
> data and make that the default that gets filled. E.g. how many fingers
> a person has.
>

Definitely. I am the original author of pandas's Sparse* family of
types, and they were created for the case where the data is mostly
null/NA. But, as far as I'm aware, this component of pandas is
relatively unique and was never intended as an alternatives to sparse
matrix libraries.

It seems like the sparse-with-fill-value might be better discussed on
Micah's thread regarding Array compression and encoding.

> Neal
>
> On Tue, Aug 27, 2019 at 3:46 PM Rok Mihevc  wrote:
> >
> > On Tue, Aug 27, 2019 at 11:05 PM Wes McKinney  wrote:
> >
> > > I don't think this has been discussed. I think the SparseTensor
> > > discussions have been intended to reach compatibility with "sparse
> > > matrix" projects like scipy.sparse. pandas's "SparseArray" objects are
> > > a distinct thing -- I don't know many examples of sparse matrices with
> > > fill values other than 0
> >
> >
> > The reason for implementing fill_value would be the case where user wants
> > another value to be '0' and it's practical for the SparseTensor object to
> > keep that value for them. I am not sure how common would such a case be and
> > since scipy.sparse is time tested I'd agree with compatibility as the
> > current goal.


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-08-27 Thread Neal Richardson
Forgive me if this is off topic; I haven't been following this closely
and I haven't used scipy.sparse. But there are some very reasonable
cases where you might want to fill sparse data with a value other than
0:

* The sparseness is missing data, and 0 is not the same as NA
* Better compression: figure out which value is most common in the
data and make that the default that gets filled. E.g. how many fingers
a person has.

Neal

On Tue, Aug 27, 2019 at 3:46 PM Rok Mihevc  wrote:
>
> On Tue, Aug 27, 2019 at 11:05 PM Wes McKinney  wrote:
>
> > I don't think this has been discussed. I think the SparseTensor
> > discussions have been intended to reach compatibility with "sparse
> > matrix" projects like scipy.sparse. pandas's "SparseArray" objects are
> > a distinct thing -- I don't know many examples of sparse matrices with
> > fill values other than 0
>
>
> The reason for implementing fill_value would be the case where user wants
> another value to be '0' and it's practical for the SparseTensor object to
> keep that value for them. I am not sure how common would such a case be and
> since scipy.sparse is time tested I'd agree with compatibility as the
> current goal.


Re: [Format] Semantics for dictionary batches in streams

2019-08-27 Thread Micah Kornfield
I was thinking the file format must satisfy one of two conditions:
1.  Exactly one dictionarybatch per encoded column
2.  DictionaryBatches are interleaved correctly.

On Tuesday, August 27, 2019, Wes McKinney  wrote:

> On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou  wrote:
> >
> >
> > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > > So the current situation we have right now in C++ is that if we tried
> > > to create an IPC stream from a sequence of record batches that don't
> > > all have the same dictionary, we'd run into two scenarios:
> > >
> > > * Batches that either have a prefix of a prior-observed dictionary, or
> > > the prior dictionary is a prefix of their dictionary. For example,
> > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > > such case we could compute and send a delta batch
> > >
> > > * Batches with a dictionary that is a permutation of values, and
> > > possibly new unique values.
> > >
> > > In this latter case, without the option of replacing an existing ID in
> > > the stream, we would have to do a unification / permutation of indices
> > > and then also possibly send a delta batch. We should probably have
> > > code at some point that deals with both cases, but in the meantime I
> > > would like to allow dictionaries to be redefined in this case. Seems
> > > like we might need a vote to formalize this?
> >
> > Isn't the stream format deviating from the file format then?  In the
> > file format, IIUC, dictionaries can appear after the respective record
> > batches, so there's no way to tell whether the original or redefined
> > version of a dictionary is being referred to.
>
> You make a good point -- we can consider changes to the file format to
> allow for record batches to have different dictionaries. Even handling
> delta dictionaries with the current file format would be a bit tedious
> (though not indeterminate)
>
> > Regards
> >
> > Antoine.
>


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-08-27 Thread Rok Mihevc
On Tue, Aug 27, 2019 at 11:05 PM Wes McKinney  wrote:

> I don't think this has been discussed. I think the SparseTensor
> discussions have been intended to reach compatibility with "sparse
> matrix" projects like scipy.sparse. pandas's "SparseArray" objects are
> a distinct thing -- I don't know many examples of sparse matrices with
> fill values other than 0


The reason for implementing fill_value would be the case where user wants
another value to be '0' and it's practical for the SparseTensor object to
keep that value for them. I am not sure how common would such a case be and
since scipy.sparse is time tested I'd agree with compatibility as the
current goal.


Re: [Format] Semantics for dictionary batches in streams

2019-08-27 Thread Wes McKinney
On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou  wrote:
>
>
> Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > So the current situation we have right now in C++ is that if we tried
> > to create an IPC stream from a sequence of record batches that don't
> > all have the same dictionary, we'd run into two scenarios:
> >
> > * Batches that either have a prefix of a prior-observed dictionary, or
> > the prior dictionary is a prefix of their dictionary. For example,
> > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > such case we could compute and send a delta batch
> >
> > * Batches with a dictionary that is a permutation of values, and
> > possibly new unique values.
> >
> > In this latter case, without the option of replacing an existing ID in
> > the stream, we would have to do a unification / permutation of indices
> > and then also possibly send a delta batch. We should probably have
> > code at some point that deals with both cases, but in the meantime I
> > would like to allow dictionaries to be redefined in this case. Seems
> > like we might need a vote to formalize this?
>
> Isn't the stream format deviating from the file format then?  In the
> file format, IIUC, dictionaries can appear after the respective record
> batches, so there's no way to tell whether the original or redefined
> version of a dictionary is being referred to.

You make a good point -- we can consider changes to the file format to
allow for record batches to have different dictionaries. Even handling
delta dictionaries with the current file format would be a bit tedious
(though not indeterminate)

> Regards
>
> Antoine.


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-08-27 Thread Wes McKinney
I'm also OK with these changes. Since we have not established a
versioning or compatibility policy with regards to "Other" data
structures like Tensor and SparseTensor, I don't know that a vote is
needed, just a pull request.

On Wed, Aug 21, 2019 at 1:11 PM Rok Mihevc  wrote:
>
> Hi,
>
> On Mon, Aug 19, 2019 at 11:30 AM Kenta Murata  wrote:
>
> > (3) Adding SparseCSCIndex
> >
>
> I'd be interested to help with (Python) part of this SparseCSCIndex.
>
>
> I’d appreciate any comments or suggestions.
> >
>
> I missed previous discussion, so this might have already been discussed,
> but did we ever consider adding fill value attribute? So parameter that
> sets the 'zero' value to some arbitrary value instead of zero if we need it.
> See fill_value here
> https://sparse.pydata.org/en/latest/generated/sparse.SparseArray.html?highlight=fill%20value#sparse.SparseArray.fill_value
> and
> here
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseArray.html
> .
>

I don't think this has been discussed. I think the SparseTensor
discussions have been intended to reach compatibility with "sparse
matrix" projects like scipy.sparse. pandas's "SparseArray" objects are
a distinct thing -- I don't know many examples of sparse matrices with
fill values other than 0

>
> Best,
> Rok


Re: [Format] Semantics for dictionary batches in streams

2019-08-27 Thread Antoine Pitrou


Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> So the current situation we have right now in C++ is that if we tried
> to create an IPC stream from a sequence of record batches that don't
> all have the same dictionary, we'd run into two scenarios:
> 
> * Batches that either have a prefix of a prior-observed dictionary, or
> the prior dictionary is a prefix of their dictionary. For example,
> suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> such case we could compute and send a delta batch
> 
> * Batches with a dictionary that is a permutation of values, and
> possibly new unique values.
> 
> In this latter case, without the option of replacing an existing ID in
> the stream, we would have to do a unification / permutation of indices
> and then also possibly send a delta batch. We should probably have
> code at some point that deals with both cases, but in the meantime I
> would like to allow dictionaries to be redefined in this case. Seems
> like we might need a vote to formalize this?

Isn't the stream format deviating from the file format then?  In the
file format, IIUC, dictionaries can appear after the respective record
batches, so there's no way to tell whether the original or redefined
version of a dictionary is being referred to.

Regards

Antoine.


Re: Timeline for 0.15.0 release

2019-08-27 Thread Wes McKinney
hi,

I think we should try to release the week of September 9, so
development work should be completed by end of next week.

Does that seem reasonable?

I plan to get up a patch for the protocol alignment changes for C++ in
the next couple of days -- I think that getting the alignment work
done is the main barrier to releasing.

Thanks
Wes

On Mon, Aug 19, 2019 at 12:25 PM Ji Liu  wrote:
>
> Hi, Wes, on the java side, I can think of several bugs that need to be fixed 
> or reminded.
>
> i. ARROW-6040: Dictionary entries are required in IPC streams even when 
> empty[1]
> This one is under review now, however through this PR we find that there 
> seems a bug in java reading and writing dictionaries in IPC which is 
> Inconsistent with spec[2] since it assumes all dictionaries are at the start 
> of stream (see details in PR comments,  and this fix may not catch up with 
> version 0.15). @Micah Kornfield
>
> ii. ARROW-1875: Write 64-bit ints as strings in integration test JSON files[3]
> Java side code already checked in, other implementations seems not.
>
> iii. ARROW-6202: OutOfMemory in JdbcAdapter[4]
> Caused by trying to load all records in one contiguous batch, fixed by 
> providing iterator API for iteratively reading in ARROW-6219[5].
>
> Thanks,
> Ji Liu
>
> [1] https://github.com/apache/arrow/pull/4960
> [2] https://arrow.apache.org/docs/ipc.html
> [3] https://issues.apache.org/jira/browse/ARROW-1875
> [4] https://issues.apache.org/jira/browse/ARROW-6202[5] 
> https://issues.apache.org/jira/browse/ARROW-6219
>
>
>
> --
> From:Wes McKinney 
> Send Time:2019年8月19日(星期一) 23:03
> To:dev 
> Subject:Re: Timeline for 0.15.0 release
>
> I'm going to work some on organizing the 0.15.0 backlog some this
> week, if anyone wants to help with grooming (particularly for
> languages other than C++/Python where I'm focusing) that would be
> helpful. There have been almost 500 JIRA issues opened since the
> 0.14.0 release, so we should make sure to check whether there's any
> regressions or other serious bugs that we should try to fix for
> 0.15.0.
>
> On Thu, Aug 15, 2019 at 6:23 PM Wes McKinney  wrote:
> >
> > The Windows wheel issue in 0.14.1 seems to be
> >
> > https://issues.apache.org/jira/browse/ARROW-6015
> >
> > I think the root cause could be the Windows changes in
> >
> > https://github.com/apache/arrow/commit/223ae744cc2a12c60cecb5db593263a03c13f85a
> >
> > I would be appreciative if a volunteer would look into what was wrong
> > with the 0.14.1 wheels on Windows. Otherwise 0.15.0 Windows wheels
> > will be broken, too
> >
> > The bad wheels can be found at
> >
> > https://bintray.com/apache/arrow/python#files/python%2F0.14.1
> >
> > On Thu, Aug 15, 2019 at 1:28 PM Antoine Pitrou  wrote:
> > >
> > > On Thu, 15 Aug 2019 11:17:07 -0700
> > > Micah Kornfield  wrote:
> > > > >
> > > > > In C++ they are
> > > > > independent, we could have 32-bit array lengths and variable-length
> > > > > types with 64-bit offsets if we wanted (we just wouldn't be able to
> > > > > have a List child with more than INT32_MAX elements).
> > > >
> > > > I think the point is we could do this in C++ but we don't.  I'm not 
> > > > sure we
> > > > would have introduced the "Large" types if we did.
> > >
> > > 64-bit offsets take twice as much space as 32-bit offsets, so if you're
> > > storing lots of small-ish lists or strings, 32-bit offsets are
> > > preferrable.  So even with 64-bit array lengths from the start it would
> > > still be beneficial to have types with 32-bit offsets.
> > >
> > > > Going with the limited address space in Java and calling it a reference
> > > > implementation seems suboptimal. If a consumer uses a "Large" type
> > > > presumably it is because they need the ability to store more than 
> > > > INT32_MAX
> > > > child elements in a column, otherwise it is just wasting space [1].
> > >
> > > Probably. Though if the individual elements (lists or strings) are
> > > large, not much space is wasted in proportion, so it may be simpler in
> > > such a case to always create a "Large" type array.
> > >
> > > > [1] I suppose theoretically there might be some performance benefits on
> > > > 64-bit architectures to using the native word sizes.
> > >
> > > Concretely, common 64-bit architectures don't do that, as 32-bit is an
> > > extremely common integer size even in high-performance code.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
>


Re: [Format] Semantics for dictionary batches in streams

2019-08-27 Thread Wes McKinney
So the current situation we have right now in C++ is that if we tried
to create an IPC stream from a sequence of record batches that don't
all have the same dictionary, we'd run into two scenarios:

* Batches that either have a prefix of a prior-observed dictionary, or
the prior dictionary is a prefix of their dictionary. For example,
suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
such case we could compute and send a delta batch

* Batches with a dictionary that is a permutation of values, and
possibly new unique values.

In this latter case, without the option of replacing an existing ID in
the stream, we would have to do a unification / permutation of indices
and then also possibly send a delta batch. We should probably have
code at some point that deals with both cases, but in the meantime I
would like to allow dictionaries to be redefined in this case. Seems
like we might need a vote to formalize this?

Independent from this decision, I would strongly recommend that all
implementations handle dictionaries in-memory as data and not metadata
(i.e. do not have dictionaries in the schema). It was lucky (see
ARROW-3144) that this problematic early design in the C++ library
could be fixed with less than a week of work.

Thanks
Wes

On Sun, Aug 11, 2019 at 9:17 PM Micah Kornfield  wrote:
>
> I'm not sure what you mean by record-in-dictionary-id, so it is possible
> this is a solution that I just don't understand :)
>
> The only two references to dictionary IDs that I could find, are  one in
> schema.fbs [1] which is attached a column in a schema and the one
> referenced above in DictionaryBatches define Message.fbs [2] for
> transmitting dictionaries.  It is quite possible I missed something.
>
>  The indices into the dictionary are Int Arrays in a normal record batch.
> I suppose the other option is to reset the stream by sending a new schema,
> but I don't think that is supported either. This is what lead to my
> original question.
>
> Does no one do this today?
>
> I think Wes did some recent work on the C++ Parquet in reading
> dictionaries, and might have faced some of these issues, I'm not sure how
> he dealt with it (haven't gotten back to the Parquet code yet).
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L271
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>
> On Sun, Aug 11, 2019 at 6:32 PM Jacques Nadeau  wrote:
>
> > Wow, you've shown how little I've thought about Arrow dictionaries for a
> > while. I thought we had a dictionary id and a record-in-dictionary-id.
> > Wouldn't that approach make more sense? Does no one do this today? (We
> > frequently use compound values for this type of scenario...)
> >
> > On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield 
> > wrote:
> >
> >> Reading data from two different parquet files sequentially with different
> >> dictionaries for the same column.  This could be handled by re-encoding
> >> data but that seems potentially sub-optimal.
> >>
> >> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau 
> >> wrote:
> >>
> >>> What situation are anticipating where you're going to be restating ids
> >>> mid stream?
> >>>
> >>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield 
> >>> wrote:
> >>>
>  The IPC specification [1] defines behavior when isDelta on a
>  DictionaryBatch [2] is "true".  I might have missed it in the
>  specification, but I couldn't find the interpretation for what the
>  expected
>  behavior is when isDelta=false and and two  dictionary batches  with the
>  same ID are sent.
> 
>  It seems like there are two options:
>  1.  Interpret the new dictionary batch as replacing the old one.
>  2.  Regard this as an error condition.
> 
>  Based on the fact that in the "file format" dictionaries are allowed to
>  be
>  placed in any order relative to the record batches, I assume it is the
>  second, but just wanted to make sure.
> 
>  Thanks,
>  Micah
> 
>  [1] https://arrow.apache.org/docs/ipc.html
>  [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
> 
> >>>


[jira] [Created] (ARROW-6371) [Doc] Row to columnar conversion example mentions arrow::Column in comments

2019-08-27 Thread Omer Ozarslan (Jira)
Omer Ozarslan created ARROW-6371:


 Summary: [Doc] Row to columnar conversion example mentions 
arrow::Column in comments
 Key: ARROW-6371
 URL: https://issues.apache.org/jira/browse/ARROW-6371
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation
Reporter: Omer Ozarslan


https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html

{code:cpp}
// The final representation should be an `arrow::Table` which in turn is made 
up of
// an `arrow::Schema` and a list of `arrow::Column`. An `arrow::Column` is 
again a
// named collection of one or more `arrow::Array` instances. As the first step, 
we
// will iterate over the data and build up the arrays incrementally.
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6370) [JS] Table.from adds 0 on int columns

2019-08-27 Thread Sascha Hofmann (Jira)
Sascha Hofmann created ARROW-6370:
-

 Summary: [JS] Table.from adds 0 on int columns
 Key: ARROW-6370
 URL: https://issues.apache.org/jira/browse/ARROW-6370
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.14.1
Reporter: Sascha Hofmann


I am generating an arrow table in pyarrow and send it via gRPC like this:
{code:java}
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, batch.schema)
writer.write_batch(batch)
writer.close()
yield ds.Response(
status=200,
loading=False,
response=[sink.getvalue().to_pybytes()]   
)
{code}
On the javascript end, I parse it like that:
{code:java}
 Table.from(response.getResponseList()[0])
{code}
That works but when I look at the actual table, int columns have a 0 for every 
other row. String columns seem to be parsed just fine. 

The Python byte array created from to_pybytes() has the same length as received 
in javascript. I am also able to recreate the original table for the byte array 
in Python. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [Discuss] Support read/write interleaved dictionaries and batches in IPC stream

2019-08-27 Thread Wes McKinney
hi all,

I agree with handling the interleaved dictionary case. While we are at
it, I think we should formally allow dictionary replacements.

Note that I just opened

https://github.com/apache/arrow/pull/5202

to revamp / consolidate the format documents. So any changes will need
to be based on that.

One problem with our integration testing strategy -- the JSON files
don't exactly simulate a stream, they are more like the current File
format. If we want to properly generate integration test cases, we
should also slightly generalize the integration test JSON format to
contain a collection of messages like

[
  {"message_type": "Schema", ... },
  {"message_type": "DictionaryBatch", ... },
  {"message_type": "RecordBatch", ... },
  {"message_type": "RecordBatch", ... },
  {"message_type": "DictionaryBatch", ... },
  ...
]

I don't know that it will be all that difficult to make this change
but we would have 4 implementations that have to get fixed up.

- Wes

On Wed, Aug 21, 2019 at 10:58 PM Micah Kornfield  wrote:
>
> Hi Ji Liu,
> Thanks for getting the conversation started.  I think a few things need to
> happen:
> 1.  We need to clarify in the specification that not all dictionaries need
> to be present at the beginning.  I plan on creating a PR for discussion
> that clarifies this point, as well as handling of non-delta dictionary
> batches discussed earlier [1].
> 2.  Java needs to support both of these once we vote on the
> clarification. One way of doing this, is what I think Jacques alluded to in
> [1].  For the in memory representation of dictionary encoding we will need
> to not only track a notion of "dictionary" id for the Vector, but also an
> addition
> 3.  Lastly, we should have integration tests around these cases to make
> sure they are handled consistently.
>
> Thanks,
> Micah
>
> [1]
> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
>
> On Wed, Aug 21, 2019 at 7:48 AM Ji Liu  wrote:
>
> > Hi all,
> >
> > Recently when we worked on fixing a IPC related bug in both Java/C++ 
> > sides[1][2],
> > @emkornfieldfound that the stream reader assumes that all dictionaries
> > are at the start of the stream which is inconsistent with  spec[3]
> > which says as long as a record batch doesn't reference a dictionary they 
> > can be interleaved.
> >
> >
> > Cases below should be supported, however they will crash at current 
> > implementations.
> > i. have a record batch of one dictionary encoded column S
> >  1>Schema
> >  2>RecordBatch: S=[null, null, null, null]
> >  3>DictionaryBatch: ['abc', 'efg']
> >  4>Recordbatch: S=[0, 1, 0, 1]
> > ii. have a record batch of two dictionary encoded column S1, S2
> >  1>Schema
> >  2->DictionaryBatch S1: ['ab', 'cd']
> >  3->RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,]
> >  4->DictionaryBatch S2: ['cc', 'dd']
> >  5->RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1]
> >
> > We already did some work on Java side via[1]
> > to make it possible to parse interleaved dictionaries and batches:
> >  i. In ArrowStreamReader, do not read all dictionaries at the start
> >  ii.
> > When call loadNextBatch, we read message to decide read dictionaries first 
> > or directly read a batch, if former, read all dictionaries out before this 
> > batch.
> >
> >  iii.When we read a batch, we check if the dictionaries it needed has 
> > already been read, if not, check if it's all null column and decide whether 
> > need throw exception.
> >
> > In this way, whatever they are interleaved or not, we can parse it properly.
> >
> > In the future, I think we should also support write interleaved 
> > dictionaries and batches in IPC stream(created an
> > issue to track this[4]), but not quite clear how to implement this.
> > Any opinions about this are appreciated, thanks!
> >
> > Thanks,
> > Ji Liu
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-6040,
> > [2] https://issues.apache.org/jira/browse/ARROW-6126,
> > [3] http://arrow.apache.org/docs/format/IPC.html#streaming-format,
> > [4]https://issues.apache.org/jira/browse/ARROW-6308
> >


[jira] [Created] (ARROW-6369) [Python] Support list-of-boolean in Array.to_pandas conversion

2019-08-27 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6369:
---

 Summary: [Python] Support list-of-boolean in Array.to_pandas 
conversion
 Key: ARROW-6369
 URL: https://issues.apache.org/jira/browse/ARROW-6369
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0


See

{code}
In [4]: paste   

   
a = pa.array(np.array([[True, False], [True, True, True]]))

## -- End pasted text --

In [5]: a   

   
Out[5]: 

[
  [
true,
false
  ],
  [
true,
true,
true
  ]
]

In [6]: a.to_pandas()   

   
---
ArrowNotImplementedError  Traceback (most recent call last)
 in 
> 1 a.to_pandas()

~/code/arrow/python/pyarrow/array.pxi in 
pyarrow.lib._PandasConvertible.to_pandas()
439 deduplicate_objects=deduplicate_objects)
440 
--> 441 return self._to_pandas(options, categories=categories,
442ignore_metadata=ignore_metadata)
443 

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array._to_pandas()
815 
816 with nogil:
--> 817 check_status(ConvertArrayToPandas(c_options, self.sp_array,
818   self, ))
819 return wrap_array_output(out)

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
 84 raise ArrowKeyError(message)
 85 elif status.IsNotImplemented():
---> 86 raise ArrowNotImplementedError(message)
 87 elif status.IsTypeError():
 88 raise ArrowTypeError(message)

ArrowNotImplementedError: Not implemented type for lists: bool
In ../src/arrow/python/arrow_to_pandas.cc, line 1910, code: 
VisitTypeInline(*data_->type(), this)
{code}

as reported in https://github.com/apache/arrow/issues/5203



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6368) [C++] Add RecordBatch projection functionality

2019-08-27 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6368:


 Summary: [C++] Add RecordBatch projection functionality
 Key: ARROW-6368
 URL: https://issues.apache.org/jira/browse/ARROW-6368
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


define classes RecordBatchProjector (which projects from one schema to another, 
augmenting with null/constant columns where necessary) and a subtype of 
RecordBatchIterator which projects each batch yielded by a wrapped iterator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6367) [C++][Gandiva] Implement string reverse

2019-08-27 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-6367:
---

 Summary: [C++][Gandiva] Implement string reverse
 Key: ARROW-6367
 URL: https://issues.apache.org/jira/browse/ARROW-6367
 Project: Apache Arrow
  Issue Type: Task
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.2#803003)