Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)
Here's the C++ changes https://github.com/apache/arrow/pull/5211 I'm going to create a integration branch where we can merge each patch before merging to master On Fri, Aug 23, 2019 at 9:03 AM Wes McKinney wrote: > > It isn't implemented in C++ yet but I will try to get a patch up for > that soon (today maybe). I think we should create a branch where we > can stack the patches that implement this for each language. > > On Fri, Aug 23, 2019 at 4:04 AM Paul Taylor wrote: > > > > I'll do the JS updates. Is it safe to validate against the Arrow C++ > > integration tests? > > > > > > On 8/22/19 7:28 PM, Micah Kornfield wrote: > > > I created https://issues.apache.org/jira/browse/ARROW-6313 as a tracking > > > issue with sub-issues on the development work. So far no-one has claimed > > > Java and Javascript tasks. > > > > > > Would it make sense to have a separate dev branch for this work? > > > > > > Thanks, > > > Micah > > > > > > On Thu, Aug 22, 2019 at 3:24 PM Wes McKinney wrote: > > > > > >> The vote carries with 4 binding +1 votes and 1 non-binding +1 > > >> > > >> I'll merge the specification patch later today and we can begin > > >> working on implementations so we can get this done for 0.15.0 > > >> > > >> On Tue, Aug 20, 2019 at 12:30 PM Bryan Cutler wrote: > > >>> +1 (non-binding) > > >>> > > >>> On Tue, Aug 20, 2019, 7:43 AM Antoine Pitrou > > >> wrote: > > Sorry, had forgotten to send my vote on this. > > > > +1 from me. > > > > Regards > > > > Antoine. > > > > > > On Wed, 14 Aug 2019 17:42:33 -0500 > > Wes McKinney wrote: > > > hi all, > > > > > > As we've been discussing [1], there is a need to introduce 4 bytes of > > > padding into the preamble of the "encapsulated IPC message" format to > > > ensure that the Flatbuffers metadata payload begins on an 8-byte > > > aligned memory offset. The alternative to this would be for Arrow > > > implementations where alignment is important (e.g. C or C++) to copy > > > the metadata (which is not always small) into memory when it is > > > unaligned. > > > > > > Micah has proposed to address this by adding a > > > 4-byte "continuation" value at the beginning of the payload > > > having the value 0x. The reason to do it this way is that > > > old clients will see an invalid length (what is currently the > > > first 4 bytes of the message -- a 32-bit little endian signed > > > integer indicating the metadata length) rather than potentially > > > crashing on a valid length. We also propose to expand the "end of > > > stream" marker used in the stream and file format from 4 to 8 > > > bytes. This has the additional effect of aligning the file footer > > > defined in File.fbs. > > > > > > This would be a backwards incompatible protocol change, so older > > >> Arrow > > > libraries would not be able to read these new messages. Maintaining > > > forward compatibility (reading data produced by older libraries) > > >> would > > > be possible as we can reason that a value other than the continuation > > > value was produced by an older library (and then validate the > > > Flatbuffer message of course). Arrow implementations could offer a > > > backward compatibility mode for the sake of old readers if they > > >> desire > > > (this may also assist with testing). > > > > > > Additionally with this vote, we want to formally approve the change > > >> to > > > the Arrow "file" format to always write the (new 8-byte) > > >> end-of-stream > > > marker, which enables code that processes Arrow streams to safely > > >> read > > > the file's internal messages as though they were a normal stream. > > > > > > The PR making these changes to the IPC documentation is here > > > > > > https://github.com/apache/arrow/pull/4951 > > > > > > Please vote to accept these changes. This vote will be open for at > > > least 72 hours > > > > > > [ ] +1 Adopt these Arrow protocol changes > > > [ ] +0 > > > [ ] -1 I disagree because... > > > > > > Here is my vote: +1 > > > > > > Thanks, > > > Wes > > > > > > [1]: > > >> https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E > > > > > >
[jira] [Created] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan
Paddy Horan created ARROW-6372: -- Summary: [Rust][Datafusion] Predictate push down optimization can break query plan Key: ARROW-6372 URL: https://issues.apache.org/jira/browse/ARROW-6372 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Affects Versions: 0.14.1 Reporter: Paddy Horan Fix For: 0.15.0 The following code reproduces the issue: [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6] If you disable the predicate push down optimization it works fine. -- This message was sent by Atlassian Jira (v8.3.2#803003)
AppVeyor re-build by committers is enabled
Hi, I asked INFRA to add the "arrow committers" GitHub team to teams of the "ApacheSoftwareFoundation" AppVeyor account: https://issues.apache.org/jira/browse/INFRA-18930 Now, Apache Arrow committers can re-build AppVeyor jobs from AppVeyor Web URI. AppVeyor jobs sometimes fail by network trouble. Re-building will be useful for the case. Note that we need to sign out and sign in again to reflect this change. Thanks, -- kou
Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation
On Wed, Aug 28, 2019 at 1:18 AM Wes McKinney wrote: > null/NA. But, as far as I'm aware, this component of pandas is > relatively unique and was never intended as an alternatives to sparse > matrix libraries. > Another example is https://sparse.pydata.org/en/latest/generated/sparse.SparseArray.html?highlight=fill%20value#sparse.SparseArray.fill_value, but it might have been influenced by Pandas. I'm ok with dropping this for now.
Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation
On Tue, Aug 27, 2019 at 6:07 PM Neal Richardson wrote: > > Forgive me if this is off topic; I haven't been following this closely > and I haven't used scipy.sparse. But there are some very reasonable > cases where you might want to fill sparse data with a value other than > 0: > > * The sparseness is missing data, and 0 is not the same as NA > * Better compression: figure out which value is most common in the > data and make that the default that gets filled. E.g. how many fingers > a person has. > Definitely. I am the original author of pandas's Sparse* family of types, and they were created for the case where the data is mostly null/NA. But, as far as I'm aware, this component of pandas is relatively unique and was never intended as an alternatives to sparse matrix libraries. It seems like the sparse-with-fill-value might be better discussed on Micah's thread regarding Array compression and encoding. > Neal > > On Tue, Aug 27, 2019 at 3:46 PM Rok Mihevc wrote: > > > > On Tue, Aug 27, 2019 at 11:05 PM Wes McKinney wrote: > > > > > I don't think this has been discussed. I think the SparseTensor > > > discussions have been intended to reach compatibility with "sparse > > > matrix" projects like scipy.sparse. pandas's "SparseArray" objects are > > > a distinct thing -- I don't know many examples of sparse matrices with > > > fill values other than 0 > > > > > > The reason for implementing fill_value would be the case where user wants > > another value to be '0' and it's practical for the SparseTensor object to > > keep that value for them. I am not sure how common would such a case be and > > since scipy.sparse is time tested I'd agree with compatibility as the > > current goal.
Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation
Forgive me if this is off topic; I haven't been following this closely and I haven't used scipy.sparse. But there are some very reasonable cases where you might want to fill sparse data with a value other than 0: * The sparseness is missing data, and 0 is not the same as NA * Better compression: figure out which value is most common in the data and make that the default that gets filled. E.g. how many fingers a person has. Neal On Tue, Aug 27, 2019 at 3:46 PM Rok Mihevc wrote: > > On Tue, Aug 27, 2019 at 11:05 PM Wes McKinney wrote: > > > I don't think this has been discussed. I think the SparseTensor > > discussions have been intended to reach compatibility with "sparse > > matrix" projects like scipy.sparse. pandas's "SparseArray" objects are > > a distinct thing -- I don't know many examples of sparse matrices with > > fill values other than 0 > > > The reason for implementing fill_value would be the case where user wants > another value to be '0' and it's practical for the SparseTensor object to > keep that value for them. I am not sure how common would such a case be and > since scipy.sparse is time tested I'd agree with compatibility as the > current goal.
Re: [Format] Semantics for dictionary batches in streams
I was thinking the file format must satisfy one of two conditions: 1. Exactly one dictionarybatch per encoded column 2. DictionaryBatches are interleaved correctly. On Tuesday, August 27, 2019, Wes McKinney wrote: > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou wrote: > > > > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : > > > So the current situation we have right now in C++ is that if we tried > > > to create an IPC stream from a sequence of record batches that don't > > > all have the same dictionary, we'd run into two scenarios: > > > > > > * Batches that either have a prefix of a prior-observed dictionary, or > > > the prior dictionary is a prefix of their dictionary. For example, > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In > > > such case we could compute and send a delta batch > > > > > > * Batches with a dictionary that is a permutation of values, and > > > possibly new unique values. > > > > > > In this latter case, without the option of replacing an existing ID in > > > the stream, we would have to do a unification / permutation of indices > > > and then also possibly send a delta batch. We should probably have > > > code at some point that deals with both cases, but in the meantime I > > > would like to allow dictionaries to be redefined in this case. Seems > > > like we might need a vote to formalize this? > > > > Isn't the stream format deviating from the file format then? In the > > file format, IIUC, dictionaries can appear after the respective record > > batches, so there's no way to tell whether the original or redefined > > version of a dictionary is being referred to. > > You make a good point -- we can consider changes to the file format to > allow for record batches to have different dictionaries. Even handling > delta dictionaries with the current file format would be a bit tedious > (though not indeterminate) > > > Regards > > > > Antoine. >
Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation
On Tue, Aug 27, 2019 at 11:05 PM Wes McKinney wrote: > I don't think this has been discussed. I think the SparseTensor > discussions have been intended to reach compatibility with "sparse > matrix" projects like scipy.sparse. pandas's "SparseArray" objects are > a distinct thing -- I don't know many examples of sparse matrices with > fill values other than 0 The reason for implementing fill_value would be the case where user wants another value to be '0' and it's practical for the SparseTensor object to keep that value for them. I am not sure how common would such a case be and since scipy.sparse is time tested I'd agree with compatibility as the current goal.
Re: [Format] Semantics for dictionary batches in streams
On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou wrote: > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : > > So the current situation we have right now in C++ is that if we tried > > to create an IPC stream from a sequence of record batches that don't > > all have the same dictionary, we'd run into two scenarios: > > > > * Batches that either have a prefix of a prior-observed dictionary, or > > the prior dictionary is a prefix of their dictionary. For example, > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In > > such case we could compute and send a delta batch > > > > * Batches with a dictionary that is a permutation of values, and > > possibly new unique values. > > > > In this latter case, without the option of replacing an existing ID in > > the stream, we would have to do a unification / permutation of indices > > and then also possibly send a delta batch. We should probably have > > code at some point that deals with both cases, but in the meantime I > > would like to allow dictionaries to be redefined in this case. Seems > > like we might need a vote to formalize this? > > Isn't the stream format deviating from the file format then? In the > file format, IIUC, dictionaries can appear after the respective record > batches, so there's no way to tell whether the original or redefined > version of a dictionary is being referred to. You make a good point -- we can consider changes to the file format to allow for record batches to have different dictionaries. Even handling delta dictionaries with the current file format would be a bit tedious (though not indeterminate) > Regards > > Antoine.
Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation
I'm also OK with these changes. Since we have not established a versioning or compatibility policy with regards to "Other" data structures like Tensor and SparseTensor, I don't know that a vote is needed, just a pull request. On Wed, Aug 21, 2019 at 1:11 PM Rok Mihevc wrote: > > Hi, > > On Mon, Aug 19, 2019 at 11:30 AM Kenta Murata wrote: > > > (3) Adding SparseCSCIndex > > > > I'd be interested to help with (Python) part of this SparseCSCIndex. > > > I’d appreciate any comments or suggestions. > > > > I missed previous discussion, so this might have already been discussed, > but did we ever consider adding fill value attribute? So parameter that > sets the 'zero' value to some arbitrary value instead of zero if we need it. > See fill_value here > https://sparse.pydata.org/en/latest/generated/sparse.SparseArray.html?highlight=fill%20value#sparse.SparseArray.fill_value > and > here > https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseArray.html > . > I don't think this has been discussed. I think the SparseTensor discussions have been intended to reach compatibility with "sparse matrix" projects like scipy.sparse. pandas's "SparseArray" objects are a distinct thing -- I don't know many examples of sparse matrices with fill values other than 0 > > Best, > Rok
Re: [Format] Semantics for dictionary batches in streams
Le 27/08/2019 à 22:31, Wes McKinney a écrit : > So the current situation we have right now in C++ is that if we tried > to create an IPC stream from a sequence of record batches that don't > all have the same dictionary, we'd run into two scenarios: > > * Batches that either have a prefix of a prior-observed dictionary, or > the prior dictionary is a prefix of their dictionary. For example, > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In > such case we could compute and send a delta batch > > * Batches with a dictionary that is a permutation of values, and > possibly new unique values. > > In this latter case, without the option of replacing an existing ID in > the stream, we would have to do a unification / permutation of indices > and then also possibly send a delta batch. We should probably have > code at some point that deals with both cases, but in the meantime I > would like to allow dictionaries to be redefined in this case. Seems > like we might need a vote to formalize this? Isn't the stream format deviating from the file format then? In the file format, IIUC, dictionaries can appear after the respective record batches, so there's no way to tell whether the original or redefined version of a dictionary is being referred to. Regards Antoine.
Re: Timeline for 0.15.0 release
hi, I think we should try to release the week of September 9, so development work should be completed by end of next week. Does that seem reasonable? I plan to get up a patch for the protocol alignment changes for C++ in the next couple of days -- I think that getting the alignment work done is the main barrier to releasing. Thanks Wes On Mon, Aug 19, 2019 at 12:25 PM Ji Liu wrote: > > Hi, Wes, on the java side, I can think of several bugs that need to be fixed > or reminded. > > i. ARROW-6040: Dictionary entries are required in IPC streams even when > empty[1] > This one is under review now, however through this PR we find that there > seems a bug in java reading and writing dictionaries in IPC which is > Inconsistent with spec[2] since it assumes all dictionaries are at the start > of stream (see details in PR comments, and this fix may not catch up with > version 0.15). @Micah Kornfield > > ii. ARROW-1875: Write 64-bit ints as strings in integration test JSON files[3] > Java side code already checked in, other implementations seems not. > > iii. ARROW-6202: OutOfMemory in JdbcAdapter[4] > Caused by trying to load all records in one contiguous batch, fixed by > providing iterator API for iteratively reading in ARROW-6219[5]. > > Thanks, > Ji Liu > > [1] https://github.com/apache/arrow/pull/4960 > [2] https://arrow.apache.org/docs/ipc.html > [3] https://issues.apache.org/jira/browse/ARROW-1875 > [4] https://issues.apache.org/jira/browse/ARROW-6202[5] > https://issues.apache.org/jira/browse/ARROW-6219 > > > > -- > From:Wes McKinney > Send Time:2019年8月19日(星期一) 23:03 > To:dev > Subject:Re: Timeline for 0.15.0 release > > I'm going to work some on organizing the 0.15.0 backlog some this > week, if anyone wants to help with grooming (particularly for > languages other than C++/Python where I'm focusing) that would be > helpful. There have been almost 500 JIRA issues opened since the > 0.14.0 release, so we should make sure to check whether there's any > regressions or other serious bugs that we should try to fix for > 0.15.0. > > On Thu, Aug 15, 2019 at 6:23 PM Wes McKinney wrote: > > > > The Windows wheel issue in 0.14.1 seems to be > > > > https://issues.apache.org/jira/browse/ARROW-6015 > > > > I think the root cause could be the Windows changes in > > > > https://github.com/apache/arrow/commit/223ae744cc2a12c60cecb5db593263a03c13f85a > > > > I would be appreciative if a volunteer would look into what was wrong > > with the 0.14.1 wheels on Windows. Otherwise 0.15.0 Windows wheels > > will be broken, too > > > > The bad wheels can be found at > > > > https://bintray.com/apache/arrow/python#files/python%2F0.14.1 > > > > On Thu, Aug 15, 2019 at 1:28 PM Antoine Pitrou wrote: > > > > > > On Thu, 15 Aug 2019 11:17:07 -0700 > > > Micah Kornfield wrote: > > > > > > > > > > In C++ they are > > > > > independent, we could have 32-bit array lengths and variable-length > > > > > types with 64-bit offsets if we wanted (we just wouldn't be able to > > > > > have a List child with more than INT32_MAX elements). > > > > > > > > I think the point is we could do this in C++ but we don't. I'm not > > > > sure we > > > > would have introduced the "Large" types if we did. > > > > > > 64-bit offsets take twice as much space as 32-bit offsets, so if you're > > > storing lots of small-ish lists or strings, 32-bit offsets are > > > preferrable. So even with 64-bit array lengths from the start it would > > > still be beneficial to have types with 32-bit offsets. > > > > > > > Going with the limited address space in Java and calling it a reference > > > > implementation seems suboptimal. If a consumer uses a "Large" type > > > > presumably it is because they need the ability to store more than > > > > INT32_MAX > > > > child elements in a column, otherwise it is just wasting space [1]. > > > > > > Probably. Though if the individual elements (lists or strings) are > > > large, not much space is wasted in proportion, so it may be simpler in > > > such a case to always create a "Large" type array. > > > > > > > [1] I suppose theoretically there might be some performance benefits on > > > > 64-bit architectures to using the native word sizes. > > > > > > Concretely, common 64-bit architectures don't do that, as 32-bit is an > > > extremely common integer size even in high-performance code. > > > > > > Regards > > > > > > Antoine. > > > > > > >
Re: [Format] Semantics for dictionary batches in streams
So the current situation we have right now in C++ is that if we tried to create an IPC stream from a sequence of record batches that don't all have the same dictionary, we'd run into two scenarios: * Batches that either have a prefix of a prior-observed dictionary, or the prior dictionary is a prefix of their dictionary. For example, suppose that the dictionary sent for an id was ['A', 'B', 'C'] and then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In such case we could compute and send a delta batch * Batches with a dictionary that is a permutation of values, and possibly new unique values. In this latter case, without the option of replacing an existing ID in the stream, we would have to do a unification / permutation of indices and then also possibly send a delta batch. We should probably have code at some point that deals with both cases, but in the meantime I would like to allow dictionaries to be redefined in this case. Seems like we might need a vote to formalize this? Independent from this decision, I would strongly recommend that all implementations handle dictionaries in-memory as data and not metadata (i.e. do not have dictionaries in the schema). It was lucky (see ARROW-3144) that this problematic early design in the C++ library could be fixed with less than a week of work. Thanks Wes On Sun, Aug 11, 2019 at 9:17 PM Micah Kornfield wrote: > > I'm not sure what you mean by record-in-dictionary-id, so it is possible > this is a solution that I just don't understand :) > > The only two references to dictionary IDs that I could find, are one in > schema.fbs [1] which is attached a column in a schema and the one > referenced above in DictionaryBatches define Message.fbs [2] for > transmitting dictionaries. It is quite possible I missed something. > > The indices into the dictionary are Int Arrays in a normal record batch. > I suppose the other option is to reset the stream by sending a new schema, > but I don't think that is supported either. This is what lead to my > original question. > > Does no one do this today? > > I think Wes did some recent work on the C++ Parquet in reading > dictionaries, and might have faced some of these issues, I'm not sure how > he dealt with it (haven't gotten back to the Parquet code yet). > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L271 > [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 > > On Sun, Aug 11, 2019 at 6:32 PM Jacques Nadeau wrote: > > > Wow, you've shown how little I've thought about Arrow dictionaries for a > > while. I thought we had a dictionary id and a record-in-dictionary-id. > > Wouldn't that approach make more sense? Does no one do this today? (We > > frequently use compound values for this type of scenario...) > > > > On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield > > wrote: > > > >> Reading data from two different parquet files sequentially with different > >> dictionaries for the same column. This could be handled by re-encoding > >> data but that seems potentially sub-optimal. > >> > >> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau > >> wrote: > >> > >>> What situation are anticipating where you're going to be restating ids > >>> mid stream? > >>> > >>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield > >>> wrote: > >>> > The IPC specification [1] defines behavior when isDelta on a > DictionaryBatch [2] is "true". I might have missed it in the > specification, but I couldn't find the interpretation for what the > expected > behavior is when isDelta=false and and two dictionary batches with the > same ID are sent. > > It seems like there are two options: > 1. Interpret the new dictionary batch as replacing the old one. > 2. Regard this as an error condition. > > Based on the fact that in the "file format" dictionaries are allowed to > be > placed in any order relative to the record batches, I assume it is the > second, but just wanted to make sure. > > Thanks, > Micah > > [1] https://arrow.apache.org/docs/ipc.html > [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 > > >>>
[jira] [Created] (ARROW-6371) [Doc] Row to columnar conversion example mentions arrow::Column in comments
Omer Ozarslan created ARROW-6371: Summary: [Doc] Row to columnar conversion example mentions arrow::Column in comments Key: ARROW-6371 URL: https://issues.apache.org/jira/browse/ARROW-6371 Project: Apache Arrow Issue Type: Bug Components: Documentation Reporter: Omer Ozarslan https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html {code:cpp} // The final representation should be an `arrow::Table` which in turn is made up of // an `arrow::Schema` and a list of `arrow::Column`. An `arrow::Column` is again a // named collection of one or more `arrow::Array` instances. As the first step, we // will iterate over the data and build up the arrays incrementally. {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6370) [JS] Table.from adds 0 on int columns
Sascha Hofmann created ARROW-6370: - Summary: [JS] Table.from adds 0 on int columns Key: ARROW-6370 URL: https://issues.apache.org/jira/browse/ARROW-6370 Project: Apache Arrow Issue Type: Bug Components: JavaScript Affects Versions: 0.14.1 Reporter: Sascha Hofmann I am generating an arrow table in pyarrow and send it via gRPC like this: {code:java} sink = pa.BufferOutputStream() writer = pa.RecordBatchStreamWriter(sink, batch.schema) writer.write_batch(batch) writer.close() yield ds.Response( status=200, loading=False, response=[sink.getvalue().to_pybytes()] ) {code} On the javascript end, I parse it like that: {code:java} Table.from(response.getResponseList()[0]) {code} That works but when I look at the actual table, int columns have a 0 for every other row. String columns seem to be parsed just fine. The Python byte array created from to_pybytes() has the same length as received in javascript. I am also able to recreate the original table for the byte array in Python. -- This message was sent by Atlassian Jira (v8.3.2#803003)
Re: [Discuss] Support read/write interleaved dictionaries and batches in IPC stream
hi all, I agree with handling the interleaved dictionary case. While we are at it, I think we should formally allow dictionary replacements. Note that I just opened https://github.com/apache/arrow/pull/5202 to revamp / consolidate the format documents. So any changes will need to be based on that. One problem with our integration testing strategy -- the JSON files don't exactly simulate a stream, they are more like the current File format. If we want to properly generate integration test cases, we should also slightly generalize the integration test JSON format to contain a collection of messages like [ {"message_type": "Schema", ... }, {"message_type": "DictionaryBatch", ... }, {"message_type": "RecordBatch", ... }, {"message_type": "RecordBatch", ... }, {"message_type": "DictionaryBatch", ... }, ... ] I don't know that it will be all that difficult to make this change but we would have 4 implementations that have to get fixed up. - Wes On Wed, Aug 21, 2019 at 10:58 PM Micah Kornfield wrote: > > Hi Ji Liu, > Thanks for getting the conversation started. I think a few things need to > happen: > 1. We need to clarify in the specification that not all dictionaries need > to be present at the beginning. I plan on creating a PR for discussion > that clarifies this point, as well as handling of non-delta dictionary > batches discussed earlier [1]. > 2. Java needs to support both of these once we vote on the > clarification. One way of doing this, is what I think Jacques alluded to in > [1]. For the in memory representation of dictionary encoding we will need > to not only track a notion of "dictionary" id for the Vector, but also an > addition > 3. Lastly, we should have integration tests around these cases to make > sure they are handled consistently. > > Thanks, > Micah > > [1] > https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E > > On Wed, Aug 21, 2019 at 7:48 AM Ji Liu wrote: > > > Hi all, > > > > Recently when we worked on fixing a IPC related bug in both Java/C++ > > sides[1][2], > > @emkornfieldfound that the stream reader assumes that all dictionaries > > are at the start of the stream which is inconsistent with spec[3] > > which says as long as a record batch doesn't reference a dictionary they > > can be interleaved. > > > > > > Cases below should be supported, however they will crash at current > > implementations. > > i. have a record batch of one dictionary encoded column S > > 1>Schema > > 2>RecordBatch: S=[null, null, null, null] > > 3>DictionaryBatch: ['abc', 'efg'] > > 4>Recordbatch: S=[0, 1, 0, 1] > > ii. have a record batch of two dictionary encoded column S1, S2 > > 1>Schema > > 2->DictionaryBatch S1: ['ab', 'cd'] > > 3->RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,] > > 4->DictionaryBatch S2: ['cc', 'dd'] > > 5->RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1] > > > > We already did some work on Java side via[1] > > to make it possible to parse interleaved dictionaries and batches: > > i. In ArrowStreamReader, do not read all dictionaries at the start > > ii. > > When call loadNextBatch, we read message to decide read dictionaries first > > or directly read a batch, if former, read all dictionaries out before this > > batch. > > > > iii.When we read a batch, we check if the dictionaries it needed has > > already been read, if not, check if it's all null column and decide whether > > need throw exception. > > > > In this way, whatever they are interleaved or not, we can parse it properly. > > > > In the future, I think we should also support write interleaved > > dictionaries and batches in IPC stream(created an > > issue to track this[4]), but not quite clear how to implement this. > > Any opinions about this are appreciated, thanks! > > > > Thanks, > > Ji Liu > > > > [1] https://issues.apache.org/jira/browse/ARROW-6040, > > [2] https://issues.apache.org/jira/browse/ARROW-6126, > > [3] http://arrow.apache.org/docs/format/IPC.html#streaming-format, > > [4]https://issues.apache.org/jira/browse/ARROW-6308 > >
[jira] [Created] (ARROW-6369) [Python] Support list-of-boolean in Array.to_pandas conversion
Wes McKinney created ARROW-6369: --- Summary: [Python] Support list-of-boolean in Array.to_pandas conversion Key: ARROW-6369 URL: https://issues.apache.org/jira/browse/ARROW-6369 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.15.0 See {code} In [4]: paste a = pa.array(np.array([[True, False], [True, True, True]])) ## -- End pasted text -- In [5]: a Out[5]: [ [ true, false ], [ true, true, true ] ] In [6]: a.to_pandas() --- ArrowNotImplementedError Traceback (most recent call last) in > 1 a.to_pandas() ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas() 439 deduplicate_objects=deduplicate_objects) 440 --> 441 return self._to_pandas(options, categories=categories, 442ignore_metadata=ignore_metadata) 443 ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array._to_pandas() 815 816 with nogil: --> 817 check_status(ConvertArrayToPandas(c_options, self.sp_array, 818 self, )) 819 return wrap_array_output(out) ~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() 84 raise ArrowKeyError(message) 85 elif status.IsNotImplemented(): ---> 86 raise ArrowNotImplementedError(message) 87 elif status.IsTypeError(): 88 raise ArrowTypeError(message) ArrowNotImplementedError: Not implemented type for lists: bool In ../src/arrow/python/arrow_to_pandas.cc, line 1910, code: VisitTypeInline(*data_->type(), this) {code} as reported in https://github.com/apache/arrow/issues/5203 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6368) [C++] Add RecordBatch projection functionality
Benjamin Kietzman created ARROW-6368: Summary: [C++] Add RecordBatch projection functionality Key: ARROW-6368 URL: https://issues.apache.org/jira/browse/ARROW-6368 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman define classes RecordBatchProjector (which projects from one schema to another, augmenting with null/constant columns where necessary) and a subtype of RecordBatchIterator which projects each batch yielded by a wrapped iterator. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6367) [C++][Gandiva] Implement string reverse
Prudhvi Porandla created ARROW-6367: --- Summary: [C++][Gandiva] Implement string reverse Key: ARROW-6367 URL: https://issues.apache.org/jira/browse/ARROW-6367 Project: Apache Arrow Issue Type: Task Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla -- This message was sent by Atlassian Jira (v8.3.2#803003)