[jira] [Created] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-07-11 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5916: --- Summary: [C++] Allow RecordBatch.length to be less than array lengths Key: ARROW-5916 URL: https://issues.apache.org/jira/browse/ARROW-5916 Project: Apache

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Fan Liya
@Wes McKinney, Thanks a lot for the brainstorming. I think your ideas are reasonable and feasible. About IPC, my idea is that we can send the vector as a PointerStringVector, and receive it as a VarCharVector, so that the overhead of memory compaction can be hidden. What do you think? Best, Liya

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Fan Liya
@Uwe L. Korn Thanks a lot for the suggestion. I think this is exactly what we are doing right now. Best, Liya Fan On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney wrote: > hi Liya -- have you thought about implementing this as an > ExtensionType / ExtensionVector? You actually can already do

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Francois Saint-Jacques
I just merged PARQUET-1623, I think it's worth inserting since it fixes an invalid memory write. Note that I couldn't resolve/close the parquet issue, do I have to be contributor to the project? François On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney wrote: > > I just merged Eric's 2nd patch

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Wes McKinney
I just merged Eric's 2nd patch ARROW-5908 and I went through all the patches since the release commit and have come up with the following list of 32 fix-only patches to pick into a maintenance branch: https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014 Note there's still unresolved

[jira] [Created] (ARROW-5915) [C++] [Python] Set up testing for backwards compatibility of the parquet reader

2019-07-11 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5915: Summary: [C++] [Python] Set up testing for backwards compatibility of the parquet reader Key: ARROW-5915 URL: https://issues.apache.org/jira/browse/ARROW-5915

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Wes McKinney
Eric -- you are free to set the Fix Version prior to the patch being merged On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt wrote: > > The two C# fixes I'd like in the 0.14.1 release are: > > https://issues.apache.org/jira/browse/ARROW-5887 - already marked with 0.14.1 > fix version. >

RE: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Eric Erhardt
The two C# fixes I'd like in the 0.14.1 release are: https://issues.apache.org/jira/browse/ARROW-5887 - already marked with 0.14.1 fix version. https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been resolved yet. The PR https://github.com/apache/arrow/pull/4851 has one approver and the

Re: [Python] Wheel questions

2019-07-11 Thread Wes McKinney
On Thu, Jul 11, 2019 at 11:26 AM Antoine Pitrou wrote: > > > Le 11/07/2019 à 17:52, Krisztián Szűcs a écrit : > > Hi All, > > > > I have a couple of questions about the wheel packaging: > > - why do we build an arrow namespaced boost on linux and osx, could we link > > statically like with the

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-11 Thread Wes McKinney
Hi Francois -- copying the metadata into memory isn't the end of the world but it's a pretty ugly wart. This affects every IPC protocol message everywhere. We have an opportunity to address the wart now but such a fix post-1.0.0 will be much more difficult. On Thu, Jul 11, 2019, 2:05 PM Francois

[jira] [Created] (ARROW-5914) [CI] Build bundled dependencies in docker build step

2019-07-11 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5914: - Summary: [CI] Build bundled dependencies in docker build step Key: ARROW-5914 URL: https://issues.apache.org/jira/browse/ARROW-5914 Project: Apache

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-11 Thread Francois Saint-Jacques
If the data buffers are still aligned, then I don't think we should add a breaking change just for avoiding the copy on the metadata? I'd expect said metadata to be small enough that zero-copy doesn't really affect performance. François On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield wrote: > >

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-11 Thread Wes McKinney
Hi Bryan -- it wouldn't be forward compatible when using the 8 byte prefix, but using the scheme we are proposing old clients would see the new prefix as malformed (metadata length 0x = -1) rather than crashing. We could possibly expose a forward compatibility option to write the 4 byte

Re: Adding a new encoding for FP data - unsubscribe

2019-07-11 Thread Bryan Cutler
Mani, please send a reply to dev-unsubscr...@arrow.apache.org to remove yourself from the list. On Thu, Jul 11, 2019 at 11:10 AM mani vannan wrote: > All, > > Can someone please help me to unsubscribe to this group? > > Thank you. > > -Original Message- > From: Radev, Martin > Sent:

RE: Adding a new encoding for FP data - unsubscribe

2019-07-11 Thread mani vannan
All, Can someone please help me to unsubscribe to this group? Thank you. -Original Message- From: Radev, Martin Sent: Thursday, July 11, 2019 2:08 PM To: dev@arrow.apache.org; emkornfi...@gmail.com Cc: Raoofy, Amir ; Karlstetter, Roman Subject: Re: Adding a new encoding for FP data

Re: Adding a new encoding for FP data

2019-07-11 Thread Radev, Martin
Hello Micah, the changes will go to the C++ implementation of Parquet within Arrow. In that sense, if Arrow uses the compression and encoding methods available in Parquet in any way, I expect a benefit. My plan is to add the new encoding to parquet-cpp and parquer-mr (java). If you have

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-11 Thread Bryan Cutler
So the proposal here will still be backwards compatible with a 4 byte prefix? Can you explain a little more how this might work if I have an older version of Java using 4 byte prefix and a new version of C++/Python with an 8 byte one for a roundtrip Java -> Python -> Java? On Wed, Jul 10, 2019 at

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Neal Richardson
I just moved https://issues.apache.org/jira/browse/ARROW-5850 from 1.0.0 to 0.14.1. On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney wrote: > To limit uncertainty, I'm going to start preparing a 0.14.1 patch > release branch. I will update the list with the patches that are being > cherry-picked.

Re: [Python] Wheel questions

2019-07-11 Thread Antoine Pitrou
Le 11/07/2019 à 17:52, Krisztián Szűcs a écrit : > Hi All, > > I have a couple of questions about the wheel packaging: > - why do we build an arrow namespaced boost on linux and osx, could we link > statically like with the windows wheels? No idea. Boost shouldn't leak in the public APIs, so

[Python] Wheel questions

2019-07-11 Thread Krisztián Szűcs
Hi All, I have a couple of questions about the wheel packaging: - why do we build an arrow namespaced boost on linux and osx, could we link statically like with the windows wheels? - do we explicitly say somewhere in the linux wheels to link the 3rdparty dependencies statically or just

Re: Adding a new encoding for FP data

2019-07-11 Thread Micah Kornfield
Hi Martin, Can you clarify were you expecting the encoding to only be used in Parquet, or more generally in Arrow? Thanks, Micah On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney wrote: > hi folks, > > If you could participate in Micah's discussion about compression and > encoding generally at > >

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Wes McKinney
To limit uncertainty, I'm going to start preparing a 0.14.1 patch release branch. I will update the list with the patches that are being cherry-picked. If other folks could give me a list of other PRs that need to be backported I will add them to the list. Any JIRA that needs to be included should

[jira] [Created] (ARROW-5913) Add support for Parquet's BYTE_STREAM_SPLIT encoding

2019-07-11 Thread Martin Radev (JIRA)
Martin Radev created ARROW-5913: --- Summary: Add support for Parquet's BYTE_STREAM_SPLIT encoding Key: ARROW-5913 URL: https://issues.apache.org/jira/browse/ARROW-5913 Project: Apache Arrow

Re: Adding a new encoding for FP data

2019-07-11 Thread Wes McKinney
hi folks, If you could participate in Micah's discussion about compression and encoding generally at https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E it would be helpful. I personally think that Arrow would benefit from an

Re: New CI system: Ursabot

2019-07-11 Thread Krisztián Szűcs
Hi Eric! On Thu, Jul 11, 2019 at 3:34 PM Eric Erhardt wrote: > My apologies if this is already covered in the docs, but I couldn't find > it. > > How do I re-run a single leg in the Ursabot tests? The 'AMD64 Debian 9 > Rust 1.35' failed on my PR, and I wanted to try re-running just that leg, >

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Wes McKinney
hi Liya -- have you thought about implementing this as an ExtensionType / ExtensionVector? You actually can already do this, so if this helps you reference strings stored in some external memory then that seems reasonable. Such a PointerStringVector could have a method that converts it into the

RE: New CI system: Ursabot

2019-07-11 Thread Eric Erhardt
My apologies if this is already covered in the docs, but I couldn't find it. How do I re-run a single leg in the Ursabot tests? The 'AMD64 Debian 9 Rust 1.35' failed on my PR, and I wanted to try re-running just that leg, but the only option I found was to re-run all Ursabot legs. Eric

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Uwe L. Korn
Hello Liya Fan, here your best approach is to copy into the Arrow format as you can then use this as the basis for working with the Arrow-native representation as well as your internal representation. You will have to use two different offset vector as those two will always differ but in the

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Antoine Pitrou
Same as Uwe. Regards Antoine. Le 11/07/2019 à 14:05, Uwe L. Korn a écrit : > Hello Liya, > > I'm quite -1 on this type as Arrow is about efficient columnar structures. We > have opened the standard also to matrix-like types but always keep the > constraint of consecutive memory. Now also

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Fan Liya
Hi Korn, Thanks a lot for your comments. In my opinion, your comments make sense to me. Allowing non-consecutive memory segments will break some good design choices of Arrow. However, there are wide-spread user requirements for non-consecutive memory segments. I am wondering how can we help such

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Uwe L. Korn
Hello Liya, I'm quite -1 on this type as Arrow is about efficient columnar structures. We have opened the standard also to matrix-like types but always keep the constraint of consecutive memory. Now also adding types where memory is no longer consecutive but spread in the heap will make the

[jira] [Created] (ARROW-5911) [Java] Make ListVector and MapVector create reader lazily

2019-07-11 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5911: --- Summary: [Java] Make ListVector and MapVector create reader lazily Key: ARROW-5911 URL: https://issues.apache.org/jira/browse/ARROW-5911 Project: Apache Arrow Issue

[jira] [Created] (ARROW-5910) read_tensor() fails on non-seekable streams

2019-07-11 Thread Karsten Krispin (JIRA)
Karsten Krispin created ARROW-5910: -- Summary: read_tensor() fails on non-seekable streams Key: ARROW-5910 URL: https://issues.apache.org/jira/browse/ARROW-5910 Project: Apache Arrow Issue

Re: Adding a new encoding for FP data

2019-07-11 Thread Fan Liya
Hi Radev, Thanks a lot for providing so much technical details. I need to read them carefully. I think FP encoding is definitely a useful feature. I hope this feature can be implemented in Arrow soon, so that we can use it in our system. Best, Liya Fan On Thu, Jul 11, 2019 at 5:55 PM Radev,

Re: Adding a new encoding for FP data

2019-07-11 Thread Radev, Martin
Hello Liya Fan, this explains the technique but for a more complex case: https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/ For FP data, the approach which seemed to be the best is the following. Say we have a buffer of two 32-bit floating point values: buf = [af, bf]

Adding a new encoding for FP data

2019-07-11 Thread Radev, Martin
Hello people, there has been discussion in the Apache Parquet mailing list on adding a new encoder for FP data. The reason for this is that the supported compressors by Apache Parquet (zstd, gzip, etc) do not compress well raw FP data. In my investigation it turns out that a very simple