Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Wes McKinney
On Mon, Jul 6, 2020 at 11:08 AM Antoine Pitrou wrote: > > > Le 06/07/2020 à 17:57, Steve Kim a écrit : > > The Parquet format specification is ambiguous about the exact details of > > LZ4 compression. However, the *de facto* reference implementation in Java > > (parquet-mr) uses the Hadoop LZ4

Re: language independent representation of filter expressions

2020-07-06 Thread Wes McKinney
I would also be interested in having a reusable serialized format for filter- and projection-like expressions. I think trying to go so far as full logical query plans suitable for building a SQL engine is perhaps a bit too far but we could start small with the use case from the JNI Datasets PR as

Re: language independent representation of filter expressions

2020-07-06 Thread Andy Grove
This is something that I am also interested in. My current approach in my personal project that uses Arrow is to use protobuf to represent expressions (as well as logical and physical query plans). I used the Gandiva protobuf definition as a starting point. Protobuf works for going between

language independent representation of filter expressions

2020-07-06 Thread Steve Kim
I have been following the discussion on a pull request ( https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the high-level dataset API via JNI. An obstacle that was encountered in this PR is that there is not a good way to pass a filter expression via JNI. Expressions have a

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
> Would that keep compatibility with existing files produces by Parquet C++? Changing the lz4 implementation to be compatible with parquet-mr/hadoop would break compatibility with any existing files that were written by Parquet C++ using lz4 compression. I believe that it is not possible to

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Antoine Pitrou
Le 06/07/2020 à 17:57, Steve Kim a écrit : > The Parquet format specification is ambiguous about the exact details of > LZ4 compression. However, the *de facto* reference implementation in Java > (parquet-mr) uses the Hadoop LZ4 codec. > > I think that it is important for Parquet c++ to have

Re: Question: How to pass data between two languages interprocess without extra libraries?

2020-07-06 Thread Neal Richardson
Could you clarify what you mean by "without external libraries"? Do you mean without using pyarrow and the arrow R package? Neal On Mon, Jul 6, 2020 at 1:40 AM Fan Liya wrote: > Hi Teng, > > Arrow provides two formats for IPC between different languages: streaming > and file. > This article

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
The Parquet format specification is ambiguous about the exact details of LZ4 compression. However, the *de facto* reference implementation in Java (parquet-mr) uses the Hadoop LZ4 codec. I think that it is important for Parquet c++ to have compatibility and feature parity with parquet-mr when

Re: [Integration] Errors running archery integration on Windows

2020-07-06 Thread Neville Dipale
Thanks Rok and Antoine, I couldn't see what the issue could have been, so the SO link was very helpful and informative. I'll try it out, and submit a PR if I get it right. On Mon, 6 Jul 2020 at 14:30, Antoine Pitrou wrote: > > Yes, that's certainly the case. > Changing: > values =

Re: [Integration] Errors running archery integration on Windows

2020-07-06 Thread Antoine Pitrou
Yes, that's certainly the case. Changing: values = np.random.randint(lower, upper, size=size) to: values = np.random.randint(lower, upper, size=size, dtype=np.int64) would hopefully fix the issue. Neville, could you try it out? Thank you Antoine. Le 06/07/2020 à 14:16, Rok Mihevc a

Re: [Integration] Errors running archery integration on Windows

2020-07-06 Thread Rok Mihevc
Numpy on windows has different default bitwidth than on linux. Perhaps this is causing the issue? (see: https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine ) Rok On Mon, Jul 6, 2020 at 12:57 PM Neville Dipale wrote: > Hi

[Integration] Errors running archery integration on Windows

2020-07-06 Thread Neville Dipale
Hi Arrow devs, I'm trying to run archery integration tests in Windows 10 (Python 3.7.7; conda 4.8.3), but I'm getting an error *ValueError: low is out of bounds for int32* (https://gist.github.com/nevi-me/4946eabb2dc111e10b98c074b45b73b1 ). Has someone else encountered this problem before?

[NIGHTLY] Arrow Build Report for Job nightly-2020-07-06-0

2020-07-06 Thread Crossbow
Arrow Build Report for Job nightly-2020-07-06-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0 Failed Tasks: - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-06-0-travis-homebrew-cpp -

Re: Question: How to pass data between two languages interprocess without extra libraries?

2020-07-06 Thread Fan Liya
Hi Teng, Arrow provides two formats for IPC between different languages: streaming and file. This article gives a tutorial for Java: https://arrow.apache.org/docs/java/ipc.html For other languages, it may be helpful to read the test cases. Best, Liya Fan On Sun, Jul 5, 2020 at 4:24 PM Teng