Re: parquet 2 incompatibility between 0.16 and 0.17?

2020-04-30 Thread Micah Kornfield
Sorry I didn't get to this, will try again tomorrow. On Thu, Apr 30, 2020 at 11:09 AM Wes McKinney wrote: > I'd be fine with a patch release addressing this so long as it's > binary-only (to save us all time). > > On Thu, Apr 30, 2020, 12:30 PM Micah Kornfield > wrote: > >> This sounds like som

How to include arrow and parquet in another project's CMakeLists.txt

2020-04-30 Thread Zhuo Jia Dai
Hi all, I am trying to write a Julia parquet writer by leveraging the C++ arrow library. I can build arrow and arrow/parquet and can write out a parquet file successfully. The next part I need to do is to use the [CxxWrap.jl]( https://github.com/JuliaInterop/CxxWrap.jl) Julia package to call the C

Pyarrow building from source along with CPP Libraries to link to another Cython API

2020-04-30 Thread Vibhatha Abeykoon
Hi, I am trying to integrate Arrow with an application that I am developing. Here I build Arrow from the source (CPP) and use the API to develop some custom functions to do a scientific calculation after data loaded with Arrow table API. On top of this, I develop a Cython API to design a python AP

[jira] [Created] (ARROW-8661) [C++][Gandiva] Reduce number of files and headers

2020-04-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8661: --- Summary: [C++][Gandiva] Reduce number of files and headers Key: ARROW-8661 URL: https://issues.apache.org/jira/browse/ARROW-8661 Project: Apache Arrow Issue Ty

[jira] [Created] (ARROW-8660) [C++][Gandiva] Reduce dependence on Boost

2020-04-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8660: --- Summary: [C++][Gandiva] Reduce dependence on Boost Key: ARROW-8660 URL: https://issues.apache.org/jira/browse/ARROW-8660 Project: Apache Arrow Issue Type: Impr

[jira] [Created] (ARROW-8659) ListBuilder and FixedSizeListBuilder capacity

2020-04-30 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-8659: Summary: ListBuilder and FixedSizeListBuilder capacity Key: ARROW-8659 URL: https://issues.apache.org/jira/browse/ARROW-8659 Project: Apache Arrow

[C++] Heads up about breaking API change with Interval types

2020-04-30 Thread Wes McKinney
Hi folks, In https://github.com/apache/arrow/pull/7060 I proposed an (unavoidable) C++ API change related to the two types of intervals that are in the Arrow columnar format. As context, in the C++ library in almost all cases we use different Type enum values for each "subtype" that has a differe

[RESULT] [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

2020-04-30 Thread Wes McKinney
The vote carries with 7 binding +1 votes and 1 non-binding +1 On Fri, Apr 24, 2020 at 7:40 AM Francois Saint-Jacques wrote: > > +1 (binding) > > On Fri, Apr 24, 2020 at 5:41 AM Krisztián Szűcs > wrote: > > > > +1 (binding) > > > > On 2020. Apr 24., Fri at 1:51, Micah Kornfield > > wrote: > > >

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread David Li
Francois, Thanks for the pointers. I'll see if I can put together a proof-of-concept, might that help discussion? I agree it would be good to make it format-agnostic. I'm also curious what thoughts you'd have on how to manage cross-file parallelism (coalescing only helps within a file). If we just

[jira] [Created] (ARROW-8658) [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments

2020-04-30 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8658: --- Summary: [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments Key: ARROW-8658 URL: https://issues.apache.org/jira/browse/ARROW-8658 Project: Apa

Re: parquet 2 incompatibility between 0.16 and 0.17?

2020-04-30 Thread Wes McKinney
I'd be fine with a patch release addressing this so long as it's binary-only (to save us all time). On Thu, Apr 30, 2020, 12:30 PM Micah Kornfield wrote: > This sounds like something we might want to do and issue a patch release. > It seems bad to default to a non-production version? > > I can t

Re: parquet 2 incompatibility between 0.16 and 0.17?

2020-04-30 Thread Micah Kornfield
This sounds like something we might want to do and issue a patch release. It seems bad to default to a non-production version? I can try to take a look tonight at a patch of no gets to it before. Thanks, Micah On Wednesday, April 29, 2020, Wes McKinney wrote: > On Wed, Apr 29, 2020 at 6:15 PM

[jira] [Created] (ARROW-8657) Distinguish parquet version 2 logical type vs DataPageV2

2020-04-30 Thread Pierre Belzile (Jira)
Pierre Belzile created ARROW-8657: - Summary: Distinguish parquet version 2 logical type vs DataPageV2 Key: ARROW-8657 URL: https://issues.apache.org/jira/browse/ARROW-8657 Project: Apache Arrow

[jira] [Created] (ARROW-8656) [Python] Switch to VS2017 in the windows wheel builds

2020-04-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8656: -- Summary: [Python] Switch to VS2017 in the windows wheel builds Key: ARROW-8656 URL: https://issues.apache.org/jira/browse/ARROW-8656 Project: Apache Arrow

[jira] [Created] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8655: Summary: [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset Key: ARROW-8655 URL: https://issues.apache.org/jira/browse/ARROW-8655

[jira] [Created] (ARROW-8654) [Python] pyarrow 0.17.0 fails reading "wide" parquet files

2020-04-30 Thread Mike Macpherson (Jira)
Mike Macpherson created ARROW-8654: -- Summary: [Python] pyarrow 0.17.0 fails reading "wide" parquet files Key: ARROW-8654 URL: https://issues.apache.org/jira/browse/ARROW-8654 Project: Apache Arrow

[jira] [Created] (ARROW-8653) [C++] Add support for gflags version detection

2020-04-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8653: -- Summary: [C++] Add support for gflags version detection Key: ARROW-8653 URL: https://issues.apache.org/jira/browse/ARROW-8653 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8652) [Python] Test error message when discovering dataset with invalid files

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8652: Summary: [Python] Test error message when discovering dataset with invalid files Key: ARROW-8652 URL: https://issues.apache.org/jira/browse/ARROW-8652

[jira] [Created] (ARROW-8651) [Python][Dataset] Support pickling of Dataset objects

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8651: Summary: [Python][Dataset] Support pickling of Dataset objects Key: ARROW-8651 URL: https://issues.apache.org/jira/browse/ARROW-8651 Project: Apache Ar

[jira] [Created] (ARROW-8650) [Rust] [Website] Add documentation to Arrow website

2020-04-30 Thread Andy Grove (Jira)
Andy Grove created ARROW-8650: - Summary: [Rust] [Website] Add documentation to Arrow website Key: ARROW-8650 URL: https://issues.apache.org/jira/browse/ARROW-8650 Project: Apache Arrow Issue Type

[jira] [Created] (ARROW-8649) [Java] [Website] Java documentation on website is hidden

2020-04-30 Thread Andy Grove (Jira)
Andy Grove created ARROW-8649: - Summary: [Java] [Website] Java documentation on website is hidden Key: ARROW-8649 URL: https://issues.apache.org/jira/browse/ARROW-8649 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8648) [Rust] Optimize Rust CI Build Times

2020-04-30 Thread Mark Hildreth (Jira)
Mark Hildreth created ARROW-8648: Summary: [Rust] Optimize Rust CI Build Times Key: ARROW-8648 URL: https://issues.apache.org/jira/browse/ARROW-8648 Project: Apache Arrow Issue Type: Improvem

Re: [C++][Python] Highlighting some known problems with our Arrow C++ and Python packages

2020-04-30 Thread Wes McKinney
The proposal is for any BUNDLED dependency to be merged into libarrow.a (or another one of the static libraries if the dependency is only used in e.g. one subcomponent), so this applies to the AWS SDK also On Thu, Apr 30, 2020 at 3:02 AM Rémi Dettai wrote: > > Hi! > > Does your point 1 also apply

[jira] [Created] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8647: Summary: [C++][Dataset] Optionally encode partition field values as dictionary type Key: ARROW-8647 URL: https://issues.apache.org/jira/browse/ARROW-8647

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Antoine Pitrou
If we want to discuss IO APIs we should do that comprehensively. There are various ways of expressing what we want to do (explicit readahead, fadvise-like APIs, async APIs, etc.). Regards Antoine. Le 30/04/2020 à 15:08, Francois Saint-Jacques a écrit : > One more point, > > It would seem ben

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Francois Saint-Jacques
One more point, It would seem beneficial if we could express this in `RandomAccessFile::ReadAhead(vector)` method: no async buffering/coalescing would be needed. In the case of Parquet, we'd get the _exact_ ranges computed from the medata.This method would also possibly benefit other filesystems s

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Francois Saint-Jacques
Hello David, I think that what you ask is achievable with the dataset API without much effort. You'd have to insert the pre-buffering at ParquetFileFormat::ScanFile [1]. The top-level Scanner::Scan method is essentially a generator that looks like flatmap(Iterator>). It consumes the fragment in-or

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread David Li
Sure, and we are still interested in collaborating. The main use case we have is scanning datasets in order of the partition key; it seems ordering is the only missing thing from Antoine's comments. However, from briefly playing around with the Python API, an application could manually order the fr

Re: [Discuss] Proposal for optimizing Datasets over S3/object storage

2020-04-30 Thread Joris Van den Bossche
On Thu, 30 Apr 2020 at 04:06, Wes McKinney wrote: > On Wed, Apr 29, 2020 at 6:54 PM David Li wrote: > > > > Ah, sorry, so I am being somewhat unclear here. Yes, you aren't > > guaranteed to download all the files in order, but with more control, > > you can make this more likely. You can also pr

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-04-29-0

2020-04-30 Thread Krisztián Szűcs
I suggest to create a github actions workflow to trigger these integration tests on pull requests when the relevant modules have changed: parquet.py, dataset.pyx etc. We have plenty of build failures, I'm trying to go through them. Given the regularly occurring nightly errors we should move some o

[NIGHTLY] Arrow Build Report for Job nightly-2020-04-30-0

2020-04-30 Thread Crossbow
Arrow Build Report for Job nightly-2020-04-30-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0 Failed Tasks: - centos-6-amd64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-30-0-github-centos-6-amd64 - centos-7-amd64:

[jira] [Created] (ARROW-8646) Allow UnionListWriter to write null values

2020-04-30 Thread Thippana Vamsi Kalyan (Jira)
Thippana Vamsi Kalyan created ARROW-8646: Summary: Allow UnionListWriter to write null values Key: ARROW-8646 URL: https://issues.apache.org/jira/browse/ARROW-8646 Project: Apache Arrow

[jira] [Created] (ARROW-8645) [C++] Missing gflags dependency for plasma

2020-04-30 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8645: -- Summary: [C++] Missing gflags dependency for plasma Key: ARROW-8645 URL: https://issues.apache.org/jira/browse/ARROW-8645 Project: Apache Arrow Issue Typ

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-04-29-0

2020-04-30 Thread Joris Van den Bossche
I opened issues to track the failing dask and pandas-master integration tests: https://issues.apache.org/jira/browse/ARROW-8643 https://issues.apache.org/jira/browse/ARROW-8644 On Wed, 29 Apr 2020 at 12:09, Crossbow wrote: > > Arrow Build Report for Job nightly-2020-04-29-0 > > All tasks: > ht

[jira] [Created] (ARROW-8644) [Python] Dask integration tests failing due to change in not including partition columns

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8644: Summary: [Python] Dask integration tests failing due to change in not including partition columns Key: ARROW-8644 URL: https://issues.apache.org/jira/browse/ARROW-

[jira] [Created] (ARROW-8643) [Python] Tests with pandas master failing due to freq assertion

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8643: Summary: [Python] Tests with pandas master failing due to freq assertion Key: ARROW-8643 URL: https://issues.apache.org/jira/browse/ARROW-8643 Projec

Re: [C++][Python] Highlighting some known problems with our Arrow C++ and Python packages

2020-04-30 Thread Rémi Dettai
Hi! Does your point 1 also apply to the AWS SDK dependency ? Currently it seems that it cannot be built in BUNDLED mode. As stated in https://issues.apache.org/jira/browse/ARROW-8565 I struggled a lot to make a static build with the S3 dependency activated ! I would really like to help on this bec

[jira] [Created] (ARROW-8642) Is there a good way to convert data types from numpy types to pyarrow DataType?

2020-04-30 Thread Anish Biswas (Jira)
Anish Biswas created ARROW-8642: --- Summary: Is there a good way to convert data types from numpy types to pyarrow DataType? Key: ARROW-8642 URL: https://issues.apache.org/jira/browse/ARROW-8642 Project: