[jira] [Created] (ARROW-7668) [Packaging][RPM] Use NInja if possible to reduce build time

2020-01-23 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7668: --- Summary: [Packaging][RPM] Use NInja if possible to reduce build time Key: ARROW-7668 URL: https://issues.apache.org/jira/browse/ARROW-7668 Project: Apache Arrow

Re: [DISCUSS][JAVA] Correct the behavior of ListVector isEmpty

2020-01-23 Thread Micah Kornfield
I would vote for treating nulls as empty. On Fri, Jan 10, 2020 at 12:36 AM Ji Liu wrote: > Hi all, > > Currently isEmpty API is always return false in BaseRepeatedValueVector, > and its subclass ListVector did not overwrite this method. > This will lead to incorrect result, for example, a

Re: [Format] Make fields required?

2020-01-23 Thread Micah Kornfield
Looking at this it seems like the main change is require empty lists instead of null values? I think this might potentially be too strict for existing degenerate cases (e.g. empty files, I also don't remember if we said null type requires a buffer). Most of the others like MessageHeader make

Re: [Java] Large Memory Allocators (Taking a dependency on JNA?)

2020-01-23 Thread Micah Kornfield
Sounds good, I'll leave it up to you which to implement. Thanks for taking it on. On Sun, Jan 19, 2020 at 8:47 PM Fan Liya wrote: > Hi Jacques and Micah, > > Thanks for the fruitful discussion. > > It seems netty based allocator and unsafe based allocator have their > specific advantages. >

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Micah Kornfield
Hi John, Not Wes, but my thoughts on this are as follows: 1. Alternate bit/byte arrangements can also be useful for processing [1] in addition to compression. 2. I think they are quite a bit more complicated then the existing schemes proposed in [2], so I think it would be more expedient to get

[Java] PR Reviewers

2020-01-23 Thread Micah Kornfield
I mentioned this elsewhere but my intent is to stop doing java reviews for the immediate future once I wrap up the few that I have requested change on. I'm happy to try to triage incoming Java PRs, but in order to do this, I need to know which committers have some bandwidth to do reviews (some of

[Format] Array/RowBatch filters

2020-01-23 Thread Micah Kornfield
One of the things that I think got overlooked in the conversation on having a slice offset in the C API was a suggestion from Jacques of perhaps generalizing the concept to an arbitrary "filter" for arrays/record batches. I believe this point was also discussed in the past as well. I'm not

[jira] [Created] (ARROW-7667) [Packaging][deb] ubuntu-eoan is missing in nightly jobs

2020-01-23 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7667: --- Summary: [Packaging][deb] ubuntu-eoan is missing in nightly jobs Key: ARROW-7667 URL: https://issues.apache.org/jira/browse/ARROW-7667 Project: Apache Arrow

[jira] [Created] (ARROW-7666) [Packaging][deb] Always use NInja to reduce build time

2020-01-23 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7666: --- Summary: [Packaging][deb] Always use NInja to reduce build time Key: ARROW-7666 URL: https://issues.apache.org/jira/browse/ARROW-7666 Project: Apache Arrow

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-23 Thread Bryan Cutler
Thanks for investigating this and the quick fix Joris and Wes! I just have a couple questions about the behavior observed here. The pyspark code assigns either the same series back to the pandas.DataFrame or makes some modifications if it is a timestamp. In the case there are no timestamps, is

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
Wes, what do you think about Arrow supporting a new suite of fixed-length data types that unshuffle on column->Value(i) calls? This would allow memory/swap compressors and memory maps backed by compressing filesystems (ZFS) or block devices (VDO) to operate more efficiently. By doing it with new

[jira] [Created] (ARROW-7665) [R] linuxLibs.R should build in parallel

2020-01-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7665: - Summary: [R] linuxLibs.R should build in parallel Key: ARROW-7665 URL: https://issues.apache.org/jira/browse/ARROW-7665 Project: Apache Arrow Issue Type:

[jira] [Created] (ARROW-7664) [C++] Extract localfs default from FileSystemFromUri

2020-01-23 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7664: --- Summary: [C++] Extract localfs default from FileSystemFromUri Key: ARROW-7664 URL: https://issues.apache.org/jira/browse/ARROW-7664 Project: Apache Arrow

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Wes McKinney
On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > Again, I know very little about Parquet, so your patience is appreciated. > > At the moment I can Arrow/mmap a file without having anywhere nearly as > much available memory as the file size. I can visit random place in the > file

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
Again, I know very little about Parquet, so your patience is appreciated. At the moment I can Arrow/mmap a file without having anywhere nearly as much available memory as the file size. I can visit random place in the file (such as a binary search if it is ordered) and only the locations visited

[jira] [Created] (ARROW-7663) from_pandas gives TypeError instead of ArrowTypeError in some cases

2020-01-23 Thread David Li (Jira)
David Li created ARROW-7663: --- Summary: from_pandas gives TypeError instead of ArrowTypeError in some cases Key: ARROW-7663 URL: https://issues.apache.org/jira/browse/ARROW-7663 Project: Apache Arrow

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Wes McKinney
Parquet is most relevant in scenarios filesystem IO is constrained (spinning rust HDD, network FS, cloud storage / S3 / GCS). For those use cases memory-mapped Arrow is not viable. Against local NVMe (> 2000 MB/s read throughput) your mileage may vary. On Thu, Jan 23, 2020 at 12:06 PM Francois

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
This could also have utility in memory via things like zram/zswap, right? Mac also has a memory compressor? I don't think Parquet is an option for me unless the integration with Arrow is tighter than I imagine (i.e. zero-copy). That said, I confess I know next to nothing about Parquet. On Thu,

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Antoine Pitrou
Forgot to give the URL: https://github.com/apache/arrow/pull/6005 Regards Antoine. Le 23/01/2020 à 18:23, Antoine Pitrou a écrit : > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : >> Perhaps related to this thread, are there any current or proposed tools to >> transform columns for

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Antoine Pitrou
Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > Perhaps related to this thread, are there any current or proposed tools to > transform columns for fixed-length data types according to a "shuffle?" > For precedent see the implementation of the shuffle filter in hdf5. >

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2020-01-23 Thread John Muehlhausen
Perhaps related to this thread, are there any current or proposed tools to transform columns for fixed-length data types according to a "shuffle?" For precedent see the implementation of the shuffle filter in hdf5.

[NIGHTLY] Arrow Build Report for Job nightly-2020-01-23-0

2020-01-23 Thread Crossbow
Arrow Build Report for Job nightly-2020-01-23-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0 Failed Tasks: - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-win-vs2015-py36 -

[jira] [Created] (ARROW-7662) Support for auto-inferring list column->array in write_parquet

2020-01-23 Thread Michael Chirico (Jira)
Michael Chirico created ARROW-7662: -- Summary: Support for auto-inferring list column->array in write_parquet Key: ARROW-7662 URL: https://issues.apache.org/jira/browse/ARROW-7662 Project: Apache

[jira] [Created] (ARROW-7660) [C++][Gandiva] Optimise castVarchar(string, int) function for single byte characters

2020-01-23 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-7660: - Summary: [C++][Gandiva] Optimise castVarchar(string, int) function for single byte characters Key: ARROW-7660 URL: https://issues.apache.org/jira/browse/ARROW-7660