Re: [Discuss] [Java] Implement vector diff functionality

2020-03-12 Thread Micah Kornfield
It probably pays to have at least two issues one for porting the diff, and the existing one I made. On Wed, Mar 11, 2020 at 1:45 AM Ji Liu wrote: > Hi Micah, > Thanks for your feedback, you have opened an issue for Google's Truth[1] > and it was assigned to me, I'll try to use it. > > Thanks, >

Re: Summary of RLE and other compression efforts?

2020-03-12 Thread Micah Kornfield
Hi Evan, Seems like we are mostly on the same page. Some more notes below. > For example, encoding nulls in dictionary values helps reduce the need for both bitmap storage and lookup. I'm not sure if this is provided was provided as an example as something to add, but I believe this is already

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-12 Thread Micah Kornfield
Maarten, I don't expect regressions for flat cases (I'm going to try to run benchmarks comparison tonight). In terms of the flag, I'm more concerned about some corner case I didn't think of in testing or a workload that for some reason is better with the prior code. If either of these arise I

[jira] [Created] (ARROW-8109) [Packaging][APT] Drop support for Ubuntu Disco

2020-03-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8109: --- Summary: [Packaging][APT] Drop support for Ubuntu Disco Key: ARROW-8109 URL: https://issues.apache.org/jira/browse/ARROW-8109 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8108) [Java] Extract a common interface for dictionary encoders

2020-03-12 Thread Liya Fan (Jira)
Liya Fan created ARROW-8108: --- Summary: [Java] Extract a common interface for dictionary encoders Key: ARROW-8108 URL: https://issues.apache.org/jira/browse/ARROW-8108 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8107) [Packaging][APT] Use HTTPS for LLVM APT repository for Debian GNU/Linux stretch

2020-03-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8107: --- Summary: [Packaging][APT] Use HTTPS for LLVM APT repository for Debian GNU/Linux stretch Key: ARROW-8107 URL: https://issues.apache.org/jira/browse/ARROW-8107 Project:

[jira] [Created] (ARROW-8106) [Python] Builds on master broken by pandas 1.0.2 release

2020-03-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8106: --- Summary: [Python] Builds on master broken by pandas 1.0.2 release Key: ARROW-8106 URL: https://issues.apache.org/jira/browse/ARROW-8106 Project: Apache Arrow

[jira] [Created] (ARROW-8105) [Python] pyarrow.array segfaults when passed masked array with shrunken mask

2020-03-12 Thread Daniel Nugent (Jira)
Daniel Nugent created ARROW-8105: Summary: [Python] pyarrow.array segfaults when passed masked array with shrunken mask Key: ARROW-8105 URL: https://issues.apache.org/jira/browse/ARROW-8105 Project:

[jira] [Created] (ARROW-8104) [C++] Don't install bundled Thrift

2020-03-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8104: --- Summary: [C++] Don't install bundled Thrift Key: ARROW-8104 URL: https://issues.apache.org/jira/browse/ARROW-8104 Project: Apache Arrow Issue Type:

Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

2020-03-12 Thread Brian Hulette
* What kind of devops tooling would be appropriate to provision and manage the instances, scaling up and down based on need? * What CI/CD platform would be appropriate to dispatch work to the cloud nodes (taking into consideration the high costs of sysadmin, and seeking to minimize nodes sitting

Re: TimestampMilliTZ usage question

2020-03-12 Thread Micah Kornfield
Hi Soojin, > Why have the timezone info in the TimeStampMilliTZHolder if it's never set? My guess is this is an oversight (it should probably be removed). > - What if each of the rows have different TZ info? Are we to shift to the tz on the TimeStampMilliTZVector on the write? Currently there

Re: TimestampMilliTZ usage question

2020-03-12 Thread Soojin Jeong
Cc Anthony On Thu, Mar 12, 2020 at 5:27 PM Soojin Jeong wrote: > Hi team, > > I would like to know what the recommended way of using TimeStampMilliTZ > is. > > I see that > - TimeStampMilliTZVector contains timestamp arrow type with timezone > - TimeStampMilliTZVector's set doesn't actually

Re: New Comment Bot

2020-03-12 Thread Wes McKinney
Great work! GitHub Actions has been a huge boon to the project on so many fronts. Let's hope that GitHub / Microsoft keep up the free open source project resources. On Thu, Mar 12, 2020 at 1:40 PM Neal Richardson wrote: > > Thanks Krisztián! > > On Thu, Mar 12, 2020 at 11:07 AM Krisztián Szűcs

[jira] [Created] (ARROW-8103) [R] Make default Linux build more minimal

2020-03-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8103: -- Summary: [R] Make default Linux build more minimal Key: ARROW-8103 URL: https://issues.apache.org/jira/browse/ARROW-8103 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8102) [Dev] Crossbow's version detection doesn't work in the comment bot's scenario

2020-03-12 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8102: -- Summary: [Dev] Crossbow's version detection doesn't work in the comment bot's scenario Key: ARROW-8102 URL: https://issues.apache.org/jira/browse/ARROW-8102

[jira] [Created] (ARROW-8101) [FlightRPC][Java] Can't read/write only an empty null array

2020-03-12 Thread David Li (Jira)
David Li created ARROW-8101: --- Summary: [FlightRPC][Java] Can't read/write only an empty null array Key: ARROW-8101 URL: https://issues.apache.org/jira/browse/ARROW-8101 Project: Apache Arrow

RE: [jira] [Created] (ARROW-8100) timestamp[ms] and date64 data types not working as expected on write

2020-03-12 Thread Lee, David
I've never used cast().. I've converted python datetimes to pa.timestamp(s) using: pyarrow.array(obj, type=None, mask=None, size=None, from_pandas=None, bool safe=True, MemoryPool memory_pool=None) where type is pa.timestamp("ms") -Original Message- From: paul hess (Jira) Sent:

[jira] [Created] (ARROW-8100) timestamp[ms] and date64 data types not working as expected on write

2020-03-12 Thread paul hess (Jira)
paul hess created ARROW-8100: Summary: timestamp[ms] and date64 data types not working as expected on write Key: ARROW-8100 URL: https://issues.apache.org/jira/browse/ARROW-8100 Project: Apache Arrow

[jira] [Created] (ARROW-8098) [go] Checkptr Failures on Go 1.14

2020-03-12 Thread Kevin Conaway (Jira)
Kevin Conaway created ARROW-8098: Summary: [go] Checkptr Failures on Go 1.14 Key: ARROW-8098 URL: https://issues.apache.org/jira/browse/ARROW-8098 Project: Apache Arrow Issue Type: Bug

[jira] [Created] (ARROW-8097) [Dev] Comment bot's crossbow command acts on the master branch

2020-03-12 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8097: -- Summary: [Dev] Comment bot's crossbow command acts on the master branch Key: ARROW-8097 URL: https://issues.apache.org/jira/browse/ARROW-8097 Project: Apache

Re: New Comment Bot

2020-03-12 Thread Neal Richardson
Thanks Krisztián! On Thu, Mar 12, 2020 at 11:07 AM Krisztián Szűcs wrote: > Hi, > > Since the Ursa-labs machines are down @ursabot comment > bot was not operational. Luckily Github Actions is able to > listen on more kinds of Github events like the issue_comment, > so I've ported [1] the

New Comment Bot

2020-03-12 Thread Krisztián Szűcs
Hi, Since the Ursa-labs machines are down @ursabot comment bot was not operational. Luckily Github Actions is able to listen on more kinds of Github events like the issue_comment, so I've ported [1] the comment bot to work without the buildbot buildmaster. So the comment bot is available again

[jira] [Created] (ARROW-8096) [C++][Gandiva] Create null node of Interval type

2020-03-12 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-8096: --- Summary: [C++][Gandiva] Create null node of Interval type Key: ARROW-8096 URL: https://issues.apache.org/jira/browse/ARROW-8096 Project: Apache Arrow

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-03-11-0

2020-03-12 Thread Neal Richardson
I made several tickets for the failures, ARROW-8091 - 8095, and have a patch up for 8091. I did not ticket conda-win, centos-8, debian-stretch, or gandiva-jar. Seems like they may be flaky or already resolved. Neal On Wed, Mar 11, 2020 at 5:35 PM Crossbow wrote: > > Arrow Build Report for Job

[jira] [Created] (ARROW-8095) [CI][Crossbow] Nightly turbodbc job fails

2020-03-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8095: -- Summary: [CI][Crossbow] Nightly turbodbc job fails Key: ARROW-8095 URL: https://issues.apache.org/jira/browse/ARROW-8095 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8094) [CI][Crossbow] Nightly valgrind test fails

2020-03-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8094: -- Summary: [CI][Crossbow] Nightly valgrind test fails Key: ARROW-8094 URL: https://issues.apache.org/jira/browse/ARROW-8094 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8093) [CI][Crossbow] Pandas integration test fails

2020-03-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8093: -- Summary: [CI][Crossbow] Pandas integration test fails Key: ARROW-8093 URL: https://issues.apache.org/jira/browse/ARROW-8093 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8092) [CI][Crossbow] OSX wheels fail on bundled bzip2

2020-03-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8092: -- Summary: [CI][Crossbow] OSX wheels fail on bundled bzip2 Key: ARROW-8092 URL: https://issues.apache.org/jira/browse/ARROW-8092 Project: Apache Arrow

[jira] [Created] (ARROW-8091) [CI][Crossbow] Fix nightly homebrew and R failures

2020-03-12 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8091: -- Summary: [CI][Crossbow] Fix nightly homebrew and R failures Key: ARROW-8091 URL: https://issues.apache.org/jira/browse/ARROW-8091 Project: Apache Arrow

[jira] [Created] (ARROW-8090) [C++][Compute] Implement stateful TopK operator node

2020-03-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8090: --- Summary: [C++][Compute] Implement stateful TopK operator node Key: ARROW-8090 URL: https://issues.apache.org/jira/browse/ARROW-8090 Project: Apache Arrow

[jira] [Created] (ARROW-8089) [C++] Port the toolchain build from Appveyor to Github Actions

2020-03-12 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8089: -- Summary: [C++] Port the toolchain build from Appveyor to Github Actions Key: ARROW-8089 URL: https://issues.apache.org/jira/browse/ARROW-8089 Project: Apache

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-12 Thread Wes McKinney
Maarten -- AFAIK Micah's work only affects nested / non-flat column paths, so flat data should not be impacted. Since we have a partial implementation of writes for nested data (lists-of-lists and structs-of-structs, but no mix of the two) that was the performance difference I was referencing. On

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-12 Thread Maarten Ballintijn
Hi Micah, How does the performance change for “flat” schemas? (particularly in the case of a large number of columns) Thanks, Maarten > On Mar 11, 2020, at 11:53 PM, Micah Kornfield wrote: > > Another status update. I've integrated the level generation code with the > parquet writing code

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-12 Thread Wes McKinney
hi Micah, Great to hear about the progress, I'll help with code review. FWIW, if the new code passes the existing unit tests I would be in favor of deleting the old code so that we're fully invested in making the new code suitably fast. Jump in with two feet, so to speak. Thanks Wes On Wed,

[jira] [Created] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8088: Summary: [C++][Dataset] Partition columns with specified dictionary type result in all nulls Key: ARROW-8088 URL: https://issues.apache.org/jira/browse/ARROW-8088

[jira] [Created] (ARROW-8087) [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema

2020-03-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8087: Summary: [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema Key: ARROW-8087 URL: https://issues.apache.org/jira/browse/ARROW-8087

[jira] [Created] (ARROW-8086) [Java] Support writing decimal from big endian byte array in UnionListWriter

2020-03-12 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-8086: - Summary: [Java] Support writing decimal from big endian byte array in UnionListWriter Key: ARROW-8086 URL: https://issues.apache.org/jira/browse/ARROW-8086

[jira] [Created] (ARROW-8085) [Dev] Set JIRA ticket's status to in progress once a pull request available

2020-03-12 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8085: -- Summary: [Dev] Set JIRA ticket's status to in progress once a pull request available Key: ARROW-8085 URL: https://issues.apache.org/jira/browse/ARROW-8085

[jira] [Created] (ARROW-8084) [Crossbow] Port crossbow to Archery and eliminate libgit2 dependency

2020-03-12 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8084: -- Summary: [Crossbow] Port crossbow to Archery and eliminate libgit2 dependency Key: ARROW-8084 URL: https://issues.apache.org/jira/browse/ARROW-8084 Project:

[jira] [Created] (ARROW-8083) [GLib] Add support for Peek() to GIOInputStream

2020-03-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8083: --- Summary: [GLib] Add support for Peek() to GIOInputStream Key: ARROW-8083 URL: https://issues.apache.org/jira/browse/ARROW-8083 Project: Apache Arrow Issue

[jira] [Created] (ARROW-8082) [Plasma]Add JNI list() interface

2020-03-12 Thread KunshangJi (Jira)
KunshangJi created ARROW-8082: - Summary: [Plasma]Add JNI list() interface Key: ARROW-8082 URL: https://issues.apache.org/jira/browse/ARROW-8082 Project: Apache Arrow Issue Type: Improvement