Re: Timeline for 0.15.0 release

2019-09-24 Thread Andy Grove
I found a last minute issue with DataFusion (Rust) and would appreciate it
if we could merge ARROW-6086 (PR is
https://github.com/apache/arrow/pull/5494) before cutting the RC.

Thanks,

Andy.


On Tue, Sep 24, 2019 at 6:19 PM Micah Kornfield 
wrote:

> OK, I'm going to postpone cutting a release until tomorrow (hoping we can
> issues resolved by then)..  I'll also try to review the third-party
> additions since 14.x.
>
> On Tue, Sep 24, 2019 at 4:20 PM Wes McKinney  wrote:
>
> > I found a licensing issue
> >
> > https://issues.apache.org/jira/browse/ARROW-6679
> >
> > It might be worth examining third party code added to the project
> > since 0.14.x to make sure there are no other such issues.
> >
> > On Tue, Sep 24, 2019 at 6:10 PM Wes McKinney 
> wrote:
> > >
> > > I have diagnosed the problem (Thrift "string" data must be UTF-8,
> > > cannot be arbitrary binary) and am working on a patch right now
> > >
> > > On Tue, Sep 24, 2019 at 6:02 PM Wes McKinney 
> > wrote:
> > > >
> > > > I just opened
> > > >
> > > > https://issues.apache.org/jira/browse/ARROW-6678
> > > >
> > > > Please don't cut an RC until I have an opportunity to diagnose this,
> > > > will report back.
> > > >
> > > >
> > > > On Tue, Sep 24, 2019 at 5:51 PM Wes McKinney 
> > wrote:
> > > > >
> > > > > I'm investigating a possible Parquet-related compatibility bug
> that I
> > > > > encountered through some routine testing / benchmarking. I'll
> report
> > > > > back once I figure out what is going on (if anything)
> > > > >
> > > > > On Sun, Sep 22, 2019 at 11:51 PM Micah Kornfield <
> > emkornfi...@gmail.com> wrote:
> > > > > >>
> > > > > >> It's ideal if your GPG key is in the web of trust (i.e. you can
> > get it
> > > > > >> signed by another PMC member), but is not 100% essential.
> > > > > >
> > > > > > That won't be an option for me this week (it seems like I would
> > need to meet one face-to-face).  I'll try to get the GPG checked in and
> the
> > rest of the pre-requisites done tomorrow (Monday) to hopefully start the
> > release on Tuesday (hopefully we can solve the last blocker/integration
> > tests by then).
> > > > > >
> > > > > > On Sat, Sep 21, 2019 at 7:12 PM Wes McKinney <
> wesmck...@gmail.com>
> > wrote:
> > > > > >>
> > > > > >> It's ideal if your GPG key is in the web of trust (i.e. you can
> > get it
> > > > > >> signed by another PMC member), but is not 100% essential.
> > > > > >>
> > > > > >> Speaking of the release, there are at least 2 code changes I
> still
> > > > > >> want to get in
> > > > > >>
> > > > > >> ARROW-5717
> > > > > >> ARROW-6353
> > > > > >>
> > > > > >> I just pushed updates to ARROW-5717, will merge once the build
> is
> > green.
> > > > > >>
> > > > > >> There are a couple of Rust patches still marked for 0.15. The
> rest
> > > > > >> seems to be documentation and a couple of integration test
> > failures we
> > > > > >> should see about fixing in time.
> > > > > >>
> > > > > >> On Fri, Sep 20, 2019 at 11:26 PM Micah Kornfield <
> > emkornfi...@gmail.com> wrote:
> > > > > >> >
> > > > > >> > Thanks Krisztián and Wes,
> > > > > >> > I've gone ahead and started registering myself on all the
> > packaging sites.
> > > > > >> >
> > > > > >> > Is there any review process when adding my GPG key to the SVN
> > file? [1]
> > > > > >> > doesn't seem to mention explicitly.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> > Micah
> > > > > >> >
> > > > > >> > [1] https://www.apache.org/dev/version-control.html#https-svn
> > > > > >> >
> > > > > >> > On Fri, Sep 20, 2019 at 5:01 PM Krisztián Szűcs <
> > szucs.kriszt...@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > On Thu, Sep 19, 2019 at 5:52 PM Wes McKinney <
> > wesmck...@gmail.com> wrote:
> > > > > >> > >
> > > > > >> > >> On Thu, Sep 19, 2019 at 12:13 AM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > > > >> > >> wrote:
> > > > > >> > >> >>
> > > > > >> > >> >> The process should be well documented at this point but
> > there are a
> > > > > >> > >> >> number of steps.
> > > > > >> > >> >
> > > > > >> > >> > Is [1] the up-to-date documentation for the release?
>  Are
> > there
> > > > > >> > >> instructions for the adding the code signing Key to SVN?
> > > > > >> > >> >
> > > > > >> > >> > I will make a go of it.  i will try to mitigate any
> > internet issues by
> > > > > >> > >> doing the process for a cloud instance (I assume that isn't
> > a problem?).
> > > > > >> > >> >
> > > > > >> > >>
> > > > > >> > >> Setting up a new cloud environment suitable for producing
> an
> > RC may be
> > > > > >> > >> time consuming, but you are welcome to try. Krisztian --
> are
> > you
> > > > > >> > >> available next week to help Micah and potentially take over
> > producing
> > > > > >> > >> the RC if there are issues?
> > > > > >> > >>
> > > > > >> > > Sure, I'll be available next week. We can also grant access
> to
> > > > > >> > > https://github.com/ursa-labs/crossbow because configuring
> all
> > > > > >> > > the CI backends can 

[jira] [Created] (ARROW-6683) [Python] Add unit tests that validate cross-compatibility with pyarrow.parquet when fastparquet is installed

2019-09-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6683:
---

 Summary: [Python] Add unit tests that validate cross-compatibility 
with pyarrow.parquet when fastparquet is installed
 Key: ARROW-6683
 URL: https://issues.apache.org/jira/browse/ARROW-6683
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


This will help prevent such issues as ARROW-6678 from reocurring



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6682) Arrow Hangs on Large Files (10-12gb)

2019-09-24 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-6682:


 Summary: Arrow Hangs on Large Files (10-12gb)
 Key: ARROW-6682
 URL: https://issues.apache.org/jira/browse/ARROW-6682
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Affects Versions: 0.14.1
Reporter: Anthony Abate


I get random hangs on arrow_read in R (windows) when using a very large file 
(10-12gb). 

I have memory dumps - All threads seem to be in wait handles.

Are there debug symbols somewhere? 

Is there a way to get the C++ code to produce diagnostic logging from R? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6681) [C# -> R] - Record Batches in reverse order?

2019-09-24 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-6681:


 Summary: [C# -> R] - Record Batches in reverse order?
 Key: ARROW-6681
 URL: https://issues.apache.org/jira/browse/ARROW-6681
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C#
Affects Versions: 0.14.1
Reporter: Anthony Abate


Are 'RecordBatches' being in C# being written in reverse order?

I made a simple test which creates a single row per record batch of 0 to 99 and 
attempted to read this in R. To my surprise batch(0) in R had the value 99 not 0

This may not seem like a big deal, however when dealing with 'huge' files, its 
more efficient to use Record Batches / index lookup than attempting to load the 
entire file into memory.

Having the order consistent within the different language / API seems only to 
make sense - for now I can work around this by reversing the order before 
writing.

 

https://github.com/apache/arrow/issues/5475

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6680) [Python] Add Array ctor microbenchmarks

2019-09-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6680:
---

 Summary: [Python] Add Array ctor microbenchmarks
 Key: ARROW-6680
 URL: https://issues.apache.org/jira/browse/ARROW-6680
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


Since more unavoidable validation is being added in e.g. 
https://github.com/apache/arrow/pull/5488



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Timeline for 0.15.0 release

2019-09-24 Thread Micah Kornfield
OK, I'm going to postpone cutting a release until tomorrow (hoping we can
issues resolved by then)..  I'll also try to review the third-party
additions since 14.x.

On Tue, Sep 24, 2019 at 4:20 PM Wes McKinney  wrote:

> I found a licensing issue
>
> https://issues.apache.org/jira/browse/ARROW-6679
>
> It might be worth examining third party code added to the project
> since 0.14.x to make sure there are no other such issues.
>
> On Tue, Sep 24, 2019 at 6:10 PM Wes McKinney  wrote:
> >
> > I have diagnosed the problem (Thrift "string" data must be UTF-8,
> > cannot be arbitrary binary) and am working on a patch right now
> >
> > On Tue, Sep 24, 2019 at 6:02 PM Wes McKinney 
> wrote:
> > >
> > > I just opened
> > >
> > > https://issues.apache.org/jira/browse/ARROW-6678
> > >
> > > Please don't cut an RC until I have an opportunity to diagnose this,
> > > will report back.
> > >
> > >
> > > On Tue, Sep 24, 2019 at 5:51 PM Wes McKinney 
> wrote:
> > > >
> > > > I'm investigating a possible Parquet-related compatibility bug that I
> > > > encountered through some routine testing / benchmarking. I'll report
> > > > back once I figure out what is going on (if anything)
> > > >
> > > > On Sun, Sep 22, 2019 at 11:51 PM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> > > > >>
> > > > >> It's ideal if your GPG key is in the web of trust (i.e. you can
> get it
> > > > >> signed by another PMC member), but is not 100% essential.
> > > > >
> > > > > That won't be an option for me this week (it seems like I would
> need to meet one face-to-face).  I'll try to get the GPG checked in and the
> rest of the pre-requisites done tomorrow (Monday) to hopefully start the
> release on Tuesday (hopefully we can solve the last blocker/integration
> tests by then).
> > > > >
> > > > > On Sat, Sep 21, 2019 at 7:12 PM Wes McKinney 
> wrote:
> > > > >>
> > > > >> It's ideal if your GPG key is in the web of trust (i.e. you can
> get it
> > > > >> signed by another PMC member), but is not 100% essential.
> > > > >>
> > > > >> Speaking of the release, there are at least 2 code changes I still
> > > > >> want to get in
> > > > >>
> > > > >> ARROW-5717
> > > > >> ARROW-6353
> > > > >>
> > > > >> I just pushed updates to ARROW-5717, will merge once the build is
> green.
> > > > >>
> > > > >> There are a couple of Rust patches still marked for 0.15. The rest
> > > > >> seems to be documentation and a couple of integration test
> failures we
> > > > >> should see about fixing in time.
> > > > >>
> > > > >> On Fri, Sep 20, 2019 at 11:26 PM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> > > > >> >
> > > > >> > Thanks Krisztián and Wes,
> > > > >> > I've gone ahead and started registering myself on all the
> packaging sites.
> > > > >> >
> > > > >> > Is there any review process when adding my GPG key to the SVN
> file? [1]
> > > > >> > doesn't seem to mention explicitly.
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Micah
> > > > >> >
> > > > >> > [1] https://www.apache.org/dev/version-control.html#https-svn
> > > > >> >
> > > > >> > On Fri, Sep 20, 2019 at 5:01 PM Krisztián Szűcs <
> szucs.kriszt...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > On Thu, Sep 19, 2019 at 5:52 PM Wes McKinney <
> wesmck...@gmail.com> wrote:
> > > > >> > >
> > > > >> > >> On Thu, Sep 19, 2019 at 12:13 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > >> > >> wrote:
> > > > >> > >> >>
> > > > >> > >> >> The process should be well documented at this point but
> there are a
> > > > >> > >> >> number of steps.
> > > > >> > >> >
> > > > >> > >> > Is [1] the up-to-date documentation for the release?   Are
> there
> > > > >> > >> instructions for the adding the code signing Key to SVN?
> > > > >> > >> >
> > > > >> > >> > I will make a go of it.  i will try to mitigate any
> internet issues by
> > > > >> > >> doing the process for a cloud instance (I assume that isn't
> a problem?).
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >> Setting up a new cloud environment suitable for producing an
> RC may be
> > > > >> > >> time consuming, but you are welcome to try. Krisztian -- are
> you
> > > > >> > >> available next week to help Micah and potentially take over
> producing
> > > > >> > >> the RC if there are issues?
> > > > >> > >>
> > > > >> > > Sure, I'll be available next week. We can also grant access to
> > > > >> > > https://github.com/ursa-labs/crossbow because configuring all
> > > > >> > > the CI backends can be time consuming.
> > > > >> > >
> > > > >> > >>
> > > > >> > >> > Thanks,
> > > > >> > >> > Micah
> > > > >> > >> >
> > > > >> > >> > [1]
> > > > >> > >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> > > > >> > >> >
> > > > >> > >> > On Wed, Sep 18, 2019 at 8:29 AM Wes McKinney <
> wesmck...@gmail.com>
> > > > >> > >> wrote:
> > > > >> > >> >>
> > > > >> > >> >> The process should be well documented at this point but
> there are a
> > > > >> > >> >> number of steps. Note that you need to add your 

Re: Timeline for 0.15.0 release

2019-09-24 Thread Wes McKinney
I found a licensing issue

https://issues.apache.org/jira/browse/ARROW-6679

It might be worth examining third party code added to the project
since 0.14.x to make sure there are no other such issues.

On Tue, Sep 24, 2019 at 6:10 PM Wes McKinney  wrote:
>
> I have diagnosed the problem (Thrift "string" data must be UTF-8,
> cannot be arbitrary binary) and am working on a patch right now
>
> On Tue, Sep 24, 2019 at 6:02 PM Wes McKinney  wrote:
> >
> > I just opened
> >
> > https://issues.apache.org/jira/browse/ARROW-6678
> >
> > Please don't cut an RC until I have an opportunity to diagnose this,
> > will report back.
> >
> >
> > On Tue, Sep 24, 2019 at 5:51 PM Wes McKinney  wrote:
> > >
> > > I'm investigating a possible Parquet-related compatibility bug that I
> > > encountered through some routine testing / benchmarking. I'll report
> > > back once I figure out what is going on (if anything)
> > >
> > > On Sun, Sep 22, 2019 at 11:51 PM Micah Kornfield  
> > > wrote:
> > > >>
> > > >> It's ideal if your GPG key is in the web of trust (i.e. you can get it
> > > >> signed by another PMC member), but is not 100% essential.
> > > >
> > > > That won't be an option for me this week (it seems like I would need to 
> > > > meet one face-to-face).  I'll try to get the GPG checked in and the 
> > > > rest of the pre-requisites done tomorrow (Monday) to hopefully start 
> > > > the release on Tuesday (hopefully we can solve the last 
> > > > blocker/integration tests by then).
> > > >
> > > > On Sat, Sep 21, 2019 at 7:12 PM Wes McKinney  
> > > > wrote:
> > > >>
> > > >> It's ideal if your GPG key is in the web of trust (i.e. you can get it
> > > >> signed by another PMC member), but is not 100% essential.
> > > >>
> > > >> Speaking of the release, there are at least 2 code changes I still
> > > >> want to get in
> > > >>
> > > >> ARROW-5717
> > > >> ARROW-6353
> > > >>
> > > >> I just pushed updates to ARROW-5717, will merge once the build is 
> > > >> green.
> > > >>
> > > >> There are a couple of Rust patches still marked for 0.15. The rest
> > > >> seems to be documentation and a couple of integration test failures we
> > > >> should see about fixing in time.
> > > >>
> > > >> On Fri, Sep 20, 2019 at 11:26 PM Micah Kornfield 
> > > >>  wrote:
> > > >> >
> > > >> > Thanks Krisztián and Wes,
> > > >> > I've gone ahead and started registering myself on all the packaging 
> > > >> > sites.
> > > >> >
> > > >> > Is there any review process when adding my GPG key to the SVN file? 
> > > >> > [1]
> > > >> > doesn't seem to mention explicitly.
> > > >> >
> > > >> > Thanks,
> > > >> > Micah
> > > >> >
> > > >> > [1] https://www.apache.org/dev/version-control.html#https-svn
> > > >> >
> > > >> > On Fri, Sep 20, 2019 at 5:01 PM Krisztián Szűcs 
> > > >> > 
> > > >> > wrote:
> > > >> >
> > > >> > > On Thu, Sep 19, 2019 at 5:52 PM Wes McKinney  
> > > >> > > wrote:
> > > >> > >
> > > >> > >> On Thu, Sep 19, 2019 at 12:13 AM Micah Kornfield 
> > > >> > >> 
> > > >> > >> wrote:
> > > >> > >> >>
> > > >> > >> >> The process should be well documented at this point but there 
> > > >> > >> >> are a
> > > >> > >> >> number of steps.
> > > >> > >> >
> > > >> > >> > Is [1] the up-to-date documentation for the release?   Are there
> > > >> > >> instructions for the adding the code signing Key to SVN?
> > > >> > >> >
> > > >> > >> > I will make a go of it.  i will try to mitigate any internet 
> > > >> > >> > issues by
> > > >> > >> doing the process for a cloud instance (I assume that isn't a 
> > > >> > >> problem?).
> > > >> > >> >
> > > >> > >>
> > > >> > >> Setting up a new cloud environment suitable for producing an RC 
> > > >> > >> may be
> > > >> > >> time consuming, but you are welcome to try. Krisztian -- are you
> > > >> > >> available next week to help Micah and potentially take over 
> > > >> > >> producing
> > > >> > >> the RC if there are issues?
> > > >> > >>
> > > >> > > Sure, I'll be available next week. We can also grant access to
> > > >> > > https://github.com/ursa-labs/crossbow because configuring all
> > > >> > > the CI backends can be time consuming.
> > > >> > >
> > > >> > >>
> > > >> > >> > Thanks,
> > > >> > >> > Micah
> > > >> > >> >
> > > >> > >> > [1]
> > > >> > >> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> > > >> > >> >
> > > >> > >> > On Wed, Sep 18, 2019 at 8:29 AM Wes McKinney 
> > > >> > >> > 
> > > >> > >> wrote:
> > > >> > >> >>
> > > >> > >> >> The process should be well documented at this point but there 
> > > >> > >> >> are a
> > > >> > >> >> number of steps. Note that you need to add your code signing 
> > > >> > >> >> key to
> > > >> > >> >> the KEYS file in SVN (that's not very hard to do). I think 
> > > >> > >> >> it's fine
> > > >> > >> >> to hand off the process to others after the VOTE but it would 
> > > >> > >> >> be
> > > >> > >> >> tricky to have multiple RMs involved with producing the source 
> > > >> > >> >> and
> > > >> > >> >> binary 

[jira] [Created] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable

2019-09-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6679:
---

 Summary: [RELEASE] autobrew license in LICENSE.txt is not 
acceptable
 Key: ARROW-6679
 URL: https://issues.apache.org/jira/browse/ARROW-6679
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Wes McKinney
 Fix For: 0.15.0


{code}
This project includes code from the autobrew project.

* r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
  are based on code from the autobrew project.

Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms.
All rights reserved.
Homepage: https://github.com/jeroen/autobrew
{code}

This code needs to be made available under a Category A license

https://apache.org/legal/resolved.html#category-a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Timeline for 0.15.0 release

2019-09-24 Thread Wes McKinney
I have diagnosed the problem (Thrift "string" data must be UTF-8,
cannot be arbitrary binary) and am working on a patch right now

On Tue, Sep 24, 2019 at 6:02 PM Wes McKinney  wrote:
>
> I just opened
>
> https://issues.apache.org/jira/browse/ARROW-6678
>
> Please don't cut an RC until I have an opportunity to diagnose this,
> will report back.
>
>
> On Tue, Sep 24, 2019 at 5:51 PM Wes McKinney  wrote:
> >
> > I'm investigating a possible Parquet-related compatibility bug that I
> > encountered through some routine testing / benchmarking. I'll report
> > back once I figure out what is going on (if anything)
> >
> > On Sun, Sep 22, 2019 at 11:51 PM Micah Kornfield  
> > wrote:
> > >>
> > >> It's ideal if your GPG key is in the web of trust (i.e. you can get it
> > >> signed by another PMC member), but is not 100% essential.
> > >
> > > That won't be an option for me this week (it seems like I would need to 
> > > meet one face-to-face).  I'll try to get the GPG checked in and the rest 
> > > of the pre-requisites done tomorrow (Monday) to hopefully start the 
> > > release on Tuesday (hopefully we can solve the last blocker/integration 
> > > tests by then).
> > >
> > > On Sat, Sep 21, 2019 at 7:12 PM Wes McKinney  wrote:
> > >>
> > >> It's ideal if your GPG key is in the web of trust (i.e. you can get it
> > >> signed by another PMC member), but is not 100% essential.
> > >>
> > >> Speaking of the release, there are at least 2 code changes I still
> > >> want to get in
> > >>
> > >> ARROW-5717
> > >> ARROW-6353
> > >>
> > >> I just pushed updates to ARROW-5717, will merge once the build is green.
> > >>
> > >> There are a couple of Rust patches still marked for 0.15. The rest
> > >> seems to be documentation and a couple of integration test failures we
> > >> should see about fixing in time.
> > >>
> > >> On Fri, Sep 20, 2019 at 11:26 PM Micah Kornfield  
> > >> wrote:
> > >> >
> > >> > Thanks Krisztián and Wes,
> > >> > I've gone ahead and started registering myself on all the packaging 
> > >> > sites.
> > >> >
> > >> > Is there any review process when adding my GPG key to the SVN file? [1]
> > >> > doesn't seem to mention explicitly.
> > >> >
> > >> > Thanks,
> > >> > Micah
> > >> >
> > >> > [1] https://www.apache.org/dev/version-control.html#https-svn
> > >> >
> > >> > On Fri, Sep 20, 2019 at 5:01 PM Krisztián Szűcs 
> > >> > 
> > >> > wrote:
> > >> >
> > >> > > On Thu, Sep 19, 2019 at 5:52 PM Wes McKinney  
> > >> > > wrote:
> > >> > >
> > >> > >> On Thu, Sep 19, 2019 at 12:13 AM Micah Kornfield 
> > >> > >> 
> > >> > >> wrote:
> > >> > >> >>
> > >> > >> >> The process should be well documented at this point but there 
> > >> > >> >> are a
> > >> > >> >> number of steps.
> > >> > >> >
> > >> > >> > Is [1] the up-to-date documentation for the release?   Are there
> > >> > >> instructions for the adding the code signing Key to SVN?
> > >> > >> >
> > >> > >> > I will make a go of it.  i will try to mitigate any internet 
> > >> > >> > issues by
> > >> > >> doing the process for a cloud instance (I assume that isn't a 
> > >> > >> problem?).
> > >> > >> >
> > >> > >>
> > >> > >> Setting up a new cloud environment suitable for producing an RC may 
> > >> > >> be
> > >> > >> time consuming, but you are welcome to try. Krisztian -- are you
> > >> > >> available next week to help Micah and potentially take over 
> > >> > >> producing
> > >> > >> the RC if there are issues?
> > >> > >>
> > >> > > Sure, I'll be available next week. We can also grant access to
> > >> > > https://github.com/ursa-labs/crossbow because configuring all
> > >> > > the CI backends can be time consuming.
> > >> > >
> > >> > >>
> > >> > >> > Thanks,
> > >> > >> > Micah
> > >> > >> >
> > >> > >> > [1]
> > >> > >> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> > >> > >> >
> > >> > >> > On Wed, Sep 18, 2019 at 8:29 AM Wes McKinney 
> > >> > >> wrote:
> > >> > >> >>
> > >> > >> >> The process should be well documented at this point but there 
> > >> > >> >> are a
> > >> > >> >> number of steps. Note that you need to add your code signing key 
> > >> > >> >> to
> > >> > >> >> the KEYS file in SVN (that's not very hard to do). I think it's 
> > >> > >> >> fine
> > >> > >> >> to hand off the process to others after the VOTE but it would be
> > >> > >> >> tricky to have multiple RMs involved with producing the source 
> > >> > >> >> and
> > >> > >> >> binary artifacts for the vote
> > >> > >> >>
> > >> > >> >> On Tue, Sep 17, 2019 at 10:55 PM Micah Kornfield <
> > >> > >> emkornfi...@gmail.com> wrote:
> > >> > >> >> >
> > >> > >> >> > SGTM, as well.
> > >> > >> >> >
> > >> > >> >> > I should have a little bit of time next week if I can help as 
> > >> > >> >> > RM but
> > >> > >> I have
> > >> > >> >> > a couple of concerns:
> > >> > >> >> > 1.  In the past I've had trouble downloading and validating
> > >> > >> releases. I'm a
> > >> > >> >> > bit worried, that I might have similar problems doing the 
> > >> > 

Re: Timeline for 0.15.0 release

2019-09-24 Thread Wes McKinney
I just opened

https://issues.apache.org/jira/browse/ARROW-6678

Please don't cut an RC until I have an opportunity to diagnose this,
will report back.


On Tue, Sep 24, 2019 at 5:51 PM Wes McKinney  wrote:
>
> I'm investigating a possible Parquet-related compatibility bug that I
> encountered through some routine testing / benchmarking. I'll report
> back once I figure out what is going on (if anything)
>
> On Sun, Sep 22, 2019 at 11:51 PM Micah Kornfield  
> wrote:
> >>
> >> It's ideal if your GPG key is in the web of trust (i.e. you can get it
> >> signed by another PMC member), but is not 100% essential.
> >
> > That won't be an option for me this week (it seems like I would need to 
> > meet one face-to-face).  I'll try to get the GPG checked in and the rest of 
> > the pre-requisites done tomorrow (Monday) to hopefully start the release on 
> > Tuesday (hopefully we can solve the last blocker/integration tests by then).
> >
> > On Sat, Sep 21, 2019 at 7:12 PM Wes McKinney  wrote:
> >>
> >> It's ideal if your GPG key is in the web of trust (i.e. you can get it
> >> signed by another PMC member), but is not 100% essential.
> >>
> >> Speaking of the release, there are at least 2 code changes I still
> >> want to get in
> >>
> >> ARROW-5717
> >> ARROW-6353
> >>
> >> I just pushed updates to ARROW-5717, will merge once the build is green.
> >>
> >> There are a couple of Rust patches still marked for 0.15. The rest
> >> seems to be documentation and a couple of integration test failures we
> >> should see about fixing in time.
> >>
> >> On Fri, Sep 20, 2019 at 11:26 PM Micah Kornfield  
> >> wrote:
> >> >
> >> > Thanks Krisztián and Wes,
> >> > I've gone ahead and started registering myself on all the packaging 
> >> > sites.
> >> >
> >> > Is there any review process when adding my GPG key to the SVN file? [1]
> >> > doesn't seem to mention explicitly.
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> > [1] https://www.apache.org/dev/version-control.html#https-svn
> >> >
> >> > On Fri, Sep 20, 2019 at 5:01 PM Krisztián Szűcs 
> >> > 
> >> > wrote:
> >> >
> >> > > On Thu, Sep 19, 2019 at 5:52 PM Wes McKinney  
> >> > > wrote:
> >> > >
> >> > >> On Thu, Sep 19, 2019 at 12:13 AM Micah Kornfield 
> >> > >> 
> >> > >> wrote:
> >> > >> >>
> >> > >> >> The process should be well documented at this point but there are a
> >> > >> >> number of steps.
> >> > >> >
> >> > >> > Is [1] the up-to-date documentation for the release?   Are there
> >> > >> instructions for the adding the code signing Key to SVN?
> >> > >> >
> >> > >> > I will make a go of it.  i will try to mitigate any internet issues 
> >> > >> > by
> >> > >> doing the process for a cloud instance (I assume that isn't a 
> >> > >> problem?).
> >> > >> >
> >> > >>
> >> > >> Setting up a new cloud environment suitable for producing an RC may be
> >> > >> time consuming, but you are welcome to try. Krisztian -- are you
> >> > >> available next week to help Micah and potentially take over producing
> >> > >> the RC if there are issues?
> >> > >>
> >> > > Sure, I'll be available next week. We can also grant access to
> >> > > https://github.com/ursa-labs/crossbow because configuring all
> >> > > the CI backends can be time consuming.
> >> > >
> >> > >>
> >> > >> > Thanks,
> >> > >> > Micah
> >> > >> >
> >> > >> > [1]
> >> > >> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> >> > >> >
> >> > >> > On Wed, Sep 18, 2019 at 8:29 AM Wes McKinney 
> >> > >> wrote:
> >> > >> >>
> >> > >> >> The process should be well documented at this point but there are a
> >> > >> >> number of steps. Note that you need to add your code signing key to
> >> > >> >> the KEYS file in SVN (that's not very hard to do). I think it's 
> >> > >> >> fine
> >> > >> >> to hand off the process to others after the VOTE but it would be
> >> > >> >> tricky to have multiple RMs involved with producing the source and
> >> > >> >> binary artifacts for the vote
> >> > >> >>
> >> > >> >> On Tue, Sep 17, 2019 at 10:55 PM Micah Kornfield <
> >> > >> emkornfi...@gmail.com> wrote:
> >> > >> >> >
> >> > >> >> > SGTM, as well.
> >> > >> >> >
> >> > >> >> > I should have a little bit of time next week if I can help as RM 
> >> > >> >> > but
> >> > >> I have
> >> > >> >> > a couple of concerns:
> >> > >> >> > 1.  In the past I've had trouble downloading and validating
> >> > >> releases. I'm a
> >> > >> >> > bit worried, that I might have similar problems doing the 
> >> > >> >> > necessary
> >> > >> uploads.
> >> > >> >> > 2.  My internet connection will likely be not great, I don't 
> >> > >> >> > know if
> >> > >> this
> >> > >> >> > would make it even less likely to be successful.
> >> > >> >> >
> >> > >> >> > Does it become problematic if somehow I would have to abandon the
> >> > >> process
> >> > >> >> > mid-release?  Is there anyone who could serve as a backup?  Are 
> >> > >> >> > the
> >> > >> steps
> >> > >> >> > well documented?
> >> > >> >> >
> >> > >> >> > Thanks,
> >> 

[jira] [Created] (ARROW-6678) [C++] Regression in Parquet file compatibility introduced by ARROW-3246

2019-09-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6678:
---

 Summary: [C++] Regression in Parquet file compatibility introduced 
by ARROW-3246
 Key: ARROW-6678
 URL: https://issues.apache.org/jira/browse/ARROW-6678
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


I randomly discovered that this script fails after applying the patch for 
ARROW-3246

https://github.com/apache/arrow/commit/2ba0566b29312e84fafc987fd8dc9664748be96a

{code}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet as fp

df = pd.util.testing.makeDataFrame()

pq.write_table(pa.table(df), 'test.parquet')

fp.ParquetFile('test.parquet')
{code}

with 

{code}
Traceback (most recent call last):
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
 line 110, in __init__
with open_with(fn2, 'rb') as f:
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/util.py",
 line 38, in default_open
return open(f, mode)
NotADirectoryError: [Errno 20] Not a directory: 'test.parquet/_metadata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 10, in 
fp.ParquetFile('test.parquet')
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
 line 116, in __init__
self._parse_header(f, verify)
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/api.py",
 line 135, in _parse_header
fmd = read_thrift(f, parquet_thrift.FileMetaData)
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/thrift_structures.py",
 line 25, in read_thrift
obj.read(pin)
  File 
"/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py",
 line 1929, in read
iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: invalid 
start byte
{code}

I don't recall making any metadata-related changes but I'm going to review the 
patch to see if I can narrow down where the problem is to see whether it's a 
bug with Arrow/parquet-cpp or with the third party library



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Timeline for 0.15.0 release

2019-09-24 Thread Wes McKinney
I'm investigating a possible Parquet-related compatibility bug that I
encountered through some routine testing / benchmarking. I'll report
back once I figure out what is going on (if anything)

On Sun, Sep 22, 2019 at 11:51 PM Micah Kornfield  wrote:
>>
>> It's ideal if your GPG key is in the web of trust (i.e. you can get it
>> signed by another PMC member), but is not 100% essential.
>
> That won't be an option for me this week (it seems like I would need to meet 
> one face-to-face).  I'll try to get the GPG checked in and the rest of the 
> pre-requisites done tomorrow (Monday) to hopefully start the release on 
> Tuesday (hopefully we can solve the last blocker/integration tests by then).
>
> On Sat, Sep 21, 2019 at 7:12 PM Wes McKinney  wrote:
>>
>> It's ideal if your GPG key is in the web of trust (i.e. you can get it
>> signed by another PMC member), but is not 100% essential.
>>
>> Speaking of the release, there are at least 2 code changes I still
>> want to get in
>>
>> ARROW-5717
>> ARROW-6353
>>
>> I just pushed updates to ARROW-5717, will merge once the build is green.
>>
>> There are a couple of Rust patches still marked for 0.15. The rest
>> seems to be documentation and a couple of integration test failures we
>> should see about fixing in time.
>>
>> On Fri, Sep 20, 2019 at 11:26 PM Micah Kornfield  
>> wrote:
>> >
>> > Thanks Krisztián and Wes,
>> > I've gone ahead and started registering myself on all the packaging sites.
>> >
>> > Is there any review process when adding my GPG key to the SVN file? [1]
>> > doesn't seem to mention explicitly.
>> >
>> > Thanks,
>> > Micah
>> >
>> > [1] https://www.apache.org/dev/version-control.html#https-svn
>> >
>> > On Fri, Sep 20, 2019 at 5:01 PM Krisztián Szűcs 
>> > wrote:
>> >
>> > > On Thu, Sep 19, 2019 at 5:52 PM Wes McKinney  wrote:
>> > >
>> > >> On Thu, Sep 19, 2019 at 12:13 AM Micah Kornfield 
>> > >> wrote:
>> > >> >>
>> > >> >> The process should be well documented at this point but there are a
>> > >> >> number of steps.
>> > >> >
>> > >> > Is [1] the up-to-date documentation for the release?   Are there
>> > >> instructions for the adding the code signing Key to SVN?
>> > >> >
>> > >> > I will make a go of it.  i will try to mitigate any internet issues by
>> > >> doing the process for a cloud instance (I assume that isn't a problem?).
>> > >> >
>> > >>
>> > >> Setting up a new cloud environment suitable for producing an RC may be
>> > >> time consuming, but you are welcome to try. Krisztian -- are you
>> > >> available next week to help Micah and potentially take over producing
>> > >> the RC if there are issues?
>> > >>
>> > > Sure, I'll be available next week. We can also grant access to
>> > > https://github.com/ursa-labs/crossbow because configuring all
>> > > the CI backends can be time consuming.
>> > >
>> > >>
>> > >> > Thanks,
>> > >> > Micah
>> > >> >
>> > >> > [1]
>> > >> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
>> > >> >
>> > >> > On Wed, Sep 18, 2019 at 8:29 AM Wes McKinney 
>> > >> wrote:
>> > >> >>
>> > >> >> The process should be well documented at this point but there are a
>> > >> >> number of steps. Note that you need to add your code signing key to
>> > >> >> the KEYS file in SVN (that's not very hard to do). I think it's fine
>> > >> >> to hand off the process to others after the VOTE but it would be
>> > >> >> tricky to have multiple RMs involved with producing the source and
>> > >> >> binary artifacts for the vote
>> > >> >>
>> > >> >> On Tue, Sep 17, 2019 at 10:55 PM Micah Kornfield <
>> > >> emkornfi...@gmail.com> wrote:
>> > >> >> >
>> > >> >> > SGTM, as well.
>> > >> >> >
>> > >> >> > I should have a little bit of time next week if I can help as RM 
>> > >> >> > but
>> > >> I have
>> > >> >> > a couple of concerns:
>> > >> >> > 1.  In the past I've had trouble downloading and validating
>> > >> releases. I'm a
>> > >> >> > bit worried, that I might have similar problems doing the necessary
>> > >> uploads.
>> > >> >> > 2.  My internet connection will likely be not great, I don't know 
>> > >> >> > if
>> > >> this
>> > >> >> > would make it even less likely to be successful.
>> > >> >> >
>> > >> >> > Does it become problematic if somehow I would have to abandon the
>> > >> process
>> > >> >> > mid-release?  Is there anyone who could serve as a backup?  Are the
>> > >> steps
>> > >> >> > well documented?
>> > >> >> >
>> > >> >> > Thanks,
>> > >> >> > Micah
>> > >> >> >
>> > >> >> > On Tue, Sep 17, 2019 at 4:25 PM Neal Richardson <
>> > >> neal.p.richard...@gmail.com>
>> > >> >> > wrote:
>> > >> >> >
>> > >> >> > > Sounds good to me.
>> > >> >> > >
>> > >> >> > > Do we have a release manager yet? Any volunteers?
>> > >> >> > >
>> > >> >> > > Neal
>> > >> >> > >
>> > >> >> > > On Tue, Sep 17, 2019 at 4:06 PM Wes McKinney 
>> > >> >> > > 
>> > >> wrote:
>> > >> >> > >
>> > >> >> > > > hi all,
>> > >> >> > > >
>> > >> >> > > > It looks like we're drawing close to be able to make the 

Re: Parquet file reading performance

2019-09-24 Thread Maarten Ballintijn
Hi,

The code to show the performance issue with DateTimeIndex is at:

https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c

It shows three case 0) int index, 1) datetime index, 2) date time index created 
in a slightly roundabout way

I’m a little confused by the two date time cases. Case 2) is much slower but 
the df compares identical to case 1)
(I originally used something like 2) to match our specific data. I don’t see 
why it behaves differently??)

The timings I find are:

1073741824 float64 8388608 16
0: make_dataframe :   2390.830 msec,  428 MB/s 
0: write_arrow_parquet:   2486.463 msec,  412 MB/s 
0: read_arrow_parquet :813.946 msec,  1258 MB/s <<<
1: make_dataframe :   2579.815 msec,  397 MB/s 
1: write_arrow_parquet:   2708.151 msec,  378 MB/s 
1: read_arrow_parquet :   1413.999 msec,  724 MB/s <<<
2: make_dataframe :  15126.520 msec,  68 MB/s 
2: write_arrow_parquet:   9205.815 msec,  111 MB/s 
2: read_arrow_parquet :   5929.346 msec,  173 MB/s <<<

Case 0, int index.  This is all great.
Case 1, date time index. We loose almost half the speed. Given that a datetime 
is only scaled from Pandas IIRC that seems like a lot?
Case  3, other datetime index. No idea what is going on.

Any insights are much appreciated.

Cheers,
Maarten.

> On Sep 24, 2019, at 11:25 AM, Wes McKinney  wrote:
> 
> hi
> 
> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn  > wrote:
>> 
>> Hi Wes,
>> 
>> Thanks for your quick response.
>> 
>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
>> 
>> numpy:   1.16.5
>> pandas:  0.25.1
>> pyarrow: 0.14.1
>> 
>> It looks like 0.15 is close, so I can wait for that.
>> 
>> Theoretically I see three components driving the performance:
>> 1) The cost of locating the column (directory overhead)
>> 2) The overhead of reading a single column. (reading and processing meta 
>> data, setting up for reading)
>> 3) Bulk reading and unmarshalling/decoding the data.
>> 
>> Only 1) would be impacted by the number of columns, but if you’re reading 
>> everything ideally this would not be a problem.
> 
> The problem is more nuanced than that. Parquet's metadata is somewhat
> "heavy" at the column level. So when you're writing thousands of
> columns, the fixed overhead associated with reading a single column
> becomes problematic. There are several data structures associated with
> decoding a column have a fixed setup and teardown cost. Even if there
> is 1 millisecond of fixed overhead related to reading a column (I
> don't know what the number is exactly) then reading 10,000 columns has
> 10 seconds of unavoidable overhead. It might be useful for us to
> quantify and communicate the expected overhead when metadata and
> decoding is taken into account. Simply put having more than 1000
> columns is not advisable.
> 
>> Based on an initial cursory look at the Parquet format I guess the index and 
>> the column meta-data might need to be read in full so I can see how that 
>> might slow down reading only a few columns out of a large set. But that was 
>> not really the case here?
>> 
>> What would you suggest for looking into the date index slow-down?
> 
> Can you show a code example to make things easier for us to see what
> you're seeing?
> 
>> 
>> Cheers,
>> Maarten.
>> 
>> 
>> 
>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney  wrote:
>>> 
>>> hi Maarten,
>>> 
>>> Are you using the master branch or 0.14.1? There are a number of
>>> performance regressions in 0.14.0/0.14.1 that are addressed in the
>>> master branch, to appear as 0.15.0 relatively soon.
>>> 
>>> As a file format, Parquet (and columnar formats in general) is not
>>> known to perform well with more than 1000 columns.
>>> 
>>> On the other items, we'd be happy to work with you to dig through the
>>> performance issues you're seeing.
>>> 
>>> Thanks
>>> Wes
>>> 
>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn  
>>> wrote:
 
 Greetings,
 
 We have Pandas DataFrames with typically about 6,000 rows using 
 DateTimeIndex.
 They have about 20,000 columns with integer column labels, and data with a 
 dtype of float32.
 
 We’d like to store these dataframes with parquet, using the ability to 
 read a subset of columns and to store meta-data with the file.
 
 We’ve found the reading performance less than expected compared to the 
 published benchmarks (e.g. Wes’ blog post).
 
 Using a modified version of his script we did reproduce his results (~ 
 1GB/s for high entropy, no dict on MacBook pro)
 
 But there seem to be three factors that contribute to the slowdown for our 
 datasets:
 
 - DateTimeIndex is much slower then a Int index (we see about a factor 5).
 - The number of columns impact reading speed significantly (factor ~2 

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-09-24-0

2019-09-24 Thread Bryan Cutler
I'm able to pass Spark integration tests locally with the build patch from
https://github.com/apache/arrow/pull/5465, so I'm reasonably confident all
the issues have been resolved and it's just flaky timeouts now. We are
trying some things to fix the timeouts, but nothing to hold up the release
for.

On Tue, Sep 24, 2019 at 8:54 AM Micah Kornfield 
wrote:

> Hi Wes,
> Thanks, that makes sense, I'll pick a commit in a little bit to get started
> with.  Somehow I thought we had done so in the past.
>
> Thanks,
> Micah
>
> On Tue, Sep 24, 2019 at 7:59 AM Wes McKinney  wrote:
>
> > hi Micah -- we should not stop merging PRs. That's been our policy
> > with past releases. If you want to pick a commit to base your release
> > branch off that's fine -- we rebase master later after the release
> > vote closes.
> >
> > On Tue, Sep 24, 2019 at 9:39 AM Micah Kornfield 
> > wrote:
> > >
> > > OK at least Spark and Wheel builds look like they might just be flaky
> > > timeouts.  I agree with Fuzzit not being a blocker.  Are there any
> other
> > > blockers I should be aware of?  Otherwise, I will try to start the
> build
> > > process later today.
> > >
> > > On Tue, Sep 24, 2019 at 8:33 AM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > At least for Fuzzit and the OS X Python wheel, I don't think those
> are
> > > > blockers.
> > > >
> > > > (IMHO the others shouldn't block the release either)
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 24/09/2019 à 16:29, Micah Kornfield a écrit :
> > > > > Have the failures already been fixed (i.e. is this a timing
> > issue?).  If
> > > > > not could people chime in if they are looking at some of them?  I
> > assume
> > > > > these are blockers until 0.15.0?
> > > > >
> > > > > If people are OK with it, it might make sense to stop merging
> > > > non-blocking
> > > > > PRs until 0.15.0 is out the door.  Thoughts?
> > > > >
> > > > > On Tue, Sep 24, 2019 at 8:25 AM Crossbow 
> > wrote:
> > > > >
> > > > >>
> > > > >> Arrow Build Report for Job nightly-2019-09-24-0
> > > > >>
> > > > >> All tasks:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0
> > > > >>
> > > > >> Failed Tasks:
> > > > >> - docker-cpp-fuzzit:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-fuzzit
> > > > >> - docker-spark-integration:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-spark-integration
> > > > >> - gandiva-jar-osx:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-gandiva-jar-osx
> > > > >> - docker-dask-integration:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-dask-integration
> > > > >> - wheel-osx-cp27m:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp27m
> > > > >>
> > > > >> Succeeded Tasks:
> > > > >> - wheel-manylinux2010-cp37m:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux2010-cp37m
> > > > >> - docker-python-3.6:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.6
> > > > >> - docker-clang-format:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-clang-format
> > > > >> - homebrew-cpp:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp
> > > > >> - docker-cpp-static-only:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-static-only
> > > > >> - wheel-osx-cp36m:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp36m
> > > > >> - homebrew-cpp-autobrew:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp-autobrew
> > > > >> - docker-python-3.7:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.7
> > > > >> - docker-python-2.7-nopandas:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-2.7-nopandas
> > > > >> - wheel-win-cp35m:
> > > > >>   URL:
> > > > >>
> > > >
> >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-appveyor-wheel-win-cp35m
> > 

[jira] [Created] (ARROW-6677) [FlightRPC][C++] Document using Flight in C++

2019-09-24 Thread lidavidm (Jira)
lidavidm created ARROW-6677:
---

 Summary: [FlightRPC][C++] Document using Flight in C++
 Key: ARROW-6677
 URL: https://issues.apache.org/jira/browse/ARROW-6677
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation, FlightRPC
Reporter: lidavidm
Assignee: lidavidm
 Fix For: 1.0.0


Similarly to ARROW-6390 for Python, we should have C++ documentation for Flight.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6676) [C++] [Parquet] Refactor encoding/decoding APIs for clarity

2019-09-24 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6676:


 Summary: [C++] [Parquet] Refactor encoding/decoding APIs for 
clarity
 Key: ARROW-6676
 URL: https://issues.apache.org/jira/browse/ARROW-6676
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


{{encoding.h}} and {{encoding.cc}} are difficult to read and rewrite. I think 
there's also lost opportunities for more generic implementations. 
Simplify/winnow the interfaces while keeping an eye on the benchmarks for 
performance regressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6675) Add scanReverse function

2019-09-24 Thread Malcolm MacLachlan (Jira)
Malcolm MacLachlan created ARROW-6675:
-

 Summary: Add scanReverse function
 Key: ARROW-6675
 URL: https://issues.apache.org/jira/browse/ARROW-6675
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Malcolm MacLachlan


* Add scanReverse function to dataFrame and filteredDataframe
 * Update tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-09-24-0

2019-09-24 Thread Micah Kornfield
Hi Wes,
Thanks, that makes sense, I'll pick a commit in a little bit to get started
with.  Somehow I thought we had done so in the past.

Thanks,
Micah

On Tue, Sep 24, 2019 at 7:59 AM Wes McKinney  wrote:

> hi Micah -- we should not stop merging PRs. That's been our policy
> with past releases. If you want to pick a commit to base your release
> branch off that's fine -- we rebase master later after the release
> vote closes.
>
> On Tue, Sep 24, 2019 at 9:39 AM Micah Kornfield 
> wrote:
> >
> > OK at least Spark and Wheel builds look like they might just be flaky
> > timeouts.  I agree with Fuzzit not being a blocker.  Are there any other
> > blockers I should be aware of?  Otherwise, I will try to start the build
> > process later today.
> >
> > On Tue, Sep 24, 2019 at 8:33 AM Antoine Pitrou 
> wrote:
> >
> > >
> > > At least for Fuzzit and the OS X Python wheel, I don't think those are
> > > blockers.
> > >
> > > (IMHO the others shouldn't block the release either)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 24/09/2019 à 16:29, Micah Kornfield a écrit :
> > > > Have the failures already been fixed (i.e. is this a timing
> issue?).  If
> > > > not could people chime in if they are looking at some of them?  I
> assume
> > > > these are blockers until 0.15.0?
> > > >
> > > > If people are OK with it, it might make sense to stop merging
> > > non-blocking
> > > > PRs until 0.15.0 is out the door.  Thoughts?
> > > >
> > > > On Tue, Sep 24, 2019 at 8:25 AM Crossbow 
> wrote:
> > > >
> > > >>
> > > >> Arrow Build Report for Job nightly-2019-09-24-0
> > > >>
> > > >> All tasks:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0
> > > >>
> > > >> Failed Tasks:
> > > >> - docker-cpp-fuzzit:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-fuzzit
> > > >> - docker-spark-integration:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-spark-integration
> > > >> - gandiva-jar-osx:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-gandiva-jar-osx
> > > >> - docker-dask-integration:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-dask-integration
> > > >> - wheel-osx-cp27m:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp27m
> > > >>
> > > >> Succeeded Tasks:
> > > >> - wheel-manylinux2010-cp37m:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux2010-cp37m
> > > >> - docker-python-3.6:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.6
> > > >> - docker-clang-format:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-clang-format
> > > >> - homebrew-cpp:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp
> > > >> - docker-cpp-static-only:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-static-only
> > > >> - wheel-osx-cp36m:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp36m
> > > >> - homebrew-cpp-autobrew:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp-autobrew
> > > >> - docker-python-3.7:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.7
> > > >> - docker-python-2.7-nopandas:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-2.7-nopandas
> > > >> - wheel-win-cp35m:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-appveyor-wheel-win-cp35m
> > > >> - docker-r:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-r
> > > >> - docker-c_glib:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-c_glib
> > > >> - docker-iwyu:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-iwyu
> > > >> - wheel-manylinux1-cp36m:
> > > >>   URL:
> > > >>
> > >
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux1-cp36m
> > > >> - docker-cpp-release:
> 

Re: Parquet file reading performance

2019-09-24 Thread Wes McKinney
hi

On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn  wrote:
>
> Hi Wes,
>
> Thanks for your quick response.
>
> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
>
> numpy:   1.16.5
> pandas:  0.25.1
> pyarrow: 0.14.1
>
> It looks like 0.15 is close, so I can wait for that.
>
> Theoretically I see three components driving the performance:
> 1) The cost of locating the column (directory overhead)
> 2) The overhead of reading a single column. (reading and processing meta 
> data, setting up for reading)
> 3) Bulk reading and unmarshalling/decoding the data.
>
> Only 1) would be impacted by the number of columns, but if you’re reading 
> everything ideally this would not be a problem.

The problem is more nuanced than that. Parquet's metadata is somewhat
"heavy" at the column level. So when you're writing thousands of
columns, the fixed overhead associated with reading a single column
becomes problematic. There are several data structures associated with
decoding a column have a fixed setup and teardown cost. Even if there
is 1 millisecond of fixed overhead related to reading a column (I
don't know what the number is exactly) then reading 10,000 columns has
10 seconds of unavoidable overhead. It might be useful for us to
quantify and communicate the expected overhead when metadata and
decoding is taken into account. Simply put having more than 1000
columns is not advisable.

> Based on an initial cursory look at the Parquet format I guess the index and 
> the column meta-data might need to be read in full so I can see how that  
> might slow down reading only a few columns out of a large set. But that was 
> not really the case here?
>
> What would you suggest for looking into the date index slow-down?

Can you show a code example to make things easier for us to see what
you're seeing?

>
> Cheers,
> Maarten.
>
>
>
> > On Sep 23, 2019, at 7:07 PM, Wes McKinney  wrote:
> >
> > hi Maarten,
> >
> > Are you using the master branch or 0.14.1? There are a number of
> > performance regressions in 0.14.0/0.14.1 that are addressed in the
> > master branch, to appear as 0.15.0 relatively soon.
> >
> > As a file format, Parquet (and columnar formats in general) is not
> > known to perform well with more than 1000 columns.
> >
> > On the other items, we'd be happy to work with you to dig through the
> > performance issues you're seeing.
> >
> > Thanks
> > Wes
> >
> > On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn  
> > wrote:
> >>
> >> Greetings,
> >>
> >> We have Pandas DataFrames with typically about 6,000 rows using 
> >> DateTimeIndex.
> >> They have about 20,000 columns with integer column labels, and data with a 
> >> dtype of float32.
> >>
> >> We’d like to store these dataframes with parquet, using the ability to 
> >> read a subset of columns and to store meta-data with the file.
> >>
> >> We’ve found the reading performance less than expected compared to the 
> >> published benchmarks (e.g. Wes’ blog post).
> >>
> >> Using a modified version of his script we did reproduce his results (~ 
> >> 1GB/s for high entropy, no dict on MacBook pro)
> >>
> >> But there seem to be three factors that contribute to the slowdown for our 
> >> datasets:
> >>
> >> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
> >> - The number of columns impact reading speed significantly (factor ~2 
> >> going from 16 to 16,000 columns)
> >> - The ‘use_pandas_metadata=True’ slows down reading significantly and 
> >> appears unnecessary? (about 40%)
> >>
> >> Are there ways we could speedup the reading? Should we use a different 
> >> layout?
> >>
> >> Thanks for your help and insights!
> >>
> >> Cheers,
> >> Maarten
> >>
> >>
> >> ps. the routines we used:
> >>
> >> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
> >>table = pa.Table.from_pandas(df)
> >>pq.write_table(table, fname, use_dictionary=False, compression=None)
> >>return
> >>
> >> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>table = pq.read_table(fname, use_pandas_metadata=False, 
> >> use_threads=True)
> >>df = table.to_pandas()
> >>return df
> >>
> >>
>


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-09-24-0

2019-09-24 Thread Micah Kornfield
OK at least Spark and Wheel builds look like they might just be flaky
timeouts.  I agree with Fuzzit not being a blocker.  Are there any other
blockers I should be aware of?  Otherwise, I will try to start the build
process later today.

On Tue, Sep 24, 2019 at 8:33 AM Antoine Pitrou  wrote:

>
> At least for Fuzzit and the OS X Python wheel, I don't think those are
> blockers.
>
> (IMHO the others shouldn't block the release either)
>
> Regards
>
> Antoine.
>
>
> Le 24/09/2019 à 16:29, Micah Kornfield a écrit :
> > Have the failures already been fixed (i.e. is this a timing issue?).  If
> > not could people chime in if they are looking at some of them?  I assume
> > these are blockers until 0.15.0?
> >
> > If people are OK with it, it might make sense to stop merging
> non-blocking
> > PRs until 0.15.0 is out the door.  Thoughts?
> >
> > On Tue, Sep 24, 2019 at 8:25 AM Crossbow  wrote:
> >
> >>
> >> Arrow Build Report for Job nightly-2019-09-24-0
> >>
> >> All tasks:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0
> >>
> >> Failed Tasks:
> >> - docker-cpp-fuzzit:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-fuzzit
> >> - docker-spark-integration:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-spark-integration
> >> - gandiva-jar-osx:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-gandiva-jar-osx
> >> - docker-dask-integration:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-dask-integration
> >> - wheel-osx-cp27m:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp27m
> >>
> >> Succeeded Tasks:
> >> - wheel-manylinux2010-cp37m:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux2010-cp37m
> >> - docker-python-3.6:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.6
> >> - docker-clang-format:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-clang-format
> >> - homebrew-cpp:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp
> >> - docker-cpp-static-only:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-static-only
> >> - wheel-osx-cp36m:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp36m
> >> - homebrew-cpp-autobrew:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp-autobrew
> >> - docker-python-3.7:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.7
> >> - docker-python-2.7-nopandas:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-2.7-nopandas
> >> - wheel-win-cp35m:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-appveyor-wheel-win-cp35m
> >> - docker-r:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-r
> >> - docker-c_glib:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-c_glib
> >> - docker-iwyu:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-iwyu
> >> - wheel-manylinux1-cp36m:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux1-cp36m
> >> - docker-cpp-release:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-release
> >> - debian-buster-arm64:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-debian-buster-arm64
> >> - ubuntu-bionic:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-ubuntu-bionic
> >> - centos-7-aarch64:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-centos-7-aarch64
> >> - wheel-win-cp37m:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-appveyor-wheel-win-cp37m
> >> - debian-stretch-arm64:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-debian-stretch-arm64
> >> - ubuntu-xenial:
> >>   URL:
> >>
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-ubuntu-xenial
> >> - 

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-09-24-0

2019-09-24 Thread Micah Kornfield
Have the failures already been fixed (i.e. is this a timing issue?).  If
not could people chime in if they are looking at some of them?  I assume
these are blockers until 0.15.0?

If people are OK with it, it might make sense to stop merging non-blocking
PRs until 0.15.0 is out the door.  Thoughts?

On Tue, Sep 24, 2019 at 8:25 AM Crossbow  wrote:

>
> Arrow Build Report for Job nightly-2019-09-24-0
>
> All tasks:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0
>
> Failed Tasks:
> - docker-cpp-fuzzit:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-fuzzit
> - docker-spark-integration:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-spark-integration
> - gandiva-jar-osx:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-gandiva-jar-osx
> - docker-dask-integration:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-dask-integration
> - wheel-osx-cp27m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp27m
>
> Succeeded Tasks:
> - wheel-manylinux2010-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux2010-cp37m
> - docker-python-3.6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.6
> - docker-clang-format:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-clang-format
> - homebrew-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp
> - docker-cpp-static-only:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-static-only
> - wheel-osx-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp36m
> - homebrew-cpp-autobrew:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp-autobrew
> - docker-python-3.7:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.7
> - docker-python-2.7-nopandas:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-2.7-nopandas
> - wheel-win-cp35m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-appveyor-wheel-win-cp35m
> - docker-r:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-r
> - docker-c_glib:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-c_glib
> - docker-iwyu:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-iwyu
> - wheel-manylinux1-cp36m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux1-cp36m
> - docker-cpp-release:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-release
> - debian-buster-arm64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-debian-buster-arm64
> - ubuntu-bionic:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-ubuntu-bionic
> - centos-7-aarch64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-centos-7-aarch64
> - wheel-win-cp37m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-appveyor-wheel-win-cp37m
> - debian-stretch-arm64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-debian-stretch-arm64
> - ubuntu-xenial:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-ubuntu-xenial
> - ubuntu-bionic-arm64:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-ubuntu-bionic-arm64
> - conda-osx-clang-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-conda-osx-clang-py36
> - conda-win-vs2015-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-conda-win-vs2015-py37
> - docker-turbodbc-integration:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-turbodbc-integration
> - wheel-manylinux2010-cp27m:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux2010-cp27m
> - conda-osx-clang-py27:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-conda-osx-clang-py27
> - 

Re: Parquet file reading performance

2019-09-24 Thread Maarten Ballintijn
Hi Wes,

Thanks for your quick response.

Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:

numpy:   1.16.5
pandas:  0.25.1
pyarrow: 0.14.1

It looks like 0.15 is close, so I can wait for that.

Theoretically I see three components driving the performance:
1) The cost of locating the column (directory overhead)
2) The overhead of reading a single column. (reading and processing meta data, 
setting up for reading)
3) Bulk reading and unmarshalling/decoding the data.

Only 1) would be impacted by the number of columns, but if you’re reading 
everything ideally this would not be a problem. 

Based on an initial cursory look at the Parquet format I guess the index and 
the column meta-data might need to be read in full so I can see how that  might 
slow down reading only a few columns out of a large set. But that was not 
really the case here?

What would you suggest for looking into the date index slow-down?

Cheers,
Maarten.



> On Sep 23, 2019, at 7:07 PM, Wes McKinney  wrote:
> 
> hi Maarten,
> 
> Are you using the master branch or 0.14.1? There are a number of
> performance regressions in 0.14.0/0.14.1 that are addressed in the
> master branch, to appear as 0.15.0 relatively soon.
> 
> As a file format, Parquet (and columnar formats in general) is not
> known to perform well with more than 1000 columns.
> 
> On the other items, we'd be happy to work with you to dig through the
> performance issues you're seeing.
> 
> Thanks
> Wes
> 
> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn  wrote:
>> 
>> Greetings,
>> 
>> We have Pandas DataFrames with typically about 6,000 rows using 
>> DateTimeIndex.
>> They have about 20,000 columns with integer column labels, and data with a 
>> dtype of float32.
>> 
>> We’d like to store these dataframes with parquet, using the ability to read 
>> a subset of columns and to store meta-data with the file.
>> 
>> We’ve found the reading performance less than expected compared to the 
>> published benchmarks (e.g. Wes’ blog post).
>> 
>> Using a modified version of his script we did reproduce his results (~ 1GB/s 
>> for high entropy, no dict on MacBook pro)
>> 
>> But there seem to be three factors that contribute to the slowdown for our 
>> datasets:
>> 
>> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
>> - The number of columns impact reading speed significantly (factor ~2 going 
>> from 16 to 16,000 columns)
>> - The ‘use_pandas_metadata=True’ slows down reading significantly and 
>> appears unnecessary? (about 40%)
>> 
>> Are there ways we could speedup the reading? Should we use a different 
>> layout?
>> 
>> Thanks for your help and insights!
>> 
>> Cheers,
>> Maarten
>> 
>> 
>> ps. the routines we used:
>> 
>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>>table = pa.Table.from_pandas(df)
>>pq.write_table(table, fname, use_dictionary=False, compression=None)
>>return
>> 
>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>>table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
>>df = table.to_pandas()
>>return df
>> 
>> 



[jira] [Created] (ARROW-6673) [Python] Consider separating libarrow.pxd into multiple definition files

2019-09-24 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-6673:
--

 Summary: [Python] Consider separating libarrow.pxd into multiple 
definition files
 Key: ARROW-6673
 URL: https://issues.apache.org/jira/browse/ARROW-6673
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs


See discussion https://github.com/apache/arrow/pull/5423#discussion_r327522836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2019-09-24-0

2019-09-24 Thread Crossbow


Arrow Build Report for Job nightly-2019-09-24-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0

Failed Tasks:
- docker-cpp-fuzzit:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-fuzzit
- docker-spark-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-spark-integration
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-gandiva-jar-osx
- docker-dask-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-dask-integration
- wheel-osx-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp27m

Succeeded Tasks:
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux2010-cp37m
- docker-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.6
- docker-clang-format:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-clang-format
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp
- docker-cpp-static-only:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-static-only
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp36m
- homebrew-cpp-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-homebrew-cpp-autobrew
- docker-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-3.7
- docker-python-2.7-nopandas:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-python-2.7-nopandas
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-appveyor-wheel-win-cp35m
- docker-r:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-r
- docker-c_glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-c_glib
- docker-iwyu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-iwyu
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux1-cp36m
- docker-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-cpp-release
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-debian-buster-arm64
- ubuntu-bionic:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-ubuntu-bionic
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-centos-7-aarch64
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-appveyor-wheel-win-cp37m
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-debian-stretch-arm64
- ubuntu-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-ubuntu-xenial
- ubuntu-bionic-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-ubuntu-bionic-arm64
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-conda-osx-clang-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-conda-win-vs2015-py37
- docker-turbodbc-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-turbodbc-integration
- wheel-manylinux2010-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux2010-cp27m
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-conda-osx-clang-py27
- wheel-manylinux1-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-manylinux1-cp27m
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-travis-wheel-osx-cp35m
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-azure-debian-stretch
- docker-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-24-0-circle-docker-docs
- 

Re: [DISCUSS][Java] Design of the algorithm module

2019-09-24 Thread Fan Liya
Hi Micah,

Thanks for your effort and precious time.
Looking forward to receiving more valuable feedback from you.

Best,
Liya Fan

On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield 
wrote:

> Hi Liya Fan,
> I started reviewing but haven't gotten all the way through it. I will try
> to leave more comments over the next few days.
>
> Thanks again for the write-up I think it will help frame a productive
> conversation.
>
> -Micah
>
> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya  wrote:
>
>> Hi Micah,
>>
>> Thanks for your kind reminder. Comments are enabled now.
>>
>> Best,
>> Liya Fan
>>
>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield 
>> wrote:
>>
>>> Hi Liya Fan,
>>> Thank you for this writeup, it doesn't look like comments are enabled on
>>> the document.  Could you allow for them?
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya  wrote:
>>>
>>> > Dear all,
>>> >
>>> > We have prepared a document for discussing the requirements, design and
>>> > implementation issues for the algorithm module of Java:
>>> >
>>> >
>>> >
>>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
>>> >
>>> > So far, we have finished the initial draft for sort, search and
>>> dictionary
>>> > encoding algorithms. Discussions for more algorithms may be added in
>>> the
>>> > future. This document will keep evolving to reflect the latest
>>> discussion
>>> > results in the community and the latest code changes.
>>> >
>>> > Please give your valuable feedback.
>>> >
>>> > Best,
>>> > Liya Fan
>>> >
>>>
>>


[jira] [Created] (ARROW-6672) [Java] Extract a common interface for dictionary builders

2019-09-24 Thread Liya Fan (Jira)
Liya Fan created ARROW-6672:
---

 Summary: [Java] Extract a common interface for dictionary builders
 Key: ARROW-6672
 URL: https://issues.apache.org/jira/browse/ARROW-6672
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need a common interface for dictionary builders to support more 
sophisticated scenarios, like collecting dictionary statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6671) [C++] Sparse tensor naming

2019-09-24 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6671:
-

 Summary: [C++] Sparse tensor naming
 Key: ARROW-6671
 URL: https://issues.apache.org/jira/browse/ARROW-6671
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


Currently there's {{SparseCOOIndex}} and {{SparseCSRIndex}}, but also 
{{SparseTensorCOO}} and {{SparseTensorCSR}}.

For consistency, it would be nice to rename the latter {{SparseCOOTensor}} and 
{{SparseCSRTensor}}.

Also, it's not obvious the {{SparseMatrixCSR}} alias is useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS][Java] Design of the algorithm module

2019-09-24 Thread Micah Kornfield
Hi Liya Fan,
I started reviewing but haven't gotten all the way through it. I will try
to leave more comments over the next few days.

Thanks again for the write-up I think it will help frame a productive
conversation.

-Micah

On Tue, Sep 17, 2019 at 1:47 AM Fan Liya  wrote:

> Hi Micah,
>
> Thanks for your kind reminder. Comments are enabled now.
>
> Best,
> Liya Fan
>
> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield 
> wrote:
>
>> Hi Liya Fan,
>> Thank you for this writeup, it doesn't look like comments are enabled on
>> the document.  Could you allow for them?
>>
>> Thanks,
>> Micah
>>
>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya  wrote:
>>
>> > Dear all,
>> >
>> > We have prepared a document for discussing the requirements, design and
>> > implementation issues for the algorithm module of Java:
>> >
>> >
>> >
>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
>> >
>> > So far, we have finished the initial draft for sort, search and
>> dictionary
>> > encoding algorithms. Discussions for more algorithms may be added in the
>> > future. This document will keep evolving to reflect the latest
>> discussion
>> > results in the community and the latest code changes.
>> >
>> > Please give your valuable feedback.
>> >
>> > Best,
>> > Liya Fan
>> >
>>
>


Re: [Discuss] [Java] DateMilliVector.getObject() return type (LocalDateTime vs LocalDate)

2019-09-24 Thread Micah Kornfield
Hi David,
Is the suggestion to add something like a LocalDate getDate method?

Thanks,
Micah

On Tue, Sep 17, 2019 at 7:39 AM David Li  wrote:

> Maybe a utility method to get a date instead of a datetime at least would
> be useful? And/or documentation of the fact that the default behavior is
> semantically incorrect, and what it does (return a datetime at midnight for
> the date).
>
> Best,
> David
>
> On Tue, Sep 17, 2019, 04:08 Fan Liya  wrote:
>
>> I think there are similar problems with other time related vectors.
>>
>> Best,
>> Liya Fan
>>
>>
>> On Tue, Sep 17, 2019 at 1:02 PM Micah Kornfield 
>> wrote:
>>
>> > Anyone have an opinion on this?  Personally, I'm leaning on keeping the
>> > existing API compatibility, but I don't feel too strongly about it.
>> >
>> > On Mon, Sep 9, 2019 at 7:39 PM Micah Kornfield 
>> > wrote:
>> >
>> > > Yongbo Zhang,
>> > > Opened up a pull request to have DateMilliVector return a LocalDate
>> > > instead of a LocalDateTime object.
>> > >
>> > > Do people have opinions if this breaking change is worth the
>> correctness?
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > > [1] https://github.com/apache/arrow/pull/5315
>> > >
>> > > On Sat, Sep 7, 2019 at 4:14 PM Yongbo Zhang <
>> zhangyongbo0...@gmail.com>
>> > > wrote:
>> > >
>> > >> Summary: [Java] DateMilliVector.getObject() should return a
>> LocalDate,
>> > >> not a LocalDateTime
>> > >> Key: ARROW-1984
>> > >> URL: https://issues.apache.org/jira/browse/ARROW-1984
>> > >> Pull Request: https://github.com/apache/arrow/pull/5315
>> > >> Project: Apache Arrow
>> > >> Issue Type: Bug
>> > >> Components: Java
>> > >> Reporter: Vanco Buca
>> > >> Assignee: Yongbo Zhang
>> > >> Fix For: 0.15.0
>> > >>
>> > >> This is an API breaking change therefore we may want to discuss
>> about it
>> > >> before merging any PRs in.
>> > >>
>> > >
>> >
>>
>