Re: Timeline for 0.15.0 release

2019-09-20 Thread Micah Kornfield
Thanks Krisztián and Wes,
I've gone ahead and started registering myself on all the packaging sites.

Is there any review process when adding my GPG key to the SVN file? [1]
doesn't seem to mention explicitly.

Thanks,
Micah

[1] https://www.apache.org/dev/version-control.html#https-svn

On Fri, Sep 20, 2019 at 5:01 PM Krisztián Szűcs 
wrote:

> On Thu, Sep 19, 2019 at 5:52 PM Wes McKinney  wrote:
>
>> On Thu, Sep 19, 2019 at 12:13 AM Micah Kornfield 
>> wrote:
>> >>
>> >> The process should be well documented at this point but there are a
>> >> number of steps.
>> >
>> > Is [1] the up-to-date documentation for the release?   Are there
>> instructions for the adding the code signing Key to SVN?
>> >
>> > I will make a go of it.  i will try to mitigate any internet issues by
>> doing the process for a cloud instance (I assume that isn't a problem?).
>> >
>>
>> Setting up a new cloud environment suitable for producing an RC may be
>> time consuming, but you are welcome to try. Krisztian -- are you
>> available next week to help Micah and potentially take over producing
>> the RC if there are issues?
>>
> Sure, I'll be available next week. We can also grant access to
> https://github.com/ursa-labs/crossbow because configuring all
> the CI backends can be time consuming.
>
>>
>> > Thanks,
>> > Micah
>> >
>> > [1]
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
>> >
>> > On Wed, Sep 18, 2019 at 8:29 AM Wes McKinney 
>> wrote:
>> >>
>> >> The process should be well documented at this point but there are a
>> >> number of steps. Note that you need to add your code signing key to
>> >> the KEYS file in SVN (that's not very hard to do). I think it's fine
>> >> to hand off the process to others after the VOTE but it would be
>> >> tricky to have multiple RMs involved with producing the source and
>> >> binary artifacts for the vote
>> >>
>> >> On Tue, Sep 17, 2019 at 10:55 PM Micah Kornfield <
>> emkornfi...@gmail.com> wrote:
>> >> >
>> >> > SGTM, as well.
>> >> >
>> >> > I should have a little bit of time next week if I can help as RM but
>> I have
>> >> > a couple of concerns:
>> >> > 1.  In the past I've had trouble downloading and validating
>> releases. I'm a
>> >> > bit worried, that I might have similar problems doing the necessary
>> uploads.
>> >> > 2.  My internet connection will likely be not great, I don't know if
>> this
>> >> > would make it even less likely to be successful.
>> >> >
>> >> > Does it become problematic if somehow I would have to abandon the
>> process
>> >> > mid-release?  Is there anyone who could serve as a backup?  Are the
>> steps
>> >> > well documented?
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> > On Tue, Sep 17, 2019 at 4:25 PM Neal Richardson <
>> neal.p.richard...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > Sounds good to me.
>> >> > >
>> >> > > Do we have a release manager yet? Any volunteers?
>> >> > >
>> >> > > Neal
>> >> > >
>> >> > > On Tue, Sep 17, 2019 at 4:06 PM Wes McKinney 
>> wrote:
>> >> > >
>> >> > > > hi all,
>> >> > > >
>> >> > > > It looks like we're drawing close to be able to make the 0.15.0
>> >> > > > release. I would suggest "pencils down" at the end of this week
>> and
>> >> > > > see if a release candidate can be produced next Monday September
>> 23.
>> >> > > > Any thoughts or objections?
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Wes
>> >> > > >
>> >> > > > On Wed, Sep 11, 2019 at 11:23 AM Wes McKinney <
>> wesmck...@gmail.com>
>> >> > > wrote:
>> >> > > > >
>> >> > > > > hi Eric -- yes, that's correct. I'm planning to amend the
>> Format docs
>> >> > > > > today regarding the EOS issue and also update the C++ library
>> >> > > > >
>> >> > > > > On Wed, Sep 11, 2019 at 11:21 AM Eric Erhardt
>> >> > > > >  wrote:
>> >> > > > > >
>> >> > > > > > I assume the plan is to merge the
>> ARROW-6313-flatbuffer-alignment
>> >> > > > branch into master before the 0.15 release, correct?
>> >> > > > > >
>> >> > > > > > BTW - I believe the C# alignment changes are ready to be
>> merged into
>> >> > > > the alignment branch -
>> https://github.com/apache/arrow/pull/5280/
>> >> > > > > >
>> >> > > > > > Eric
>> >> > > > > >
>> >> > > > > > -Original Message-
>> >> > > > > > From: Micah Kornfield 
>> >> > > > > > Sent: Tuesday, September 10, 2019 10:24 PM
>> >> > > > > > To: Wes McKinney 
>> >> > > > > > Cc: dev ; niki.lj 
>> >> > > > > > Subject: Re: Timeline for 0.15.0 release
>> >> > > > > >
>> >> > > > > > I should have a little more bandwidth to help with some of
>> the
>> >> > > > packaging starting tomorrow and going into the weekend.
>> >> > > > > >
>> >> > > > > > On Tuesday, September 10, 2019, Wes McKinney <
>> wesmck...@gmail.com>
>> >> > > > wrote:
>> >> > > > > >
>> >> > > > > > > Hi folks,
>> >> > > > > > >
>> >> > > > > > > With the state of nightly packaging and integration builds
>> things
>> >> > > > > > > aren't looking too good for being in release readiness by
>> the end
>> >> > > of
>> >> > > > > > 

[jira] [Created] (ARROW-6650) [Rust] [Integration] Add method to generate JSON from RecordBatch

2019-09-20 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-6650:
-

 Summary: [Rust] [Integration] Add method to generate JSON from 
RecordBatch
 Key: ARROW-6650
 URL: https://issues.apache.org/jira/browse/ARROW-6650
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Integration, Rust
Affects Versions: 0.14.1
Reporter: Neville Dipale


[~emkornfi...@gmail.com] recommended that we use the integration IPC files. To 
be able to compare against the JSON files that are used, we need to be able to 
generate a JSON represention of Arrow data in Rust.

We can already do this for schemas, and this ticket is for supporting 
converting RecordBatch to JSON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: I want to unsubscribe from the dev mailing list

2019-09-20 Thread Wes McKinney
Write to

dev-unsubscr...@arrow.apache.org

I would recommend setting up an email filter instead.

On Fri, Sep 20, 2019, 5:21 PM Zhuo Jia Dai  wrote:

> Hi,
>
> I can't seem to figure out how to do it online.
>
> Please help.
>
> Regards
>
> --
> ZJ
>
> zhuojia@gmail.com
>


Re: Timeline for 0.15.0 release

2019-09-20 Thread Krisztián Szűcs
On Thu, Sep 19, 2019 at 5:52 PM Wes McKinney  wrote:

> On Thu, Sep 19, 2019 at 12:13 AM Micah Kornfield 
> wrote:
> >>
> >> The process should be well documented at this point but there are a
> >> number of steps.
> >
> > Is [1] the up-to-date documentation for the release?   Are there
> instructions for the adding the code signing Key to SVN?
> >
> > I will make a go of it.  i will try to mitigate any internet issues by
> doing the process for a cloud instance (I assume that isn't a problem?).
> >
>
> Setting up a new cloud environment suitable for producing an RC may be
> time consuming, but you are welcome to try. Krisztian -- are you
> available next week to help Micah and potentially take over producing
> the RC if there are issues?
>
Sure, I'll be available next week. We can also grant access to
https://github.com/ursa-labs/crossbow because configuring all
the CI backends can be time consuming.

>
> > Thanks,
> > Micah
> >
> > [1]
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> >
> > On Wed, Sep 18, 2019 at 8:29 AM Wes McKinney 
> wrote:
> >>
> >> The process should be well documented at this point but there are a
> >> number of steps. Note that you need to add your code signing key to
> >> the KEYS file in SVN (that's not very hard to do). I think it's fine
> >> to hand off the process to others after the VOTE but it would be
> >> tricky to have multiple RMs involved with producing the source and
> >> binary artifacts for the vote
> >>
> >> On Tue, Sep 17, 2019 at 10:55 PM Micah Kornfield 
> wrote:
> >> >
> >> > SGTM, as well.
> >> >
> >> > I should have a little bit of time next week if I can help as RM but
> I have
> >> > a couple of concerns:
> >> > 1.  In the past I've had trouble downloading and validating releases.
> I'm a
> >> > bit worried, that I might have similar problems doing the necessary
> uploads.
> >> > 2.  My internet connection will likely be not great, I don't know if
> this
> >> > would make it even less likely to be successful.
> >> >
> >> > Does it become problematic if somehow I would have to abandon the
> process
> >> > mid-release?  Is there anyone who could serve as a backup?  Are the
> steps
> >> > well documented?
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> > On Tue, Sep 17, 2019 at 4:25 PM Neal Richardson <
> neal.p.richard...@gmail.com>
> >> > wrote:
> >> >
> >> > > Sounds good to me.
> >> > >
> >> > > Do we have a release manager yet? Any volunteers?
> >> > >
> >> > > Neal
> >> > >
> >> > > On Tue, Sep 17, 2019 at 4:06 PM Wes McKinney 
> wrote:
> >> > >
> >> > > > hi all,
> >> > > >
> >> > > > It looks like we're drawing close to be able to make the 0.15.0
> >> > > > release. I would suggest "pencils down" at the end of this week
> and
> >> > > > see if a release candidate can be produced next Monday September
> 23.
> >> > > > Any thoughts or objections?
> >> > > >
> >> > > > Thanks,
> >> > > > Wes
> >> > > >
> >> > > > On Wed, Sep 11, 2019 at 11:23 AM Wes McKinney <
> wesmck...@gmail.com>
> >> > > wrote:
> >> > > > >
> >> > > > > hi Eric -- yes, that's correct. I'm planning to amend the
> Format docs
> >> > > > > today regarding the EOS issue and also update the C++ library
> >> > > > >
> >> > > > > On Wed, Sep 11, 2019 at 11:21 AM Eric Erhardt
> >> > > > >  wrote:
> >> > > > > >
> >> > > > > > I assume the plan is to merge the
> ARROW-6313-flatbuffer-alignment
> >> > > > branch into master before the 0.15 release, correct?
> >> > > > > >
> >> > > > > > BTW - I believe the C# alignment changes are ready to be
> merged into
> >> > > > the alignment branch -
> https://github.com/apache/arrow/pull/5280/
> >> > > > > >
> >> > > > > > Eric
> >> > > > > >
> >> > > > > > -Original Message-
> >> > > > > > From: Micah Kornfield 
> >> > > > > > Sent: Tuesday, September 10, 2019 10:24 PM
> >> > > > > > To: Wes McKinney 
> >> > > > > > Cc: dev ; niki.lj 
> >> > > > > > Subject: Re: Timeline for 0.15.0 release
> >> > > > > >
> >> > > > > > I should have a little more bandwidth to help with some of the
> >> > > > packaging starting tomorrow and going into the weekend.
> >> > > > > >
> >> > > > > > On Tuesday, September 10, 2019, Wes McKinney <
> wesmck...@gmail.com>
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > Hi folks,
> >> > > > > > >
> >> > > > > > > With the state of nightly packaging and integration builds
> things
> >> > > > > > > aren't looking too good for being in release readiness by
> the end
> >> > > of
> >> > > > > > > this week but maybe I'm wrong. I'm planning to be working
> to close
> >> > > as
> >> > > > > > > many issues as I can and also to help with the ongoing
> alignment
> >> > > > fixes.
> >> > > > > > >
> >> > > > > > > Wes
> >> > > > > > >
> >> > > > > > > On Thu, Sep 5, 2019, 11:07 PM Micah Kornfield <
> >> > > emkornfi...@gmail.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > >> Just for reference [1] has a dashboard of the current
> issues:
> >> > > > > > >>
> >> > > > > > >>
> >> 

[jira] [Created] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.

2019-09-20 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6649:
--

 Summary: [R] print() methods for Table, RecordBatch, etc.
 Key: ARROW-6649
 URL: https://issues.apache.org/jira/browse/ARROW-6649
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


Inspired by tibble: show schema, head of data, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6648) Go: Expose the bitutil package

2019-09-20 Thread Jonathan A Sternberg (Jira)
Jonathan A Sternberg created ARROW-6648:
---

 Summary: Go: Expose the bitutil package
 Key: ARROW-6648
 URL: https://issues.apache.org/jira/browse/ARROW-6648
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Jonathan A Sternberg


Please allow the {{bitutil}} package to be exposed to external developers. The 
package provides useful utilities for constructing a bitmap and it is needed if 
you want to create an external builder implementation that handles null values.

Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6647) [C++] Can't build with g++ 4.8.5 on CentOS 7 by member initializer for shared_ptr

2019-09-20 Thread Sutou Kouhei (Jira)
Sutou Kouhei created ARROW-6647:
---

 Summary: [C++] Can't build with g++ 4.8.5 on CentOS 7 by member 
initializer for shared_ptr
 Key: ARROW-6647
 URL: https://issues.apache.org/jira/browse/ARROW-6647
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei


{noformat}
% g++ --version
g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
{noformat}

Error message:

{noformat}
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:
 In instantiation of 'arrow::Status arrow::py::GetConverterFlat(const 
std::shared_ptr&, bool, 
std::unique_ptr*) [with arrow::py::NullCoding 
null_coding = (arrow::py::NullCoding)1]':
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:1001:5:
   required from here
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:864:7:
 error: conversion from 'std::nullptr_t' to non-scalar type 
'std::shared_ptr' requested
 class DecimalConverter
   ^
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:894:10:
 note: synthesized method 
'arrow::py::DecimalConverter<(arrow::py::NullCoding)1>::DecimalConverter()' 
first required here 
 *out = std::unique_ptr(new TYPE_CLASS); \
  ^
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:915:5:
 note: in expansion of macro 'SIMPLE_CONVERTER_CASE'
 SIMPLE_CONVERTER_CASE(DECIMAL, DecimalConverter);
 ^
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:
 In instantiation of 'arrow::Status arrow::py::GetConverterFlat(const 
std::shared_ptr&, bool, 
std::unique_ptr*) [with arrow::py::NullCoding 
null_coding = (arrow::py::NullCoding)0]':
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:1004:5:
   required from here
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:864:7:
 error: conversion from 'std::nullptr_t' to non-scalar type 
'std::shared_ptr' requested
 class DecimalConverter
   ^
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:894:10:
 note: synthesized method 
'arrow::py::DecimalConverter<(arrow::py::NullCoding)0>::DecimalConverter()' 
first required here 
 *out = std::unique_ptr(new TYPE_CLASS); \
  ^
/root/rpmbuild/BUILD/apache-arrow-0.15.0/cpp/src/arrow/python/python_to_arrow.cc:915:5:
 note: in expansion of macro 'SIMPLE_CONVERTER_CASE'
 SIMPLE_CONVERTER_CASE(DECIMAL, DecimalConverter);
 ^
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] IPC buffer layout for Null type

2019-09-20 Thread Wes McKinney
Thanks. I committed it and opened some new JIRA issues that have been
attached to

https://issues.apache.org/jira/browse/ARROW-1636

On Fri, Sep 20, 2019 at 4:37 AM Micah Kornfield  wrote:
>
> I think committing as is, is the better of the two options.
>
> On Thu, Sep 19, 2019 at 12:35 PM Wes McKinney  wrote:
>
> > OK, my preference, therefore, would be to rebase and merge my patch
> > without bothering with backwards compatibility code. The situations
> > where there would be an issue are fairly esoteric.
> >
> > https://github.com/apache/arrow/pull/5287
> >
> > On Thu, Sep 19, 2019 at 2:29 PM Antoine Pitrou 
> > wrote:
> > >
> > >
> > > Well, this is an incompatible IPC change, so ideally it should be done
> > > now, not later.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Thu, 19 Sep 2019 14:08:37 -0500
> > > Wes McKinney  wrote:
> > >
> > > > I'm concerned about rushing through any patch for this for 0.15.0, but
> > > > each release with the status quo increases the risk of making changes.
> > > > Thoughts?
> > > >
> > > > On Fri, Sep 6, 2019 at 12:59 PM Wes McKinney 
> > wrote:
> > > > >
> > > > > On Fri, Sep 6, 2019 at 12:57 PM Micah Kornfield <
> > emkornfi...@gmail.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > We can't because the buffer layout is not transmitted --
> > implementations
> > > > > > > make assumptions about what Buffer values correspond to each
> > field. The
> > > > > > > only thing we could do to signal the change would be to increase
> > the
> > > > > > > metadata version from V4 to V5.
> > > > > >
> > > > > > If we do this within 0.15.0 we could infer from the padding of
> > messages.
> > > > > >
> > > > >
> > > > > That's true. I'd be OK adding backward compatibility code (that we
> > can
> > > > > probably remove later) to my patch...
> > > > >
> > > > > I'm not sure about the other implementations. I think for non-C++
> > > > > implementations because they don't have much application code that
> > can
> > > > > produce Null arrays that they should simply use the no-buffers layout
> > > > >
> > > > > > On Fri, Sep 6, 2019 at 10:16 AM Wes McKinney 
> > wrote:
> > > > > >
> > > > > > > On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou 
> > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Null can also come up when converting a column with only NA
> > values in a
> > > > > > > > CSV file.  I don't remember for sure, but I think the same can
> > happen
> > > > > > > > with JSON files as well.
> > > > > > > >
> > > > > > > > Can't we accept both forms when reading?  It sounds like it
> > should be
> > > > > > > > reasonably easy.
> > > > > > > >
> > > > > > >
> > > > > > > We can't because the buffer layout is not transmitted --
> > implementations
> > > > > > > make assumptions about what Buffer values correspond to each
> > field. The
> > > > > > > only thing we could do to signal the change would be to increase
> > the
> > > > > > > metadata version from V4 to V5.
> > > > > > >
> > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Antoine.
> > > > > > > >
> > > > > > > >
> > > > > > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit :
> > > > > > > > > hi Micah,
> > > > > > > > >
> > > > > > > > > Null wouldn't come up that often in practice. It could
> > happen when
> > > > > > > > > converting from pandas, for example
> > > > > > > > >
> > > > > > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10,
> > > > > > > > dtype=object)})
> > > > > > > > >
> > > > > > > > > In [9]: t = pa.table(df)
> > > > > > > > >
> > > > > > > > > In [10]: t
> > > > > > > > > Out[10]:
> > > > > > > > > pyarrow.Table
> > > > > > > > > col1: null
> > > > > > > > > metadata
> > > > > > > > > 
> > > > > > > > > {b'pandas': b'{"index_columns": [{"kind": "range", "name":
> > null,
> > > > > > > > "start": 0, "'
> > > > > > > > > b'stop": 10, "step": 1}], "column_indexes":
> > [{"name": null,
> > > > > > > > "field'
> > > > > > > > > b'_name": null, "pandas_type": "unicode",
> > "numpy_type":
> > > > > > > > "object", '
> > > > > > > > > b'"metadata": {"encoding": "UTF-8"}}],
> > "columns": [{"name":
> > > > > > > > "col1"'
> > > > > > > > > b', "field_name": "col1", "pandas_type": "empty",
> > > > > > > > "numpy_type": "o'
> > > > > > > > > b'bject", "metadata": null}], "creator":
> > {"library":
> > > > > > > > "pyarrow", "v'
> > > > > > > > > b'ersion": "0.14.1.dev464+g40d08a751"},
> > "pandas_version":
> > > > > > > > "0.24.2"'
> > > > > > > > > b'}'}
> > > > > > > > >
> > > > > > > > > I'm inclined to make the change without worrying about
> > backwards
> > > > > > > > > compatibility. If people have been persisting data against
> > the
> > > > > > > > > recommendations of the project, the remedy is to use an
> > older version
> > > > > > > > > of the library to read the files and write them to something
> > else
> > > > > > > > > (like Parquet format) in the 

[jira] [Created] (ARROW-6646) [Go] Amend NullType IPC implementation to append no buffers in RecordBatch message

2019-09-20 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6646:
---

 Summary: [Go] Amend NullType IPC implementation to append no 
buffers in RecordBatch message
 Key: ARROW-6646
 URL: https://issues.apache.org/jira/browse/ARROW-6646
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Wes McKinney
 Fix For: 1.0.0


per ARROW-6379



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6645) [Python] Dictionary indices are boundschecked unconditionally in CategoricalBlock.to_pandas

2019-09-20 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6645:
---

 Summary: [Python] Dictionary indices are boundschecked 
unconditionally in CategoricalBlock.to_pandas
 Key: ARROW-6645
 URL: https://issues.apache.org/jira/browse/ARROW-6645
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


This was added at some point to fix a bug. I suspect we might want to move this 
check somewhere else rather than do it every time {{to_pandas}} is called



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6644) [JS] Amend NullType IPC protocol to append no buffers

2019-09-20 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6644:
---

 Summary: [JS] Amend NullType IPC protocol to append no buffers
 Key: ARROW-6644
 URL: https://issues.apache.org/jira/browse/ARROW-6644
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Wes McKinney
 Fix For: 1.0.0


Per ARROW-6379



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Antoine Pitrou


Yes, I don't think we should go the full way of separating Arrow in
micro-components.  The IO and IPC layer aren't heavyweight.  We should
simply address the most often-quoted annoyances.

Regards

Antoine.


Le 20/09/2019 à 17:41, Wes McKinney a écrit :
> Implementing the format fully requires memory management and IO
> interfaces (i.e. arrow/io/{file.h, interfaces.h, memory.h}). So those
> parts are not separable.
> 
> On Fri, Sep 20, 2019 at 10:36 AM Neal Richardson
>  wrote:
>>
>> I wonder if having a core "format" C++ library, which the io, compute,
>> etc. library/libraries would depend on, is a natural step.
>> Particularly since we're coming up on 1.0 and the format is being
>> declared stable.
>>
>> Neal
>>
>> On Fri, Sep 20, 2019 at 8:28 AM Wes McKinney  wrote:
>>>
>>> We would have to be even more careful about managing symbol exports.
>>> Third party projects would need to link more libraries in their
>>> applications (not unlike the way that Boost works now -- I suppose
>>> that Boost is the closest analogue to what we're going for)
>>>
>>> On Fri, Sep 20, 2019 at 2:30 AM Micah Kornfield  
>>> wrote:
>
> We could indeed split up libarrow into more shared libraries. This
> would mean accepting a lot more maintenance effort though, on a team
> that is already overburdened. I'm not too keen on that in the short
> term.


 Something for longer term to think about.  What are you seeing as the 
 added maintenance here?


 On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:
>
> hi Micah,
>
>
> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  
> wrote:
>>
>>>
>>> * Should optional components be "opt in", "out out", or a mix?
>>> Currently it's a mix, and that's confusing for people. I think we
>>> should make them all "opt in".
>>
>> Agreed they should all be opt in by default.  I think active developer 
>> are
>> quite adept at flipping the appropriate CMake flags.
>>
>
> Cool. I opened a tracking JIRA
> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
> issues. Sorry for the new JIRA flood
>
>>
>>> * Do we want to bring the out-of-the-box core build down to zero
>>> dependencies, including not depending on boost::filesystem and
>>> possibly checking the compiled Flatbuffers files.
>>
>>  While it may be
>>> slightly more maintenance work, I think the optics of a
>>> "dependency-free" core build would be beneficial and help the project
>>> marketing-wise.
>>
>> I'm -.5 on checking in generated artifacts but this is mostly stylistic.
>> In the case of flatbuffers it seems like we might be able to get-away 
>> with
>> vendoring since it should mostly be headers only.
>>
>> I would prefer to try come up with more granular components and be
>> very conservative on what is "core".  I think it should be possible have 
>> a
>> zero dependency build if only MemoryPool, Buffers, Arrays and 
>> ArrayBuilders
>> in a core package [1].  This combined with discussion Antoine started on 
>> an
>> ABI compatible C-layer would make basic inter-op within a process
>> reasonable.  Moving up the stack to IPC and files, there is probably a 
>> way
>> to package headers separately from implementations.  This would allow 
>> other
>> projects wishing to integrate with Arrow to bring their own 
>> implementations
>> without the baggage of boost::filesystem. Would this leave anything 
>> besides
>> "flatbuffers" as a hard dependency to support IPC?
>>
>
> We could indeed split up libarrow into more shared libraries. This
> would mean accepting a lot more maintenance effort though, on a team
> that is already overburdened. I'm not too keen on that in the short
> term.
>
>> Thanks,
>> Micah
>>
>>
>> [1] It probably makes sense to go even further and separate out 
>> MemoryPool
>> and Buffer, so we can break the circular relationship between parquet and
>> arrow.
>
> Don't think this is possible even then, particularly in light of my
> recent work reading and writing Arrow columnar data "closer to the
> metal"  inside Parquet, yielding beneficial performance improvements.
>
>>
>> On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:
>>
>>> To be clear I think we should make these changes right after 0.15.0 is
>>> released so we aren't playing whackamole with our packaging scripts.
>>> I'm happy to take the lead on the work...
>>>
>>> On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
>>> wrote:

 On Wed, 18 Sep 2019 09:46:54 -0500
 Wes McKinney  wrote:
> I think these are both interesting areas to explore further. I'd like
> to focus on the couple of immediate items I think we should address

[jira] [Created] (ARROW-6643) [C#] Write no IPC buffer metadata for NullType

2019-09-20 Thread Eric Erhardt (Jira)
Eric Erhardt created ARROW-6643:
---

 Summary: [C#] Write no IPC buffer metadata for NullType
 Key: ARROW-6643
 URL: https://issues.apache.org/jira/browse/ARROW-6643
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Eric Erhardt


We need to align the C# writer (and test the reader) for NullType. See 
[https://github.com/apache/arrow/pull/5287] and ARROW-6379.

 

>The C++ implementation has been writing 2 {{Buffer}} Flatbuffer struct values 
>with length 0 for NullType. Rather than having dummy/placeholder Buffer I 
>think it is more consistent to write no metadata for this type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Wes McKinney
Implementing the format fully requires memory management and IO
interfaces (i.e. arrow/io/{file.h, interfaces.h, memory.h}). So those
parts are not separable.

On Fri, Sep 20, 2019 at 10:36 AM Neal Richardson
 wrote:
>
> I wonder if having a core "format" C++ library, which the io, compute,
> etc. library/libraries would depend on, is a natural step.
> Particularly since we're coming up on 1.0 and the format is being
> declared stable.
>
> Neal
>
> On Fri, Sep 20, 2019 at 8:28 AM Wes McKinney  wrote:
> >
> > We would have to be even more careful about managing symbol exports.
> > Third party projects would need to link more libraries in their
> > applications (not unlike the way that Boost works now -- I suppose
> > that Boost is the closest analogue to what we're going for)
> >
> > On Fri, Sep 20, 2019 at 2:30 AM Micah Kornfield  
> > wrote:
> > >>
> > >> We could indeed split up libarrow into more shared libraries. This
> > >> would mean accepting a lot more maintenance effort though, on a team
> > >> that is already overburdened. I'm not too keen on that in the short
> > >> term.
> > >
> > >
> > > Something for longer term to think about.  What are you seeing as the 
> > > added maintenance here?
> > >
> > >
> > > On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:
> > >>
> > >> hi Micah,
> > >>
> > >>
> > >> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  
> > >> wrote:
> > >> >
> > >> > >
> > >> > > * Should optional components be "opt in", "out out", or a mix?
> > >> > > Currently it's a mix, and that's confusing for people. I think we
> > >> > > should make them all "opt in".
> > >> >
> > >> > Agreed they should all be opt in by default.  I think active developer 
> > >> > are
> > >> > quite adept at flipping the appropriate CMake flags.
> > >> >
> > >>
> > >> Cool. I opened a tracking JIRA
> > >> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
> > >> issues. Sorry for the new JIRA flood
> > >>
> > >> >
> > >> > > * Do we want to bring the out-of-the-box core build down to zero
> > >> > > dependencies, including not depending on boost::filesystem and
> > >> > > possibly checking the compiled Flatbuffers files.
> > >> >
> > >> >  While it may be
> > >> > > slightly more maintenance work, I think the optics of a
> > >> > > "dependency-free" core build would be beneficial and help the project
> > >> > > marketing-wise.
> > >> >
> > >> > I'm -.5 on checking in generated artifacts but this is mostly 
> > >> > stylistic.
> > >> > In the case of flatbuffers it seems like we might be able to get-away 
> > >> > with
> > >> > vendoring since it should mostly be headers only.
> > >> >
> > >> > I would prefer to try come up with more granular components and be
> > >> > very conservative on what is "core".  I think it should be possible 
> > >> > have a
> > >> > zero dependency build if only MemoryPool, Buffers, Arrays and 
> > >> > ArrayBuilders
> > >> > in a core package [1].  This combined with discussion Antoine started 
> > >> > on an
> > >> > ABI compatible C-layer would make basic inter-op within a process
> > >> > reasonable.  Moving up the stack to IPC and files, there is probably a 
> > >> > way
> > >> > to package headers separately from implementations.  This would allow 
> > >> > other
> > >> > projects wishing to integrate with Arrow to bring their own 
> > >> > implementations
> > >> > without the baggage of boost::filesystem. Would this leave anything 
> > >> > besides
> > >> > "flatbuffers" as a hard dependency to support IPC?
> > >> >
> > >>
> > >> We could indeed split up libarrow into more shared libraries. This
> > >> would mean accepting a lot more maintenance effort though, on a team
> > >> that is already overburdened. I'm not too keen on that in the short
> > >> term.
> > >>
> > >> > Thanks,
> > >> > Micah
> > >> >
> > >> >
> > >> > [1] It probably makes sense to go even further and separate out 
> > >> > MemoryPool
> > >> > and Buffer, so we can break the circular relationship between parquet 
> > >> > and
> > >> > arrow.
> > >>
> > >> Don't think this is possible even then, particularly in light of my
> > >> recent work reading and writing Arrow columnar data "closer to the
> > >> metal"  inside Parquet, yielding beneficial performance improvements.
> > >>
> > >> >
> > >> > On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  
> > >> > wrote:
> > >> >
> > >> > > To be clear I think we should make these changes right after 0.15.0 
> > >> > > is
> > >> > > released so we aren't playing whackamole with our packaging scripts.
> > >> > > I'm happy to take the lead on the work...
> > >> > >
> > >> > > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> > >> > > wrote:
> > >> > > >
> > >> > > > On Wed, 18 Sep 2019 09:46:54 -0500
> > >> > > > Wes McKinney  wrote:
> > >> > > > > I think these are both interesting areas to explore further. I'd 
> > >> > > > > like
> > >> > > > > to focus on the couple of immediate items I think we should 
> > >> > > > > address
> > >> > > > >
> > 

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Neal Richardson
I wonder if having a core "format" C++ library, which the io, compute,
etc. library/libraries would depend on, is a natural step.
Particularly since we're coming up on 1.0 and the format is being
declared stable.

Neal

On Fri, Sep 20, 2019 at 8:28 AM Wes McKinney  wrote:
>
> We would have to be even more careful about managing symbol exports.
> Third party projects would need to link more libraries in their
> applications (not unlike the way that Boost works now -- I suppose
> that Boost is the closest analogue to what we're going for)
>
> On Fri, Sep 20, 2019 at 2:30 AM Micah Kornfield  wrote:
> >>
> >> We could indeed split up libarrow into more shared libraries. This
> >> would mean accepting a lot more maintenance effort though, on a team
> >> that is already overburdened. I'm not too keen on that in the short
> >> term.
> >
> >
> > Something for longer term to think about.  What are you seeing as the added 
> > maintenance here?
> >
> >
> > On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:
> >>
> >> hi Micah,
> >>
> >>
> >> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  
> >> wrote:
> >> >
> >> > >
> >> > > * Should optional components be "opt in", "out out", or a mix?
> >> > > Currently it's a mix, and that's confusing for people. I think we
> >> > > should make them all "opt in".
> >> >
> >> > Agreed they should all be opt in by default.  I think active developer 
> >> > are
> >> > quite adept at flipping the appropriate CMake flags.
> >> >
> >>
> >> Cool. I opened a tracking JIRA
> >> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
> >> issues. Sorry for the new JIRA flood
> >>
> >> >
> >> > > * Do we want to bring the out-of-the-box core build down to zero
> >> > > dependencies, including not depending on boost::filesystem and
> >> > > possibly checking the compiled Flatbuffers files.
> >> >
> >> >  While it may be
> >> > > slightly more maintenance work, I think the optics of a
> >> > > "dependency-free" core build would be beneficial and help the project
> >> > > marketing-wise.
> >> >
> >> > I'm -.5 on checking in generated artifacts but this is mostly stylistic.
> >> > In the case of flatbuffers it seems like we might be able to get-away 
> >> > with
> >> > vendoring since it should mostly be headers only.
> >> >
> >> > I would prefer to try come up with more granular components and be
> >> > very conservative on what is "core".  I think it should be possible have 
> >> > a
> >> > zero dependency build if only MemoryPool, Buffers, Arrays and 
> >> > ArrayBuilders
> >> > in a core package [1].  This combined with discussion Antoine started on 
> >> > an
> >> > ABI compatible C-layer would make basic inter-op within a process
> >> > reasonable.  Moving up the stack to IPC and files, there is probably a 
> >> > way
> >> > to package headers separately from implementations.  This would allow 
> >> > other
> >> > projects wishing to integrate with Arrow to bring their own 
> >> > implementations
> >> > without the baggage of boost::filesystem. Would this leave anything 
> >> > besides
> >> > "flatbuffers" as a hard dependency to support IPC?
> >> >
> >>
> >> We could indeed split up libarrow into more shared libraries. This
> >> would mean accepting a lot more maintenance effort though, on a team
> >> that is already overburdened. I'm not too keen on that in the short
> >> term.
> >>
> >> > Thanks,
> >> > Micah
> >> >
> >> >
> >> > [1] It probably makes sense to go even further and separate out 
> >> > MemoryPool
> >> > and Buffer, so we can break the circular relationship between parquet and
> >> > arrow.
> >>
> >> Don't think this is possible even then, particularly in light of my
> >> recent work reading and writing Arrow columnar data "closer to the
> >> metal"  inside Parquet, yielding beneficial performance improvements.
> >>
> >> >
> >> > On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:
> >> >
> >> > > To be clear I think we should make these changes right after 0.15.0 is
> >> > > released so we aren't playing whackamole with our packaging scripts.
> >> > > I'm happy to take the lead on the work...
> >> > >
> >> > > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> >> > > wrote:
> >> > > >
> >> > > > On Wed, 18 Sep 2019 09:46:54 -0500
> >> > > > Wes McKinney  wrote:
> >> > > > > I think these are both interesting areas to explore further. I'd 
> >> > > > > like
> >> > > > > to focus on the couple of immediate items I think we should address
> >> > > > >
> >> > > > > * Should optional components be "opt in", "out out", or a mix?
> >> > > > > Currently it's a mix, and that's confusing for people. I think we
> >> > > > > should make them all "opt in".
> >> > > > > * Do we want to bring the out-of-the-box core build down to zero
> >> > > > > dependencies, including not depending on boost::filesystem and
> >> > > > > possibly checking the compiled Flatbuffers files. While it may be
> >> > > > > slightly more maintenance work, I think the optics of a
> >> > > > > 

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Wes McKinney
We would have to be even more careful about managing symbol exports.
Third party projects would need to link more libraries in their
applications (not unlike the way that Boost works now -- I suppose
that Boost is the closest analogue to what we're going for)

On Fri, Sep 20, 2019 at 2:30 AM Micah Kornfield  wrote:
>>
>> We could indeed split up libarrow into more shared libraries. This
>> would mean accepting a lot more maintenance effort though, on a team
>> that is already overburdened. I'm not too keen on that in the short
>> term.
>
>
> Something for longer term to think about.  What are you seeing as the added 
> maintenance here?
>
>
> On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:
>>
>> hi Micah,
>>
>>
>> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  
>> wrote:
>> >
>> > >
>> > > * Should optional components be "opt in", "out out", or a mix?
>> > > Currently it's a mix, and that's confusing for people. I think we
>> > > should make them all "opt in".
>> >
>> > Agreed they should all be opt in by default.  I think active developer are
>> > quite adept at flipping the appropriate CMake flags.
>> >
>>
>> Cool. I opened a tracking JIRA
>> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
>> issues. Sorry for the new JIRA flood
>>
>> >
>> > > * Do we want to bring the out-of-the-box core build down to zero
>> > > dependencies, including not depending on boost::filesystem and
>> > > possibly checking the compiled Flatbuffers files.
>> >
>> >  While it may be
>> > > slightly more maintenance work, I think the optics of a
>> > > "dependency-free" core build would be beneficial and help the project
>> > > marketing-wise.
>> >
>> > I'm -.5 on checking in generated artifacts but this is mostly stylistic.
>> > In the case of flatbuffers it seems like we might be able to get-away with
>> > vendoring since it should mostly be headers only.
>> >
>> > I would prefer to try come up with more granular components and be
>> > very conservative on what is "core".  I think it should be possible have a
>> > zero dependency build if only MemoryPool, Buffers, Arrays and ArrayBuilders
>> > in a core package [1].  This combined with discussion Antoine started on an
>> > ABI compatible C-layer would make basic inter-op within a process
>> > reasonable.  Moving up the stack to IPC and files, there is probably a way
>> > to package headers separately from implementations.  This would allow other
>> > projects wishing to integrate with Arrow to bring their own implementations
>> > without the baggage of boost::filesystem. Would this leave anything besides
>> > "flatbuffers" as a hard dependency to support IPC?
>> >
>>
>> We could indeed split up libarrow into more shared libraries. This
>> would mean accepting a lot more maintenance effort though, on a team
>> that is already overburdened. I'm not too keen on that in the short
>> term.
>>
>> > Thanks,
>> > Micah
>> >
>> >
>> > [1] It probably makes sense to go even further and separate out MemoryPool
>> > and Buffer, so we can break the circular relationship between parquet and
>> > arrow.
>>
>> Don't think this is possible even then, particularly in light of my
>> recent work reading and writing Arrow columnar data "closer to the
>> metal"  inside Parquet, yielding beneficial performance improvements.
>>
>> >
>> > On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:
>> >
>> > > To be clear I think we should make these changes right after 0.15.0 is
>> > > released so we aren't playing whackamole with our packaging scripts.
>> > > I'm happy to take the lead on the work...
>> > >
>> > > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
>> > > wrote:
>> > > >
>> > > > On Wed, 18 Sep 2019 09:46:54 -0500
>> > > > Wes McKinney  wrote:
>> > > > > I think these are both interesting areas to explore further. I'd like
>> > > > > to focus on the couple of immediate items I think we should address
>> > > > >
>> > > > > * Should optional components be "opt in", "out out", or a mix?
>> > > > > Currently it's a mix, and that's confusing for people. I think we
>> > > > > should make them all "opt in".
>> > > > > * Do we want to bring the out-of-the-box core build down to zero
>> > > > > dependencies, including not depending on boost::filesystem and
>> > > > > possibly checking the compiled Flatbuffers files. While it may be
>> > > > > slightly more maintenance work, I think the optics of a
>> > > > > "dependency-free" core build would be beneficial and help the project
>> > > > > marketing-wise.
>> > > > >
>> > > > > Both of these issues must be addressed whether we undertake a Bazel
>> > > > > implementation or some other refactor of the C++ build system.
>> > > >
>> > > > I think checking in the Flatbuffers files (and also Protobuf and Thrift
>> > > > where applicable :-)) would be fine.
>> > > >
>> > > > As for boost::filesystem, getting rid of it wouldn't be a huge task.
>> > > > Still worth deciding whether we want to prioritize development time for
>> > > > it, 

[jira] [Created] (ARROW-6642) [Python] chained access of ParquetDataset's metadata segfaults

2019-09-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6642:


 Summary: [Python] chained access of ParquetDataset's metadata 
segfaults
 Key: ARROW-6642
 URL: https://issues.apache.org/jira/browse/ARROW-6642
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Creating and reading a parquet dataset:

{code}
table = pa.table({'a': [1, 2, 3]})

import pyarrow.parquet as pq
pq.write_table(table, '__test_statistics_segfault.parquet')
dataset = pq.ParquetDataset('__test_statistics_segfault.parquet')
dataset_piece = dataset.pieces[0]
{code}

If you access the metadata and a column's statistics in steps, this works fine:

{code}
meta = dataset_piece.get_metadata()
row = meta.row_group(0)
col = row.column(0)
{code}

but doing it chained in one step, this segfaults:

{code}
dataset_piece.get_metadata().row_group(0).column(0)
{code}

{{dataset_piece.get_metadata().row_group(0)}} still works, but additionally 
with {{.column(0)}} then it segfaults. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] IPC buffer layout for Null type

2019-09-20 Thread Micah Kornfield
I think committing as is, is the better of the two options.

On Thu, Sep 19, 2019 at 12:35 PM Wes McKinney  wrote:

> OK, my preference, therefore, would be to rebase and merge my patch
> without bothering with backwards compatibility code. The situations
> where there would be an issue are fairly esoteric.
>
> https://github.com/apache/arrow/pull/5287
>
> On Thu, Sep 19, 2019 at 2:29 PM Antoine Pitrou 
> wrote:
> >
> >
> > Well, this is an incompatible IPC change, so ideally it should be done
> > now, not later.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 19 Sep 2019 14:08:37 -0500
> > Wes McKinney  wrote:
> >
> > > I'm concerned about rushing through any patch for this for 0.15.0, but
> > > each release with the status quo increases the risk of making changes.
> > > Thoughts?
> > >
> > > On Fri, Sep 6, 2019 at 12:59 PM Wes McKinney 
> wrote:
> > > >
> > > > On Fri, Sep 6, 2019 at 12:57 PM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> > > > >
> > > > > >
> > > > > > We can't because the buffer layout is not transmitted --
> implementations
> > > > > > make assumptions about what Buffer values correspond to each
> field. The
> > > > > > only thing we could do to signal the change would be to increase
> the
> > > > > > metadata version from V4 to V5.
> > > > >
> > > > > If we do this within 0.15.0 we could infer from the padding of
> messages.
> > > > >
> > > >
> > > > That's true. I'd be OK adding backward compatibility code (that we
> can
> > > > probably remove later) to my patch...
> > > >
> > > > I'm not sure about the other implementations. I think for non-C++
> > > > implementations because they don't have much application code that
> can
> > > > produce Null arrays that they should simply use the no-buffers layout
> > > >
> > > > > On Fri, Sep 6, 2019 at 10:16 AM Wes McKinney 
> wrote:
> > > > >
> > > > > > On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou 
> wrote:
> > > > > >
> > > > > > >
> > > > > > > Null can also come up when converting a column with only NA
> values in a
> > > > > > > CSV file.  I don't remember for sure, but I think the same can
> happen
> > > > > > > with JSON files as well.
> > > > > > >
> > > > > > > Can't we accept both forms when reading?  It sounds like it
> should be
> > > > > > > reasonably easy.
> > > > > > >
> > > > > >
> > > > > > We can't because the buffer layout is not transmitted --
> implementations
> > > > > > make assumptions about what Buffer values correspond to each
> field. The
> > > > > > only thing we could do to signal the change would be to increase
> the
> > > > > > metadata version from V4 to V5.
> > > > > >
> > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > > >
> > > > > > >
> > > > > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit :
> > > > > > > > hi Micah,
> > > > > > > >
> > > > > > > > Null wouldn't come up that often in practice. It could
> happen when
> > > > > > > > converting from pandas, for example
> > > > > > > >
> > > > > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10,
> > > > > > > dtype=object)})
> > > > > > > >
> > > > > > > > In [9]: t = pa.table(df)
> > > > > > > >
> > > > > > > > In [10]: t
> > > > > > > > Out[10]:
> > > > > > > > pyarrow.Table
> > > > > > > > col1: null
> > > > > > > > metadata
> > > > > > > > 
> > > > > > > > {b'pandas': b'{"index_columns": [{"kind": "range", "name":
> null,
> > > > > > > "start": 0, "'
> > > > > > > > b'stop": 10, "step": 1}], "column_indexes":
> [{"name": null,
> > > > > > > "field'
> > > > > > > > b'_name": null, "pandas_type": "unicode",
> "numpy_type":
> > > > > > > "object", '
> > > > > > > > b'"metadata": {"encoding": "UTF-8"}}],
> "columns": [{"name":
> > > > > > > "col1"'
> > > > > > > > b', "field_name": "col1", "pandas_type": "empty",
> > > > > > > "numpy_type": "o'
> > > > > > > > b'bject", "metadata": null}], "creator":
> {"library":
> > > > > > > "pyarrow", "v'
> > > > > > > > b'ersion": "0.14.1.dev464+g40d08a751"},
> "pandas_version":
> > > > > > > "0.24.2"'
> > > > > > > > b'}'}
> > > > > > > >
> > > > > > > > I'm inclined to make the change without worrying about
> backwards
> > > > > > > > compatibility. If people have been persisting data against
> the
> > > > > > > > recommendations of the project, the remedy is to use an
> older version
> > > > > > > > of the library to read the files and write them to something
> else
> > > > > > > > (like Parquet format) in the meantime.
> > > > > > > >
> > > > > > > > Obviously come 1.0.0 we'll begin to make compatibility
> guarantees so
> > > > > > > > this will be less of an issue.
> > > > > > > >
> > > > > > > > - Wes
> > > > > > > >
> > > > > > > > On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield <
> emkornfi...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> Hi Wes and others,
> > > > > > > >> I don't have a sense of where Null arrays get created in
> 

Re: Collecting Arrow critique and our roadmap on that

2019-09-20 Thread Micah Kornfield
I think this is a good idea, as well.  I added comments and additions on
the document.

On Thu, Sep 19, 2019 at 11:47 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:

> Uwe, I think this is an excellent idea. I've started
>
> https://docs.google.com/document/d/1cgN7mYzH30URDTaioHsCP2d80wKKHDNs9f5s7vdb2mA/edit?usp=sharing
> to collect some ideas and notes. Once we have gathered our thoughts
> there, we can put them in the appropriate places.
>
> I think that some of the result will go into the FAQ, some into
> documentation (maybe more "how-to" and "getting started" guides in the
> respective language docs, as well as some "how to share Arrow data
> from X to Y"), and other things that we haven't yet done should go
> into a sort of Roadmap document on the main website. We have some very
> outdated content related to a roadmap on the confluence wiki that
> should be folded in as appropriate too.
>
> Neal
>
> On Thu, Sep 19, 2019 at 10:26 AM Uwe L. Korn  wrote:
> >
> > Hello,
> >
> > there has been a lot of public discussions lately with some mentions of
> actually informed, valid critique of things in the Arrow project. From my
> perspective, these things include "there is not STL-native C++ Arrow API",
> "the base build requires too much dependencies", "the pyarrow package is
> really huge and you cannot select single components". These are things we
> cannot tackle at the moment due to the lack of contributors to the project.
> But we can use this as a basis to point people that critique the project on
> this that this is not intentional but a lack of resources as well as it
> provides another point of entry for new contributors looking for work.
> >
> > Thus I would like to start a document (possibly on the website) where we
> list the major critiques on Arrow, mention our long-term solution to that
> and what JIRAs need to be done for that.
> >
> > Would that be something others would also see as valuable?
> >
> > There has also been a lot of uninformed criticism, I think that can be
> best combat by documentation, blog posts and public appearances at
> conferences and is not covered by this proposal.
> >
> > Uwe
>


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Micah Kornfield
>
> We could indeed split up libarrow into more shared libraries. This
> would mean accepting a lot more maintenance effort though, on a team
> that is already overburdened. I'm not too keen on that in the short
> term.


Something for longer term to think about.  What are you seeing as the added
maintenance here?


On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:

> hi Micah,
>
>
> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield 
> wrote:
> >
> > >
> > > * Should optional components be "opt in", "out out", or a mix?
> > > Currently it's a mix, and that's confusing for people. I think we
> > > should make them all "opt in".
> >
> > Agreed they should all be opt in by default.  I think active developer
> are
> > quite adept at flipping the appropriate CMake flags.
> >
>
> Cool. I opened a tracking JIRA
> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
> issues. Sorry for the new JIRA flood
>
> >
> > > * Do we want to bring the out-of-the-box core build down to zero
> > > dependencies, including not depending on boost::filesystem and
> > > possibly checking the compiled Flatbuffers files.
> >
> >  While it may be
> > > slightly more maintenance work, I think the optics of a
> > > "dependency-free" core build would be beneficial and help the project
> > > marketing-wise.
> >
> > I'm -.5 on checking in generated artifacts but this is mostly stylistic.
> > In the case of flatbuffers it seems like we might be able to get-away
> with
> > vendoring since it should mostly be headers only.
> >
> > I would prefer to try come up with more granular components and be
> > very conservative on what is "core".  I think it should be possible have
> a
> > zero dependency build if only MemoryPool, Buffers, Arrays and
> ArrayBuilders
> > in a core package [1].  This combined with discussion Antoine started on
> an
> > ABI compatible C-layer would make basic inter-op within a process
> > reasonable.  Moving up the stack to IPC and files, there is probably a
> way
> > to package headers separately from implementations.  This would allow
> other
> > projects wishing to integrate with Arrow to bring their own
> implementations
> > without the baggage of boost::filesystem. Would this leave anything
> besides
> > "flatbuffers" as a hard dependency to support IPC?
> >
>
> We could indeed split up libarrow into more shared libraries. This
> would mean accepting a lot more maintenance effort though, on a team
> that is already overburdened. I'm not too keen on that in the short
> term.
>
> > Thanks,
> > Micah
> >
> >
> > [1] It probably makes sense to go even further and separate out
> MemoryPool
> > and Buffer, so we can break the circular relationship between parquet and
> > arrow.
>
> Don't think this is possible even then, particularly in light of my
> recent work reading and writing Arrow columnar data "closer to the
> metal"  inside Parquet, yielding beneficial performance improvements.
>
> >
> > On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney 
> wrote:
> >
> > > To be clear I think we should make these changes right after 0.15.0 is
> > > released so we aren't playing whackamole with our packaging scripts.
> > > I'm happy to take the lead on the work...
> > >
> > > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> > > wrote:
> > > >
> > > > On Wed, 18 Sep 2019 09:46:54 -0500
> > > > Wes McKinney  wrote:
> > > > > I think these are both interesting areas to explore further. I'd
> like
> > > > > to focus on the couple of immediate items I think we should address
> > > > >
> > > > > * Should optional components be "opt in", "out out", or a mix?
> > > > > Currently it's a mix, and that's confusing for people. I think we
> > > > > should make them all "opt in".
> > > > > * Do we want to bring the out-of-the-box core build down to zero
> > > > > dependencies, including not depending on boost::filesystem and
> > > > > possibly checking the compiled Flatbuffers files. While it may be
> > > > > slightly more maintenance work, I think the optics of a
> > > > > "dependency-free" core build would be beneficial and help the
> project
> > > > > marketing-wise.
> > > > >
> > > > > Both of these issues must be addressed whether we undertake a Bazel
> > > > > implementation or some other refactor of the C++ build system.
> > > >
> > > > I think checking in the Flatbuffers files (and also Protobuf and
> Thrift
> > > > where applicable :-)) would be fine.
> > > >
> > > > As for boost::filesystem, getting rid of it wouldn't be a huge task.
> > > > Still worth deciding whether we want to prioritize development time
> for
> > > > it, because it's not entirely trivial either.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > >
>