[jira] [Created] (ARROW-6294) [C++] Use hyphen for plasma-store-server executable

2019-08-19 Thread Sutou Kouhei (Jira)
Sutou Kouhei created ARROW-6294:
---

 Summary: [C++] Use hyphen for plasma-store-server executable 
 Key: ARROW-6294
 URL: https://issues.apache.org/jira/browse/ARROW-6294
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [VOTE] Proposed addition to Arrow Flight Protocol

2019-08-19 Thread Micah Kornfield
The motion carries with 4 binding +1 votes,  2 non-binding +1 votes  and no
other votes.

I think the next step is to review and merge the patch pending patch [1].

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/4980



On Mon, Aug 19, 2019 at 2:52 AM Antoine Pitrou  wrote:

>
> +1 (binding)
>
> Regards
>
> Antoine.
>
>
> Le 16/08/2019 à 07:44, Micah Kornfield a écrit :
> > Hello,
> > Ryan Murray has proposed adding a GetFlightSchema RPC [1] to the Arrow
> > Flight Protocol [2].  The purpose of this RPC is to allow decoupling
> schema
> > and endpoint retrieval as provided by the GetFlightInfo RPC.  The new
> > definition provided is:
> >
> > message SchemaResult {
> >   // Serialized Flatbuffer Schema message.
> >   bytes schema = 1;
> > }
> > rpc GetSchema(FlightDescriptor) returns (SchemaResult) {}
> >
> > Ryan has also provided a PR demonstrating implementation of the new RPC
> [3]
> > in Java, C++ and Python which can be reviewed and merged after this
> > addition is approved.
> >
> > Please vote whether to accept the addition. The vote will be open for at
> > least 72 hours.
> >
> > [ ] +1 Accept this addition to the Flight protocol
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> >
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit
> > [2] https://github.com/apache/arrow/blob/master/format/Flight.proto
> > [3] https://github.com/apache/arrow/pull/4980
> >
>


[jira] [Created] (ARROW-6293) datafusion 0.15.0-SNAPSHOT error

2019-08-19 Thread xingzhicn (Jira)
xingzhicn created ARROW-6293:


 Summary: datafusion 0.15.0-SNAPSHOT error
 Key: ARROW-6293
 URL: https://issues.apache.org/jira/browse/ARROW-6293
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 0.15.0
Reporter: xingzhicn
 Fix For: 0.15.0


error: failed to select a version for the requirement `datafusion = 
"^0.15.0-SNAPSHOT"`
 candidate versions found which didn't match: 0.14.1, 0.14.0, 0.13.0, ...
 location searched: crates.io index
required by package `myRust v0.1.0 (E:\workspace\myRust)`

Add “0.15.0-SNAPSHOT” to my Cargo.toml,but crates.io not found this version

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6292) [C++] Add an option to build with mimalloc

2019-08-19 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6292:
-

 Summary: [C++] Add an option to build with mimalloc
 Key: ARROW-6292
 URL: https://issues.apache.org/jira/browse/ARROW-6292
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


It's a new allocator, Apache-licensed, by Microsoft. It claims very good 
performance and is cross-platform (works on Windows and Unix).
https://github.com/microsoft/mimalloc/

There's a detailed set of APIs including aligned allocation and 
zero-initialized allocation. However, zero-initialized reallocation doesn't 
seem provided.
https://microsoft.github.io/mimalloc/group__malloc.html#details




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [Discuss] [Python] protocol for conversion to pyarrow Array

2019-08-19 Thread Wes McKinney
No concerns from me either.

On Mon, Aug 19, 2019 at 5:10 AM Antoine Pitrou  wrote:
>
>
> No concern from me.  It should probably be documented somewhere though :-)
>
> Regards
>
> Antoine.
>
>
> Le 16/08/2019 à 17:23, Joris Van den Bossche a écrit :
> > Coming back to this older thread, I have opened a PR with a proof of
> > concept of the proposed protocol to convert third-party array objects to
> > arrow: https://github.com/apache/arrow/pull/5106
> > In the tests, I added the protocol to pandas' nullable integer array (which
> > is currently not supported in the from_pandas conversion) and this converts
> > now nicely without much changes.
> >
> > Are there remaining concerns about such a protocol?
> >
> > --
> >
> > Note that the protocol is only for pandas -> arrow conversion (or other
> > array-like objects -> arrow). The other way around (arrow -> pandas) is
> > more complex and needs further discussion, and also involves the Arrow
> > ExtensionTypes (as mentioned below by Wes).
> > But I think the protocol will be useful in any case, and we can go ahead
> > with that already (for example, not all pandas ExtensionArrays will need to
> > map to a Arrow ExtensionType, eg the nullable integers simply map to
> > arrow's int64 or fletcher's ExtensionArrays which just wrap a arrow array).
> > That said, I have been working on the arrow ExtensionTypes the last days,
> > and have been keeping an overview of the issues and needed work in this
> > google document:
> > https://docs.google.com/document/d/1pr9PuBfXTdlUoAgyh9zPIKDJZalDLI6GuxqblMynMM8/edit?usp=sharing
> > (feel free to comment on it). There is also an initial PR to extend the
> > support for defining ExtensionTypes in Python (ARROW-5610
> > ,
> > https://github.com/apache/arrow/pull/5094).
> >
> > Joris
> >
> > On Fri, 17 May 2019 at 00:28, Wes McKinney  wrote:
> >
> >> hi Joris,
> >>
> >> Somewhat related to this, I want to also point out that we have C++
> >> extension types [1]. As part of this, it would also be good to define
> >> and document a public API for users to create ExtensionArray
> >> subclasses that can be serialized and deserialized using this
> >> machinery.
> >>
> >> As a motivating example, suppose that a Java application has a special
> >> data type that can be serialized as a Binary value in Arrow, and we
> >> want to be able to receive this special object as a pandas
> >> ExtensionArray column, which unboxing into a Python user space type.
> >>
> >> The ExtensionType can be implemented in Java, and then on the Python
> >> side the implementation can occur either in C++ or Python. An API will
> >> need to be defined to serializer functions for the pandas
> >> ExtensionArray to map the pandas-space type onto the the Arrow-space
> >> type. Does this seem like a project you might be able to help drive
> >> forward? As a matter of sequencing, we do not yet have the capability
> >> to interact with C++ ExtensionType in Python, so we might need to
> >> first create callback machinery to enable Arrow extension types to be
> >> defined in Python (that call into the C++ ExtensionType registry)
> >>
> >> - Wes
> >>
> >> [1]:
> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc
> >>
> >> On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche
> >>  wrote:
> >>>
> >>> Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn :
> >>>
>  +1 to the idea of adding a protocol to let other objects define their
> >> way
>  to Arrow structures. For pandas.Series I would expect that they return
> >> an
>  Arrow Column.
> 
>  For the Arrow->pandas conversion I have a bit mixed feelings. In the
>  normal Fletcher case I would expect that we don't convert anything as
> >> we
>  represent anything from Arrow with it.
> >>>
> >>>
> >>> Yes, you don't want to convert anything (apart from wrapping the arrow
> >>> array into a FletcherArray). But how does Table.to_pandas know that?
> >>> Maybe it doesn't need to know that. And then you might write a function
> >> in
> >>> fletcher to convert a pyarrow Table to a pandas DataFrame with
> >>> fletcher-backed columns. But if you want to have this roundtrip
> >>> automatically, without the need that each project that defines an
> >>> ExtensionArray and wants to interact with arrow (eg in GeoPandas as well)
> >>> needs to have his own "arrow-table-to-pandas-dataframe" converter,
> >> pyarrow
> >>> needs to have some notion of how to convert back to a pandas
> >> ExtensionArray.
> >>>
> >>>
>  For the case where we want to restore the exact pandas DataFrame we had
>  before this will become a bit more complicated as we either would need
> >> to
>  have all third-party libraries to support Arrow via a hook as proposed
> >> or
>  we also define some kind of other protocol on the pandas side to
>  reconstruct ExtensionArrays from Arrow data.
> 
> >>>
> >>> That last one is basically what I 

Having a hard time merging PRs after the JIRA upgrade

2019-08-19 Thread Wes McKinney
It seems that INFRA upgraded JIRA -- over the last hour the JIRA REST
API seems to be struggling, I'm not sure what's going wrong:

https://issues.apache.org/jira/browse/INFRA-18900

- Wes


Re: [C++] Naming changes merged

2019-08-19 Thread Wes McKinney
Thanks for doing this! Perhaps we should document these conventions in
developer/cpp.rst in the Sphinx project

On Mon, Aug 19, 2019 at 12:13 PM Antoine Pitrou  wrote:
>
>
> Hello,
>
> In https://github.com/apache/arrow/pull/5069 I've merged the naming
> changes previously discussed.  In short:
> - always use underscores in C++ source file names
> - always use underscores in .so / .a file names
> - always use hyphens in executable file names
> - always use hyphens in pkgconfig file names
>
> Hopefully this will not create too many spurious conflicts in other
> Pull Requests.  Also, you may find it useful to recreate your CMake
> build directories locally, if any problems arise.
>
> Regards
>
> Antoine.
>
>


Re: Timeline for 0.15.0 release

2019-08-19 Thread Ji Liu
Hi, Wes, on the java side, I can think of several bugs that need to be fixed or 
reminded.

i. ARROW-6040: Dictionary entries are required in IPC streams even when empty[1]
This one is under review now, however through this PR we find that there seems 
a bug in java reading and writing dictionaries in IPC which is Inconsistent 
with spec[2] since it assumes all dictionaries are at the start of stream (see 
details in PR comments,  and this fix may not catch up with version 0.15). 
@Micah Kornfield

ii. ARROW-1875: Write 64-bit ints as strings in integration test JSON files[3]
Java side code already checked in, other implementations seems not.

iii. ARROW-6202: OutOfMemory in JdbcAdapter[4]
Caused by trying to load all records in one contiguous batch, fixed by 
providing iterator API for iteratively reading in ARROW-6219[5].

Thanks,
Ji Liu

[1] https://github.com/apache/arrow/pull/4960
[2] https://arrow.apache.org/docs/ipc.html
[3] https://issues.apache.org/jira/browse/ARROW-1875
[4] https://issues.apache.org/jira/browse/ARROW-6202[5] 
https://issues.apache.org/jira/browse/ARROW-6219



--
From:Wes McKinney 
Send Time:2019年8月19日(星期一) 23:03
To:dev 
Subject:Re: Timeline for 0.15.0 release

I'm going to work some on organizing the 0.15.0 backlog some this
week, if anyone wants to help with grooming (particularly for
languages other than C++/Python where I'm focusing) that would be
helpful. There have been almost 500 JIRA issues opened since the
0.14.0 release, so we should make sure to check whether there's any
regressions or other serious bugs that we should try to fix for
0.15.0.

On Thu, Aug 15, 2019 at 6:23 PM Wes McKinney  wrote:
>
> The Windows wheel issue in 0.14.1 seems to be
>
> https://issues.apache.org/jira/browse/ARROW-6015
>
> I think the root cause could be the Windows changes in
>
> https://github.com/apache/arrow/commit/223ae744cc2a12c60cecb5db593263a03c13f85a
>
> I would be appreciative if a volunteer would look into what was wrong
> with the 0.14.1 wheels on Windows. Otherwise 0.15.0 Windows wheels
> will be broken, too
>
> The bad wheels can be found at
>
> https://bintray.com/apache/arrow/python#files/python%2F0.14.1
>
> On Thu, Aug 15, 2019 at 1:28 PM Antoine Pitrou  wrote:
> >
> > On Thu, 15 Aug 2019 11:17:07 -0700
> > Micah Kornfield  wrote:
> > > >
> > > > In C++ they are
> > > > independent, we could have 32-bit array lengths and variable-length
> > > > types with 64-bit offsets if we wanted (we just wouldn't be able to
> > > > have a List child with more than INT32_MAX elements).
> > >
> > > I think the point is we could do this in C++ but we don't.  I'm not sure 
> > > we
> > > would have introduced the "Large" types if we did.
> >
> > 64-bit offsets take twice as much space as 32-bit offsets, so if you're
> > storing lots of small-ish lists or strings, 32-bit offsets are
> > preferrable.  So even with 64-bit array lengths from the start it would
> > still be beneficial to have types with 32-bit offsets.
> >
> > > Going with the limited address space in Java and calling it a reference
> > > implementation seems suboptimal. If a consumer uses a "Large" type
> > > presumably it is because they need the ability to store more than 
> > > INT32_MAX
> > > child elements in a column, otherwise it is just wasting space [1].
> >
> > Probably. Though if the individual elements (lists or strings) are
> > large, not much space is wasted in proportion, so it may be simpler in
> > such a case to always create a "Large" type array.
> >
> > > [1] I suppose theoretically there might be some performance benefits on
> > > 64-bit architectures to using the native word sizes.
> >
> > Concretely, common 64-bit architectures don't do that, as 32-bit is an
> > extremely common integer size even in high-performance code.
> >
> > Regards
> >
> > Antoine.
> >
> >



[C++] Naming changes merged

2019-08-19 Thread Antoine Pitrou


Hello,

In https://github.com/apache/arrow/pull/5069 I've merged the naming
changes previously discussed.  In short:
- always use underscores in C++ source file names
- always use underscores in .so / .a file names
- always use hyphens in executable file names
- always use hyphens in pkgconfig file names

Hopefully this will not create too many spurious conflicts in other
Pull Requests.  Also, you may find it useful to recreate your CMake
build directories locally, if any problems arise.

Regards

Antoine.




Re: Timeline for 0.15.0 release

2019-08-19 Thread Wes McKinney
I'm going to work some on organizing the 0.15.0 backlog some this
week, if anyone wants to help with grooming (particularly for
languages other than C++/Python where I'm focusing) that would be
helpful. There have been almost 500 JIRA issues opened since the
0.14.0 release, so we should make sure to check whether there's any
regressions or other serious bugs that we should try to fix for
0.15.0.

On Thu, Aug 15, 2019 at 6:23 PM Wes McKinney  wrote:
>
> The Windows wheel issue in 0.14.1 seems to be
>
> https://issues.apache.org/jira/browse/ARROW-6015
>
> I think the root cause could be the Windows changes in
>
> https://github.com/apache/arrow/commit/223ae744cc2a12c60cecb5db593263a03c13f85a
>
> I would be appreciative if a volunteer would look into what was wrong
> with the 0.14.1 wheels on Windows. Otherwise 0.15.0 Windows wheels
> will be broken, too
>
> The bad wheels can be found at
>
> https://bintray.com/apache/arrow/python#files/python%2F0.14.1
>
> On Thu, Aug 15, 2019 at 1:28 PM Antoine Pitrou  wrote:
> >
> > On Thu, 15 Aug 2019 11:17:07 -0700
> > Micah Kornfield  wrote:
> > > >
> > > > In C++ they are
> > > > independent, we could have 32-bit array lengths and variable-length
> > > > types with 64-bit offsets if we wanted (we just wouldn't be able to
> > > > have a List child with more than INT32_MAX elements).
> > >
> > > I think the point is we could do this in C++ but we don't.  I'm not sure 
> > > we
> > > would have introduced the "Large" types if we did.
> >
> > 64-bit offsets take twice as much space as 32-bit offsets, so if you're
> > storing lots of small-ish lists or strings, 32-bit offsets are
> > preferrable.  So even with 64-bit array lengths from the start it would
> > still be beneficial to have types with 32-bit offsets.
> >
> > > Going with the limited address space in Java and calling it a reference
> > > implementation seems suboptimal. If a consumer uses a "Large" type
> > > presumably it is because they need the ability to store more than 
> > > INT32_MAX
> > > child elements in a column, otherwise it is just wasting space [1].
> >
> > Probably. Though if the individual elements (lists or strings) are
> > large, not much space is wasted in proportion, so it may be simpler in
> > such a case to always create a "Large" type array.
> >
> > > [1] I suppose theoretically there might be some performance benefits on
> > > 64-bit architectures to using the native word sizes.
> >
> > Concretely, common 64-bit architectures don't do that, as 32-bit is an
> > extremely common integer size even in high-performance code.
> >
> > Regards
> >
> > Antoine.
> >
> >


Re: [DISCUSS] ArrayBuilders with mutable type

2019-08-19 Thread Antoine Pitrou


If it becomes much more expensive then calling it `type()` (rather than
e.g. GetCurrentType()) is a bit misleading.

Regards

Antoine.


Le 19/08/2019 à 16:27, Wes McKinney a écrit :
> On Mon, Aug 19, 2019 at 9:16 AM Ben Kietzman  wrote:
>>
>> Thanks for responding.
>>
>> I can certainly add VisitBuilder/VisitBuilderInline for ArrayBuilder, but
>> there's a slight difficulty: some types don't have a single concrete
>> builder class. For example, the builders of dictionary arrays are templated
>> on the encoded type and also on an index builder, which is either adaptive
>> or fixed at int32. I can make this transparent in the definition of
>> VisitBuilder so that one can add Visit(StringDictionaryBuilder*) and
>> Visit(DictionaryBuilder*). Something similar is already required
>> for IntervalType, which may be MonthIntervalType or DayTimeIntervalType
>> depending on the value of IntervalType::interval_type().
>>
>> Another difficulty is the adaptive builders. Whatever they return for
>> ArrayBuilder::type()/type_id()/..., it won't help to discriminate them from
>> the non adaptive builders. Is there any application which uses the adaptive
>> builders except the dictionary builders? If possible, I think it would be
>> simplest to make their inheritance of ArrayBuilder protected.
>>
> 
> My gut feeling is that ArrayBuilder::type virtual is the simplest
> thing and causes the complexity relating to changing types be
> localized to classes like AdaptiveIntBuilder. I don't have a sense for
> what use cases changing from an inline member (type_) to a virtual
> function would cause a meaningful performance issue. You could even
> play tricks by having something like
> 
> sp type() const {
>   if (!is_fixed_type_) {
> // type_ is null, GetCurrentType is a protected virtual
> return GetCurrentType();
>   } else {
> return type_;
>   }
> }
> 
> You said "much more expensive for nested builders" -- where and how exactly?
> 
>> Ben


Re: [DISCUSS] ArrayBuilders with mutable type

2019-08-19 Thread Wes McKinney
On Mon, Aug 19, 2019 at 9:16 AM Ben Kietzman  wrote:
>
> Thanks for responding.
>
> I can certainly add VisitBuilder/VisitBuilderInline for ArrayBuilder, but
> there's a slight difficulty: some types don't have a single concrete
> builder class. For example, the builders of dictionary arrays are templated
> on the encoded type and also on an index builder, which is either adaptive
> or fixed at int32. I can make this transparent in the definition of
> VisitBuilder so that one can add Visit(StringDictionaryBuilder*) and
> Visit(DictionaryBuilder*). Something similar is already required
> for IntervalType, which may be MonthIntervalType or DayTimeIntervalType
> depending on the value of IntervalType::interval_type().
>
> Another difficulty is the adaptive builders. Whatever they return for
> ArrayBuilder::type()/type_id()/..., it won't help to discriminate them from
> the non adaptive builders. Is there any application which uses the adaptive
> builders except the dictionary builders? If possible, I think it would be
> simplest to make their inheritance of ArrayBuilder protected.
>

My gut feeling is that ArrayBuilder::type virtual is the simplest
thing and causes the complexity relating to changing types be
localized to classes like AdaptiveIntBuilder. I don't have a sense for
what use cases changing from an inline member (type_) to a virtual
function would cause a meaningful performance issue. You could even
play tricks by having something like

sp type() const {
  if (!is_fixed_type_) {
// type_ is null, GetCurrentType is a protected virtual
return GetCurrentType();
  } else {
return type_;
  }
}

You said "much more expensive for nested builders" -- where and how exactly?

> Ben


Re: [DISCUSS] ArrayBuilders with mutable type

2019-08-19 Thread Francois Saint-Jacques
Indeed, I'd expect the `type()` method to not be called in the hot path.

François

On Mon, Aug 19, 2019 at 10:17 AM Wes McKinney  wrote:
>
> hi Ben,
>
> On this possibility
>
> - Make ArrayBuilder::type() virtual. This will be much more expensive for
> nested builders and for applications which need to branch on
> ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
> well.
>
> Could you explain what would be a situation where these methods
> (type() and type_id()) would be on a hot path? Whether virtual or not,
> I would think that writing code like
>
> VisitTypeInline(*builder->type(), func)
>
> would not necessarily be advisable for cell-level granularity in general.
>
> - Wes
>
> On Sun, Aug 18, 2019 at 8:05 PM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > Thanks for working of this.
> >
> > This fix is very happy for Arrow GLib.
> >
> > Arrow GLib wants array builder's type ID to convert C++
> > object to GLib object. Here is a JIRA issue for this use case:
> >   https://issues.apache.org/jira/browse/ARROW-5355
> >
> > In Arrow GLib's use case, ArrayBuilder::type_id() is only
> > required. Arrow GLib doesn't need ArrayBuilder::type().
> >
> > Furthermore, Arrow GLib just wants to know the type of
> > ArrayBuilder (not the type of built Array). Arrow GLib
> > doesn't need ArrayBuilder::type() nor
> > ArrayBuilder::type_id() if ArrayBuilder::id() or something
> > (e.g. visitor API for ArrayBuilder) is provided. Arrow GLib
> > can use dynamic_cast and NULL check to detect the real
> > ArrowBuilder's class. But I don't want to use it as much as
> > possible.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "[DISCUSS] ArrayBuilders with mutable type" on Fri, 16 Aug 2019 16:40:15 
> > -0400,
> >   Ben Kietzman  wrote:
> >
> > > For some array builders, ArrayBuilder::type() will be different from the
> > > type of array produced by ArrayBuilder::Finish(). These are:
> > > - AdaptiveIntBuilder will progress through {int8, int16, int32, int64}
> > > whenever a value is inserted which cannot be stored using the current
> > > integer type.
> > > - DictionaryBuilder will similarly increase the width of its indices if 
> > > its
> > > memo table grows too large.
> > > - {Dense,Sparse}UnionBuilder may append a new child builder
> > > - Any nested builder whose child builders include a builder with mutable
> > > type
> > >
> > > IMHO if ArrayBuilder::type is sporadically inaccurate then it's a user
> > > hostile API and needs to be fixed.
> > >
> > > The current solution solution is for mutable type to be marked by
> > > ArrayBuilder::type() == null. This results in significant loss of metadata
> > > from nested types; for example StructBuilder::FinishInternal currently 
> > > sets
> > > all field names to "" if constructed with null type. Null type is
> > > inconsistently applied; a builder of list(dictionary()) will currently
> > > finish to an invalid array if the dictionary builder widens its indices
> > > before finishing.
> > >
> > > Options:
> > > - Implement array builders such that ArrayBuilder::type() is always the
> > > type to which the builder would Finish. There is a PR for this
> > > https://github.com/apache/arrow/pull/4930 but it introduces performance
> > > regressions for the dictionary builders: 5% if the values are integer, 
> > > 1.8%
> > > if they are strings.
> > > - Provide ArrayBuilder::UpdateType(); type() is not guaranteed to be
> > > accurate unless immediately preceded by UpdateType().
> > > - Remove ArrayBuilder::type() in favor of ArrayBuilder::type_id(), which
> > > will be an immutable property of ArrayBuilders.
> > > - Make ArrayBuilder::type() virtual. This will be much more expensive for
> > > nested builders and for applications which need to branch on
> > > ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
> > > well.


Re: [DISCUSS] ArrayBuilders with mutable type

2019-08-19 Thread Wes McKinney
hi Ben,

On this possibility

- Make ArrayBuilder::type() virtual. This will be much more expensive for
nested builders and for applications which need to branch on
ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
well.

Could you explain what would be a situation where these methods
(type() and type_id()) would be on a hot path? Whether virtual or not,
I would think that writing code like

VisitTypeInline(*builder->type(), func)

would not necessarily be advisable for cell-level granularity in general.

- Wes

On Sun, Aug 18, 2019 at 8:05 PM Sutou Kouhei  wrote:
>
> Hi,
>
> Thanks for working of this.
>
> This fix is very happy for Arrow GLib.
>
> Arrow GLib wants array builder's type ID to convert C++
> object to GLib object. Here is a JIRA issue for this use case:
>   https://issues.apache.org/jira/browse/ARROW-5355
>
> In Arrow GLib's use case, ArrayBuilder::type_id() is only
> required. Arrow GLib doesn't need ArrayBuilder::type().
>
> Furthermore, Arrow GLib just wants to know the type of
> ArrayBuilder (not the type of built Array). Arrow GLib
> doesn't need ArrayBuilder::type() nor
> ArrayBuilder::type_id() if ArrayBuilder::id() or something
> (e.g. visitor API for ArrayBuilder) is provided. Arrow GLib
> can use dynamic_cast and NULL check to detect the real
> ArrowBuilder's class. But I don't want to use it as much as
> possible.
>
>
> Thanks,
> --
> kou
>
> In 
>   "[DISCUSS] ArrayBuilders with mutable type" on Fri, 16 Aug 2019 16:40:15 
> -0400,
>   Ben Kietzman  wrote:
>
> > For some array builders, ArrayBuilder::type() will be different from the
> > type of array produced by ArrayBuilder::Finish(). These are:
> > - AdaptiveIntBuilder will progress through {int8, int16, int32, int64}
> > whenever a value is inserted which cannot be stored using the current
> > integer type.
> > - DictionaryBuilder will similarly increase the width of its indices if its
> > memo table grows too large.
> > - {Dense,Sparse}UnionBuilder may append a new child builder
> > - Any nested builder whose child builders include a builder with mutable
> > type
> >
> > IMHO if ArrayBuilder::type is sporadically inaccurate then it's a user
> > hostile API and needs to be fixed.
> >
> > The current solution solution is for mutable type to be marked by
> > ArrayBuilder::type() == null. This results in significant loss of metadata
> > from nested types; for example StructBuilder::FinishInternal currently sets
> > all field names to "" if constructed with null type. Null type is
> > inconsistently applied; a builder of list(dictionary()) will currently
> > finish to an invalid array if the dictionary builder widens its indices
> > before finishing.
> >
> > Options:
> > - Implement array builders such that ArrayBuilder::type() is always the
> > type to which the builder would Finish. There is a PR for this
> > https://github.com/apache/arrow/pull/4930 but it introduces performance
> > regressions for the dictionary builders: 5% if the values are integer, 1.8%
> > if they are strings.
> > - Provide ArrayBuilder::UpdateType(); type() is not guaranteed to be
> > accurate unless immediately preceded by UpdateType().
> > - Remove ArrayBuilder::type() in favor of ArrayBuilder::type_id(), which
> > will be an immutable property of ArrayBuilders.
> > - Make ArrayBuilder::type() virtual. This will be much more expensive for
> > nested builders and for applications which need to branch on
> > ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
> > well.


Re: [DISCUSS] ArrayBuilders with mutable type

2019-08-19 Thread Ben Kietzman
Thanks for responding.

I can certainly add VisitBuilder/VisitBuilderInline for ArrayBuilder, but
there's a slight difficulty: some types don't have a single concrete
builder class. For example, the builders of dictionary arrays are templated
on the encoded type and also on an index builder, which is either adaptive
or fixed at int32. I can make this transparent in the definition of
VisitBuilder so that one can add Visit(StringDictionaryBuilder*) and
Visit(DictionaryBuilder*). Something similar is already required
for IntervalType, which may be MonthIntervalType or DayTimeIntervalType
depending on the value of IntervalType::interval_type().

Another difficulty is the adaptive builders. Whatever they return for
ArrayBuilder::type()/type_id()/..., it won't help to discriminate them from
the non adaptive builders. Is there any application which uses the adaptive
builders except the dictionary builders? If possible, I think it would be
simplest to make their inheritance of ArrayBuilder protected.

Ben


Re: [DISCUSS] Apache Arrow manylinux1 support

2019-08-19 Thread Wes McKinney
On Mon, Aug 19, 2019 at 8:55 AM Antoine Pitrou  wrote:
>
> On Mon, 19 Aug 2019 08:44:26 -0500
> Wes McKinney  wrote:
> > Will publishing only manylinux2010 wheels have any consequences (for
> > example, a relatively new version of setuptools may be required)?
>
> A relatively new version of pip is required.  But upgrading pip is
> straightforward, at least in a virtual environment or private Python
> install.
>

OK. So people building Dockerfiles on Linux will have to upgrade the
setuptools that's available in their package manager in a lot of
cases. It doesn't seem like a big deal, but it would need to be
appropriately documented since there are likely to be a lot of folks
out there who are running `pip install pyarrow` without much
additional thought about this detail

> Regards
>
> Antoine.
>
>
> >
> > On Fri, Aug 16, 2019 at 11:58 AM Neal Richardson
> >  wrote:
> > >
> > > For R's official support for various C++ versions, see
> > > https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Using-C_002b_002b11-code
> > > and below. Empirically, C++ > 11 is not really used: there are only 6
> > > packages on CRAN that declare it as a requirement, and none of those
> > > are widely used.
> > >
> > > $ R
> > > > df <- tools::CRAN_package_db()
> > > > table(grepl("C++11", df$SystemRequirements, fixed=TRUE))
> > >
> > > FALSE  TRUE
> > > 14502   275
> > > > table(grepl("C++14", df$SystemRequirements, fixed=TRUE))
> > >
> > > FALSE  TRUE
> > > 14771 6
> > > > df[grepl("C++14", df$SystemRequirements, fixed=TRUE), c("Package", 
> > > > "Reverse depends", "Reverse imports", "Reverse suggests")]
> > > Package Reverse depends Reverse imports Reverse suggests
> > > 6071   IsoSpecR 
> > > 8004   multinet 
> > > 8200 ndjson streamR 
> > > 10487 RcppAlgos STraTUS  bigIntegerAlgos
> > > 5rmdcev 
> > > 14391walker 
> > > > table(grepl("C++17", df$SystemRequirements, fixed=TRUE))
> > >
> > > FALSE
> > > 14777
> > >
> > > On Fri, Aug 16, 2019 at 8:32 AM Antoine Pitrou  wrote:
> > > >
> > > >
> > > > Le 16/08/2019 à 17:11, Hatem Helal a écrit :
> > > > > Hi all,
> > > > >
> > > > > I ran into a surprising (to me) limitation when working on an issue 
> > > > > [1].  To summarize, supporting the manylinux1 standard ties Arrow 
> > > > > development to gcc 4.8.x which is technically not C++11 complete.  
> > > > > This brought on few questions for me:
> > > > >
> > > > > * What are the pre-conditions for dropping manylinux1 / gcc 4.8.x?  I 
> > > > > found an open task to remove support altogether [2] .
> > > >
> > > > Not much IMHO.   1) The people who have been producing Python wheels up
> > > > to now have decided to stop spending valuable time on hairy binary
> > > > compatibility and distribution issues.  2) Last I tried, manylinux2010
> > > > works and someone who's interested in reviving Python Linux wheels can
> > > > probably produce such wheels instead of manylinux1.
> > > >
> > > > So IMHO we can drop manylinux1 support right now.  However:
> > > >
> > > > > * What is needed to move to C++ 14?
> > > >
> > > > Make sure that all important toolchains support it.  Unfortunately, I
> > > > don't think that's the case for the MinGW version that's used to build R
> > > > packages on Windows.  It's using gcc 4.9.3.
> > > >
> > > > See e.g.
> > > > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26742063/job/7k57qamlpb5cchfh?fullLog=true#L666
> > > >
> > > > > * Would either of these changes normally require a PMC-driven vote?
> > > >
> > > > I don't think dropping manylinux1 needs a PMC vote.  It's simply a case
> > > > of a high-cost recurring activity that doesn't find a volunteer anymore.
> > > >  The PMC can't simply claim that we continue supporting manylinux1 if
> > > > there's nobody around to do the actual work.
> > > >
> > > > As for switching the baseline to C++14, it would probably require a vote
> > > > indeed.  And I expect a -1 if the R Windows build can't be migrated to a
> > > > newer compiler.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> >
>
>
>


Re: [DISCUSS] Apache Arrow manylinux1 support

2019-08-19 Thread Antoine Pitrou
On Mon, 19 Aug 2019 08:44:26 -0500
Wes McKinney  wrote:
> Will publishing only manylinux2010 wheels have any consequences (for
> example, a relatively new version of setuptools may be required)?

A relatively new version of pip is required.  But upgrading pip is
straightforward, at least in a virtual environment or private Python
install.

Regards

Antoine.


> 
> On Fri, Aug 16, 2019 at 11:58 AM Neal Richardson
>  wrote:
> >
> > For R's official support for various C++ versions, see
> > https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Using-C_002b_002b11-code
> > and below. Empirically, C++ > 11 is not really used: there are only 6
> > packages on CRAN that declare it as a requirement, and none of those
> > are widely used.
> >
> > $ R  
> > > df <- tools::CRAN_package_db()
> > > table(grepl("C++11", df$SystemRequirements, fixed=TRUE))  
> >
> > FALSE  TRUE
> > 14502   275  
> > > table(grepl("C++14", df$SystemRequirements, fixed=TRUE))  
> >
> > FALSE  TRUE
> > 14771 6  
> > > df[grepl("C++14", df$SystemRequirements, fixed=TRUE), c("Package", 
> > > "Reverse depends", "Reverse imports", "Reverse suggests")]  
> > Package Reverse depends Reverse imports Reverse suggests
> > 6071   IsoSpecR 
> > 8004   multinet 
> > 8200 ndjson streamR 
> > 10487 RcppAlgos STraTUS  bigIntegerAlgos
> > 5rmdcev 
> > 14391walker   
> > > table(grepl("C++17", df$SystemRequirements, fixed=TRUE))  
> >
> > FALSE
> > 14777
> >
> > On Fri, Aug 16, 2019 at 8:32 AM Antoine Pitrou  wrote:  
> > >
> > >
> > > Le 16/08/2019 à 17:11, Hatem Helal a écrit :  
> > > > Hi all,
> > > >
> > > > I ran into a surprising (to me) limitation when working on an issue 
> > > > [1].  To summarize, supporting the manylinux1 standard ties Arrow 
> > > > development to gcc 4.8.x which is technically not C++11 complete.  This 
> > > > brought on few questions for me:
> > > >
> > > > * What are the pre-conditions for dropping manylinux1 / gcc 4.8.x?  I 
> > > > found an open task to remove support altogether [2] .  
> > >
> > > Not much IMHO.   1) The people who have been producing Python wheels up
> > > to now have decided to stop spending valuable time on hairy binary
> > > compatibility and distribution issues.  2) Last I tried, manylinux2010
> > > works and someone who's interested in reviving Python Linux wheels can
> > > probably produce such wheels instead of manylinux1.
> > >
> > > So IMHO we can drop manylinux1 support right now.  However:
> > >  
> > > > * What is needed to move to C++ 14?  
> > >
> > > Make sure that all important toolchains support it.  Unfortunately, I
> > > don't think that's the case for the MinGW version that's used to build R
> > > packages on Windows.  It's using gcc 4.9.3.
> > >
> > > See e.g.
> > > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26742063/job/7k57qamlpb5cchfh?fullLog=true#L666
> > >  
> > > > * Would either of these changes normally require a PMC-driven vote?  
> > >
> > > I don't think dropping manylinux1 needs a PMC vote.  It's simply a case
> > > of a high-cost recurring activity that doesn't find a volunteer anymore.
> > >  The PMC can't simply claim that we continue supporting manylinux1 if
> > > there's nobody around to do the actual work.
> > >
> > > As for switching the baseline to C++14, it would probably require a vote
> > > indeed.  And I expect a -1 if the R Windows build can't be migrated to a
> > > newer compiler.
> > >
> > > Regards
> > >
> > > Antoine.  
> 





Re: [DISCUSS] Apache Arrow manylinux1 support

2019-08-19 Thread Wes McKinney
Will publishing only manylinux2010 wheels have any consequences (for
example, a relatively new version of setuptools may be required)?

On Fri, Aug 16, 2019 at 11:58 AM Neal Richardson
 wrote:
>
> For R's official support for various C++ versions, see
> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Using-C_002b_002b11-code
> and below. Empirically, C++ > 11 is not really used: there are only 6
> packages on CRAN that declare it as a requirement, and none of those
> are widely used.
>
> $ R
> > df <- tools::CRAN_package_db()
> > table(grepl("C++11", df$SystemRequirements, fixed=TRUE))
>
> FALSE  TRUE
> 14502   275
> > table(grepl("C++14", df$SystemRequirements, fixed=TRUE))
>
> FALSE  TRUE
> 14771 6
> > df[grepl("C++14", df$SystemRequirements, fixed=TRUE), c("Package", "Reverse 
> > depends", "Reverse imports", "Reverse suggests")]
> Package Reverse depends Reverse imports Reverse suggests
> 6071   IsoSpecR 
> 8004   multinet 
> 8200 ndjson streamR 
> 10487 RcppAlgos STraTUS  bigIntegerAlgos
> 5rmdcev 
> 14391walker 
> > table(grepl("C++17", df$SystemRequirements, fixed=TRUE))
>
> FALSE
> 14777
>
> On Fri, Aug 16, 2019 at 8:32 AM Antoine Pitrou  wrote:
> >
> >
> > Le 16/08/2019 à 17:11, Hatem Helal a écrit :
> > > Hi all,
> > >
> > > I ran into a surprising (to me) limitation when working on an issue [1].  
> > > To summarize, supporting the manylinux1 standard ties Arrow development 
> > > to gcc 4.8.x which is technically not C++11 complete.  This brought on 
> > > few questions for me:
> > >
> > > * What are the pre-conditions for dropping manylinux1 / gcc 4.8.x?  I 
> > > found an open task to remove support altogether [2] .
> >
> > Not much IMHO.   1) The people who have been producing Python wheels up
> > to now have decided to stop spending valuable time on hairy binary
> > compatibility and distribution issues.  2) Last I tried, manylinux2010
> > works and someone who's interested in reviving Python Linux wheels can
> > probably produce such wheels instead of manylinux1.
> >
> > So IMHO we can drop manylinux1 support right now.  However:
> >
> > > * What is needed to move to C++ 14?
> >
> > Make sure that all important toolchains support it.  Unfortunately, I
> > don't think that's the case for the MinGW version that's used to build R
> > packages on Windows.  It's using gcc 4.9.3.
> >
> > See e.g.
> > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/26742063/job/7k57qamlpb5cchfh?fullLog=true#L666
> >
> > > * Would either of these changes normally require a PMC-driven vote?
> >
> > I don't think dropping manylinux1 needs a PMC vote.  It's simply a case
> > of a high-cost recurring activity that doesn't find a volunteer anymore.
> >  The PMC can't simply claim that we continue supporting manylinux1 if
> > there's nobody around to do the actual work.
> >
> > As for switching the baseline to C++14, it would probably require a vote
> > indeed.  And I expect a -1 if the R Windows build can't be migrated to a
> > newer compiler.
> >
> > Regards
> >
> > Antoine.


Re: Gandiva Java benchmarks

2019-08-19 Thread Ravindra Pindikura
On Sat, Aug 17, 2019 at 5:09 AM Rui Wang  wrote:

> I got help for a pointer to Gandiva cpp's micro benchmark
> <
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/tests/micro_benchmarks.cc
> >.
> I will start from there.
>

There is also a primitive java variant here :

https://github.com/apache/arrow/blob/master/java/gandiva/src/test/java/org/apache/arrow/gandiva/evaluator/MicroBenchmarkTest.java


>
>
> -Rui
>
> On Fri, Aug 16, 2019 at 2:49 PM Rui Wang  wrote:
>
> > Hello Arrow community,
> >
> > I am studying Gandiva and especially interested in its performance.
> >
> > I am wondering if there is prior benchmarking work happened somewhere on
> > Gandiva Java? Also, is it worth building some benchmarking code for
> Gandiva
> > Java (something like SparkSQL's benchmarks
> > <
> https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark
> >
> > )?
> >
> >
> >
> > -Rui
> >
>


-- 
Thanks and regards,
Ravindra.


Re: [Discuss] [Python] protocol for conversion to pyarrow Array

2019-08-19 Thread Antoine Pitrou


No concern from me.  It should probably be documented somewhere though :-)

Regards

Antoine.


Le 16/08/2019 à 17:23, Joris Van den Bossche a écrit :
> Coming back to this older thread, I have opened a PR with a proof of
> concept of the proposed protocol to convert third-party array objects to
> arrow: https://github.com/apache/arrow/pull/5106
> In the tests, I added the protocol to pandas' nullable integer array (which
> is currently not supported in the from_pandas conversion) and this converts
> now nicely without much changes.
> 
> Are there remaining concerns about such a protocol?
> 
> --
> 
> Note that the protocol is only for pandas -> arrow conversion (or other
> array-like objects -> arrow). The other way around (arrow -> pandas) is
> more complex and needs further discussion, and also involves the Arrow
> ExtensionTypes (as mentioned below by Wes).
> But I think the protocol will be useful in any case, and we can go ahead
> with that already (for example, not all pandas ExtensionArrays will need to
> map to a Arrow ExtensionType, eg the nullable integers simply map to
> arrow's int64 or fletcher's ExtensionArrays which just wrap a arrow array).
> That said, I have been working on the arrow ExtensionTypes the last days,
> and have been keeping an overview of the issues and needed work in this
> google document:
> https://docs.google.com/document/d/1pr9PuBfXTdlUoAgyh9zPIKDJZalDLI6GuxqblMynMM8/edit?usp=sharing
> (feel free to comment on it). There is also an initial PR to extend the
> support for defining ExtensionTypes in Python (ARROW-5610
> ,
> https://github.com/apache/arrow/pull/5094).
> 
> Joris
> 
> On Fri, 17 May 2019 at 00:28, Wes McKinney  wrote:
> 
>> hi Joris,
>>
>> Somewhat related to this, I want to also point out that we have C++
>> extension types [1]. As part of this, it would also be good to define
>> and document a public API for users to create ExtensionArray
>> subclasses that can be serialized and deserialized using this
>> machinery.
>>
>> As a motivating example, suppose that a Java application has a special
>> data type that can be serialized as a Binary value in Arrow, and we
>> want to be able to receive this special object as a pandas
>> ExtensionArray column, which unboxing into a Python user space type.
>>
>> The ExtensionType can be implemented in Java, and then on the Python
>> side the implementation can occur either in C++ or Python. An API will
>> need to be defined to serializer functions for the pandas
>> ExtensionArray to map the pandas-space type onto the the Arrow-space
>> type. Does this seem like a project you might be able to help drive
>> forward? As a matter of sequencing, we do not yet have the capability
>> to interact with C++ ExtensionType in Python, so we might need to
>> first create callback machinery to enable Arrow extension types to be
>> defined in Python (that call into the C++ ExtensionType registry)
>>
>> - Wes
>>
>> [1]:
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc
>>
>> On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche
>>  wrote:
>>>
>>> Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn :
>>>
 +1 to the idea of adding a protocol to let other objects define their
>> way
 to Arrow structures. For pandas.Series I would expect that they return
>> an
 Arrow Column.

 For the Arrow->pandas conversion I have a bit mixed feelings. In the
 normal Fletcher case I would expect that we don't convert anything as
>> we
 represent anything from Arrow with it.
>>>
>>>
>>> Yes, you don't want to convert anything (apart from wrapping the arrow
>>> array into a FletcherArray). But how does Table.to_pandas know that?
>>> Maybe it doesn't need to know that. And then you might write a function
>> in
>>> fletcher to convert a pyarrow Table to a pandas DataFrame with
>>> fletcher-backed columns. But if you want to have this roundtrip
>>> automatically, without the need that each project that defines an
>>> ExtensionArray and wants to interact with arrow (eg in GeoPandas as well)
>>> needs to have his own "arrow-table-to-pandas-dataframe" converter,
>> pyarrow
>>> needs to have some notion of how to convert back to a pandas
>> ExtensionArray.
>>>
>>>
 For the case where we want to restore the exact pandas DataFrame we had
 before this will become a bit more complicated as we either would need
>> to
 have all third-party libraries to support Arrow via a hook as proposed
>> or
 we also define some kind of other protocol on the pandas side to
 reconstruct ExtensionArrays from Arrow data.

>>>
>>> That last one is basically what I proposed in
>>>
>> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>>>
>>> Thanks Antoine and Uwe for the discussion!
>>>
>>> Joris
>>
> 


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-08-19 Thread Antoine Pitrou


Hi,

This sounds fine on the principle.  I'll let other comment on the details.

Regards

Antoine.


Le 19/08/2019 à 11:29, Kenta Murata a écrit :
> Hi,
> 
> I’d like to propose the following improvement of the sparse tensor
> format and implementation.
> 
> (1) To make variable bit-width indices available.
> 
> The main purpose of the first part of the proposal is making 32-bit
> indices available.  It allows us to serialize scipy.sparse.csr_matrix
> objects etc. with 32-bit indices without converting the index arrays
> to 64-bit values.  As Jed said in the previous discussion [1] in this
> ML, since 32-bit indices have advantages of the small memory
> footprints, I strongly consider this change is necessary for the
> sparse tensor support for Apache Arrow.  Adding both the type field in
> each sparse index format and the stride field in SparseCOOIndex format
> is necessary to do this.
> 
> (2) Adding the new COO format with separated row and column indices
> 
> Scipy.sparse.coo_matrix manages the indices of row and column in
> separated numpy arrays.  It is enough for representing a sparse
> matrix.  On the other hand, for supporting sparse tensors with
> arbitrary ranks, Arrow's SparseCOOIndex manages COO indices as one
> matrix. Hence we need to make a copy of indices to convert
> scipy.sparse.coo_matrix to Arrow’s SparseTensor.  Introducing the new
> COO format with separated row and column indices can resolve this
> issue.
> 
> (3) Adding SparseCSCIndex
> 
> The CSC format of sparse matrices has the advantage of faster scanning
> in columnar direction while the CSR format is faster in a row-wise
> scan. Because The aptitude of CSC is different from the one of CSR, I
> want to support CSC before releasing Arrow 1.0.
> 
> There are work-in-progress branch [2] of (1) above.  I’d appreciate
> any comments or suggestions.
> 
> [1] 
> http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3c87pnqz70rg@jedbrown.org%3e
> 
> [2] https://github.com/mrkn/arrow/tree/sparse_tensor_index_value_type
> 
> Regards,
> Kenta Murata
> 


Re: [VOTE] Proposed addition to Arrow Flight Protocol

2019-08-19 Thread Antoine Pitrou


+1 (binding)

Regards

Antoine.


Le 16/08/2019 à 07:44, Micah Kornfield a écrit :
> Hello,
> Ryan Murray has proposed adding a GetFlightSchema RPC [1] to the Arrow
> Flight Protocol [2].  The purpose of this RPC is to allow decoupling schema
> and endpoint retrieval as provided by the GetFlightInfo RPC.  The new
> definition provided is:
> 
> message SchemaResult {
>   // Serialized Flatbuffer Schema message.
>   bytes schema = 1;
> }
> rpc GetSchema(FlightDescriptor) returns (SchemaResult) {}
> 
> Ryan has also provided a PR demonstrating implementation of the new RPC [3]
> in Java, C++ and Python which can be reviewed and merged after this
> addition is approved.
> 
> Please vote whether to accept the addition. The vote will be open for at
> least 72 hours.
> 
> [ ] +1 Accept this addition to the Flight protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
> 
> 
> Thanks,
> Micah
> 
> [1]
> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit
> [2] https://github.com/apache/arrow/blob/master/format/Flight.proto
> [3] https://github.com/apache/arrow/pull/4980
> 


[DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-08-19 Thread Kenta Murata
Hi,

I’d like to propose the following improvement of the sparse tensor
format and implementation.

(1) To make variable bit-width indices available.

The main purpose of the first part of the proposal is making 32-bit
indices available.  It allows us to serialize scipy.sparse.csr_matrix
objects etc. with 32-bit indices without converting the index arrays
to 64-bit values.  As Jed said in the previous discussion [1] in this
ML, since 32-bit indices have advantages of the small memory
footprints, I strongly consider this change is necessary for the
sparse tensor support for Apache Arrow.  Adding both the type field in
each sparse index format and the stride field in SparseCOOIndex format
is necessary to do this.

(2) Adding the new COO format with separated row and column indices

Scipy.sparse.coo_matrix manages the indices of row and column in
separated numpy arrays.  It is enough for representing a sparse
matrix.  On the other hand, for supporting sparse tensors with
arbitrary ranks, Arrow's SparseCOOIndex manages COO indices as one
matrix. Hence we need to make a copy of indices to convert
scipy.sparse.coo_matrix to Arrow’s SparseTensor.  Introducing the new
COO format with separated row and column indices can resolve this
issue.

(3) Adding SparseCSCIndex

The CSC format of sparse matrices has the advantage of faster scanning
in columnar direction while the CSR format is faster in a row-wise
scan. Because The aptitude of CSC is different from the one of CSR, I
want to support CSC before releasing Arrow 1.0.

There are work-in-progress branch [2] of (1) above.  I’d appreciate
any comments or suggestions.

[1] 
http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3c87pnqz70rg@jedbrown.org%3e

[2] https://github.com/mrkn/arrow/tree/sparse_tensor_index_value_type

Regards,
Kenta Murata