Re: [VOTE] Proposed changes to Arrow Flight protocol

2019-04-05 Thread Kouhei Sutou
+1 (binding)

In 
  "[VOTE] Proposed changes to Arrow Flight protocol" on Tue, 2 Apr 2019 
19:05:27 -0500,
  Wes McKinney  wrote:

> Hi,
> 
> David Li has proposed to make the following additions or changes
> to the Flight gRPC service definition [1] and general design, as explained in
> greater detail in the linked Google Docs document [2]. Arrow
> Flight is an in-development messaging framework for creating
> services that can, among other things, send and receive the Arrow
> binary protocol without intermediate serialization.
> 
> The changes proposed are as follows:
> 
> Proposal 1: In FlightData, add a bytes field for application-defined metadata.
> In DoPut, change the return type to be streaming, and add a bytes
> field to PutResult for application-defined metadata.
> 
> Proposal 2: In client/server APIs, add a call options parameter to
> control timeouts and provide access to the identity of the
> authenticated peer (if any).
> 
> Proposal 3: Add an interface to define authentication protocols on the
> client and server, using the existing Handshake endpoint and adding a
> protocol-defined, per-call token.
> 
> Proposal 4: Construct the client/server using builders to allow
> configuration of transport-specific options and open the door for
> alternative transports.
> 
> The actual changes will be made through subsequent pull requests
> that change Flight.proto and the existing Flight implementations
> in C++ and Java.
> 
> Please vote whether to accept the changes. The vote will be open
> for at least 72 hours.
> 
> [ ] +1 Accept these changes to the Flight protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
> 
> Thanks,
> Wes
> 
> [1]: https://github.com/apache/arrow/blob/master/format/Flight.proto
> [2]: 
> https://docs.google.com/document/d/1aIVZ8SD5dMZXHTCeEY9PoNAwyuUgG-UEjmd3zfs1PYM/edit


Re: [VOTE] Add new DurationInterval Type to Arrow Format

2019-04-05 Thread Kouhei Sutou
+1 (binding)

In 
  "[VOTE] Add new DurationInterval Type to Arrow Format" on Wed, 3 Apr 2019 
07:59:56 -0700,
  Jacques Nadeau  wrote:

> I'd like to propose a change to the Arrow format to support a new duration
> type. Details below. Threads on mailing list around discussion.
> 
> 
> // An absolute length of time unrelated to any calendar artifacts.  For the
> purposes
> /// of Arrow Implementations, adding this value to a Timestamp ("t1")
> naively (i.e. simply summing
> /// the two number) is acceptable even though in some cases the resulting
> Timestamp (t2) would
> /// not account for leap-seconds during the elapsed time between "t1" and
> "t2".  Similarly, representing
> /// the difference between two Unix timestamp is acceptable, but would
> yield a value that is possibly a few seconds
> /// off from the true elapsed time.
> ///
> ///  The resolution defaults to
> /// millisecond, but can be any of the other supported TimeUnit values as
> /// with Timestamp and Time types.  This type is always represented as
> /// an 8-byte integer.
> table DurationInterval {
>unit: TimeUnit = MILLISECOND;
> }
> 
> 
> Please vote whether to accept the changes. The vote will be open
> for at least 72 hours.
> 
> [ ] +1 Accept these changes to the Flight protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...


Re: [VOTE] Add new DurationInterval Type to Arrow Format

2019-04-05 Thread Micah Kornfield
I think this needs another PMC member to way in?  Would mind taking a look?

On Wed, Apr 3, 2019 at 9:21 AM Jacques Nadeau  wrote:

> Yes, copy and paste error:
>
> +1 to add the new type (binding)
>
> On Wed, Apr 3, 2019 at 8:36 AM Wes McKinney  wrote:
>
> > +1 (binding) to add the new type
> >
> > On Wed, Apr 3, 2019 at 10:35 AM Micah Kornfield 
> > wrote:
> > >
> > > +1 (non-binding).
> > >
> > > P.S. Copy and paste error on the plus 1 option from the flight vote?
> > >
> > > On Wednesday, April 3, 2019, Jacques Nadeau 
> wrote:
> > >
> > > > I'd like to propose a change to the Arrow format to support a new
> > duration
> > > > type. Details below. Threads on mailing list around discussion.
> > > >
> > > >
> > > > // An absolute length of time unrelated to any calendar artifacts.
> > For the
> > > > purposes
> > > > /// of Arrow Implementations, adding this value to a Timestamp ("t1")
> > > > naively (i.e. simply summing
> > > > /// the two number) is acceptable even though in some cases the
> > resulting
> > > > Timestamp (t2) would
> > > > /// not account for leap-seconds during the elapsed time between "t1"
> > and
> > > > "t2".  Similarly, representing
> > > > /// the difference between two Unix timestamp is acceptable, but
> would
> > > > yield a value that is possibly a few seconds
> > > > /// off from the true elapsed time.
> > > > ///
> > > > ///  The resolution defaults to
> > > > /// millisecond, but can be any of the other supported TimeUnit
> values
> > as
> > > > /// with Timestamp and Time types.  This type is always represented
> as
> > > > /// an 8-byte integer.
> > > > table DurationInterval {
> > > >unit: TimeUnit = MILLISECOND;
> > > > }
> > > >
> > > >
> > > > Please vote whether to accept the changes. The vote will be open
> > > > for at least 72 hours.
> > > >
> > > > [ ] +1 Accept these changes to the Flight protocol
> > > > [ ] +0
> > > > [ ] -1 Do not accept the changes because...
> > > >
> >
>


[jira] [Created] (ARROW-5130) Segfault when importing TensorFlow after Pyarrow

2019-04-05 Thread Travis Addair (JIRA)
Travis Addair created ARROW-5130:


 Summary: Segfault when importing TensorFlow after Pyarrow
 Key: ARROW-5130
 URL: https://issues.apache.org/jira/browse/ARROW-5130
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.13.0
Reporter: Travis Addair


This issue is similar to https://jira.apache.org/jira/browse/ARROW-2657 which 
was fixed in v0.10.0.

When we import TensorFlow after Pyarrow in Linux Debian Jessie, we get a 
segfault.  To reproduce:
{code:java}
import pyarrow 
import tensorflow{code}
Here's the backtrace from gdb:
{code:java}
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x in ?? ()
(gdb) bt
#0 0x in ?? ()
#1 0x7f529ee04410 in pthread_once () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
#2 0x7f5229a74efa in void std::call_once(std::once_flag&, void 
(&)()) () from 
/usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#3 0x7f5229a74f3e in 
tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) () from 
/usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#4 0x7f522978b561 in tensorflow::port::(anonymous 
namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string const&) 
()
from 
/usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#5 0x7f522978b5b4 in _GLOBAL__sub_I_cpu_feature_guard.cc () from 
/usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#6 0x7f529f224bea in call_init (l=, argc=argc@entry=9, 
argv=argv@entry=0x7ffc6d8c1488, env=env@entry=0x294c0c0) at dl-init.c:78
#7 0x7f529f224cd3 in call_init (env=0x294c0c0, argv=0x7ffc6d8c1488, argc=9, 
l=) at dl-init.c:36
#8 _dl_init (main_map=main_map@entry=0x2e4aff0, argc=9, argv=0x7ffc6d8c1488, 
env=0x294c0c0) at dl-init.c:126
#9 0x7f529f228e38 in dl_open_worker (a=a@entry=0x7ffc6d8bebb8) at 
dl-open.c:577
#10 0x7f529f224aa4 in _dl_catch_error 
(objname=objname@entry=0x7ffc6d8beba8, 
errstring=errstring@entry=0x7ffc6d8bebb0, 
mallocedp=mallocedp@entry=0x7ffc6d8beba7,
operate=operate@entry=0x7f529f228b60 , 
args=args@entry=0x7ffc6d8bebb8) at dl-error.c:187
#11 0x7f529f22862b in _dl_open (file=0x7f5248178b54 
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
 mode=-2147483646, caller_dlopen=,
nsid=-2, argc=9, argv=0x7ffc6d8c1488, env=0x294c0c0) at dl-open.c:661
#12 0x7f529ebf402b in dlopen_doit (a=a@entry=0x7ffc6d8bedd0) at dlopen.c:66
#13 0x7f529f224aa4 in _dl_catch_error (objname=0x2950fc0, 
errstring=0x2950fc8, mallocedp=0x2950fb8, operate=0x7f529ebf3fd0 , 
args=0x7ffc6d8bedd0) at dl-error.c:187
#14 0x7f529ebf45dd in _dlerror_run (operate=operate@entry=0x7f529ebf3fd0 
, args=args@entry=0x7ffc6d8bedd0) at dlerror.c:163
#15 0x7f529ebf40c1 in __dlopen (file=, mode=) 
at dlopen.c:87
#16 0x00540859 in _PyImport_GetDynLoadFunc ()
#17 0x0054024c in _PyImport_LoadDynamicModule ()
#18 0x005f2bcb in ?? ()
#19 0x004ca235 in PyEval_EvalFrameEx ()
#20 0x004ca9c2 in PyEval_EvalFrameEx ()
#21 0x004c8c39 in PyEval_EvalCodeEx ()
#22 0x004c84e6 in PyEval_EvalCode ()
#23 0x004c6e5c in PyImport_ExecCodeModuleEx ()
#24 0x004c3272 in ?? ()
#25 0x004b19e2 in ?? ()
#26 0x004b13d7 in ?? ()
#27 0x004b42f6 in ?? ()
#28 0x004d1aab in PyEval_CallObjectWithKeywords ()
#29 0x004ccdb3 in PyEval_EvalFrameEx ()
#30 0x004c8c39 in PyEval_EvalCodeEx ()
#31 0x004c84e6 in PyEval_EvalCode ()
#32 0x004c6e5c in PyImport_ExecCodeModuleEx ()
#33 0x004c3272 in ?? ()
#34 0x004b1d3f in ?? ()
#35 0x004b6b2b in ?? ()
#36 0x004b0d82 in ?? ()
#37 0x004b42f6 in ?? ()
#38 0x004d1aab in PyEval_CallObjectWithKeywords ()
#39 0x004ccdb3 in PyEval_EvalFrameEx (){code}
It looks like the code changes that fixed the previous issue was recently 
removed in 
[https://github.com/apache/arrow/commit/b766bff34b7d85034d26cebef5b3aeef1eb2fd82#diff-16806bcebc1df2fae432db426905b9f0].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Proposed changes to Arrow Flight protocol

2019-04-05 Thread Wes McKinney
hi,

We still need another PMC to look at the 4 proposals, since 2 of them
do not have the requisite votes.

Thanks

On Thu, Apr 4, 2019 at 1:28 PM Wes McKinney  wrote:
>
> Could some other PMC members have a look at these proposals? 2 out of
> the 4 have the requisite 3 votes, while 2 need another +1
>
> On Wed, Apr 3, 2019 at 10:44 AM Bryan Cutler  wrote:
> >
> > +1 (non-binding)
> >
> > On Wed, Apr 3, 2019 at 7:52 AM Jacques Nadeau  wrote:
> >
> > > I'm +1 to all four (binding)
> > >
> > > On Wed, Apr 3, 2019 at 1:56 AM Antoine Pitrou  wrote:
> > >
> > > >
> > > >
> > > > Le 03/04/2019 à 02:05, Wes McKinney a écrit :
> > > > > Hi,
> > > > >
> > > > > David Li has proposed to make the following additions or changes
> > > > > to the Flight gRPC service definition [1] and general design, as
> > > > explained in
> > > > > greater detail in the linked Google Docs document [2]. Arrow
> > > > > Flight is an in-development messaging framework for creating
> > > > > services that can, among other things, send and receive the Arrow
> > > > > binary protocol without intermediate serialization.
> > > > >
> > > > > The changes proposed are as follows:
> > > > >
> > > > > Proposal 1: In FlightData, add a bytes field for application-defined
> > > > metadata.
> > > > > In DoPut, change the return type to be streaming, and add a bytes
> > > > > field to PutResult for application-defined metadata.
> > > >
> > > > +1 (binding).
> > > >
> > > > > Proposal 2: In client/server APIs, add a call options parameter to
> > > > > control timeouts and provide access to the identity of the
> > > > > authenticated peer (if any).
> > > >
> > > > +0.
> > > >
> > > > > Proposal 3: Add an interface to define authentication protocols on the
> > > > > client and server, using the existing Handshake endpoint and adding a
> > > > > protocol-defined, per-call token.
> > > >
> > > > +0.
> > > >
> > > > > Proposal 4: Construct the client/server using builders to allow
> > > > > configuration of transport-specific options and open the door for
> > > > > alternative transports.
> > > >
> > > > +1 (binding).
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > >


[DRAFT] Apache Arrow ASF Board Report April 2019

2019-04-05 Thread Wes McKinney
## Description:

Apache Arrow is a cross-language development platform for in-memory data. It
specifies a standardized language-independent columnar memory format for flat
and hierarchical data, organized for efficient analytic operations on modern
hardware. It also provides computational libraries and zero-copy streaming
messaging and interprocess communication. Languages currently supported
include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

## Issues:
- There are no issues requiring board attention at this time

## Activity:
 - The project received a donation of DataFusion, a Rust-based query
engine for Apache Arrow

## Health report:
- The project is very healthy, with a growing number and diversity of
   contributors

## PMC changes:

 - Currently 26 PMC members.
 - Andrew Grove was added to the PMC on Sun Feb 03 2019

## Committer base changes:

 - Currently 41 committers.
 - New commmitters:
- Micah Kornfield was added as a committer on Fri Mar 08 2019
- Deepak Majeti was added as a committer on Thu Jan 31 2019
- Paddy Horan was added as a committer on Fri Feb 08 2019
- Ravindra Pindikura was added as a committer on Fri Feb 01 2019
- Sun Chao was added as a committer on Fri Feb 22 2019

## Releases:

 - 0.12.0 was released on Sat Jan 26 2019
 - 0.12.1 was released on Sun Feb 24 2019
 - 0.13.0 was released on Sun Mar 31 2019
 - JS-0.4.0 was released on Tue Feb 05 2019
 - JS-0.4.1 was released on Sat Mar 23 2019

## JIRA activity:

 - 969 JIRA tickets created in the last 3 months
 - 861 JIRA tickets closed/resolved in the last 3 months


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Thanks Ryan,

After further pondering this, I came to similar conclusions.

Compress the data before putting it into a Parquet ByteArray and if that’s not 
feasible reference it in an external/persisted data structure

Another alternative is to create one or more “shadow columns” to store the 
overflow horizontally.

-Brian

On Apr 5, 2019, at 3:11 PM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:


EXTERNAL

I don't think that's what you would want to do. Parquet will eventually 
compress large values, but not after making defensive copies and attempting to 
encode them. In the end, it will be a lot more overhead, plus the work to make 
it possible. I think you'd be much better of compressing before storing in 
Parquet if you expect good compression rates.

On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:
My hope is that these large ByteArray values will encode/compress to a fraction 
of their original size.  FWIW, 
cpp/src/parquet/column_writer.cc/.h has int64_t 
offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated 
solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: 
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not 
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte 
arrays that are more than 2GB? That's far larger than the size of a row group, 
let alone a page. This would completely break memory management in the JVM 
implementation.

Can you solve this problem using a BLOB type that references an external file 
with the gigantic values? Seems to me that values this large should go in 
separate files, not in a Parquet file where it would destroy any benefit from 
using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" 
mailto:rb...@netflix.com.INVALID>> wrote:

EXTERNAL

Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


--
Ryan Blue
Software Engineer
Netflix




--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Wes McKinney
hi Brian,

Just to comment from the C++ side -- the 64-bit issue is a limitation
of the Parquet format itself and not related to the C++
implementation. It would be possibly interesting to add a
LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
doing much the same in Apache Arrow for in-memory)

- Wes

On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue  wrote:
>
> I don't think that's what you would want to do. Parquet will eventually
> compress large values, but not after making defensive copies and attempting
> to encode them. In the end, it will be a lot more overhead, plus the work
> to make it possible. I think you'd be much better of compressing before
> storing in Parquet if you expect good compression rates.
>
> On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman  wrote:
>
> > My hope is that these large ByteArray values will encode/compress to a
> > fraction of their original size.  FWIW, cpp/src/parquet/
> > column_writer.cc/.h has int64_t offset and length fields all over the
> > place.
> >
> > External file references to BLOBS is doable but not the elegant,
> > integrated solution I was hoping for.
> >
> > -Brian
> >
> > On Apr 5, 2019, at 1:53 PM, Ryan Blue  wrote:
> >
> > *EXTERNAL*
> > Looks like we will need a new encoding for this:
> > https://github.com/apache/parquet-format/blob/master/Encodings.md
> >
> > That doc specifies that the plain encoding uses a 4-byte length. That's
> > not going to be a quick fix.
> >
> > Now that I'm thinking about this a bit more, does it make sense to support
> > byte arrays that are more than 2GB? That's far larger than the size of a
> > row group, let alone a page. This would completely break memory management
> > in the JVM implementation.
> >
> > Can you solve this problem using a BLOB type that references an external
> > file with the gigantic values? Seems to me that values this large should go
> > in separate files, not in a Parquet file where it would destroy any benefit
> > from using the format.
> >
> > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman  wrote:
> >
> >> Hello Ryan,
> >>
> >> Looks like it's limited by both the Parquet implementation and the Thrift
> >> message methods.  Am I missing anything?
> >>
> >> From cpp/src/parquet/types.h
> >>
> >> struct ByteArray {
> >>   ByteArray() : len(0), ptr(NULLPTR) {}
> >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
> >>   uint32_t len;
> >>   const uint8_t* ptr;
> >> };
> >>
> >> From cpp/src/parquet/thrift.h
> >>
> >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> >> deserialized_msg) {
> >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
> >> out)
> >>
> >> -Brian
> >>
> >> On 4/5/19, 1:32 PM, "Ryan Blue"  wrote:
> >>
> >> EXTERNAL
> >>
> >> Hi Brian,
> >>
> >> This seems like something we should allow. What imposes the current
> >> limit?
> >> Is it in the thrift format, or just the implementations?
> >>
> >> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
> >> wrote:
> >>
> >> > All,
> >> >
> >> > SAS requires support for storing varying-length character and
> >> binary blobs
> >> > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> >> field is
> >> > a unint32_t.   Looks this the will require incrementing the Parquet
> >> file
> >> > format version and changing ByteArray len to uint64_t.
> >> >
> >> > Have there been any requests for this or other Parquet developments
> >> that
> >> > require file format versioning changes?
> >> >
> >> > I realize this a non-trivial ask.  Thanks for considering it.
> >> >
> >> > -Brian
> >> >
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
>
> --
> Ryan Blue
> Software Engineer
> Netflix


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Ryan Blue
I don't think that's what you would want to do. Parquet will eventually
compress large values, but not after making defensive copies and attempting
to encode them. In the end, it will be a lot more overhead, plus the work
to make it possible. I think you'd be much better of compressing before
storing in Parquet if you expect good compression rates.

On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman  wrote:

> My hope is that these large ByteArray values will encode/compress to a
> fraction of their original size.  FWIW, cpp/src/parquet/
> column_writer.cc/.h has int64_t offset and length fields all over the
> place.
>
> External file references to BLOBS is doable but not the elegant,
> integrated solution I was hoping for.
>
> -Brian
>
> On Apr 5, 2019, at 1:53 PM, Ryan Blue  wrote:
>
> *EXTERNAL*
> Looks like we will need a new encoding for this:
> https://github.com/apache/parquet-format/blob/master/Encodings.md
>
> That doc specifies that the plain encoding uses a 4-byte length. That's
> not going to be a quick fix.
>
> Now that I'm thinking about this a bit more, does it make sense to support
> byte arrays that are more than 2GB? That's far larger than the size of a
> row group, let alone a page. This would completely break memory management
> in the JVM implementation.
>
> Can you solve this problem using a BLOB type that references an external
> file with the gigantic values? Seems to me that values this large should go
> in separate files, not in a Parquet file where it would destroy any benefit
> from using the format.
>
> On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman  wrote:
>
>> Hello Ryan,
>>
>> Looks like it's limited by both the Parquet implementation and the Thrift
>> message methods.  Am I missing anything?
>>
>> From cpp/src/parquet/types.h
>>
>> struct ByteArray {
>>   ByteArray() : len(0), ptr(NULLPTR) {}
>>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
>>   uint32_t len;
>>   const uint8_t* ptr;
>> };
>>
>> From cpp/src/parquet/thrift.h
>>
>> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
>> deserialized_msg) {
>> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
>> out)
>>
>> -Brian
>>
>> On 4/5/19, 1:32 PM, "Ryan Blue"  wrote:
>>
>> EXTERNAL
>>
>> Hi Brian,
>>
>> This seems like something we should allow. What imposes the current
>> limit?
>> Is it in the thrift format, or just the implementations?
>>
>> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
>> wrote:
>>
>> > All,
>> >
>> > SAS requires support for storing varying-length character and
>> binary blobs
>> > with a 2^64 max length in Parquet.   Currently, the ByteArray len
>> field is
>> > a unint32_t.   Looks this the will require incrementing the Parquet
>> file
>> > format version and changing ByteArray len to uint64_t.
>> >
>> > Have there been any requests for this or other Parquet developments
>> that
>> > require file format versioning changes?
>> >
>> > I realize this a non-trivial ask.  Thanks for considering it.
>> >
>> > -Brian
>> >
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
My hope is that these large ByteArray values will encode/compress to a fraction 
of their original size.  FWIW, 
cpp/src/parquet/column_writer.cc/.h has int64_t 
offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated 
solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: 
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not 
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte 
arrays that are more than 2GB? That's far larger than the size of a row group, 
let alone a page. This would completely break memory management in the JVM 
implementation.

Can you solve this problem using a BLOB type that references an external file 
with the gigantic values? Seems to me that values this large should go in 
separate files, not in a Parquet file where it would destroy any benefit from 
using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" 
mailto:rb...@netflix.com.INVALID>> wrote:

EXTERNAL

Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


--
Ryan Blue
Software Engineer
Netflix




--
Ryan Blue
Software Engineer
Netflix


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Ryan Blue
Looks like we will need a new encoding for this:
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support
byte arrays that are more than 2GB? That's far larger than the size of a
row group, let alone a page. This would completely break memory management
in the JVM implementation.

Can you solve this problem using a BLOB type that references an external
file with the gigantic values? Seems to me that values this large should go
in separate files, not in a Parquet file where it would destroy any benefit
from using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman  wrote:

> Hello Ryan,
>
> Looks like it's limited by both the Parquet implementation and the Thrift
> message methods.  Am I missing anything?
>
> From cpp/src/parquet/types.h
>
> struct ByteArray {
>   ByteArray() : len(0), ptr(NULLPTR) {}
>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
>   uint32_t len;
>   const uint8_t* ptr;
> };
>
> From cpp/src/parquet/thrift.h
>
> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> deserialized_msg) {
> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)
>
> -Brian
>
> On 4/5/19, 1:32 PM, "Ryan Blue"  wrote:
>
> EXTERNAL
>
> Hi Brian,
>
> This seems like something we should allow. What imposes the current
> limit?
> Is it in the thrift format, or just the implementations?
>
> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
> wrote:
>
> > All,
> >
> > SAS requires support for storing varying-length character and binary
> blobs
> > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> field is
> > a unint32_t.   Looks this the will require incrementing the Parquet
> file
> > format version and changing ByteArray len to uint64_t.
> >
> > Have there been any requests for this or other Parquet developments
> that
> > require file format versioning changes?
> >
> > I realize this a non-trivial ask.  Thanks for considering it.
> >
> > -Brian
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix


[jira] [Created] (ARROW-5129) Column writer bug: check dictionary encoder when adding a new data page

2019-04-05 Thread Ivan Sadikov (JIRA)
Ivan Sadikov created ARROW-5129:
---

 Summary: Column writer bug: check dictionary encoder when adding a 
new data page
 Key: ARROW-5129
 URL: https://issues.apache.org/jira/browse/ARROW-5129
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
 Environment: N/A
Reporter: Ivan Sadikov


As part of my weekly routine, I glanced over code in Parquet column writer and 
found that the way we check when to add a new data page is buggy. The idea is 
checking the current encoder and deciding if we have written enough bytes for a 
page to construct. The problem is that we only check value encoder, regardless 
whether or not dictionary encoder is enabled. 

Here is how we do it now: actual check 
(https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L378)
 and the buggy function 
(https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L423).
 

In the case of sparse column and dictionary  encoder we would write a single 
data page, even though we would have accumulated a large enough number of bytes 
for more than one page in encoder (value encoder will be empty, so it will 
always less than constant limit).

I forgot that parquet-cpp has `current_encoder` as either value encoder or 
dictionary encoder 
(https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_writer.cc#L544),
 but in parquet-rs we have them separate.

So the fix could be something like this:
{code}
/// Returns true if there is enough data for a data page, false otherwise.
#[inline]
fn should_add_data_page() -> bool {
  match self.dict_encoder {
Some(ref encoder) => {
  encoder.estimated_data_encoded_size() >= self.props.data_pagesize_limit()
},
None => {
  self.encoder.estimated_data_encoded_size() >= 
self.props.data_pagesize_limit()
}
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h 

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out) 

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue"  wrote:

EXTERNAL

Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman  wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


--
Ryan Blue
Software Engineer
Netflix




Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Ryan Blue
Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman  wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


-- 
Ryan Blue
Software Engineer
Netflix


Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
All,

SAS requires support for storing varying-length character and binary blobs with 
a 2^64 max length in Parquet.   Currently, the ByteArray len field is a 
unint32_t.   Looks this the will require incrementing the Parquet file format 
version and changing ByteArray len to uint64_t.

Have there been any requests for this or other Parquet developments that 
require file format versioning changes?

I realize this a non-trivial ask.  Thanks for considering it.

-Brian


[jira] [Created] (ARROW-5128) [Packaging][CentOS][Conda] Numpy not found in nightly builds

2019-04-05 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-5128:
--

 Summary: [Packaging][CentOS][Conda] Numpy not found in nightly 
builds
 Key: ARROW-5128
 URL: https://issues.apache.org/jira/browse/ARROW-5128
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Krisztian Szucs
 Fix For: 0.14.0


In the last three days centos-7 and conda-win builds have been failing with 
numpy not found
- https://travis-ci.org/kszucs/crossbow/builds/515638053
- https://ci.appveyor.com/project/kszucs/crossbow/builds/23593736
- https://ci.appveyor.com/project/kszucs/crossbow/builds/23563730



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5127) [Rust] [Parquet] Add page iterator

2019-04-05 Thread Renjie Liu (JIRA)
Renjie Liu created ARROW-5127:
-

 Summary: [Rust] [Parquet] Add page iterator
 Key: ARROW-5127
 URL: https://issues.apache.org/jira/browse/ARROW-5127
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Renjie Liu
Assignee: Renjie Liu


Adds a page iterator for column reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5126) [Rust] [Parquet] Convert parquet column desc to arrow data type

2019-04-05 Thread Renjie Liu (JIRA)
Renjie Liu created ARROW-5126:
-

 Summary: [Rust] [Parquet] Convert parquet column desc to arrow 
data type
 Key: ARROW-5126
 URL: https://issues.apache.org/jira/browse/ARROW-5126
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Renjie Liu
Assignee: Renjie Liu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5125) [Python] Cannot roundtrip extreme dates through pyarrow

2019-04-05 Thread Max Bolingbroke (JIRA)
Max Bolingbroke created ARROW-5125:
--

 Summary: [Python] Cannot roundtrip extreme dates through pyarrow
 Key: ARROW-5125
 URL: https://issues.apache.org/jira/browse/ARROW-5125
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
 Environment: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 
22:22:05)
Reporter: Max Bolingbroke


You can roundtrip many dates through a pyarrow array:

 
{noformat}
>>> pa.array([datetime.date(1980, 1, 1)], type=pa.date32())[0]
datetime.date(1980, 1, 1){noformat}
 

But (on Windows at least), not extreme ones:

 
{noformat}
>>> pa.array([datetime.date(1960, 1, 1)], type=pa.date32())[0]
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
 File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
OSError: [Errno 22] Invalid argument
>>> pa.array([datetime.date(3200, 1, 1)], type=pa.date32())[0]
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
 File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
{noformat}
This is because datetime.utcfromtimestamp and datetime.timestamp fail on these 
dates, but it seems we should be able to totally avoid invoking this function 
when deserializing dates. Ideally we would be able to roundtrip these as 
datetimes too, of course, but it's less clear that this will be easy. For some 
context on this see [https://bugs.python.org/issue29097].

This may be related to ARROW-3176 and ARROW-4746



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)