[jira] [Created] (ARROW-7668) [Packaging][RPM] Use NInja if possible to reduce build time

2020-01-23 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7668:
---

 Summary: [Packaging][RPM] Use NInja if possible to reduce build 
time
 Key: ARROW-7668
 URL: https://issues.apache.org/jira/browse/ARROW-7668
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS][JAVA] Correct the behavior of ListVector isEmpty

2020-01-23 Thread Micah Kornfield
 I would vote for treating nulls as empty.

On Fri, Jan 10, 2020 at 12:36 AM Ji Liu  wrote:

> Hi all,
>
> Currently isEmpty API is always return false in BaseRepeatedValueVector,
> and its subclass ListVector did not overwrite this method.
> This will lead to incorrect result, for example, a ListVector with data
> [1,2], null, [], [5,6] would get [false, false, false, false] which is not
> right.
> I opened a PR to fix this[1] and not sure what’s the right behavior for
> null value, should it return [false, false, true, false] or [false, true,
> true, false] ?
>
>
> Thanks,
> Ji Liu
>
>
> [1] https://github.com/apache/arrow/pull/6044
>
>


Re: [Format] Make fields required?

2020-01-23 Thread Micah Kornfield
Looking at this it seems like the main change is require empty lists
instead of null values?  I think this might potentially be too strict for
existing degenerate cases (e.g. empty files, I also don't remember if we
said null type requires a buffer).

Most of the others like MessageHeader make sense to me.

On Mon, Jan 20, 2020 at 2:32 PM Wes McKinney  wrote:

> To help with the discussion, here is a patch with 9 "definitely
> required" fields made required, and the associated generated C++
> changes
>
> https://github.com/apache/arrow/compare/master...wesm:flatbuffers-required
>
> (I am not 100% sure about Field.children always being non-null, if
> there were some doubt we could let it be null)
>
> (I would guess that the semantics in Java and elsewhere is the same,
> but someone should confirm)
>
> On Mon, Jan 20, 2020 at 12:59 PM Wes McKinney  wrote:
> >
> > On Mon, Jan 20, 2020 at 12:20 PM Jacques Nadeau 
> wrote:
> > >
> > > >
> > > > I think what we have determined is that the changes that are being
> > > > discussed in this thread would not render any existing serialized
> > > > Flatbuffers unreadable, unless they are malformed / unable to be
> > > > read with the current libraries.
> > > >
> > >
> > > I think we need to separate two different things:
> > >
> > > Point 1: If all data is populated as we expect, changing from optional
> to
> > > required is a noop.
> > > Point 2: All current Arrow code fails to work in all cases where a
> field is
> > > not populated as expected.
> >
> > I looked at the before/after when adding "(required)" to a field and
> > it appears the only change on the read path is the generated verifier
> > (which you have to explicitly invoke, and you can skip verification)
> >
> > https://gist.github.com/wesm/f1a9e7492b0daee07ccef0566c3900a2
> >
> > This is distinct from Protobuf (I think?) because protobuf verifies
> > the presence of required fields when parsing the protobuf. I assume
> > it's the same in other languages but we'll have to check to be sure
> >
> > This means that if you _fail to invoke the verifier_, you can still
> > follow a null pointer, but applications that use the verifier will
> > stop there and not have to implement their own null checks.
> >
> > >
> > > I think one needs to prove both points in order for this change to be a
> > > compatible change. I agree that point 1 is proven. I don't think point
> 2
> > > has been proven. In fact, I'm not sure how one could prove it(*). The
> bar
> > > for changing the format in a backwards incompatible way (assuming we
> can't
> > > prove point 2) should be high given how long the specification has been
> > > out. It doesn't feel like the benefits here outweigh the cost of
> changing
> > > in an incompatible way (especially given the subjective nature of
> optional
> > > vs. required).
> > >
> > > It's probably less of a concern for
> > > > an in-house protocol than for an open standard like Arrow where there
> > > > may be multiple third-party implementations around at some point.
> > > >
> > >
> > > This is subjective, just like the general argument around whether
> required
> > > or optional should be used in protobuf. My point in sharing was to (1)
> > > point out that the initial implementation choices weren't done without
> > > reason and (2) that we should avoid arguing that either direction is
> more
> > > technically sound (which seemed to be the direction the argument was
> > > taking).
> > >
> > > (*)  One could do an exhaustive analysis of every codepath. This would
> work
> > > for libraries in the Arrow project. However, the flatbuf definition is
> part
> > > of the external specification meaning that other codepaths likely exist
> > > that we could not evaluate.
>


Re: [Java] Large Memory Allocators (Taking a dependency on JNA?)

2020-01-23 Thread Micah Kornfield
Sounds good, I'll leave it up to you which to implement.  Thanks for taking
it on.

On Sun, Jan 19, 2020 at 8:47 PM Fan Liya  wrote:

> Hi Jacques and Micah,
>
> Thanks for the fruitful discussion.
>
> It seems netty based allocator and unsafe based allocator have their
> specific advantages.
> Maybe we can implement both as independent allocators, to support
> different scenarios.
>
> This should not be difficult, as [1] has laid a solid ground for this.
>
> Best,
> Liya Fan
>
> [1] https://issues.apache.org/jira/browse/ARROW-7329
>
> On Mon, Jan 20, 2020 at 11:38 AM Micah Kornfield 
> wrote:
>
>> Hmm, somehow I missed those two alternatives, thanks for pointing them
>> out.
>>
>> I agree that these are probably better than taking a new dependency.  Of
>> the two of them, it seems like using Unsafe directly might be better since
>> it would also solve a the issue of setting special environment variables
>> for Netty [1], but it might be two big of a change to couple the two
>> together.
>>
>> The other point brought on the JIRA about honoring -XX:MaxDirectMemorySize
>> is a good one.  The one downside to this is it potentially comes with a
>> performance penalty [2] (this is quite dated though).  But I think we can
>> always explore other options after doing the simplest thing first.
>>
>> -Micah
>>
>> [1] https://issues.apache.org/jira/browse/ARROW-7223
>> [2]
>>
>> http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-February/017089.html
>>
>> On Sun, Jan 19, 2020 at 3:03 PM Jacques Nadeau 
>> wrote:
>>
>> > It seems like jna is overkill & unnecessary for simply
>> allocating/freeing
>> > memory.
>> >
>> > A simple way to do this is either to use unsafe directly or call the
>> > existing netty unsafe facade directly.
>> >
>> > PlatformDependent.allocateMemory(long)
>> > PlatformDependent.freeMemory(long)
>> >
>> > Should be relatively straightforward to add to the existing Netty-based
>> > allocator.
>> >
>> > On Sat, Jan 18, 2020 at 8:14 PM Fan Liya  wrote:
>> >
>> >> Hi Micah,
>> >>
>> >> Thanks for the good suggestion. JNA seems like a good and reasonable
>> tool
>> >> for allocating large memory chunks.
>> >>
>> >> How about we directly use Java UNSAFE API? It seems the allocateMemory
>> API
>> >> is also based on the malloc method of the native implementation [1].
>> >>
>> >> Best,
>> >> Liya Fan
>> >>
>> >> [1]
>> >>
>> >>
>> http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/4fc084dac61e/src/share/vm/prims/unsafe.cpp
>> >>
>> >> On Sat, Jan 18, 2020 at 12:58 PM Micah Kornfield <
>> emkornfi...@gmail.com>
>> >> wrote:
>> >>
>> >> > With the recently merged changes to the underlying ArrowBuf APIs to
>> >> allow
>> >> > 64-bit memory address spaces there is some follow-up work to actually
>> >> > confirm it works.  I opened a JIRA [1] to track this work.
>> >> >
>> >> > The main question is how to provide an allocator that supports larger
>> >> > memory chunks.  It appears the Netty API only takes an 32-bit integer
>> >> for
>> >> > array sizes.  Doing a little bit of investigation it seems like JNA
>> [2]
>> >> > exposes a direct call to malloc of 64-bit integers [3].
>> >> >
>> >> > The other options would seem to be rolling our own allocator via JNI.
>> >> >
>> >> > Is there anybody worked with JNA and can share experiences?
>> >> > Is anyone familiar with other options?
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> > [1] https://issues.apache.org/jira/browse/ARROW-7606
>> >> > [2] https://github.com/java-native-access/jna
>> >> > [3]
>> >> >
>> >> >
>> >>
>> https://github.com/java-native-access/jna/blob/master/src/com/sun/jna/Native.java#L2265
>> >> >
>> >>
>> >
>>
>


Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Micah Kornfield
Hi John,
Not Wes, but my thoughts on this are as follows:

1. Alternate bit/byte arrangements can also be useful for processing [1] in
addition to compression.
2. I think they are quite a bit more complicated then the existing schemes
proposed in [2], so I think it would be more expedient to get the
integration hooks necessary to work with simpler encodings before going
with something more complex.  I believe the proposal is generic enough to
support this type of encoding.
3. For prototyping, this seems like a potential use of the ExtensionType
[3] type mechanism already in the specification.
4. I don't think these should be new types or part of the basic Array data
structure.  I think having a different container format in the form of
"SparseRecordBatch" (or perhaps it should be renamed to EncodedRecordBatch)
and keeping the existing types with alternate encodings is a better option.

That being said if you have bandwidth to get this working for C++ and Java
we can potentially setup a separate development branch to see how it
evolves.  Personally, I've not brought my proposal up for discussion again,
because I haven't had bandwidth to work on it, but I still think
introducing some level of alternate encodings is a good idea.

Cheers,
Micah


[1]
https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf
[2] https://github.com/apache/arrow/pull/4815
[3]
https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types

On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen  wrote:

> Wes, what do you think about Arrow supporting a new suite of fixed-length
> data types that unshuffle on column->Value(i) calls?  This would allow
> memory/swap compressors and memory maps backed by compressing
> filesystems (ZFS) or block devices (VDO) to operate more efficiently.
>
> By doing it with new datatypes there is no separate flag to check?
>
> On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney  wrote:
>
> > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen  wrote:
> > >
> > > Again, I know very little about Parquet, so your patience is
> appreciated.
> > >
> > > At the moment I can Arrow/mmap a file without having anywhere nearly as
> > > much available memory as the file size.  I can visit random place in
> the
> > > file (such as a binary search if it is ordered) and only the locations
> > > visited by column->Value(i) are paged in.  Paging them out happens
> > without
> > > my awareness, if necessary.
> > >
> > > Does Parquet cover this use-case with the same elegance and at least
> > equal
> > > efficiency, or are there more copies/conversions?  Perhaps it requires
> > the
> > > entire file to be transformed into Arrow memory at the beginning? Or
> on a
> > > batch/block basis? Or to get this I need to use a non-Arrow API for
> data
> > > element access?  Etc.
> >
> > Data has to be materialized / deserialized from the Parquet file on a
> > batch-wise per-column basis. The APIs we provide allow batches of
> > values to be read for a given subset of columns
> >
> > >
> > > IFF it covers the above use-case, which does not mention compression or
> > > encoding, then I could consider whether it is interesting on those
> > points.
> >
> > My point really has to do with Parquet's design which is about
> > reducing file size. In the following blog post
> >
> > https://ursalabs.org/blog/2019-10-columnar-perf/
> >
> > I examined a dataset which is about 4GB as raw Arrow stream/file but
> > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
> > if you are working with filesystems that yield < 500MB/s (which
> > includes pretty much all cloud filesystems AFAIK). In clickstream
> > analytics this kind of compression ratio is not unusual.
> >
> > >
> > > -John
> > >
> > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> > > fsaintjacq...@gmail.com> wrote:
> > >
> > > > What's the point of having zero copy if the OS is doing the
> > > > decompression in kernel (which trumps the zero-copy argument)? You
> > > > might as well just use parquet without filesystem compression. I
> > > > prefer to have compression algorithm where the columnar engine can
> > > > benefit from it [1] than marginally improving a file-system-os
> > > > specific feature.
> > > >
> > > > François
> > > >
> > > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen 
> wrote:
> > > > >
> > > > > This could also have utility in memory via things like zram/zswap,
> > right?
> > > > > Mac also has a memory compressor?
> > > > >
> > > > > I don't think Parquet is an option for me unless the integration
> with
> > > > Arrow
> > > > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I
> > know
> > > > > next to nothing about Parquet.
> > > > >
> > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <
> anto...@python.org>
> > > > wrote:
> > > > > >
> > > > > >
> > > > > 

[Java] PR Reviewers

2020-01-23 Thread Micah Kornfield
I mentioned this elsewhere but my intent is to stop doing java reviews for
the immediate future once I wrap up the few that I have requested change on.

I'm happy to try to triage incoming Java PRs, but in order to do this, I
need to know which committers have some bandwidth to do reviews (some of
the existing PRs I've tagged people who never responded).

Thanks,
Micah


[Format] Array/RowBatch filters

2020-01-23 Thread Micah Kornfield
One of the things that I think got overlooked in the conversation on having
a slice offset in the C API was a suggestion from Jacques of perhaps
generalizing the concept to an arbitrary "filter" for arrays/record batches.

I believe this point was also discussed in the past as well.  I'm not
advocating for adding it now but I'm curious if people feel we should add
something to Schema.fbs for forward compatibility,  in case we wish to
support this use-case in the future.

Thanks,
Micah


[jira] [Created] (ARROW-7667) [Packaging][deb] ubuntu-eoan is missing in nightly jobs

2020-01-23 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7667:
---

 Summary: [Packaging][deb] ubuntu-eoan is missing in nightly jobs
 Key: ARROW-7667
 URL: https://issues.apache.org/jira/browse/ARROW-7667
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7666) [Packaging][deb] Always use NInja to reduce build time

2020-01-23 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7666:
---

 Summary: [Packaging][deb] Always use NInja to reduce build time
 Key: ARROW-7666
 URL: https://issues.apache.org/jira/browse/ARROW-7666
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

2020-01-23 Thread Bryan Cutler
Thanks for investigating this and the quick fix Joris and Wes!  I just have
a couple questions about the behavior observed here.  The pyspark code
assigns either the same series back to the pandas.DataFrame or makes some
modifications if it is a timestamp. In the case there are no timestamps, is
this potentially making extra copies or will it be unable to take advantage
of new zero-copy features in pyarrow? For the case of having timestamp
columns that need to be modified, is there a more efficient way to create a
new dataframe with only copies of the modified series?  Thanks!

Bryan

On Thu, Jan 16, 2020 at 11:48 PM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> That sounds like a good solution. Having the zero-copy behavior depending
> on whether you have only 1 column of a certain type or not, might lead to
> surprising results. To avoid yet another keyword, only doing it when
> split_blocks=True sounds good to me (in practice, that's also when it will
> happen mostly, except for very narrow dataframes with only few columns).
>
> Joris
>
> On Thu, 16 Jan 2020 at 22:44, Wes McKinney  wrote:
>
> > hi Joris,
> >
> > Thanks for investigating this. It seems there were some unintended
> > consequences of the zero-copy optimizations from ARROW-3789. Another
> > way forward might be to "opt in" to this behavior, or to only do the
> > zero copy optimizations when split_blocks=True. What do you think?
> >
> > - Wes
> >
> > On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche
> >  wrote:
> > >
> > > So the spark integration build started to fail, and with the following
> > test
> > > error:
> > >
> > > ==
> > > ERROR: test_toPandas_batch_order
> > > (pyspark.sql.tests.test_arrow.EncryptionArrowTests)
> > > --
> > > Traceback (most recent call last):
> > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in
> > > test_toPandas_batch_order
> > > run_test(*case)
> > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in
> > run_test
> > > pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
> > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in
> > > _toPandas_arrow_toggle
> > > pdf_arrow = df.toPandas()
> > >   File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in
> > toPandas
> > > return _check_dataframe_localize_timestamps(pdf, timezone)
> > >   File "/spark/python/pyspark/sql/pandas/types.py", line 180, in
> > > _check_dataframe_localize_timestamps
> > > pdf[column] = _check_series_localize_timestamps(series, timezone)
> > >   File
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > > line 3487, in __setitem__
> > > self._set_item(key, value)
> > >   File
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > > line 3565, in _set_item
> > > NDFrame._set_item(self, key, value)
> > >   File
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py",
> > > line 3381, in _set_item
> > > self._data.set(key, value)
> > >   File
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py",
> > > line 1090, in set
> > > blk.set(blk_locs, value_getitem(val_locs))
> > >   File
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py",
> > > line 380, in set
> > > self.values[locs] = values
> > > ValueError: assignment destination is read-only
> > >
> > >
> > > It's from a test that is doing conversions from spark to arrow to
> pandas
> > > (so calling pyarrow.Table.to_pandas here
> > > <
> >
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/conversion.py#L111-L115
> > >),
> > > and on the resulting DataFrame, it is iterating through all columns,
> > > potentially fixing timezones, and writing each column back into the
> > > DataFrame (here
> > > <
> >
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/types.py#L179-L181
> > >
> > > ).
> > >
> > > Since it is giving an error about read-only, it might be related to
> > > zero-copy behaviour of to_pandas, and thus might be related to the
> > refactor
> > > of the arrow->pandas conversion that landed yesterday (
> > > https://github.com/apache/arrow/pull/6067, it says it changed to do
> > > zero-copy for 1-column blocks if possible).
> > > I am not sure if something should be fixed in pyarrow for this, but the
> > > obvious thing that pyspark can do is specify they don't want zero-copy.
> > >
> > > Joris
> > >
> > > On Wed, 15 Jan 2020 at 14:32, Crossbow  wrote:
> > >
> >
>


Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
Wes, what do you think about Arrow supporting a new suite of fixed-length
data types that unshuffle on column->Value(i) calls?  This would allow
memory/swap compressors and memory maps backed by compressing
filesystems (ZFS) or block devices (VDO) to operate more efficiently.

By doing it with new datatypes there is no separate flag to check?

On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney  wrote:

> On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen  wrote:
> >
> > Again, I know very little about Parquet, so your patience is appreciated.
> >
> > At the moment I can Arrow/mmap a file without having anywhere nearly as
> > much available memory as the file size.  I can visit random place in the
> > file (such as a binary search if it is ordered) and only the locations
> > visited by column->Value(i) are paged in.  Paging them out happens
> without
> > my awareness, if necessary.
> >
> > Does Parquet cover this use-case with the same elegance and at least
> equal
> > efficiency, or are there more copies/conversions?  Perhaps it requires
> the
> > entire file to be transformed into Arrow memory at the beginning? Or on a
> > batch/block basis? Or to get this I need to use a non-Arrow API for data
> > element access?  Etc.
>
> Data has to be materialized / deserialized from the Parquet file on a
> batch-wise per-column basis. The APIs we provide allow batches of
> values to be read for a given subset of columns
>
> >
> > IFF it covers the above use-case, which does not mention compression or
> > encoding, then I could consider whether it is interesting on those
> points.
>
> My point really has to do with Parquet's design which is about
> reducing file size. In the following blog post
>
> https://ursalabs.org/blog/2019-10-columnar-perf/
>
> I examined a dataset which is about 4GB as raw Arrow stream/file but
> only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
> if you are working with filesystems that yield < 500MB/s (which
> includes pretty much all cloud filesystems AFAIK). In clickstream
> analytics this kind of compression ratio is not unusual.
>
> >
> > -John
> >
> > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> > fsaintjacq...@gmail.com> wrote:
> >
> > > What's the point of having zero copy if the OS is doing the
> > > decompression in kernel (which trumps the zero-copy argument)? You
> > > might as well just use parquet without filesystem compression. I
> > > prefer to have compression algorithm where the columnar engine can
> > > benefit from it [1] than marginally improving a file-system-os
> > > specific feature.
> > >
> > > François
> > >
> > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> > >
> > >
> > >
> > >
> > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen  wrote:
> > > >
> > > > This could also have utility in memory via things like zram/zswap,
> right?
> > > > Mac also has a memory compressor?
> > > >
> > > > I don't think Parquet is an option for me unless the integration with
> > > Arrow
> > > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I
> know
> > > > next to nothing about Parquet.
> > > >
> > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou 
> > > wrote:
> > > > >
> > > > >
> > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > > Perhaps related to this thread, are there any current or proposed
> > > tools
> > > > to
> > > > > > transform columns for fixed-length data types according to a
> > > "shuffle?"
> > > > > >  For precedent see the implementation of the shuffle filter in
> hdf5.
> > > > > >
> > > >
> > >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > > >
> > > > > > For example, the column (length 3) would store bytes 00 00 00 00
> 00
> > > 00
> > > > 00
> > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01
> 00
> > > 00
> > > > 00
> > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
> actually
> > > the
> > > > > > case).
> > > > > >
> > > > > > Value(1) would return 00 00 00 02 by referring to some metadata
> flag
> > > > that
> > > > > > the column is shuffled, stitching the bytes back together at call
> > > time.
> > > > > >
> > > > > > Thus if the column pages were backed by a memory map to something
> > > like
> > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
> savings
> > > in
> > > > > > underlying disk usage due to better run lengths.
> > > > > >
> > > > > > It would enable a space/time tradeoff that could be useful?  The
> > > > filesystem
> > > > > > itself cannot easily do this particular compression transform
> since
> > > it
> > > > > > benefits from knowing the shape of the data.
> > > > >
> > > > > For the record, there's a pull request adding this encoding to the
> > > > > Parquet C++ specification.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > >
>


[jira] [Created] (ARROW-7665) [R] linuxLibs.R should build in parallel

2020-01-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7665:
-

 Summary: [R] linuxLibs.R should build in parallel
 Key: ARROW-7665
 URL: https://issues.apache.org/jira/browse/ARROW-7665
 Project: Apache Arrow
  Issue Type: Wish
  Components: R
Reporter: Antoine Pitrou


It currently seems to compile everything in one thread, which is ghastinly slow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7664) [C++] Extract localfs default from FileSystemFromUri

2020-01-23 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7664:
---

 Summary: [C++] Extract localfs default from FileSystemFromUri
 Key: ARROW-7664
 URL: https://issues.apache.org/jira/browse/ARROW-7664
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Antoine Pitrou
 Fix For: 1.0.0


[https://github.com/apache/arrow/pull/6257#pullrequestreview-347506792]

The argument to FileSystemFromUri should always be rfc3986 formatted. The 
current fallback to localfs can be recovered by adding {{static string 
Uri::FromPath(string)}} which wraps 
[uriWindowsFilenameToUriStringA|https://uriparser.github.io/doc/api/latest/Uri_8h.html#a422dc4a2b979ad380a4dfe007e3de845]
 and the corresponding unix path function.
{code:java}
FileSystemFromUri(Uri::FromPath(R"(E:\dir\file.txt)"), ) {code}
This is a little more boilerplate but I think it's worthwhile to be explicit 
here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Wes McKinney
On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen  wrote:
>
> Again, I know very little about Parquet, so your patience is appreciated.
>
> At the moment I can Arrow/mmap a file without having anywhere nearly as
> much available memory as the file size.  I can visit random place in the
> file (such as a binary search if it is ordered) and only the locations
> visited by column->Value(i) are paged in.  Paging them out happens without
> my awareness, if necessary.
>
> Does Parquet cover this use-case with the same elegance and at least equal
> efficiency, or are there more copies/conversions?  Perhaps it requires the
> entire file to be transformed into Arrow memory at the beginning? Or on a
> batch/block basis? Or to get this I need to use a non-Arrow API for data
> element access?  Etc.

Data has to be materialized / deserialized from the Parquet file on a
batch-wise per-column basis. The APIs we provide allow batches of
values to be read for a given subset of columns

>
> IFF it covers the above use-case, which does not mention compression or
> encoding, then I could consider whether it is interesting on those points.

My point really has to do with Parquet's design which is about
reducing file size. In the following blog post

https://ursalabs.org/blog/2019-10-columnar-perf/

I examined a dataset which is about 4GB as raw Arrow stream/file but
only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
if you are working with filesystems that yield < 500MB/s (which
includes pretty much all cloud filesystems AFAIK). In clickstream
analytics this kind of compression ratio is not unusual.

>
> -John
>
> On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> fsaintjacq...@gmail.com> wrote:
>
> > What's the point of having zero copy if the OS is doing the
> > decompression in kernel (which trumps the zero-copy argument)? You
> > might as well just use parquet without filesystem compression. I
> > prefer to have compression algorithm where the columnar engine can
> > benefit from it [1] than marginally improving a file-system-os
> > specific feature.
> >
> > François
> >
> > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> >
> >
> >
> >
> > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen  wrote:
> > >
> > > This could also have utility in memory via things like zram/zswap, right?
> > > Mac also has a memory compressor?
> > >
> > > I don't think Parquet is an option for me unless the integration with
> > Arrow
> > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > > next to nothing about Parquet.
> > >
> > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou 
> > wrote:
> > > >
> > > >
> > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > Perhaps related to this thread, are there any current or proposed
> > tools
> > > to
> > > > > transform columns for fixed-length data types according to a
> > "shuffle?"
> > > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > > >
> > >
> > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > >
> > > > > For example, the column (length 3) would store bytes 00 00 00 00 00
> > 00
> > > 00
> > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00
> > 00
> > > 00
> > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually
> > the
> > > > > case).
> > > > >
> > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > > that
> > > > > the column is shuffled, stitching the bytes back together at call
> > time.
> > > > >
> > > > > Thus if the column pages were backed by a memory map to something
> > like
> > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings
> > in
> > > > > underlying disk usage due to better run lengths.
> > > > >
> > > > > It would enable a space/time tradeoff that could be useful?  The
> > > filesystem
> > > > > itself cannot easily do this particular compression transform since
> > it
> > > > > benefits from knowing the shape of the data.
> > > >
> > > > For the record, there's a pull request adding this encoding to the
> > > > Parquet C++ specification.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> >


Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
Again, I know very little about Parquet, so your patience is appreciated.

At the moment I can Arrow/mmap a file without having anywhere nearly as
much available memory as the file size.  I can visit random place in the
file (such as a binary search if it is ordered) and only the locations
visited by column->Value(i) are paged in.  Paging them out happens without
my awareness, if necessary.

Does Parquet cover this use-case with the same elegance and at least equal
efficiency, or are there more copies/conversions?  Perhaps it requires the
entire file to be transformed into Arrow memory at the beginning? Or on a
batch/block basis? Or to get this I need to use a non-Arrow API for data
element access?  Etc.

IFF it covers the above use-case, which does not mention compression or
encoding, then I could consider whether it is interesting on those points.

-John

On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> What's the point of having zero copy if the OS is doing the
> decompression in kernel (which trumps the zero-copy argument)? You
> might as well just use parquet without filesystem compression. I
> prefer to have compression algorithm where the columnar engine can
> benefit from it [1] than marginally improving a file-system-os
> specific feature.
>
> François
>
> [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>
>
>
>
> On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen  wrote:
> >
> > This could also have utility in memory via things like zram/zswap, right?
> > Mac also has a memory compressor?
> >
> > I don't think Parquet is an option for me unless the integration with
> Arrow
> > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > next to nothing about Parquet.
> >
> > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou 
> wrote:
> > >
> > >
> > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > Perhaps related to this thread, are there any current or proposed
> tools
> > to
> > > > transform columns for fixed-length data types according to a
> "shuffle?"
> > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > >
> >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > >
> > > > For example, the column (length 3) would store bytes 00 00 00 00 00
> 00
> > 00
> > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00
> 00
> > 00
> > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually
> the
> > > > case).
> > > >
> > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > that
> > > > the column is shuffled, stitching the bytes back together at call
> time.
> > > >
> > > > Thus if the column pages were backed by a memory map to something
> like
> > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings
> in
> > > > underlying disk usage due to better run lengths.
> > > >
> > > > It would enable a space/time tradeoff that could be useful?  The
> > filesystem
> > > > itself cannot easily do this particular compression transform since
> it
> > > > benefits from knowing the shape of the data.
> > >
> > > For the record, there's a pull request adding this encoding to the
> > > Parquet C++ specification.
> > >
> > > Regards
> > >
> > > Antoine.
>


[jira] [Created] (ARROW-7663) from_pandas gives TypeError instead of ArrowTypeError in some cases

2020-01-23 Thread David Li (Jira)
David Li created ARROW-7663:
---

 Summary: from_pandas gives TypeError instead of ArrowTypeError in 
some cases
 Key: ARROW-7663
 URL: https://issues.apache.org/jira/browse/ARROW-7663
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1
Reporter: David Li


from_pandas sometimes raises a TypeError with an uninformative error message 
rather than an ArrowTypeError with the full, informative type error for 
mixed-type array columns:

{noformat}
>>> pa.Table.from_pandas(pd.DataFrame({"a": ['a', 1]}))
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/table.pxi", line 1177, in pyarrow.lib.Table.from_pandas
  File 
"/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py",
 line 575, in dataframe_to_arrays
for c, f in zip(columns_to_convert, convert_fields)]
  File 
"/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py",
 line 575, in 
for c, f in zip(columns_to_convert, convert_fields)]
  File 
"/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py",
 line 566, in convert_column
raise e
  File 
"/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py",
 line 560, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ("Expected a bytes object, got a 'int' object", 
'Conversion failed for column a with type object')
>>> pa.Table.from_pandas(pd.DataFrame({"a": [1, 'a']}))
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/table.pxi", line 1177, in pyarrow.lib.Table.from_pandas
  File 
"/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py",
 line 575, in dataframe_to_arrays
for c, f in zip(columns_to_convert, convert_fields)]
  File 
"/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py",
 line 575, in 
for c, f in zip(columns_to_convert, convert_fields)]
  File 
"/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py",
 line 560, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type str)
{noformat}

Noticed on 0.15.1 and on master when we tried to upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Wes McKinney
Parquet is most relevant in scenarios filesystem IO is constrained
(spinning rust HDD, network FS, cloud storage / S3 / GCS). For those
use cases memory-mapped Arrow is not viable.

Against local NVMe (> 2000 MB/s read throughput) your mileage may vary.

On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques
 wrote:
>
> What's the point of having zero copy if the OS is doing the
> decompression in kernel (which trumps the zero-copy argument)? You
> might as well just use parquet without filesystem compression. I
> prefer to have compression algorithm where the columnar engine can
> benefit from it [1] than marginally improving a file-system-os
> specific feature.
>
> François
>
> [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>
>
>
>
> On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen  wrote:
> >
> > This could also have utility in memory via things like zram/zswap, right?
> > Mac also has a memory compressor?
> >
> > I don't think Parquet is an option for me unless the integration with Arrow
> > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > next to nothing about Parquet.
> >
> > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou  wrote:
> > >
> > >
> > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > Perhaps related to this thread, are there any current or proposed tools
> > to
> > > > transform columns for fixed-length data types according to a "shuffle?"
> > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > >
> > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > >
> > > > For example, the column (length 3) would store bytes 00 00 00 00 00 00
> > 00
> > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00
> > 00
> > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> > > > case).
> > > >
> > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > that
> > > > the column is shuffled, stitching the bytes back together at call time.
> > > >
> > > > Thus if the column pages were backed by a memory map to something like
> > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> > > > underlying disk usage due to better run lengths.
> > > >
> > > > It would enable a space/time tradeoff that could be useful?  The
> > filesystem
> > > > itself cannot easily do this particular compression transform since it
> > > > benefits from knowing the shape of the data.
> > >
> > > For the record, there's a pull request adding this encoding to the
> > > Parquet C++ specification.
> > >
> > > Regards
> > >
> > > Antoine.


Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
This could also have utility in memory via things like zram/zswap, right?
Mac also has a memory compressor?

I don't think Parquet is an option for me unless the integration with Arrow
is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
next to nothing about Parquet.

On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou  wrote:
>
>
> Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > Perhaps related to this thread, are there any current or proposed tools
to
> > transform columns for fixed-length data types according to a "shuffle?"
> >  For precedent see the implementation of the shuffle filter in hdf5.
> >
https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> >
> > For example, the column (length 3) would store bytes 00 00 00 00 00 00
00
> > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00
00
> > 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> > case).
> >
> > Value(1) would return 00 00 00 02 by referring to some metadata flag
that
> > the column is shuffled, stitching the bytes back together at call time.
> >
> > Thus if the column pages were backed by a memory map to something like
> > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> > underlying disk usage due to better run lengths.
> >
> > It would enable a space/time tradeoff that could be useful?  The
filesystem
> > itself cannot easily do this particular compression transform since it
> > benefits from knowing the shape of the data.
>
> For the record, there's a pull request adding this encoding to the
> Parquet C++ specification.
>
> Regards
>
> Antoine.


Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Antoine Pitrou


Forgot to give the URL:
https://github.com/apache/arrow/pull/6005

Regards

Antoine.


Le 23/01/2020 à 18:23, Antoine Pitrou a écrit :
> 
> Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
>> Perhaps related to this thread, are there any current or proposed tools to
>> transform columns for fixed-length data types according to a "shuffle?"
>>  For precedent see the implementation of the shuffle filter in hdf5.
>> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
>>
>> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
>> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
>> 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
>> case).
>>
>> Value(1) would return 00 00 00 02 by referring to some metadata flag that
>> the column is shuffled, stitching the bytes back together at call time.
>>
>> Thus if the column pages were backed by a memory map to something like
>> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
>> underlying disk usage due to better run lengths.
>>
>> It would enable a space/time tradeoff that could be useful?  The filesystem
>> itself cannot easily do this particular compression transform since it
>> benefits from knowing the shape of the data.
> 
> For the record, there's a pull request adding this encoding to the
> Parquet C++ specification.
> 
> Regards
> 
> Antoine.
> 


Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread Antoine Pitrou


Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> Perhaps related to this thread, are there any current or proposed tools to
> transform columns for fixed-length data types according to a "shuffle?"
>  For precedent see the implementation of the shuffle filter in hdf5.
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> 
> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
> 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> case).
> 
> Value(1) would return 00 00 00 02 by referring to some metadata flag that
> the column is shuffled, stitching the bytes back together at call time.
> 
> Thus if the column pages were backed by a memory map to something like
> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> underlying disk usage due to better run lengths.
> 
> It would enable a space/time tradeoff that could be useful?  The filesystem
> itself cannot easily do this particular compression transform since it
> benefits from knowing the shape of the data.

For the record, there's a pull request adding this encoding to the
Parquet C++ specification.

Regards

Antoine.


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2020-01-23 Thread John Muehlhausen
Perhaps related to this thread, are there any current or proposed tools to
transform columns for fixed-length data types according to a "shuffle?"
 For precedent see the implementation of the shuffle filter in hdf5.
https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf

For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
02 00 00 00 03  (I'm writing big-endian even if that is not actually the
case).

Value(1) would return 00 00 00 02 by referring to some metadata flag that
the column is shuffled, stitching the bytes back together at call time.

Thus if the column pages were backed by a memory map to something like
zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
underlying disk usage due to better run lengths.

It would enable a space/time tradeoff that could be useful?  The filesystem
itself cannot easily do this particular compression transform since it
benefits from knowing the shape of the data.

-John

On Sun, Aug 25, 2019 at 10:30 PM Micah Kornfield 
wrote:
>
> Hi Ippokratis,
> Thank you for the feedback, I have some questions based on the links you
> provided.
>
>
> > I think that lightweight encodings (like the FrameOfReference Micah
> > suggests) do make a lot of sense for Arrow. There are a few
implementations
> > of those in commercial systems. One related paper in the literature is
> > http://www.cs.columbia.edu/~orestis/damon15.pdf
>
>
> This paper seems to suggest more complex encodings I was imagining for the
> the first implementation.  Specifically, I proposed using only codes that
> are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have
> the dense bit-packing in an initial version?
>
> >
> > I would actually also look into some order-preserving dictionary
encodings
> > for strings that also allow vectorized processing (predicates, joins,
..)
> > on encoded data, e.g. see
> >
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
> >  .
>
> The IPC spec [1] already has some metadata about the ordering of
> dictionaries, but this might not be sufficient.  The paper linked here
> seems to recommend two things:
> 1.  Treating dictionaries as explicit mappings between value and integer
> code today is is implicit because the dictionaries are lists indexed by
> code.  It seems like for forward-compatibility we should add a type enum
to
> the Dictionary Encoding metadata.
> 2.  Adding indexes to the dictionaries.  For this, did you imagine the
> indexes would be transferred or something built up on receiving batches?
>
> Arrow can be used as during shuffles for distributed joins/aggs and being
> > able to operate on encoded data yields benefits (e.g.
> > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
>
> The main take-away I got after skimming this paper, as it relates to
> encodings, is that encodings (including dictionary) should be dynamic per
> batch.  The other interesting question it raises with respect to Arrow is
> one of the techniques used is delta-encoding.  I believe delta encoding
> requires linear time access.  The dense representations in Arrow was
> designed to have constant time access to elements. One open question on
how
> far we  want to relax this requirement for encoded columns.  My proposal
> uses a form of RLE that provide O(Log(N)) access).
>
> Cheers,
> Micah
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285
>
> On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis 
> wrote:
>
> > I think that lightweight encodings (like the FrameOfReference Micah
> > suggests) do make a lot of sense for Arrow. There are a few
implementations
> > of those in commercial systems. One related paper in the literature is
> > http://www.cs.columbia.edu/~orestis/damon15.pdf
> >
> > I would actually also look into some order-preserving dictionary
encodings
> > for strings that also allow vectorized processing (predicates, joins,
..)
> > on encoded data, e.g. see
> >
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
> >  .
> >
> > Arrow can be used as during shuffles for distributed joins/aggs and
being
> > able to operate on encoded data yields benefits (e.g.
> > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
> >
> > Thanks,
> > -Ippokratis.
> >
> >
> > On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield 
> > wrote:
> >
> >> >
> >> > It's not just computation libraries, it's any library peeking inside
> >> > Arrow data.  Currently, the Arrow data types are simple, which makes
it
> >> > easy and non-intimidating to build data processing utilities around
> >> > them.  If we start adding sophisticated encodings, we also raise the
> >> > cost of supporting Arrow for third-party libraries.
> >>
> >>
> >> This is another legitimate concern about complexity.
> >>
> >> To try to limit complexity. I simplified the proposal PR [1] to only
have
> 

[NIGHTLY] Arrow Build Report for Job nightly-2020-01-23-0

2020-01-23 Thread Crossbow


Arrow Build Report for Job nightly-2020-01-23-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0

Failed Tasks:
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-win-vs2015-py38
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-gandiva-jar-osx
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.7-spark-master
- test-ubuntu-fuzzit-fuzzing:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-ubuntu-fuzzit-fuzzing
- test-ubuntu-fuzzit-regression:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-ubuntu-fuzzit-regression
- wheel-osx-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp27m
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp35m
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp36m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp37m
- wheel-osx-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-centos-8
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-osx-clang-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-debian-stretch
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-gandiva-jar-trusty
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-cpp
- test-conda-python-2.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-2.7-pandas-latest
- test-conda-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-2.7
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.7-pandas-latest
- 

[jira] [Created] (ARROW-7662) Support for auto-inferring list column->array in write_parquet

2020-01-23 Thread Michael Chirico (Jira)
Michael Chirico created ARROW-7662:
--

 Summary: Support for auto-inferring list column->array in 
write_parquet
 Key: ARROW-7662
 URL: https://issues.apache.org/jira/browse/ARROW-7662
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Michael Chirico



{code:r}
DF = data.frame(a = 1:10)
DF$b = as.list(DF$a)
arrow::write_parquet(DF, 'test.parquet')
# Error in Table__from_dots(dots, schema) : cannot infer type from data
{code}

This appears to be supported naturally already in Python:

{code:python}
import pandas as pd
pd.DataFrame({'a': [1, 2, 3], 'b': [[1, 2], [3, 4], [5, 
6]]}).to_parquet('test.parquet')
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7660) [C++][Gandiva] Optimise castVarchar(string, int) function for single byte characters

2020-01-23 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-7660:
-

 Summary: [C++][Gandiva] Optimise castVarchar(string, int) function 
for single byte characters
 Key: ARROW-7660
 URL: https://issues.apache.org/jira/browse/ARROW-7660
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Projjal Chanda
Assignee: Projjal Chanda


The current castVarchar function does byte by byte check for handling multibyte 
characters. Since most of the time string consists of single byte characters 
optimise it for that case and move to the slow path when multibyte characters 
detected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)