[jira] [Created] (ARROW-7668) [Packaging][RPM] Use NInja if possible to reduce build time
Kouhei Sutou created ARROW-7668: --- Summary: [Packaging][RPM] Use NInja if possible to reduce build time Key: ARROW-7668 URL: https://issues.apache.org/jira/browse/ARROW-7668 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS][JAVA] Correct the behavior of ListVector isEmpty
I would vote for treating nulls as empty. On Fri, Jan 10, 2020 at 12:36 AM Ji Liu wrote: > Hi all, > > Currently isEmpty API is always return false in BaseRepeatedValueVector, > and its subclass ListVector did not overwrite this method. > This will lead to incorrect result, for example, a ListVector with data > [1,2], null, [], [5,6] would get [false, false, false, false] which is not > right. > I opened a PR to fix this[1] and not sure what’s the right behavior for > null value, should it return [false, false, true, false] or [false, true, > true, false] ? > > > Thanks, > Ji Liu > > > [1] https://github.com/apache/arrow/pull/6044 > >
Re: [Format] Make fields required?
Looking at this it seems like the main change is require empty lists instead of null values? I think this might potentially be too strict for existing degenerate cases (e.g. empty files, I also don't remember if we said null type requires a buffer). Most of the others like MessageHeader make sense to me. On Mon, Jan 20, 2020 at 2:32 PM Wes McKinney wrote: > To help with the discussion, here is a patch with 9 "definitely > required" fields made required, and the associated generated C++ > changes > > https://github.com/apache/arrow/compare/master...wesm:flatbuffers-required > > (I am not 100% sure about Field.children always being non-null, if > there were some doubt we could let it be null) > > (I would guess that the semantics in Java and elsewhere is the same, > but someone should confirm) > > On Mon, Jan 20, 2020 at 12:59 PM Wes McKinney wrote: > > > > On Mon, Jan 20, 2020 at 12:20 PM Jacques Nadeau > wrote: > > > > > > > > > > > I think what we have determined is that the changes that are being > > > > discussed in this thread would not render any existing serialized > > > > Flatbuffers unreadable, unless they are malformed / unable to be > > > > read with the current libraries. > > > > > > > > > > I think we need to separate two different things: > > > > > > Point 1: If all data is populated as we expect, changing from optional > to > > > required is a noop. > > > Point 2: All current Arrow code fails to work in all cases where a > field is > > > not populated as expected. > > > > I looked at the before/after when adding "(required)" to a field and > > it appears the only change on the read path is the generated verifier > > (which you have to explicitly invoke, and you can skip verification) > > > > https://gist.github.com/wesm/f1a9e7492b0daee07ccef0566c3900a2 > > > > This is distinct from Protobuf (I think?) because protobuf verifies > > the presence of required fields when parsing the protobuf. I assume > > it's the same in other languages but we'll have to check to be sure > > > > This means that if you _fail to invoke the verifier_, you can still > > follow a null pointer, but applications that use the verifier will > > stop there and not have to implement their own null checks. > > > > > > > > I think one needs to prove both points in order for this change to be a > > > compatible change. I agree that point 1 is proven. I don't think point > 2 > > > has been proven. In fact, I'm not sure how one could prove it(*). The > bar > > > for changing the format in a backwards incompatible way (assuming we > can't > > > prove point 2) should be high given how long the specification has been > > > out. It doesn't feel like the benefits here outweigh the cost of > changing > > > in an incompatible way (especially given the subjective nature of > optional > > > vs. required). > > > > > > It's probably less of a concern for > > > > an in-house protocol than for an open standard like Arrow where there > > > > may be multiple third-party implementations around at some point. > > > > > > > > > > This is subjective, just like the general argument around whether > required > > > or optional should be used in protobuf. My point in sharing was to (1) > > > point out that the initial implementation choices weren't done without > > > reason and (2) that we should avoid arguing that either direction is > more > > > technically sound (which seemed to be the direction the argument was > > > taking). > > > > > > (*) One could do an exhaustive analysis of every codepath. This would > work > > > for libraries in the Arrow project. However, the flatbuf definition is > part > > > of the external specification meaning that other codepaths likely exist > > > that we could not evaluate. >
Re: [Java] Large Memory Allocators (Taking a dependency on JNA?)
Sounds good, I'll leave it up to you which to implement. Thanks for taking it on. On Sun, Jan 19, 2020 at 8:47 PM Fan Liya wrote: > Hi Jacques and Micah, > > Thanks for the fruitful discussion. > > It seems netty based allocator and unsafe based allocator have their > specific advantages. > Maybe we can implement both as independent allocators, to support > different scenarios. > > This should not be difficult, as [1] has laid a solid ground for this. > > Best, > Liya Fan > > [1] https://issues.apache.org/jira/browse/ARROW-7329 > > On Mon, Jan 20, 2020 at 11:38 AM Micah Kornfield > wrote: > >> Hmm, somehow I missed those two alternatives, thanks for pointing them >> out. >> >> I agree that these are probably better than taking a new dependency. Of >> the two of them, it seems like using Unsafe directly might be better since >> it would also solve a the issue of setting special environment variables >> for Netty [1], but it might be two big of a change to couple the two >> together. >> >> The other point brought on the JIRA about honoring -XX:MaxDirectMemorySize >> is a good one. The one downside to this is it potentially comes with a >> performance penalty [2] (this is quite dated though). But I think we can >> always explore other options after doing the simplest thing first. >> >> -Micah >> >> [1] https://issues.apache.org/jira/browse/ARROW-7223 >> [2] >> >> http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-February/017089.html >> >> On Sun, Jan 19, 2020 at 3:03 PM Jacques Nadeau >> wrote: >> >> > It seems like jna is overkill & unnecessary for simply >> allocating/freeing >> > memory. >> > >> > A simple way to do this is either to use unsafe directly or call the >> > existing netty unsafe facade directly. >> > >> > PlatformDependent.allocateMemory(long) >> > PlatformDependent.freeMemory(long) >> > >> > Should be relatively straightforward to add to the existing Netty-based >> > allocator. >> > >> > On Sat, Jan 18, 2020 at 8:14 PM Fan Liya wrote: >> > >> >> Hi Micah, >> >> >> >> Thanks for the good suggestion. JNA seems like a good and reasonable >> tool >> >> for allocating large memory chunks. >> >> >> >> How about we directly use Java UNSAFE API? It seems the allocateMemory >> API >> >> is also based on the malloc method of the native implementation [1]. >> >> >> >> Best, >> >> Liya Fan >> >> >> >> [1] >> >> >> >> >> http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/4fc084dac61e/src/share/vm/prims/unsafe.cpp >> >> >> >> On Sat, Jan 18, 2020 at 12:58 PM Micah Kornfield < >> emkornfi...@gmail.com> >> >> wrote: >> >> >> >> > With the recently merged changes to the underlying ArrowBuf APIs to >> >> allow >> >> > 64-bit memory address spaces there is some follow-up work to actually >> >> > confirm it works. I opened a JIRA [1] to track this work. >> >> > >> >> > The main question is how to provide an allocator that supports larger >> >> > memory chunks. It appears the Netty API only takes an 32-bit integer >> >> for >> >> > array sizes. Doing a little bit of investigation it seems like JNA >> [2] >> >> > exposes a direct call to malloc of 64-bit integers [3]. >> >> > >> >> > The other options would seem to be rolling our own allocator via JNI. >> >> > >> >> > Is there anybody worked with JNA and can share experiences? >> >> > Is anyone familiar with other options? >> >> > >> >> > Thanks, >> >> > Micah >> >> > >> >> > [1] https://issues.apache.org/jira/browse/ARROW-7606 >> >> > [2] https://github.com/java-native-access/jna >> >> > [3] >> >> > >> >> > >> >> >> https://github.com/java-native-access/jna/blob/master/src/com/sun/jna/Native.java#L2265 >> >> > >> >> >> > >> >
Re: [DISCUSS] Format additions for encoding/compression
Hi John, Not Wes, but my thoughts on this are as follows: 1. Alternate bit/byte arrangements can also be useful for processing [1] in addition to compression. 2. I think they are quite a bit more complicated then the existing schemes proposed in [2], so I think it would be more expedient to get the integration hooks necessary to work with simpler encodings before going with something more complex. I believe the proposal is generic enough to support this type of encoding. 3. For prototyping, this seems like a potential use of the ExtensionType [3] type mechanism already in the specification. 4. I don't think these should be new types or part of the basic Array data structure. I think having a different container format in the form of "SparseRecordBatch" (or perhaps it should be renamed to EncodedRecordBatch) and keeping the existing types with alternate encodings is a better option. That being said if you have bandwidth to get this working for C++ and Java we can potentially setup a separate development branch to see how it evolves. Personally, I've not brought my proposal up for discussion again, because I haven't had bandwidth to work on it, but I still think introducing some level of alternate encodings is a good idea. Cheers, Micah [1] https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf [2] https://github.com/apache/arrow/pull/4815 [3] https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen wrote: > Wes, what do you think about Arrow supporting a new suite of fixed-length > data types that unshuffle on column->Value(i) calls? This would allow > memory/swap compressors and memory maps backed by compressing > filesystems (ZFS) or block devices (VDO) to operate more efficiently. > > By doing it with new datatypes there is no separate flag to check? > > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney wrote: > > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > > > > > Again, I know very little about Parquet, so your patience is > appreciated. > > > > > > At the moment I can Arrow/mmap a file without having anywhere nearly as > > > much available memory as the file size. I can visit random place in > the > > > file (such as a binary search if it is ordered) and only the locations > > > visited by column->Value(i) are paged in. Paging them out happens > > without > > > my awareness, if necessary. > > > > > > Does Parquet cover this use-case with the same elegance and at least > > equal > > > efficiency, or are there more copies/conversions? Perhaps it requires > > the > > > entire file to be transformed into Arrow memory at the beginning? Or > on a > > > batch/block basis? Or to get this I need to use a non-Arrow API for > data > > > element access? Etc. > > > > Data has to be materialized / deserialized from the Parquet file on a > > batch-wise per-column basis. The APIs we provide allow batches of > > values to be read for a given subset of columns > > > > > > > > IFF it covers the above use-case, which does not mention compression or > > > encoding, then I could consider whether it is interesting on those > > points. > > > > My point really has to do with Parquet's design which is about > > reducing file size. In the following blog post > > > > https://ursalabs.org/blog/2019-10-columnar-perf/ > > > > I examined a dataset which is about 4GB as raw Arrow stream/file but > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal > > if you are working with filesystems that yield < 500MB/s (which > > includes pretty much all cloud filesystems AFAIK). In clickstream > > analytics this kind of compression ratio is not unusual. > > > > > > > > -John > > > > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < > > > fsaintjacq...@gmail.com> wrote: > > > > > > > What's the point of having zero copy if the OS is doing the > > > > decompression in kernel (which trumps the zero-copy argument)? You > > > > might as well just use parquet without filesystem compression. I > > > > prefer to have compression algorithm where the columnar engine can > > > > benefit from it [1] than marginally improving a file-system-os > > > > specific feature. > > > > > > > > François > > > > > > > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > > > > > > > > > > > > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen > wrote: > > > > > > > > > > This could also have utility in memory via things like zram/zswap, > > right? > > > > > Mac also has a memory compressor? > > > > > > > > > > I don't think Parquet is an option for me unless the integration > with > > > > Arrow > > > > > is tighter than I imagine (i.e. zero-copy). That said, I confess I > > know > > > > > next to nothing about Parquet. > > > > > > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou < > anto...@python.org> > > > > wrote: > > > > > > > > > > > > > > > > >
[Java] PR Reviewers
I mentioned this elsewhere but my intent is to stop doing java reviews for the immediate future once I wrap up the few that I have requested change on. I'm happy to try to triage incoming Java PRs, but in order to do this, I need to know which committers have some bandwidth to do reviews (some of the existing PRs I've tagged people who never responded). Thanks, Micah
[Format] Array/RowBatch filters
One of the things that I think got overlooked in the conversation on having a slice offset in the C API was a suggestion from Jacques of perhaps generalizing the concept to an arbitrary "filter" for arrays/record batches. I believe this point was also discussed in the past as well. I'm not advocating for adding it now but I'm curious if people feel we should add something to Schema.fbs for forward compatibility, in case we wish to support this use-case in the future. Thanks, Micah
[jira] [Created] (ARROW-7667) [Packaging][deb] ubuntu-eoan is missing in nightly jobs
Kouhei Sutou created ARROW-7667: --- Summary: [Packaging][deb] ubuntu-eoan is missing in nightly jobs Key: ARROW-7667 URL: https://issues.apache.org/jira/browse/ARROW-7667 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7666) [Packaging][deb] Always use NInja to reduce build time
Kouhei Sutou created ARROW-7666: --- Summary: [Packaging][deb] Always use NInja to reduce build time Key: ARROW-7666 URL: https://issues.apache.org/jira/browse/ARROW-7666 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]
Thanks for investigating this and the quick fix Joris and Wes! I just have a couple questions about the behavior observed here. The pyspark code assigns either the same series back to the pandas.DataFrame or makes some modifications if it is a timestamp. In the case there are no timestamps, is this potentially making extra copies or will it be unable to take advantage of new zero-copy features in pyarrow? For the case of having timestamp columns that need to be modified, is there a more efficient way to create a new dataframe with only copies of the modified series? Thanks! Bryan On Thu, Jan 16, 2020 at 11:48 PM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > That sounds like a good solution. Having the zero-copy behavior depending > on whether you have only 1 column of a certain type or not, might lead to > surprising results. To avoid yet another keyword, only doing it when > split_blocks=True sounds good to me (in practice, that's also when it will > happen mostly, except for very narrow dataframes with only few columns). > > Joris > > On Thu, 16 Jan 2020 at 22:44, Wes McKinney wrote: > > > hi Joris, > > > > Thanks for investigating this. It seems there were some unintended > > consequences of the zero-copy optimizations from ARROW-3789. Another > > way forward might be to "opt in" to this behavior, or to only do the > > zero copy optimizations when split_blocks=True. What do you think? > > > > - Wes > > > > On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche > > wrote: > > > > > > So the spark integration build started to fail, and with the following > > test > > > error: > > > > > > == > > > ERROR: test_toPandas_batch_order > > > (pyspark.sql.tests.test_arrow.EncryptionArrowTests) > > > -- > > > Traceback (most recent call last): > > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in > > > test_toPandas_batch_order > > > run_test(*case) > > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in > > run_test > > > pdf, pdf_arrow = self._toPandas_arrow_toggle(df) > > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in > > > _toPandas_arrow_toggle > > > pdf_arrow = df.toPandas() > > > File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in > > toPandas > > > return _check_dataframe_localize_timestamps(pdf, timezone) > > > File "/spark/python/pyspark/sql/pandas/types.py", line 180, in > > > _check_dataframe_localize_timestamps > > > pdf[column] = _check_series_localize_timestamps(series, timezone) > > > File > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py", > > > line 3487, in __setitem__ > > > self._set_item(key, value) > > > File > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py", > > > line 3565, in _set_item > > > NDFrame._set_item(self, key, value) > > > File > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py", > > > line 3381, in _set_item > > > self._data.set(key, value) > > > File > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py", > > > line 1090, in set > > > blk.set(blk_locs, value_getitem(val_locs)) > > > File > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py", > > > line 380, in set > > > self.values[locs] = values > > > ValueError: assignment destination is read-only > > > > > > > > > It's from a test that is doing conversions from spark to arrow to > pandas > > > (so calling pyarrow.Table.to_pandas here > > > < > > > https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/conversion.py#L111-L115 > > >), > > > and on the resulting DataFrame, it is iterating through all columns, > > > potentially fixing timezones, and writing each column back into the > > > DataFrame (here > > > < > > > https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/types.py#L179-L181 > > > > > > ). > > > > > > Since it is giving an error about read-only, it might be related to > > > zero-copy behaviour of to_pandas, and thus might be related to the > > refactor > > > of the arrow->pandas conversion that landed yesterday ( > > > https://github.com/apache/arrow/pull/6067, it says it changed to do > > > zero-copy for 1-column blocks if possible). > > > I am not sure if something should be fixed in pyarrow for this, but the > > > obvious thing that pyspark can do is specify they don't want zero-copy. > > > > > > Joris > > > > > > On Wed, 15 Jan 2020 at 14:32, Crossbow wrote: > > > > > >
Re: [DISCUSS] Format additions for encoding/compression
Wes, what do you think about Arrow supporting a new suite of fixed-length data types that unshuffle on column->Value(i) calls? This would allow memory/swap compressors and memory maps backed by compressing filesystems (ZFS) or block devices (VDO) to operate more efficiently. By doing it with new datatypes there is no separate flag to check? On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney wrote: > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > > > Again, I know very little about Parquet, so your patience is appreciated. > > > > At the moment I can Arrow/mmap a file without having anywhere nearly as > > much available memory as the file size. I can visit random place in the > > file (such as a binary search if it is ordered) and only the locations > > visited by column->Value(i) are paged in. Paging them out happens > without > > my awareness, if necessary. > > > > Does Parquet cover this use-case with the same elegance and at least > equal > > efficiency, or are there more copies/conversions? Perhaps it requires > the > > entire file to be transformed into Arrow memory at the beginning? Or on a > > batch/block basis? Or to get this I need to use a non-Arrow API for data > > element access? Etc. > > Data has to be materialized / deserialized from the Parquet file on a > batch-wise per-column basis. The APIs we provide allow batches of > values to be read for a given subset of columns > > > > > IFF it covers the above use-case, which does not mention compression or > > encoding, then I could consider whether it is interesting on those > points. > > My point really has to do with Parquet's design which is about > reducing file size. In the following blog post > > https://ursalabs.org/blog/2019-10-columnar-perf/ > > I examined a dataset which is about 4GB as raw Arrow stream/file but > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal > if you are working with filesystems that yield < 500MB/s (which > includes pretty much all cloud filesystems AFAIK). In clickstream > analytics this kind of compression ratio is not unusual. > > > > > -John > > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < > > fsaintjacq...@gmail.com> wrote: > > > > > What's the point of having zero copy if the OS is doing the > > > decompression in kernel (which trumps the zero-copy argument)? You > > > might as well just use parquet without filesystem compression. I > > > prefer to have compression algorithm where the columnar engine can > > > benefit from it [1] than marginally improving a file-system-os > > > specific feature. > > > > > > François > > > > > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > > > > > > > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen wrote: > > > > > > > > This could also have utility in memory via things like zram/zswap, > right? > > > > Mac also has a memory compressor? > > > > > > > > I don't think Parquet is an option for me unless the integration with > > > Arrow > > > > is tighter than I imagine (i.e. zero-copy). That said, I confess I > know > > > > next to nothing about Parquet. > > > > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou > > > wrote: > > > > > > > > > > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > > > > > Perhaps related to this thread, are there any current or proposed > > > tools > > > > to > > > > > > transform columns for fixed-length data types according to a > > > "shuffle?" > > > > > > For precedent see the implementation of the shuffle filter in > hdf5. > > > > > > > > > > > > > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > > > > > > > > > For example, the column (length 3) would store bytes 00 00 00 00 > 00 > > > 00 > > > > 00 > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 > 00 > > > 00 > > > > 00 > > > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not > actually > > > the > > > > > > case). > > > > > > > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata > flag > > > > that > > > > > > the column is shuffled, stitching the bytes back together at call > > > time. > > > > > > > > > > > > Thus if the column pages were backed by a memory map to something > > > like > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% > savings > > > in > > > > > > underlying disk usage due to better run lengths. > > > > > > > > > > > > It would enable a space/time tradeoff that could be useful? The > > > > filesystem > > > > > > itself cannot easily do this particular compression transform > since > > > it > > > > > > benefits from knowing the shape of the data. > > > > > > > > > > For the record, there's a pull request adding this encoding to the > > > > > Parquet C++ specification. > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > >
[jira] [Created] (ARROW-7665) [R] linuxLibs.R should build in parallel
Antoine Pitrou created ARROW-7665: - Summary: [R] linuxLibs.R should build in parallel Key: ARROW-7665 URL: https://issues.apache.org/jira/browse/ARROW-7665 Project: Apache Arrow Issue Type: Wish Components: R Reporter: Antoine Pitrou It currently seems to compile everything in one thread, which is ghastinly slow. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7664) [C++] Extract localfs default from FileSystemFromUri
Ben Kietzman created ARROW-7664: --- Summary: [C++] Extract localfs default from FileSystemFromUri Key: ARROW-7664 URL: https://issues.apache.org/jira/browse/ARROW-7664 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.15.1 Reporter: Ben Kietzman Assignee: Antoine Pitrou Fix For: 1.0.0 [https://github.com/apache/arrow/pull/6257#pullrequestreview-347506792] The argument to FileSystemFromUri should always be rfc3986 formatted. The current fallback to localfs can be recovered by adding {{static string Uri::FromPath(string)}} which wraps [uriWindowsFilenameToUriStringA|https://uriparser.github.io/doc/api/latest/Uri_8h.html#a422dc4a2b979ad380a4dfe007e3de845] and the corresponding unix path function. {code:java} FileSystemFromUri(Uri::FromPath(R"(E:\dir\file.txt)"), ) {code} This is a little more boilerplate but I think it's worthwhile to be explicit here. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Format additions for encoding/compression
On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > Again, I know very little about Parquet, so your patience is appreciated. > > At the moment I can Arrow/mmap a file without having anywhere nearly as > much available memory as the file size. I can visit random place in the > file (such as a binary search if it is ordered) and only the locations > visited by column->Value(i) are paged in. Paging them out happens without > my awareness, if necessary. > > Does Parquet cover this use-case with the same elegance and at least equal > efficiency, or are there more copies/conversions? Perhaps it requires the > entire file to be transformed into Arrow memory at the beginning? Or on a > batch/block basis? Or to get this I need to use a non-Arrow API for data > element access? Etc. Data has to be materialized / deserialized from the Parquet file on a batch-wise per-column basis. The APIs we provide allow batches of values to be read for a given subset of columns > > IFF it covers the above use-case, which does not mention compression or > encoding, then I could consider whether it is interesting on those points. My point really has to do with Parquet's design which is about reducing file size. In the following blog post https://ursalabs.org/blog/2019-10-columnar-perf/ I examined a dataset which is about 4GB as raw Arrow stream/file but only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal if you are working with filesystems that yield < 500MB/s (which includes pretty much all cloud filesystems AFAIK). In clickstream analytics this kind of compression ratio is not unusual. > > -John > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > What's the point of having zero copy if the OS is doing the > > decompression in kernel (which trumps the zero-copy argument)? You > > might as well just use parquet without filesystem compression. I > > prefer to have compression algorithm where the columnar engine can > > benefit from it [1] than marginally improving a file-system-os > > specific feature. > > > > François > > > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen wrote: > > > > > > This could also have utility in memory via things like zram/zswap, right? > > > Mac also has a memory compressor? > > > > > > I don't think Parquet is an option for me unless the integration with > > Arrow > > > is tighter than I imagine (i.e. zero-copy). That said, I confess I know > > > next to nothing about Parquet. > > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou > > wrote: > > > > > > > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > > > > Perhaps related to this thread, are there any current or proposed > > tools > > > to > > > > > transform columns for fixed-length data types according to a > > "shuffle?" > > > > > For precedent see the implementation of the shuffle filter in hdf5. > > > > > > > > > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > > > > > > > For example, the column (length 3) would store bytes 00 00 00 00 00 > > 00 > > > 00 > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 > > 00 > > > 00 > > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not actually > > the > > > > > case). > > > > > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag > > > that > > > > > the column is shuffled, stitching the bytes back together at call > > time. > > > > > > > > > > Thus if the column pages were backed by a memory map to something > > like > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings > > in > > > > > underlying disk usage due to better run lengths. > > > > > > > > > > It would enable a space/time tradeoff that could be useful? The > > > filesystem > > > > > itself cannot easily do this particular compression transform since > > it > > > > > benefits from knowing the shape of the data. > > > > > > > > For the record, there's a pull request adding this encoding to the > > > > Parquet C++ specification. > > > > > > > > Regards > > > > > > > > Antoine. > >
Re: [DISCUSS] Format additions for encoding/compression
Again, I know very little about Parquet, so your patience is appreciated. At the moment I can Arrow/mmap a file without having anywhere nearly as much available memory as the file size. I can visit random place in the file (such as a binary search if it is ordered) and only the locations visited by column->Value(i) are paged in. Paging them out happens without my awareness, if necessary. Does Parquet cover this use-case with the same elegance and at least equal efficiency, or are there more copies/conversions? Perhaps it requires the entire file to be transformed into Arrow memory at the beginning? Or on a batch/block basis? Or to get this I need to use a non-Arrow API for data element access? Etc. IFF it covers the above use-case, which does not mention compression or encoding, then I could consider whether it is interesting on those points. -John On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > What's the point of having zero copy if the OS is doing the > decompression in kernel (which trumps the zero-copy argument)? You > might as well just use parquet without filesystem compression. I > prefer to have compression algorithm where the columnar engine can > benefit from it [1] than marginally improving a file-system-os > specific feature. > > François > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen wrote: > > > > This could also have utility in memory via things like zram/zswap, right? > > Mac also has a memory compressor? > > > > I don't think Parquet is an option for me unless the integration with > Arrow > > is tighter than I imagine (i.e. zero-copy). That said, I confess I know > > next to nothing about Parquet. > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou > wrote: > > > > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > > > Perhaps related to this thread, are there any current or proposed > tools > > to > > > > transform columns for fixed-length data types according to a > "shuffle?" > > > > For precedent see the implementation of the shuffle filter in hdf5. > > > > > > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > > > > > For example, the column (length 3) would store bytes 00 00 00 00 00 > 00 > > 00 > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 > 00 > > 00 > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not actually > the > > > > case). > > > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag > > that > > > > the column is shuffled, stitching the bytes back together at call > time. > > > > > > > > Thus if the column pages were backed by a memory map to something > like > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings > in > > > > underlying disk usage due to better run lengths. > > > > > > > > It would enable a space/time tradeoff that could be useful? The > > filesystem > > > > itself cannot easily do this particular compression transform since > it > > > > benefits from knowing the shape of the data. > > > > > > For the record, there's a pull request adding this encoding to the > > > Parquet C++ specification. > > > > > > Regards > > > > > > Antoine. >
[jira] [Created] (ARROW-7663) from_pandas gives TypeError instead of ArrowTypeError in some cases
David Li created ARROW-7663: --- Summary: from_pandas gives TypeError instead of ArrowTypeError in some cases Key: ARROW-7663 URL: https://issues.apache.org/jira/browse/ARROW-7663 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Reporter: David Li from_pandas sometimes raises a TypeError with an uninformative error message rather than an ArrowTypeError with the full, informative type error for mixed-type array columns: {noformat} >>> pa.Table.from_pandas(pd.DataFrame({"a": ['a', 1]})) Traceback (most recent call last): File "", line 1, in File "pyarrow/table.pxi", line 1177, in pyarrow.lib.Table.from_pandas File "/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py", line 575, in dataframe_to_arrays for c, f in zip(columns_to_convert, convert_fields)] File "/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py", line 575, in for c, f in zip(columns_to_convert, convert_fields)] File "/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py", line 566, in convert_column raise e File "/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py", line 560, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 265, in pyarrow.lib.array File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: ("Expected a bytes object, got a 'int' object", 'Conversion failed for column a with type object') >>> pa.Table.from_pandas(pd.DataFrame({"a": [1, 'a']})) Traceback (most recent call last): File "", line 1, in File "pyarrow/table.pxi", line 1177, in pyarrow.lib.Table.from_pandas File "/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py", line 575, in dataframe_to_arrays for c, f in zip(columns_to_convert, convert_fields)] File "/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py", line 575, in for c, f in zip(columns_to_convert, convert_fields)] File "/Users/lidavidm/Flight/arrow/build/python/lib.macosx-10.12-x86_64-3.7/pyarrow/pandas_compat.py", line 560, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 265, in pyarrow.lib.array File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array TypeError: an integer is required (got type str) {noformat} Noticed on 0.15.1 and on master when we tried to upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Format additions for encoding/compression
Parquet is most relevant in scenarios filesystem IO is constrained (spinning rust HDD, network FS, cloud storage / S3 / GCS). For those use cases memory-mapped Arrow is not viable. Against local NVMe (> 2000 MB/s read throughput) your mileage may vary. On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques wrote: > > What's the point of having zero copy if the OS is doing the > decompression in kernel (which trumps the zero-copy argument)? You > might as well just use parquet without filesystem compression. I > prefer to have compression algorithm where the columnar engine can > benefit from it [1] than marginally improving a file-system-os > specific feature. > > François > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen wrote: > > > > This could also have utility in memory via things like zram/zswap, right? > > Mac also has a memory compressor? > > > > I don't think Parquet is an option for me unless the integration with Arrow > > is tighter than I imagine (i.e. zero-copy). That said, I confess I know > > next to nothing about Parquet. > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou wrote: > > > > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > > > Perhaps related to this thread, are there any current or proposed tools > > to > > > > transform columns for fixed-length data types according to a "shuffle?" > > > > For precedent see the implementation of the shuffle filter in hdf5. > > > > > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > > > > > For example, the column (length 3) would store bytes 00 00 00 00 00 00 > > 00 > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 > > 00 > > > > 02 00 00 00 03 (I'm writing big-endian even if that is not actually the > > > > case). > > > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag > > that > > > > the column is shuffled, stitching the bytes back together at call time. > > > > > > > > Thus if the column pages were backed by a memory map to something like > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in > > > > underlying disk usage due to better run lengths. > > > > > > > > It would enable a space/time tradeoff that could be useful? The > > filesystem > > > > itself cannot easily do this particular compression transform since it > > > > benefits from knowing the shape of the data. > > > > > > For the record, there's a pull request adding this encoding to the > > > Parquet C++ specification. > > > > > > Regards > > > > > > Antoine.
Re: [DISCUSS] Format additions for encoding/compression
This could also have utility in memory via things like zram/zswap, right? Mac also has a memory compressor? I don't think Parquet is an option for me unless the integration with Arrow is tighter than I imagine (i.e. zero-copy). That said, I confess I know next to nothing about Parquet. On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou wrote: > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > Perhaps related to this thread, are there any current or proposed tools to > > transform columns for fixed-length data types according to a "shuffle?" > > For precedent see the implementation of the shuffle filter in hdf5. > > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > > > For example, the column (length 3) would store bytes 00 00 00 00 00 00 00 > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00 > > 02 00 00 00 03 (I'm writing big-endian even if that is not actually the > > case). > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag that > > the column is shuffled, stitching the bytes back together at call time. > > > > Thus if the column pages were backed by a memory map to something like > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in > > underlying disk usage due to better run lengths. > > > > It would enable a space/time tradeoff that could be useful? The filesystem > > itself cannot easily do this particular compression transform since it > > benefits from knowing the shape of the data. > > For the record, there's a pull request adding this encoding to the > Parquet C++ specification. > > Regards > > Antoine.
Re: [DISCUSS] Format additions for encoding/compression
Forgot to give the URL: https://github.com/apache/arrow/pull/6005 Regards Antoine. Le 23/01/2020 à 18:23, Antoine Pitrou a écrit : > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : >> Perhaps related to this thread, are there any current or proposed tools to >> transform columns for fixed-length data types according to a "shuffle?" >> For precedent see the implementation of the shuffle filter in hdf5. >> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf >> >> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00 >> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00 >> 02 00 00 00 03 (I'm writing big-endian even if that is not actually the >> case). >> >> Value(1) would return 00 00 00 02 by referring to some metadata flag that >> the column is shuffled, stitching the bytes back together at call time. >> >> Thus if the column pages were backed by a memory map to something like >> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in >> underlying disk usage due to better run lengths. >> >> It would enable a space/time tradeoff that could be useful? The filesystem >> itself cannot easily do this particular compression transform since it >> benefits from knowing the shape of the data. > > For the record, there's a pull request adding this encoding to the > Parquet C++ specification. > > Regards > > Antoine. >
Re: [DISCUSS] Format additions for encoding/compression
Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > Perhaps related to this thread, are there any current or proposed tools to > transform columns for fixed-length data types according to a "shuffle?" > For precedent see the implementation of the shuffle filter in hdf5. > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf > > For example, the column (length 3) would store bytes 00 00 00 00 00 00 00 > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00 > 02 00 00 00 03 (I'm writing big-endian even if that is not actually the > case). > > Value(1) would return 00 00 00 02 by referring to some metadata flag that > the column is shuffled, stitching the bytes back together at call time. > > Thus if the column pages were backed by a memory map to something like > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in > underlying disk usage due to better run lengths. > > It would enable a space/time tradeoff that could be useful? The filesystem > itself cannot easily do this particular compression transform since it > benefits from knowing the shape of the data. For the record, there's a pull request adding this encoding to the Parquet C++ specification. Regards Antoine.
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
Perhaps related to this thread, are there any current or proposed tools to transform columns for fixed-length data types according to a "shuffle?" For precedent see the implementation of the shuffle filter in hdf5. https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf For example, the column (length 3) would store bytes 00 00 00 00 00 00 00 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00 02 00 00 00 03 (I'm writing big-endian even if that is not actually the case). Value(1) would return 00 00 00 02 by referring to some metadata flag that the column is shuffled, stitching the bytes back together at call time. Thus if the column pages were backed by a memory map to something like zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in underlying disk usage due to better run lengths. It would enable a space/time tradeoff that could be useful? The filesystem itself cannot easily do this particular compression transform since it benefits from knowing the shape of the data. -John On Sun, Aug 25, 2019 at 10:30 PM Micah Kornfield wrote: > > Hi Ippokratis, > Thank you for the feedback, I have some questions based on the links you > provided. > > > > I think that lightweight encodings (like the FrameOfReference Micah > > suggests) do make a lot of sense for Arrow. There are a few implementations > > of those in commercial systems. One related paper in the literature is > > http://www.cs.columbia.edu/~orestis/damon15.pdf > > > This paper seems to suggest more complex encodings I was imagining for the > the first implementation. Specifically, I proposed using only codes that > are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have > the dense bit-packing in an initial version? > > > > > I would actually also look into some order-preserving dictionary encodings > > for strings that also allow vectorized processing (predicates, joins, ..) > > on encoded data, e.g. see > > https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf > > . > > The IPC spec [1] already has some metadata about the ordering of > dictionaries, but this might not be sufficient. The paper linked here > seems to recommend two things: > 1. Treating dictionaries as explicit mappings between value and integer > code today is is implicit because the dictionaries are lists indexed by > code. It seems like for forward-compatibility we should add a type enum to > the Dictionary Encoding metadata. > 2. Adding indexes to the dictionaries. For this, did you imagine the > indexes would be transferred or something built up on receiving batches? > > Arrow can be used as during shuffles for distributed joins/aggs and being > > able to operate on encoded data yields benefits (e.g. > > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf). > > The main take-away I got after skimming this paper, as it relates to > encodings, is that encodings (including dictionary) should be dynamic per > batch. The other interesting question it raises with respect to Arrow is > one of the techniques used is delta-encoding. I believe delta encoding > requires linear time access. The dense representations in Arrow was > designed to have constant time access to elements. One open question on how > far we want to relax this requirement for encoded columns. My proposal > uses a form of RLE that provide O(Log(N)) access). > > Cheers, > Micah > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285 > > On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis > wrote: > > > I think that lightweight encodings (like the FrameOfReference Micah > > suggests) do make a lot of sense for Arrow. There are a few implementations > > of those in commercial systems. One related paper in the literature is > > http://www.cs.columbia.edu/~orestis/damon15.pdf > > > > I would actually also look into some order-preserving dictionary encodings > > for strings that also allow vectorized processing (predicates, joins, ..) > > on encoded data, e.g. see > > https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf > > . > > > > Arrow can be used as during shuffles for distributed joins/aggs and being > > able to operate on encoded data yields benefits (e.g. > > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf). > > > > Thanks, > > -Ippokratis. > > > > > > On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield > > wrote: > > > >> > > >> > It's not just computation libraries, it's any library peeking inside > >> > Arrow data. Currently, the Arrow data types are simple, which makes it > >> > easy and non-intimidating to build data processing utilities around > >> > them. If we start adding sophisticated encodings, we also raise the > >> > cost of supporting Arrow for third-party libraries. > >> > >> > >> This is another legitimate concern about complexity. > >> > >> To try to limit complexity. I simplified the proposal PR [1] to only have >
[NIGHTLY] Arrow Build Report for Job nightly-2020-01-23-0
Arrow Build Report for Job nightly-2020-01-23-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0 Failed Tasks: - conda-win-vs2015-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-win-vs2015-py36 - conda-win-vs2015-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-win-vs2015-py37 - conda-win-vs2015-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-win-vs2015-py38 - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-gandiva-jar-osx - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.7-spark-master - test-ubuntu-fuzzit-fuzzing: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-ubuntu-fuzzit-fuzzing - test-ubuntu-fuzzit-regression: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-ubuntu-fuzzit-regression - wheel-osx-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp27m - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp35m - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp36m - wheel-osx-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp37m - wheel-osx-cp38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-wheel-osx-cp38 Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-centos-8 - conda-linux-gcc-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-linux-gcc-py27 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-linux-gcc-py37 - conda-linux-gcc-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-linux-gcc-py38 - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-osx-clang-py27 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-osx-clang-py37 - conda-osx-clang-py38: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-conda-osx-clang-py38 - debian-buster: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-debian-buster - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-azure-debian-stretch - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-gandiva-jar-trusty - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-homebrew-cpp - macos-r-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-travis-macos-r-autobrew - test-conda-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-cpp - test-conda-python-2.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-2.7-pandas-latest - test-conda-python-2.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-2.7 - test-conda-python-3.6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.6 - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-01-23-0-circle-test-conda-python-3.7-pandas-latest -
[jira] [Created] (ARROW-7662) Support for auto-inferring list column->array in write_parquet
Michael Chirico created ARROW-7662: -- Summary: Support for auto-inferring list column->array in write_parquet Key: ARROW-7662 URL: https://issues.apache.org/jira/browse/ARROW-7662 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Michael Chirico {code:r} DF = data.frame(a = 1:10) DF$b = as.list(DF$a) arrow::write_parquet(DF, 'test.parquet') # Error in Table__from_dots(dots, schema) : cannot infer type from data {code} This appears to be supported naturally already in Python: {code:python} import pandas as pd pd.DataFrame({'a': [1, 2, 3], 'b': [[1, 2], [3, 4], [5, 6]]}).to_parquet('test.parquet') {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7660) [C++][Gandiva] Optimise castVarchar(string, int) function for single byte characters
Projjal Chanda created ARROW-7660: - Summary: [C++][Gandiva] Optimise castVarchar(string, int) function for single byte characters Key: ARROW-7660 URL: https://issues.apache.org/jira/browse/ARROW-7660 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Projjal Chanda Assignee: Projjal Chanda The current castVarchar function does byte by byte check for handling multibyte characters. Since most of the time string consists of single byte characters optimise it for that case and move to the slow path when multibyte characters detected. -- This message was sent by Atlassian Jira (v8.3.4#803005)