[jira] [Created] (ARROW-2024) Remove global SerializationContext variables.
Robert Nishihara created ARROW-2024: --- Summary: Remove global SerializationContext variables. Key: ARROW-2024 URL: https://issues.apache.org/jira/browse/ARROW-2024 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Robert Nishihara We should get rid of the global variables _default_serialization_context and pandas_serialization_context and replace them with functions default_serialization_context() and pandas_serialization_context(). This will also make it faster to do import pyarrow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2023) [C++] Test opening IPC stream reader or file reader on an empty InputStream
Wes McKinney created ARROW-2023: --- Summary: [C++] Test opening IPC stream reader or file reader on an empty InputStream Key: ARROW-2023 URL: https://issues.apache.org/jira/browse/ARROW-2023 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.9.0 This was reported to segfault in ARROW-1589 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Filters on Arrow record batch
hi Animesh -- it does not yet, but the idea has come up on occasion. You are welcome to propose additions to the format for including statistics in a stream of record batch messages (these could possibly be embedded in the main RecordBatch metadata or sent as a separate message). As an aside, I just opened https://issues.apache.org/jira/browse/ARROW-2022 to think about the idea of sending along arbitrary extra metadata with a record batch message. - Wes On Sat, Jan 20, 2018 at 6:07 AM, Animesh Trivediwrote: > Hi all, > > Is it possible to have push-down filters on Arrow record batches while > reading data in? Something like what parquet have. > > Does Arrow maintain any per batch statistics? > > Thanks > -- > Animesh
[jira] [Created] (ARROW-2022) [Format] Add custom metadata field specific to a RecordBatch message
Wes McKinney created ARROW-2022: --- Summary: [Format] Add custom metadata field specific to a RecordBatch message Key: ARROW-2022 URL: https://issues.apache.org/jira/browse/ARROW-2022 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Wes McKinney While we can have schema- and field-level custom metadata, we cannot send metadata at the record batch level. This could include things like statistics (although statistics isn't a great example, because this might be something we want to eventually standardize), but other things too See message definitions in https://github.com/apache/arrow/blob/master/format/Message.fbs -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2021) Reduce Travis CI flakiness due to apt connectivity problems
Wes McKinney created ARROW-2021: --- Summary: Reduce Travis CI flakiness due to apt connectivity problems Key: ARROW-2021 URL: https://issues.apache.org/jira/browse/ARROW-2021 Project: Apache Arrow Issue Type: Improvement Reporter: Wes McKinney We have been experiencing periodic apt flakiness in Travis CI. See discussion in https://github.com/apache/arrow/pull/1481#issuecomment-359993584 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2020) pyarrow: Parquet segfaults if coercing ns timestamps and writing 96-bit timestamps
Yiannis Liodakis created ARROW-2020: --- Summary: pyarrow: Parquet segfaults if coercing ns timestamps and writing 96-bit timestamps Key: ARROW-2020 URL: https://issues.apache.org/jira/browse/ARROW-2020 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Environment: OS: Mac OS X 10.13.2 Python: 3.6.4 PyArrow: 0.8.0 Reporter: Yiannis Liodakis Attachments: crash-report.txt If you try to write a PyArrow table containing nanosecond-resolution timestamps to Parquet using `coerce_timestamps` and `use_deprecated_int96_timestamps=True`, the Arrow library will segfault. The crash doesn't happen if you don't coerce the timestamp resolution or if you don't use 96-bit timestamps. *To Reproduce:* {code:java} import datetime import pyarrow from pyarrow import parquet schema = pyarrow.schema([ pyarrow.field('last_updated', pyarrow.timestamp('ns')), ]) data = [ pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('ns')), ] table = pyarrow.Table.from_arrays(data, ['last_updated']) with open('test_file.parquet', 'wb') as fdesc: parquet.write_table(table, fdesc, coerce_timestamps='us', # 'ms' works too use_deprecated_int96_timestamps=True){code} See attached file for the crash report. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2019) Control the memory allocated for inner vector in LIST
Siddharth Teotia created ARROW-2019: --- Summary: Control the memory allocated for inner vector in LIST Key: ARROW-2019 URL: https://issues.apache.org/jira/browse/ARROW-2019 Project: Apache Arrow Issue Type: Improvement Reporter: Siddharth Teotia Assignee: Siddharth Teotia We have observed cases in our external sort code where the amount of memory actually allocated for a record batch sometimes turns out to be more than necessary and also more than what was reserved by the operator for special purposes. Thus queries fail with OOM. Usually to control the memory allocated by vector.allocateNew() is to do a setInitialCapacity() and the latter modifies the vector state variables which are then used to allocate memory. However, due to the multiplier of 5 used in List Vector, we end up asking for more memory than necessary. For example, for a value count of 4095, we asked for 128KB of memory for an offset buffer of VarCharVector for a field which was list of varchars. We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 allocation). We had earlier made changes to setInitialCapacity() of ListVector when we were facing problems with deeply nested lists and decided to use the multiplier only for the leaf scalar vector. It looks like there is a need for a specialized setInitialCapacity() for ListVector where the caller dictates the repeatedness. Also, there is another bug in setInitialCapacity() where the allocation of validity buffer doesn't obey the capacity specified in setInitialCapacity(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2018) [C++] Build instruction on macOS and Homebrew is incomplete
yosuke shiro created ARROW-2018: --- Summary: [C++] Build instruction on macOS and Homebrew is incomplete Key: ARROW-2018 URL: https://issues.apache.org/jira/browse/ARROW-2018 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.8.0 Reporter: yosuke shiro I read [https://github.com/apache/arrow/blob/master/cpp/README.md] I did the following instruction {quote}On OS X, you can use [Homebrew|https://brew.sh/]: brew update && brew bundle --file=c_glib/Brewfile {quote} I got the following result {quote}% brew update && brew bundle --file=c_glib/Brewfile (git)-[master] Updated 3 taps (caskroom/cask, caskroom/versions, homebrew/core). ==> Updated Formulae git ✔ bit cryptopp envconsul fwup hugo just leptonica mlt pdnsrec wget imagemagick@6 ✔ cocoapods dlib etcd gitlab-runner imagemagick khard libtomcrypt nss quicktype wtf awscli conan dmd fn go imagesnap knot-resolver mariadb-connector-c orc-tools tomcat bench cryfs dub folly godep jenkins kubernetes-helm micropython parallel vala ==> Tapping homebrew/bundle Cloning into '/usr/local/Homebrew/Library/Taps/homebrew/homebrew-bundle'... remote: Counting objects: 59, done. remote: Compressing objects: 100% (53/53), done. remote: Total 59 (delta 8), reused 13 (delta 3), pack-reused 0 Unpacking objects: 100% (59/59), done. Tapped 0 formulae (130 files, 173.3KB) Error: No Brewfile found {quote} I need the following information to succeed * I need to clone the Apache Arrow repository. * I need to move to the top directory. -- This message was sent by Atlassian JIRA (v7.6.3#76005)