[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887610#comment-16887610 ] Micah Kornfield commented on ARROW-3772: "I'm looking at this. This is not a small project – the assumption that values are fully materialized is pretty deeply baked into the library. We also have to deal with the "fallback" case where a column chunk starts out dictionary encoded and switches mid-stream because the dictionary got too big" I don't have context on how we decided originally to designate an entire column dictionary encoded vs a chunk/record batch column but it seems like this might be another use-case where the proposal on encoding/compression might make things easier to code (i.e. specify dictionary encoding only on SparseRecordBatches where it makes sense and leave the fallback to dense encoding where it no longer makes sense). > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887577#comment-16887577 ] Wes McKinney commented on ARROW-3772: - I'm looking at this. This is not a small project -- the assumption that values are fully materialized is pretty deeply baked into the library. We also have to deal with the "fallback" case where a column chunk starts out dictionary encoded and switches mid-stream because the dictionary got too big. What to do in that case is ambiguous: * One option is to dictionary-encode the additional pages, so we could end up with one big dictionary * Another option is to optimistically leave things dictionary-encoded, and if we hit the fallback case then we fully materialize. We can always do a cast on the Arrow side after the fact in this case FWIW, the fallback scenario is not at all esoteric because the default dictionary pagesize limit in the C++ library is 1MB. I think Java is the same https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L44 I think adding an option to raise the limit to 2GB or so when writing Arrow DictionaryArray would help. Things are made a bit more complex by the code duplication between parquet/column_reader.cc and parquet/arrow/record_reader.cc. I'll see if there's some things I can do to fix that while I'm working on this > [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow > DictionaryArray > - > > Key: ARROW-3772 > URL: https://issues.apache.org/jira/browse/ARROW-3772 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Stav Nir >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > > Dictionary data is very common in parquet, in the current implementation > parquet-cpp decodes dictionary encoded data always before creating a plain > arrow array. This process is wasteful since we could use arrow's > DictionaryArray directly and achieve several benefits: > # Smaller memory footprint - both in the decoding process and in the > resulting arrow table - especially when the dict values are large > # Better decoding performance - mostly as a result of the first bullet - > less memory fetches and less allocations. > I think those benefits could achieve significant improvements in runtime. > My direction for the implementation is to read the indices (through the > DictionaryDecoder, after the RLE decoding) and values separately into 2 > arrays and create a DictionaryArray using them. > There are some questions to discuss: > # Should this be the default behavior for dictionary encoded data > # Should it be controlled with a parameter in the API > # What should be the policy in case some of the chunks are dictionary > encoded and some are not. > I started implementing this but would like to hear your opinions. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5974) read_csv returns truncated read for some valid gzip files
[ https://issues.apache.org/jira/browse/ARROW-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jordan Samuels updated ARROW-5974: -- Affects Version/s: 0.13.0 Confirmed same issue for 0.13.0 > read_csv returns truncated read for some valid gzip files > - > > Key: ARROW-5974 > URL: https://issues.apache.org/jira/browse/ARROW-5974 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Jordan Samuels >Priority: Minor > > If two gzipped files are concatenated together, the result is a valid gzip > file. However, it appears that pyarrow.csv.read_csv will only read the > portion related to the first file. > If the repro script > [here|https://gist.github.com/jordansamuels/d69f1c22c58418f5dfa0785b9ecd211e] > is run, the output is: > {{$ python repro.py}} > {{pyarrow.csv only reads one row:}} > {{ x}} > {{0 1}} > {{pandas reads two rows:}} > {{ x}} > {{0 1}} > {{1 2}} > {{pyarrow version: 0.14.0}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5974) read_csv returns truncated read for some valid gzip files
Jordan Samuels created ARROW-5974: - Summary: read_csv returns truncated read for some valid gzip files Key: ARROW-5974 URL: https://issues.apache.org/jira/browse/ARROW-5974 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.0 Reporter: Jordan Samuels If two gzipped files are concatenated together, the result is a valid gzip file. However, it appears that pyarrow.csv.read_csv will only read the portion related to the first file. If the repro script [here|https://gist.github.com/jordansamuels/d69f1c22c58418f5dfa0785b9ecd211e] is run, the output is: {{$ python repro.py}} {{pyarrow.csv only reads one row:}} {{ x}} {{0 1}} {{pandas reads two rows:}} {{ x}} {{0 1}} {{1 2}} {{pyarrow version: 0.14.0}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5973) [Java] Variable width vectors' get methods should return null when the underlying data is null
[ https://issues.apache.org/jira/browse/ARROW-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5973: -- Labels: pull-request-available (was: ) > [Java] Variable width vectors' get methods should return null when the > underlying data is null > -- > > Key: ARROW-5973 > URL: https://issues.apache.org/jira/browse/ARROW-5973 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > For variable-width vectors (VarCharVector and VarBinaryVector), when the > validity bit is not set, it means the underlying data is null, so the get > method should return null. > However, the current implementation throws an IllegalStateException when > NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is > clear. > Maybe the purpose of this design is to be consistent with fixed-width > vectors. However, the scenario is different: fixed-width vectors (e.g. > IntVector) throw an IllegalStateException, simply because the primitive types > are non-nullable. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5973) [Java] Variable width vectors' get methods should return null when the underlying data is null
[ https://issues.apache.org/jira/browse/ARROW-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan updated ARROW-5973: Summary: [Java] Variable width vectors' get methods should return null when the underlying data is null (was: [Java] Variable width vectors' get methods should return return null when the underlying data is null) > [Java] Variable width vectors' get methods should return null when the > underlying data is null > -- > > Key: ARROW-5973 > URL: https://issues.apache.org/jira/browse/ARROW-5973 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > > For variable-width vectors (VarCharVector and VarBinaryVector), when the > validity bit is not set, it means the underlying data is null, so the get > method should return null. > However, the current implementation throws an IllegalStateException when > NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is > clear. > Maybe the purpose of this design is to be consistent with fixed-width > vectors. However, the scenario is different: fixed-width vectors (e.g. > IntVector) throw an IllegalStateException, simply because the primitive types > are non-nullable. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5973) [Java] Variable width vectors' get methods should return return null when the underlying data is null
Liya Fan created ARROW-5973: --- Summary: [Java] Variable width vectors' get methods should return return null when the underlying data is null Key: ARROW-5973 URL: https://issues.apache.org/jira/browse/ARROW-5973 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan For variable-width vectors (VarCharVector and VarBinaryVector), when the validity bit is not set, it means the underlying data is null, so the get method should return null. However, the current implementation throws an IllegalStateException when NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is clear. Maybe the purpose of this design is to be consistent with fixed-width vectors. However, the scenario is different: fixed-width vectors (e.g. IntVector) throw an IllegalStateException, simply because the primitive types are non-nullable. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (ARROW-5815) [Java] Support swap functionality for fixed-width vectors
[ https://issues.apache.org/jira/browse/ARROW-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan closed ARROW-5815. --- Resolution: Won't Fix > [Java] Support swap functionality for fixed-width vectors > - > > Key: ARROW-5815 > URL: https://issues.apache.org/jira/browse/ARROW-5815 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Support swapping data elements for fixed-width vectors. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5970) [Java] Provide pointer to Arrow buffer
[ https://issues.apache.org/jira/browse/ARROW-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan updated ARROW-5970: Description: Introduce pointer to a memory region within an ArrowBuf. This pointer will be used as the basis for calculating the hash code within a vector, and equality determination. This data structure can be considered as a "universal value holder". was: Introduce pointer to a memory region within an ArrowBuf. This pointer will be used as the basis for calculating the hash code within a vector, and equality determination. > [Java] Provide pointer to Arrow buffer > -- > > Key: ARROW-5970 > URL: https://issues.apache.org/jira/browse/ARROW-5970 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Introduce pointer to a memory region within an ArrowBuf. > This pointer will be used as the basis for calculating the hash code within a > vector, and equality determination. > This data structure can be considered as a "universal value holder". -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5762) [Integration][JS] Integration Tests for Map Type
[ https://issues.apache.org/jira/browse/ARROW-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Taylor reassigned ARROW-5762: -- Assignee: Paul Taylor > [Integration][JS] Integration Tests for Map Type > > > Key: ARROW-5762 > URL: https://issues.apache.org/jira/browse/ARROW-5762 > Project: Apache Arrow > Issue Type: Improvement > Components: Integration, JavaScript >Reporter: Bryan Cutler >Assignee: Paul Taylor >Priority: Major > Fix For: 1.0.0 > > > ARROW-1279 enabled integration tests for MapType between Java and C++, but > JavaScript had to be disabled for the map case due to an error. Once this is > fixed, {{generate_map_case}} could be moved under {{generate_nested_case}} > with the other nested types. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5762) [Integration][JS] Integration Tests for Map Type
[ https://issues.apache.org/jira/browse/ARROW-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Taylor updated ARROW-5762: --- Fix Version/s: 1.0.0 > [Integration][JS] Integration Tests for Map Type > > > Key: ARROW-5762 > URL: https://issues.apache.org/jira/browse/ARROW-5762 > Project: Apache Arrow > Issue Type: Improvement > Components: Integration, JavaScript >Reporter: Bryan Cutler >Priority: Major > Fix For: 1.0.0 > > > ARROW-1279 enabled integration tests for MapType between Java and C++, but > JavaScript had to be disabled for the map case due to an error. Once this is > fixed, {{generate_map_case}} could be moved under {{generate_nested_case}} > with the other nested types. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5762) [Integration][JS] Integration Tests for Map Type
[ https://issues.apache.org/jira/browse/ARROW-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887544#comment-16887544 ] Paul Taylor commented on ARROW-5762: After reviewing the C++, the JS version of the Map type is not the same (it's essentially a Struct except instead of accessing child fields by field index, they're accessed by name). We should absolutely update the JS Map implementation before the 1.0 release. > [Integration][JS] Integration Tests for Map Type > > > Key: ARROW-5762 > URL: https://issues.apache.org/jira/browse/ARROW-5762 > Project: Apache Arrow > Issue Type: Improvement > Components: Integration, JavaScript >Reporter: Bryan Cutler >Priority: Major > > ARROW-1279 enabled integration tests for MapType between Java and C++, but > JavaScript had to be disabled for the map case due to an error. Once this is > fixed, {{generate_map_case}} could be moved under {{generate_nested_case}} > with the other nested types. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5747) [C++] Better column name and header support in CSV reader
[ https://issues.apache.org/jira/browse/ARROW-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5747: -- Labels: csv pull-request-available (was: csv) > [C++] Better column name and header support in CSV reader > - > > Key: ARROW-5747 > URL: https://issues.apache.org/jira/browse/ARROW-5747 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.13.0 >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Labels: csv, pull-request-available > Fix For: 1.0.0 > > > While working on ARROW-5500, I found a number of issues around the CSV parse > options {{header_rows}}: > * If header_rows is 0, [the reader > errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150] > * It's not possible to supply your own column names, as [this > TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149] > notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is > enough as long as header_rows == 0 doesn't error, but then you can't > naturally specify column types in the convert options because that takes a > map of column name to type. > * If header_rows is > 1, every cell gets turned into a column name, so if > header_rows == 2, you get twice the number of column names as columns. This > doesn't error, but it leads to unexpected results. > IMO a better interface would be to have a {{skip_rows}} argument to let you > ignore a large header, and a {{column_names}} argument that, if provided, > gives the column names. If not provided, the first row after {{skip_rows}} is > taken to be the column names. If it were also possible for {{column_names}} > to take a {{false}} or {{null}} argument, then we could support the case of > autogenerating names when none are provided and there's no header row. > Alternatively, we could use a boolean {{header}} argument to govern whether > the first (non-skipped) row should be interpreted as column names. (For > reference, R's > [readr|https://github.com/tidyverse/readr/blob/master/R/read_delim.R#L14-L27] > takes TRUE/FALSE/array of strings in one arg; the base > [read.csv|https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html] > uses separate args for header and col.names. Both have a {{skip}} argument.) > I don't think there's value in trying to be clever about multirow headers and > converting those to column names; if there's meaningful information in a tall > header, let the user parse it themselves. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5763) [JS] enable integration tests for MapVector
[ https://issues.apache.org/jira/browse/ARROW-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887479#comment-16887479 ] Paul Taylor commented on ARROW-5763: After reviewing the C++, the JS version of the Map type is not the same (it's essentially a Struct except instead of accessing child fields by field index, they're accessed by name). We should absolutely update the JS Map implementation before the 1.0 release. > [JS] enable integration tests for MapVector > --- > > Key: ARROW-5763 > URL: https://issues.apache.org/jira/browse/ARROW-5763 > Project: Apache Arrow > Issue Type: New Feature > Components: JavaScript >Reporter: Benjamin Kietzman >Priority: Minor > > As of 0.14, C++ and Java support Map arrays those implementations pass > integration tests. JS has a MapVector and some unit tests for it, but it > should be tested against other implementations as well -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5894) [C++] libgandiva.so.14 is exporting libstdc++ symbols
[ https://issues.apache.org/jira/browse/ARROW-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-5894: --- Assignee: Zhuo Peng > [C++] libgandiva.so.14 is exporting libstdc++ symbols > - > > Key: ARROW-5894 > URL: https://issues.apache.org/jira/browse/ARROW-5894 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Affects Versions: 0.14.0 >Reporter: Zhuo Peng >Assignee: Zhuo Peng >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > For example: > $ nm libgandiva.so.14 | grep "once_proxy" > 018c0a10 T __once_proxy > > many other symbols are also exported which I guess shouldn't be (e.g. LLVM > symbols) > > There seems to be no linker script for libgandiva.so (there was, but was > never used and got deleted? > [https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]). > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5894) [C++] libgandiva.so.14 is exporting libstdc++ symbols
[ https://issues.apache.org/jira/browse/ARROW-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5894. - Resolution: Fixed Issue resolved by pull request 4883 [https://github.com/apache/arrow/pull/4883] > [C++] libgandiva.so.14 is exporting libstdc++ symbols > - > > Key: ARROW-5894 > URL: https://issues.apache.org/jira/browse/ARROW-5894 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Affects Versions: 0.14.0 >Reporter: Zhuo Peng >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > For example: > $ nm libgandiva.so.14 | grep "once_proxy" > 018c0a10 T __once_proxy > > many other symbols are also exported which I guess shouldn't be (e.g. LLVM > symbols) > > There seems to be no linker script for libgandiva.so (there was, but was > never used and got deleted? > [https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]). > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5964) [C++][Gandiva] Cast double to decimal with rounding returns 0
[ https://issues.apache.org/jira/browse/ARROW-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5964: Component/s: C++ - Gandiva > [C++][Gandiva] Cast double to decimal with rounding returns 0 > - > > Key: ARROW-5964 > URL: https://issues.apache.org/jira/browse/ARROW-5964 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Pindikura Ravindra >Assignee: Pindikura Ravindra >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > casting 1.15470053838 to decimal(18,0) gives 0. should return 1. > there is a bug in the overflow check after rounding. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5964) [C++][Gandiva] Cast double to decimal with rounding returns 0
[ https://issues.apache.org/jira/browse/ARROW-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5964. - Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 4894 [https://github.com/apache/arrow/pull/4894] > [C++][Gandiva] Cast double to decimal with rounding returns 0 > - > > Key: ARROW-5964 > URL: https://issues.apache.org/jira/browse/ARROW-5964 > Project: Apache Arrow > Issue Type: Bug >Reporter: Pindikura Ravindra >Assignee: Pindikura Ravindra >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > casting 1.15470053838 to decimal(18,0) gives 0. should return 1. > there is a bug in the overflow check after rounding. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5741) [JS] Make numeric vector from functions consistent with TypedArray.from
[ https://issues.apache.org/jira/browse/ARROW-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5741: Fix Version/s: 1.0.0 > [JS] Make numeric vector from functions consistent with TypedArray.from > --- > > Key: ARROW-5741 > URL: https://issues.apache.org/jira/browse/ARROW-5741 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Described in > https://lists.apache.org/thread.html/b648a781cba7f10d5a6072ff2e7dab6c03e2d1f12e359d9261891486@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:11 PM: - [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{ # conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0'] }} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... EDIT: can't seem to format the code-block correctly, sorry. was (Author: h-vetinari): [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{ {quote}# conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0']{quote} }} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887356#comment-16887356 ] Wes McKinney commented on ARROW-5965: - Note I linked this with ARROW-2652 since many users aren't familiar with producing gdb backtraces generated in Python programs > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:10 PM: - [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{ {quote}# conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0']{quote} }} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... was (Author: h-vetinari): [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{# conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0']}} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:09 PM: - [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{# conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0']}} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... was (Author: h-vetinari): [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, ``` # conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0'] ``` which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari commented on ARROW-5965: [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, ``` # conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0'] ``` which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3032) [Python] Clean up NumPy-related C++ headers
[ https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887347#comment-16887347 ] Wes McKinney commented on ARROW-3032: - We decided in the PR not to combine any of the headers > [Python] Clean up NumPy-related C++ headers > --- > > Key: ARROW-3032 > URL: https://issues.apache.org/jira/browse/ARROW-3032 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > There are 4 different headers. After ARROW-2814, we can probably eliminate > numpy_convert.h and combine with numpy_to_arrow.h -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-3032) [Python] Clean up NumPy-related C++ headers
[ https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3032. - Resolution: Fixed Issue resolved by pull request 4899 [https://github.com/apache/arrow/pull/4899] > [Python] Clean up NumPy-related C++ headers > --- > > Key: ARROW-3032 > URL: https://issues.apache.org/jira/browse/ARROW-3032 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > There are 4 different headers. After ARROW-2814, we can probably eliminate > numpy_convert.h and combine with numpy_to_arrow.h -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-3032) [Python] Clean up NumPy-related C++ headers
[ https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3032: --- Assignee: Antoine Pitrou > [Python] Clean up NumPy-related C++ headers > --- > > Key: ARROW-3032 > URL: https://issues.apache.org/jira/browse/ARROW-3032 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > There are 4 different headers. After ARROW-2814, we can probably eliminate > numpy_convert.h and combine with numpy_to_arrow.h -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5963) [R] R Appveyor job does not test changes in the C++ library
[ https://issues.apache.org/jira/browse/ARROW-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5963: -- Labels: pull-request-available (was: ) > [R] R Appveyor job does not test changes in the C++ library > --- > > Key: ARROW-5963 > URL: https://issues.apache.org/jira/browse/ARROW-5963 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Wes McKinney >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > It seems like master is being used > https://github.com/apache/arrow/blob/master/ci/PKGBUILD#L42 > I observed this in > https://ci.appveyor.com/project/wesm/arrow/builds/26030853/job/7vn8q3l8e24t83jh?fullLog=true > from this PR > https://github.com/apache/arrow/pull/4841 for ARROW-5893 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (ARROW-5953) Thrift download ERRORS with apache-arrow-0.14.0
[ https://issues.apache.org/jira/browse/ARROW-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian closed ARROW-5953. Resolution: Information Provided This is a user-specific build environment issue unrelated to the apache-arrow-0.14.0 codebase. > Thrift download ERRORS with apache-arrow-0.14.0 > --- > > Key: ARROW-5953 > URL: https://issues.apache.org/jira/browse/ARROW-5953 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.13.0, 0.14.0 > Environment: RHEL 6.7 >Reporter: Brian >Priority: Major > > {color:#33}cmake returns:{color} > requests.excetions.SSLError: hostname 'www.apache.org' doesn't match either > of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > {color:#33}during check for thrift download location. {color} > {color:#33}This occurs with a freshly inflated arrow source release tree > where cmake is running for the first time. {color} > {color:#33}Reproducible with the release levels of apache-arrow-0.14.0 > and 0.13.0. I tried this 3-5x on 15Jul2019 and see it consistently each > time.{color} > {color:#33}Here's the full context from cmake output: {color} > {quote}-- Checking for module 'thrift' > -- No package 'thrift' found > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR > THRIFT_COMPILER) > Building Apache Thrift from source > Downloading Apache Thrift from Traceback (most recent call last): > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line > 38, in > suggested_mirror = get_url('[https://www.apache.org/dyn/]' > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line > 27, in get_url > return requests.get(url).content > File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get > return request('get', url, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request > response = session.request(method=method, url=url, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in > request > resp = self.send(prep, **send_kwargs) > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in > send > r = adapter.send(request, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in > send > raise SSLError(e, request=request) > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either > of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > {quote} > {color:#FF} {color} > {color:#FF}{color:#33}Per Wes' suggestion I ran the following > directly:{color}{color} > {color:#FF}{color:#33}python cpp/build-support/get_apache_mirror.py > [https://www-eu.apache.org/dist/] [http://us.mirrors.quenda.co/apache/] > {color}{color} > {color:#FF}{color:#33}with this output:{color}{color} > [https://www-eu.apache.org/dist/] [http://us.mirrors.quenda.co/apache/] > > > *NOTE:* here are the cmake thrift log lines from a build of apache-arrow git > clone on 06Jul2019 where cmake/make were run fine.pwd > > {quote}-- Checking for module 'thrift' > -- No package 'thrift' found > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) > Building Apache Thrift from source > Downloading Apache Thrift from > http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz > {quote} > Currently, cmake runs successfully on this apache-arrow-0.14.0 directory. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5953) Thrift download ERRORS with apache-arrow-0.14.0
[ https://issues.apache.org/jira/browse/ARROW-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887288#comment-16887288 ] Brian commented on ARROW-5953: -- This turns out to be a problem with cert validation when cmake sets up to download thrift, due to back-level RHEL 6 and Python 2.6.6 on the internal SAS build node where this fails. We need to update to a newer supported version of Python on these machines. > Thrift download ERRORS with apache-arrow-0.14.0 > --- > > Key: ARROW-5953 > URL: https://issues.apache.org/jira/browse/ARROW-5953 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.13.0, 0.14.0 > Environment: RHEL 6.7 >Reporter: Brian >Priority: Major > > {color:#33}cmake returns:{color} > requests.excetions.SSLError: hostname 'www.apache.org' doesn't match either > of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > {color:#33}during check for thrift download location. {color} > {color:#33}This occurs with a freshly inflated arrow source release tree > where cmake is running for the first time. {color} > {color:#33}Reproducible with the release levels of apache-arrow-0.14.0 > and 0.13.0. I tried this 3-5x on 15Jul2019 and see it consistently each > time.{color} > {color:#33}Here's the full context from cmake output: {color} > {quote}-- Checking for module 'thrift' > -- No package 'thrift' found > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR > THRIFT_COMPILER) > Building Apache Thrift from source > Downloading Apache Thrift from Traceback (most recent call last): > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line > 38, in > suggested_mirror = get_url('[https://www.apache.org/dyn/]' > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line > 27, in get_url > return requests.get(url).content > File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get > return request('get', url, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request > response = session.request(method=method, url=url, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in > request > resp = self.send(prep, **send_kwargs) > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in > send > r = adapter.send(request, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in > send > raise SSLError(e, request=request) > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either > of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > {quote} > {color:#FF} {color} > {color:#FF}{color:#33}Per Wes' suggestion I ran the following > directly:{color}{color} > {color:#FF}{color:#33}python cpp/build-support/get_apache_mirror.py > [https://www-eu.apache.org/dist/] [http://us.mirrors.quenda.co/apache/] > {color}{color} > {color:#FF}{color:#33}with this output:{color}{color} > [https://www-eu.apache.org/dist/] [http://us.mirrors.quenda.co/apache/] > > > *NOTE:* here are the cmake thrift log lines from a build of apache-arrow git > clone on 06Jul2019 where cmake/make were run fine.pwd > > {quote}-- Checking for module 'thrift' > -- No package 'thrift' found > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) > Building Apache Thrift from source > Downloading Apache Thrift from > http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz > {quote} > Currently, cmake runs successfully on this apache-arrow-0.14.0 directory. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large UTF32 numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-5966: -- Summary: [Python] Capacity error when converting large UTF32 numpy array to arrow array (was: [Python] Capacity error when converting large string numpy array to arrow array) > [Python] Capacity error when converting large UTF32 numpy array to arrow array > -- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: > [https://github.com/apache/arrow/issues/1855]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5972) [Rust] Installing cargo-tarpaulin and generating coverage report takes over 20 minutes
Wes McKinney created ARROW-5972: --- Summary: [Rust] Installing cargo-tarpaulin and generating coverage report takes over 20 minutes Key: ARROW-5972 URL: https://issues.apache.org/jira/browse/ARROW-5972 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Wes McKinney Fix For: 1.0.0 See example build: https://travis-ci.org/apache/arrow/jobs/558986931 Here, installing cargo-tarpaulin takes 13m32s. Running the coverage report takes another 7m40s. Given the Travis CI build queue issues we're having, this might be worth optimizing or moving to Docker/Buildbot -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5971) [Website] Blog post introducing Arrow Flight
[ https://issues.apache.org/jira/browse/ARROW-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887257#comment-16887257 ] Wes McKinney commented on ARROW-5971: - Yeah I was thinking to use Python for all the benchmarking, both server and client. Good dogfooding exercise > [Website] Blog post introducing Arrow Flight > > > Key: ARROW-5971 > URL: https://issues.apache.org/jira/browse/ARROW-5971 > Project: Apache Arrow > Issue Type: New Feature > Components: Website >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it's a good time to be bringing more attention to our work over the > last 12-14 months on Arrow Flight. > I would be OK to draft an initial version of the blog post, and I can > circulate to others for review / edit / comment. If there are particular > benchmarks you would like to see included, contributing code for that would > also be helpful. My plan would be to show tcp throughput on localhost, and > node-to-node throughput on a local gigabit ethernet network. I think the > localhost throughput is important to show that Flight is a tool that you > would want to reach for for faster throughput in high performance networking > (e.g. 10/40 gigabit) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5971) [Website] Blog post introducing Arrow Flight
[ https://issues.apache.org/jira/browse/ARROW-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887246#comment-16887246 ] lidavidm commented on ARROW-5971: - I'd be happy to look over anything. We're also working on a post of our own, though that probably won't come in the near future. It might be interesting to show Python numbers as well - it actually performs better than Java in our tests (don't think I can share actual data though). > [Website] Blog post introducing Arrow Flight > > > Key: ARROW-5971 > URL: https://issues.apache.org/jira/browse/ARROW-5971 > Project: Apache Arrow > Issue Type: New Feature > Components: Website >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it's a good time to be bringing more attention to our work over the > last 12-14 months on Arrow Flight. > I would be OK to draft an initial version of the blog post, and I can > circulate to others for review / edit / comment. If there are particular > benchmarks you would like to see included, contributing code for that would > also be helpful. My plan would be to show tcp throughput on localhost, and > node-to-node throughput on a local gigabit ethernet network. I think the > localhost throughput is important to show that Flight is a tool that you > would want to reach for for faster throughput in high performance networking > (e.g. 10/40 gigabit) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887219#comment-16887219 ] Antoine Pitrou commented on ARROW-5966: --- I think that's because you're going through a Numpy Array (which also uses a wasteful UTF32 encoding). Just call pa.array() directly on the source. Or use another dtype for the Numpy Array. > [Python] Capacity error when converting large string numpy array to arrow > array > --- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: > [https://github.com/apache/arrow/issues/1855]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887197#comment-16887197 ] Igor Yastrebov commented on ARROW-5966: --- I tried your example and it worked but uuid array fails. I have pyarrow 0.14.0 (from conda-forge) > [Python] Capacity error when converting large string numpy array to arrow > array > --- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: > [https://github.com/apache/arrow/issues/1855]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.
[ https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887195#comment-16887195 ] Wes McKinney commented on ARROW-5811: - Yeah, so we could define a conversion rule to return string or binary, and then add an option to set a default conversion rule (where currently we have an implicit default of "use type inference") > [C++] CSV reader: Ability to not infer column types. > > > Key: ARROW-5811 > URL: https://issues.apache.org/jira/browse/ARROW-5811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.13.0 > Environment: Ubuntu Xenial >Reporter: Bogdan Klichuk >Priority: Minor > Labels: csv, csvparser, pyarrow > Fix For: 1.0.0 > > > I'm trying to read CSV as is. All columns as strings. I don't know the schema > of these CSVs and they will vary as they are provided by user. > Right now i'm using pandas.read_csv(dtype=str) which works great, but since > final destination of these CSVs are parquet files it seems like much more > efficient to use pyarrow.csv.read_csv in future, as soon as this becomes > available :) > I tried things like > `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: > 'string')))` but it doesn't work. > Maybe I just didnt' find something that already exists? :) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.
[ https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887187#comment-16887187 ] Antoine Pitrou commented on ARROW-5811: --- We're talking about C++ here. Dynamic typing isn't terribly idiomatic (though it's possible using std::variant) :-) > [C++] CSV reader: Ability to not infer column types. > > > Key: ARROW-5811 > URL: https://issues.apache.org/jira/browse/ARROW-5811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.13.0 > Environment: Ubuntu Xenial >Reporter: Bogdan Klichuk >Priority: Minor > Labels: csv, csvparser, pyarrow > Fix For: 1.0.0 > > > I'm trying to read CSV as is. All columns as strings. I don't know the schema > of these CSVs and they will vary as they are provided by user. > Right now i'm using pandas.read_csv(dtype=str) which works great, but since > final destination of these CSVs are parquet files it seems like much more > efficient to use pyarrow.csv.read_csv in future, as soon as this becomes > available :) > I tried things like > `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: > 'string')))` but it doesn't work. > Maybe I just didnt' find something that already exists? :) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-3032) [Python] Clean up NumPy-related C++ headers
[ https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3032: -- Labels: pull-request-available (was: ) > [Python] Clean up NumPy-related C++ headers > --- > > Key: ARROW-3032 > URL: https://issues.apache.org/jira/browse/ARROW-3032 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > There are 4 different headers. After ARROW-2814, we can probably eliminate > numpy_convert.h and combine with numpy_to_arrow.h -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.
[ https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887183#comment-16887183 ] Neal Richardson commented on ARROW-5811: In principle, a user could parse the header row of the CSV separately to identify the column names, then use that to define {{column_types}} mapping each name to string type. So are we just talking about how to facilitate that, whether/how to internalize that logic and expose it as a simple argument? Or is there something else? If {{column_types}} didn't have to be a map, maybe that would help. Perhaps it could also accept an array of length equal to the number of columns, or a single value, in which case it would recycle that type for every column. > [C++] CSV reader: Ability to not infer column types. > > > Key: ARROW-5811 > URL: https://issues.apache.org/jira/browse/ARROW-5811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.13.0 > Environment: Ubuntu Xenial >Reporter: Bogdan Klichuk >Priority: Minor > Labels: csv, csvparser, pyarrow > Fix For: 1.0.0 > > > I'm trying to read CSV as is. All columns as strings. I don't know the schema > of these CSVs and they will vary as they are provided by user. > Right now i'm using pandas.read_csv(dtype=str) which works great, but since > final destination of these CSVs are parquet files it seems like much more > efficient to use pyarrow.csv.read_csv in future, as soon as this becomes > available :) > I tried things like > `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: > 'string')))` but it doesn't work. > Maybe I just didnt' find something that already exists? :) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5962) [CI][Python] Do not test manylinux1 wheels in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5962. - Resolution: Fixed Issue resolved by pull request 4893 [https://github.com/apache/arrow/pull/4893] > [CI][Python] Do not test manylinux1 wheels in Travis CI > --- > > Key: ARROW-5962 > URL: https://issues.apache.org/jira/browse/ARROW-5962 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > These can be tested via Crossbow either on demand or nightly. Removing these > from Travis CI will save 30 minutes of build time resulting in better team > productivity -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5971) [Website] Blog post introducing Arrow Flight
Wes McKinney created ARROW-5971: --- Summary: [Website] Blog post introducing Arrow Flight Key: ARROW-5971 URL: https://issues.apache.org/jira/browse/ARROW-5971 Project: Apache Arrow Issue Type: New Feature Components: Website Reporter: Wes McKinney Fix For: 1.0.0 I think it's a good time to be bringing more attention to our work over the last 12-14 months on Arrow Flight. I would be OK to draft an initial version of the blog post, and I can circulate to others for review / edit / comment. If there are particular benchmarks you would like to see included, contributing code for that would also be helpful. My plan would be to show tcp throughput on localhost, and node-to-node throughput on a local gigabit ethernet network. I think the localhost throughput is important to show that Flight is a tool that you would want to reach for for faster throughput in high performance networking (e.g. 10/40 gigabit) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.
[ https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887172#comment-16887172 ] Antoine Pitrou commented on ARROW-5811: --- The request is for no inference to occur, without knowing the column names or the number of columns in advance (so you cannot pass an explicit {{column_types}}). > [C++] CSV reader: Ability to not infer column types. > > > Key: ARROW-5811 > URL: https://issues.apache.org/jira/browse/ARROW-5811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.13.0 > Environment: Ubuntu Xenial >Reporter: Bogdan Klichuk >Priority: Minor > Labels: csv, csvparser, pyarrow > Fix For: 1.0.0 > > > I'm trying to read CSV as is. All columns as strings. I don't know the schema > of these CSVs and they will vary as they are provided by user. > Right now i'm using pandas.read_csv(dtype=str) which works great, but since > final destination of these CSVs are parquet files it seems like much more > efficient to use pyarrow.csv.read_csv in future, as soon as this becomes > available :) > I tried things like > `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: > 'string')))` but it doesn't work. > Maybe I just didnt' find something that already exists? :) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.
[ https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887170#comment-16887170 ] Neal Richardson commented on ARROW-5811: I think I'm not understanding the problem. What's missing from the {{column_types}} we already support? [https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L69] > [C++] CSV reader: Ability to not infer column types. > > > Key: ARROW-5811 > URL: https://issues.apache.org/jira/browse/ARROW-5811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.13.0 > Environment: Ubuntu Xenial >Reporter: Bogdan Klichuk >Priority: Minor > Labels: csv, csvparser, pyarrow > Fix For: 1.0.0 > > > I'm trying to read CSV as is. All columns as strings. I don't know the schema > of these CSVs and they will vary as they are provided by user. > Right now i'm using pandas.read_csv(dtype=str) which works great, but since > final destination of these CSVs are parquet files it seems like much more > efficient to use pyarrow.csv.read_csv in future, as soon as this becomes > available :) > I tried things like > `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: > 'string')))` but it doesn't work. > Maybe I just didnt' find something that already exists? :) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-750) [Format] Add LargeBinary and LargeString types
[ https://issues.apache.org/jira/browse/ARROW-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-750: - Component/s: C++ > [Format] Add LargeBinary and LargeString types > -- > > Key: ARROW-750 > URL: https://issues.apache.org/jira/browse/ARROW-750 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Format >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > These are string and binary types that use 64-bit offsets. Java will not need > to implement these types for the time being, but they are needed when > representing very large datasets in C++ -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-4810) [Format][C++] Add "LargeList" type with 64-bit offsets
[ https://issues.apache.org/jira/browse/ARROW-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-4810: -- Priority: Minor (was: Major) > [Format][C++] Add "LargeList" type with 64-bit offsets > -- > > Key: ARROW-4810 > URL: https://issues.apache.org/jira/browse/ARROW-4810 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Wes McKinney >Assignee: Philipp Moritz >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > Mentioned in https://github.com/apache/arrow/issues/3845 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.
[ https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887166#comment-16887166 ] Wes McKinney commented on ARROW-5811: - I think we need to create an abstract C++ type (or similar) that is a {{ConversionRule}}. We have other types of conversion rules where we have not defined an API yet, for example "timestamp with striptime-like format of $FORMAT". Whatever API we have, it needs to be extensible to accommodate new kinds of logic > [C++] CSV reader: Ability to not infer column types. > > > Key: ARROW-5811 > URL: https://issues.apache.org/jira/browse/ARROW-5811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.13.0 > Environment: Ubuntu Xenial >Reporter: Bogdan Klichuk >Priority: Minor > Labels: csv, csvparser, pyarrow > Fix For: 1.0.0 > > > I'm trying to read CSV as is. All columns as strings. I don't know the schema > of these CSVs and they will vary as they are provided by user. > Right now i'm using pandas.read_csv(dtype=str) which works great, but since > final destination of these CSVs are parquet files it seems like much more > efficient to use pyarrow.csv.read_csv in future, as soon as this becomes > available :) > I tried things like > `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: > 'string')))` but it doesn't work. > Maybe I just didnt' find something that already exists? :) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.
[ https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887162#comment-16887162 ] Antoine Pitrou commented on ARROW-5811: --- [~wesmckinn] [~npr] do you have an idea about a desirable API here? > [C++] CSV reader: Ability to not infer column types. > > > Key: ARROW-5811 > URL: https://issues.apache.org/jira/browse/ARROW-5811 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.13.0 > Environment: Ubuntu Xenial >Reporter: Bogdan Klichuk >Priority: Minor > Labels: csv, csvparser, pyarrow > Fix For: 1.0.0 > > > I'm trying to read CSV as is. All columns as strings. I don't know the schema > of these CSVs and they will vary as they are provided by user. > Right now i'm using pandas.read_csv(dtype=str) which works great, but since > final destination of these CSVs are parquet files it seems like much more > efficient to use pyarrow.csv.read_csv in future, as soon as this becomes > available :) > I tried things like > `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: > 'string')))` but it doesn't work. > Maybe I just didnt' find something that already exists? :) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (ARROW-5839) [Python] Test manylinux2010 in CI
[ https://issues.apache.org/jira/browse/ARROW-5839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-5839. - Resolution: Won't Fix > [Python] Test manylinux2010 in CI > - > > Key: ARROW-5839 > URL: https://issues.apache.org/jira/browse/ARROW-5839 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > Currently we test manylinux1 builds on Travis-CI. At some point we should > test manylinux2010 builds too. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5839) [Python] Test manylinux2010 in CI
[ https://issues.apache.org/jira/browse/ARROW-5839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887160#comment-16887160 ] Antoine Pitrou commented on ARROW-5839: --- manylinux2010 is already tested on crossbow, so closing this since we aren't gonna test it on Travis. > [Python] Test manylinux2010 in CI > - > > Key: ARROW-5839 > URL: https://issues.apache.org/jira/browse/ARROW-5839 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > Currently we test manylinux1 builds on Travis-CI. At some point we should > test manylinux2010 builds too. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887156#comment-16887156 ] Antoine Pitrou commented on ARROW-5966: --- Which version are you using? > [Python] Capacity error when converting large string numpy array to arrow > array > --- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: > [https://github.com/apache/arrow/issues/1855]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887155#comment-16887155 ] Antoine Pitrou commented on ARROW-5966: --- I am not seeing this issue: {code:python} >>> import pyarrow as pa >>> >>> >>> import numpy as np >>> >>> >>> >>> >>> >>> l = [] >>> >>> >>> x = b"x"*1024 >>> >>> >>> for i in range(4 * (1024**2)): l.append(x) >>> >>> >>> arr = pa.array(l) >>> arr.type >>> >>> DataType(binary) >>> type(arr) >>> >>> pyarrow.lib.ChunkedArray >>> len(arr) >>> >>> 4194304 >>> len(arr.chunks) >>> >>> 3 >>> del arr >>> >>> >>> narr = np.array(l) >>> narr.nbytes >>> >>> 4294967296 >>> arr = pa.array(narr) >>> >>> >>> type(arr) >>> >>> pyarrow.lib.ChunkedArray >>> len(arr.chunks) >>> >>> 256 >>> len(arr) >>> >>> 4194304 {code} > [Python] Capacity error when converting large string numpy array to arrow > array > --- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: > [https://github.com/apache/arrow/issues/1855]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5963) [R] R Appveyor job does not test changes in the C++ library
[ https://issues.apache.org/jira/browse/ARROW-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-5963: -- Assignee: Neal Richardson > [R] R Appveyor job does not test changes in the C++ library > --- > > Key: ARROW-5963 > URL: https://issues.apache.org/jira/browse/ARROW-5963 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Wes McKinney >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > It seems like master is being used > https://github.com/apache/arrow/blob/master/ci/PKGBUILD#L42 > I observed this in > https://ci.appveyor.com/project/wesm/arrow/builds/26030853/job/7vn8q3l8e24t83jh?fullLog=true > from this PR > https://github.com/apache/arrow/pull/4841 for ARROW-5893 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887142#comment-16887142 ] Wes McKinney commented on ARROW-5965: - A gdb backtrace would help us a lot. Do you know how to get one? > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887129#comment-16887129 ] H. Vetinari commented on ARROW-5965: Hey Neal, I tried a couple of times before filing the report, and all (~5) invocations on 0.14 crashed, and all invocations on 0.13 worked. The machine itself has lots of memory, so I don't think it's that. Not sure I'll be able to pare this down to a minimal reproducing parquet file. I'll try. > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887119#comment-16887119 ] Neal Richardson commented on ARROW-5965: Thanks for the report. A few questions: # Is this reproducible if you try again with the same file? (I wonder if "Killed" means OOM and not segfault) # Could you provide a (preferably as small as possible) Parquet file that triggers this behavior? I think we'll need that in order to identify and fix any issues. > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-5965: --- Labels: parquet (was: ) > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-5965: --- Component/s: Python > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-5965: --- Summary: [Python] Regression: segfault when reading hive table with v0.14 (was: Regression: segfault when reading hive table with v0.14) > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5747) [C++] Better column name and header support in CSV reader
[ https://issues.apache.org/jira/browse/ARROW-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-5747: - Assignee: Antoine Pitrou > [C++] Better column name and header support in CSV reader > - > > Key: ARROW-5747 > URL: https://issues.apache.org/jira/browse/ARROW-5747 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.13.0 >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Labels: csv > Fix For: 1.0.0 > > > While working on ARROW-5500, I found a number of issues around the CSV parse > options {{header_rows}}: > * If header_rows is 0, [the reader > errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150] > * It's not possible to supply your own column names, as [this > TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149] > notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is > enough as long as header_rows == 0 doesn't error, but then you can't > naturally specify column types in the convert options because that takes a > map of column name to type. > * If header_rows is > 1, every cell gets turned into a column name, so if > header_rows == 2, you get twice the number of column names as columns. This > doesn't error, but it leads to unexpected results. > IMO a better interface would be to have a {{skip_rows}} argument to let you > ignore a large header, and a {{column_names}} argument that, if provided, > gives the column names. If not provided, the first row after {{skip_rows}} is > taken to be the column names. If it were also possible for {{column_names}} > to take a {{false}} or {{null}} argument, then we could support the case of > autogenerating names when none are provided and there's no header row. > Alternatively, we could use a boolean {{header}} argument to govern whether > the first (non-skipped) row should be interpreted as column names. (For > reference, R's > [readr|https://github.com/tidyverse/readr/blob/master/R/read_delim.R#L14-L27] > takes TRUE/FALSE/array of strings in one arg; the base > [read.csv|https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html] > uses separate args for header and col.names. Both have a {{skip}} argument.) > I don't think there's value in trying to be clever about multirow headers and > converting those to column names; if there's meaningful information in a tall > header, let the user parse it themselves. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5864) [Python] simplify cython wrapping of Result
[ https://issues.apache.org/jira/browse/ARROW-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-5864. --- Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 4848 [https://github.com/apache/arrow/pull/4848] > [Python] simplify cython wrapping of Result > --- > > Key: ARROW-5864 > URL: https://issues.apache.org/jira/browse/ARROW-5864 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > See answer in https://github.com/cython/cython/issues/3018 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5970) [Java] Provide pointer to Arrow buffer
[ https://issues.apache.org/jira/browse/ARROW-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5970: -- Labels: pull-request-available (was: ) > [Java] Provide pointer to Arrow buffer > -- > > Key: ARROW-5970 > URL: https://issues.apache.org/jira/browse/ARROW-5970 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > Introduce pointer to a memory region within an ArrowBuf. > This pointer will be used as the basis for calculating the hash code within a > vector, and equality determination. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5970) [Java] Provide pointer to Arrow buffer
Liya Fan created ARROW-5970: --- Summary: [Java] Provide pointer to Arrow buffer Key: ARROW-5970 URL: https://issues.apache.org/jira/browse/ARROW-5970 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Introduce pointer to a memory region within an ArrowBuf. This pointer will be used as the basis for calculating the hash code within a vector, and equality determination. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5969) [CI] [R] Lint failures
[ https://issues.apache.org/jira/browse/ARROW-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-5969. --- Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 4895 [https://github.com/apache/arrow/pull/4895] > [CI] [R] Lint failures > -- > > Key: ARROW-5969 > URL: https://issues.apache.org/jira/browse/ARROW-5969 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration, R >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5969) [CI] [R] Lint failures
[ https://issues.apache.org/jira/browse/ARROW-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5969: -- Labels: pull-request-available (was: ) > [CI] [R] Lint failures > -- > > Key: ARROW-5969 > URL: https://issues.apache.org/jira/browse/ARROW-5969 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration, R >Reporter: Antoine Pitrou >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5969) [CI] [R] Lint failures
[ https://issues.apache.org/jira/browse/ARROW-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-5969: - Assignee: Antoine Pitrou > [CI] [R] Lint failures > -- > > Key: ARROW-5969 > URL: https://issues.apache.org/jira/browse/ARROW-5969 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration, R >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5969) [CI] [R] Lint failures
Antoine Pitrou created ARROW-5969: - Summary: [CI] [R] Lint failures Key: ARROW-5969 URL: https://issues.apache.org/jira/browse/ARROW-5969 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration, R Reporter: Antoine Pitrou -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5968) [Java] Remove duplicate Preconditions check in JDBC adapter
[ https://issues.apache.org/jira/browse/ARROW-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5968: -- Labels: pull-request-available (was: ) > [Java] Remove duplicate Preconditions check in JDBC adapter > --- > > Key: ARROW-5968 > URL: https://issues.apache.org/jira/browse/ARROW-5968 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > > Some Preconditions check are duplicate in {{JdbcToArrow#sqlToArrow}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5968) [Java] Remove duplicate Preconditions check in JDBC adapter
Ji Liu created ARROW-5968: - Summary: [Java] Remove duplicate Preconditions check in JDBC adapter Key: ARROW-5968 URL: https://issues.apache.org/jira/browse/ARROW-5968 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Some Preconditions check are duplicate in {{JdbcToArrow#sqlToArrow}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5967) [Java] DateUtility#timeZoneList is not correct
[ https://issues.apache.org/jira/browse/ARROW-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu updated ARROW-5967: -- Component/s: Java > [Java] DateUtility#timeZoneList is not correct > -- > > Key: ARROW-5967 > URL: https://issues.apache.org/jira/browse/ARROW-5967 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > > Now {{timeZoneList}} in {{DateUtility}} belongs to Joda time. > Since we have replace Joda time with Java time in ARROW-2015, this should > also be changed. > {{TimeStampXXTZVectors}} have a timezone member which seems not used now and > its {{getObject}} returns Long(different with that in {{TimeStampXXVectors}} > which returns {{LocalDateTime}}), should it return {{LocalDateTime}} with its > timezone? > Is it reasonable if we do as follows: > # replace Joda {{timezoneList}} with Java {{timezoneList}} in {{DateUtility}} > # add method like {{getLocalDateTimeFromEpochMilli(long epochMillis, String > timezone)}} in DateUtility > # Not sure make {{TimeStampXXTZVectors}} return {{LocalDateTime}}? > cc [~emkornfi...@gmail.com] [~bryanc] [~siddteotia] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5967) [Java] DateUtility#timeZoneList is not correct
Ji Liu created ARROW-5967: - Summary: [Java] DateUtility#timeZoneList is not correct Key: ARROW-5967 URL: https://issues.apache.org/jira/browse/ARROW-5967 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Now {{timeZoneList}} in {{DateUtility}} belongs to Joda time. Since we have replace Joda time with Java time in ARROW-2015, this should also be changed. {{TimeStampXXTZVectors}} have a timezone member which seems not used now and its {{getObject}} returns Long(different with that in {{TimeStampXXVectors}} which returns {{LocalDateTime}}), should it return {{LocalDateTime}} with its timezone? Is it reasonable if we do as follows: # replace Joda {{timezoneList}} with Java {{timezoneList}} in {{DateUtility}} # add method like {{getLocalDateTimeFromEpochMilli(long epochMillis, String timezone)}} in DateUtility # Not sure make {{TimeStampXXTZVectors}} return {{LocalDateTime}}? cc [~emkornfi...@gmail.com] [~bryanc] [~siddteotia] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5957) [C++][Gandiva] Implement div function in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prudhvi Porandla updated ARROW-5957: Description: Implement 'div' function for int32, int64, float32, and float64 (gandiva) types. div is integer division - divide and return quotient after discarding the fractional part. The function signature is {{type div(type, type)}} was: Implement 'div' function for int32, int64, float32, float64, and decimal128 (gandiva) types. div is integer division - divide and return quotient after discarding the fractional part. The function signature is {{type div(type, type)}} > [C++][Gandiva] Implement div function in Gandiva > > > Key: ARROW-5957 > URL: https://issues.apache.org/jira/browse/ARROW-5957 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Implement 'div' function for int32, int64, float32, and float64 (gandiva) > types. > div is integer division - divide and return quotient after discarding the > fractional part. > The function signature is {{type div(type, type)}} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Yastrebov updated ARROW-5966: -- Description: Trying to create a large string array fails with ArrowCapacityError: Encoded string length exceeds maximum size (2GB) instead of creating a chunked array. A reproducible example: {code:java} import uuid import numpy as np import pyarrow as pa li = [] for i in range(1): li.append(uuid.uuid4().hex) arr = np.array(li) parr = pa.array(arr) {code} Is it a regression or was it never properly fixed: [https://github.com/apache/arrow/issues/1855]? was: Trying to create a large string array fails with ArrowCapacityError: Encoded string length exceeds maximum size (2GB) instead of creating a chunked array. A reproducible example: {code:java} import uuid import numpy as np import pyarrow as pa li = [] for i in range(1): li.append(uuid.uuid4().hex) arr = np.array(li) parr = pa.array(arr) {code} Is it a regression or was it never properly fixed: [link title|[https://github.com/apache/arrow/issues/1855]]? > [Python] Capacity error when converting large string numpy array to arrow > array > --- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: > [https://github.com/apache/arrow/issues/1855]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Yastrebov updated ARROW-5966: -- External issue URL: (was: https://github.com/apache/arrow/issues/1855) Description: Trying to create a large string array fails with ArrowCapacityError: Encoded string length exceeds maximum size (2GB) instead of creating a chunked array. A reproducible example: {code:java} import uuid import numpy as np import pyarrow as pa li = [] for i in range(1): li.append(uuid.uuid4().hex) arr = np.array(li) parr = pa.array(arr) {code} Is it a regression or was it never properly fixed: [link title|[https://github.com/apache/arrow/issues/1855]]? was: Trying to create a large string array fails with ArrowCapacityError: Encoded string length exceeds maximum size (2GB) instead of creating a chunked array. A reproducible example: {code:java} import uuid import numpy as np import pyarrow as pa li = [] for i in range(1): li.append(uuid.uuid4().hex) arr = np.array(li) parr = pa.array(arr) {code} Is it a regression or was it never properly fixed? > [Python] Capacity error when converting large string numpy array to arrow > array > --- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: [link > title|[https://github.com/apache/arrow/issues/1855]]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5965) Regression: segfault when reading hive table with v0.14
H. Vetinari created ARROW-5965: -- Summary: Regression: segfault when reading hive table with v0.14 Key: ARROW-5965 URL: https://issues.apache.org/jira/browse/ARROW-5965 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.14.0 Reporter: H. Vetinari I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow installed in a conda env. The data I'm reading is a hive(-registered) table written as parquet, and with v0.13, reading this table (that is partitioned) does not cause any issues. The code that worked before and now crashes with v0.14 is simply: ``` import pyarrow.parquet as pq pq.ParquetDataset('hdfs:///data/raw/source/table').read() ``` Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I cannot report much more, but this is a pretty severe usability restriction. So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5964) [C++][Gandiva] Cast double to decimal with rounding returns 0
[ https://issues.apache.org/jira/browse/ARROW-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5964: -- Labels: pull-request-available (was: ) > [C++][Gandiva] Cast double to decimal with rounding returns 0 > - > > Key: ARROW-5964 > URL: https://issues.apache.org/jira/browse/ARROW-5964 > Project: Apache Arrow > Issue Type: Bug >Reporter: Pindikura Ravindra >Assignee: Pindikura Ravindra >Priority: Major > Labels: pull-request-available > > casting 1.15470053838 to decimal(18,0) gives 0. should return 1. > there is a bug in the overflow check after rounding. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5964) [C++][Gandiva] Cast double to decimal with rounding returns 0
Pindikura Ravindra created ARROW-5964: - Summary: [C++][Gandiva] Cast double to decimal with rounding returns 0 Key: ARROW-5964 URL: https://issues.apache.org/jira/browse/ARROW-5964 Project: Apache Arrow Issue Type: Bug Reporter: Pindikura Ravindra Assignee: Pindikura Ravindra casting 1.15470053838 to decimal(18,0) gives 0. should return 1. there is a bug in the overflow check after rounding. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5957) [C++][Gandiva] Implement div function in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5957: -- Labels: pull-request-available (was: ) > [C++][Gandiva] Implement div function in Gandiva > > > Key: ARROW-5957 > URL: https://issues.apache.org/jira/browse/ARROW-5957 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Labels: pull-request-available > > Implement 'div' function for int32, int64, float32, float64, and decimal128 > (gandiva) types. > div is integer division - divide and return quotient after discarding the > fractional part. > The function signature is {{type div(type, type)}} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5351) [Rust] Add support for take kernel functions
[ https://issues.apache.org/jira/browse/ARROW-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-5351. --- Resolution: Fixed Fix Version/s: 0.14.1 Issue resolved by pull request 4330 [https://github.com/apache/arrow/pull/4330] > [Rust] Add support for take kernel functions > > > Key: ARROW-5351 > URL: https://issues.apache.org/jira/browse/ARROW-5351 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > Time Spent: 9h 10m > Remaining Estimate: 0h > > Similar to https://issues.apache.org/jira/browse/ARROW-772, a take function > would allow us random-access on arrays, which is useful for sorting and > (potentially) filtering. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)