[jira] [Commented] (ARROW-6849) [Python] can not read a parquet store containing a list of integers
[ https://issues.apache.org/jira/browse/ARROW-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949201#comment-16949201 ] Joris Van den Bossche commented on ARROW-6849: -- [~selitvin] thanks for the issue report and reproducible example! This is indeed a regression in 0.15.0, see ARROW-6844. Going to close this issue as duplicate in favor of ARROW-6844 > [Python] can not read a parquet store containing a list of integers > > > Key: ARROW-6849 > URL: https://issues.apache.org/jira/browse/ARROW-6849 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Yevgeni Litvin >Priority: Major > Attachments: test_bad_parquet.tgz > > > A field having a type of list-of-ints can not be read using > {{parrow.parquet.read_table}} function. Also failed with other field types > (observed strings, for example). > This happens only in pyarrow 0.15.0. When downgrading to 0.14.1, the issue is > not observed. > pyspark version: 2.4.4[^test_bad_parquet.tgz] > Minimal snippet to reproduce the issue: > > {code:java} > import pyarrow.parquet as pq > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, IntegerType, > ArrayType, Row > output_url = '/tmp/test_bad_parquet' > spark = SparkSession.builder.getOrCreate() > schema = StructType([StructField('int_fixed_size_list', > ArrayType(IntegerType(), False), False)]) > rows = [Row(int_fixed_size_list=[1, 2, 3])] > dataframe = spark.createDataFrame(rows, > schema).write.mode('overwrite').parquet(output_url) > pq.read_table(output_url) > {code} > I get an error: > {code:java} > Traceback (most recent call last): > File "/home/yevgeni/uatc/dataset-toolkit/repro_failure.py", line 13, in > > pq.read_table(output_url) > File > "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 1281, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 1137, in read > use_pandas_metadata=use_pandas_metadata) > File > "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 605, in read > table = reader.read(**options) > File > "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 253, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1136, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column data for field 0 with type list not null> is inconsistent with schema listProcess > finished with exit code 1 > {code} > > Column data for field 0 with type list is inconsistent > with schema list > > A parquet store, as generated by the snippet is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6849) [Python] can not read a parquet store containing a list of integers
[ https://issues.apache.org/jira/browse/ARROW-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-6849. Resolution: Duplicate > [Python] can not read a parquet store containing a list of integers > > > Key: ARROW-6849 > URL: https://issues.apache.org/jira/browse/ARROW-6849 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: Yevgeni Litvin >Priority: Major > Attachments: test_bad_parquet.tgz > > > A field having a type of list-of-ints can not be read using > {{parrow.parquet.read_table}} function. Also failed with other field types > (observed strings, for example). > This happens only in pyarrow 0.15.0. When downgrading to 0.14.1, the issue is > not observed. > pyspark version: 2.4.4[^test_bad_parquet.tgz] > Minimal snippet to reproduce the issue: > > {code:java} > import pyarrow.parquet as pq > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, IntegerType, > ArrayType, Row > output_url = '/tmp/test_bad_parquet' > spark = SparkSession.builder.getOrCreate() > schema = StructType([StructField('int_fixed_size_list', > ArrayType(IntegerType(), False), False)]) > rows = [Row(int_fixed_size_list=[1, 2, 3])] > dataframe = spark.createDataFrame(rows, > schema).write.mode('overwrite').parquet(output_url) > pq.read_table(output_url) > {code} > I get an error: > {code:java} > Traceback (most recent call last): > File "/home/yevgeni/uatc/dataset-toolkit/repro_failure.py", line 13, in > > pq.read_table(output_url) > File > "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 1281, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 1137, in read > use_pandas_metadata=use_pandas_metadata) > File > "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 605, in read > table = reader.read(**options) > File > "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 253, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1136, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column data for field 0 with type list not null> is inconsistent with schema listProcess > finished with exit code 1 > {code} > > Column data for field 0 with type list is inconsistent > with schema list > > A parquet store, as generated by the snippet is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux
[ https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949227#comment-16949227 ] Thomas Schm commented on ARROW-6793: Yes, I guess this ticket is addressing a subproblem of getting arrow into R on Linux. Solving this problem is unfortunately a huge task and the information is in fragments over Github, Jira and several articles. It's a very unfortunate situation. Trying to install apache/arrow/r from Github worked yesterday but fails today. The problem today relates to a commit you have done yesterday compression.cpp: In function ‘bool util___Codec__IsAvailable(arrow::Compression::type)’: compression.cpp:37:10: error: ‘IsAvailable’ is not a member of ‘arrow::util::Codec’ return arrow::util::Codec::IsAvailable(codec); ^ Are the libraries I link to outdated? I did a fresh pull just a few minutes ago. Is there way to specify a certain tag in the install via github route? > [R] Arrow C++ binary packaging for Linux > > > Key: ARROW-6793 > URL: https://issues.apache.org/jira/browse/ARROW-6793 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Our current installation experience on Linux isn't ideal. Unless you've > already installed the Arrow C++ library, when you install the R package, you > get a shell that tells you to install the C++ library. That was a useful > approach to allow us to get the package on CRAN, which makes it easy for > macOS and Windows users to install, but it doesn't improve the installation > experience for Linux users. This is an impediment to adoption of arrow not > only by users but also by package maintainers who might want to depend on > arrow. > macOS and Windows have a better experience because at installation time, the > configure scripts download and statically link a prebuilt C++ library. CRAN > bundles the whole thing up and delivers that as a binary R package. > Python wheels do a similar thing: they're binaries that contain all external > dependencies. And there are pyarrow wheels for Linux. This suggests that we > could do something similar for R: build a generic Linux binary of the C++ > library and download it in the R package configure script at install time. > I experimented with using the Arrow C++ binaries included in the Python > wheels in R. See discussion at the end of ARROW-5956. This worked on macOS > (not useful for R, but it proved the concept) and almost worked on Linux, but > it turned out that the "manylinux2010" standard is too archaic to work with > contemporary Rcpp. > Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, > just with slightly more modern compiler/settings. Publish that C++ binary > package to bintray. Then download it in the R configure script if a > local/system package isn't found. > Once we have a basic version working, test against various distros on > [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere > and/or ensure the current fallback behavior when we encounter a distro that > this doesn't work for. If necessary, we can make multiple flavors of this C++ > binary for debian, centos, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6850) [Java] Jdbc converter support Null type
Ji Liu created ARROW-6850: - Summary: [Java] Jdbc converter support Null type Key: ARROW-6850 URL: https://issues.apache.org/jira/browse/ARROW-6850 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu java.sql.Types.Null is not supported yet since we have no NullVector in Java code before. This could be implemented after ARROW-1638 merged (IPC roundtrip for null type). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6851) Should OSError be FileNotFoundError?
Vladimir Filimonov created ARROW-6851: - Summary: Should OSError be FileNotFoundError? Key: ARROW-6851 URL: https://issues.apache.org/jira/browse/ARROW-6851 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.15.0 Reporter: Vladimir Filimonov On the read_table function - if the file is not found, a OSError is raised: {code:java} import pyarrow.parquet as pq pq.read_table('example.parquet') {code} Should it rather be FileNotFoundError which is more standard in such situations? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6851) Should OSError be FileNotFoundError?
[ https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Filimonov updated ARROW-6851: -- Description: On the read_table function - if the file is not found, an OSError ("Passed non-file path") is raised: {code:java} import pyarrow.parquet as pq pq.read_table('example.parquet') {code} Should it rather be FileNotFoundError which is more standard in such situations? was: On the read_table function - if the file is not found, a OSError is raised: {code:java} import pyarrow.parquet as pq pq.read_table('example.parquet') {code} Should it rather be FileNotFoundError which is more standard in such situations? > Should OSError be FileNotFoundError? > > > Key: ARROW-6851 > URL: https://issues.apache.org/jira/browse/ARROW-6851 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.15.0 >Reporter: Vladimir Filimonov >Priority: Minor > > On the read_table function - if the file is not found, an OSError ("Passed > non-file path") is raised: > {code:java} > import pyarrow.parquet as pq > pq.read_table('example.parquet') > {code} > Should it rather be FileNotFoundError which is more standard in such > situations? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6851) Should OSError be FileNotFoundError?
[ https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-6851: -- Component/s: C++ > Should OSError be FileNotFoundError? > > > Key: ARROW-6851 > URL: https://issues.apache.org/jira/browse/ARROW-6851 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: Vladimir Filimonov >Priority: Minor > Fix For: 2.0.0 > > > On the read_table function - if the file is not found, an OSError ("Passed > non-file path") is raised: > {code:java} > import pyarrow.parquet as pq > pq.read_table('example.parquet') > {code} > Should it rather be FileNotFoundError which is more standard in such > situations? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6851) Should OSError be FileNotFoundError?
[ https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-6851: -- Fix Version/s: 2.0.0 > Should OSError be FileNotFoundError? > > > Key: ARROW-6851 > URL: https://issues.apache.org/jira/browse/ARROW-6851 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.15.0 >Reporter: Vladimir Filimonov >Priority: Minor > Fix For: 2.0.0 > > > On the read_table function - if the file is not found, an OSError ("Passed > non-file path") is raised: > {code:java} > import pyarrow.parquet as pq > pq.read_table('example.parquet') > {code} > Should it rather be FileNotFoundError which is more standard in such > situations? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6851) Should OSError be FileNotFoundError?
[ https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949273#comment-16949273 ] Antoine Pitrou commented on ARROW-6851: --- Certainly. But this will require some plumbing on the C++ side to remember {{errno}}. > Should OSError be FileNotFoundError? > > > Key: ARROW-6851 > URL: https://issues.apache.org/jira/browse/ARROW-6851 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: Vladimir Filimonov >Priority: Minor > Fix For: 2.0.0 > > > On the read_table function - if the file is not found, an OSError ("Passed > non-file path") is raised: > {code:java} > import pyarrow.parquet as pq > pq.read_table('example.parquet') > {code} > Should it rather be FileNotFoundError which is more standard in such > situations? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6851) Should OSError be FileNotFoundError?
[ https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949274#comment-16949274 ] Antoine Pitrou commented on ARROW-6851: --- cc [~jorisvandenbossche] > Should OSError be FileNotFoundError? > > > Key: ARROW-6851 > URL: https://issues.apache.org/jira/browse/ARROW-6851 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: Vladimir Filimonov >Priority: Minor > Fix For: 2.0.0 > > > On the read_table function - if the file is not found, an OSError ("Passed > non-file path") is raised: > {code:java} > import pyarrow.parquet as pq > pq.read_table('example.parquet') > {code} > Should it rather be FileNotFoundError which is more standard in such > situations? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6704) [C++] Cast from timestamp to higher resolution does not check out of bounds timestamps
[ https://issues.apache.org/jira/browse/ARROW-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6704: -- Labels: pull-request-available (was: ) > [C++] Cast from timestamp to higher resolution does not check out of bounds > timestamps > -- > > Key: ARROW-6704 > URL: https://issues.apache.org/jira/browse/ARROW-6704 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > > When casting eg {{timestamp('s')}} to {{timestamp('ns')}}, we do not check > for out of bounds timestamps, giving "garbage" timestamps in the result: > {code} > In [74]: a_np = np.array(["2012-01-01", "2412-01-01"], dtype="datetime64[s]") > > > In [75]: arr = pa.array(a_np) > > > In [76]: arr > > > Out[76]: > > [ > 2012-01-01 00:00:00, > 2412-01-01 00:00:00 > ] > In [77]: arr.cast(pa.timestamp('ns')) > > > Out[77]: > > [ > 2012-01-01 00:00:00.0, > 1827-06-13 00:25:26.290448384 > ] > {code} > Now, this is the same behaviour as numpy, so not sure we should do this. > However, since we have a {{safe=True/False}}, I would expect that for > {{safe=True}} we check this and for {{safe=False}} we do not check this. > (numpy has a similiar {{casting='safe'}} but also does not raise an error in > that case). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6852) [C++] memory-benchmark build failed on Arm64
Yuqi Gu created ARROW-6852: -- Summary: [C++] memory-benchmark build failed on Arm64 Key: ARROW-6852 URL: https://issues.apache.org/jira/browse/ARROW-6852 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yuqi Gu Assignee: Yuqi Gu After the new commit: ARROW-6381 was merged in master, build would fail on Arm64 when DARROW_BUILD_BENCHMARKS is enabled: {code:java} /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:205:31: error: 'kMemoryPerCore' was not declared in this scope const int64_t buffer_size = kMemoryPerCore; ^~ /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: error: 'Buffer' was not declared in this scope std::shared_ptr src, dst; ^~ /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: note: suggested alternative: In file included from /home/builder/arrow/cpp/src/arrow/array.h:28:0, from /home/builder/arrow/cpp/src/arrow/api.h:23, from /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:20: /home/builder/arrow/cpp/src/arrow/buffer.h:50:20: note: 'arrow::Buffer' class ARROW_EXPORT Buffer { ^~ /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:25: error: template argument 1 is invalid std::shared_ptr src, dst; ... . . {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6852) [C++] memory-benchmark build failed on Arm64
[ https://issues.apache.org/jira/browse/ARROW-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6852: -- Labels: pull-request-available (was: ) > [C++] memory-benchmark build failed on Arm64 > > > Key: ARROW-6852 > URL: https://issues.apache.org/jira/browse/ARROW-6852 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yuqi Gu >Assignee: Yuqi Gu >Priority: Major > Labels: pull-request-available > > After the new commit: ARROW-6381 was merged in master, > build would fail on Arm64 when DARROW_BUILD_BENCHMARKS is enabled: > > > {code:java} > /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:205:31: error: > 'kMemoryPerCore' was not declared in this scope > const int64_t buffer_size = kMemoryPerCore; > ^~ > /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: error: > 'Buffer' was not declared in this scope > std::shared_ptr src, dst; > ^~ > /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: note: > suggested alternative: > In file included from /home/builder/arrow/cpp/src/arrow/array.h:28:0, > from /home/builder/arrow/cpp/src/arrow/api.h:23, > from /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:20: > /home/builder/arrow/cpp/src/arrow/buffer.h:50:20: note: 'arrow::Buffer' > class ARROW_EXPORT Buffer { > ^~ > /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:25: error: > template argument 1 is invalid > std::shared_ptr src, dst; > ... > . > . > > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6853) [Java] Support vector and dictionary encoder use different hasher for calculating hashCode
Ji Liu created ARROW-6853: - Summary: [Java] Support vector and dictionary encoder use different hasher for calculating hashCode Key: ARROW-6853 URL: https://issues.apache.org/jira/browse/ARROW-6853 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Hasher interface was introduce in ARROW-5898 and now have two different implementations ({{MurmurHasher and }}{{SimpleHasher}}) and it could be more in the future. And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use {{SimpleHasher}} for calculating hashCode. This issue enables them to use different hasher or even user-defined hasher for their own use cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6853) [Java] Support vector and dictionary encoder use different hasher for calculating hashCode
[ https://issues.apache.org/jira/browse/ARROW-6853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu updated ARROW-6853: -- Description: Hasher interface was introduce in ARROW-5898 and now have two different implementations ({{MurmurHasher and SimpleHasher}}) and it could be more in the future. And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use {{SimpleHasher}} for calculating hashCode. This issue enables them to use different hasher or even user-defined hasher for their own use cases. was: Hasher interface was introduce in ARROW-5898 and now have two different implementations ({{MurmurHasher and }}{{SimpleHasher}}) and it could be more in the future. And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use {{SimpleHasher}} for calculating hashCode. This issue enables them to use different hasher or even user-defined hasher for their own use cases. > [Java] Support vector and dictionary encoder use different hasher for > calculating hashCode > -- > > Key: ARROW-6853 > URL: https://issues.apache.org/jira/browse/ARROW-6853 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > > Hasher interface was introduce in ARROW-5898 and now have two different > implementations ({{MurmurHasher and SimpleHasher}}) and it could be more in > the future. > And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use > {{SimpleHasher}} for calculating hashCode. This issue enables them to use > different hasher or even user-defined hasher for their own use cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6853) [Java] Support vector and dictionary encoder use different hasher for calculating hashCode
[ https://issues.apache.org/jira/browse/ARROW-6853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6853: -- Labels: pull-request-available (was: ) > [Java] Support vector and dictionary encoder use different hasher for > calculating hashCode > -- > > Key: ARROW-6853 > URL: https://issues.apache.org/jira/browse/ARROW-6853 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > > Hasher interface was introduce in ARROW-5898 and now have two different > implementations ({{MurmurHasher and SimpleHasher}}) and it could be more in > the future. > And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use > {{SimpleHasher}} for calculating hashCode. This issue enables them to use > different hasher or even user-defined hasher for their own use cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6854) [Dataset][C++] RecordBatchProjector is not thread safe
[ https://issues.apache.org/jira/browse/ARROW-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6854: -- Summary: [Dataset][C++] RecordBatchProjector is not thread safe (was: [Dataset] RecordBatchProjector is not thread safe) > [Dataset][C++] RecordBatchProjector is not thread safe > -- > > Key: ARROW-6854 > URL: https://issues.apache.org/jira/browse/ARROW-6854 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > > While working on ARROW-6769 I noted that RecordbBatchProjector is not thread > safe. My goal is to use this class to wrap the ScanTaskIterator in another > ScanTaskIterator that projects, so producer (fragments) don't have to know > about this schema. The issue is that ScanTask are expected to run on > concurrent thread. The projector will be invoked by multiple thread. > The lack of concurrency safety is due to adaptivity of input schemas and > `SetInputSchema` stores in a local cache. I suggest we refactor into 2 > classes. > # `RecordBatchProjector` which will work with a static `from` schema, i.e. > no adaptivity. The schema is defined at construct time. This class is thread > safe to invoke after construction since no local modification is done. > # `AdaptiveRecordBatchProjector` which will have a cache map[schema_hash, > std::shared_ptr] protected with a mutex. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6854) [Dataset] RecordBatchProjector is not thread safe
Francois Saint-Jacques created ARROW-6854: - Summary: [Dataset] RecordBatchProjector is not thread safe Key: ARROW-6854 URL: https://issues.apache.org/jira/browse/ARROW-6854 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Francois Saint-Jacques While working on ARROW-6769 I noted that RecordbBatchProjector is not thread safe. My goal is to use this class to wrap the ScanTaskIterator in another ScanTaskIterator that projects, so producer (fragments) don't have to know about this schema. The issue is that ScanTask are expected to run on concurrent thread. The projector will be invoked by multiple thread. The lack of concurrency safety is due to adaptivity of input schemas and `SetInputSchema` stores in a local cache. I suggest we refactor into 2 classes. # `RecordBatchProjector` which will work with a static `from` schema, i.e. no adaptivity. The schema is defined at construct time. This class is thread safe to invoke after construction since no local modification is done. # `AdaptiveRecordBatchProjector` which will have a cache map[schema_hash, std::shared_ptr] protected with a mutex. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight
[ https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6390: --- Assignee: (was: Wes McKinney) > [Python][Flight] Add Python documentation / tutorial for Flight > --- > > Key: ARROW-6390 > URL: https://issues.apache.org/jira/browse/ARROW-6390 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > There is no Sphinx documentation for using Flight from Python. I have found > that writing documentation is an effective way to uncover usability problems > -- I would suggest we write comprehensive documentation for using Flight from > Python as a way to refine the public Python API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight
[ https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6390: Fix Version/s: (was: 0.15.0) 1.0.0 > [Python][Flight] Add Python documentation / tutorial for Flight > --- > > Key: ARROW-6390 > URL: https://issues.apache.org/jira/browse/ARROW-6390 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > There is no Sphinx documentation for using Flight from Python. I have found > that writing documentation is an effective way to uncover usability problems > -- I would suggest we write comprehensive documentation for using Flight from > Python as a way to refine the public Python API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight
[ https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949487#comment-16949487 ] Wes McKinney commented on ARROW-6390: - I haven't made much progress on this yet. I'll reassign when I do > [Python][Flight] Add Python documentation / tutorial for Flight > --- > > Key: ARROW-6390 > URL: https://issues.apache.org/jira/browse/ARROW-6390 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > There is no Sphinx documentation for using Flight from Python. I have found > that writing documentation is an effective way to uncover usability problems > -- I would suggest we write comprehensive documentation for using Flight from > Python as a way to refine the public Python API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5502) [R] file readers should mmap
[ https://issues.apache.org/jira/browse/ARROW-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949488#comment-16949488 ] Wes McKinney commented on ARROW-5502: - Note that we stopped memory mapping by default in {{pyarrow.parquet}}. > [R] file readers should mmap > > > Key: ARROW-5502 > URL: https://issues.apache.org/jira/browse/ARROW-5502 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Arrow is supposed to let you work with datasets bigger than memory. Memory > mapping is a big part of that. It should be the default way that files are > read in the `read_*` functions. To disable memory mapping, we could use a > global `option()`, or a function argument, but that might clutter the > interface. Or we could not give a choice and only fall back to not memory > mapping if the platform/file system doesn't support it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux
[ https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949489#comment-16949489 ] Wes McKinney commented on ARROW-6793: - If you're building from master, you need to build both the C++ and R libraries from master. In general the git revision of both libraries should be the same > [R] Arrow C++ binary packaging for Linux > > > Key: ARROW-6793 > URL: https://issues.apache.org/jira/browse/ARROW-6793 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Our current installation experience on Linux isn't ideal. Unless you've > already installed the Arrow C++ library, when you install the R package, you > get a shell that tells you to install the C++ library. That was a useful > approach to allow us to get the package on CRAN, which makes it easy for > macOS and Windows users to install, but it doesn't improve the installation > experience for Linux users. This is an impediment to adoption of arrow not > only by users but also by package maintainers who might want to depend on > arrow. > macOS and Windows have a better experience because at installation time, the > configure scripts download and statically link a prebuilt C++ library. CRAN > bundles the whole thing up and delivers that as a binary R package. > Python wheels do a similar thing: they're binaries that contain all external > dependencies. And there are pyarrow wheels for Linux. This suggests that we > could do something similar for R: build a generic Linux binary of the C++ > library and download it in the R package configure script at install time. > I experimented with using the Arrow C++ binaries included in the Python > wheels in R. See discussion at the end of ARROW-5956. This worked on macOS > (not useful for R, but it proved the concept) and almost worked on Linux, but > it turned out that the "manylinux2010" standard is too archaic to work with > contemporary Rcpp. > Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, > just with slightly more modern compiler/settings. Publish that C++ binary > package to bintray. Then download it in the R configure script if a > local/system package isn't found. > Once we have a basic version working, test against various distros on > [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere > and/or ensure the current fallback behavior when we encounter a distro that > this doesn't work for. If necessary, we can make multiple flavors of this C++ > binary for debian, centos, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6855) [C++][Python][Flight] Implement Flight middleware
Wes McKinney created ARROW-6855: --- Summary: [C++][Python][Flight] Implement Flight middleware Key: ARROW-6855 URL: https://issues.apache.org/jira/browse/ARROW-6855 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Wes McKinney Assignee: David Li Fix For: 1.0.0 C++/Python side of ARROW-6074 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6855) [C++][Python][Flight] Implement Flight middleware
[ https://issues.apache.org/jira/browse/ARROW-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6855. - Resolution: Fixed Issue resolved by pull request 5552 [https://github.com/apache/arrow/pull/5552] > [C++][Python][Flight] Implement Flight middleware > - > > Key: ARROW-6855 > URL: https://issues.apache.org/jira/browse/ARROW-6855 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: David Li >Priority: Major > Fix For: 1.0.0 > > > C++/Python side of ARROW-6074 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6855) [C++][Python][Flight] Implement Flight middleware
[ https://issues.apache.org/jira/browse/ARROW-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6855: -- Labels: pull-request-available (was: ) > [C++][Python][Flight] Implement Flight middleware > - > > Key: ARROW-6855 > URL: https://issues.apache.org/jira/browse/ARROW-6855 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > C++/Python side of ARROW-6074 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary
Wes McKinney created ARROW-6856: --- Summary: [C++] Use ArrayData instead of Array for ArrayData::dictionary Key: ARROW-6856 URL: https://issues.apache.org/jira/browse/ARROW-6856 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 This would be helpful for consistency. {{DictionaryArray}} may want to cache a "boxed" version of this to return from {{DictionaryArray::dictionary}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6846) [C++] Build failures with glog enabled
[ https://issues.apache.org/jira/browse/ARROW-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-6846. --- > [C++] Build failures with glog enabled > -- > > Key: ARROW-6846 > URL: https://issues.apache.org/jira/browse/ARROW-6846 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > This has started appearing on Travis, e.g.: > https://travis-ci.org/apache/arrow/jobs/596181386#L3663 > {code} > In file included from > /home/travis/build/apache/arrow/cpp/src/arrow/util/logging.cc:29:0: > /home/travis/build/apache/arrow/pyarrow-test-3.6/include/glog/logging.h:994:0: > error: "DCHECK" redefined [-Werror] > #define DCHECK(condition) CHECK(condition) > > In file included from > /home/travis/build/apache/arrow/cpp/src/arrow/util/logging.cc:18:0: > /home/travis/build/apache/arrow/cpp/src/arrow/util/logging.h:130:0: note: > this is the location of the previous definition > #define DCHECK ARROW_CHECK > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6835) [Archery][CMake] Restore ARROW_LINT_ONLY
[ https://issues.apache.org/jira/browse/ARROW-6835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6835. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5616 [https://github.com/apache/arrow/pull/5616] > [Archery][CMake] Restore ARROW_LINT_ONLY > -- > > Key: ARROW-6835 > URL: https://issues.apache.org/jira/browse/ARROW-6835 > Project: Apache Arrow > Issue Type: Bug > Components: Archery >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > This is used by developers to fasten the cmake build creation and loosen the > required installed toolchains (notably libraries). This was yanked because > ARROW_LINT_ONLY effectively exit-early and doesn't generate > `compile_commands.json`. > Restore this option, but ensure that archery toggles accordingly to the usage > of iwyu or clang-tidy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6835) [Archery][CMake] Restore ARROW_LINT_ONLY
[ https://issues.apache.org/jira/browse/ARROW-6835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6835: --- Assignee: Francois Saint-Jacques > [Archery][CMake] Restore ARROW_LINT_ONLY > -- > > Key: ARROW-6835 > URL: https://issues.apache.org/jira/browse/ARROW-6835 > Project: Apache Arrow > Issue Type: Bug > Components: Archery >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Minor > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > This is used by developers to fasten the cmake build creation and loosen the > required installed toolchains (notably libraries). This was yanked because > ARROW_LINT_ONLY effectively exit-early and doesn't generate > `compile_commands.json`. > Restore this option, but ensure that archery toggles accordingly to the usage > of iwyu or clang-tidy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6857) Segfault for dictionary_encode on empty chunked_array (edge case)
Artem KOZHEVNIKOV created ARROW-6857: Summary: Segfault for dictionary_encode on empty chunked_array (edge case) Key: ARROW-6857 URL: https://issues.apache.org/jira/browse/ARROW-6857 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.0 Reporter: Artem KOZHEVNIKOV a reproducer is here : {code:python} import pyarrow as pa aa = pa.chunked_array([pa.array(['a', 'b', 'c'])]) aa[:0].dictionary_encode() # Segmentation fault: 11 {code} For pyarrow=0.14, I could not reproduce. I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6851) [Python] Should OSError be FileNotFoundError?
[ https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6851: Summary: [Python] Should OSError be FileNotFoundError? (was: Should OSError be FileNotFoundError?) > [Python] Should OSError be FileNotFoundError? > - > > Key: ARROW-6851 > URL: https://issues.apache.org/jira/browse/ARROW-6851 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: Vladimir Filimonov >Priority: Minor > Fix For: 2.0.0 > > > On the read_table function - if the file is not found, an OSError ("Passed > non-file path") is raised: > {code:java} > import pyarrow.parquet as pq > pq.read_table('example.parquet') > {code} > Should it rather be FileNotFoundError which is more standard in such > situations? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6852) [C++] memory-benchmark build failed on Arm64
[ https://issues.apache.org/jira/browse/ARROW-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6852: Fix Version/s: 1.0.0 > [C++] memory-benchmark build failed on Arm64 > > > Key: ARROW-6852 > URL: https://issues.apache.org/jira/browse/ARROW-6852 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yuqi Gu >Assignee: Yuqi Gu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > After the new commit: ARROW-6381 was merged in master, > build would fail on Arm64 when DARROW_BUILD_BENCHMARKS is enabled: > > > {code:java} > /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:205:31: error: > 'kMemoryPerCore' was not declared in this scope > const int64_t buffer_size = kMemoryPerCore; > ^~ > /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: error: > 'Buffer' was not declared in this scope > std::shared_ptr src, dst; > ^~ > /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: note: > suggested alternative: > In file included from /home/builder/arrow/cpp/src/arrow/array.h:28:0, > from /home/builder/arrow/cpp/src/arrow/api.h:23, > from /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:20: > /home/builder/arrow/cpp/src/arrow/buffer.h:50:20: note: 'arrow::Buffer' > class ARROW_EXPORT Buffer { > ^~ > /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:25: error: > template argument 1 is invalid > std::shared_ptr src, dst; > ... > . > . > > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6858) [C++] Create Python script to handle transitive component dependencies
Wes McKinney created ARROW-6858: --- Summary: [C++] Create Python script to handle transitive component dependencies Key: ARROW-6858 URL: https://issues.apache.org/jira/browse/ARROW-6858 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 In the C++ build system, we are handling relationships between optional components in an ad hoc fashion https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L266 This doesn't seem ideal. As discussed on the mailing list, I suggest declaring dependencies in a Python data structure and then generating and checking in a .cmake file that can be {{include}}d. This will be a big easier than maintaining this on an ad hoc basis. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6859) [CI][Nightly] Disable docker layer caching for CircleCI tasks
[ https://issues.apache.org/jira/browse/ARROW-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-6859: -- Assignee: Krisztian Szucs > [CI][Nightly] Disable docker layer caching for CircleCI tasks > - > > Key: ARROW-6859 > URL: https://issues.apache.org/jira/browse/ARROW-6859 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Fix For: 1.0.0 > > > CircleCI builds are failing because the layer caching is not available for > free plans. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6859) [CI][Nightly] Disable docker layer caching for CircleCI tasks
Krisztian Szucs created ARROW-6859: -- Summary: [CI][Nightly] Disable docker layer caching for CircleCI tasks Key: ARROW-6859 URL: https://issues.apache.org/jira/browse/ARROW-6859 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Krisztian Szucs Fix For: 1.0.0 CircleCI builds are failing because the layer caching is not available for free plans. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6859) [CI][Nightly] Disable docker layer caching for CircleCI tasks
[ https://issues.apache.org/jira/browse/ARROW-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6859: -- Labels: pull-request-available (was: ) > [CI][Nightly] Disable docker layer caching for CircleCI tasks > - > > Key: ARROW-6859 > URL: https://issues.apache.org/jira/browse/ARROW-6859 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > CircleCI builds are failing because the layer caching is not available for > free plans. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight
Wes McKinney created ARROW-6860: --- Summary: [Python] Only link libarrow_flight.so to pyarrow._flight Key: ARROW-6860 URL: https://issues.apache.org/jira/browse/ARROW-6860 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 1.0.0 See BEAM-8368. We need to find a strategy to mitigate protobuf static linking issues with teh Beam community -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight
[ https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949652#comment-16949652 ] Antoine Pitrou commented on ARROW-6860: --- It's a general issue with our Cython extensions. We link them each with all Arrow DLLs (including gandiva AFAIR) > [Python] Only link libarrow_flight.so to pyarrow._flight > > > Key: ARROW-6860 > URL: https://issues.apache.org/jira/browse/ARROW-6860 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > See BEAM-8368. We need to find a strategy to mitigate protobuf static linking > issues with teh Beam community -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight
[ https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949655#comment-16949655 ] Wes McKinney commented on ARROW-6860: - Yes, we'll have to make changes to python/CMakeLists.txt to link less monolithically. I can take a look at it > [Python] Only link libarrow_flight.so to pyarrow._flight > > > Key: ARROW-6860 > URL: https://issues.apache.org/jira/browse/ARROW-6860 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > See BEAM-8368. We need to find a strategy to mitigate protobuf static linking > issues with teh Beam community -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6861) With arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
Adam Hooper created ARROW-6861: -- Summary: With arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize Key: ARROW-6861 URL: https://issues.apache.org/jira/browse/ARROW-6861 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.15.0 Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) Reporter: Adam Hooper Attachments: fix-dict-builder-capacity.diff I'll need to jump through hoops to upload the (seemingly-valid) Parquet file that triggers this bug. In the meantime, here's the error I get, reading the Parquet file with read_dictionary=true. I'll start with the stack trace: {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize}} {{#0 0x00b9fffd in __cxa_throw ()}} {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow (this=0x56612e50, num_values=67339, null_count=0, valid_bits=0x7f39a764b780 '\377' ..., valid_bits_offset=748544,}} \{{ builder=0x56616330) at /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} {{#2 0x0046d703 in parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced (this=0x56616260, values_to_read=67339, null_count=0)}} \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} {{#3 0x004a13f8 in parquet::internal::TypedRecordReader >::ReadRecordData (this=0x56616260, num_records=67339)}} \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} {{#4 0x00493876 in parquet::internal::TypedRecordReader >::ReadRecords (this=0x56616260, num_records=815883)}} \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string, std::allocator > const&) ()}} And now a report of my gdb adventures: In Arrow 0.15.0, when reading a particular dictionary column ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} is called twice (once with 493568 values, once with 254976 values); and then {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't know why this column comes in three batches.) On first {{AppendIndices()}} call, the buffer capacity is equal to the number of values. On second call, that's no longer the case: the buffer grows using {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. But there's a bug: the 987136-capacity buffer is in {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in {{Dictionary32Builder::indices_builder_.capacity_}}. {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} is called. (Dictionary32Builder behaves like a proxy for its {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things are messy.) So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its wrong, cached value) to {{length_ + num_values}} (815883). Since {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws an exception. The only workaround I can find: use {{read_dictionaries=false}}. This affects Python, too. I've attached a patch that fixes the issue for my file. I don't know how to formulate a reduction, though, so I haven't contributed unit tests. I'm also not certain how FinishInternal is meant to work, so this definitely needs expert review. (FinishInternal was _definitely_ buggy before my patch; after my patch it _might_ be buggy but I don't know.) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6861) arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
[ https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Hooper updated ARROW-6861: --- Summary: arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize (was: With arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize) > arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure > reading column: IOError: Arrow error: Invalid: Resize cannot downsize > - > > Key: ARROW-6861 > URL: https://issues.apache.org/jira/browse/ARROW-6861 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 > Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) >Reporter: Adam Hooper >Priority: Major > Attachments: fix-dict-builder-capacity.diff > > > I'll need to jump through hoops to upload the (seemingly-valid) Parquet file > that triggers this bug. In the meantime, here's the error I get, reading the > Parquet file with read_dictionary=true. I'll start with the stack trace: > {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot > downsize}} > {{#0 0x00b9fffd in __cxa_throw ()}} > {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow > (this=0x56612e50, num_values=67339, null_count=0, > valid_bits=0x7f39a764b780 '\377' ..., > valid_bits_offset=748544,}} > \{{ builder=0x56616330) at > /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} > {{#2 0x0046d703 in > parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced > (this=0x56616260, values_to_read=67339, null_count=0)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} > {{#3 0x004a13f8 in > parquet::internal::TypedRecordReader > >::ReadRecordData (this=0x56616260, num_records=67339)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} > {{#4 0x00493876 in > parquet::internal::TypedRecordReader > >::ReadRecords (this=0x56616260, num_records=815883)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} > {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch > (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} > {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} > {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} > {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}} > And now a report of my gdb adventures: > In Arrow 0.15.0, when reading a particular dictionary column > ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow > 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} > is called twice (once with 493568 values, once with 254976 values); and then > {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't > know why this column comes in three batches.) On first {{AppendIndices()}} > call, the buffer capacity is equal to the number of values. On second call, > that's no longer the case: the buffer grows using > {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. > But there's a bug: the 987136-capacity buffer is in > {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in > {{Dictionary32Builder::indices_builder_.capacity_}}. > {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} > is called. (Dictionary32Builder behaves like a proxy for its > {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things > are messy.) > So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, > via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But > {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its > wrong, cached value) to {{length_ + num_values}} (815883). Since > {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws > an exception. > The only workaround I can find: use {{read_dictionaries=false}}. > This affects Python, too. > I've attached a patch that fixes the issue for my file. I don't know how to > formulate a reduction, though, so I haven't contributed unit test
[jira] [Resolved] (ARROW-6711) [C++] Consolidate Filter and Expression classes
[ https://issues.apache.org/jira/browse/ARROW-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-6711. - Resolution: Fixed Issue resolved by pull request 5594 [https://github.com/apache/arrow/pull/5594] > [C++] Consolidate Filter and Expression classes > --- > > Key: ARROW-6711 > URL: https://issues.apache.org/jira/browse/ARROW-6711 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > There is unnecessary boilerplate required when using the Filter/Expression > classes. Filter is no longer necessary; it (and FilterVector) can be replaced > with Expression. Expression is sufficiently general that it can be subclassed > to provide any custom functionality which would have been added through a > GenericFilter (add some tests for this). > Additionally rows within RecordBatches yielded from a scan are not currently > filtered using Expression::Evaluate(). (Add tests ensuring both row filtering > and pruning obey Kleene logic) > Add some comments on the mechanism of {{Assume()}} too, and refactor it not > to return a Result (its failure modes are covered by {{Validate()}}) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6861) arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
[ https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Hooper updated ARROW-6861: --- Attachment: parquet-written-by-arrow-0-14-1.7z > arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure > reading column: IOError: Arrow error: Invalid: Resize cannot downsize > - > > Key: ARROW-6861 > URL: https://issues.apache.org/jira/browse/ARROW-6861 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 > Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) >Reporter: Adam Hooper >Priority: Major > Attachments: fix-dict-builder-capacity.diff, > parquet-written-by-arrow-0-14-1.7z > > > I'll need to jump through hoops to upload the (seemingly-valid) Parquet file > that triggers this bug. In the meantime, here's the error I get, reading the > Parquet file with read_dictionary=true. I'll start with the stack trace: > {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot > downsize}} > {{#0 0x00b9fffd in __cxa_throw ()}} > {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow > (this=0x56612e50, num_values=67339, null_count=0, > valid_bits=0x7f39a764b780 '\377' ..., > valid_bits_offset=748544,}} > \{{ builder=0x56616330) at > /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} > {{#2 0x0046d703 in > parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced > (this=0x56616260, values_to_read=67339, null_count=0)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} > {{#3 0x004a13f8 in > parquet::internal::TypedRecordReader > >::ReadRecordData (this=0x56616260, num_records=67339)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} > {{#4 0x00493876 in > parquet::internal::TypedRecordReader > >::ReadRecords (this=0x56616260, num_records=815883)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} > {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch > (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} > {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} > {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} > {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}} > And now a report of my gdb adventures: > In Arrow 0.15.0, when reading a particular dictionary column > ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow > 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} > is called twice (once with 493568 values, once with 254976 values); and then > {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't > know why this column comes in three batches.) On first {{AppendIndices()}} > call, the buffer capacity is equal to the number of values. On second call, > that's no longer the case: the buffer grows using > {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. > But there's a bug: the 987136-capacity buffer is in > {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in > {{Dictionary32Builder::indices_builder_.capacity_}}. > {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} > is called. (Dictionary32Builder behaves like a proxy for its > {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things > are messy.) > So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, > via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But > {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its > wrong, cached value) to {{length_ + num_values}} (815883). Since > {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws > an exception. > The only workaround I can find: use {{read_dictionaries=false}}. > This affects Python, too. > I've attached a patch that fixes the issue for my file. I don't know how to > formulate a reduction, though, so I haven't contributed unit tests. I'm also > not certain how FinishInternal is meant to work, so this definitely needs > expert review. (FinishInternal was _definitely_ buggy before my patch; after > my patch it _might_ be buggy but I don
[jira] [Commented] (ARROW-6861) arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
[ https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949738#comment-16949738 ] Adam Hooper commented on ARROW-6861: I've attached a Parquet file, written by Arrow 0.14.1, which causes this problem. Column 8 (among others) causes this problem. Most columns work fine. > arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure > reading column: IOError: Arrow error: Invalid: Resize cannot downsize > - > > Key: ARROW-6861 > URL: https://issues.apache.org/jira/browse/ARROW-6861 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 > Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) >Reporter: Adam Hooper >Priority: Major > Attachments: fix-dict-builder-capacity.diff, > parquet-written-by-arrow-0-14-1.7z > > > I'll need to jump through hoops to upload the (seemingly-valid) Parquet file > that triggers this bug. In the meantime, here's the error I get, reading the > Parquet file with read_dictionary=true. I'll start with the stack trace: > {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot > downsize}} > {{#0 0x00b9fffd in __cxa_throw ()}} > {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow > (this=0x56612e50, num_values=67339, null_count=0, > valid_bits=0x7f39a764b780 '\377' ..., > valid_bits_offset=748544,}} > \{{ builder=0x56616330) at > /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} > {{#2 0x0046d703 in > parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced > (this=0x56616260, values_to_read=67339, null_count=0)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} > {{#3 0x004a13f8 in > parquet::internal::TypedRecordReader > >::ReadRecordData (this=0x56616260, num_records=67339)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} > {{#4 0x00493876 in > parquet::internal::TypedRecordReader > >::ReadRecords (this=0x56616260, num_records=815883)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} > {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch > (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} > {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} > {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} > {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}} > And now a report of my gdb adventures: > In Arrow 0.15.0, when reading a particular dictionary column > ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow > 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} > is called twice (once with 493568 values, once with 254976 values); and then > {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't > know why this column comes in three batches.) On first {{AppendIndices()}} > call, the buffer capacity is equal to the number of values. On second call, > that's no longer the case: the buffer grows using > {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. > But there's a bug: the 987136-capacity buffer is in > {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in > {{Dictionary32Builder::indices_builder_.capacity_}}. > {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} > is called. (Dictionary32Builder behaves like a proxy for its > {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things > are messy.) > So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, > via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But > {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its > wrong, cached value) to {{length_ + num_values}} (815883). Since > {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws > an exception. > The only workaround I can find: use {{read_dictionaries=false}}. > This affects Python, too. > I've attached a patch that fixes the issue for my file. I don't know how to > formulate a reduction, though, so I haven't contributed unit tests. I'm also > not certain how FinishInternal is me
[jira] [Commented] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight
[ https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949745#comment-16949745 ] David Li commented on ARROW-6860: - That won't help as libarrow_python will still link against Flight. You'll need a libarrow_python_flight as well. > [Python] Only link libarrow_flight.so to pyarrow._flight > > > Key: ARROW-6860 > URL: https://issues.apache.org/jira/browse/ARROW-6860 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > See BEAM-8368. We need to find a strategy to mitigate protobuf static linking > issues with teh Beam community -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux
[ https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949763#comment-16949763 ] Thomas Schm commented on ARROW-6793: Gosh, that's a big can. Is there a chance to keep the precompiled libraries, see https://arrow.apache.org/install/ somewhat in sync with a tagged version from github? At the moment the libraries or all pointing to 0.15.0 etc. but CRAN is lagging and Github is somewhat ahead. Maybe it's a stupid idea in the first place to try to rely on this precomiled libraries? Or maybe one could install slightly outdated libraries to stay in sync with CRAN? > [R] Arrow C++ binary packaging for Linux > > > Key: ARROW-6793 > URL: https://issues.apache.org/jira/browse/ARROW-6793 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Our current installation experience on Linux isn't ideal. Unless you've > already installed the Arrow C++ library, when you install the R package, you > get a shell that tells you to install the C++ library. That was a useful > approach to allow us to get the package on CRAN, which makes it easy for > macOS and Windows users to install, but it doesn't improve the installation > experience for Linux users. This is an impediment to adoption of arrow not > only by users but also by package maintainers who might want to depend on > arrow. > macOS and Windows have a better experience because at installation time, the > configure scripts download and statically link a prebuilt C++ library. CRAN > bundles the whole thing up and delivers that as a binary R package. > Python wheels do a similar thing: they're binaries that contain all external > dependencies. And there are pyarrow wheels for Linux. This suggests that we > could do something similar for R: build a generic Linux binary of the C++ > library and download it in the R package configure script at install time. > I experimented with using the Arrow C++ binaries included in the Python > wheels in R. See discussion at the end of ARROW-5956. This worked on macOS > (not useful for R, but it proved the concept) and almost worked on Linux, but > it turned out that the "manylinux2010" standard is too archaic to work with > contemporary Rcpp. > Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, > just with slightly more modern compiler/settings. Publish that C++ binary > package to bintray. Then download it in the R configure script if a > local/system package isn't found. > Once we have a basic version working, test against various distros on > [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere > and/or ensure the current fallback behavior when we encounter a distro that > this doesn't work for. If necessary, we can make multiple flavors of this C++ > binary for debian, centos, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux
[ https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949770#comment-16949770 ] Neal Richardson commented on ARROW-6793: The binaries available on the install page _are_ in sync with tagged versions on GitHub, but you seem to be installing the head of the master branch (what you get if you do install_github without specifying a tag). If you want to use the built binary libraries for an official release version of the C++ library, you need to use the corresponding R package. You can get that from CRAN–it isn't lagging. In the output you pasted above, you were installing from a CRAN snapshot "https://mran.microsoft.com/snapshot/2019-09-19/";. That's your lag. > [R] Arrow C++ binary packaging for Linux > > > Key: ARROW-6793 > URL: https://issues.apache.org/jira/browse/ARROW-6793 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Our current installation experience on Linux isn't ideal. Unless you've > already installed the Arrow C++ library, when you install the R package, you > get a shell that tells you to install the C++ library. That was a useful > approach to allow us to get the package on CRAN, which makes it easy for > macOS and Windows users to install, but it doesn't improve the installation > experience for Linux users. This is an impediment to adoption of arrow not > only by users but also by package maintainers who might want to depend on > arrow. > macOS and Windows have a better experience because at installation time, the > configure scripts download and statically link a prebuilt C++ library. CRAN > bundles the whole thing up and delivers that as a binary R package. > Python wheels do a similar thing: they're binaries that contain all external > dependencies. And there are pyarrow wheels for Linux. This suggests that we > could do something similar for R: build a generic Linux binary of the C++ > library and download it in the R package configure script at install time. > I experimented with using the Arrow C++ binaries included in the Python > wheels in R. See discussion at the end of ARROW-5956. This worked on macOS > (not useful for R, but it proved the concept) and almost worked on Linux, but > it turned out that the "manylinux2010" standard is too archaic to work with > contemporary Rcpp. > Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, > just with slightly more modern compiler/settings. Publish that C++ binary > package to bintray. Then download it in the R configure script if a > local/system package isn't found. > Once we have a basic version working, test against various distros on > [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere > and/or ensure the current fallback behavior when we encounter a distro that > this doesn't work for. If necessary, we can make multiple flavors of this C++ > binary for debian, centos, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (ARROW-6813) [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15
[ https://issues.apache.org/jira/browse/ARROW-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Cobb updated ARROW-6813: - Comment: was deleted (was: As I looked more deeply at the issue, it appears that 0.15.0 completely reworks the notion of header parsing, and the most straightforward solution is to remove the `headers` option from the Ruby layer. Thus my PR. We've "repaired" our application code by removing the use of the option; we always had it set to `true` anyway, and that's the behavior with no options now.) > [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15 > --- > > Key: ARROW-6813 > URL: https://issues.apache.org/jira/browse/ARROW-6813 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby >Affects Versions: 0.15.0 > Environment: Ubuntu 18.04, Debian Stretch >Reporter: Rick Cobb >Assignee: Rick Cobb >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > ``` > Error: undefined method `n_header_rows=' for > # > ``` > It appears that 0.15 has changed the name for this option to `n_skip_rows` > > ``` > (byebug) options > #(byebug) > (options.methods - Object.new.methods).sort > [:add_column_name, :add_column_type, :add_column_type_raw, :add_false_value, > :add_null_value, :add_schema, :add_true_value, :allow_newlines_in_values=, > :allow_newlines_in_values?, :allow_null_strings=, :allow_null_strings?, > :bind_property, :block_size, :block_size=, :check_utf8=, :check_utf8?, > :column_names, :column_names=, :column_types, :delimiter, :delimiter=, > :destroyed?, :double_quoted=, :double_quoted?, :escape_character, > :escape_character=, :escaped=, :escaped?, :false_values, :false_values=, > :floating?, :freeze_notify, :generate_column_names=, :generate_column_names?, > :get_property, :gtype, :ignore_empty_lines=, :ignore_empty_lines?, > :n_skip_rows, :n_skip_rows=, :notify, :null_values, :null_values=, > :parent_instance, :quote_character, :quote_character=, :quoted=, :quoted?, > :ref_count, :set_allow_newlines_in_values, :set_allow_null_strings, > :set_block_size, :set_check_utf8, :set_column_names, :set_delimiter, > :set_double_quoted, :set_escape_character, :set_escaped, :set_false_values, > :set_generate_column_names, :set_ignore_empty_lines, :set_n_skip_rows, > :set_null_values, :set_property, :set_quote_character, :set_quoted, > :set_true_values, :set_use_threads, :signal_connect, :signal_connect_after, > :signal_emit, :signal_emit_stop, :signal_handler_block, > :signal_handler_disconnect, :signal_handler_is_connected?, > :signal_handler_unblock, :signal_has_handler_pending?, :thaw_notify, > :true_values, :true_values=, :type_name, :unref, :use_threads=, :use_threads?] > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
[ https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949779#comment-16949779 ] Kyle McCarthy commented on ARROW-6659: -- I am happy to help - and I would prefer to do it how you wanted/the right way! I am fairly unfamiliar with the codebase so I am really just learning it by working through the open tasks, so this may be a dumb question. How does the LogicalPlan and partition count actually work together. From the tests it looks like the partition count is related to the batch size? If so that would mean that every LogicalPlan would have the same partition count right? Also, if we do add the LogicalPlan::Merge - that would mean that when the SQL Planner is creating a Logical Aggregate it would create: `Aggregate { Merge { Aggregate ( aggregate_input ) } }`? If so that definitely makes sense to me, but I am still not totally sure how the partition count would work into this. Thank you for your patience! > [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge > - > > Key: ARROW-6659 > URL: https://issues.apache.org/jira/browse/ARROW-6659 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Kyle McCarthy >Priority: Major > > HashAggregateExec current creates one HashPartition per input partition for > the initial aggregate per partition, and then explicitly calls MergeExec and > then creates another HashPartition for the final reduce operation. > This is fine for in-memory queries in DataFusion but is not extensible. For > example, it is not possible to provide a different MergeExec implementation > that would distribute queries to a cluster. > A better design would be to move the logic into the query planner so that the > physical plan contains explicit steps such as: > > {code:java} > - HashAggregate // final aggregate > - MergeExec > - HashAggregate // aggregate per partition > {code} > This would then make it easier to customize the plan in other projects, to > support distributed execution: > {code:java} > - HashAggregate // final aggregate >- MergeExec > - DistributedExec > - HashAggregate // aggregate per partition{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
[ https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949779#comment-16949779 ] Kyle McCarthy edited comment on ARROW-6659 at 10/11/19 8:32 PM: I am happy to help - and I would prefer to do it how you wanted/the right way! I am fairly unfamiliar with the codebase so I am really just learning it by working through the open tasks, so this may be a dumb question. How does the LogicalPlan and partition count actually work together. From the tests it looks like the partition count is related to the batch size? If so that would mean that every LogicalPlan would have the same partition count right? Also, if we do add the LogicalPlan::Merge - that would mean that when the SQL Planner is creating a Logical Aggregate it would create: {code:java} Aggregate { Merge { Aggregate ( aggregate_input ) } }{code} ? If so that definitely makes sense to me, but I am still not totally sure how the partition count would work into this. Thank you for your patience! was (Author: kylemccarthy): I am happy to help - and I would prefer to do it how you wanted/the right way! I am fairly unfamiliar with the codebase so I am really just learning it by working through the open tasks, so this may be a dumb question. How does the LogicalPlan and partition count actually work together. From the tests it looks like the partition count is related to the batch size? If so that would mean that every LogicalPlan would have the same partition count right? Also, if we do add the LogicalPlan::Merge - that would mean that when the SQL Planner is creating a Logical Aggregate it would create: `Aggregate { Merge { Aggregate ( aggregate_input ) } }`? If so that definitely makes sense to me, but I am still not totally sure how the partition count would work into this. Thank you for your patience! > [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge > - > > Key: ARROW-6659 > URL: https://issues.apache.org/jira/browse/ARROW-6659 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Kyle McCarthy >Priority: Major > > HashAggregateExec current creates one HashPartition per input partition for > the initial aggregate per partition, and then explicitly calls MergeExec and > then creates another HashPartition for the final reduce operation. > This is fine for in-memory queries in DataFusion but is not extensible. For > example, it is not possible to provide a different MergeExec implementation > that would distribute queries to a cluster. > A better design would be to move the logic into the query planner so that the > physical plan contains explicit steps such as: > > {code:java} > - HashAggregate // final aggregate > - MergeExec > - HashAggregate // aggregate per partition > {code} > This would then make it easier to customize the plan in other projects, to > support distributed execution: > {code:java} > - HashAggregate // final aggregate >- MergeExec > - DistributedExec > - HashAggregate // aggregate per partition{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux
[ https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949785#comment-16949785 ] Thomas Schm commented on ARROW-6793: Awesome, that's good news. Everyone following this thread. To install a tagged version from Github run R -e 'remotes::install_github("apache/arrow/r@apache-arrow-0.15.0")' or with CRAN devtools::install_version("arrow", version = "0.15.0", repos = "http://cran.us.r-project.org";) Thanks for all your help on that issue. The documentation on downloading the precompiled libraries is unfortunately slightly outdated. But @kou is already on the case. If I understand the linking process correctly there is no need to specify any version number for the precompiled libraries as debian is given merely access to a software archive and the compiler/linker can pick any library in need. I couldn't agree more with the initial premise of this thread. The experience for people running arrow on Linux relying on this binary packages is not exactly ideal :-) Painful. Thanks again... Note that the documentation is too terse for people not familiar with deep knowledge of debian and the way it can access libraries and/or familiar with devtools. > [R] Arrow C++ binary packaging for Linux > > > Key: ARROW-6793 > URL: https://issues.apache.org/jira/browse/ARROW-6793 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Our current installation experience on Linux isn't ideal. Unless you've > already installed the Arrow C++ library, when you install the R package, you > get a shell that tells you to install the C++ library. That was a useful > approach to allow us to get the package on CRAN, which makes it easy for > macOS and Windows users to install, but it doesn't improve the installation > experience for Linux users. This is an impediment to adoption of arrow not > only by users but also by package maintainers who might want to depend on > arrow. > macOS and Windows have a better experience because at installation time, the > configure scripts download and statically link a prebuilt C++ library. CRAN > bundles the whole thing up and delivers that as a binary R package. > Python wheels do a similar thing: they're binaries that contain all external > dependencies. And there are pyarrow wheels for Linux. This suggests that we > could do something similar for R: build a generic Linux binary of the C++ > library and download it in the R package configure script at install time. > I experimented with using the Arrow C++ binaries included in the Python > wheels in R. See discussion at the end of ARROW-5956. This worked on macOS > (not useful for R, but it proved the concept) and almost worked on Linux, but > it turned out that the "manylinux2010" standard is too archaic to work with > contemporary Rcpp. > Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, > just with slightly more modern compiler/settings. Publish that C++ binary > package to bintray. Then download it in the R configure script if a > local/system package isn't found. > Once we have a basic version working, test against various distros on > [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere > and/or ensure the current fallback behavior when we encounter a distro that > this doesn't work for. If necessary, we can make multiple flavors of this C++ > binary for debian, centos, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux
[ https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949791#comment-16949791 ] Neal Richardson commented on ARROW-6793: You don't need devtools/remotes if you want to install the current version. Just install it from CRAN. > [R] Arrow C++ binary packaging for Linux > > > Key: ARROW-6793 > URL: https://issues.apache.org/jira/browse/ARROW-6793 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Our current installation experience on Linux isn't ideal. Unless you've > already installed the Arrow C++ library, when you install the R package, you > get a shell that tells you to install the C++ library. That was a useful > approach to allow us to get the package on CRAN, which makes it easy for > macOS and Windows users to install, but it doesn't improve the installation > experience for Linux users. This is an impediment to adoption of arrow not > only by users but also by package maintainers who might want to depend on > arrow. > macOS and Windows have a better experience because at installation time, the > configure scripts download and statically link a prebuilt C++ library. CRAN > bundles the whole thing up and delivers that as a binary R package. > Python wheels do a similar thing: they're binaries that contain all external > dependencies. And there are pyarrow wheels for Linux. This suggests that we > could do something similar for R: build a generic Linux binary of the C++ > library and download it in the R package configure script at install time. > I experimented with using the Arrow C++ binaries included in the Python > wheels in R. See discussion at the end of ARROW-5956. This worked on macOS > (not useful for R, but it proved the concept) and almost worked on Linux, but > it turned out that the "manylinux2010" standard is too archaic to work with > contemporary Rcpp. > Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, > just with slightly more modern compiler/settings. Publish that C++ binary > package to bintray. Then download it in the R configure script if a > local/system package isn't found. > Once we have a basic version working, test against various distros on > [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere > and/or ensure the current fallback behavior when we encounter a distro that > this doesn't work for. If necessary, we can make multiple flavors of this C++ > binary for debian, centos, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
[ https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6861: Summary: [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize (was: arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize) > [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: > Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize > -- > > Key: ARROW-6861 > URL: https://issues.apache.org/jira/browse/ARROW-6861 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 > Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) >Reporter: Adam Hooper >Priority: Major > Attachments: fix-dict-builder-capacity.diff, > parquet-written-by-arrow-0-14-1.7z > > > I'll need to jump through hoops to upload the (seemingly-valid) Parquet file > that triggers this bug. In the meantime, here's the error I get, reading the > Parquet file with read_dictionary=true. I'll start with the stack trace: > {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot > downsize}} > {{#0 0x00b9fffd in __cxa_throw ()}} > {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow > (this=0x56612e50, num_values=67339, null_count=0, > valid_bits=0x7f39a764b780 '\377' ..., > valid_bits_offset=748544,}} > \{{ builder=0x56616330) at > /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} > {{#2 0x0046d703 in > parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced > (this=0x56616260, values_to_read=67339, null_count=0)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} > {{#3 0x004a13f8 in > parquet::internal::TypedRecordReader > >::ReadRecordData (this=0x56616260, num_records=67339)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} > {{#4 0x00493876 in > parquet::internal::TypedRecordReader > >::ReadRecords (this=0x56616260, num_records=815883)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} > {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch > (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} > {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} > {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} > {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}} > And now a report of my gdb adventures: > In Arrow 0.15.0, when reading a particular dictionary column > ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow > 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} > is called twice (once with 493568 values, once with 254976 values); and then > {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't > know why this column comes in three batches.) On first {{AppendIndices()}} > call, the buffer capacity is equal to the number of values. On second call, > that's no longer the case: the buffer grows using > {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. > But there's a bug: the 987136-capacity buffer is in > {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in > {{Dictionary32Builder::indices_builder_.capacity_}}. > {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} > is called. (Dictionary32Builder behaves like a proxy for its > {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things > are messy.) > So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, > via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But > {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its > wrong, cached value) to {{length_ + num_values}} (815883). Since > {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws > an exception. > The only workaround I can find: use {{read_dictionaries=false}}. > This affects Python, too. > I've attached a patch that fixes the issue for my file. I d
[jira] [Updated] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
[ https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6861: Fix Version/s: 1.0.0 > [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: > Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize > -- > > Key: ARROW-6861 > URL: https://issues.apache.org/jira/browse/ARROW-6861 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 > Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) >Reporter: Adam Hooper >Priority: Major > Fix For: 1.0.0 > > Attachments: fix-dict-builder-capacity.diff, > parquet-written-by-arrow-0-14-1.7z > > > I'll need to jump through hoops to upload the (seemingly-valid) Parquet file > that triggers this bug. In the meantime, here's the error I get, reading the > Parquet file with read_dictionary=true. I'll start with the stack trace: > {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot > downsize}} > {{#0 0x00b9fffd in __cxa_throw ()}} > {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow > (this=0x56612e50, num_values=67339, null_count=0, > valid_bits=0x7f39a764b780 '\377' ..., > valid_bits_offset=748544,}} > \{{ builder=0x56616330) at > /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} > {{#2 0x0046d703 in > parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced > (this=0x56616260, values_to_read=67339, null_count=0)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} > {{#3 0x004a13f8 in > parquet::internal::TypedRecordReader > >::ReadRecordData (this=0x56616260, num_records=67339)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} > {{#4 0x00493876 in > parquet::internal::TypedRecordReader > >::ReadRecords (this=0x56616260, num_records=815883)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} > {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch > (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} > {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} > {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} > {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}} > And now a report of my gdb adventures: > In Arrow 0.15.0, when reading a particular dictionary column > ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow > 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} > is called twice (once with 493568 values, once with 254976 values); and then > {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't > know why this column comes in three batches.) On first {{AppendIndices()}} > call, the buffer capacity is equal to the number of values. On second call, > that's no longer the case: the buffer grows using > {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. > But there's a bug: the 987136-capacity buffer is in > {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in > {{Dictionary32Builder::indices_builder_.capacity_}}. > {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} > is called. (Dictionary32Builder behaves like a proxy for its > {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things > are messy.) > So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, > via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But > {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its > wrong, cached value) to {{length_ + num_values}} (815883). Since > {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws > an exception. > The only workaround I can find: use {{read_dictionaries=false}}. > This affects Python, too. > I've attached a patch that fixes the issue for my file. I don't know how to > formulate a reduction, though, so I haven't contributed unit tests. I'm also > not certain how FinishInternal is meant to work, so this definitely needs > expert review. (FinishInternal was _definitely_ buggy before my patch; after > my patch it _
[jira] [Commented] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
[ https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949801#comment-16949801 ] Wes McKinney commented on ARROW-6861: - Thanks. This should be enough information to help write a unit test to reproduce the issue. [~bkietz] are you interested in taking a look? > [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: > Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize > -- > > Key: ARROW-6861 > URL: https://issues.apache.org/jira/browse/ARROW-6861 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 > Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) >Reporter: Adam Hooper >Priority: Major > Fix For: 1.0.0 > > Attachments: fix-dict-builder-capacity.diff, > parquet-written-by-arrow-0-14-1.7z > > > I'll need to jump through hoops to upload the (seemingly-valid) Parquet file > that triggers this bug. In the meantime, here's the error I get, reading the > Parquet file with read_dictionary=true. I'll start with the stack trace: > {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot > downsize}} > {{#0 0x00b9fffd in __cxa_throw ()}} > {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow > (this=0x56612e50, num_values=67339, null_count=0, > valid_bits=0x7f39a764b780 '\377' ..., > valid_bits_offset=748544,}} > \{{ builder=0x56616330) at > /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} > {{#2 0x0046d703 in > parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced > (this=0x56616260, values_to_read=67339, null_count=0)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} > {{#3 0x004a13f8 in > parquet::internal::TypedRecordReader > >::ReadRecordData (this=0x56616260, num_records=67339)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} > {{#4 0x00493876 in > parquet::internal::TypedRecordReader > >::ReadRecords (this=0x56616260, num_records=815883)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} > {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch > (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} > {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} > {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} > {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}} > And now a report of my gdb adventures: > In Arrow 0.15.0, when reading a particular dictionary column > ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow > 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} > is called twice (once with 493568 values, once with 254976 values); and then > {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't > know why this column comes in three batches.) On first {{AppendIndices()}} > call, the buffer capacity is equal to the number of values. On second call, > that's no longer the case: the buffer grows using > {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. > But there's a bug: the 987136-capacity buffer is in > {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in > {{Dictionary32Builder::indices_builder_.capacity_}}. > {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} > is called. (Dictionary32Builder behaves like a proxy for its > {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things > are messy.) > So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, > via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But > {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its > wrong, cached value) to {{length_ + num_values}} (815883). Since > {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws > an exception. > The only workaround I can find: use {{read_dictionaries=false}}. > This affects Python, too. > I've attached a patch that fixes the issue for my file. I don't know how to > formulate a reduction, though, so I haven't contributed unit tests. I'm also
[jira] [Commented] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight
[ https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949804#comment-16949804 ] Wes McKinney commented on ARROW-6860: - Ah good point. That is tricky. > [Python] Only link libarrow_flight.so to pyarrow._flight > > > Key: ARROW-6860 > URL: https://issues.apache.org/jira/browse/ARROW-6860 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > See BEAM-8368. We need to find a strategy to mitigate protobuf static linking > issues with teh Beam community -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight
[ https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6860: --- Assignee: Wes McKinney > [Python] Only link libarrow_flight.so to pyarrow._flight > > > Key: ARROW-6860 > URL: https://issues.apache.org/jira/browse/ARROW-6860 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > See BEAM-8368. We need to find a strategy to mitigate protobuf static linking > issues with teh Beam community -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6844) [C++][Parquet][Python] List columns read broken with 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6844: Fix Version/s: 0.15.1 > [C++][Parquet][Python] List columns read broken with 0.15.0 > > > Key: ARROW-6844 > URL: https://issues.apache.org/jira/browse/ARROW-6844 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 >Reporter: Benoit Rostykus >Priority: Major > Labels: parquet > Fix For: 1.0.0, 0.15.1 > > Attachments: dbg_sample.gz.parquet, dbg_sample2.gz.parquet > > > Columns of type {{array}} (such as `array`, > `array`...) are not readable anymore using {{pyarrow == 0.15.0}} (but > were with {{pyarrow == 0.14.1}}) when the original writer of the parquet file > is {{parquet-mr 1.9.1}}. > {code} > import pyarrow.parquet as pq > pf = pq.ParquetFile('sample.gz.parquet') > print(pf.read(columns=['profile_ids'])) > {code} > with 0.14.1: > {code} > pyarrow.Table > profile_ids: list > child 0, element: int64 > ... > {code} > with 0.15.0: > {code} > Traceback (most recent call last): > File "", line 1, in > File > "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 253, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1131, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column data for field 0 with type list > is inconsistent with schema list > {code} > I've tested parquet files coming from multiple tables (with various schemas) > created with `parquet-mr`, couldn't read any `array` column > anymore. > > I _think_ the bug was introduced with [this > commit|[https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5]]. > I think the root of the issue comes from the fact that `parquet-mr` writes > the inner struct name as `"element"` by default (see > [here|[https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java#L33]]), > whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example > [this > test|[https://github.com/apache/arrow/blob/c805b5fadb548925c915e0e130d6ed03c95d1398/python/pyarrow/tests/test_schema.py#L74]]). > The round-tripping tests write/read in pyarrow only obviously won't catch > this. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight
[ https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6860: Fix Version/s: 0.15.1 > [Python] Only link libarrow_flight.so to pyarrow._flight > > > Key: ARROW-6860 > URL: https://issues.apache.org/jira/browse/ARROW-6860 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0, 0.15.1 > > > See BEAM-8368. We need to find a strategy to mitigate protobuf static linking > issues with teh Beam community -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight
[ https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6860: -- Labels: pull-request-available (was: ) > [Python] Only link libarrow_flight.so to pyarrow._flight > > > Key: ARROW-6860 > URL: https://issues.apache.org/jira/browse/ARROW-6860 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > > See BEAM-8368. We need to find a strategy to mitigate protobuf static linking > issues with teh Beam community -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6777) [GLib][CI] Unpin gobject-introspection gem
[ https://issues.apache.org/jira/browse/ARROW-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6777: Fix Version/s: 0.15.1 > [GLib][CI] Unpin gobject-introspection gem > -- > > Key: ARROW-6777 > URL: https://issues.apache.org/jira/browse/ARROW-6777 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
[ https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6861: Fix Version/s: 0.15.1 > [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: > Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize > -- > > Key: ARROW-6861 > URL: https://issues.apache.org/jira/browse/ARROW-6861 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 > Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) >Reporter: Adam Hooper >Priority: Major > Fix For: 1.0.0, 0.15.1 > > Attachments: fix-dict-builder-capacity.diff, > parquet-written-by-arrow-0-14-1.7z > > > I'll need to jump through hoops to upload the (seemingly-valid) Parquet file > that triggers this bug. In the meantime, here's the error I get, reading the > Parquet file with read_dictionary=true. I'll start with the stack trace: > {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot > downsize}} > {{#0 0x00b9fffd in __cxa_throw ()}} > {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow > (this=0x56612e50, num_values=67339, null_count=0, > valid_bits=0x7f39a764b780 '\377' ..., > valid_bits_offset=748544,}} > \{{ builder=0x56616330) at > /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} > {{#2 0x0046d703 in > parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced > (this=0x56616260, values_to_read=67339, null_count=0)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} > {{#3 0x004a13f8 in > parquet::internal::TypedRecordReader > >::ReadRecordData (this=0x56616260, num_records=67339)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} > {{#4 0x00493876 in > parquet::internal::TypedRecordReader > >::ReadRecords (this=0x56616260, num_records=815883)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} > {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch > (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} > {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} > {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} > {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}} > And now a report of my gdb adventures: > In Arrow 0.15.0, when reading a particular dictionary column > ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow > 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} > is called twice (once with 493568 values, once with 254976 values); and then > {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't > know why this column comes in three batches.) On first {{AppendIndices()}} > call, the buffer capacity is equal to the number of values. On second call, > that's no longer the case: the buffer grows using > {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. > But there's a bug: the 987136-capacity buffer is in > {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in > {{Dictionary32Builder::indices_builder_.capacity_}}. > {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} > is called. (Dictionary32Builder behaves like a proxy for its > {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things > are messy.) > So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, > via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But > {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its > wrong, cached value) to {{length_ + num_values}} (815883). Since > {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws > an exception. > The only workaround I can find: use {{read_dictionaries=false}}. > This affects Python, too. > I've attached a patch that fixes the issue for my file. I don't know how to > formulate a reduction, though, so I haven't contributed unit tests. I'm also > not certain how FinishInternal is meant to work, so this definitely needs > expert review. (FinishInternal was _definitely_ buggy before my patch; after > my p
[jira] [Commented] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
[ https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949809#comment-16949809 ] Wes McKinney commented on ARROW-6861: - Seems like a good candidate for 0.15.1. Marked as such > [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: > Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize > -- > > Key: ARROW-6861 > URL: https://issues.apache.org/jira/browse/ARROW-6861 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.0 > Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) >Reporter: Adam Hooper >Priority: Major > Fix For: 1.0.0, 0.15.1 > > Attachments: fix-dict-builder-capacity.diff, > parquet-written-by-arrow-0-14-1.7z > > > I'll need to jump through hoops to upload the (seemingly-valid) Parquet file > that triggers this bug. In the meantime, here's the error I get, reading the > Parquet file with read_dictionary=true. I'll start with the stack trace: > {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot > downsize}} > {{#0 0x00b9fffd in __cxa_throw ()}} > {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow > (this=0x56612e50, num_values=67339, null_count=0, > valid_bits=0x7f39a764b780 '\377' ..., > valid_bits_offset=748544,}} > \{{ builder=0x56616330) at > /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} > {{#2 0x0046d703 in > parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced > (this=0x56616260, values_to_read=67339, null_count=0)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} > {{#3 0x004a13f8 in > parquet::internal::TypedRecordReader > >::ReadRecordData (this=0x56616260, num_records=67339)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} > {{#4 0x00493876 in > parquet::internal::TypedRecordReader > >::ReadRecords (this=0x56616260, num_records=815883)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} > {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch > (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} > {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} > {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} > {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}} > And now a report of my gdb adventures: > In Arrow 0.15.0, when reading a particular dictionary column > ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow > 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} > is called twice (once with 493568 values, once with 254976 values); and then > {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't > know why this column comes in three batches.) On first {{AppendIndices()}} > call, the buffer capacity is equal to the number of values. On second call, > that's no longer the case: the buffer grows using > {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. > But there's a bug: the 987136-capacity buffer is in > {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in > {{Dictionary32Builder::indices_builder_.capacity_}}. > {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} > is called. (Dictionary32Builder behaves like a proxy for its > {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things > are messy.) > So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, > via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But > {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its > wrong, cached value) to {{length_ + num_values}} (815883). Since > {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws > an exception. > The only workaround I can find: use {{read_dictionaries=false}}. > This affects Python, too. > I've attached a patch that fixes the issue for my file. I don't know how to > formulate a reduction, though, so I haven't contributed unit tests. I'm also > not certain how FinishInternal is meant to work, so this definitely needs
[jira] [Assigned] (ARROW-6807) [Java][FlightRPC] Expose gRPC service
[ https://issues.apache.org/jira/browse/ARROW-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Gupta reassigned ARROW-6807: -- Assignee: Rohit Gupta > [Java][FlightRPC] Expose gRPC service > -- > > Key: ARROW-6807 > URL: https://issues.apache.org/jira/browse/ARROW-6807 > Project: Apache Arrow > Issue Type: New Feature > Components: FlightRPC, Java >Reporter: Rohit Gupta >Assignee: Rohit Gupta >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Have a utility class that exposes the flight service & client so that > multiple services can be plugged into the same endpoint. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6738) [Java] Fix problems with current union comparison logic
[ https://issues.apache.org/jira/browse/ARROW-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan updated ARROW-6738: Fix Version/s: 0.15.1 > [Java] Fix problems with current union comparison logic > --- > > Key: ARROW-6738 > URL: https://issues.apache.org/jira/browse/ARROW-6738 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > There are some problems with the current union comparison logic. For example: > 1. For type check, we should not require fields to be equal. It is possible > that two vectors' value ranges are equal but their fields are different. > 2. We should not compare the number of sub vectors, as it is possible that > two union vectors have different numbers of sub vectors, but have equal > values in the range. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6806) Segfault deserializing ListArray containing null/empty list
[ https://issues.apache.org/jira/browse/ARROW-6806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated ARROW-6806: --- Fix Version/s: 0.15.1 > Segfault deserializing ListArray containing null/empty list > --- > > Key: ARROW-6806 > URL: https://issues.apache.org/jira/browse/ARROW-6806 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.0 >Reporter: Max Bolingbroke >Assignee: Antoine Pitrou >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Time Spent: 40m > Remaining Estimate: 0h > > The following code segfaults for me (Windows and Linux, pyarrow 0.15): > > {code:java} > import pyarrow as pa > from io import BytesIO > x = > b'\xdc\x00\x00\x00\x10\x00\x00\x00\x0c\x00\x0e\x00\x06\x00\r\x00\x08\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x03\x00\x10\x00\x00\x00\x00\x01\n\x00\x0c\x00\x00\x00\x08\x00\x04\x00\n\x00\x00\x00\x08\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x18\x00\x00\x00\x00\x00\x12\x00\x18\x00\x14\x00\x13\x00\x12\x00\x0c\x00\x00\x00\x08\x00\x04\x00\x12\x00\x00\x00\x14\x00\x00\x00\x14\x00\x00\x00`\x00\x00\x00\x00\x00\x0c\x01\\\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x18\x00\x00\x00\x00\x00\x12\x00\x18\x00\x14\x00\x00\x00\x13\x00\x0c\x00\x00\x00\x08\x00\x04\x00\x12\x00\x00\x00\x14\x00\x00\x00\x14\x00\x00\x00\x14\x00\x00\x00\x00\x00\x00\x05\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0\xff\xff\xff\x06\x00\x00\x00$data$\x00\x00\x04\x00\x04\x00\x04\x00\x00\x00\x10\x00\x00\x00exchangeCodeList\x00\x00\x00\x00\xcc\x00\x00\x00\x14\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x16\x00\x0e\x00\x15\x00\x10\x00\x04\x00\x0c\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x10\x00\x00\x00\x00\x03\n\x00\x18\x00\x0c\x00\x08\x00\x04\x00\n\x00\x00\x00\x14\x00\x00\x00h\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' > r = pa.RecordBatchStreamReader(BytesIO(x)) > r.read_all() > {code} > I *think* what should happen instead is that I should get a Table with a > single column named "exchangeCodeList", where the column is a ChunkedArray > with a single chunk, where that chunk is a ListArray containing just a single > element (a null). Failing that (i.e. if the bytestring is actually > malformed), pyarrow should maybe throw an error instead of segfaulting? > I'm not 100% sure how the bytestring was generated: I think it comes from a > Java-based server. I can deserialize the server response fine if all the > records have at least one element in the "exchangeCodeList" column, but not > if at least one of them is null. I've tried to reproduce the failure by > generating the bytestring with pyarrow but can't trigger the segfault. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6862) [Developer] Check pull request title
Kouhei Sutou created ARROW-6862: --- Summary: [Developer] Check pull request title Key: ARROW-6862 URL: https://issues.apache.org/jira/browse/ARROW-6862 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6862) [Developer] Check pull request title
[ https://issues.apache.org/jira/browse/ARROW-6862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6862: -- Labels: pull-request-available (was: ) > [Developer] Check pull request title > > > Key: ARROW-6862 > URL: https://issues.apache.org/jira/browse/ARROW-6862 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-1638) [Java] IPC roundtrip for null type
[ https://issues.apache.org/jira/browse/ARROW-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-1638. Resolution: Fixed Issue resolved by pull request 5164 [https://github.com/apache/arrow/pull/5164] > [Java] IPC roundtrip for null type > -- > > Key: ARROW-1638 > URL: https://issues.apache.org/jira/browse/ARROW-1638 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Wes McKinney >Assignee: Ji Liu >Priority: Major > Labels: columnar-format-1.0, pull-request-available > Fix For: 1.0.0 > > Time Spent: 9h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6721) [JAVA] Avro adapter benchmark only runs once in JMH
[ https://issues.apache.org/jira/browse/ARROW-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6721. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5524 [https://github.com/apache/arrow/pull/5524] > [JAVA] Avro adapter benchmark only runs once in JMH > --- > > Key: ARROW-6721 > URL: https://issues.apache.org/jira/browse/ARROW-6721 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The current {{AvroAdapterBenchmark}} actually only run once during JMH > evaluation, since the decoder was consumed for the first time and the > follow-up invokes will directly return. > To solve this, we use {{BinaryDecoder}} explicitly in benchmark and reset its > inner stream first when the test method is invoked. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6732) [Java] Implement quick sort in a non-recursive way to avoid stack overflow
[ https://issues.apache.org/jira/browse/ARROW-6732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6732. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5540 [https://github.com/apache/arrow/pull/5540] > [Java] Implement quick sort in a non-recursive way to avoid stack overflow > -- > > Key: ARROW-6732 > URL: https://issues.apache.org/jira/browse/ARROW-6732 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Critical > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > The current quick sort algorithm in implemented by a recursive algorithm. The > problem is that for the worst case, the number of recursive layers is equal > to the length of the vector. For large vectors, this will cause stack > overflow. > To solve this problem, we implement the quick sort algorithm as a > non-recursive algorithm. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6074) [FlightRPC] Implement middleware
[ https://issues.apache.org/jira/browse/ARROW-6074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6074. Resolution: Fixed Issue resolved by pull request 5068 [https://github.com/apache/arrow/pull/5068] > [FlightRPC] Implement middleware > > > Key: ARROW-6074 > URL: https://issues.apache.org/jira/browse/ARROW-6074 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Reporter: David Li >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6074) [FlightRPC] Implement middleware
[ https://issues.apache.org/jira/browse/ARROW-6074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned ARROW-6074: -- Assignee: David Li > [FlightRPC] Implement middleware > > > Key: ARROW-6074 > URL: https://issues.apache.org/jira/browse/ARROW-6074 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)