[jira] [Commented] (ARROW-6849) [Python] can not read a parquet store containing a list of integers

2019-10-11 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949201#comment-16949201
 ] 

Joris Van den Bossche commented on ARROW-6849:
--

[~selitvin] thanks for the issue report and reproducible example! 
This is indeed a regression in 0.15.0, see ARROW-6844. Going to close this 
issue as duplicate in favor of ARROW-6844

> [Python] can not read a parquet store containing a list of integers 
> 
>
> Key: ARROW-6849
> URL: https://issues.apache.org/jira/browse/ARROW-6849
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Yevgeni Litvin
>Priority: Major
> Attachments: test_bad_parquet.tgz
>
>
> A field having a type of list-of-ints can not be read using 
> {{parrow.parquet.read_table}} function. Also failed with other field types 
> (observed strings, for example).
> This happens only in pyarrow 0.15.0. When downgrading to 0.14.1, the issue is 
> not observed.
> pyspark version: 2.4.4[^test_bad_parquet.tgz]
> Minimal snippet to reproduce the issue:
>  
> {code:java}
> import pyarrow.parquet as pq
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType, 
> ArrayType, Row
> output_url = '/tmp/test_bad_parquet'
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([StructField('int_fixed_size_list', 
> ArrayType(IntegerType(), False), False)])
> rows = [Row(int_fixed_size_list=[1, 2, 3])]
> dataframe = spark.createDataFrame(rows, 
> schema).write.mode('overwrite').parquet(output_url)
> pq.read_table(output_url)
> {code}
> I get an error:
> {code:java}
> Traceback (most recent call last):
>   File "/home/yevgeni/uatc/dataset-toolkit/repro_failure.py", line 13, in 
> 
> pq.read_table(output_url)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 1281, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 1137, in read
> use_pandas_metadata=use_pandas_metadata)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 605, in read
> table = reader.read(**options)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 253, in read
> use_threads=use_threads)
>   File "pyarrow/_parquet.pyx", line 1136, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column data for field 0 with type list not null> is inconsistent with schema listProcess 
> finished with exit code 1
> {code}
>  
> Column data for field 0 with type list is inconsistent 
> with schema list
>  
> A parquet store, as generated by the snippet is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6849) [Python] can not read a parquet store containing a list of integers

2019-10-11 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-6849.

Resolution: Duplicate

> [Python] can not read a parquet store containing a list of integers 
> 
>
> Key: ARROW-6849
> URL: https://issues.apache.org/jira/browse/ARROW-6849
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Yevgeni Litvin
>Priority: Major
> Attachments: test_bad_parquet.tgz
>
>
> A field having a type of list-of-ints can not be read using 
> {{parrow.parquet.read_table}} function. Also failed with other field types 
> (observed strings, for example).
> This happens only in pyarrow 0.15.0. When downgrading to 0.14.1, the issue is 
> not observed.
> pyspark version: 2.4.4[^test_bad_parquet.tgz]
> Minimal snippet to reproduce the issue:
>  
> {code:java}
> import pyarrow.parquet as pq
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, IntegerType, 
> ArrayType, Row
> output_url = '/tmp/test_bad_parquet'
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([StructField('int_fixed_size_list', 
> ArrayType(IntegerType(), False), False)])
> rows = [Row(int_fixed_size_list=[1, 2, 3])]
> dataframe = spark.createDataFrame(rows, 
> schema).write.mode('overwrite').parquet(output_url)
> pq.read_table(output_url)
> {code}
> I get an error:
> {code:java}
> Traceback (most recent call last):
>   File "/home/yevgeni/uatc/dataset-toolkit/repro_failure.py", line 13, in 
> 
> pq.read_table(output_url)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 1281, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 1137, in read
> use_pandas_metadata=use_pandas_metadata)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 605, in read
> table = reader.read(**options)
>   File 
> "/home/yevgeni/uatc/.petastorm3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 253, in read
> use_threads=use_threads)
>   File "pyarrow/_parquet.pyx", line 1136, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column data for field 0 with type list not null> is inconsistent with schema listProcess 
> finished with exit code 1
> {code}
>  
> Column data for field 0 with type list is inconsistent 
> with schema list
>  
> A parquet store, as generated by the snippet is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-11 Thread Thomas Schm (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949227#comment-16949227
 ] 

Thomas Schm commented on ARROW-6793:


Yes, I guess this ticket is addressing a subproblem of getting arrow into R on 
Linux. Solving this problem is unfortunately a huge task and the information is 
in fragments over Github, Jira and several articles. It's a very unfortunate 
situation. Trying to install apache/arrow/r from Github worked yesterday but 
fails today. The problem today relates to a commit you have done yesterday

compression.cpp: In function ‘bool 
util___Codec__IsAvailable(arrow::Compression::type)’:
compression.cpp:37:10: error: ‘IsAvailable’ is not a member of 
‘arrow::util::Codec’
   return arrow::util::Codec::IsAvailable(codec);
  ^

Are the libraries I link to outdated? I did a fresh pull just a few minutes 
ago. Is there way to specify a certain tag in the install via github route? 

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6850) [Java] Jdbc converter support Null type

2019-10-11 Thread Ji Liu (Jira)
Ji Liu created ARROW-6850:
-

 Summary: [Java] Jdbc converter support Null type
 Key: ARROW-6850
 URL: https://issues.apache.org/jira/browse/ARROW-6850
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


java.sql.Types.Null is not supported yet since we have no NullVector in Java 
code before.

This could be implemented after ARROW-1638 merged (IPC roundtrip for null type).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6851) Should OSError be FileNotFoundError?

2019-10-11 Thread Vladimir Filimonov (Jira)
Vladimir Filimonov created ARROW-6851:
-

 Summary: Should OSError be FileNotFoundError?
 Key: ARROW-6851
 URL: https://issues.apache.org/jira/browse/ARROW-6851
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.15.0
Reporter: Vladimir Filimonov


On the read_table function - if the file is not found, a OSError is raised:

 
{code:java}
import pyarrow.parquet as pq
pq.read_table('example.parquet')
{code}
Should it rather be FileNotFoundError which is more standard in such situations?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6851) Should OSError be FileNotFoundError?

2019-10-11 Thread Vladimir Filimonov (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Filimonov updated ARROW-6851:
--
Description: 
On the read_table function - if the file is not found, an OSError ("Passed 
non-file path") is raised:
{code:java}
import pyarrow.parquet as pq
pq.read_table('example.parquet')
{code}
Should it rather be FileNotFoundError which is more standard in such situations?

  was:
On the read_table function - if the file is not found, a OSError is raised:

 
{code:java}
import pyarrow.parquet as pq
pq.read_table('example.parquet')
{code}
Should it rather be FileNotFoundError which is more standard in such situations?

 


> Should OSError be FileNotFoundError?
> 
>
> Key: ARROW-6851
> URL: https://issues.apache.org/jira/browse/ARROW-6851
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Vladimir Filimonov
>Priority: Minor
>
> On the read_table function - if the file is not found, an OSError ("Passed 
> non-file path") is raised:
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('example.parquet')
> {code}
> Should it rather be FileNotFoundError which is more standard in such 
> situations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6851) Should OSError be FileNotFoundError?

2019-10-11 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6851:
--
Component/s: C++

> Should OSError be FileNotFoundError?
> 
>
> Key: ARROW-6851
> URL: https://issues.apache.org/jira/browse/ARROW-6851
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: Vladimir Filimonov
>Priority: Minor
> Fix For: 2.0.0
>
>
> On the read_table function - if the file is not found, an OSError ("Passed 
> non-file path") is raised:
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('example.parquet')
> {code}
> Should it rather be FileNotFoundError which is more standard in such 
> situations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6851) Should OSError be FileNotFoundError?

2019-10-11 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6851:
--
Fix Version/s: 2.0.0

> Should OSError be FileNotFoundError?
> 
>
> Key: ARROW-6851
> URL: https://issues.apache.org/jira/browse/ARROW-6851
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Vladimir Filimonov
>Priority: Minor
> Fix For: 2.0.0
>
>
> On the read_table function - if the file is not found, an OSError ("Passed 
> non-file path") is raised:
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('example.parquet')
> {code}
> Should it rather be FileNotFoundError which is more standard in such 
> situations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6851) Should OSError be FileNotFoundError?

2019-10-11 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949273#comment-16949273
 ] 

Antoine Pitrou commented on ARROW-6851:
---

Certainly. But this will require some plumbing on the C++ side to remember 
{{errno}}.

> Should OSError be FileNotFoundError?
> 
>
> Key: ARROW-6851
> URL: https://issues.apache.org/jira/browse/ARROW-6851
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: Vladimir Filimonov
>Priority: Minor
> Fix For: 2.0.0
>
>
> On the read_table function - if the file is not found, an OSError ("Passed 
> non-file path") is raised:
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('example.parquet')
> {code}
> Should it rather be FileNotFoundError which is more standard in such 
> situations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6851) Should OSError be FileNotFoundError?

2019-10-11 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949274#comment-16949274
 ] 

Antoine Pitrou commented on ARROW-6851:
---

cc [~jorisvandenbossche]

 

> Should OSError be FileNotFoundError?
> 
>
> Key: ARROW-6851
> URL: https://issues.apache.org/jira/browse/ARROW-6851
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: Vladimir Filimonov
>Priority: Minor
> Fix For: 2.0.0
>
>
> On the read_table function - if the file is not found, an OSError ("Passed 
> non-file path") is raised:
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('example.parquet')
> {code}
> Should it rather be FileNotFoundError which is more standard in such 
> situations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6704) [C++] Cast from timestamp to higher resolution does not check out of bounds timestamps

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6704:
--
Labels: pull-request-available  (was: )

> [C++] Cast from timestamp to higher resolution does not check out of bounds 
> timestamps
> --
>
> Key: ARROW-6704
> URL: https://issues.apache.org/jira/browse/ARROW-6704
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>
> When casting eg {{timestamp('s')}} to {{timestamp('ns')}}, we do not check 
> for out of bounds timestamps, giving "garbage" timestamps in the result:
> {code}
> In [74]: a_np = np.array(["2012-01-01", "2412-01-01"], dtype="datetime64[s]") 
>   
>
> In [75]: arr = pa.array(a_np) 
>   
>
> In [76]: arr  
>   
>
> Out[76]: 
> 
> [
>   2012-01-01 00:00:00,
>   2412-01-01 00:00:00
> ]
> In [77]: arr.cast(pa.timestamp('ns')) 
>   
>
> Out[77]: 
> 
> [
>   2012-01-01 00:00:00.0,
>   1827-06-13 00:25:26.290448384
> ]
> {code}
> Now, this is the same behaviour as numpy, so not sure we should do this. 
> However, since we have a {{safe=True/False}}, I would expect that for 
> {{safe=True}} we check this and for {{safe=False}} we do not check this.  
> (numpy has a similiar {{casting='safe'}} but also does not raise an error in 
> that case).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6852) [C++] memory-benchmark build failed on Arm64

2019-10-11 Thread Yuqi Gu (Jira)
Yuqi Gu created ARROW-6852:
--

 Summary: [C++] memory-benchmark build failed on Arm64
 Key: ARROW-6852
 URL: https://issues.apache.org/jira/browse/ARROW-6852
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yuqi Gu
Assignee: Yuqi Gu


After the new commit: ARROW-6381 was merged in master,
build would fail on Arm64 when DARROW_BUILD_BENCHMARKS is enabled:

 

 
{code:java}
/home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:205:31: error: 
'kMemoryPerCore' was not declared in this scope
 const int64_t buffer_size = kMemoryPerCore;
 ^~
/home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: error: 
'Buffer' was not declared in this scope
 std::shared_ptr src, dst;
 ^~
/home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: note: 
suggested alternative:
In file included from /home/builder/arrow/cpp/src/arrow/array.h:28:0,
 from /home/builder/arrow/cpp/src/arrow/api.h:23,
 from /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:20:
/home/builder/arrow/cpp/src/arrow/buffer.h:50:20: note: 'arrow::Buffer'
 class ARROW_EXPORT Buffer {
 ^~
/home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:25: error: 
template argument 1 is invalid
 std::shared_ptr src, dst;
...
.
.
 
{code}
 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6852) [C++] memory-benchmark build failed on Arm64

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6852:
--
Labels: pull-request-available  (was: )

> [C++] memory-benchmark build failed on Arm64
> 
>
> Key: ARROW-6852
> URL: https://issues.apache.org/jira/browse/ARROW-6852
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Major
>  Labels: pull-request-available
>
> After the new commit: ARROW-6381 was merged in master,
> build would fail on Arm64 when DARROW_BUILD_BENCHMARKS is enabled:
>  
>  
> {code:java}
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:205:31: error: 
> 'kMemoryPerCore' was not declared in this scope
>  const int64_t buffer_size = kMemoryPerCore;
>  ^~
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: error: 
> 'Buffer' was not declared in this scope
>  std::shared_ptr src, dst;
>  ^~
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: note: 
> suggested alternative:
> In file included from /home/builder/arrow/cpp/src/arrow/array.h:28:0,
>  from /home/builder/arrow/cpp/src/arrow/api.h:23,
>  from /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:20:
> /home/builder/arrow/cpp/src/arrow/buffer.h:50:20: note: 'arrow::Buffer'
>  class ARROW_EXPORT Buffer {
>  ^~
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:25: error: 
> template argument 1 is invalid
>  std::shared_ptr src, dst;
> ...
> .
> .
>  
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6853) [Java] Support vector and dictionary encoder use different hasher for calculating hashCode

2019-10-11 Thread Ji Liu (Jira)
Ji Liu created ARROW-6853:
-

 Summary: [Java] Support vector and dictionary encoder use 
different hasher for calculating hashCode
 Key: ARROW-6853
 URL: https://issues.apache.org/jira/browse/ARROW-6853
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Hasher interface was introduce in ARROW-5898 and now have two different 
implementations ({{MurmurHasher and }}{{SimpleHasher}}) and it could be more in 
the future.

And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use 
{{SimpleHasher}} for calculating hashCode. This issue enables them to use 
different hasher or even user-defined hasher for their own use cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6853) [Java] Support vector and dictionary encoder use different hasher for calculating hashCode

2019-10-11 Thread Ji Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu updated ARROW-6853:
--
Description: 
Hasher interface was introduce in ARROW-5898 and now have two different 
implementations ({{MurmurHasher and SimpleHasher}}) and it could be more in the 
future.

And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use 
{{SimpleHasher}} for calculating hashCode. This issue enables them to use 
different hasher or even user-defined hasher for their own use cases.

  was:
Hasher interface was introduce in ARROW-5898 and now have two different 
implementations ({{MurmurHasher and }}{{SimpleHasher}}) and it could be more in 
the future.

And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use 
{{SimpleHasher}} for calculating hashCode. This issue enables them to use 
different hasher or even user-defined hasher for their own use cases.


> [Java] Support vector and dictionary encoder use different hasher for 
> calculating hashCode
> --
>
> Key: ARROW-6853
> URL: https://issues.apache.org/jira/browse/ARROW-6853
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>
> Hasher interface was introduce in ARROW-5898 and now have two different 
> implementations ({{MurmurHasher and SimpleHasher}}) and it could be more in 
> the future.
> And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use 
> {{SimpleHasher}} for calculating hashCode. This issue enables them to use 
> different hasher or even user-defined hasher for their own use cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6853) [Java] Support vector and dictionary encoder use different hasher for calculating hashCode

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6853:
--
Labels: pull-request-available  (was: )

> [Java] Support vector and dictionary encoder use different hasher for 
> calculating hashCode
> --
>
> Key: ARROW-6853
> URL: https://issues.apache.org/jira/browse/ARROW-6853
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
>
> Hasher interface was introduce in ARROW-5898 and now have two different 
> implementations ({{MurmurHasher and SimpleHasher}}) and it could be more in 
> the future.
> And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use 
> {{SimpleHasher}} for calculating hashCode. This issue enables them to use 
> different hasher or even user-defined hasher for their own use cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6854) [Dataset][C++] RecordBatchProjector is not thread safe

2019-10-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6854:
--
Summary: [Dataset][C++] RecordBatchProjector is not thread safe  (was: 
[Dataset] RecordBatchProjector is not thread safe)

> [Dataset][C++] RecordBatchProjector is not thread safe
> --
>
> Key: ARROW-6854
> URL: https://issues.apache.org/jira/browse/ARROW-6854
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> While working on ARROW-6769 I noted that RecordbBatchProjector is not thread 
> safe. My goal is to use this class to wrap the ScanTaskIterator in another 
> ScanTaskIterator that projects, so producer (fragments) don't have to know 
> about this schema. The issue is that ScanTask are expected to run on 
> concurrent thread. The projector will be invoked by multiple thread.
> The lack of concurrency safety is due to adaptivity of input schemas and 
> `SetInputSchema` stores in a local cache. I suggest we refactor into 2 
> classes. 
>  # `RecordBatchProjector` which will work with a static `from` schema, i.e. 
> no adaptivity. The schema is defined at construct time. This class is thread 
> safe to invoke after construction since no local modification is done.
>  # `AdaptiveRecordBatchProjector` which will have a cache map[schema_hash, 
> std::shared_ptr] protected with a mutex. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6854) [Dataset] RecordBatchProjector is not thread safe

2019-10-11 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6854:
-

 Summary: [Dataset] RecordBatchProjector is not thread safe
 Key: ARROW-6854
 URL: https://issues.apache.org/jira/browse/ARROW-6854
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Francois Saint-Jacques


While working on ARROW-6769 I noted that RecordbBatchProjector is not thread 
safe. My goal is to use this class to wrap the ScanTaskIterator in another 
ScanTaskIterator that projects, so producer (fragments) don't have to know 
about this schema. The issue is that ScanTask are expected to run on concurrent 
thread. The projector will be invoked by multiple thread.

The lack of concurrency safety is due to adaptivity of input schemas and 
`SetInputSchema` stores in a local cache. I suggest we refactor into 2 classes. 
 # `RecordBatchProjector` which will work with a static `from` schema, i.e. no 
adaptivity. The schema is defined at construct time. This class is thread safe 
to invoke after construction since no local modification is done.
 # `AdaptiveRecordBatchProjector` which will have a cache map[schema_hash, 
std::shared_ptr] protected with a mutex. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6390:
---

Assignee: (was: Wes McKinney)

> [Python][Flight] Add Python documentation / tutorial for Flight
> ---
>
> Key: ARROW-6390
> URL: https://issues.apache.org/jira/browse/ARROW-6390
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There is no Sphinx documentation for using Flight from Python. I have found 
> that writing documentation is an effective way to uncover usability problems 
> -- I would suggest we write comprehensive documentation for using Flight from 
> Python as a way to refine the public Python API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6390:

Fix Version/s: (was: 0.15.0)
   1.0.0

> [Python][Flight] Add Python documentation / tutorial for Flight
> ---
>
> Key: ARROW-6390
> URL: https://issues.apache.org/jira/browse/ARROW-6390
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There is no Sphinx documentation for using Flight from Python. I have found 
> that writing documentation is an effective way to uncover usability problems 
> -- I would suggest we write comprehensive documentation for using Flight from 
> Python as a way to refine the public Python API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6390) [Python][Flight] Add Python documentation / tutorial for Flight

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949487#comment-16949487
 ] 

Wes McKinney commented on ARROW-6390:
-

I haven't made much progress on this yet. I'll reassign when I do

> [Python][Flight] Add Python documentation / tutorial for Flight
> ---
>
> Key: ARROW-6390
> URL: https://issues.apache.org/jira/browse/ARROW-6390
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There is no Sphinx documentation for using Flight from Python. I have found 
> that writing documentation is an effective way to uncover usability problems 
> -- I would suggest we write comprehensive documentation for using Flight from 
> Python as a way to refine the public Python API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5502) [R] file readers should mmap

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949488#comment-16949488
 ] 

Wes McKinney commented on ARROW-5502:
-

Note that we stopped memory mapping by default in {{pyarrow.parquet}}.

> [R] file readers should mmap
> 
>
> Key: ARROW-5502
> URL: https://issues.apache.org/jira/browse/ARROW-5502
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Arrow is supposed to let you work with datasets bigger than memory. Memory 
> mapping is a big part of that. It should be the default way that files are 
> read in the `read_*` functions. To disable memory mapping, we could use a 
> global `option()`, or a function argument, but that might clutter the 
> interface. Or we could not give a choice and only fall back to not memory 
> mapping if the platform/file system doesn't support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949489#comment-16949489
 ] 

Wes McKinney commented on ARROW-6793:
-

If you're building from master, you need to build both the C++ and R libraries 
from master. In general the git revision of both libraries should be the same

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6855) [C++][Python][Flight] Implement Flight middleware

2019-10-11 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6855:
---

 Summary: [C++][Python][Flight] Implement Flight middleware
 Key: ARROW-6855
 URL: https://issues.apache.org/jira/browse/ARROW-6855
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: Wes McKinney
Assignee: David Li
 Fix For: 1.0.0


C++/Python side of ARROW-6074



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6855) [C++][Python][Flight] Implement Flight middleware

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6855.
-
Resolution: Fixed

Issue resolved by pull request 5552
[https://github.com/apache/arrow/pull/5552]

> [C++][Python][Flight] Implement Flight middleware
> -
>
> Key: ARROW-6855
> URL: https://issues.apache.org/jira/browse/ARROW-6855
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: David Li
>Priority: Major
> Fix For: 1.0.0
>
>
> C++/Python side of ARROW-6074



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6855) [C++][Python][Flight] Implement Flight middleware

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6855:
--
Labels: pull-request-available  (was: )

> [C++][Python][Flight] Implement Flight middleware
> -
>
> Key: ARROW-6855
> URL: https://issues.apache.org/jira/browse/ARROW-6855
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> C++/Python side of ARROW-6074



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary

2019-10-11 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6856:
---

 Summary: [C++] Use ArrayData instead of Array for 
ArrayData::dictionary
 Key: ARROW-6856
 URL: https://issues.apache.org/jira/browse/ARROW-6856
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This would be helpful for consistency. {{DictionaryArray}} may want to cache a 
"boxed" version of this to return from {{DictionaryArray::dictionary}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6846) [C++] Build failures with glog enabled

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6846.
---

> [C++] Build failures with glog enabled
> --
>
> Key: ARROW-6846
> URL: https://issues.apache.org/jira/browse/ARROW-6846
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> This has started appearing on Travis, e.g.:
> https://travis-ci.org/apache/arrow/jobs/596181386#L3663
> {code}
> In file included from 
> /home/travis/build/apache/arrow/cpp/src/arrow/util/logging.cc:29:0:
> /home/travis/build/apache/arrow/pyarrow-test-3.6/include/glog/logging.h:994:0:
>  error: "DCHECK" redefined [-Werror]
>  #define DCHECK(condition) CHECK(condition)
>  
> In file included from 
> /home/travis/build/apache/arrow/cpp/src/arrow/util/logging.cc:18:0:
> /home/travis/build/apache/arrow/cpp/src/arrow/util/logging.h:130:0: note: 
> this is the location of the previous definition
>  #define DCHECK ARROW_CHECK
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6835) [Archery][CMake] Restore ARROW_LINT_ONLY

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6835.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5616
[https://github.com/apache/arrow/pull/5616]

> [Archery][CMake] Restore ARROW_LINT_ONLY  
> --
>
> Key: ARROW-6835
> URL: https://issues.apache.org/jira/browse/ARROW-6835
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This is used by developers to fasten the cmake build creation and loosen the 
> required installed toolchains (notably libraries). This was yanked because 
> ARROW_LINT_ONLY effectively exit-early and doesn't generate 
> `compile_commands.json`.
> Restore this option, but ensure that archery toggles accordingly to the usage 
> of iwyu or clang-tidy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6835) [Archery][CMake] Restore ARROW_LINT_ONLY

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6835:
---

Assignee: Francois Saint-Jacques

> [Archery][CMake] Restore ARROW_LINT_ONLY  
> --
>
> Key: ARROW-6835
> URL: https://issues.apache.org/jira/browse/ARROW-6835
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This is used by developers to fasten the cmake build creation and loosen the 
> required installed toolchains (notably libraries). This was yanked because 
> ARROW_LINT_ONLY effectively exit-early and doesn't generate 
> `compile_commands.json`.
> Restore this option, but ensure that archery toggles accordingly to the usage 
> of iwyu or clang-tidy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6857) Segfault for dictionary_encode on empty chunked_array (edge case)

2019-10-11 Thread Artem KOZHEVNIKOV (Jira)
Artem KOZHEVNIKOV created ARROW-6857:


 Summary: Segfault for dictionary_encode on empty chunked_array 
(edge case)
 Key: ARROW-6857
 URL: https://issues.apache.org/jira/browse/ARROW-6857
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
Reporter: Artem KOZHEVNIKOV


a reproducer is here :
{code:python}
import pyarrow as pa
aa = pa.chunked_array([pa.array(['a', 'b', 'c'])])
aa[:0].dictionary_encode()  
# Segmentation fault: 11
{code}
For pyarrow=0.14, I could not reproduce. 
 I use a conda version : "pyarrow 0.15.0 py37hdca360a_0 conda-forge"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6851) [Python] Should OSError be FileNotFoundError?

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6851:

Summary: [Python] Should OSError be FileNotFoundError?  (was: Should 
OSError be FileNotFoundError?)

> [Python] Should OSError be FileNotFoundError?
> -
>
> Key: ARROW-6851
> URL: https://issues.apache.org/jira/browse/ARROW-6851
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: Vladimir Filimonov
>Priority: Minor
> Fix For: 2.0.0
>
>
> On the read_table function - if the file is not found, an OSError ("Passed 
> non-file path") is raised:
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('example.parquet')
> {code}
> Should it rather be FileNotFoundError which is more standard in such 
> situations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6852) [C++] memory-benchmark build failed on Arm64

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6852:

Fix Version/s: 1.0.0

> [C++] memory-benchmark build failed on Arm64
> 
>
> Key: ARROW-6852
> URL: https://issues.apache.org/jira/browse/ARROW-6852
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After the new commit: ARROW-6381 was merged in master,
> build would fail on Arm64 when DARROW_BUILD_BENCHMARKS is enabled:
>  
>  
> {code:java}
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:205:31: error: 
> 'kMemoryPerCore' was not declared in this scope
>  const int64_t buffer_size = kMemoryPerCore;
>  ^~
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: error: 
> 'Buffer' was not declared in this scope
>  std::shared_ptr src, dst;
>  ^~
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:19: note: 
> suggested alternative:
> In file included from /home/builder/arrow/cpp/src/arrow/array.h:28:0,
>  from /home/builder/arrow/cpp/src/arrow/api.h:23,
>  from /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:20:
> /home/builder/arrow/cpp/src/arrow/buffer.h:50:20: note: 'arrow::Buffer'
>  class ARROW_EXPORT Buffer {
>  ^~
> /home/builder/arrow/cpp/src/arrow/io/memory_benchmark.cc:207:25: error: 
> template argument 1 is invalid
>  std::shared_ptr src, dst;
> ...
> .
> .
>  
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6858) [C++] Create Python script to handle transitive component dependencies

2019-10-11 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6858:
---

 Summary: [C++] Create Python script to handle transitive component 
dependencies
 Key: ARROW-6858
 URL: https://issues.apache.org/jira/browse/ARROW-6858
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


In the C++ build system, we are handling relationships between optional 
components in an ad hoc fashion

https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L266

This doesn't seem ideal. 

As discussed on the mailing list, I suggest declaring dependencies in a Python 
data structure and then generating and checking in a .cmake file that can be 
{{include}}d. This will be a big easier than maintaining this on an ad hoc 
basis. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6859) [CI][Nightly] Disable docker layer caching for CircleCI tasks

2019-10-11 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-6859:
--

Assignee: Krisztian Szucs

> [CI][Nightly] Disable docker layer caching for CircleCI tasks
> -
>
> Key: ARROW-6859
> URL: https://issues.apache.org/jira/browse/ARROW-6859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 1.0.0
>
>
> CircleCI builds are failing because the layer caching is not available for 
> free plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6859) [CI][Nightly] Disable docker layer caching for CircleCI tasks

2019-10-11 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-6859:
--

 Summary: [CI][Nightly] Disable docker layer caching for CircleCI 
tasks
 Key: ARROW-6859
 URL: https://issues.apache.org/jira/browse/ARROW-6859
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Krisztian Szucs
 Fix For: 1.0.0


CircleCI builds are failing because the layer caching is not available for free 
plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6859) [CI][Nightly] Disable docker layer caching for CircleCI tasks

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6859:
--
Labels: pull-request-available  (was: )

> [CI][Nightly] Disable docker layer caching for CircleCI tasks
> -
>
> Key: ARROW-6859
> URL: https://issues.apache.org/jira/browse/ARROW-6859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> CircleCI builds are failing because the layer caching is not available for 
> free plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight

2019-10-11 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6860:
---

 Summary: [Python] Only link libarrow_flight.so to pyarrow._flight
 Key: ARROW-6860
 URL: https://issues.apache.org/jira/browse/ARROW-6860
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


See BEAM-8368. We need to find a strategy to mitigate protobuf static linking 
issues with teh Beam community



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight

2019-10-11 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949652#comment-16949652
 ] 

Antoine Pitrou commented on ARROW-6860:
---

It's a general issue with our Cython extensions. We link them each with all 
Arrow DLLs (including gandiva AFAIR)

> [Python] Only link libarrow_flight.so to pyarrow._flight
> 
>
> Key: ARROW-6860
> URL: https://issues.apache.org/jira/browse/ARROW-6860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See BEAM-8368. We need to find a strategy to mitigate protobuf static linking 
> issues with teh Beam community



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949655#comment-16949655
 ] 

Wes McKinney commented on ARROW-6860:
-

Yes, we'll have to make changes to python/CMakeLists.txt to link less 
monolithically. I can take a look at it

> [Python] Only link libarrow_flight.so to pyarrow._flight
> 
>
> Key: ARROW-6860
> URL: https://issues.apache.org/jira/browse/ARROW-6860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See BEAM-8368. We need to find a strategy to mitigate protobuf static linking 
> issues with teh Beam community



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6861) With arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-6861:
--

 Summary: With arrow-0.14.1-output Parquet dictionary column: 
Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
 Key: ARROW-6861
 URL: https://issues.apache.org/jira/browse/ARROW-6861
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.15.0
 Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
Reporter: Adam Hooper
 Attachments: fix-dict-builder-capacity.diff

I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
that triggers this bug. In the meantime, here's the error I get, reading the 
Parquet file with read_dictionary=true. I'll start with the stack trace:

{{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
downsize}}

{{#0 0x00b9fffd in __cxa_throw ()}}
 {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
(this=0x56612e50, num_values=67339, null_count=0, valid_bits=0x7f39a764b780 
'\377' ..., valid_bits_offset=748544,}}
 \{{ builder=0x56616330) at 
/src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
 {{#2 0x0046d703 in 
parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
(this=0x56616260, values_to_read=67339, null_count=0)}}
 \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
 {{#3 0x004a13f8 in 
parquet::internal::TypedRecordReader
 >::ReadRecordData (this=0x56616260, num_records=67339)}}
 \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
 {{#4 0x00493876 in 
parquet::internal::TypedRecordReader
 >::ReadRecords (this=0x56616260, num_records=815883)}}
 \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
 {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
(this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
/src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
 {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
(this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
/src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
 {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
(this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
/src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
 {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string, std::allocator > const&) ()}}

And now a report of my gdb adventures:

In Arrow 0.15.0, when reading a particular dictionary column 
({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 0.14.1, 
{{arrow::Dictionary32Builder::AppendIndices(...)}} is called 
twice (once with 493568 values, once with 254976 values); and then 
{{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't know 
why this column comes in three batches.) On first {{AppendIndices()}} call, the 
buffer capacity is equal to the number of values. On second call, that's no 
longer the case: the buffer grows using {{BufferBuilder::GrowByFactor}}, so its 
capacity is 987136.

But there's a bug: the 987136-capacity buffer is in 
{{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
{{Dictionary32Builder::indices_builder_.capacity_}}. 
{{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} is 
called. (Dictionary32Builder behaves like a proxy for its {{indices_builder_}}; 
but its {{capacity()}} method is not virtual, so things are messy.)

So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, via 
{{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
{{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
wrong, cached value) to {{length_ + num_values}} (815883). Since 
{{indicies_builder->capacity_}} is 987136, that's a downsize – which throws an 
exception.

The only workaround I can find: use {{read_dictionaries=false}}.

This affects Python, too.

I've attached a patch that fixes the issue for my file. I don't know how to 
formulate a reduction, though, so I haven't contributed unit tests. I'm also 
not certain how FinishInternal is meant to work, so this definitely needs 
expert review. (FinishInternal was _definitely_ buggy before my patch; after my 
patch it _might_ be buggy but I don't know.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6861) arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Adam Hooper (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Hooper updated ARROW-6861:
---
Summary: arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary 
column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
downsize  (was: With arrow-0.14.1-output Parquet dictionary column: Failure 
reading column: IOError: Arrow error: Invalid: Resize cannot downsize)

> arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure 
> reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> -
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Attachments: fix-dict-builder-capacity.diff
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit test

[jira] [Resolved] (ARROW-6711) [C++] Consolidate Filter and Expression classes

2019-10-11 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-6711.
-
Resolution: Fixed

Issue resolved by pull request 5594
[https://github.com/apache/arrow/pull/5594]

> [C++] Consolidate Filter and Expression classes
> ---
>
> Key: ARROW-6711
> URL: https://issues.apache.org/jira/browse/ARROW-6711
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> There is unnecessary boilerplate required when using the Filter/Expression 
> classes. Filter is no longer necessary; it (and FilterVector) can be replaced 
> with Expression. Expression is sufficiently general that it can be subclassed 
> to provide any custom functionality which would have been added through a 
> GenericFilter (add some tests for this).
> Additionally rows within RecordBatches yielded from a scan are not currently 
> filtered using Expression::Evaluate(). (Add tests ensuring both row filtering 
> and pruning obey Kleene logic)
> Add some comments on the mechanism of {{Assume()}} too, and refactor it not 
> to return a Result (its failure modes are covered by {{Validate()}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6861) arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Adam Hooper (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Hooper updated ARROW-6861:
---
Attachment: parquet-written-by-arrow-0-14-1.7z

> arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure 
> reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> -
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit tests. I'm also 
> not certain how FinishInternal is meant to work, so this definitely needs 
> expert review. (FinishInternal was _definitely_ buggy before my patch; after 
> my patch it _might_ be buggy but I don

[jira] [Commented] (ARROW-6861) arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Adam Hooper (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949738#comment-16949738
 ] 

Adam Hooper commented on ARROW-6861:


I've attached a Parquet file, written by Arrow 0.14.1, which causes this 
problem. Column 8 (among others) causes this problem. Most columns work fine.

> arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure 
> reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> -
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit tests. I'm also 
> not certain how FinishInternal is me

[jira] [Commented] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight

2019-10-11 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949745#comment-16949745
 ] 

David Li commented on ARROW-6860:
-

That won't help as libarrow_python will still link against Flight. You'll need 
a libarrow_python_flight as well.

> [Python] Only link libarrow_flight.so to pyarrow._flight
> 
>
> Key: ARROW-6860
> URL: https://issues.apache.org/jira/browse/ARROW-6860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See BEAM-8368. We need to find a strategy to mitigate protobuf static linking 
> issues with teh Beam community



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-11 Thread Thomas Schm (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949763#comment-16949763
 ] 

Thomas Schm commented on ARROW-6793:


Gosh, that's a big can. Is there a chance to keep the precompiled libraries, 
see https://arrow.apache.org/install/ somewhat in sync with a tagged version 
from github? At the moment the libraries or all pointing to 0.15.0 etc. but 
CRAN is lagging and Github is somewhat ahead. Maybe it's a stupid idea in the 
first place to try to rely on this precomiled libraries? Or maybe one could 
install slightly outdated libraries to stay in sync with CRAN? 

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-11 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949770#comment-16949770
 ] 

Neal Richardson commented on ARROW-6793:


The binaries available on the install page _are_ in sync with tagged versions 
on GitHub, but you seem to be installing the head of the master branch (what 
you get if you do install_github without specifying a tag). If you want to use 
the built binary libraries for an official release version of the C++ library, 
you need to use the corresponding R package. You can get that from CRAN–it 
isn't lagging. In the output you pasted above, you were installing from a CRAN 
snapshot "https://mran.microsoft.com/snapshot/2019-09-19/";. That's your lag.

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-6813) [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15

2019-10-11 Thread Rick Cobb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Cobb updated ARROW-6813:
-
Comment: was deleted

(was: As I looked more deeply at the issue, it appears that 0.15.0 completely 
reworks the notion of header parsing, and the most straightforward solution is 
to remove the `headers` option from the Ruby layer.  Thus my PR.  We've 
"repaired" our application code by removing the use of the option; we always 
had it set to `true` anyway, and that's the behavior with no options now.)

> [Ruby] Arrow::Table.load with headers=true leads to exception in Arrow 0.15
> ---
>
> Key: ARROW-6813
> URL: https://issues.apache.org/jira/browse/ARROW-6813
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 0.15.0
> Environment: Ubuntu 18.04, Debian Stretch
>Reporter: Rick Cobb
>Assignee: Rick Cobb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> ```
> Error: undefined method `n_header_rows=' for 
> #
> ```
> It appears that 0.15 has changed the name for this option to `n_skip_rows`
>  
> ```
> (byebug) options
> #(byebug) 
> (options.methods - Object.new.methods).sort
> [:add_column_name, :add_column_type, :add_column_type_raw, :add_false_value, 
> :add_null_value, :add_schema, :add_true_value, :allow_newlines_in_values=, 
> :allow_newlines_in_values?, :allow_null_strings=, :allow_null_strings?, 
> :bind_property, :block_size, :block_size=, :check_utf8=, :check_utf8?, 
> :column_names, :column_names=, :column_types, :delimiter, :delimiter=, 
> :destroyed?, :double_quoted=, :double_quoted?, :escape_character, 
> :escape_character=, :escaped=, :escaped?, :false_values, :false_values=, 
> :floating?, :freeze_notify, :generate_column_names=, :generate_column_names?, 
> :get_property, :gtype, :ignore_empty_lines=, :ignore_empty_lines?, 
> :n_skip_rows, :n_skip_rows=, :notify, :null_values, :null_values=, 
> :parent_instance, :quote_character, :quote_character=, :quoted=, :quoted?, 
> :ref_count, :set_allow_newlines_in_values, :set_allow_null_strings, 
> :set_block_size, :set_check_utf8, :set_column_names, :set_delimiter, 
> :set_double_quoted, :set_escape_character, :set_escaped, :set_false_values, 
> :set_generate_column_names, :set_ignore_empty_lines, :set_n_skip_rows, 
> :set_null_values, :set_property, :set_quote_character, :set_quoted, 
> :set_true_values, :set_use_threads, :signal_connect, :signal_connect_after, 
> :signal_emit, :signal_emit_stop, :signal_handler_block, 
> :signal_handler_disconnect, :signal_handler_is_connected?, 
> :signal_handler_unblock, :signal_has_handler_pending?, :thaw_notify, 
> :true_values, :true_values=, :type_name, :unref, :use_threads=, :use_threads?]
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-11 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949779#comment-16949779
 ] 

Kyle McCarthy commented on ARROW-6659:
--

I am happy to help - and I would prefer to do it how you wanted/the right way! 
I am fairly unfamiliar with the codebase so I am really just learning it by 
working through the open tasks, so this may be a dumb question.

How does the LogicalPlan and partition count actually work together. From the 
tests it looks like the partition count is related to the batch size? If so 
that would mean that every LogicalPlan would have the same partition count 
right?

Also, if we do add the LogicalPlan::Merge - that would mean that when the SQL 
Planner is creating a Logical Aggregate it would create: `Aggregate { Merge { 
Aggregate ( aggregate_input ) } }`? If so that definitely makes sense to me, 
but I am still not totally sure how the partition count would work into this.

Thank you for your patience!

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-11 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949779#comment-16949779
 ] 

Kyle McCarthy edited comment on ARROW-6659 at 10/11/19 8:32 PM:


I am happy to help - and I would prefer to do it how you wanted/the right way! 
I am fairly unfamiliar with the codebase so I am really just learning it by 
working through the open tasks, so this may be a dumb question.

How does the LogicalPlan and partition count actually work together. From the 
tests it looks like the partition count is related to the batch size? If so 
that would mean that every LogicalPlan would have the same partition count 
right?

Also, if we do add the LogicalPlan::Merge - that would mean that when the SQL 
Planner is creating a Logical Aggregate it would create:
{code:java}
Aggregate { Merge { Aggregate ( aggregate_input ) } }{code}
? If so that definitely makes sense to me, but I am still not totally sure how 
the partition count would work into this.

Thank you for your patience!


was (Author: kylemccarthy):
I am happy to help - and I would prefer to do it how you wanted/the right way! 
I am fairly unfamiliar with the codebase so I am really just learning it by 
working through the open tasks, so this may be a dumb question.

How does the LogicalPlan and partition count actually work together. From the 
tests it looks like the partition count is related to the batch size? If so 
that would mean that every LogicalPlan would have the same partition count 
right?

Also, if we do add the LogicalPlan::Merge - that would mean that when the SQL 
Planner is creating a Logical Aggregate it would create: `Aggregate { Merge { 
Aggregate ( aggregate_input ) } }`? If so that definitely makes sense to me, 
but I am still not totally sure how the partition count would work into this.

Thank you for your patience!

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Kyle McCarthy
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-11 Thread Thomas Schm (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949785#comment-16949785
 ] 

Thomas Schm commented on ARROW-6793:


Awesome, that's good news. Everyone following this thread. To install a tagged 
version from Github run
R -e 'remotes::install_github("apache/arrow/r@apache-arrow-0.15.0")'
or with CRAN
devtools::install_version("arrow", version = "0.15.0", repos = 
"http://cran.us.r-project.org";)

Thanks for all your help on that issue. The documentation on downloading the 
precompiled libraries is unfortunately slightly outdated. But @kou is already 
on the case.  If I understand the linking process correctly there is no need to 
specify any version number for the precompiled libraries as debian is given 
merely access to a software archive and the compiler/linker can pick any 
library in need. I couldn't agree more with the initial premise of this thread. 
The experience for people running arrow on Linux relying on this binary 
packages is not exactly ideal :-) Painful. Thanks again... Note that the 
documentation is too terse for people not familiar with deep knowledge of 
debian and the way it can access libraries and/or familiar with devtools.

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-11 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949791#comment-16949791
 ] 

Neal Richardson commented on ARROW-6793:


You don't need devtools/remotes if you want to install the current version. 
Just install it from CRAN.

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6861:

Summary: [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet 
dictionary column: Failure reading column: IOError: Arrow error: Invalid: 
Resize cannot downsize  (was: arrow-0.15.0 reading arrow-0.14.1-output Parquet 
dictionary column: Failure reading column: IOError: Arrow error: Invalid: 
Resize cannot downsize)

> [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: 
> Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> --
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I d

[jira] [Updated] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6861:

Fix Version/s: 1.0.0

> [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: 
> Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> --
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit tests. I'm also 
> not certain how FinishInternal is meant to work, so this definitely needs 
> expert review. (FinishInternal was _definitely_ buggy before my patch; after 
> my patch it _

[jira] [Commented] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949801#comment-16949801
 ] 

Wes McKinney commented on ARROW-6861:
-

Thanks. This should be enough information to help write a unit test to 
reproduce the issue. [~bkietz] are you interested in taking a look?

> [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: 
> Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> --
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit tests. I'm also 

[jira] [Commented] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949804#comment-16949804
 ] 

Wes McKinney commented on ARROW-6860:
-

Ah good point. That is tricky. 

> [Python] Only link libarrow_flight.so to pyarrow._flight
> 
>
> Key: ARROW-6860
> URL: https://issues.apache.org/jira/browse/ARROW-6860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See BEAM-8368. We need to find a strategy to mitigate protobuf static linking 
> issues with teh Beam community



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6860:
---

Assignee: Wes McKinney

> [Python] Only link libarrow_flight.so to pyarrow._flight
> 
>
> Key: ARROW-6860
> URL: https://issues.apache.org/jira/browse/ARROW-6860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See BEAM-8368. We need to find a strategy to mitigate protobuf static linking 
> issues with teh Beam community



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6844) [C++][Parquet][Python] List columns read broken with 0.15.0

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6844:

Fix Version/s: 0.15.1

> [C++][Parquet][Python] List columns read broken with 0.15.0
> 
>
> Key: ARROW-6844
> URL: https://issues.apache.org/jira/browse/ARROW-6844
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
>Reporter: Benoit Rostykus
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0, 0.15.1
>
> Attachments: dbg_sample.gz.parquet, dbg_sample2.gz.parquet
>
>
> Columns of type {{array}} (such as `array`, 
> `array`...) are not readable anymore using {{pyarrow == 0.15.0}} (but 
> were with {{pyarrow == 0.14.1}}) when the original writer of the parquet file 
> is {{parquet-mr 1.9.1}}.
> {code}
> import pyarrow.parquet as pq
> pf = pq.ParquetFile('sample.gz.parquet')
> print(pf.read(columns=['profile_ids']))
> {code}
> with 0.14.1:
> {code}
> pyarrow.Table
> profile_ids: list
>  child 0, element: int64
> ...
> {code}
> with 0.15.0:
> {code}
> Traceback (most recent call last):
>  File "", line 1, in 
>  File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 253, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1131, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column data for field 0 with type list 
> is inconsistent with schema list
> {code}
> I've tested parquet files coming from multiple tables (with various schemas) 
> created with `parquet-mr`, couldn't read any `array` column 
> anymore.
>  
> I _think_ the bug was introduced with [this 
> commit|[https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5]].
> I think the root of the issue comes from the fact that `parquet-mr` writes 
> the inner struct name as `"element"` by default (see 
> [here|[https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java#L33]]),
>  whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example 
> [this 
> test|[https://github.com/apache/arrow/blob/c805b5fadb548925c915e0e130d6ed03c95d1398/python/pyarrow/tests/test_schema.py#L74]]).
>  The round-tripping tests write/read in pyarrow only obviously won't catch 
> this.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6860:

Fix Version/s: 0.15.1

> [Python] Only link libarrow_flight.so to pyarrow._flight
> 
>
> Key: ARROW-6860
> URL: https://issues.apache.org/jira/browse/ARROW-6860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
>
> See BEAM-8368. We need to find a strategy to mitigate protobuf static linking 
> issues with teh Beam community



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6860) [Python] Only link libarrow_flight.so to pyarrow._flight

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6860:
--
Labels: pull-request-available  (was: )

> [Python] Only link libarrow_flight.so to pyarrow._flight
> 
>
> Key: ARROW-6860
> URL: https://issues.apache.org/jira/browse/ARROW-6860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>
> See BEAM-8368. We need to find a strategy to mitigate protobuf static linking 
> issues with teh Beam community



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6777) [GLib][CI] Unpin gobject-introspection gem

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6777:

Fix Version/s: 0.15.1

> [GLib][CI] Unpin gobject-introspection gem
> --
>
> Key: ARROW-6777
> URL: https://issues.apache.org/jira/browse/ARROW-6777
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6861:

Fix Version/s: 0.15.1

> [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: 
> Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> --
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit tests. I'm also 
> not certain how FinishInternal is meant to work, so this definitely needs 
> expert review. (FinishInternal was _definitely_ buggy before my patch; after 
> my p

[jira] [Commented] (ARROW-6861) [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949809#comment-16949809
 ] 

Wes McKinney commented on ARROW-6861:
-

Seems like a good candidate for 0.15.1. Marked as such

> [Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: 
> Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
> --
>
> Key: ARROW-6861
> URL: https://issues.apache.org/jira/browse/ARROW-6861
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.0
> Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
>Reporter: Adam Hooper
>Priority: Major
> Fix For: 1.0.0, 0.15.1
>
> Attachments: fix-dict-builder-capacity.diff, 
> parquet-written-by-arrow-0-14-1.7z
>
>
> I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
> that triggers this bug. In the meantime, here's the error I get, reading the 
> Parquet file with read_dictionary=true. I'll start with the stack trace:
> {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 
> downsize}}
> {{#0 0x00b9fffd in __cxa_throw ()}}
>  {{#1 0x004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
> (this=0x56612e50, num_values=67339, null_count=0, 
> valid_bits=0x7f39a764b780 '\377' ..., 
> valid_bits_offset=748544,}}
>  \{{ builder=0x56616330) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}}
>  {{#2 0x0046d703 in 
> parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced 
> (this=0x56616260, values_to_read=67339, null_count=0)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}}
>  {{#3 0x004a13f8 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecordData (this=0x56616260, num_records=67339)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}}
>  {{#4 0x00493876 in 
> parquet::internal::TypedRecordReader
>  >::ReadRecords (this=0x56616260, num_records=815883)}}
>  \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}}
>  {{#5 0x00413955 in parquet::arrow::LeafReader::NextBatch 
> (this=0x56615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}}
>  {{#6 0x00412081 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}}
>  {{#7 0x004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
> (this=0x566067a0, i=7, out=0x7ffd4b5afab0) at 
> /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}}
>  {{#8 0x00405fbd in readParquet(std::__cxx11::basic_string std::char_traits, std::allocator > const&) ()}}
> And now a report of my gdb adventures:
> In Arrow 0.15.0, when reading a particular dictionary column 
> ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 
> 0.14.1, {{arrow::Dictionary32Builder::AppendIndices(...)}} 
> is called twice (once with 493568 values, once with 254976 values); and then 
> {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't 
> know why this column comes in three batches.) On first {{AppendIndices()}} 
> call, the buffer capacity is equal to the number of values. On second call, 
> that's no longer the case: the buffer grows using 
> {{BufferBuilder::GrowByFactor}}, so its capacity is 987136.
> But there's a bug: the 987136-capacity buffer is in 
> {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
> {{Dictionary32Builder::indices_builder_.capacity_}}. 
> {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} 
> is called. (Dictionary32Builder behaves like a proxy for its 
> {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things 
> are messy.)
> So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, 
> via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
> {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
> wrong, cached value) to {{length_ + num_values}} (815883). Since 
> {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws 
> an exception.
> The only workaround I can find: use {{read_dictionaries=false}}.
> This affects Python, too.
> I've attached a patch that fixes the issue for my file. I don't know how to 
> formulate a reduction, though, so I haven't contributed unit tests. I'm also 
> not certain how FinishInternal is meant to work, so this definitely needs 

[jira] [Assigned] (ARROW-6807) [Java][FlightRPC] Expose gRPC service

2019-10-11 Thread Rohit Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Gupta reassigned ARROW-6807:
--

Assignee: Rohit Gupta

> [Java][FlightRPC] Expose gRPC service 
> --
>
> Key: ARROW-6807
> URL: https://issues.apache.org/jira/browse/ARROW-6807
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC, Java
>Reporter: Rohit Gupta
>Assignee: Rohit Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Have a utility class that exposes the flight service & client so that 
> multiple services can be plugged into the same endpoint. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6738) [Java] Fix problems with current union comparison logic

2019-10-11 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-6738:

Fix Version/s: 0.15.1

> [Java] Fix problems with current union comparison logic
> ---
>
> Key: ARROW-6738
> URL: https://issues.apache.org/jira/browse/ARROW-6738
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There are some problems with the current union comparison logic. For example:
> 1. For type check, we should not require fields to be equal. It is possible 
> that two vectors' value ranges are equal but their fields are different.
> 2. We should not compare the number of sub vectors, as it is possible that 
> two union vectors have different numbers of sub vectors, but have equal 
> values in the range.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6806) Segfault deserializing ListArray containing null/empty list

2019-10-11 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6806:
---
Fix Version/s: 0.15.1

> Segfault deserializing ListArray containing null/empty list
> ---
>
> Key: ARROW-6806
> URL: https://issues.apache.org/jira/browse/ARROW-6806
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Max Bolingbroke
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The following code segfaults for me (Windows and Linux, pyarrow 0.15):
>  
> {code:java}
> import pyarrow as pa
> from io import BytesIO
> x = 
> b'\xdc\x00\x00\x00\x10\x00\x00\x00\x0c\x00\x0e\x00\x06\x00\r\x00\x08\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x03\x00\x10\x00\x00\x00\x00\x01\n\x00\x0c\x00\x00\x00\x08\x00\x04\x00\n\x00\x00\x00\x08\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x18\x00\x00\x00\x00\x00\x12\x00\x18\x00\x14\x00\x13\x00\x12\x00\x0c\x00\x00\x00\x08\x00\x04\x00\x12\x00\x00\x00\x14\x00\x00\x00\x14\x00\x00\x00`\x00\x00\x00\x00\x00\x0c\x01\\\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x18\x00\x00\x00\x00\x00\x12\x00\x18\x00\x14\x00\x00\x00\x13\x00\x0c\x00\x00\x00\x08\x00\x04\x00\x12\x00\x00\x00\x14\x00\x00\x00\x14\x00\x00\x00\x14\x00\x00\x00\x00\x00\x00\x05\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0\xff\xff\xff\x06\x00\x00\x00$data$\x00\x00\x04\x00\x04\x00\x04\x00\x00\x00\x10\x00\x00\x00exchangeCodeList\x00\x00\x00\x00\xcc\x00\x00\x00\x14\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x16\x00\x0e\x00\x15\x00\x10\x00\x04\x00\x0c\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x10\x00\x00\x00\x00\x03\n\x00\x18\x00\x0c\x00\x08\x00\x04\x00\n\x00\x00\x00\x14\x00\x00\x00h\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> r = pa.RecordBatchStreamReader(BytesIO(x))
> r.read_all()
> {code}
> I *think* what should happen instead is that I should get a Table with a 
> single column named "exchangeCodeList", where the column is a ChunkedArray 
> with a single chunk, where that chunk is a ListArray containing just a single 
> element (a null). Failing that (i.e. if the bytestring is actually 
> malformed), pyarrow should maybe throw an error instead of segfaulting?
> I'm not 100% sure how the bytestring was generated: I think it comes from a 
> Java-based server. I can deserialize the server response fine if all the 
> records have at least one element in the "exchangeCodeList" column, but not 
> if at least one of them is null. I've tried to reproduce the failure by 
> generating the bytestring with pyarrow but can't trigger the segfault.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6862) [Developer] Check pull request title

2019-10-11 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-6862:
---

 Summary: [Developer] Check pull request title
 Key: ARROW-6862
 URL: https://issues.apache.org/jira/browse/ARROW-6862
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6862) [Developer] Check pull request title

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6862:
--
Labels: pull-request-available  (was: )

> [Developer] Check pull request title
> 
>
> Key: ARROW-6862
> URL: https://issues.apache.org/jira/browse/ARROW-6862
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-1638) [Java] IPC roundtrip for null type

2019-10-11 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-1638.

Resolution: Fixed

Issue resolved by pull request 5164
[https://github.com/apache/arrow/pull/5164]

> [Java] IPC roundtrip for null type
> --
>
> Key: ARROW-1638
> URL: https://issues.apache.org/jira/browse/ARROW-1638
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Wes McKinney
>Assignee: Ji Liu
>Priority: Major
>  Labels: columnar-format-1.0, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6721) [JAVA] Avro adapter benchmark only runs once in JMH

2019-10-11 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6721.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5524
[https://github.com/apache/arrow/pull/5524]

> [JAVA] Avro adapter benchmark only runs once in JMH
> ---
>
> Key: ARROW-6721
> URL: https://issues.apache.org/jira/browse/ARROW-6721
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current {{AvroAdapterBenchmark}} actually only run once during JMH 
> evaluation, since the decoder was consumed for the first time and the 
> follow-up invokes will directly return.
> To solve this, we use {{BinaryDecoder}} explicitly in benchmark and reset its 
> inner stream first when the test method is invoked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6732) [Java] Implement quick sort in a non-recursive way to avoid stack overflow

2019-10-11 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6732.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5540
[https://github.com/apache/arrow/pull/5540]

> [Java] Implement quick sort in a non-recursive way to avoid stack overflow
> --
>
> Key: ARROW-6732
> URL: https://issues.apache.org/jira/browse/ARROW-6732
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The current quick sort algorithm in implemented by a recursive algorithm. The 
> problem is that for the worst case, the number of recursive layers is equal 
> to the length of the vector.  For large vectors, this will cause stack 
> overflow.
> To solve this problem, we implement the quick sort algorithm as a 
> non-recursive algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6074) [FlightRPC] Implement middleware

2019-10-11 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6074.

Resolution: Fixed

Issue resolved by pull request 5068
[https://github.com/apache/arrow/pull/5068]

> [FlightRPC] Implement middleware
> 
>
> Key: ARROW-6074
> URL: https://issues.apache.org/jira/browse/ARROW-6074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6074) [FlightRPC] Implement middleware

2019-10-11 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-6074:
--

Assignee: David Li

> [FlightRPC] Implement middleware
> 
>
> Key: ARROW-6074
> URL: https://issues.apache.org/jira/browse/ARROW-6074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)