[jira] [Commented] (ARROW-5949) [Rust] Implement DictionaryArray

2020-02-27 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047269#comment-17047269
 ] 

Neville Dipale commented on ARROW-5949:
---

I'm unable to assign this to andy-thomason, I don't have permission.

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-5949) [Rust] Implement DictionaryArray

2020-02-27 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-5949.
---
Resolution: Fixed

Issue resolved by pull request 6095
[https://github.com/apache/arrow/pull/6095]

> [Rust] Implement DictionaryArray
> 
>
> Key: ARROW-5949
> URL: https://issues.apache.org/jira/browse/ARROW-5949
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: David Atienza
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> I am pretty new to the codebase, but I have seen that DictionaryArray is not 
> implemented in the Rust implementation.
> I went through the list of issues and I could not see any work on this. Is 
> there any blocker?
>  
> The specification is a bit 
> [short|https://arrow.apache.org/docs/format/Layout.html#dictionary-encoding] 
> or even 
> [non-existant|https://arrow.apache.org/docs/format/Metadata.html#dictionary-encoding],
>  so I am not sure how to implement it myself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7958) [Java] Update Avro to version 1.9.2

2020-02-27 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-7958:
--

Assignee: Ismaël Mejía

> [Java] Update Avro to version 1.9.2
> ---
>
> Key: ARROW-7958
> URL: https://issues.apache.org/jira/browse/ARROW-7958
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7958) [Java] Update Avro to version 1.9.2

2020-02-27 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-7958.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6500
[https://github.com/apache/arrow/pull/6500]

> [Java] Update Avro to version 1.9.2
> ---
>
> Key: ARROW-7958
> URL: https://issues.apache.org/jira/browse/ARROW-7958
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ismaël Mejía
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7960) Add support for schema translation from parquet nodes back to arrow for missing types

2020-02-27 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7960:
--

 Summary: Add support for schema translation from parquet nodes 
back to arrow for missing types
 Key: ARROW-7960
 URL: https://issues.apache.org/jira/browse/ARROW-7960
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Map seems to be the most obvious one missing.  Without additional metadata I 
don't think FixedSizeList is possible.  LargeList would probably have to also 
be could be determined  empirically while parsing if there are any entries that 
exceed the int32 range (or with medata).  Need to also double check that struct 
is supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7959) [Ruby] Add support for Ruby 2.3 again

2020-02-27 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7959:
---

 Summary: [Ruby] Add support for Ruby 2.3 again
 Key: ARROW-7959
 URL: https://issues.apache.org/jira/browse/ARROW-7959
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Ruby 2.3 reached EOL but Ubuntu 16.04 LTS ships Ruby 2.3. So supporting Ruby 
2.3 again is valuable.

Note that Red Arrow 0.15.1 works with Ruby 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7959) [Ruby] Add support for Ruby 2.3 again

2020-02-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7959:
--
Labels: pull-request-available  (was: )

> [Ruby] Add support for Ruby 2.3 again
> -
>
> Key: ARROW-7959
> URL: https://issues.apache.org/jira/browse/ARROW-7959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>
> Ruby 2.3 reached EOL but Ubuntu 16.04 LTS ships Ruby 2.3. So supporting Ruby 
> 2.3 again is valuable.
> Note that Red Arrow 0.15.1 works with Ruby 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3247) [Python] Support spark parquet array and map types

2020-02-27 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047208#comment-17047208
 ] 

Micah Kornfield commented on ARROW-3247:



{quote}
 group  (MAP) {
  repeated group key_value {
required  key;
  value;
  }
}
{quote}
Sorry I'm not seeing the different in Map schema from what is listed in the 
parquet spec (pasted above)?

Other issues covering this:
https://issues.apache.org/jira/browse/ARROW-1644 (this is the one I'm actively 
updating with subtasks)
https://issues.apache.org/jira/browse/ARROW-2587?filter=-1
https://issues.apache.org/jira/browse/ARROW-1599?filter=-1

Discussion on maiiling list: 
https://mail-archives.apache.org/mod_mbox/arrow-dev/202002.mbox/%3CCAJPUwMBP_CyfsVn0nCQx%3DP6AFuGaAcYRr-x9Y0GtJ7d2QTZRHA%40mail.gmail.com%3E




> [Python] Support spark parquet array and map types
> --
>
> Key: ARROW-3247
> URL: https://issues.apache.org/jira/browse/ARROW-3247
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Minor
>  Labels: parquet
>
> As far I understand, there is already some support for nested 
> array/dict/structs in arrow. However, spark Map and List types are structured 
> one level deeper (I believe to allow for both NULL and empty entries). 
> Surprisingly, fastparquet can load these. I do not know the plan for 
> arbitrary nested object support, but it should be made clear.
> Schema of spark-generated file from the fastparquet test suite:
> {code:java}
>  - spark_schema:
> | - map_op_op: MAP, OPTIONAL
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, OPTIONAL
> | - map_op_req: MAP, OPTIONAL
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, REQUIRED
> | - map_req_op: MAP, REQUIRED
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, OPTIONAL
> | - map_req_req: MAP, REQUIRED
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, REQUIRED
> | - arr_op_op: LIST, OPTIONAL
> |   - list: REPEATED
> | - element: BYTE_ARRAY, UTF8, OPTIONAL
> | - arr_op_req: LIST, OPTIONAL
> |   - list: REPEATED
> | - element: BYTE_ARRAY, UTF8, REQUIRED
> | - arr_req_op: LIST, REQUIRED
> |   - list: REPEATED
> | - element: BYTE_ARRAY, UTF8, OPTIONAL
>   - arr_req_req: LIST, REQUIRED
> - list: REPEATED
>   - element: BYTE_ARRAY, UTF8, REQUIRED
> {code}
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6887) [Java] Create prose documentation for using ValueVectors

2020-02-27 Thread Ji Liu (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047169#comment-17047169
 ] 

Ji Liu commented on ARROW-6887:
---

Seems the docs we added in this issue didn't work in website? 
[~emkornfi...@gmail.com] [~wesm]

[http://arrow.apache.org/docs/java/index.html]

> [Java] Create prose documentation for using ValueVectors
> 
>
> Key: ARROW-6887
> URL: https://issues.apache.org/jira/browse/ARROW-6887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> We should create documentation (in restructured text) for the library that 
> demonstrates:
> 1.  Basic construction of ValueVectors.  Highlighting:
>     * ValueVector lifecycle
>     * Reading by rows using Readers (mentioning that it is not as efficient 
> as direct access).
>     * Populating with Writers
> 2.  Reading and writing IPC stream format and file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7957) [Python] ParquetDataset cannot take HadoopFileSystem as filesystem

2020-02-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7957:

Summary: [Python] ParquetDataset cannot take HadoopFileSystem as filesystem 
 (was: ParquetDataset cannot take HadoopFileSystem as filesystem)

> [Python] ParquetDataset cannot take HadoopFileSystem as filesystem
> --
>
> Key: ARROW-7957
> URL: https://issues.apache.org/jira/browse/ARROW-7957
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Catherine
>Priority: Critical
>
> {{from pyarrow.fs import HadoopFileSystem}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
>  {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
>  {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}
>  
> has error:
>  {{OSError: Unrecognized filesystem:  'pyarrow._hdfs.HadoopFileSystem'>}}
>  
> When I tried using the deprecated {{HadoopFileSystem}}:
> {{import pyarrow}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
> {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}
> {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}
> {{pa_schema = dataset.schema.to_arrow_schema()}}
> {{pieces = dataset.pieces}}
> {{for piece in pieces: }}
> {{    print(piece.path)}}
>  
> {{piece.path}} lose the {{hdfs://localhost:9000}} prefix.
>  
> I think {{ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
> filesystem?}}
> And {{piece.path}} should have the prefix?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7958) [Java] Update Avro to version 1.9.2

2020-02-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7958:
--
Labels: pull-request-available  (was: )

> [Java] Update Avro to version 1.9.2
> ---
>
> Key: ARROW-7958
> URL: https://issues.apache.org/jira/browse/ARROW-7958
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ismaël Mejía
>Priority: Trivial
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7958) [Java] Update Avro to version 1.9.2

2020-02-27 Thread Jira
Ismaël Mejía created ARROW-7958:
---

 Summary: [Java] Update Avro to version 1.9.2
 Key: ARROW-7958
 URL: https://issues.apache.org/jira/browse/ARROW-7958
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ismaël Mejía






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7951) [Python][Parquet] Expose BYTE_STREAM_SPLIT to pyarrow

2020-02-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7951:
--
Labels: parquet pull-request-available  (was: parquet)

> [Python][Parquet] Expose BYTE_STREAM_SPLIT to pyarrow
> -
>
> Key: ARROW-7951
> URL: https://issues.apache.org/jira/browse/ARROW-7951
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: parquet, pull-request-available
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The Parquet writer now supports the option of selecting the 
> BYTE_STREAMS_SPLIT encoding. It could be nice to have it exposed in pyarrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7916) [C++][Dataset] Project IPC record batches to materialized fields

2020-02-27 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-7916.
-
Resolution: Fixed

Issue resolved by pull request 6474
[https://github.com/apache/arrow/pull/6474]

> [C++][Dataset] Project IPC record batches to materialized fields
> 
>
> Key: ARROW-7916
> URL: https://issues.apache.org/jira/browse/ARROW-7916
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> If batches mmaped from disk are projected before post filtering, unreferenced 
> columns will never be accessed (so the memory map shouldn't do I/O on them).
> At the same time, it'd probably be wise to explicitly document that batches 
> yielded directly from fragments rather than from a Scanner will not be 
> filtered or projected (so they will not match the fragment's schema and will 
> include columns referenced by the filter even if they were not projected).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7957) ParquetDataset cannot take HadoopFileSystem as filesystem

2020-02-27 Thread Catherine (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Catherine updated ARROW-7957:
-
Description: 
{{from pyarrow.fs import HadoopFileSystem}}
 {{import pyarrow.parquet as pq}}

 

{{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
 {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
 {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}

 

has error:
 {{OSError: Unrecognized filesystem: }}

 

When I tried using the deprecated {{HadoopFileSystem}}:

{{import pyarrow}}
 {{import pyarrow.parquet as pq}}

 

{{file_name = "hdfs://localhost:9000/test/file_name.pq"}}

{{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}

{{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}

{{pa_schema = dataset.schema.to_arrow_schema()}}

{{pieces = dataset.pieces}}

{{for piece in pieces: }}

{{    print(piece.path)}}

 

{{piece.path}} lose the {{hdfs://localhost:9000}} prefix.

 

I think {{ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
filesystem?}}

And {{piece.path}} should have the prefix?

  was:
{{from pyarrow.fs import HadoopFileSystem}}
 {{import pyarrow.parquet as pq}}

 

{{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
 {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
 {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}

 

has error:
 {{OSError: Unrecognized filesystem: }}

 

When I tried using the deprecated {{HadoopFileSystem}}:

{{import pyarrow}}
 {{import pyarrow.parquet as pq}}

 

{{file_name = "hdfs://localhost:9000/test/file_name.pq"}}

{{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}

{{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}

{{pa_schema = dataset.schema.to_arrow_schema()}}

{{pieces = dataset.pieces}}

{{for piece in pieces: }}

{{    print(piece.path)}}

 

{{piece.path}} lose the {{hdfs://localhost:9000}} prefix.

 

I think{{ ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
filesystem?}}

And {{piece.path}} should have the prefix?


> ParquetDataset cannot take HadoopFileSystem as filesystem
> -
>
> Key: ARROW-7957
> URL: https://issues.apache.org/jira/browse/ARROW-7957
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Catherine
>Priority: Critical
>
> {{from pyarrow.fs import HadoopFileSystem}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
>  {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
>  {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}
>  
> has error:
>  {{OSError: Unrecognized filesystem:  'pyarrow._hdfs.HadoopFileSystem'>}}
>  
> When I tried using the deprecated {{HadoopFileSystem}}:
> {{import pyarrow}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
> {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}
> {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}
> {{pa_schema = dataset.schema.to_arrow_schema()}}
> {{pieces = dataset.pieces}}
> {{for piece in pieces: }}
> {{    print(piece.path)}}
>  
> {{piece.path}} lose the {{hdfs://localhost:9000}} prefix.
>  
> I think {{ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
> filesystem?}}
> And {{piece.path}} should have the prefix?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7957) ParquetDataset cannot take HadoopFileSystem as filesystem

2020-02-27 Thread Catherine (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Catherine updated ARROW-7957:
-
Description: 
{{from pyarrow.fs import HadoopFileSystem}}
 {{import pyarrow.parquet as pq}}

 

{{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
 {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
 {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}

 

has error:
 {{OSError: Unrecognized filesystem: }}

 

When I tried using the deprecated {{HadoopFileSystem}}:

{{import pyarrow}}
 {{import pyarrow.parquet as pq}}

 

{{file_name = "hdfs://localhost:9000/test/file_name.pq"}}

{{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}

{{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}

{{pa_schema = dataset.schema.to_arrow_schema()}}

{{pieces = dataset.pieces}}

{{for piece in pieces: }}

{{    print(piece.path)}}

 

{{piece.path}} lose the {{hdfs://localhost:9000}} prefix.

 

I think{{ ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
filesystem?}}

And {{piece.path}} should have the prefix?

  was:
{{from pyarrow.fs import HadoopFileSystem}}
{{import pyarrow.parquet as pq}}

 

{{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
{{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
{{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}

 

has error:

 

{{ raise IOError('Unrecognized filesystem: \{0}'.format(fs_type))}}
{{OSError: Unrecognized filesystem: }}

 

When I tried using the deprecated {{HadoopFileSystem}}:

{{import pyarrow}}
{{import pyarrow.parquet as pq}}

 

{{file_name = }}{{"hdfs://localhost:9000/test/file_name.pq"}}{{}}

{{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}

{{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}

{{pa_schema = dataset.schema.to_arrow_schema()}}

{{pieces = dataset.pieces}}

{{for piece in pieces: }}

{{    print(piece.path)}}

 

{{piece.path }}lose the{{ hdfs://localhost:9000 }}prefix.

 

I think{{ ParquetDataset }}should accept{{ }}{{pyarrow.fs.}}{{HadoopFileSystem 
}}as filesystem?{{}}

And {{piece.path }}should have the prefix?


> ParquetDataset cannot take HadoopFileSystem as filesystem
> -
>
> Key: ARROW-7957
> URL: https://issues.apache.org/jira/browse/ARROW-7957
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Catherine
>Priority: Critical
>
> {{from pyarrow.fs import HadoopFileSystem}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
>  {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
>  {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}
>  
> has error:
>  {{OSError: Unrecognized filesystem:  'pyarrow._hdfs.HadoopFileSystem'>}}
>  
> When I tried using the deprecated {{HadoopFileSystem}}:
> {{import pyarrow}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
> {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}
> {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}
> {{pa_schema = dataset.schema.to_arrow_schema()}}
> {{pieces = dataset.pieces}}
> {{for piece in pieces: }}
> {{    print(piece.path)}}
>  
> {{piece.path}} lose the {{hdfs://localhost:9000}} prefix.
>  
> I think{{ ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
> filesystem?}}
> And {{piece.path}} should have the prefix?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7957) ParquetDataset cannot take HadoopFileSystem as filesystem

2020-02-27 Thread Catherine (Jira)
Catherine created ARROW-7957:


 Summary: ParquetDataset cannot take HadoopFileSystem as filesystem
 Key: ARROW-7957
 URL: https://issues.apache.org/jira/browse/ARROW-7957
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Catherine


{{from pyarrow.fs import HadoopFileSystem}}
{{import pyarrow.parquet as pq}}

 

{{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
{{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
{{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}

 

has error:

 

{{ raise IOError('Unrecognized filesystem: \{0}'.format(fs_type))}}
{{OSError: Unrecognized filesystem: }}

 

When I tried using the deprecated {{HadoopFileSystem}}:

{{import pyarrow}}
{{import pyarrow.parquet as pq}}

 

{{file_name = }}{{"hdfs://localhost:9000/test/file_name.pq"}}{{}}

{{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}

{{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}

{{pa_schema = dataset.schema.to_arrow_schema()}}

{{pieces = dataset.pieces}}

{{for piece in pieces: }}

{{    print(piece.path)}}

 

{{piece.path }}lose the{{ hdfs://localhost:9000 }}prefix.

 

I think{{ ParquetDataset }}should accept{{ }}{{pyarrow.fs.}}{{HadoopFileSystem 
}}as filesystem?{{}}

And {{piece.path }}should have the prefix?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7956) [Python] Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas

2020-02-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046864#comment-17046864
 ] 

Wes McKinney commented on ARROW-7956:
-

I reopened this as I want to make sure there is an appropriate unit test (or 
equivalent) for this

> [Python] Memory leak in pyarrow functions 
> .ipc.serialize_pandas/deserialize_pandas
> --
>
> Key: ARROW-7956
> URL: https://issues.apache.org/jira/browse/ARROW-7956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Denis
>Priority: Critical
> Fix For: 1.0.0
>
> Attachments: loans.parquet, pyarrow_mem_leak_test.py
>
>
> Used python version is 3.7.4 (conda distribution)
> OS: Ubunty 18.04
> pandas version is 0.24.2
> numpy version is 1.16.4
>  
> To reproduce the issue run the attached script pyarrow_mem_leak_test.py. Also 
> put the attached file loans.parquet to the folder of working directory.
>  
> Also the reading and writing to parquet in memory do has memory leaks. To 
> reproduce this run function test_parquet_leak() from the attached file 
> pyarrow_mem_leak_test.py
> The memory leak is 100% reproducible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7956) [Python] Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas

2020-02-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7956:

Fix Version/s: 1.0.0

> [Python] Memory leak in pyarrow functions 
> .ipc.serialize_pandas/deserialize_pandas
> --
>
> Key: ARROW-7956
> URL: https://issues.apache.org/jira/browse/ARROW-7956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Denis
>Priority: Critical
> Fix For: 1.0.0
>
> Attachments: loans.parquet, pyarrow_mem_leak_test.py
>
>
> Used python version is 3.7.4 (conda distribution)
> OS: Ubunty 18.04
> pandas version is 0.24.2
> numpy version is 1.16.4
>  
> To reproduce the issue run the attached script pyarrow_mem_leak_test.py. Also 
> put the attached file loans.parquet to the folder of working directory.
>  
> Also the reading and writing to parquet in memory do has memory leaks. To 
> reproduce this run function test_parquet_leak() from the attached file 
> pyarrow_mem_leak_test.py
> The memory leak is 100% reproducible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7956) [Python] Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas

2020-02-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7956:

Affects Version/s: (was: 0.16.0)

> [Python] Memory leak in pyarrow functions 
> .ipc.serialize_pandas/deserialize_pandas
> --
>
> Key: ARROW-7956
> URL: https://issues.apache.org/jira/browse/ARROW-7956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Denis
>Priority: Critical
> Attachments: loans.parquet, pyarrow_mem_leak_test.py
>
>
> Used python version is 3.7.4 (conda distribution)
> OS: Ubunty 18.04
> pandas version is 0.24.2
> numpy version is 1.16.4
>  
> To reproduce the issue run the attached script pyarrow_mem_leak_test.py. Also 
> put the attached file loans.parquet to the folder of working directory.
>  
> Also the reading and writing to parquet in memory do has memory leaks. To 
> reproduce this run function test_parquet_leak() from the attached file 
> pyarrow_mem_leak_test.py
> The memory leak is 100% reproducible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-7956) Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas

2020-02-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-7956:
-

> Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas
> -
>
> Key: ARROW-7956
> URL: https://issues.apache.org/jira/browse/ARROW-7956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0, 0.16.0
>Reporter: Denis
>Priority: Critical
> Attachments: loans.parquet, pyarrow_mem_leak_test.py
>
>
> Used python version is 3.7.4 (conda distribution)
> OS: Ubunty 18.04
> pandas version is 0.24.2
> numpy version is 1.16.4
>  
> To reproduce the issue run the attached script pyarrow_mem_leak_test.py. Also 
> put the attached file loans.parquet to the folder of working directory.
>  
> Also the reading and writing to parquet in memory do has memory leaks. To 
> reproduce this run function test_parquet_leak() from the attached file 
> pyarrow_mem_leak_test.py
> The memory leak is 100% reproducible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7956) [Python] Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas

2020-02-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7956:

Summary: [Python] Memory leak in pyarrow functions 
.ipc.serialize_pandas/deserialize_pandas  (was: Memory leak in pyarrow 
functions .ipc.serialize_pandas/deserialize_pandas)

> [Python] Memory leak in pyarrow functions 
> .ipc.serialize_pandas/deserialize_pandas
> --
>
> Key: ARROW-7956
> URL: https://issues.apache.org/jira/browse/ARROW-7956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0, 0.16.0
>Reporter: Denis
>Priority: Critical
> Attachments: loans.parquet, pyarrow_mem_leak_test.py
>
>
> Used python version is 3.7.4 (conda distribution)
> OS: Ubunty 18.04
> pandas version is 0.24.2
> numpy version is 1.16.4
>  
> To reproduce the issue run the attached script pyarrow_mem_leak_test.py. Also 
> put the attached file loans.parquet to the folder of working directory.
>  
> Also the reading and writing to parquet in memory do has memory leaks. To 
> reproduce this run function test_parquet_leak() from the attached file 
> pyarrow_mem_leak_test.py
> The memory leak is 100% reproducible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7956) Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas

2020-02-27 Thread Denis (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046827#comment-17046827
 ] 

Denis commented on ARROW-7956:
--

Not reproduced with pyarrow==0.16.0.

Closing this ticket

> Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas
> -
>
> Key: ARROW-7956
> URL: https://issues.apache.org/jira/browse/ARROW-7956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0, 0.16.0
>Reporter: Denis
>Priority: Critical
> Attachments: loans.parquet, pyarrow_mem_leak_test.py
>
>
> Used python version is 3.7.4 (conda distribution)
> OS: Ubunty 18.04
> pandas version is 0.24.2
> numpy version is 1.16.4
>  
> To reproduce the issue run the attached script pyarrow_mem_leak_test.py. Also 
> put the attached file loans.parquet to the folder of working directory.
>  
> Also the reading and writing to parquet in memory do has memory leaks. To 
> reproduce this run function test_parquet_leak() from the attached file 
> pyarrow_mem_leak_test.py
> The memory leak is 100% reproducible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-560) [C++] Add support for zero-copy libhdfs reads

2020-02-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046821#comment-17046821
 ] 

Antoine Pitrou commented on ARROW-560:
--

In any case, I've built a prototype, but it crashes when releasing a buffer in 
the Python tests.

{code}
#0  raise (sig=) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7efbfaaa4298 in os::Linux::chained_handler(int, siginfo*, void*) () 
from /opt/conda/envs/arrow//jre/lib/amd64/server/libjvm.so
#2  0x7efbfaaab585 in JVM_handle_linux_signal () from 
/opt/conda/envs/arrow//jre/lib/amd64/server/libjvm.so
#3  0x7efbfaaa0c93 in signalHandler(int, siginfo*, void*) () from 
/opt/conda/envs/arrow//jre/lib/amd64/server/libjvm.so
#4  
#5  0x7efbfa86c230 in jni_invoke_nonstatic(JNIEnv_*, JavaValue*, _jobject*, 
JNICallType, _jmethodID*, JNI_ArgumentPusher*, Thread*) ()
   from /opt/conda/envs/arrow//jre/lib/amd64/server/libjvm.so
#6  0x7efbfa86ff7c in jni_CallVoidMethodV () from 
/opt/conda/envs/arrow//jre/lib/amd64/server/libjvm.so
#7  0x7efbf9fe7f90 in invokeMethod (env=env@entry=0x5655584ff9e0, 
retval=retval@entry=0x7ffbfffef420, methType=, 
methType@entry=INSTANCE, 
instObj=0x565559963960, className=className@entry=0x7efbf9ff05f8 
"org/apache/hadoop/fs/FSDataInputStream", 
methName=methName@entry=0x7efbf9fefc74 "releaseBuffer", 
methSignature=0x7efbf9fefc98 ")V", methSignature@entry=0x7efbf9fefc82 
"(Ljava/nio/ByteBuffer;)V")
at 
/build/source/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.c:150
#8  0x7efbf9fed97d in hadoopRzBufferFree (file=0x565559964ab0, 
buffer=0x565559325270)
at 
/build/source/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/hdfs.c:2712
#9  0x7efbbfdd70cd in arrow::io::internal::LibHdfsShim::RzBufferFree 
(this=0x7efbc0b7cc40 , file=0x565559964ab0, 
buffer=0x565559325270) at /arrow/cpp/src/arrow/io/hdfs_internal.cc:579
#10 0x7efbbfdc9a18 in arrow::io::HdfsBuffer::~HdfsBuffer 
(this=0x5655599638b0, __in_chrg=) at 
/arrow/cpp/src/arrow/io/hdfs.cc:134
#11 0x7efbbfdd2e37 in 
__gnu_cxx::new_allocator::destroy 
(this=0x5655599638b0, __p=0x5655599638b0)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/ext/new_allocator.h:140
#12 0x7efbbfdd2e03 in 
std::allocator_traits 
>::destroy (__a=..., __p=0x5655599638b0)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/alloc_traits.h:487
#13 0x7efbbfdd2b53 in std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>::_M_dispose 
(this=0x5655599638a0)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:535
#14 0x7efbbf92f508 in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release 
(this=0x5655599638a0)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:154
#15 0x7efbbf9272ff in 
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count 
(this=0x5655599631c8, __in_chrg=)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:684
#16 0x7efbbf9231b2 in std::__shared_ptr::~__shared_ptr (this=0x5655599631c0, 
__in_chrg=)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:1123
#17 0x7efbbf9231ce in std::shared_ptr::~shared_ptr 
(this=0x5655599631c0, __in_chrg=)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr.h:93
#18 0x7efbbfdb8306 in arrow::io::CompressedInputStream::Impl::~Impl 
(this=0x565559963190, __in_chrg=) at 
/arrow/cpp/src/arrow/io/compressed.cc:242
#19 0x7efbbfdb834c in 
std::default_delete::operator() 
(this=0x565558a161b8, __ptr=0x565559963190)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/unique_ptr.h:78
#20 0x7efbbfdb7485 in 
std::unique_ptr >::~unique_ptr 
(this=0x565558a161b8, 
__in_chrg=) at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/unique_ptr.h:268
#21 0x7efbbfdb355d in 
arrow::io::CompressedInputStream::~CompressedInputStream (this=0x565558a161a0, 
__in_chrg=, __vtt_parm=)
at /arrow/cpp/src/arrow/io/compressed.cc:443
#22 0x7efbbfdb35ba in 
arrow::io::CompressedInputStream::~CompressedInputStream (this=0x565558a161a0, 
__in_chrg=, __vtt_parm=)
at /arrow/cpp/src/arrow/io/compressed.cc:443
#23 0x7efbbfdb91cc in 
std::_Sp_counted_ptr::_M_dispose (this=0x565558ad9d40)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:376
#24 0x7efbc1092ad6 in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release 
(this=0x565558ad9d40)
at 
/opt/conda/envs/arrow/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:154
#25 0x7efbc1089248 in 
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count 

[jira] [Commented] (ARROW-560) [C++] Add support for zero-copy libhdfs reads

2020-02-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046781#comment-17046781
 ] 

Antoine Pitrou commented on ARROW-560:
--

One caveat here is that positioned reads (ReadAt) are not supported with 
zero-copy in libhdfs. I don't know why that is, as I don't understand what 
would be the impossibility. I guess it just hasn't been coded. So I'm not sure 
it's worthwhile doing this, what do you think?

> [C++] Add support for zero-copy libhdfs reads
> -
>
> Key: ARROW-560
> URL: https://issues.apache.org/jira/browse/ARROW-560
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> See *Rz* functions in 
> https://github.com/apache/arrow/blob/master/cpp/thirdparty/hadoop/include/hdfs.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7746) [Java] Support large buffer for Flight

2020-02-27 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-7746.

Resolution: Invalid

> [Java] Support large buffer for Flight
> --
>
> Key: ARROW-7746
> URL: https://issues.apache.org/jira/browse/ARROW-7746
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> The motivation is described in 
> https://github.com/apache/arrow/pull/6323#issuecomment-580137629.
> When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
> due to integer overflow. 
> This is because internally, we have used some data structures which are based 
> on 32-bit integers. To resolve the problem, we must revise/replace the data 
> structures to make them support 64-bit integers. 
> As a concrete example, we can see that when the server sends data through 
> IPC, an org.apache.arrow.flight.ArrowMessage object is created, and is 
> wrapped as an InputStream through the `asInputStream` method. In this method, 
> we use data stuctures like java.io.ByteArrayOutputStream and 
> io.netty.buffer.ByteBuf, which are based on 32-bit integers (we can observe 
> that NettyArrowBuf#length and ByteArrayOutputStream#count are both 32-bit 
> integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7746) [Java] Support large buffer for Flight

2020-02-27 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046771#comment-17046771
 ] 

Micah Kornfield commented on ARROW-7746:


thanks [~fan_li_ya].  I don't think this Jira will be doable because gRPC has a 
maximum message size of 2GB. 

> [Java] Support large buffer for Flight
> --
>
> Key: ARROW-7746
> URL: https://issues.apache.org/jira/browse/ARROW-7746
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> The motivation is described in 
> https://github.com/apache/arrow/pull/6323#issuecomment-580137629.
> When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
> due to integer overflow. 
> This is because internally, we have used some data structures which are based 
> on 32-bit integers. To resolve the problem, we must revise/replace the data 
> structures to make them support 64-bit integers. 
> As a concrete example, we can see that when the server sends data through 
> IPC, an org.apache.arrow.flight.ArrowMessage object is created, and is 
> wrapped as an InputStream through the `asInputStream` method. In this method, 
> we use data stuctures like java.io.ByteArrayOutputStream and 
> io.netty.buffer.ByteBuf, which are based on 32-bit integers (we can observe 
> that NettyArrowBuf#length and ByteArrayOutputStream#count are both 32-bit 
> integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7926) [Developer] "archery lint" target is not ergonomic for running a single check like IWYU

2020-02-27 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-7926.

Resolution: Fixed

Issue resolved by pull request 6491
[https://github.com/apache/arrow/pull/6491]

> [Developer] "archery lint" target is not ergonomic for running a single check 
> like IWYU
> ---
>
> Key: ARROW-7926
> URL: https://issues.apache.org/jira/browse/ARROW-7926
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It might be useful to have a second lint CLI target with everything disabled 
> by default so that a single lint target can be toggled on. How should this be 
> used via docker-compose? See ARROW-7925



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7886) [C++][Dataset] Consolidate Source and Dataset

2020-02-27 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-7886.
-
Resolution: Fixed

Issue resolved by pull request 6470
[https://github.com/apache/arrow/pull/6470]

> [C++][Dataset] Consolidate Source and Dataset
> -
>
> Key: ARROW-7886
> URL: https://issues.apache.org/jira/browse/ARROW-7886
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Source and Dataset are very similar concepts (collections of multiple data 
> fragments). Consolidating them would decrease doc burden without reducing our 
> flexibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7926) [Developer] "archery lint" target is not ergonomic for running a single check like IWYU

2020-02-27 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046673#comment-17046673
 ] 

Krisztian Szucs commented on ARROW-7926:


The numpydoc validation can't.

> [Developer] "archery lint" target is not ergonomic for running a single check 
> like IWYU
> ---
>
> Key: ARROW-7926
> URL: https://issues.apache.org/jira/browse/ARROW-7926
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It might be useful to have a second lint CLI target with everything disabled 
> by default so that a single lint target can be toggled on. How should this be 
> used via docker-compose? See ARROW-7925



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7917) [CMake] FindPythonInterp should check for python3

2020-02-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7917:
--
Labels: pull-request-available  (was: )

> [CMake] FindPythonInterp should check for python3
> -
>
> Key: ARROW-7917
> URL: https://issues.apache.org/jira/browse/ARROW-7917
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.16.0
>Reporter: Francois Saint-Jacques
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> On ubuntu 18.04 it'll pick python2 by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7956) Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas

2020-02-27 Thread Denis (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis updated ARROW-7956:
-
Priority: Critical  (was: Major)

> Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas
> -
>
> Key: ARROW-7956
> URL: https://issues.apache.org/jira/browse/ARROW-7956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0, 0.16.0
>Reporter: Denis
>Priority: Critical
> Attachments: loans.parquet, pyarrow_mem_leak_test.py
>
>
> Used python version is 3.7.4 (conda distribution)
> OS: Ubunty 18.04
> pandas version is 0.24.2
> numpy version is 1.16.4
>  
> To reproduce the issue run the attached script pyarrow_mem_leak_test.py. Also 
> put the attached file loans.parquet to the folder of working directory.
>  
> Also the reading and writing to parquet in memory do has memory leaks. To 
> reproduce this run function test_parquet_leak() from the attached file 
> pyarrow_mem_leak_test.py
> The memory leak is 100% reproducible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7917) [CMake] FindPythonInterp should check for python3

2020-02-27 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-7917:
-

Assignee: Antoine Pitrou

> [CMake] FindPythonInterp should check for python3
> -
>
> Key: ARROW-7917
> URL: https://issues.apache.org/jira/browse/ARROW-7917
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.16.0
>Reporter: Francois Saint-Jacques
>Assignee: Antoine Pitrou
>Priority: Major
>
> On ubuntu 18.04 it'll pick python2 by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7956) Memory leak in pyarrow functions .ipc.serialize_pandas/deserialize_pandas

2020-02-27 Thread Denis (Jira)
Denis created ARROW-7956:


 Summary: Memory leak in pyarrow functions 
.ipc.serialize_pandas/deserialize_pandas
 Key: ARROW-7956
 URL: https://issues.apache.org/jira/browse/ARROW-7956
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0, 0.15.0
Reporter: Denis
 Attachments: loans.parquet, pyarrow_mem_leak_test.py

Used python version is 3.7.4 (conda distribution)

OS: Ubunty 18.04

pandas version is 0.24.2

numpy version is 1.16.4

 

To reproduce the issue run the attached script pyarrow_mem_leak_test.py. Also 
put the attached file loans.parquet to the folder of working directory.

 

Also the reading and writing to parquet in memory do has memory leaks. To 
reproduce this run function test_parquet_leak() from the attached file 
pyarrow_mem_leak_test.py

The memory leak is 100% reproducible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7789) [R] Can't initialize arrow objects when R.oo package is loaded

2020-02-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7789:
--
Labels: pull-request-available  (was: )

> [R] Can't initialize arrow objects when R.oo package is loaded
> --
>
> Key: ARROW-7789
> URL: https://issues.apache.org/jira/browse/ARROW-7789
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Martin
>Priority: Minor
>  Labels: pull-request-available
>
> Unknown error when using arrow::write_feather()  in R 3.5.3
> pb = as.data.frame(seq(1:100))
> pbFilename <- file.path(getwd(), "reproduceBug.feather")
>  arrow::write_feather(x = pb, sink = pbFilename)
> >Error in exists(name, envir = envir, inherits = FALSE) : 
>  > use of NULL environment is defunct
>  
> packageVersion('arrow')
> [1] ‘0.15.1.1’



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7789) [R] Can't initialize arrow objects when R.oo package is loaded

2020-02-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-7789:
--

Assignee: Neal Richardson

> [R] Can't initialize arrow objects when R.oo package is loaded
> --
>
> Key: ARROW-7789
> URL: https://issues.apache.org/jira/browse/ARROW-7789
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Martin
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Unknown error when using arrow::write_feather()  in R 3.5.3
> pb = as.data.frame(seq(1:100))
> pbFilename <- file.path(getwd(), "reproduceBug.feather")
>  arrow::write_feather(x = pb, sink = pbFilename)
> >Error in exists(name, envir = envir, inherits = FALSE) : 
>  > use of NULL environment is defunct
>  
> packageVersion('arrow')
> [1] ‘0.15.1.1’



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7789) [R] Can't initialize arrow objects when R.oo package is loaded

2020-02-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7789:
---
Summary: [R] Can't initialize arrow objects when R.oo package is loaded  
(was: [R] Unknown error when using arrow::write_feather()  in R 3.5.3)

> [R] Can't initialize arrow objects when R.oo package is loaded
> --
>
> Key: ARROW-7789
> URL: https://issues.apache.org/jira/browse/ARROW-7789
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Martin
>Priority: Minor
>
> Unknown error when using arrow::write_feather()  in R 3.5.3
> pb = as.data.frame(seq(1:100))
> pbFilename <- file.path(getwd(), "reproduceBug.feather")
>  arrow::write_feather(x = pb, sink = pbFilename)
> >Error in exists(name, envir = envir, inherits = FALSE) : 
>  > use of NULL environment is defunct
>  
> packageVersion('arrow')
> [1] ‘0.15.1.1’



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7789) [R] Unknown error when using arrow::write_feather()  in R 3.5.3

2020-02-27 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046627#comment-17046627
 ] 

Neal Richardson commented on ARROW-7789:


Thanks for the tip, [~karldw]. The problem seems to be that arrow objects share 
a base class name ("Object") that gets special methods defined in {{R.oo}} 
(https://github.com/HenrikBengtsson/R.oo/blob/develop/NAMESPACE#L171). Since 
R.oo Objects don't have the same properties as arrow Objects, calling R.oo's 
{{$<-}} method on an arrow Object does bad things. 

There are a few ways to fix this, will discuss on the PR I'm about to put up.

> [R] Unknown error when using arrow::write_feather()  in R 3.5.3
> ---
>
> Key: ARROW-7789
> URL: https://issues.apache.org/jira/browse/ARROW-7789
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Martin
>Priority: Minor
>
> Unknown error when using arrow::write_feather()  in R 3.5.3
> pb = as.data.frame(seq(1:100))
> pbFilename <- file.path(getwd(), "reproduceBug.feather")
>  arrow::write_feather(x = pb, sink = pbFilename)
> >Error in exists(name, envir = envir, inherits = FALSE) : 
>  > use of NULL environment is defunct
>  
> packageVersion('arrow')
> [1] ‘0.15.1.1’



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7048) [Java] Support for combining multiple vectors under VectorSchemaRoot

2020-02-27 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046623#comment-17046623
 ] 

Liya Fan commented on ARROW-7048:
-

[~yogeshtewari] Sorry for the long wait. We have provided a PR for this issue. 
Would you please take a look, and check if it is what you want?

> [Java] Support for combining multiple vectors under VectorSchemaRoot
> 
>
> Key: ARROW-7048
> URL: https://issues.apache.org/jira/browse/ARROW-7048
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Yogesh Tewari
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
>  
> pyarrow.Table.combine_chunks provides a nice functionality of combining 
> multiple batch records under a single pyarrow.Table.
>  
> I am currently working on a downstream application which reads data from 
> BigQuery. BigQuery storage api supports data output in Arrow format but 
> streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality 
> under an abstraction like VectorSchemaRoot.
> After getting guidance from [~emkornfi...@gmail.com], I tried to write my own 
> implementation by copying data vector by vector using TransferPair's 
> copyValueSafe
> But, unless I am missing some thing obvious, turns out it only copies one 
> value at a time. That means a lot of looping trying copyValueSafe millions of 
> rows from source vector index to target vector index. Ideally I would want to 
> concatenate/link the underlying buffers rather than copying one cell at a 
> time.
>  
> Eg, if I have :
> {code:java}
> List batchList = new ArrayList<>();
> try (ArrowStreamReader reader = new ArrowStreamReader(new 
> ByteArrayInputStream(out.toByteArray()), allocator)) {
> Schema schema = reader.getVectorSchemaRoot().getSchema();
> for (int i = 0; i < 5; i++) {
> // This will be loaded with new values on every call to loadNextBatch
> VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
> reader.loadNextBatch();
> batchList.add(readBatch);
> }
> }
> //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
>  
> A method like VectorSchemaRoot.combineChunks(List)?
> I did read the VectorSchemaRoot discussion on 
> https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the 
> right thing to use here.
>  
>  
> PS. Feel free to update the title of this feature request with more 
> appropriate wordings.
>  
> Cheers,
> Yogesh
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7048) [Java] Support for combining multiple vectors under VectorSchemaRoot

2020-02-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7048:
--
Labels: pull-request-available  (was: )

> [Java] Support for combining multiple vectors under VectorSchemaRoot
> 
>
> Key: ARROW-7048
> URL: https://issues.apache.org/jira/browse/ARROW-7048
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Yogesh Tewari
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> Hi,
>  
> pyarrow.Table.combine_chunks provides a nice functionality of combining 
> multiple batch records under a single pyarrow.Table.
>  
> I am currently working on a downstream application which reads data from 
> BigQuery. BigQuery storage api supports data output in Arrow format but 
> streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality 
> under an abstraction like VectorSchemaRoot.
> After getting guidance from [~emkornfi...@gmail.com], I tried to write my own 
> implementation by copying data vector by vector using TransferPair's 
> copyValueSafe
> But, unless I am missing some thing obvious, turns out it only copies one 
> value at a time. That means a lot of looping trying copyValueSafe millions of 
> rows from source vector index to target vector index. Ideally I would want to 
> concatenate/link the underlying buffers rather than copying one cell at a 
> time.
>  
> Eg, if I have :
> {code:java}
> List batchList = new ArrayList<>();
> try (ArrowStreamReader reader = new ArrowStreamReader(new 
> ByteArrayInputStream(out.toByteArray()), allocator)) {
> Schema schema = reader.getVectorSchemaRoot().getSchema();
> for (int i = 0; i < 5; i++) {
> // This will be loaded with new values on every call to loadNextBatch
> VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
> reader.loadNextBatch();
> batchList.add(readBatch);
> }
> }
> //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
>  
> A method like VectorSchemaRoot.combineChunks(List)?
> I did read the VectorSchemaRoot discussion on 
> https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the 
> right thing to use here.
>  
>  
> PS. Feel free to update the title of this feature request with more 
> appropriate wordings.
>  
> Cheers,
> Yogesh
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7906) [C++][Python] Full functionality for ORC format

2020-02-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046586#comment-17046586
 ] 

Antoine Pitrou commented on ARROW-7906:
---

[~PereTang] Do you want to submit a PR?

> [C++][Python] Full functionality for ORC format
> ---
>
> Key: ARROW-7906
> URL: https://issues.apache.org/jira/browse/ARROW-7906
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: HAOFENG DENG
>Priority: Major
>
> Just like parquet format, ORC has a large group of fans in Bigdata area, it 
> has better performance than parquet in some use cases.
>  But there is a problem in python is that it doesn't have the standard write 
> function.
> Seems the ORC team itself maintains the standard C++ 
> code([https://github.com/apache/orc/tree/master/c%2B%2B]) , so I think it 
> won't take too much effort to integrate into Arrow(C++) and build the hook 
> for python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7949) [Developer] Update to '.gitignore' to not track user specific 'cpp/Brewfile.lock.json' file

2020-02-27 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-7949:
--
Component/s: Developer Tools

> [Developer] Update to '.gitignore' to not track user specific  
> 'cpp/Brewfile.lock.json' file
> 
>
> Key: ARROW-7949
> URL: https://issues.apache.org/jira/browse/ARROW-7949
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
> Environment: macOS-10.15.3
>Reporter: Tarek Allam
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the developer guides for Python, there is a suggestion for users on macOS 
> to use Homebrew to install all dependencies required for building Arrow C++. 
> This creates a 'cpp/Brewfile.lock.json' file is specific to the system it 
> sits on.
> It would be desirable for this not to be tracked by version control. To 
> prevent this accidental addition, perhaps it should be ignored in the 
> gitignore file for the respository



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7949) [Developer] Update to '.gitignore' to not track user specific 'cpp/Brewfile.lock.json' file

2020-02-27 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-7949:
-

Assignee: Tarek Allam  (was: Antoine Pitrou)

> [Developer] Update to '.gitignore' to not track user specific  
> 'cpp/Brewfile.lock.json' file
> 
>
> Key: ARROW-7949
> URL: https://issues.apache.org/jira/browse/ARROW-7949
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
> Environment: macOS-10.15.3
>Reporter: Tarek Allam
>Assignee: Tarek Allam
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the developer guides for Python, there is a suggestion for users on macOS 
> to use Homebrew to install all dependencies required for building Arrow C++. 
> This creates a 'cpp/Brewfile.lock.json' file is specific to the system it 
> sits on.
> It would be desirable for this not to be tracked by version control. To 
> prevent this accidental addition, perhaps it should be ignored in the 
> gitignore file for the respository



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7949) [Developer] Update to '.gitignore' to not track user specific 'cpp/Brewfile.lock.json' file

2020-02-27 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-7949:
-

Assignee: Antoine Pitrou

> [Developer] Update to '.gitignore' to not track user specific  
> 'cpp/Brewfile.lock.json' file
> 
>
> Key: ARROW-7949
> URL: https://issues.apache.org/jira/browse/ARROW-7949
> Project: Apache Arrow
>  Issue Type: Improvement
> Environment: macOS-10.15.3
>Reporter: Tarek Allam
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the developer guides for Python, there is a suggestion for users on macOS 
> to use Homebrew to install all dependencies required for building Arrow C++. 
> This creates a 'cpp/Brewfile.lock.json' file is specific to the system it 
> sits on.
> It would be desirable for this not to be tracked by version control. To 
> prevent this accidental addition, perhaps it should be ignored in the 
> gitignore file for the respository



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7949) [Developer] Update to '.gitignore' to not track user specific 'cpp/Brewfile.lock.json' file

2020-02-27 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7949.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6494
[https://github.com/apache/arrow/pull/6494]

> [Developer] Update to '.gitignore' to not track user specific  
> 'cpp/Brewfile.lock.json' file
> 
>
> Key: ARROW-7949
> URL: https://issues.apache.org/jira/browse/ARROW-7949
> Project: Apache Arrow
>  Issue Type: Improvement
> Environment: macOS-10.15.3
>Reporter: Tarek Allam
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the developer guides for Python, there is a suggestion for users on macOS 
> to use Homebrew to install all dependencies required for building Arrow C++. 
> This creates a 'cpp/Brewfile.lock.json' file is specific to the system it 
> sits on.
> It would be desirable for this not to be tracked by version control. To 
> prevent this accidental addition, perhaps it should be ignored in the 
> gitignore file for the respository



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3247) [Python] Support spark parquet array and map types

2020-02-27 Thread Brian Hulette (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046482#comment-17046482
 ] 

Brian Hulette commented on ARROW-3247:
--

Thanks Micah, could you link those here? When searching for parquet and maps 
this is all I could find.

> [Python] Support spark parquet array and map types
> --
>
> Key: ARROW-3247
> URL: https://issues.apache.org/jira/browse/ARROW-3247
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Minor
>  Labels: parquet
>
> As far I understand, there is already some support for nested 
> array/dict/structs in arrow. However, spark Map and List types are structured 
> one level deeper (I believe to allow for both NULL and empty entries). 
> Surprisingly, fastparquet can load these. I do not know the plan for 
> arbitrary nested object support, but it should be made clear.
> Schema of spark-generated file from the fastparquet test suite:
> {code:java}
>  - spark_schema:
> | - map_op_op: MAP, OPTIONAL
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, OPTIONAL
> | - map_op_req: MAP, OPTIONAL
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, REQUIRED
> | - map_req_op: MAP, REQUIRED
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, OPTIONAL
> | - map_req_req: MAP, REQUIRED
> |   - key_value: REPEATED
> |   | - key: BYTE_ARRAY, UTF8, REQUIRED
> | - value: BYTE_ARRAY, UTF8, REQUIRED
> | - arr_op_op: LIST, OPTIONAL
> |   - list: REPEATED
> | - element: BYTE_ARRAY, UTF8, OPTIONAL
> | - arr_op_req: LIST, OPTIONAL
> |   - list: REPEATED
> | - element: BYTE_ARRAY, UTF8, REQUIRED
> | - arr_req_op: LIST, REQUIRED
> |   - list: REPEATED
> | - element: BYTE_ARRAY, UTF8, OPTIONAL
>   - arr_req_req: LIST, REQUIRED
> - list: REPEATED
>   - element: BYTE_ARRAY, UTF8, REQUIRED
> {code}
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7746) [Java] Support large buffer for Flight

2020-02-27 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046431#comment-17046431
 ] 

Liya Fan commented on ARROW-7746:
-

It seems the PR for ARROW-7610 is already big enough. To make code reviewing 
easier, I have opened ARROW-7955 to track the support for file/stream IPC. 

> [Java] Support large buffer for Flight
> --
>
> Key: ARROW-7746
> URL: https://issues.apache.org/jira/browse/ARROW-7746
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> The motivation is described in 
> https://github.com/apache/arrow/pull/6323#issuecomment-580137629.
> When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
> due to integer overflow. 
> This is because internally, we have used some data structures which are based 
> on 32-bit integers. To resolve the problem, we must revise/replace the data 
> structures to make them support 64-bit integers. 
> As a concrete example, we can see that when the server sends data through 
> IPC, an org.apache.arrow.flight.ArrowMessage object is created, and is 
> wrapped as an InputStream through the `asInputStream` method. In this method, 
> we use data stuctures like java.io.ByteArrayOutputStream and 
> io.netty.buffer.ByteBuf, which are based on 32-bit integers (we can observe 
> that NettyArrowBuf#length and ByteArrayOutputStream#count are both 32-bit 
> integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7955) [Java] Support large buffer for file/stream IPC

2020-02-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-7955:
---

 Summary: [Java] Support large buffer for file/stream IPC
 Key: ARROW-7955
 URL: https://issues.apache.org/jira/browse/ARROW-7955
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


After supporting 64-bit ArrowBuf, we need to make file/stream IPC work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7746) [Java] Support large buffer for Flight

2020-02-27 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-7746:

Summary: [Java] Support large buffer for Flight  (was: [Java] Support large 
buffer for IPC)

> [Java] Support large buffer for Flight
> --
>
> Key: ARROW-7746
> URL: https://issues.apache.org/jira/browse/ARROW-7746
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> The motivation is described in 
> https://github.com/apache/arrow/pull/6323#issuecomment-580137629.
> When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
> due to integer overflow. 
> This is because internally, we have used some data structures which are based 
> on 32-bit integers. To resolve the problem, we must revise/replace the data 
> structures to make them support 64-bit integers. 
> As a concrete example, we can see that when the server sends data through 
> IPC, an org.apache.arrow.flight.ArrowMessage object is created, and is 
> wrapped as an InputStream through the `asInputStream` method. In this method, 
> we use data stuctures like java.io.ByteArrayOutputStream and 
> io.netty.buffer.ByteBuf, which are based on 32-bit integers (we can observe 
> that NettyArrowBuf#length and ByteArrayOutputStream#count are both 32-bit 
> integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7746) [Java] Support large buffer for IPC

2020-02-27 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046316#comment-17046316
 ] 

Liya Fan commented on ARROW-7746:
-

[~emkornfi...@gmail.com] Sorry. There must be some misunderstanding here. I 
thought flight was a necessary part for IPC. 

So I will change the title for this issue, and provide support for the rest 
issues of IPC (e.g. ArrowStreamWriter/ArrowStreamReader) in ARROW-7610.

> [Java] Support large buffer for IPC
> ---
>
> Key: ARROW-7746
> URL: https://issues.apache.org/jira/browse/ARROW-7746
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> The motivation is described in 
> https://github.com/apache/arrow/pull/6323#issuecomment-580137629.
> When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
> due to integer overflow. 
> This is because internally, we have used some data structures which are based 
> on 32-bit integers. To resolve the problem, we must revise/replace the data 
> structures to make them support 64-bit integers. 
> As a concrete example, we can see that when the server sends data through 
> IPC, an org.apache.arrow.flight.ArrowMessage object is created, and is 
> wrapped as an InputStream through the `asInputStream` method. In this method, 
> we use data stuctures like java.io.ByteArrayOutputStream and 
> io.netty.buffer.ByteBuf, which are based on 32-bit integers (we can observe 
> that NettyArrowBuf#length and ByteArrayOutputStream#count are both 32-bit 
> integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)