[jira] [Assigned] (ARROW-10010) [Rust] Speedup arithmetic

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10010:
-

Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] Speedup arithmetic
> -
>
> Key: ARROW-10010
> URL: https://issues.apache.org/jira/browse/ARROW-10010
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are some optimizations possible in arithmetics kernels.
>  
> PR to follow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10010) [Rust] Speedup arithmetic

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10010:
-

Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] Speedup arithmetic
> -
>
> Key: ARROW-10010
> URL: https://issues.apache.org/jira/browse/ARROW-10010
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are some optimizations possible in arithmetics kernels.
>  
> PR to follow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10010) [Rust] Speedup arithmetic

2020-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10010:
---
Labels: pull-request-available  (was: )

> [Rust] Speedup arithmetic
> -
>
> Key: ARROW-10010
> URL: https://issues.apache.org/jira/browse/ARROW-10010
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are some optimizations possible in arithmetics kernels.
>  
> PR to follow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10010) [Rust] Speedup arithmetic

2020-09-14 Thread Jorge (Jira)
Jorge created ARROW-10010:
-

 Summary: [Rust] Speedup arithmetic
 Key: ARROW-10010
 URL: https://issues.apache.org/jira/browse/ARROW-10010
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge
Assignee: Jorge


There are some optimizations possible in arithmetics kernels.

 

PR to follow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10009) [C++] LeastSignficantBitMask has typo in name.

2020-09-14 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-10009:
---

 Summary: [C++] LeastSignficantBitMask has typo in name.
 Key: ARROW-10009
 URL: https://issues.apache.org/jira/browse/ARROW-10009
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield


We should fix the typo.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10008) pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False

2020-09-14 Thread Caleb Hattingh (Jira)
Caleb Hattingh created ARROW-10008:
--

 Summary: pyarrow.parquet.read_table fails with predicate pushdown 
on categorical data with use_legacy_dataset=False
 Key: ARROW-10008
 URL: https://issues.apache.org/jira/browse/ARROW-10008
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 1.0.1, 0.17.1
 Environment: Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
[GCC 7.3.0]
Pandas version: 1.1.2
pyarrow version: 1.0.1

Reporter: Caleb Hattingh


I apologise if this is a known issue; I looked both in this issue tracker and 
on github and I didn't find it.

There seems to be a problem reading a dataset with predicate pushdown (filters) 
on columns with categorical data. The problem only occurs with 
`use_legacy_dataset=False` (but if that's True it has no effect if the column 
isn't a partition key.

Reproducer:
{code:python}
import shutil
import sys, platform
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# Settings
CATEGORICAL_DTYPE = True
USE_LEGACY_DATASET = False

print('Platform:', platform.platform())
print('Python version:', sys.version)
print('Pandas version:', pd.__version__)
print('pyarrow version:', pa.__version__)
print('categorical enabled:', CATEGORICAL_DTYPE)
print('use_legacy_dataset:', USE_LEGACY_DATASET)
print()

# Clean up test dataset if present
path = Path('blah.parquet')
if path.exists():
shutil.rmtree(str(path))

# Simple data
d = dict(col1=['a', 'b'], col2=[1, 2])

# Either categorical or not
if CATEGORICAL_DTYPE:
df = pd.DataFrame(data=d, dtype='category')
else:
df = pd.DataFrame(data=d)

# Write dataset
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, str(path))

# Load dataset
table = pq.read_table(
str(path),
filters=[('col1', '=', 'a')],
use_legacy_dataset=USE_LEGACY_DATASET,
)
df = table.to_pandas()
print(df.dtypes)
print(repr(df))

{code}
 Output:
{code:java}
$ python categorical_predicate_pushdown.py 
Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10
Python version: 3.8.5 (default, Aug  5 2020, 08:36:46) 
[GCC 7.3.0]
Pandas version: 1.1.2
pyarrow version: 1.0.1
categorical enabled: True
use_legacy_dataset: False

/arrow/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Type error: 
Cannot compare scalars of differing type: dictionary vs string
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4fc128)[0x7f50568c6128]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f50568c693d]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal14DieWithMessageERKSs+0x51)[0x7f50569757c1]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x4c)[0x7f505697716c]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression21AssumeGivenComparisonERKS1_+0x438)[0x7f5043334f18]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0x34)[0x7f5043334fa4]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset12RowGroupInfo7SatisfyERKNS0_10ExpressionE+0x1c)[0x7f50433116ac]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset19ParquetFileFragment15FilterRowGroupsERKNS0_10ExpressionE+0x563)[0x7f5043311cb3]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset17ParquetFileFormat8ScanFileESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEEPNS0_12FileFragmentE+0x203)[0x7f50433168a3]
/home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset12FileFragment4ScanESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEE+0x55)[0x7f5043329785]

[jira] [Commented] (ARROW-9989) Arrow

2020-09-14 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195791#comment-17195791
 ] 

Wes McKinney commented on ARROW-9989:
-

This question might be more relevant for the dev@ or user@ mailing list. If you 
want to keep it as a JIRA, could you write an informative issue title?


> Arrow 
> --
>
> Key: ARROW-9989
> URL: https://issues.apache.org/jira/browse/ARROW-9989
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.14.0
> Environment: Linux 18.04, Arrow from maven 0.14.0
>Reporter: Litchy Soong
>Priority: Major
>
> In scala, (Java of Arrow), following code work
> {quote}object A {
>   def write(): = 
>   
> Unknown macro: {           
> val vectorSchemaRoot = VectorSchemaRoot.create(getSchema, allocator)       
> val   writer = new ArrowStreamWriter(vectorSchemaRoot, null, out)   }
> }
> {quote}
> But following does not work
> {quote}object A {
>   var vectorSchemaRoot: VectorSchemaRoot = null
>   var writer: ArrowStreamWriter = null
>   def write(): = 
> Unknown macro: {           
> vectorSchemaRoot = VectorSchemaRoot.create(getSchema, allocator)     writer = 
> new ArrowStreamWriter(vectorSchemaRoot, null, out)   }
> }
> {quote}
> The error is ,
> {quote}java.lang.IllegalStateException: wrong buffer size: 601 != 
> 4081java.lang.IllegalStateException: wrong buffer size: 601 != 4081 at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.writeBatchBuffers(MessageSerializer.java:297)
>  at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:267)
>  at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132)
>  at org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120)
>  
> java.lang.IndexOutOfBoundsException: index: 0, length: 1 (expected: range(0, 
> 0))java.lang.IndexOutOfBoundsException: index: 0, length: 1 (expected: 
> range(0, 0)) at io.netty.buffer.ArrowBuf.checkIndexD(ArrowBuf.java:337) at 
> io.netty.buffer.ArrowBuf.chk(ArrowBuf.java:324) at 
> io.netty.buffer.ArrowBuf.getByte(ArrowBuf.java:526) at 
> org.apache.arrow.vector.BitVectorHelper.setBit(BitVectorHelper.java:70) at 
> org.apache.arrow.vector.Float4Vector.set(Float4Vector.java:168)
>  
> java.lang.IllegalStateException: RefCnt has gone 
> negativejava.lang.IllegalStateException: RefCnt has gone negative at 
> org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458) at 
> org.apache.arrow.memory.BufferLedger.release(BufferLedger.java:134) at 
> org.apache.arrow.memory.BufferLedger.release(BufferLedger.java:108) at 
> org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:441)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708)
>  at 
> org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226)
> {quote}
>  
> and it is raised every second time when I call the method. And seems both 
> ArrowStreamWriter and VectorSchemaRoot could not be initialized in this way. 
> Why?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10007) [Python][CI] Add a nightly build to exercise hypothesis tests

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10007:
---

 Summary: [Python][CI] Add a nightly build to exercise hypothesis 
tests
 Key: ARROW-10007
 URL: https://issues.apache.org/jira/browse/ARROW-10007
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Continuous Integration, Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


We have a couple of hypothesis tests which are especially useful to discover 
corner cases. We should have a crossbow nightly build to regularly run them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10004) [Python] Consider to raise or normalize if a timezone aware datetime.time object is encountered during conversion

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10004:

Issue Type: Improvement  (was: New Feature)

> [Python] Consider to raise or normalize if a timezone aware datetime.time 
> object is encountered during conversion
> -
>
> Key: ARROW-10004
> URL: https://issues.apache.org/jira/browse/ARROW-10004
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>
> Python datetime.time objects may have timezone information attached, but 
> since the time types (type32 and type64) don't have that property in arrow we 
> simply ignore it.
> We should either raise an error or normalize to UTC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10005) [C++] Add an Append method to the time builders which validates the input range

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10005:

Issue Type: Improvement  (was: New Feature)

> [C++] Add an Append method to the time builders which validates the input 
> range
> ---
>
> Key: ARROW-10005
> URL: https://issues.apache.org/jira/browse/ARROW-10005
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>
> Seems like we don't have a method which validates the input value range for 
> time types. It would be handy to do the validation after converting from a 
> python object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10006) [C++][Python] Do not collect python iterators if not necessary

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10006:
---

 Summary: [C++][Python] Do not collect python iterators if not 
necessary
 Key: ARROW-10006
 URL: https://issues.apache.org/jira/browse/ARROW-10006
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Krisztian Szucs


When converting python objects to arrow array currently we always collect the 
input to a sequence, but this may be memory consuming in certain cases. 

For unknown sized iterators we could consume and temporarily store the seen 
items during inference potentially improving both the conversion time and peak 
memory usage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

2020-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9924:
--
Labels: pull-request-available  (was: )

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> 
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(1000)})  
>   
>   
> In [28]: pq.write_table(pa.table(df), 'test.parquet') 
>   
>   
> In [29]: timeit pq.read_table('test.parquet') 
>   
>   
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>   
>   
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10005) [C++] Add an Append method to the time builders which validates the input range

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10005:
---

 Summary: [C++] Add an Append method to the time builders which 
validates the input range
 Key: ARROW-10005
 URL: https://issues.apache.org/jira/browse/ARROW-10005
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Krisztian Szucs


Seems like we don't have a method which validates the input value range for 
time types. It would be handy to do the validation after converting from a 
python object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10004) [Python] Consider to raise or normalize if a timezone aware datetime.time object is encountered during conversion

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10004:
---

 Summary: [Python] Consider to raise or normalize if a timezone 
aware datetime.time object is encountered during conversion
 Key: ARROW-10004
 URL: https://issues.apache.org/jira/browse/ARROW-10004
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Krisztian Szucs


Python datetime.time objects may have timezone information attached, but since 
the time types (type32 and type64) don't have that property in arrow we simply 
ignore it.

We should either raise an error or normalize to UTC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-14 Thread Paul Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195668#comment-17195668
 ] 

Paul Taylor commented on ARROW-8394:


I've started work on a branch in my fork here[1], but have been occupied the 
last few weeks (work, moving, back injury, etc.). There's not much left to do, 
so I think I should be able to get it finished and PR'd this week.

1. https://github.com/trxcllnt/arrow/tree/typescript-3.9

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10003) [C++] Create directories in CopyFiles when copying within the same filesystem

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10003:
-

Assignee: Ben Kietzman  (was: Apache Arrow JIRA Bot)

> [C++] Create directories in CopyFiles when copying within the same filesystem
> -
>
> Key: ARROW-10003
> URL: https://issues.apache.org/jira/browse/ARROW-10003
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CopyFiles creates parent directories for destination files, but only when 
> copying between different filesystems. This behavior should be made consistent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10003) [C++] Create directories in CopyFiles when copying within the same filesystem

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10003:
-

Assignee: Apache Arrow JIRA Bot  (was: Ben Kietzman)

> [C++] Create directories in CopyFiles when copying within the same filesystem
> -
>
> Key: ARROW-10003
> URL: https://issues.apache.org/jira/browse/ARROW-10003
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Ben Kietzman
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CopyFiles creates parent directories for destination files, but only when 
> copying between different filesystems. This behavior should be made consistent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10003) [C++] Create directories in CopyFiles when copying within the same filesystem

2020-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10003:
---
Labels: pull-request-available  (was: )

> [C++] Create directories in CopyFiles when copying within the same filesystem
> -
>
> Key: ARROW-10003
> URL: https://issues.apache.org/jira/browse/ARROW-10003
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CopyFiles creates parent directories for destination files, but only when 
> copying between different filesystems. This behavior should be made consistent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9775) [C++] Automatic S3 region selection

2020-09-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9775:
-

Assignee: Antoine Pitrou

> [C++] Automatic S3 region selection
> ---
>
> Key: ARROW-9775
> URL: https://issues.apache.org/jira/browse/ARROW-9775
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
> Environment: macOS, Linux.
>Reporter: Sahil Gupta
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: filesystem
> Fix For: 2.0.0
>
>
> Currently, PyArrow and ArrowCpp need to be provided the region of the S3 
> file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and 
> ArrowCpp can automatically detect the region and get the files, etc. For 
> instance, s3fs and boto3 can read and write files without having to specify 
> the region explicitly. Similar functionality to auto-detect the region would 
> be great to have in PyArrow and ArrowCpp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10003) [C++] Create directories in CopyFiles when copying within the same filesystem

2020-09-14 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-10003:


 Summary: [C++] Create directories in CopyFiles when copying within 
the same filesystem
 Key: ARROW-10003
 URL: https://issues.apache.org/jira/browse/ARROW-10003
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 2.0.0


CopyFiles creates parent directories for destination files, but only when 
copying between different filesystems. This behavior should be made consistent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

2020-09-14 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195647#comment-17195647
 ] 

Wes McKinney commented on ARROW-9924:
-

My principle concern is addressing the performance regressions, which are 
especially grave considering that they affect one of the (if not *the*) 
most-called user-facing APIs in the whole Arrow project. The other questions we 
can investigate as follow up matters. 

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> 
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(1000)})  
>   
>   
> In [28]: pq.write_table(pa.table(df), 'test.parquet') 
>   
>   
> In [29]: timeit pq.read_table('test.parquet') 
>   
>   
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>   
>   
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-14 Thread Kyle Strand (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Strand updated ARROW-10002:

Description: 
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of {{default fn}} in the codebase:

 
{code:java}
$> rg -c 'default fn' ../arrow/rust/
 ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
 ../arrow/rust/parquet/src/column/writer.rs:2
 ../arrow/rust/parquet/src/encodings/encoding.rs:16
 ../arrow/rust/parquet/src/arrow/record_reader.rs:1
 ../arrow/rust/parquet/src/encodings/decoding.rs:13
 ../arrow/rust/parquet/src/file/statistics.rs:1
 ../arrow/rust/arrow/src/array/builder.rs:7
 ../arrow/rust/arrow/src/array/array.rs:3
 ../arrow/rust/arrow/src/array/equal.rs:3{code}
 

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
has been further discussion and ideas for resolving the soundness issue, but to 
my knowledge no definitive action.)

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.

  was:
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of {{default fn}} in the codebase:

 
{code:java}
$> rg -c 'default fn' ../arrow/rust/
 ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
 ../arrow/rust/parquet/src/column/writer.rs:2
 ../arrow/rust/parquet/src/encodings/encoding.rs:16
 ../arrow/rust/parquet/src/arrow/record_reader.rs:1
 ../arrow/rust/parquet/src/encodings/decoding.rs:13
 ../arrow/rust/parquet/src/file/statistics.rs:1
 ../arrow/rust/arrow/src/array/builder.rs:7
 ../arrow/rust/arrow/src/array/array.rs:3
 ../arrow/rust/arrow/src/array/equal.rs:3{code}
 

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]]. (Note: there 
has been further discussion and ideas for resolving the soundness issue, but to 
my knowledge no definitive action.)

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.


> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of {{default fn}} in the 
> codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
> has been further discussion and ideas for resolving the soundness issue, but 
> to my knowledge no definitive action.)
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-14 Thread Kyle Strand (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Strand updated ARROW-10002:

Description: 
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of {{default fn}} in the codebase:

 
{code:java}
$> rg -c 'default fn' ../arrow/rust/
 ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
 ../arrow/rust/parquet/src/column/writer.rs:2
 ../arrow/rust/parquet/src/encodings/encoding.rs:16
 ../arrow/rust/parquet/src/arrow/record_reader.rs:1
 ../arrow/rust/parquet/src/encodings/decoding.rs:13
 ../arrow/rust/parquet/src/file/statistics.rs:1
 ../arrow/rust/arrow/src/array/builder.rs:7
 ../arrow/rust/arrow/src/array/array.rs:3
 ../arrow/rust/arrow/src/array/equal.rs:3{code}
 

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch].
 (Note: there has been further discussion and ideas for resolving the soundness 
issue, but to my knowledge no definitive action.)

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.

  was:
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of `default fn` in the codebase:

 
{code:java}
$> rg -c 'default fn' ../arrow/rust/
 ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
 ../arrow/rust/parquet/src/column/writer.rs:2
 ../arrow/rust/parquet/src/encodings/encoding.rs:16
 ../arrow/rust/parquet/src/arrow/record_reader.rs:1
 ../arrow/rust/parquet/src/encodings/decoding.rs:13
 ../arrow/rust/parquet/src/file/statistics.rs:1
 ../arrow/rust/arrow/src/array/builder.rs:7
 ../arrow/rust/arrow/src/array/array.rs:3
 ../arrow/rust/arrow/src/array/equal.rs:3{code}
 

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
has been further discussion and ideas for resolving the soundness issue, but to 
my knowledge no definitive 
action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.


> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of {{default fn}} in the 
> codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch].
>  (Note: there has been further discussion and ideas for resolving the 
> soundness issue, but to my knowledge no definitive action.)
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-14 Thread Kyle Strand (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Strand updated ARROW-10002:

Description: 
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of {{default fn}} in the codebase:

 
{code:java}
$> rg -c 'default fn' ../arrow/rust/
 ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
 ../arrow/rust/parquet/src/column/writer.rs:2
 ../arrow/rust/parquet/src/encodings/encoding.rs:16
 ../arrow/rust/parquet/src/arrow/record_reader.rs:1
 ../arrow/rust/parquet/src/encodings/decoding.rs:13
 ../arrow/rust/parquet/src/file/statistics.rs:1
 ../arrow/rust/arrow/src/array/builder.rs:7
 ../arrow/rust/arrow/src/array/array.rs:3
 ../arrow/rust/arrow/src/array/equal.rs:3{code}
 

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]]. (Note: there 
has been further discussion and ideas for resolving the soundness issue, but to 
my knowledge no definitive action.)

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.

  was:
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of {{default fn}} in the codebase:

 
{code:java}
$> rg -c 'default fn' ../arrow/rust/
 ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
 ../arrow/rust/parquet/src/column/writer.rs:2
 ../arrow/rust/parquet/src/encodings/encoding.rs:16
 ../arrow/rust/parquet/src/arrow/record_reader.rs:1
 ../arrow/rust/parquet/src/encodings/decoding.rs:13
 ../arrow/rust/parquet/src/file/statistics.rs:1
 ../arrow/rust/arrow/src/array/builder.rs:7
 ../arrow/rust/arrow/src/array/array.rs:3
 ../arrow/rust/arrow/src/array/equal.rs:3{code}
 

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch].
 (Note: there has been further discussion and ideas for resolving the soundness 
issue, but to my knowledge no definitive action.)

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.


> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of {{default fn}} in the 
> codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]]. (Note: 
> there has been further discussion and ideas for resolving the soundness 
> issue, but to my knowledge no definitive action.)
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-14 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195638#comment-17195638
 ] 

Andy Grove commented on ARROW-10002:


Thanks [~batmanaod] this looks really interesting.

[~paddyhoran] [~nevime]  [~sunchao]  [~alamb] [~jorgecarleitao]  [~jhorstmann] 
will likely be interested in this

> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of `default fn` in the codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: 
> there has been further discussion and ideas for resolving the soundness 
> issue, but to my knowledge no definitive 
> action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails

2020-09-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9859.
---
Resolution: Fixed

Issue resolved by pull request 8185
[https://github.com/apache/arrow/pull/8185]

> [C++] S3 FileSystemFromUri with special char in secret key fails
> 
>
> Key: ARROW-9859
> URL: https://issues.apache.org/jira/browse/ARROW-9859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> S3 Secret access keys can contain special characters like {{/}}. When they do
> 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them 
> (e.g. replace / with %2F)
> 2) When you do escape the special characters, requests that require 
> authorization fail with the message "The request signature we calculated does 
> not match the signature you provided. Check your key and signing method." 
> This may suggest that there's some extra URL encoding/decoding that needs to 
> happen inside.
> I was only able to work around this by generating a new access key that 
> happened not to have special characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

2020-09-14 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195633#comment-17195633
 ] 

Ben Kietzman commented on ARROW-9924:
-

{quote}
Looking at the top of the hierarchical perf report for the "new" code, the 
deeply nested layers of iterators strikes me as one thing to think more about 
whether that's the design we want
{quote}

To be clear, is the concern over clarity or performance? IIUC 
[https://gist.github.com/wesm/3e3eeb6b7f5f22650f18e69e206c2eb8#file-gistfile1-txt-L8-L20]
 represents minimal cost since 0.65% of runtime was spent managing the Iterator 
abstraction. If we wanted to replace our abstraction for lazy sequences we 
could potentially refactor to a {{Future}}-based iteration. Did you have a 
replacement in mind?

{quote}
why ProjectRecordBatch and FilterRecordBatch being used? Nothing is being 
projected nor filtered
{quote}

We don't explicitly elide them when the projection or filter is trivial. I 
could try to benchmark whether there is a significant performance benefit to 
adding a special case for trivial projection/filtering, but I'd guess we don't 
gain anything.

Another potential bandaid fix would be to allow column level parallelism when 
scanning a single file (since no thread contention would be incurred) (combined 
with increasing batch size).

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> 
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(1000)})  
>   
>   
> In [28]: pq.write_table(pa.table(df), 'test.parquet') 
>   
>   
> In [29]: timeit pq.read_table('test.parquet') 
>   
>   
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>   
>   
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9465) [Python] Improve ergonomics of compute functions

2020-09-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9465.
---
Resolution: Fixed

Issue resolved by pull request 8163
[https://github.com/apache/arrow/pull/8163]

> [Python] Improve ergonomics of compute functions
> 
>
> Key: ARROW-9465
> URL: https://issues.apache.org/jira/browse/ARROW-9465
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Introspection of exported compute functions currently yield suboptimal output:
> {code:python}
> >>> from pyarrow import compute as pc 
> >>>   
> >>>   
> >>> pc.list_flatten   
> >>>   
> >>>   
> .func(arg)>
> >>> ?pc.list_flatten  
> >>>   
> >>>   
> Signature: pc.list_flatten(arg)
> Docstring: 
> File:  ~/arrow/dev/python/pyarrow/compute.py
> Type:  function
> >>> help(pc.list_flatten) 
> >>>   
> >>>   
> Help on function func in module pyarrow.compute:
> func(arg)
> {code}
> The function should ideally have:
> * the right global name
> * an appropriate signature
> * a docstring



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-14 Thread Kyle Strand (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195629#comment-17195629
 ] 

Kyle Strand commented on ARROW-10002:
-

I have put together a repository with a minimal example of one way in which 
we're using specialization in the `array` module: 
[https://github.com/BatmanAoD/arrow-rust-specialization-alternatives]

The {{master}} branch shows how the code is written currently. This pull 
request shows how we could avoid specialization by introducing an "indexing" 
method associated with each primitive type: 
https://github.com/BatmanAoD/arrow-rust-specialization-alternatives/pull/1

> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of `default fn` in the codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: 
> there has been further discussion and ideas for resolving the soundness 
> issue, but to my knowledge no definitive 
> action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-14 Thread Kyle Strand (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Strand updated ARROW-10002:

Description: 
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of `default fn` in the codebase:

 
{code:java}
$> rg -c 'default fn' ../arrow/rust/
 ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
 ../arrow/rust/parquet/src/column/writer.rs:2
 ../arrow/rust/parquet/src/encodings/encoding.rs:16
 ../arrow/rust/parquet/src/arrow/record_reader.rs:1
 ../arrow/rust/parquet/src/encodings/decoding.rs:13
 ../arrow/rust/parquet/src/file/statistics.rs:1
 ../arrow/rust/arrow/src/array/builder.rs:7
 ../arrow/rust/arrow/src/array/array.rs:3
 ../arrow/rust/arrow/src/array/equal.rs:3{code}
 

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
has been further discussion and ideas for resolving the soundness issue, but to 
my knowledge no definitive 
action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.

  was:
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of `default fn` in the codebase:


{{ }}
{code:java}
$> rg -c 'default fn' ../arrow/rust/
 ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
 ../arrow/rust/parquet/src/column/writer.rs:2
 ../arrow/rust/parquet/src/encodings/encoding.rs:16
 ../arrow/rust/parquet/src/arrow/record_reader.rs:1
 ../arrow/rust/parquet/src/encodings/decoding.rs:13
 ../arrow/rust/parquet/src/file/statistics.rs:1
 ../arrow/rust/arrow/src/array/builder.rs:7
 ../arrow/rust/arrow/src/array/array.rs:3
 ../arrow/rust/arrow/src/array/equal.rs:3{code}
 

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
has been further discussion and ideas for resolving the soundness issue, but to 
my knowledge no definitive 
action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.


> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of `default fn` in the codebase:
>  
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: 
> there has been further discussion and ideas for resolving the soundness 
> issue, but to my knowledge no definitive 
> action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-14 Thread Kyle Strand (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Strand updated ARROW-10002:

Description: 
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of `default fn` in the codebase:


{{ }}
{code:java}
$> rg -c 'default fn' ../arrow/rust/
 ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
 ../arrow/rust/parquet/src/column/writer.rs:2
 ../arrow/rust/parquet/src/encodings/encoding.rs:16
 ../arrow/rust/parquet/src/arrow/record_reader.rs:1
 ../arrow/rust/parquet/src/encodings/decoding.rs:13
 ../arrow/rust/parquet/src/file/statistics.rs:1
 ../arrow/rust/arrow/src/array/builder.rs:7
 ../arrow/rust/arrow/src/array/array.rs:3
 ../arrow/rust/arrow/src/array/equal.rs:3{code}
 

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
has been further discussion and ideas for resolving the soundness issue, but to 
my knowledge no definitive 
action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.

  was:
Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of `default fn` in the codebase:


{{ $> rg -c 'default fn' ../arrow/rust/}}
{{ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1}}
{{ ../arrow/rust/parquet/src/column/writer.rs:2}}
{{ ../arrow/rust/parquet/src/encodings/encoding.rs:16}}
{{ ../arrow/rust/parquet/src/arrow/record_reader.rs:1}}
{{ ../arrow/rust/parquet/src/encodings/decoding.rs:13}}
{{ ../arrow/rust/parquet/src/file/statistics.rs:1}}
{{ ../arrow/rust/arrow/src/array/builder.rs:7}}
{{ ../arrow/rust/arrow/src/array/array.rs:3}}
{{ ../arrow/rust/arrow/src/array/equal.rs:3}}

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
has been further discussion and ideas for resolving the soundness issue, but to 
my knowledge no definitive 
action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.


> [Rust] Trait-specialization requries nightly
> 
>
> Key: ARROW-10002
> URL: https://issues.apache.org/jira/browse/ARROW-10002
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Kyle Strand
>Priority: Major
>
> Trait specialization is widely used in the Rust Arrow implementation. Uses 
> can be identified by searching for instances of `default fn` in the codebase:
> {{ }}
> {code:java}
> $> rg -c 'default fn' ../arrow/rust/
>  ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
>  ../arrow/rust/parquet/src/column/writer.rs:2
>  ../arrow/rust/parquet/src/encodings/encoding.rs:16
>  ../arrow/rust/parquet/src/arrow/record_reader.rs:1
>  ../arrow/rust/parquet/src/encodings/decoding.rs:13
>  ../arrow/rust/parquet/src/file/statistics.rs:1
>  ../arrow/rust/arrow/src/array/builder.rs:7
>  ../arrow/rust/arrow/src/array/array.rs:3
>  ../arrow/rust/arrow/src/array/equal.rs:3{code}
>  
> This feature requires Nightly Rust. Additionally, there is [no schedule for 
> stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] 
> , primarily due to an [unresolved soundness 
> hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: 
> there has been further discussion and ideas for resolving the soundness 
> issue, but to my knowledge no definitive 
> action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]
> If we can remove specialization from the Rust codebase, we will not be 
> blocked on the Rust team's stabilization of that feature in order to move to 
> stable Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10002) [Rust] Trait-specialization requries nightly

2020-09-14 Thread Kyle Strand (Jira)
Kyle Strand created ARROW-10002:
---

 Summary: [Rust] Trait-specialization requries nightly
 Key: ARROW-10002
 URL: https://issues.apache.org/jira/browse/ARROW-10002
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Kyle Strand


Trait specialization is widely used in the Rust Arrow implementation. Uses can 
be identified by searching for instances of `default fn` in the codebase:

```
$> rg -c 'default fn' ../arrow/rust/
../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1
../arrow/rust/parquet/src/column/writer.rs:2
../arrow/rust/parquet/src/encodings/encoding.rs:16
../arrow/rust/parquet/src/arrow/record_reader.rs:1
../arrow/rust/parquet/src/encodings/decoding.rs:13
../arrow/rust/parquet/src/file/statistics.rs:1
../arrow/rust/arrow/src/array/builder.rs:7
../arrow/rust/arrow/src/array/array.rs:3
../arrow/rust/arrow/src/array/equal.rs:3
```

This feature requires Nightly Rust. Additionally, there is [no schedule for 
stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , 
primarily due to an [unresolved soundness 
hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there 
has been further discussion and ideas for resolving the soundness issue, but to 
my knowledge no definitive 
action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].]

If we can remove specialization from the Rust codebase, we will not be blocked 
on the Rust team's stabilization of that feature in order to move to stable 
Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-14 Thread Tim Conkling (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195577#comment-17195577
 ] 

Tim Conkling edited comment on ARROW-8394 at 9/14/20, 4:35 PM:
---

This is intended with all respect - this is a complex project, and I appreciate 
the work being done on it! - but I'm surprised by this response.

[~wesm], if nobody is looking at this issue, does that mean that the JavaScript 
library is not a priority (or not being maintained anymore)?

(As a user of the project, I'm trying to calibrate my expectations for its 
future. And as a developer on other open source projects, I recognize that it 
can be supremely frustrating when others feel entitled to ongoing free support 
- that's not my intent! :))


was (Author: timconkling):
This is intended with all respect - this is a complex project, and I appreciate 
the work being done on it! - but I'm surprised by this response.

[~wesm], if nobody is looking at this issue, does that mean that the JavaScript 
library is not a priority (or not being maintained anymore)?

(As a user of the project, I'm trying to gauge my expectations for the project. 
And as a developer on other open source projects, I recognize that it can be 
supremely frustrating when others feel entitled to ongoing free support - 
that's not my intent! :))

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-14 Thread Tim Conkling (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195577#comment-17195577
 ] 

Tim Conkling commented on ARROW-8394:
-

This is intended with all respect - this is a complex project, and I appreciate 
the work being done on it! - but I'm surprised by this response.

[~wesm], if nobody is looking at this issue, does that mean that the JavaScript 
library is not a priority (or not being maintained anymore)?

(As a user of the project, I'm trying to gauge my expectations for the project. 
And as a developer on other open source projects, I recognize that it can be 
supremely frustrating when others feel entitled to ongoing free support - 
that's not my intent! :))

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README

2020-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10001:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Add developer guide to README
> -
>
> Key: ARROW-10001
> URL: https://issues.apache.org/jira/browse/ARROW-10001
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10001:
-

Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] [DataFusion] Add developer guide to README
> -
>
> Key: ARROW-10001
> URL: https://issues.apache.org/jira/browse/ARROW-10001
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-10001:
-

Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] Add developer guide to README
> -
>
> Key: ARROW-10001
> URL: https://issues.apache.org/jira/browse/ARROW-10001
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-2651) [Python] Build & Test with PyPy

2020-09-14 Thread Niklas B (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195571#comment-17195571
 ] 

Niklas B edited comment on ARROW-2651 at 9/14/20, 4:29 PM:
---

Besides GetContiguous we (and with we I mean Matti) needed to patch a few 
datetime related things, 
[https://gist.github.com/mattip/c9c8398b58721ae5893dc8134c353f28]

Build that works with the patch available on 
[https://github.com/bivald/pyarrow-on-pypy3/tree/feature/latest-pypy-latest-pyarrow]

As for the test suite I had to disable IO, misc and memory since they gave 
segfaults. 

pytest pyarrow --ignore-glob='*test_io.py' --ignore-glob='*test_misc.py' 
--ignore-glob='*test_memory.py'

Gave:

33 failed, 2620 passed, 532 skipped, 13 xfailed, 10 warnings in 104.02s

 

==
 short test summary info 
==
 FAILED pyarrow2/tests/test_array.py::test_to_pandas_zero_copy - 
AttributeError: module 'sys' has no attribute 'getrefcount'
 FAILED pyarrow2/tests/test_array.py::test_array_slice - SystemError: Function 
returned an error result without setting an exception
 FAILED pyarrow2/tests/test_array.py::test_array_ref_to_ndarray_base - 
AttributeError: module 'sys' has no attribute 'getrefcount'
 FAILED pyarrow2/tests/test_array.py::test_array_conversions_no_sentinel_values 
- AttributeError: module 'sys' has no attribute 'getrefcount'
 FAILED pyarrow2/tests/test_array.py::test_nbytes_sizeof - TypeError: 
getsizeof(...)
 FAILED pyarrow2/tests/test_cffi.py::test_export_import_array - assert 1528 == 
896
 FAILED pyarrow2/tests/test_cffi.py::test_export_import_batch - assert 1048 == 
128
 FAILED pyarrow2/tests/test_convert_builtin.py::test_garbage_collection - 
assert 128 == 766912
 FAILED pyarrow2/tests/test_convert_builtin.py::test_sequence_bytes - 
NotImplementedError: creating contiguous readonly buffer from non-contiguous 
not implemented yet
 FAILED pyarrow2/tests/test_convert_builtin.py::test_map_from_dicts - 
AssertionError: Regex pattern 'integer is required' does not match 'expected 
integer, got str object'.
 FAILED pyarrow2/tests/test_csv.py::test_read_options - Failed: DID NOT RAISE 

 FAILED pyarrow2/tests/test_csv.py::test_parse_options - Failed: DID NOT RAISE 

 FAILED pyarrow2/tests/test_csv.py::test_convert_options - Failed: DID NOT 
RAISE 
 FAILED 
pyarrow2/tests/test_csv.py::TestSerialStreamingCSVRead::test_batch_lifetime - 
AssertionError: assert 1464704 == 1464576
 FAILED pyarrow2/tests/test_cython.py::test_cython_api - 
subprocess.CalledProcessError: Command '['/pyarrow/bin/pypy3', 'setup.py', 
'build_ext', '--inplace']' returned non-zero exit status 1.
 FAILED pyarrow2/tests/test_extension_type.py::test_ext_type__lifetime - 
AssertionError: assert UuidType(extension) is None
 FAILED pyarrow2/tests/test_extension_type.py::test_uuid_type_pickle - 
AssertionError: assert UuidType(extension) is None
 FAILED pyarrow2/tests/test_extension_type.py::test_ext_array_lifetime - 
AssertionError: assert ParamExtType(extension) is None
 FAILED pyarrow2/tests/test_fs.py::test_py_filesystem_lifetime - 
AssertionError: assert  is None
 FAILED 
pyarrow2/tests/test_pandas.py::test_to_pandas_deduplicate_integers_as_objects - 
assert 100 == 991
 FAILED pyarrow2/tests/test_pandas.py::test_array_uses_memory_pool - assert 
103552 == 465152
 FAILED pyarrow2/tests/test_pandas.py::test_to_pandas_self_destruct - assert 
6112064 == 4112064
 FAILED pyarrow2/tests/test_pandas.py::test_table_uses_memory_pool - assert 
6249408 == 6112064
 FAILED pyarrow2/tests/test_pandas.py::test_object_leak_in_numpy_array - 
AttributeError: module 'sys' has no attribute 'getrefcount'
 FAILED pyarrow2/tests/test_pandas.py::test_object_leak_in_dataframe - 
AttributeError: module 'sys' has no attribute 'getrefcount'
 FAILED pyarrow2/tests/test_schema.py::test_schema_sizeof - TypeError: 
getsizeof(...)
 FAILED 
pyarrow2/tests/test_sparse_tensor.py::test_sparse_coo_tensor_base_object - 
AttributeError: module 'sys' has no attribute 'getrefcount'
 FAILED 
pyarrow2/tests/test_sparse_tensor.py::test_sparse_csr_matrix_base_object - 
AttributeError: module 'sys' has no attribute 'getrefcount'
 FAILED 
pyarrow2/tests/test_sparse_tensor.py::test_sparse_csf_tensor_base_object - 
AttributeError: module 'sys' has no attribute 'getrefcount'
 FAILED pyarrow2/tests/test_table.py::test_chunked_array_basics - TypeError: 
getsizeof(...)
 FAILED pyarrow2/tests/test_table.py::test_recordbatch_basics - TypeError: 
getsizeof(...)
 FAILED pyarrow2/tests/test_table.py::test_table_basics - TypeError: 
getsizeof(...)
 FAILED pyarrow2/tests/test_tensor.py::test_tensor_base_object - 
AttributeError: module 'sys' has no attribute 'getrefcount'
 

[jira] [Commented] (ARROW-2651) [Python] Build & Test with PyPy

2020-09-14 Thread Niklas B (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195571#comment-17195571
 ] 

Niklas B commented on ARROW-2651:
-

Besides GetContiguous we (and with me I mean Matti) needed to patch a few 
datetime related things, 
[https://gist.github.com/mattip/c9c8398b58721ae5893dc8134c353f28]

Build that works with the patch available on 
[https://github.com/bivald/pyarrow-on-pypy3/tree/feature/latest-pypy-latest-pyarrow]

As for the test suite I had to disable IO, misc and memory since they gave 
segfaults. 

pytest pyarrow --ignore-glob='*test_io.py' --ignore-glob='*test_misc.py' 
--ignore-glob='*test_memory.py'

Gave:

33 failed, 2620 passed, 532 skipped, 13 xfailed, 10 warnings in 104.02s

 

==
 short test summary info 
==
FAILED pyarrow2/tests/test_array.py::test_to_pandas_zero_copy - AttributeError: 
module 'sys' has no attribute 'getrefcount'
FAILED pyarrow2/tests/test_array.py::test_array_slice - SystemError: Function 
returned an error result without setting an exception
FAILED pyarrow2/tests/test_array.py::test_array_ref_to_ndarray_base - 
AttributeError: module 'sys' has no attribute 'getrefcount'
FAILED pyarrow2/tests/test_array.py::test_array_conversions_no_sentinel_values 
- AttributeError: module 'sys' has no attribute 'getrefcount'
FAILED pyarrow2/tests/test_array.py::test_nbytes_sizeof - TypeError: 
getsizeof(...)
FAILED pyarrow2/tests/test_cffi.py::test_export_import_array - assert 1528 == 
896
FAILED pyarrow2/tests/test_cffi.py::test_export_import_batch - assert 1048 == 
128
FAILED pyarrow2/tests/test_convert_builtin.py::test_garbage_collection - assert 
128 == 766912
FAILED pyarrow2/tests/test_convert_builtin.py::test_sequence_bytes - 
NotImplementedError: creating contiguous readonly buffer from non-contiguous 
not implemented yet
FAILED pyarrow2/tests/test_convert_builtin.py::test_map_from_dicts - 
AssertionError: Regex pattern 'integer is required' does not match 'expected 
integer, got str object'.
FAILED pyarrow2/tests/test_csv.py::test_read_options - Failed: DID NOT RAISE 

FAILED pyarrow2/tests/test_csv.py::test_parse_options - Failed: DID NOT RAISE 

FAILED pyarrow2/tests/test_csv.py::test_convert_options - Failed: DID NOT RAISE 

FAILED 
pyarrow2/tests/test_csv.py::TestSerialStreamingCSVRead::test_batch_lifetime - 
AssertionError: assert 1464704 == 1464576
FAILED pyarrow2/tests/test_cython.py::test_cython_api - 
subprocess.CalledProcessError: Command '['/pyarrow/bin/pypy3', 'setup.py', 
'build_ext', '--inplace']' returned non-zero exit status 1.
FAILED pyarrow2/tests/test_extension_type.py::test_ext_type__lifetime - 
AssertionError: assert UuidType(extension) is None
FAILED pyarrow2/tests/test_extension_type.py::test_uuid_type_pickle - 
AssertionError: assert UuidType(extension) is None
FAILED pyarrow2/tests/test_extension_type.py::test_ext_array_lifetime - 
AssertionError: assert ParamExtType(extension) is None
FAILED pyarrow2/tests/test_fs.py::test_py_filesystem_lifetime - AssertionError: 
assert  is 
None
FAILED 
pyarrow2/tests/test_pandas.py::test_to_pandas_deduplicate_integers_as_objects - 
assert 100 == 991
FAILED pyarrow2/tests/test_pandas.py::test_array_uses_memory_pool - assert 
103552 == 465152
FAILED pyarrow2/tests/test_pandas.py::test_to_pandas_self_destruct - assert 
6112064 == 4112064
FAILED pyarrow2/tests/test_pandas.py::test_table_uses_memory_pool - assert 
6249408 == 6112064
FAILED pyarrow2/tests/test_pandas.py::test_object_leak_in_numpy_array - 
AttributeError: module 'sys' has no attribute 'getrefcount'
FAILED pyarrow2/tests/test_pandas.py::test_object_leak_in_dataframe - 
AttributeError: module 'sys' has no attribute 'getrefcount'
FAILED pyarrow2/tests/test_schema.py::test_schema_sizeof - TypeError: 
getsizeof(...)
FAILED pyarrow2/tests/test_sparse_tensor.py::test_sparse_coo_tensor_base_object 
- AttributeError: module 'sys' has no attribute 'getrefcount'
FAILED pyarrow2/tests/test_sparse_tensor.py::test_sparse_csr_matrix_base_object 
- AttributeError: module 'sys' has no attribute 'getrefcount'
FAILED pyarrow2/tests/test_sparse_tensor.py::test_sparse_csf_tensor_base_object 
- AttributeError: module 'sys' has no attribute 'getrefcount'
FAILED pyarrow2/tests/test_table.py::test_chunked_array_basics - TypeError: 
getsizeof(...)
FAILED pyarrow2/tests/test_table.py::test_recordbatch_basics - TypeError: 
getsizeof(...)
FAILED pyarrow2/tests/test_table.py::test_table_basics - TypeError: 
getsizeof(...)
FAILED pyarrow2/tests/test_tensor.py::test_tensor_base_object - AttributeError: 
module 'sys' has no attribute 'getrefcount'
= 33 
failed, 2620 passed, 532 skipped, 13 xfailed, 10 warnings in 

[jira] [Created] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README

2020-09-14 Thread Jorge (Jira)
Jorge created ARROW-10001:
-

 Summary: [Rust] [DataFusion] Add developer guide to README
 Key: ARROW-10001
 URL: https://issues.apache.org/jira/browse/ARROW-10001
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9995) [R] Snappy Codec Support not built

2020-09-14 Thread Joska Lako (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195563#comment-17195563
 ] 

Joska Lako commented on ARROW-9995:
---

Thanks, that worked in the end!

> [R] Snappy Codec Support not built
> --
>
> Key: ARROW-9995
> URL: https://issues.apache.org/jira/browse/ARROW-9995
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Joska Lako
>Assignee: Neal Richardson
>Priority: Major
>  Labels: Snappy
> Attachments: ErrorScreenshot.PNG
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am reading my file on a Linux based server which has no Snappy compression. 
> Even though I call the function to do uncompressed compression. I still get 
> an error Snappy codec support not built. How do I overcome this error and 
> read a parquet file without snappy codec on linux?
> read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED')
> Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: 
> NotImplemented: Snappy codec support not built



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4432) [Python][Hypothesis] Empty table - pandas roundtrip produces unequal tables

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-4432:
---
Summary: [Python][Hypothesis] Empty table - pandas roundtrip produces 
unequal tables  (was: [Python][Hypothesis] Empty table - pandas roundtrip 
produces inequal tables)

> [Python][Hypothesis] Empty table - pandas roundtrip produces unequal tables
> ---
>
> Key: ARROW-4432
> URL: https://issues.apache.org/jira/browse/ARROW-4432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: hypothesis
>
> The following test case fails for empty tables:
> {code:python}
> import hypothesis as h
> import pyarrow.tests.strategies as past
> @h.given(past.all_tables)
> def test_pandas_roundtrip(table):
> df = table.to_pandas()
> table_ = pa.Table.from_pandas(df)
> assert table == table_
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9992) [C++][Python] Refactor python to arrow conversions based on a reusable conversion API

2020-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9992:
--
Labels: pull-request-available  (was: )

> [C++][Python] Refactor python to arrow conversions based on a reusable 
> conversion API 
> --
>
> Key: ARROW-9992
> URL: https://issues.apache.org/jira/browse/ARROW-9992
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have a lot of technical debt accumulated in the python to arrow conversion 
> code paths including hidden bugs. We need to simplify the implementation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10000) [C++][Python] Support constructing StructArray from list of key-value pairs

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-1:

Fix Version/s: 2.0.0

> [C++][Python] Support constructing StructArray from list of key-value pairs
> ---
>
> Key: ARROW-1
> URL: https://issues.apache.org/jira/browse/ARROW-1
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> {code:python}
> item = [
> ('a', 1),
> ('b', 2)
> ]
> ty = pa.struct([
> pa.field('a', type=pa.int8()),
> pa.field('b', type=pa.float64())
> ])
> pa.array([item], type=ty)
> {code}
> raises 
> {code}
> ArrowTypeError: Could not convert [('a', 1), ('b', 2)] with type list: was 
> not a dict, tuple, or recognized null value for conversion to struct type
> {code}
> This feature is required for {{pa.repeat(scalar, n)}} roundtrip if the type 
> contains duplicated field names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10000) [C++][Python] Support constructing StructArray from list of key-value pairs

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-1:

Component/s: Python

> [C++][Python] Support constructing StructArray from list of key-value pairs
> ---
>
> Key: ARROW-1
> URL: https://issues.apache.org/jira/browse/ARROW-1
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> {code:python}
> item = [
> ('a', 1),
> ('b', 2)
> ]
> ty = pa.struct([
> pa.field('a', type=pa.int8()),
> pa.field('b', type=pa.float64())
> ])
> pa.array([item], type=ty)
> {code}
> raises 
> {code}
> ArrowTypeError: Could not convert [('a', 1), ('b', 2)] with type list: was 
> not a dict, tuple, or recognized null value for conversion to struct type
> {code}
> This feature is required for {{pa.repeat(scalar, n)}} roundtrip if the type 
> contains duplicated field names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10000) [C++][Python] Support constructing StructArray from list of key-value pairs

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-1:

Description: 
{code:python}
item = [
('a', 1),
('b', 2)
]
ty = pa.struct([
pa.field('a', type=pa.int8()),
pa.field('b', type=pa.float64())
])
pa.array([item], type=ty)
{code}

raises 

{code}
ArrowTypeError: Could not convert [('a', 1), ('b', 2)] with type list: was not 
a dict, tuple, or recognized null value for conversion to struct type
{code}

This feature is required for {{pa.repeat(scalar, n)}} roundtrip if the type 
contains duplicated field names.

> [C++][Python] Support constructing StructArray from list of key-value pairs
> ---
>
> Key: ARROW-1
> URL: https://issues.apache.org/jira/browse/ARROW-1
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Krisztian Szucs
>Priority: Major
>
> {code:python}
> item = [
> ('a', 1),
> ('b', 2)
> ]
> ty = pa.struct([
> pa.field('a', type=pa.int8()),
> pa.field('b', type=pa.float64())
> ])
> pa.array([item], type=ty)
> {code}
> raises 
> {code}
> ArrowTypeError: Could not convert [('a', 1), ('b', 2)] with type list: was 
> not a dict, tuple, or recognized null value for conversion to struct type
> {code}
> This feature is required for {{pa.repeat(scalar, n)}} roundtrip if the type 
> contains duplicated field names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9999) [Python] Support constructing dictionary array directly through pa.array()

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-:
---
Description: 
{code:python}
pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
{code}

raises

{code}
ArrowNotImplementedError: Sequence converter for type dictionary not implemented
{code}

It would be a much more comfortable way than

{code:python}
pa.DictionaryArray.from_arrays(indices, dictionary)
{code}

  was:
{code:python}
pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
{code}

raises

{code}
ArrowNotImplementedError: Sequence converter for type dictionary not implemented
{code}




> [Python] Support constructing dictionary array directly through pa.array()
> --
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Krisztian Szucs
>Priority: Major
>
> {code:python}
> pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
> {code}
> raises
> {code}
> ArrowNotImplementedError: Sequence converter for type 
> dictionary not implemented
> {code}
> It would be a much more comfortable way than
> {code:python}
> pa.DictionaryArray.from_arrays(indices, dictionary)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10000) [C++][Python] Support constructing StructArray from list of key-value pairs

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-1:
---

 Summary: [C++][Python] Support constructing StructArray from list 
of key-value pairs
 Key: ARROW-1
 URL: https://issues.apache.org/jira/browse/ARROW-1
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Krisztian Szucs






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9995) [R] Snappy Codec Support not built

2020-09-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9995:
---
Component/s: (was: C++)

> [R] Snappy Codec Support not built
> --
>
> Key: ARROW-9995
> URL: https://issues.apache.org/jira/browse/ARROW-9995
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Joska Lako
>Priority: Major
>  Labels: Snappy
> Attachments: ErrorScreenshot.PNG
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am reading my file on a Linux based server which has no Snappy compression. 
> Even though I call the function to do uncompressed compression. I still get 
> an error Snappy codec support not built. How do I overcome this error and 
> read a parquet file without snappy codec on linux?
> read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED')
> Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: 
> NotImplemented: Snappy codec support not built



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9999) [Python] Support constructing dictionary array directly through pa.array()

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-:
---
Description: 
{code:python}
pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
{code}

raises

{code}
ArrowNotImplementedError: Sequence converter for type dictionary not implemented
{code}

It would be a much more comfortable way than

{code:python}
pa.DictionaryArray.from_arrays(indices, dictionary)
{code}

And possibly more efficient as well thanks to the adaptive dictionary builders.

  was:
{code:python}
pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
{code}

raises

{code}
ArrowNotImplementedError: Sequence converter for type dictionary not implemented
{code}

It would be a much more comfortable way than

{code:python}
pa.DictionaryArray.from_arrays(indices, dictionary)
{code}


> [Python] Support constructing dictionary array directly through pa.array()
> --
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> {code:python}
> pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
> {code}
> raises
> {code}
> ArrowNotImplementedError: Sequence converter for type 
> dictionary not implemented
> {code}
> It would be a much more comfortable way than
> {code:python}
> pa.DictionaryArray.from_arrays(indices, dictionary)
> {code}
> And possibly more efficient as well thanks to the adaptive dictionary 
> builders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9999) [Python] Support constructing dictionary array directly through pa.array()

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-:
--

Assignee: Krisztian Szucs

> [Python] Support constructing dictionary array directly through pa.array()
> --
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> {code:python}
> pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
> {code}
> raises
> {code}
> ArrowNotImplementedError: Sequence converter for type 
> dictionary not implemented
> {code}
> It would be a much more comfortable way than
> {code:python}
> pa.DictionaryArray.from_arrays(indices, dictionary)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9999) [Python] Support constructing dictionary array directly through pa.array()

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-:
---
Summary: [Python] Support constructing dictionary array directly through 
pa.array()  (was: [Python] Support constructing dictionary array through 
pa.array())

> [Python] Support constructing dictionary array directly through pa.array()
> --
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Krisztian Szucs
>Priority: Major
>
> {code:python}
> pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
> {code}
> raises
> {code}
> ArrowNotImplementedError: Sequence converter for type 
> dictionary not implemented
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9999) [Python] Support constructing dictionary array through pa.array()

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-:
---
Description: 
{code:python}
pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
{code}

raises

{code}
ArrowNotImplementedError: Sequence converter for type dictionary not implemented
{code}



> [Python] Support constructing dictionary array through pa.array()
> -
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Krisztian Szucs
>Priority: Major
>
> {code:python}
> pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string)))
> {code}
> raises
> {code}
> ArrowNotImplementedError: Sequence converter for type 
> dictionary not implemented
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9995) [R] Snappy Codec Support not built

2020-09-14 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195541#comment-17195541
 ] 

Neal Richardson commented on ARROW-9995:


{{read_parquet()}} doesn't ask you about compression--it detects what 
compression is used in the file. So it sounds like you're trying to read a 
snappy-compressed file and thus need a build with snappy enabled. 

To get that, since you already have arrow you could call 
`arrow::install_arrow()` and it should just work, installing a more complete 
version. Or you could set {{LIBARROW_MINIMAL=FALSE}} and reinstall by the usual 
ways. See https://arrow.apache.org/docs/r/articles/install.html for more. 

> [R] Snappy Codec Support not built
> --
>
> Key: ARROW-9995
> URL: https://issues.apache.org/jira/browse/ARROW-9995
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Joska Lako
>Priority: Major
>  Labels: Snappy
> Attachments: ErrorScreenshot.PNG
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am reading my file on a Linux based server which has no Snappy compression. 
> Even though I call the function to do uncompressed compression. I still get 
> an error Snappy codec support not built. How do I overcome this error and 
> read a parquet file without snappy codec on linux?
> read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED')
> Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: 
> NotImplemented: Snappy codec support not built



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9995) [R] Snappy Codec Support not built

2020-09-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9995:
--

Assignee: Neal Richardson

> [R] Snappy Codec Support not built
> --
>
> Key: ARROW-9995
> URL: https://issues.apache.org/jira/browse/ARROW-9995
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Joska Lako
>Assignee: Neal Richardson
>Priority: Major
>  Labels: Snappy
> Attachments: ErrorScreenshot.PNG
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am reading my file on a Linux based server which has no Snappy compression. 
> Even though I call the function to do uncompressed compression. I still get 
> an error Snappy codec support not built. How do I overcome this error and 
> read a parquet file without snappy codec on linux?
> read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED')
> Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: 
> NotImplemented: Snappy codec support not built



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9995) [R] Snappy Codec Support not built

2020-09-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9995.

Resolution: Information Provided

> [R] Snappy Codec Support not built
> --
>
> Key: ARROW-9995
> URL: https://issues.apache.org/jira/browse/ARROW-9995
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Joska Lako
>Assignee: Neal Richardson
>Priority: Major
>  Labels: Snappy
> Attachments: ErrorScreenshot.PNG
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am reading my file on a Linux based server which has no Snappy compression. 
> Even though I call the function to do uncompressed compression. I still get 
> an error Snappy codec support not built. How do I overcome this error and 
> read a parquet file without snappy codec on linux?
> read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED')
> Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: 
> NotImplemented: Snappy codec support not built



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9997) [Python] StructScalar.as_py() fails if the type has duplicate field names

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9997:
--

 Summary: [Python] StructScalar.as_py() fails if the type has 
duplicate field names
 Key: ARROW-9997
 URL: https://issues.apache.org/jira/browse/ARROW-9997
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 2.0.0


{{StructScalar}} currently extends an abstract Mapping interface. Since the 
type allows duplicate field names we cannot provide that API.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9999) [Python] Support constructing dictionary array through pa.array()

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-:
--

 Summary: [Python] Support constructing dictionary array through 
pa.array()
 Key: ARROW-
 URL: https://issues.apache.org/jira/browse/ARROW-
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Krisztian Szucs






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9998) [Python] Support pickling DictionaryScalar

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9998:
--

 Summary: [Python] Support pickling DictionaryScalar
 Key: ARROW-9998
 URL: https://issues.apache.org/jira/browse/ARROW-9998
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 2.0.0


Since the {{pa.array}} factory function doesn't support the creation of 
dictionary array pickling [has not been 
implemented|https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_scalars.py#L554]
 for dictionary scalars yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9995) [R] Snappy Codec Support not built

2020-09-14 Thread Joska Lako (Jira)
Joska Lako created ARROW-9995:
-

 Summary: [R] Snappy Codec Support not built
 Key: ARROW-9995
 URL: https://issues.apache.org/jira/browse/ARROW-9995
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Affects Versions: 1.0.1, 1.0.0
Reporter: Joska Lako
 Attachments: ErrorScreenshot.PNG

I am reading my file on a Linux based server which has no Snappy compression. 
Even though I call the function to do uncompressed compression. I still get an 
error Snappy codec support not built. How do I overcome this error and read a 
parquet file without snappy codec on linux?
read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED')

Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: 
NotImplemented: Snappy codec support not built



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9994) [C++][Python] Auto chunking nested array containing binary-like fields result malformed output

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9994:
---
Issue Type: Bug  (was: Improvement)

> [C++][Python] Auto chunking nested array containing binary-like fields result 
> malformed output
> --
>
> Key: ARROW-9994
> URL: https://issues.apache.org/jira/browse/ARROW-9994
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.0
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> In case of nested types the binary-like arrays are chunked but not the 
> others, so after finalizing the builder the nested output array contains 
> different length children.
> {code:python}
>char = b'x'
>ty = pa.binary()
> v1 = char * 1
> v2 = char * 147483646
> struct_type = pa.struct([
> pa.field('bool', pa.bool_()),
> pa.field('integer', pa.int64()),
> pa.field('string-like', ty),
> ])
> data = [{'bool': True, 'integer': 1, 'string-like': v1}] * 20
> data.append({'bool': True, 'integer': 1, 'string-like': v2})
> arr = pa.array(data, type=struct_type)
> assert isinstance(arr, pa.Array)
> data.append({'bool': True, 'integer': 1, 'string-like': char})
> arr = pa.array(data, type=struct_type)
> assert isinstance(arr, pa.ChunkedArray)
> {code}
> {code:python}
> len(arr.field(0)) == 22
> len(arr.field(1)) == 22
> len(arr.field(2)) == 1  # the string array gets chunked whereas the rest of 
> the fields do not
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9994) [C++][Python] Auto chunking nested array containing binary-like fields result malformed output

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9994:
--

 Summary: [C++][Python] Auto chunking nested array containing 
binary-like fields result malformed output
 Key: ARROW-9994
 URL: https://issues.apache.org/jira/browse/ARROW-9994
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 1.0.0
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


In case of nested types the binary-like arrays are chunked but not the others, 
so after finalizing the builder the nested output array contains different 
length children.

{code:python}
   char = b'x'
   ty = pa.binary()

v1 = char * 1
v2 = char * 147483646

struct_type = pa.struct([
pa.field('bool', pa.bool_()),
pa.field('integer', pa.int64()),
pa.field('string-like', ty),
])

data = [{'bool': True, 'integer': 1, 'string-like': v1}] * 20
data.append({'bool': True, 'integer': 1, 'string-like': v2})
arr = pa.array(data, type=struct_type)
assert isinstance(arr, pa.Array)

data.append({'bool': True, 'integer': 1, 'string-like': char})
arr = pa.array(data, type=struct_type)
assert isinstance(arr, pa.ChunkedArray)
{code}

{code:python}
len(arr.field(0)) == 22
len(arr.field(1)) == 22
len(arr.field(2)) == 1  # the string array gets chunked whereas the rest of the 
fields do not
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9580) Docs have superfluous ()

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9580:


Assignee: Apache Arrow JIRA Bot  (was: Dominik Moritz)

> Docs have superfluous ()
> 
>
> Key: ARROW-9580
> URL: https://issues.apache.org/jira/browse/ARROW-9580
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Apache Arrow JIRA Bot
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9580) Docs have superfluous ()

2020-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9580:
--
Labels: pull-request-available  (was: )

> Docs have superfluous ()
> 
>
> Key: ARROW-9580
> URL: https://issues.apache.org/jira/browse/ARROW-9580
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Dominik Moritz
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9580) Docs have superfluous ()

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9580:


Assignee: Dominik Moritz  (was: Apache Arrow JIRA Bot)

> Docs have superfluous ()
> 
>
> Key: ARROW-9580
> URL: https://issues.apache.org/jira/browse/ARROW-9580
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Dominik Moritz
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9993) [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9993:
---
Issue Type: Bug  (was: Improvement)

> [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects
> -
>
> Key: ARROW-9993
> URL: https://issues.apache.org/jira/browse/ARROW-9993
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> Timezone roundtrip fails with {{pytz.StaticTzInfo}} objects on master:
> {code:python}
> tz = pytz.timezone('Etc/GMT+1')
> pa.lib.string_to_tzinfo(pa.lib.tzinfo_to_string(tz))
> {code}
> {code}
> ---
> UnknownTimeZoneError  Traceback (most recent call last)
>  in 
> > 1 pa.lib.string_to_tzinfo(pa.lib.tzinfo_to_string(tz))
> ~/Workspace/arrow/python/pyarrow/types.pxi in pyarrow.lib.string_to_tzinfo()
>1838 Time zone object
>1839 """
> -> 1840 cdef PyObject* tz = 
> GetResultValue(StringToTzinfo(name.encode('utf-8')))
>1841 return PyObject_to_object(tz)
>1842
> ~/Workspace/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> 120 cdef api int pyarrow_internal_check_status(const CStatus& status) \
> 121 nogil except -1:
> --> 122 return check_status(status)
> ~/.conda/envs/arrow38/lib/python3.8/site-packages/pytz/__init__.py in 
> timezone(zone)
> 179 fp.close()
> 180 else:
> --> 181 raise UnknownTimeZoneError(zone)
> 182
> 183 return _tzinfo_cache[zone]
> UnknownTimeZoneError: '-01'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9993) [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9993:
--

 Summary: [Python] Tzinfo - string roundtrip fails on 
pytz.StaticTzInfo objects
 Key: ARROW-9993
 URL: https://issues.apache.org/jira/browse/ARROW-9993
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


Timezone roundtrip fails with {{pytz.StaticTzInfo}} objects on master:

{code:python}
tz = pytz.timezone('Etc/GMT+1')
pa.lib.string_to_tzinfo(pa.lib.tzinfo_to_string(tz))
{code}
{code}
---
UnknownTimeZoneError  Traceback (most recent call last)
 in 
> 1 pa.lib.string_to_tzinfo(pa.lib.tzinfo_to_string(tz))

~/Workspace/arrow/python/pyarrow/types.pxi in pyarrow.lib.string_to_tzinfo()
   1838 Time zone object
   1839 """
-> 1840 cdef PyObject* tz = 
GetResultValue(StringToTzinfo(name.encode('utf-8')))
   1841 return PyObject_to_object(tz)
   1842

~/Workspace/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()
120 cdef api int pyarrow_internal_check_status(const CStatus& status) \
121 nogil except -1:
--> 122 return check_status(status)

~/.conda/envs/arrow38/lib/python3.8/site-packages/pytz/__init__.py in 
timezone(zone)
179 fp.close()
180 else:
--> 181 raise UnknownTimeZoneError(zone)
182
183 return _tzinfo_cache[zone]

UnknownTimeZoneError: '-01'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9616) [C++] Support LTO for R

2020-09-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195515#comment-17195515
 ] 

Antoine Pitrou commented on ARROW-9616:
---

Deciding to LTO everything sounds more ideological than pragmatic. LTO can be 
useful in some select cases, but I fail to understand why it would be 
mandatory. Also it will increase build times again.

> [C++] Support LTO for R
> ---
>
> Key: ARROW-9616
> URL: https://issues.apache.org/jira/browse/ARROW-9616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 1.0.0
>Reporter: Jeroen
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The next version of R might enable LTO on Windows, i.e. R packages will be 
> compiled with {{-flto}} by default. This works out of the box for most 
> packages, but for arrow, the linker crashes as below. 
> {code}
>  C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 
> -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o 
> array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o 
> chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o imports.o io.o json.o 
> memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o 
> -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset 
> -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto 
> -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR
>  lto1.exe: internal compiler error: in add_symbol_to_partition_1, at 
> lto/lto-partition.c:153
>  libbacktrace could not find executable to open
>  Please submit a full bug report,
>  with preprocessed source if appropriate.
>  See <[https://github.com/r-windows]> for instructions.
>  lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 
> exit status
>  compilation terminated.
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  error: lto-wrapper failed
> {code}
> You can reproduce this in R on Windows for example like so:
> {code:r}
> dir.create("~/.R")
> writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars")
> install.packages("arrow", type = 'source')
> {code}
> I am not sure if this is a bug in the toolchain, or in arrow. I tried with 
> both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this 
> issue|https://github.com/cycfi/elements/pull/56] in another project which 
> suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto 
> code with non-lto code (which is the case when we only build the r bindings 
> with lto, but not the c++ library).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9616) [C++] Support LTO for R

2020-09-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195513#comment-17195513
 ] 

Antoine Pitrou commented on ARROW-9616:
---

An internal compiler error is certainly not a bug in Arrow, but we have to 
workaround the issue at some point, no?

> [C++] Support LTO for R
> ---
>
> Key: ARROW-9616
> URL: https://issues.apache.org/jira/browse/ARROW-9616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 1.0.0
>Reporter: Jeroen
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The next version of R might enable LTO on Windows, i.e. R packages will be 
> compiled with {{-flto}} by default. This works out of the box for most 
> packages, but for arrow, the linker crashes as below. 
> {code}
>  C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 
> -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o 
> array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o 
> chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o imports.o io.o json.o 
> memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o 
> -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset 
> -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto 
> -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR
>  lto1.exe: internal compiler error: in add_symbol_to_partition_1, at 
> lto/lto-partition.c:153
>  libbacktrace could not find executable to open
>  Please submit a full bug report,
>  with preprocessed source if appropriate.
>  See <[https://github.com/r-windows]> for instructions.
>  lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 
> exit status
>  compilation terminated.
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  error: lto-wrapper failed
> {code}
> You can reproduce this in R on Windows for example like so:
> {code:r}
> dir.create("~/.R")
> writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars")
> install.packages("arrow", type = 'source')
> {code}
> I am not sure if this is a bug in the toolchain, or in arrow. I tried with 
> both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this 
> issue|https://github.com/cycfi/elements/pull/56] in another project which 
> suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto 
> code with non-lto code (which is the case when we only build the r bindings 
> with lto, but not the c++ library).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9616) [C++] Support LTO for R

2020-09-14 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195512#comment-17195512
 ] 

Neal Richardson commented on ARROW-9616:


If CRAN decides that it LTO's everything, then we wouldn't be able to turn that 
off. FWIW CRAN already has a LTO builder in its test setup (debian, I believe) 
and arrow is not failing that. So this is something in the Windows setup, and 
possibly not a problem in arrow at all.

> [C++] Support LTO for R
> ---
>
> Key: ARROW-9616
> URL: https://issues.apache.org/jira/browse/ARROW-9616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 1.0.0
>Reporter: Jeroen
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The next version of R might enable LTO on Windows, i.e. R packages will be 
> compiled with {{-flto}} by default. This works out of the box for most 
> packages, but for arrow, the linker crashes as below. 
> {code}
>  C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 
> -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o 
> array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o 
> chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o imports.o io.o json.o 
> memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o 
> -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset 
> -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto 
> -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR
>  lto1.exe: internal compiler error: in add_symbol_to_partition_1, at 
> lto/lto-partition.c:153
>  libbacktrace could not find executable to open
>  Please submit a full bug report,
>  with preprocessed source if appropriate.
>  See <[https://github.com/r-windows]> for instructions.
>  lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 
> exit status
>  compilation terminated.
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  error: lto-wrapper failed
> {code}
> You can reproduce this in R on Windows for example like so:
> {code:r}
> dir.create("~/.R")
> writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars")
> install.packages("arrow", type = 'source')
> {code}
> I am not sure if this is a bug in the toolchain, or in arrow. I tried with 
> both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this 
> issue|https://github.com/cycfi/elements/pull/56] in another project which 
> suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto 
> code with non-lto code (which is the case when we only build the r bindings 
> with lto, but not the c++ library).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9991) [C++] split kernels for strings/binary

2020-09-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195508#comment-17195508
 ] 

Joris Van den Bossche commented on ARROW-9991:
--

And I suppose "whitespace" here is more than a split on " " ? (also multiple 
spaces, different kinds of newlines, tabs, etc?) In that case, a separate 
specialized kernel seems indeed best. 

> [C++] split kernels for strings/binary
> --
>
> Key: ARROW-9991
> URL: https://issues.apache.org/jira/browse/ARROW-9991
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Maarten Breddels
>Assignee: Maarten Breddels
>Priority: Major
>
> Similar to Python str.split and bytes.split, we'd like to have a way to 
> convert str into list[str] (and similarly for bytes).
> When the separator is given, the algorithms for both types are the same. 
> Python, however, overloads strip. When given no separator, the algorithm will 
> split considering all whitespace (unicode for str, ascii for bytes) as 
> separator.
> I'd rather see not too much overloaded kernels, e.g.
> binary_split (takes string/binary separator, and maxsplit arg, no special 
> utf8 version needed)
> utf8_split_whitespace (similar to Python's version given no separator)
> ascii_split_whitespace (similar to Python's version given no separator, but 
> considering ascii, although this could work on any binary data)
> there can also be rsplit versions of these, or they could be an argument.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1385) [C++] Add Buffer implementation and helper functions for POSIX shared memory

2020-09-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195491#comment-17195491
 ] 

Antoine Pitrou commented on ARROW-1385:
---

Is there a target use case we're thinking about? Otherwise it's not obvious 
this deserves keeping an issue open.
(especially, one annoyance with shared memory is garbage collecting unused 
shared memory segments: Windows is able to do this automatically, Unix 
unfortunately is not)

> [C++] Add Buffer implementation and helper functions for POSIX shared memory
> 
>
> Key: ARROW-1385
> URL: https://issues.apache.org/jira/browse/ARROW-1385
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> This should also include affordances for detaching and removing shm segments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9859:


Assignee: Apache Arrow JIRA Bot  (was: Antoine Pitrou)

> [C++] S3 FileSystemFromUri with special char in secret key fails
> 
>
> Key: ARROW-9859
> URL: https://issues.apache.org/jira/browse/ARROW-9859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Neal Richardson
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> S3 Secret access keys can contain special characters like {{/}}. When they do
> 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them 
> (e.g. replace / with %2F)
> 2) When you do escape the special characters, requests that require 
> authorization fail with the message "The request signature we calculated does 
> not match the signature you provided. Check your key and signing method." 
> This may suggest that there's some extra URL encoding/decoding that needs to 
> happen inside.
> I was only able to work around this by generating a new access key that 
> happened not to have special characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails

2020-09-14 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9859:


Assignee: Antoine Pitrou  (was: Apache Arrow JIRA Bot)

> [C++] S3 FileSystemFromUri with special char in secret key fails
> 
>
> Key: ARROW-9859
> URL: https://issues.apache.org/jira/browse/ARROW-9859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> S3 Secret access keys can contain special characters like {{/}}. When they do
> 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them 
> (e.g. replace / with %2F)
> 2) When you do escape the special characters, requests that require 
> authorization fail with the message "The request signature we calculated does 
> not match the signature you provided. Check your key and signing method." 
> This may suggest that there's some extra URL encoding/decoding that needs to 
> happen inside.
> I was only able to work around this by generating a new access key that 
> happened not to have special characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails

2020-09-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9859:
--
Labels: pull-request-available  (was: )

> [C++] S3 FileSystemFromUri with special char in secret key fails
> 
>
> Key: ARROW-9859
> URL: https://issues.apache.org/jira/browse/ARROW-9859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> S3 Secret access keys can contain special characters like {{/}}. When they do
> 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them 
> (e.g. replace / with %2F)
> 2) When you do escape the special characters, requests that require 
> authorization fail with the message "The request signature we calculated does 
> not match the signature you provided. Check your key and signing method." 
> This may suggest that there's some extra URL encoding/decoding that needs to 
> happen inside.
> I was only able to work around this by generating a new access key that 
> happened not to have special characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9964) [C++] CSV date support

2020-09-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195469#comment-17195469
 ] 

Antoine Pitrou commented on ARROW-9964:
---

Thanks for the report. Indeed, for now, you cannot directly those values as a 
date type. However, you can read them as timestamp64. I agree it would be good 
to allow specifying a date column in {{column_types}}.

> [C++] CSV date support
> --
>
> Key: ARROW-9964
> URL: https://issues.apache.org/jira/browse/ARROW-9964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Maciej
>Priority: Major
>
> There is no support for reading date type from CSV file. I'd like to read 
> such a value:
> {code:java}
> 1991-02-03
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails

2020-09-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195470#comment-17195470
 ] 

Antoine Pitrou commented on ARROW-9859:
---

Nevermind, I have such a test bucket myself :-)

> [C++] S3 FileSystemFromUri with special char in secret key fails
> 
>
> Key: ARROW-9859
> URL: https://issues.apache.org/jira/browse/ARROW-9859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 2.0.0
>
>
> S3 Secret access keys can contain special characters like {{/}}. When they do
> 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them 
> (e.g. replace / with %2F)
> 2) When you do escape the special characters, requests that require 
> authorization fail with the message "The request signature we calculated does 
> not match the signature you provided. Check your key and signing method." 
> This may suggest that there's some extra URL encoding/decoding that needs to 
> happen inside.
> I was only able to work around this by generating a new access key that 
> happened not to have special characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9964) [C++] CSV date support

2020-09-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9964:
--
Fix Version/s: 2.0.0

> [C++] CSV date support
> --
>
> Key: ARROW-9964
> URL: https://issues.apache.org/jira/browse/ARROW-9964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Maciej
>Priority: Major
> Fix For: 2.0.0
>
>
> There is no support for reading date type from CSV file. I'd like to read 
> such a value:
> {code:java}
> 1991-02-03
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9616) [C++] Support LTO for R

2020-09-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195464#comment-17195464
 ] 

Antoine Pitrou commented on ARROW-9616:
---

Can't we simply disable LTO? I doubt LTO would bring much to Arrow (and if it 
does, then I'd say it's a bug: we should structure our source code so that LTO 
is generally not useful).

> [C++] Support LTO for R
> ---
>
> Key: ARROW-9616
> URL: https://issues.apache.org/jira/browse/ARROW-9616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 1.0.0
>Reporter: Jeroen
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The next version of R might enable LTO on Windows, i.e. R packages will be 
> compiled with {{-flto}} by default. This works out of the box for most 
> packages, but for arrow, the linker crashes as below. 
> {code}
>  C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 
> -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o 
> array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o 
> chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o imports.o io.o json.o 
> memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o 
> -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset 
> -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto 
> -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR
>  lto1.exe: internal compiler error: in add_symbol_to_partition_1, at 
> lto/lto-partition.c:153
>  libbacktrace could not find executable to open
>  Please submit a full bug report,
>  with preprocessed source if appropriate.
>  See <[https://github.com/r-windows]> for instructions.
>  lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 
> exit status
>  compilation terminated.
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  error: lto-wrapper failed
> {code}
> You can reproduce this in R on Windows for example like so:
> {code:r}
> dir.create("~/.R")
> writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars")
> install.packages("arrow", type = 'source')
> {code}
> I am not sure if this is a bug in the toolchain, or in arrow. I tried with 
> both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this 
> issue|https://github.com/cycfi/elements/pull/56] in another project which 
> suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto 
> code with non-lto code (which is the case when we only build the r bindings 
> with lto, but not the c++ library).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5123) [Rust] derive RecordWriter from struct definitions

2020-09-14 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195410#comment-17195410
 ] 

Neville Dipale commented on ARROW-5123:
---

I'm unable to assign to Xavier

> [Rust] derive RecordWriter from struct definitions
> --
>
> Key: ARROW-5123
> URL: https://issues.apache.org/jira/browse/ARROW-5123
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Xavier Lange
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 14h 20m
>  Remaining Estimate: 0h
>
> Migrated from previous github issue (which saw a lot of comments but at a 
> rough transition time in the project): 
> https://github.com/sunchao/parquet-rs/pull/197
>  
> Goal
> ===
> Writing many columns to a file is a chore. If you can put your values in to a 
> struct which mirrors the schema of your file, this 
> `derive(ParquetRecordWriter)` will write out all the fields, in the order in 
> which they are defined, to a row_group.
> How to Use
> ===
> ```
> extern crate parquet;
> #[macro_use] extern crate parquet_derive;
> #[derive(ParquetRecordWriter)]
> struct ACompleteRecord<'a> {
>   pub a_bool: bool,
>   pub a_str: &'a str,
> }
> ```
> RecordWriter trait
> ===
> This is the new trait which `parquet_derive` will implement for your structs.
> ```
> use super::RowGroupWriter;
> pub trait RecordWriter {
>   fn write_to_row_group(, row_group_writer:  Box);
> }
> ```
> How does it work?
> ===
> The `parquet_derive` crate adds code generating functionality to the rust 
> compiler. The code generation takes rust syntax and emits additional syntax. 
> This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, 
> loaded by the machinery in cargo. Users don't have to do any special 
> `build.rs` steps or anything like that, it's automatic by including 
> `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a 
> section saying as much:
> ```
> [lib]
> proc-macro = true
> ```
> The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to 
> the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The 
> `syn` crate parses the struct from a string-representation to a AST (a 
> recursive enum value). The AST contains all the values I care about when 
> generating a `RecordWriter` impl:
>  - the name of the struct
>  - the lifetime variables of the struct
>  - the fields of the struct
> The fields of the struct are translated from AST to a flat `FieldInfo` 
> struct. It has the bits I care about for writing a column: `field_name`, 
> `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.
> The code then does the equivalent of templating to build the `RecordWriter` 
> implementation. The templating functionality is provided by the `quote` 
> crate. At a high-level the template for `RecordWriter` looks like:
> ```
> impl RecordWriter for $struct_name {
>   fn write_row_group(..) {
>     $({
>   $column_writer_snippet
>     })
>   } 
> }
> ```
> this template is then added under the struct definition, ending up something 
> like:
> ```
> struct MyStruct {
> }
> impl RecordWriter for MyStruct {
>   fn write_row_group(..) {
>     {
>    write_col_1();
>     };
>    {
>    write_col_2();
>    }
>   }
> }
> ```
> and finally _THIS_ is the code passed to rustc. It's just code now, fully 
> expanded and standalone. If a user ever changes their `struct MyValue` 
> definition the `ParquetRecordWriter` will be regenerated. There's no 
> intermediate values to version control or worry about.
> Viewing the Derived Code
> ===
> To see the generated code before it's compiled, one very useful bit is to 
> install `cargo expand` [more info on 
> gh](https://github.com/dtolnay/cargo-expand), then you can do:
> ```
> $WORK_DIR/parquet-rs/parquet_derive_test
> cargo expand --lib > ../temp.rs
> ```
> then you can dump the contents:
> ```
> struct DumbRecord {
>     pub a_bool: bool,
>     pub a2_bool: bool,
> }
> impl RecordWriter for &[DumbRecord] {
>     fn write_to_row_group(
>     ,
>     row_group_writer:  Box,
>     ) {
>     let mut row_group_writer = row_group_writer;
>     {
>     let vals: Vec = self.iter().map(|x| x.a_bool).collect();
>     let mut column_writer = 
> row_group_writer.next_column().unwrap().unwrap();
>     if let 
> parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
>     column_writer
>     {
>     typed.write_batch([..], None, None).unwrap();
>     }
>     row_group_writer.close_column(column_writer).unwrap();
>     };
>     {
>     let vals: Vec = self.iter().map(|x| x.a2_bool).collect();
>     let mut 

[jira] [Resolved] (ARROW-5123) [Rust] derive RecordWriter from struct definitions

2020-09-14 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-5123.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 4140
[https://github.com/apache/arrow/pull/4140]

> [Rust] derive RecordWriter from struct definitions
> --
>
> Key: ARROW-5123
> URL: https://issues.apache.org/jira/browse/ARROW-5123
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Xavier Lange
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 14h 10m
>  Remaining Estimate: 0h
>
> Migrated from previous github issue (which saw a lot of comments but at a 
> rough transition time in the project): 
> https://github.com/sunchao/parquet-rs/pull/197
>  
> Goal
> ===
> Writing many columns to a file is a chore. If you can put your values in to a 
> struct which mirrors the schema of your file, this 
> `derive(ParquetRecordWriter)` will write out all the fields, in the order in 
> which they are defined, to a row_group.
> How to Use
> ===
> ```
> extern crate parquet;
> #[macro_use] extern crate parquet_derive;
> #[derive(ParquetRecordWriter)]
> struct ACompleteRecord<'a> {
>   pub a_bool: bool,
>   pub a_str: &'a str,
> }
> ```
> RecordWriter trait
> ===
> This is the new trait which `parquet_derive` will implement for your structs.
> ```
> use super::RowGroupWriter;
> pub trait RecordWriter {
>   fn write_to_row_group(, row_group_writer:  Box);
> }
> ```
> How does it work?
> ===
> The `parquet_derive` crate adds code generating functionality to the rust 
> compiler. The code generation takes rust syntax and emits additional syntax. 
> This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, 
> loaded by the machinery in cargo. Users don't have to do any special 
> `build.rs` steps or anything like that, it's automatic by including 
> `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a 
> section saying as much:
> ```
> [lib]
> proc-macro = true
> ```
> The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to 
> the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The 
> `syn` crate parses the struct from a string-representation to a AST (a 
> recursive enum value). The AST contains all the values I care about when 
> generating a `RecordWriter` impl:
>  - the name of the struct
>  - the lifetime variables of the struct
>  - the fields of the struct
> The fields of the struct are translated from AST to a flat `FieldInfo` 
> struct. It has the bits I care about for writing a column: `field_name`, 
> `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.
> The code then does the equivalent of templating to build the `RecordWriter` 
> implementation. The templating functionality is provided by the `quote` 
> crate. At a high-level the template for `RecordWriter` looks like:
> ```
> impl RecordWriter for $struct_name {
>   fn write_row_group(..) {
>     $({
>   $column_writer_snippet
>     })
>   } 
> }
> ```
> this template is then added under the struct definition, ending up something 
> like:
> ```
> struct MyStruct {
> }
> impl RecordWriter for MyStruct {
>   fn write_row_group(..) {
>     {
>    write_col_1();
>     };
>    {
>    write_col_2();
>    }
>   }
> }
> ```
> and finally _THIS_ is the code passed to rustc. It's just code now, fully 
> expanded and standalone. If a user ever changes their `struct MyValue` 
> definition the `ParquetRecordWriter` will be regenerated. There's no 
> intermediate values to version control or worry about.
> Viewing the Derived Code
> ===
> To see the generated code before it's compiled, one very useful bit is to 
> install `cargo expand` [more info on 
> gh](https://github.com/dtolnay/cargo-expand), then you can do:
> ```
> $WORK_DIR/parquet-rs/parquet_derive_test
> cargo expand --lib > ../temp.rs
> ```
> then you can dump the contents:
> ```
> struct DumbRecord {
>     pub a_bool: bool,
>     pub a2_bool: bool,
> }
> impl RecordWriter for &[DumbRecord] {
>     fn write_to_row_group(
>     ,
>     row_group_writer:  Box,
>     ) {
>     let mut row_group_writer = row_group_writer;
>     {
>     let vals: Vec = self.iter().map(|x| x.a_bool).collect();
>     let mut column_writer = 
> row_group_writer.next_column().unwrap().unwrap();
>     if let 
> parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
>     column_writer
>     {
>     typed.write_batch([..], None, None).unwrap();
>     }
>     row_group_writer.close_column(column_writer).unwrap();
>     };
>     {
>     let vals: Vec = 

[jira] [Assigned] (ARROW-9976) [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe

2020-09-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-9976:
--

Assignee: Krisztian Szucs

> [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe
> -
>
> Key: ARROW-9976
> URL: https://issues.apache.org/jira/browse/ARROW-9976
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: quentin lhoest
>Assignee: Krisztian Szucs
>Priority: Minor
>
> When calling Table.from_pandas() with a large dataset with a column of 
> vectors (np.array), there is an `ArrowCapacityError`
> To reproduce:
> {code:python}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> n = 1713614
> df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
> pa.Table.from_pandas(df)
> {code}
> With a smaller n it works.
> Error raised:
> {noformat}
> ---
> ArrowCapacityErrorTraceback (most recent call last)
>  in 
> > 1 _ = pa.Table.from_pandas(df)
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py
>  in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 591 for i, maybe_fut in enumerate(arrays):
> 592 if isinstance(maybe_fut, futures.Future):
> --> 593 arrays[i] = maybe_fut.result()
> 594 
> 595 types = [x.type for x in arrays]
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py
>  in result(self, timeout)
> 423 raise CancelledError()
> 424 elif self._state == FINISHED:
> --> 425 return self.__get_result()
> 426 
> 427 self._condition.wait(timeout)
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py
>  in __get_result(self)
> 382 def __get_result(self):
> 383 if self._exception:
> --> 384 raise self._exception
> 385 else:
> 386 return self._result
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py
>  in run(self)
>  55 
>  56 try:
> ---> 57 result = self.fn(*self.args, **self.kwargs)
>  58 except BaseException as exc:
>  59 self.future.set_exception(exc)
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py
>  in convert_column(col, field)
> 557 
> 558 try:
> --> 559 result = pa.array(col, type=type_, from_pandas=True, 
> safe=safe)
> 560 except (pa.ArrowInvalid,
> 561 pa.ArrowNotImplementedError,
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in 
> pyarrow.lib.array()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in 
> pyarrow.lib._ndarray_to_array()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowCapacityError: List array cannot contain more than 2147483646 child 
> elements, have 2147483648
> {noformat}
> I guess one needs to chunk the data before creating the arrays ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9984) [Rust] [DataFusion] DRY of function to string

2020-09-14 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9984.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8176
[https://github.com/apache/arrow/pull/8176]

> [Rust] [DataFusion] DRY of function to string
> -
>
> Key: ARROW-9984
> URL: https://issues.apache.org/jira/browse/ARROW-9984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9992) [C++][Python] Refactor python to arrow conversions based on a reusable conversion API

2020-09-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-9992:
--

 Summary: [C++][Python] Refactor python to arrow conversions based 
on a reusable conversion API 
 Key: ARROW-9992
 URL: https://issues.apache.org/jira/browse/ARROW-9992
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 2.0.0


We have a lot of technical debt accumulated in the python to arrow conversion 
code paths including hidden bugs. We need to simplify the implementation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9991) [C++] split kernels for strings/binary

2020-09-14 Thread Maarten Breddels (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maarten Breddels updated ARROW-9991:

Summary: [C++] split kernels for strings/binary  (was: [C++] split kernsl 
for strings/binary)

> [C++] split kernels for strings/binary
> --
>
> Key: ARROW-9991
> URL: https://issues.apache.org/jira/browse/ARROW-9991
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Maarten Breddels
>Assignee: Maarten Breddels
>Priority: Major
>
> Similar to Python str.split and bytes.split, we'd like to have a way to 
> convert str into list[str] (and similarly for bytes).
> When the separator is given, the algorithms for both types are the same. 
> Python, however, overloads strip. When given no separator, the algorithm will 
> split considering all whitespace (unicode for str, ascii for bytes) as 
> separator.
> I'd rather see not too much overloaded kernels, e.g.
> binary_split (takes string/binary separator, and maxsplit arg, no special 
> utf8 version needed)
> utf8_split_whitespace (similar to Python's version given no separator)
> ascii_split_whitespace (similar to Python's version given no separator, but 
> considering ascii, although this could work on any binary data)
> there can also be rsplit versions of these, or they could be an argument.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9991) [C++] split kernsl for strings/binary

2020-09-14 Thread Maarten Breddels (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maarten Breddels updated ARROW-9991:

Description: 
Similar to Python str.split and bytes.split, we'd like to have a way to convert 
str into list[str] (and similarly for bytes).

When the separator is given, the algorithms for both types are the same. 
Python, however, overloads strip. When given no separator, the algorithm will 
split considering all whitespace (unicode for str, ascii for bytes) as 
separator.

I'd rather see not too much overloaded kernels, e.g.

binary_split (takes string/binary separator, and maxsplit arg, no special utf8 
version needed)

utf8_split_whitespace (similar to Python's version given no separator)

ascii_split_whitespace (similar to Python's version given no separator, but 
considering ascii, although this could work on any binary data)

there can also be rsplit versions of these, or they could be an argument.

 

  was:
Similar to Python str.split and bytes.split, we'd like to have a way to convert 
str into list[str] (and similarly for bytes).

When the separator is given, the algorithms for both types are the same. 
Python, however, overloads strip. When given no separator, the algorithm will 
split considering all whitespace (unicode for str, ascii for bytes) as 
separator.

I'd rather see not too much overloaded kernels, e.g.
 # 
binary_split (takes string/binary separator, and maxsplit arg, no special utf8 
version needed)


 
utf8_split_whitespace (similar to Python's version given no separator)
asi


> [C++] split kernsl for strings/binary
> -
>
> Key: ARROW-9991
> URL: https://issues.apache.org/jira/browse/ARROW-9991
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Maarten Breddels
>Assignee: Maarten Breddels
>Priority: Major
>
> Similar to Python str.split and bytes.split, we'd like to have a way to 
> convert str into list[str] (and similarly for bytes).
> When the separator is given, the algorithms for both types are the same. 
> Python, however, overloads strip. When given no separator, the algorithm will 
> split considering all whitespace (unicode for str, ascii for bytes) as 
> separator.
> I'd rather see not too much overloaded kernels, e.g.
> binary_split (takes string/binary separator, and maxsplit arg, no special 
> utf8 version needed)
> utf8_split_whitespace (similar to Python's version given no separator)
> ascii_split_whitespace (similar to Python's version given no separator, but 
> considering ascii, although this could work on any binary data)
> there can also be rsplit versions of these, or they could be an argument.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9991) [C++] split kernsl for strings/binary

2020-09-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9991:
---

 Summary: [C++] split kernsl for strings/binary
 Key: ARROW-9991
 URL: https://issues.apache.org/jira/browse/ARROW-9991
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels
Assignee: Maarten Breddels


Similar to Python str.split and bytes.split, we'd like to have a way to convert 
str into list[str] (and similarly for bytes).

When the separator is given, the algorithms for both types are the same. 
Python, however, overloads strip. When given no separator, the algorithm will 
split considering all whitespace (unicode for str, ascii for bytes) as 
separator.

I'd rather see not too much overloaded kernels, e.g.
 # 
binary_split (takes string/binary separator, and maxsplit arg, no special utf8 
version needed)


 
utf8_split_whitespace (similar to Python's version given no separator)
asi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9976) [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe

2020-09-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195249#comment-17195249
 ] 

Joris Van den Bossche commented on ARROW-9976:
--

[~lhoestq] Thanks for the report. Yes, for now you will need to chunk yourself 
before converting to pyarrow, but this might be something that pyarrow should 
do for you.

cc [~kszucs] might be a relevant case for your python conversion refactor?

> [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe
> -
>
> Key: ARROW-9976
> URL: https://issues.apache.org/jira/browse/ARROW-9976
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: quentin lhoest
>Priority: Minor
>
> When calling Table.from_pandas() with a large dataset with a column of 
> vectors (np.array), there is an `ArrowCapacityError`
> To reproduce:
> {code:python}
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> n = 1713614
> df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)})
> pa.Table.from_pandas(df)
> {code}
> With a smaller n it works.
> Error raised:
> {noformat}
> ---
> ArrowCapacityErrorTraceback (most recent call last)
>  in 
> > 1 _ = pa.Table.from_pandas(df)
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py
>  in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 591 for i, maybe_fut in enumerate(arrays):
> 592 if isinstance(maybe_fut, futures.Future):
> --> 593 arrays[i] = maybe_fut.result()
> 594 
> 595 types = [x.type for x in arrays]
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py
>  in result(self, timeout)
> 423 raise CancelledError()
> 424 elif self._state == FINISHED:
> --> 425 return self.__get_result()
> 426 
> 427 self._condition.wait(timeout)
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py
>  in __get_result(self)
> 382 def __get_result(self):
> 383 if self._exception:
> --> 384 raise self._exception
> 385 else:
> 386 return self._result
> ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py
>  in run(self)
>  55 
>  56 try:
> ---> 57 result = self.fn(*self.args, **self.kwargs)
>  58 except BaseException as exc:
>  59 self.future.set_exception(exc)
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py
>  in convert_column(col, field)
> 557 
> 558 try:
> --> 559 result = pa.array(col, type=type_, from_pandas=True, 
> safe=safe)
> 560 except (pa.ArrowInvalid,
> 561 pa.ArrowNotImplementedError,
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in 
> pyarrow.lib.array()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in 
> pyarrow.lib._ndarray_to_array()
> ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowCapacityError: List array cannot contain more than 2147483646 child 
> elements, have 2147483648
> {noformat}
> I guess one needs to chunk the data before creating the arrays ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)