[jira] [Created] (ARROW-10561) [Rust] Simplify `MutableBuffer::write` and `MutableBuffer::write_bytes`

2020-11-11 Thread Jira
Jorge Leitão created ARROW-10561:


 Summary: [Rust] Simplify `MutableBuffer::write` and 
`MutableBuffer::write_bytes`
 Key: ARROW-10561
 URL: https://issues.apache.org/jira/browse/ARROW-10561
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10560) [Python] Crash when creating array with string over 2GB

2020-11-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10560:
--

 Summary: [Python] Crash when creating array with string over 2GB
 Key: ARROW-10560
 URL: https://issues.apache.org/jira/browse/ARROW-10560
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Antoine Pitrou


{code:python}
>>> import pyarrow as pa
>>> data = [b"x" * (1<<32)]
>>> arr = pa.array(data)
Erreur de segmentation
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10559) [Rust] [DataFusion] Break up logical_plan/mod.rs into smaller modules

2020-11-11 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-10559:
---

 Summary: [Rust] [DataFusion] Break up logical_plan/mod.rs into 
smaller modules
 Key: ARROW-10559
 URL: https://issues.apache.org/jira/browse/ARROW-10559
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb
Assignee: Andrew Lamb


The module has gotten fairly large and so refactoring it into smaller chunks 
will improve readability  -- as suggested by Jorge 
https://github.com/apache/arrow/pull/8619#pullrequestreview-527391221




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10558) [Python] Filesystem S3 tests not independent (native s3 influences s3fs)

2020-11-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10558:
-

 Summary: [Python] Filesystem S3 tests not independent (native s3 
influences s3fs)
 Key: ARROW-10558
 URL: https://issues.apache.org/jira/browse/ARROW-10558
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


The filesystem tests in {{test_fs.py}} that are parametrized with all the 
tested filesystems have some "state" shared between them, at least in the case 
of S3. 

When first a test is run with our own S3FileSystem, which eg creates a 
directory, this directory is still present when we test the s3fs wrapped 
filesystem, which causes some tests to pass that would otherwise fail if run in 
isolation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10557) [C++] Add scalar string slicing/substring kernel

2020-11-11 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10557:


 Summary: [C++] Add scalar string slicing/substring kernel 
 Key: ARROW-10557
 URL: https://issues.apache.org/jira/browse/ARROW-10557
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels
Assignee: Maarten Breddels


This should implement slicing scalar string values of strings arrays with 
Python semantics with start, stop ,step arguments. This may seem similar to 
lists, or binary array, but the string length semantics enter into this kernel, 
which does not need to equal the number of bytes, nor the number of codepoints 
(accents, etc should be skipped).

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10556) [C++] Caching pre computed data based on FunctionOptions in the kernel state

2020-11-11 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10556:


 Summary: [C++] Caching pre computed data based on FunctionOptions 
in the kernel state
 Key: ARROW-10556
 URL: https://issues.apache.org/jira/browse/ARROW-10556
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels


See discussion here:

[https://github.com/apache/arrow/pull/8621#issuecomment-724796243]

 

A kernel might need to pre-compute something based on the function options 
passed. Since the Kernel-FunctionOptions mapping is not 1-to-1, it does not 
make sense to store this in the function option object. 

Currently, match_substring calculates a `prefix_table` on each Exec call. In 
trim ([https://github.com/apache/arrow/pull/8621)] we compute a vector on 
each Exec call. This should be done only once and cached in the kernel state 
instead.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10555) Java API get negative messageLength

2020-11-11 Thread Litchy Soong (Jira)
Litchy Soong created ARROW-10555:


 Summary: Java API get negative messageLength
 Key: ARROW-10555
 URL: https://issues.apache.org/jira/browse/ARROW-10555
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 1.0.1
Reporter: Litchy Soong


when I call  ArrowStreamReader.vectorSchemaRoot(),  

{{: (-520103681 < 0)
 2020-11-09 07:09:07,033 
ERROR com.intel.analytics.zoo.serving.PreProcessing - Error 
stack trace java.base/java.nio.
Buffer.createCapacityException(Buffer.java:256) 
   
java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:347) 
   
com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:692)
  
com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57)

com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164)
   
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170)
   
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161)}}

{{com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63)}}

{{}}

{{}}

{{The messageLength is negative, code is in MessageSerializer.java}}

{{}}

messageLength = MessageSerializer.bytesToInt(buffer.array());{{}}

{{}}

{{}}

{{and error is raised in }}

{{}}

ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);{{}}

{{}}

I tried to use minimal reproduce code to reproduce the error but I could not 
reproduce it. So *any ideas that when can it get negative messageLength*?  This 
error occurs occasionally in my program.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10554) [Rust] Default to non-cryptographic hash function?

2020-11-11 Thread Ritchie (Jira)
Ritchie created ARROW-10554:
---

 Summary: [Rust] Default to non-cryptographic hash function?
 Key: ARROW-10554
 URL: https://issues.apache.org/jira/browse/ARROW-10554
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Ritchie


THis isn't a major issue/ or even one at all. 

I was wondering how guys are looking towards using a non-cryptographic hash 
function (as SeaHash or Fnv) as default for Schema (and maybe other locations 
in crate). 

Arrow currently defaults to the default HashMap and HashSet. This uses a 
cryptographic hasher to guard you against DOS-attacks, which I believe isn't 
needed most of the time. This extra security has some performance overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)