[jira] [Created] (ARROW-9474) Column type inference in read_csv vs. open_csv. CSV conversion error to null.

2020-07-14 Thread Sep Dehpour (Jira)
Sep Dehpour created ARROW-9474:
--

 Summary: Column type inference in read_csv vs. open_csv. CSV 
conversion error to null.
 Key: ARROW-9474
 URL: https://issues.apache.org/jira/browse/ARROW-9474
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Sep Dehpour


The open_csv stream does not adjust the inferred column type based on the new 
data seen in new blocks.

For example if a csv has null values in the first few blocks of open_csv 
reader, the column is inferred as Null type. As PyArrow iterates over blocks 
and sees non null values in that column,  it crashes.

Example Error:
{code:java}
pyarrow.lib.ArrowInvalid: In CSV column #44: CSV conversion error to null: 
invalid value '-176400' {code}
 

This problem is resolved if a read_option with a huge block size is passed to 
the open_csv. But that negates the whole point of having a stream vs. read_csv.

 

System info:

PyArrow 0.17.1, Mac OS Catalina, Python 3.7.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9473) [Doc] Polishing for 1.0

2020-07-14 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9473:
--

 Summary: [Doc] Polishing for 1.0
 Key: ARROW-9473
 URL: https://issues.apache.org/jira/browse/ARROW-9473
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Documentation, R
Reporter: Neal Richardson
Assignee: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [arrow-testing] wesm opened a new pull request #39: ARROW-9399: [C++] Check in serialized schema with MetadataVersion::V6

2020-07-14 Thread GitBox


wesm opened a new pull request #39:
URL: https://github.com/apache/arrow-testing/pull/39


   Testing file needed for forward compatibility test.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-testing] wesm merged pull request #39: ARROW-9399: [C++] Check in serialized schema with MetadataVersion::V6

2020-07-14 Thread GitBox


wesm merged pull request #39:
URL: https://github.com/apache/arrow-testing/pull/39


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (ARROW-9472) [R] Provide configurable MetadataVersion in IPC API and environment variable to set default to V4 when needed

2020-07-14 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-9472:
--

 Summary: [R] Provide configurable MetadataVersion in IPC API and 
environment variable to set default to V4 when needed
 Key: ARROW-9472
 URL: https://issues.apache.org/jira/browse/ARROW-9472
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


See ARROW-9395 for the Python version of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9471) [C++] Scan Dataset in reverse

2020-07-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9471:
---

 Summary: [C++] Scan Dataset in reverse
 Key: ARROW-9471
 URL: https://issues.apache.org/jira/browse/ARROW-9471
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels


If a dataset does not fit into the OS cache, it can be beneficial to alternate 
between normal and reverse 'scanning'. Even if 90% of the a set of files fits 
into cache, scanning the same set twice will not make use of the OS cache. On 
the other hand, if the second time, scanning goes in reverse order, 90% will 
still be in OS cache. We use this trick in vaex, and I'd like to support that 
for parquet reading as well. (Is there a proper name/term for this?)

Note that since you don't want to reverse on byte level, you may want to 
reverse the way of traversing the fragment, or fragment and row groups. Too 
small chunks (e.g. pages) could lead to a performance decrease because most 
read algorithms implement read-ahead optimization (not the reverse). I think 
doing this on fragment level might be enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9470) [CI][Java] Run Maven in parallel

2020-07-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9470:
-

 Summary: [CI][Java] Run Maven in parallel
 Key: ARROW-9470
 URL: https://issues.apache.org/jira/browse/ARROW-9470
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools, Java
Reporter: Antoine Pitrou


It looks like Maven nowadays supports multi-threaded builds, but we're not 
using them:
https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9469) [Python] Make more objects weakrefable

2020-07-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9469:
-

 Summary: [Python] Make more objects weakrefable
 Key: ARROW-9469
 URL: https://issues.apache.org/jira/browse/ARROW-9469
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Reporter: Antoine Pitrou
 Fix For: 2.0.0


Currently, some PyArrow objects (like Array) are weakrefable, but others (like 
Buffer) are not. There's no reason not to allow that, it just needs the 
required (short) boilerplate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9468) [Python][Java] Ensure jvm module doesn't leak java buffers

2020-07-14 Thread Ryan Murray (Jira)
Ryan Murray created ARROW-9468:
--

 Summary: [Python][Java] Ensure jvm module doesn't leak java buffers
 Key: ARROW-9468
 URL: https://issues.apache.org/jira/browse/ARROW-9468
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java, Python
Reporter: Ryan Murray


As per discussion in https://github.com/apache/arrow/pull/7753 we should ensure 
we aren't leaking JVM direct memory when Python objects are collected



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9467) [Rust] [Website] Create Rust-specific 1.0.0 blog post

2020-07-14 Thread Andy Grove (Jira)
Andy Grove created ARROW-9467:
-

 Summary: [Rust] [Website] Create Rust-specific 1.0.0 blog post
 Key: ARROW-9467
 URL: https://issues.apache.org/jira/browse/ARROW-9467
 Project: Apache Arrow
  Issue Type: Task
Reporter: Andy Grove
Assignee: Andy Grove


Create Rust-specific 1.0.0 blog post



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9466) [Rust] [DataFusion] Upgrade to latest version of sqlparser crate

2020-07-14 Thread Andy Grove (Jira)
Andy Grove created ARROW-9466:
-

 Summary: [Rust] [DataFusion] Upgrade to latest version of 
sqlparser crate
 Key: ARROW-9466
 URL: https://issues.apache.org/jira/browse/ARROW-9466
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove


We should upgrade to the latest version of the sqlparser crate so that we can 
support more complex queries, such as those used in TPCH.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9465) [Python] Improve ergonomics of compute functions

2020-07-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9465:
-

 Summary: [Python] Improve ergonomics of compute functions
 Key: ARROW-9465
 URL: https://issues.apache.org/jira/browse/ARROW-9465
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Reporter: Antoine Pitrou


Introspection of exported compute functions currently yield suboptimal output:
{code:python}
>>> from pyarrow import compute as pc   
>>> 
>>>   
>>> pc.list_flatten 
>>> 
>>>   
.func(arg)>
>>> ?pc.list_flatten
>>> 
>>>   
Signature: pc.list_flatten(arg)
Docstring: 
File:  ~/arrow/dev/python/pyarrow/compute.py
Type:  function
>>> help(pc.list_flatten)   
>>> 
>>>   
Help on function func in module pyarrow.compute:

func(arg)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support async and optimization rules

2020-07-14 Thread Andy Grove (Jira)
Andy Grove created ARROW-9464:
-

 Summary: [Rust] [DataFusion] Physical plan refactor to support 
async and optimization rules
 Key: ARROW-9464
 URL: https://issues.apache.org/jira/browse/ARROW-9464
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove


I would like to propose a refactor of the physical/execution planning based 
experience I have had in implementing distributed execution in Ballista.

This will likely need subtasks but here is an overview of the changes I am 
proposing.

>> 1. Introduce enum to represent physical plan.

By wrapping the execution plan structs in an enum, we make it possible to build 
a tree representing the physical plan just like we do with the logical plan. 
This makes it easy to print physical plans and also to apply transformations to 
it.
{code:java}
 pub enum PhysicalPlan {
/// Projection.
Projection(Arc),
/// Filter a.k.a predicate.
Filter(Arc),
/// Hash aggregate
HashAggregate(Arc),
/// Performs a hash join of two child relations by first shuffling the data 
using the join keys.
ShuffledHashJoin(ShuffledHashJoinExec),
/// Performs a shuffle that will result in the desired partitioning.
ShuffleExchange(Arc),
/// Reads results from a ShuffleExchange
ShuffleReader(Arc),
/// Scans a partitioned data source
ParquetScan(Arc),
/// Scans an in-memory table
InMemoryTableScan(Arc),
}{code}
>> 2. Introduce physical plan optimization rule to insert "shuffle" operators

We should extend the ExecutionPlan trait so that each operator can specify its 
input and output partitioning needs, and then have an optimization rule that 
can insert any repartioning or reordering steps required.

For example, these are the methods to be added to ExecutionPlan. This design is 
based on Apache Spark.

 
{code:java}
/// Specifies how data is partitioned across different nodes in the cluster
fn output_partitioning() -> Partitioning {
Partitioning::UnknownPartitioning(0)
}

/// Specifies the data distribution requirements of all the children for this 
operator
fn required_child_distribution() -> Distribution {
Distribution::UnspecifiedDistribution
}

/// Specifies how data is ordered in each partition
fn output_ordering() -> Option> {
None
}

/// Specifies the data distribution requirements of all the children for this 
operator
fn required_child_ordering() -> Option>> {
None
}
 {code}
A good example of applying this rule would be in the case of hash aggregates 
where we perform a partial aggregate in parallel across partitions and then 
coalesce the results and apply a final hash aggregate.

Another example would be a SortMergeExec specifying the sort order required for 
its children.

>> 3. Make execution async

The execution plan trait should use the async keyword. This will require adding 
dependencies on async_trait and smol. This allows us to remove much of the 
manual thread management and have more efficient execution.

The main benefits of these changes are:
 # Simplify implementation of physical operators, because the optimizer will 
take care of repartitioning concerns
 # The ability to print a physical query plan
 # More efficient query execution because of the use of async
 # Easier for projects like Ballista to use DataFusion and add their own 
optimization rules e.g. replacing repartitioning steps with distributed 
equivalents

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9463) [Go] The writer is double closed in TestReadWrite

2020-07-14 Thread FredGan (Jira)
FredGan created ARROW-9463:
--

 Summary: [Go] The writer is double closed in TestReadWrite
 Key: ARROW-9463
 URL: https://issues.apache.org/jira/browse/ARROW-9463
 Project: Apache Arrow
  Issue Type: Test
  Components: Go
Affects Versions: 0.17.1
Reporter: FredGan


The writer in the test case 'TestReadWrite' is double closed.
{code:java}
w, err := NewWriter(f, recs[0].Schema())
if err != nil {
   t.Fatal(err)
}
defer w.Close() // <= Here

for i, rec := range recs {
   err = w.Write(rec)
   if err != nil {
  t.Fatalf("could not write record[%d] to JSON: %v", i, err)
   }
}

err = w.Close()// <= Here
if err != nil {
   t.Fatalf("could not close JSON writer: %v", err)
}
{code}
The 'defer w.Close()' is redundant, which makes one more ']}' in the end of the 
output json file .

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9462) [Go] The Indentation after the first Record arrjson writer is missing

2020-07-14 Thread FredGan (Jira)
FredGan created ARROW-9462:
--

 Summary: [Go] The Indentation after the first Record arrjson 
writer is missing
 Key: ARROW-9462
 URL: https://issues.apache.org/jira/browse/ARROW-9462
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Affects Versions: 0.17.1
Reporter: FredGan


The `jsonRecPrefix` is missed for the Records after the first one in the 
arrjson writer.

We can see the output file `arrjson-xx` in the TempDir, such as
{code:java}
  "batches": [
{
  "count": 5,
  "columns": [
{
  "name": "fixed_size_binary_3",
  "count": 5,
  "VALIDITY": [
1,
0,
0,
1,
1
  ],
  "DATA": [
"303031",
"303032",
"303033",
"303034",
"303035"
  ]
}
  ]
},
{ //   <-   HERE! we can see it is not indented correctly
  "count": 5,
  "columns": [
{
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9461) [Rust] Reading Date32 and Date64 errors - they are incorrectly converted to RecordBatch

2020-07-14 Thread Jorge (Jira)
Jorge created ARROW-9461:


 Summary: [Rust] Reading Date32 and Date64 errors - they are 
incorrectly converted to RecordBatch
 Key: ARROW-9461
 URL: https://issues.apache.org/jira/browse/ARROW-9461
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge
Assignee: Jorge


Steps to reproduce:

1. Create a file `a.parquet` using the following code:


{code:python}
import pyarrow.parquet
import numpy


def _data_datetime(f):
data = numpy.array([
numpy.datetime64('2018-08-18 23:25'),
numpy.datetime64('2019-08-18 23:25'),
numpy.datetime64("NaT")
])
data = numpy.array(data, dtype=f'datetime64[{f}]')
return data

def _write_parquet(path, data):
table = pyarrow.Table.from_arrays([pyarrow.array(data)], names=['a'])
pyarrow.parquet.write_table(table, path)
return path


_write_parquet('a.parquet', _data_datetime('D'))
{code}

2. Write a small example to read it to RecordBatches

3. observe the error {{ArrowError(ParquetError("InvalidArgumentError(\"column 
types must match schema types, expected Date32(Day) but found UInt32 at column 
index 0\")"))}}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9460) [C++] BinaryContainsExact doesn't cope with double characters in the pattern

2020-07-14 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-9460:
---

 Summary: [C++] BinaryContainsExact doesn't cope with double 
characters in the pattern
 Key: ARROW-9460
 URL: https://issues.apache.org/jira/browse/ARROW-9460
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Uwe Korn
Assignee: Uwe Korn
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9459) [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment

2020-07-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9459:


 Summary: [C++][Dataset] Make collecting/parsing statistics 
optional for ParquetFragment
 Key: ARROW-9459
 URL: https://issues.apache.org/jira/browse/ARROW-9459
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


See some timing checks here: 
https://github.com/dask/dask/pull/6346#issuecomment-656548675

Parsing all statistics, even from a centralized {{_metadata}} file can be quite 
expensive. If you know in advance that you are not going to use them (eg you 
are only going to do filtering on the partition fields, and otherwise read all 
data), it could be nice to have an option to disable parsing statistics.

cc [~rjzamora] [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9458) [Python] Dataset singlethreaded only

2020-07-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9458:
---

 Summary: [Python] Dataset singlethreaded only
 Key: ARROW-9458
 URL: https://issues.apache.org/jira/browse/ARROW-9458
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Maarten Breddels


I'm not sure this is a misunderstanding, or a compilation issue (flags?) or an 
issue in the C++ layer.

I have 1000 parquet files with a total of 1 billion rows (1 million rows each 
file, ~20 columns). I wanted to see if I could go through all rows 1 of 2 
columns efficiently (vaex use case).

 
{code:java}
import pyarrow.parquet
import pyarrow as pa
import pyarrow.dataset as ds
import glob
ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
scanned = 0
for scan_task in ds.scan(batch_size=1_000_000, columns=['passenger_count'], 
use_threads=True):
for record_batch in scan_task.execute():
scanned += record_batch.num_rows
scanned
{code}
This only seems to use 1 cpu.

Using a threadpool from Python:
{code:java}
# %%timeit
import concurrent.futures
pool = concurrent.futures.ThreadPoolExecutor()
ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
def process(scan_task):
scan_count = 0
for record_batch in scan_task.execute():
scan_count += len(record_batch)
return scan_count
sum(pool.map(process, ds.scan(batch_size=1_000_000, 
columns=['passenger_count'], use_threads=False)))
{code}
Gives me a similar performance, again, only 100% cpu usage (=1 core/cpu).

py-spy (profiler for Python) shows no GIL, so this might be something at the 
C++ layer.

Am I 'holding it wrong' or could this be a bug? Note that IO speed is not a 
problem on this system (it actually all comes from OS cache, no disk read 
observed)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9457) [C++] TableReader support protobuf

2020-07-14 Thread Shuai Zhang (Jira)
Shuai Zhang created ARROW-9457:
--

 Summary: [C++] TableReader support protobuf
 Key: ARROW-9457
 URL: https://issues.apache.org/jira/browse/ARROW-9457
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.17.1
Reporter: Shuai Zhang


I found there were TableReaders for CSV & JSON formats. It would be very nice 
if we also supported Protobuf format.

The basic idea is user passing in both the data file & the protobuf descriptor. 
The protobuf messages are delimited like CSV or prefixed by (maybe encoded in 
someway) message length.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9456) [Python] Dataset segfault when not importing pyarrow.parquet

2020-07-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9456:
---

 Summary: [Python] Dataset segfault when not importing 
pyarrow.parquet 
 Key: ARROW-9456
 URL: https://issues.apache.org/jira/browse/ARROW-9456
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Maarten Breddels


To reproduce:
# import pyarrow.parquet # if we skip this...
import pyarrow as pa
import pyarrow.dataset as ds
import glob
ds = pa.dataset.dataset('/data/taxi_parquet/data_0.parquet')
ds.to_table() # this will crash
 
$ python pyarrow/crash.py dev
terminate called after throwing an instance of 'parquet::ParquetException'
 what(): The file only has 19 columns, requested metadata for column: 1049198736
[1] 1559395 abort (core dumped) python pyarrow/crash.py
 
When the import is there, it will work fine.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9455) Request: add option for taking all columns from all files in pa.dataset

2020-07-14 Thread David Cortes (Jira)
David Cortes created ARROW-9455:
---

 Summary: Request: add option for taking all columns from all files 
in pa.dataset
 Key: ARROW-9455
 URL: https://issues.apache.org/jira/browse/ARROW-9455
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Reporter: David Cortes


In PyArrow's dataset class, if I give it multiple parquet files in a list and 
these parquet files have potentially different columns, it will always take the 
schema from the first parquet file in the list, thus ignoring columns that the 
first file doesn't have. Getting all columns within the files into the same 
dataset implies passing a manual schema or constructing one by iterating over 
the files and checking for their columns.

 

Would be nicer if PyArrow's dataset class could have an option to automatically 
take all columns within the files from which it is constructed.
{code:java}
import numpy as np, pandas as pd
df1 = pd.DataFrame({
"col1" : np.arange(10),
"col2" : np.random.choice(["a", "b"], size=10)
})
df2 = pd.DataFrame({
"col1" : np.arange(10, 20),
"col3" : np.random.random(size=10)
})
df1.to_parquet("df1.parquet")
df2.to_parquet("df2.parquet"){code}
{code:java}
import pyarrow.dataset as pds
ff = ["df1.parquet", "df2.parquet"]{code}
{code:java}
### Code below will generate a DF with col1 and col2, but no col3{code}
{code:java}
pds.dataset(ff, format="parquet").to_table().to_pandas()
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9454) [GLib] Add binding of some dictionary builders

2020-07-14 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-9454:
---

 Summary: [GLib] Add binding of some dictionary builders
 Key: ARROW-9454
 URL: https://issues.apache.org/jira/browse/ARROW-9454
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)