[jira] [Created] (ARROW-9474) Column type inference in read_csv vs. open_csv. CSV conversion error to null.
Sep Dehpour created ARROW-9474: -- Summary: Column type inference in read_csv vs. open_csv. CSV conversion error to null. Key: ARROW-9474 URL: https://issues.apache.org/jira/browse/ARROW-9474 Project: Apache Arrow Issue Type: Bug Reporter: Sep Dehpour The open_csv stream does not adjust the inferred column type based on the new data seen in new blocks. For example if a csv has null values in the first few blocks of open_csv reader, the column is inferred as Null type. As PyArrow iterates over blocks and sees non null values in that column, it crashes. Example Error: {code:java} pyarrow.lib.ArrowInvalid: In CSV column #44: CSV conversion error to null: invalid value '-176400' {code} This problem is resolved if a read_option with a huge block size is passed to the open_csv. But that negates the whole point of having a stream vs. read_csv. System info: PyArrow 0.17.1, Mac OS Catalina, Python 3.7.4 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9473) [Doc] Polishing for 1.0
Neal Richardson created ARROW-9473: -- Summary: [Doc] Polishing for 1.0 Key: ARROW-9473 URL: https://issues.apache.org/jira/browse/ARROW-9473 Project: Apache Arrow Issue Type: New Feature Components: Documentation, R Reporter: Neal Richardson Assignee: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [arrow-testing] wesm opened a new pull request #39: ARROW-9399: [C++] Check in serialized schema with MetadataVersion::V6
wesm opened a new pull request #39: URL: https://github.com/apache/arrow-testing/pull/39 Testing file needed for forward compatibility test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] wesm merged pull request #39: ARROW-9399: [C++] Check in serialized schema with MetadataVersion::V6
wesm merged pull request #39: URL: https://github.com/apache/arrow-testing/pull/39 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-9472) [R] Provide configurable MetadataVersion in IPC API and environment variable to set default to V4 when needed
Neal Richardson created ARROW-9472: -- Summary: [R] Provide configurable MetadataVersion in IPC API and environment variable to set default to V4 when needed Key: ARROW-9472 URL: https://issues.apache.org/jira/browse/ARROW-9472 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 See ARROW-9395 for the Python version of this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9471) [C++] Scan Dataset in reverse
Maarten Breddels created ARROW-9471: --- Summary: [C++] Scan Dataset in reverse Key: ARROW-9471 URL: https://issues.apache.org/jira/browse/ARROW-9471 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels If a dataset does not fit into the OS cache, it can be beneficial to alternate between normal and reverse 'scanning'. Even if 90% of the a set of files fits into cache, scanning the same set twice will not make use of the OS cache. On the other hand, if the second time, scanning goes in reverse order, 90% will still be in OS cache. We use this trick in vaex, and I'd like to support that for parquet reading as well. (Is there a proper name/term for this?) Note that since you don't want to reverse on byte level, you may want to reverse the way of traversing the fragment, or fragment and row groups. Too small chunks (e.g. pages) could lead to a performance decrease because most read algorithms implement read-ahead optimization (not the reverse). I think doing this on fragment level might be enough. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9470) [CI][Java] Run Maven in parallel
Antoine Pitrou created ARROW-9470: - Summary: [CI][Java] Run Maven in parallel Key: ARROW-9470 URL: https://issues.apache.org/jira/browse/ARROW-9470 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Developer Tools, Java Reporter: Antoine Pitrou It looks like Maven nowadays supports multi-threaded builds, but we're not using them: https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9469) [Python] Make more objects weakrefable
Antoine Pitrou created ARROW-9469: - Summary: [Python] Make more objects weakrefable Key: ARROW-9469 URL: https://issues.apache.org/jira/browse/ARROW-9469 Project: Apache Arrow Issue Type: Wish Components: Python Reporter: Antoine Pitrou Fix For: 2.0.0 Currently, some PyArrow objects (like Array) are weakrefable, but others (like Buffer) are not. There's no reason not to allow that, it just needs the required (short) boilerplate. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9468) [Python][Java] Ensure jvm module doesn't leak java buffers
Ryan Murray created ARROW-9468: -- Summary: [Python][Java] Ensure jvm module doesn't leak java buffers Key: ARROW-9468 URL: https://issues.apache.org/jira/browse/ARROW-9468 Project: Apache Arrow Issue Type: Improvement Components: Java, Python Reporter: Ryan Murray As per discussion in https://github.com/apache/arrow/pull/7753 we should ensure we aren't leaking JVM direct memory when Python objects are collected -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9467) [Rust] [Website] Create Rust-specific 1.0.0 blog post
Andy Grove created ARROW-9467: - Summary: [Rust] [Website] Create Rust-specific 1.0.0 blog post Key: ARROW-9467 URL: https://issues.apache.org/jira/browse/ARROW-9467 Project: Apache Arrow Issue Type: Task Reporter: Andy Grove Assignee: Andy Grove Create Rust-specific 1.0.0 blog post -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9466) [Rust] [DataFusion] Upgrade to latest version of sqlparser crate
Andy Grove created ARROW-9466: - Summary: [Rust] [DataFusion] Upgrade to latest version of sqlparser crate Key: ARROW-9466 URL: https://issues.apache.org/jira/browse/ARROW-9466 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Andy Grove We should upgrade to the latest version of the sqlparser crate so that we can support more complex queries, such as those used in TPCH. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9465) [Python] Improve ergonomics of compute functions
Antoine Pitrou created ARROW-9465: - Summary: [Python] Improve ergonomics of compute functions Key: ARROW-9465 URL: https://issues.apache.org/jira/browse/ARROW-9465 Project: Apache Arrow Issue Type: Wish Components: Python Reporter: Antoine Pitrou Introspection of exported compute functions currently yield suboptimal output: {code:python} >>> from pyarrow import compute as pc >>> >>> >>> pc.list_flatten >>> >>> .func(arg)> >>> ?pc.list_flatten >>> >>> Signature: pc.list_flatten(arg) Docstring: File: ~/arrow/dev/python/pyarrow/compute.py Type: function >>> help(pc.list_flatten) >>> >>> Help on function func in module pyarrow.compute: func(arg) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support async and optimization rules
Andy Grove created ARROW-9464: - Summary: [Rust] [DataFusion] Physical plan refactor to support async and optimization rules Key: ARROW-9464 URL: https://issues.apache.org/jira/browse/ARROW-9464 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Andy Grove I would like to propose a refactor of the physical/execution planning based experience I have had in implementing distributed execution in Ballista. This will likely need subtasks but here is an overview of the changes I am proposing. >> 1. Introduce enum to represent physical plan. By wrapping the execution plan structs in an enum, we make it possible to build a tree representing the physical plan just like we do with the logical plan. This makes it easy to print physical plans and also to apply transformations to it. {code:java} pub enum PhysicalPlan { /// Projection. Projection(Arc), /// Filter a.k.a predicate. Filter(Arc), /// Hash aggregate HashAggregate(Arc), /// Performs a hash join of two child relations by first shuffling the data using the join keys. ShuffledHashJoin(ShuffledHashJoinExec), /// Performs a shuffle that will result in the desired partitioning. ShuffleExchange(Arc), /// Reads results from a ShuffleExchange ShuffleReader(Arc), /// Scans a partitioned data source ParquetScan(Arc), /// Scans an in-memory table InMemoryTableScan(Arc), }{code} >> 2. Introduce physical plan optimization rule to insert "shuffle" operators We should extend the ExecutionPlan trait so that each operator can specify its input and output partitioning needs, and then have an optimization rule that can insert any repartioning or reordering steps required. For example, these are the methods to be added to ExecutionPlan. This design is based on Apache Spark. {code:java} /// Specifies how data is partitioned across different nodes in the cluster fn output_partitioning() -> Partitioning { Partitioning::UnknownPartitioning(0) } /// Specifies the data distribution requirements of all the children for this operator fn required_child_distribution() -> Distribution { Distribution::UnspecifiedDistribution } /// Specifies how data is ordered in each partition fn output_ordering() -> Option> { None } /// Specifies the data distribution requirements of all the children for this operator fn required_child_ordering() -> Option>> { None } {code} A good example of applying this rule would be in the case of hash aggregates where we perform a partial aggregate in parallel across partitions and then coalesce the results and apply a final hash aggregate. Another example would be a SortMergeExec specifying the sort order required for its children. >> 3. Make execution async The execution plan trait should use the async keyword. This will require adding dependencies on async_trait and smol. This allows us to remove much of the manual thread management and have more efficient execution. The main benefits of these changes are: # Simplify implementation of physical operators, because the optimizer will take care of repartitioning concerns # The ability to print a physical query plan # More efficient query execution because of the use of async # Easier for projects like Ballista to use DataFusion and add their own optimization rules e.g. replacing repartitioning steps with distributed equivalents -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9463) [Go] The writer is double closed in TestReadWrite
FredGan created ARROW-9463: -- Summary: [Go] The writer is double closed in TestReadWrite Key: ARROW-9463 URL: https://issues.apache.org/jira/browse/ARROW-9463 Project: Apache Arrow Issue Type: Test Components: Go Affects Versions: 0.17.1 Reporter: FredGan The writer in the test case 'TestReadWrite' is double closed. {code:java} w, err := NewWriter(f, recs[0].Schema()) if err != nil { t.Fatal(err) } defer w.Close() // <= Here for i, rec := range recs { err = w.Write(rec) if err != nil { t.Fatalf("could not write record[%d] to JSON: %v", i, err) } } err = w.Close()// <= Here if err != nil { t.Fatalf("could not close JSON writer: %v", err) } {code} The 'defer w.Close()' is redundant, which makes one more ']}' in the end of the output json file . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9462) [Go] The Indentation after the first Record arrjson writer is missing
FredGan created ARROW-9462: -- Summary: [Go] The Indentation after the first Record arrjson writer is missing Key: ARROW-9462 URL: https://issues.apache.org/jira/browse/ARROW-9462 Project: Apache Arrow Issue Type: Bug Components: Go Affects Versions: 0.17.1 Reporter: FredGan The `jsonRecPrefix` is missed for the Records after the first one in the arrjson writer. We can see the output file `arrjson-xx` in the TempDir, such as {code:java} "batches": [ { "count": 5, "columns": [ { "name": "fixed_size_binary_3", "count": 5, "VALIDITY": [ 1, 0, 0, 1, 1 ], "DATA": [ "303031", "303032", "303033", "303034", "303035" ] } ] }, { // <- HERE! we can see it is not indented correctly "count": 5, "columns": [ { {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9461) [Rust] Reading Date32 and Date64 errors - they are incorrectly converted to RecordBatch
Jorge created ARROW-9461: Summary: [Rust] Reading Date32 and Date64 errors - they are incorrectly converted to RecordBatch Key: ARROW-9461 URL: https://issues.apache.org/jira/browse/ARROW-9461 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jorge Assignee: Jorge Steps to reproduce: 1. Create a file `a.parquet` using the following code: {code:python} import pyarrow.parquet import numpy def _data_datetime(f): data = numpy.array([ numpy.datetime64('2018-08-18 23:25'), numpy.datetime64('2019-08-18 23:25'), numpy.datetime64("NaT") ]) data = numpy.array(data, dtype=f'datetime64[{f}]') return data def _write_parquet(path, data): table = pyarrow.Table.from_arrays([pyarrow.array(data)], names=['a']) pyarrow.parquet.write_table(table, path) return path _write_parquet('a.parquet', _data_datetime('D')) {code} 2. Write a small example to read it to RecordBatches 3. observe the error {{ArrowError(ParquetError("InvalidArgumentError(\"column types must match schema types, expected Date32(Day) but found UInt32 at column index 0\")"))}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9460) [C++] BinaryContainsExact doesn't cope with double characters in the pattern
Uwe Korn created ARROW-9460: --- Summary: [C++] BinaryContainsExact doesn't cope with double characters in the pattern Key: ARROW-9460 URL: https://issues.apache.org/jira/browse/ARROW-9460 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9459) [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment
Joris Van den Bossche created ARROW-9459: Summary: [C++][Dataset] Make collecting/parsing statistics optional for ParquetFragment Key: ARROW-9459 URL: https://issues.apache.org/jira/browse/ARROW-9459 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche See some timing checks here: https://github.com/dask/dask/pull/6346#issuecomment-656548675 Parsing all statistics, even from a centralized {{_metadata}} file can be quite expensive. If you know in advance that you are not going to use them (eg you are only going to do filtering on the partition fields, and otherwise read all data), it could be nice to have an option to disable parsing statistics. cc [~rjzamora] [~bkietz] [~fsaintjacques] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9458) [Python] Dataset singlethreaded only
Maarten Breddels created ARROW-9458: --- Summary: [Python] Dataset singlethreaded only Key: ARROW-9458 URL: https://issues.apache.org/jira/browse/ARROW-9458 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Maarten Breddels I'm not sure this is a misunderstanding, or a compilation issue (flags?) or an issue in the C++ layer. I have 1000 parquet files with a total of 1 billion rows (1 million rows each file, ~20 columns). I wanted to see if I could go through all rows 1 of 2 columns efficiently (vaex use case). {code:java} import pyarrow.parquet import pyarrow as pa import pyarrow.dataset as ds import glob ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet')) scanned = 0 for scan_task in ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=True): for record_batch in scan_task.execute(): scanned += record_batch.num_rows scanned {code} This only seems to use 1 cpu. Using a threadpool from Python: {code:java} # %%timeit import concurrent.futures pool = concurrent.futures.ThreadPoolExecutor() ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet')) def process(scan_task): scan_count = 0 for record_batch in scan_task.execute(): scan_count += len(record_batch) return scan_count sum(pool.map(process, ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=False))) {code} Gives me a similar performance, again, only 100% cpu usage (=1 core/cpu). py-spy (profiler for Python) shows no GIL, so this might be something at the C++ layer. Am I 'holding it wrong' or could this be a bug? Note that IO speed is not a problem on this system (it actually all comes from OS cache, no disk read observed) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9457) [C++] TableReader support protobuf
Shuai Zhang created ARROW-9457: -- Summary: [C++] TableReader support protobuf Key: ARROW-9457 URL: https://issues.apache.org/jira/browse/ARROW-9457 Project: Apache Arrow Issue Type: New Feature Components: C++ Affects Versions: 0.17.1 Reporter: Shuai Zhang I found there were TableReaders for CSV & JSON formats. It would be very nice if we also supported Protobuf format. The basic idea is user passing in both the data file & the protobuf descriptor. The protobuf messages are delimited like CSV or prefixed by (maybe encoded in someway) message length. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9456) [Python] Dataset segfault when not importing pyarrow.parquet
Maarten Breddels created ARROW-9456: --- Summary: [Python] Dataset segfault when not importing pyarrow.parquet Key: ARROW-9456 URL: https://issues.apache.org/jira/browse/ARROW-9456 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Maarten Breddels To reproduce: # import pyarrow.parquet # if we skip this... import pyarrow as pa import pyarrow.dataset as ds import glob ds = pa.dataset.dataset('/data/taxi_parquet/data_0.parquet') ds.to_table() # this will crash $ python pyarrow/crash.py dev terminate called after throwing an instance of 'parquet::ParquetException' what(): The file only has 19 columns, requested metadata for column: 1049198736 [1] 1559395 abort (core dumped) python pyarrow/crash.py When the import is there, it will work fine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9455) Request: add option for taking all columns from all files in pa.dataset
David Cortes created ARROW-9455: --- Summary: Request: add option for taking all columns from all files in pa.dataset Key: ARROW-9455 URL: https://issues.apache.org/jira/browse/ARROW-9455 Project: Apache Arrow Issue Type: Wish Components: Python Reporter: David Cortes In PyArrow's dataset class, if I give it multiple parquet files in a list and these parquet files have potentially different columns, it will always take the schema from the first parquet file in the list, thus ignoring columns that the first file doesn't have. Getting all columns within the files into the same dataset implies passing a manual schema or constructing one by iterating over the files and checking for their columns. Would be nicer if PyArrow's dataset class could have an option to automatically take all columns within the files from which it is constructed. {code:java} import numpy as np, pandas as pd df1 = pd.DataFrame({ "col1" : np.arange(10), "col2" : np.random.choice(["a", "b"], size=10) }) df2 = pd.DataFrame({ "col1" : np.arange(10, 20), "col3" : np.random.random(size=10) }) df1.to_parquet("df1.parquet") df2.to_parquet("df2.parquet"){code} {code:java} import pyarrow.dataset as pds ff = ["df1.parquet", "df2.parquet"]{code} {code:java} ### Code below will generate a DF with col1 and col2, but no col3{code} {code:java} pds.dataset(ff, format="parquet").to_table().to_pandas() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9454) [GLib] Add binding of some dictionary builders
Kenta Murata created ARROW-9454: --- Summary: [GLib] Add binding of some dictionary builders Key: ARROW-9454 URL: https://issues.apache.org/jira/browse/ARROW-9454 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kenta Murata Assignee: Kenta Murata -- This message was sent by Atlassian Jira (v8.3.4#803005)