[jira] [Created] (ARROW-10970) [Rust][DataFusion] Implement Value(Null)
Mike Seddon created ARROW-10970: --- Summary: [Rust][DataFusion] Implement Value(Null) Key: ARROW-10970 URL: https://issues.apache.org/jira/browse/ARROW-10970 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Mike Seddon We need to add support for the NULL value. For example: ```sql SELECT char_length(NULL) AS char_length_null ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10969) [Rust][DataFusion] Implement basic String Functions
Mike Seddon created ARROW-10969: --- Summary: [Rust][DataFusion] Implement basic String Functions Key: ARROW-10969 URL: https://issues.apache.org/jira/browse/ARROW-10969 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Mike Seddon Assignee: Mike Seddon There are not many ANSI SQL functions currently supported. This ticket is an umbrella for increasing the support. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10968) [Rust][DataFusion] Don't build hash table for right side of the join
Daniël Heres created ARROW-10968: Summary: [Rust][DataFusion] Don't build hash table for right side of the join Key: ARROW-10968 URL: https://issues.apache.org/jira/browse/ARROW-10968 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Daniël Heres Assignee: Daniël Heres -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10967) Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA optional
meng qingyou created ARROW-10967: Summary: Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA optional Key: ARROW-10967 URL: https://issues.apache.org/jira/browse/ARROW-10967 Project: Apache Arrow Issue Type: Test Reporter: meng qingyou Facts/problems: # Two vars *c* and *PARQUET_TEST_DATA* are required by be set for running tests, benchmarks, examples. # There are totally eighteen .rs files use these environment variables. # The major usage likes this: ``` let testdata = std::env::var("PARQUET_TEST_DATA").expect("PARQUET_TEST_DATA not defined");``` # Somebody tried to assembly the test data directories by appending relative dir to *current dir* of current running process, but highly depend on the actual current dir (for example, rust/, rust/datafusion, etc.). Here is my solution: Suppose: # *current_dir* is ALWAYS inside the *git workspace dir* # We know an *absolute dir X relative to git workspace dir* Get absolute dir of X == get absolute dir *TOP* of *git workspace dir*. Given *current dir* (in *git workspace dir*),we visit the dir and it's parents, check if ."git" (file or dir)exists. The first dir that contains ".git" SHOULD be *git workspace dir*. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10966) [C++] Use FnOnce for ThreadPool's tasks instead of std::function
Ben Kietzman created ARROW-10966: Summary: [C++] Use FnOnce for ThreadPool's tasks instead of std::function Key: ARROW-10966 URL: https://issues.apache.org/jira/browse/ARROW-10966 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 4.0.0 FnOnce drops dependencies on invocation and is lighter weight than std::function -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10965) [Rust][DataFusion] switching key join order leads to error
Daniël Heres created ARROW-10965: Summary: [Rust][DataFusion] switching key join order leads to error Key: ARROW-10965 URL: https://issues.apache.org/jira/browse/ARROW-10965 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Daniël Heres If we switch the order of keys in a equijoin this results in an error. For example changing l_orderkey = o_orderkey to o_orderkey = l_orderkey in query 12 of the tpch benchmark, we get this error: {{Error: Plan("The left or right side of the join does not have all columns on \"on\": \nMissing on the left: \{\"o_orderkey\"}\nMissing on the right: \{\"l_orderkey\"}")}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10964) [Rust] [DataFusion] Optimize nested joins
Andy Grove created ARROW-10964: -- Summary: [Rust] [DataFusion] Optimize nested joins Key: ARROW-10964 URL: https://issues.apache.org/jira/browse/ARROW-10964 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andy Grove Once [https://github.com/apache/arrow/pull/8961] is merged, we have an optimization for a JOIN that operates on two tables. The next step is to extend this optimization to work with nested joins, and this is not trivial. See discussion in [https://github.com/apache/arrow/pull/8961] for context. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10963) [Rust] [DataFusion] Improve the PartailEq implementation for UDF expressions
Daniel Russo created ARROW-10963: Summary: [Rust] [DataFusion] Improve the PartailEq implementation for UDF expressions Key: ARROW-10963 URL: https://issues.apache.org/jira/browse/ARROW-10963 Project: Apache Arrow Issue Type: New Feature Components: Rust, Rust - DataFusion Reporter: Daniel Russo An implementation of {{PartialEq}} for {{ScalarUDF}} and {{AggregateUDF}} was added in ARROW-10808 ([pull request|https://github.com/apache/arrow/pull/8836], which was a requirement for the {{PartialEq}} derivation for {{Expr}}. The implementation checks equality on only the UDFs' {{name}} and {{signature}} fields: the underlying assumption is two UDFs with the same name and signature must be the same UDF. This assumption may hold up in a SQL context where UDFs are registered by name (therefore guaranteeing they are distinct), however, it doesn't hold up in a general case where there are no uniqueness requirements on the name. Improve the equality implementation for {{ScalarUDF}} and {{AggregateUDF}}. For additional context, see the discussion in the pull request [here|https://github.com/apache/arrow/pull/8836#discussion_r536874229]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10962) [Java][FlightRPC] FlightData deserializer should accept missing fields
David Li created ARROW-10962: Summary: [Java][FlightRPC] FlightData deserializer should accept missing fields Key: ARROW-10962 URL: https://issues.apache.org/jira/browse/ARROW-10962 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Java Affects Versions: 2.0.0 Reporter: David Li Fix For: 3.0.0 To be compatible with Protobuf implementations, missing fields should not be treated as an error. The C++ implementation has the same issue, and this is causing issues with the C# and Rust implementations (and presumably the Go implementation) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10961) [Rust] [Flight] Upgrade to tonic > 0.3.1 to get fix of headers/trailers handling
Carol Nichols created ARROW-10961: - Summary: [Rust] [Flight] Upgrade to tonic > 0.3.1 to get fix of headers/trailers handling Key: ARROW-10961 URL: https://issues.apache.org/jira/browse/ARROW-10961 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Rust Reporter: Carol Nichols C++ gRPC servers sometimes return both headers and trailers, and [tonic|https://crates.io/crates/tonic], the crate that provides a Rust gRPC implementation, wasn't correctly merging headers and trailers for errors in the gRPC client. [This has been fixed|https://github.com/hyperium/tonic/pull/510] and should be included in the next release of tonic, which should have some version number greater than 0.3.1, but I'm not sure what tonic's release plans are. In the Rust Flight integration test client I'm developing, the middleware scenario with the Rust client against the C++ server will fail until this is taken care of. Filing this ticket so I can reference it in the disabling of that test case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10960) [C++] [Flight] Missing protobuf data_body should result in default value of empty bytes, not null
Carol Nichols created ARROW-10960: - Summary: [C++] [Flight] Missing protobuf data_body should result in default value of empty bytes, not null Key: ARROW-10960 URL: https://issues.apache.org/jira/browse/ARROW-10960 Project: Apache Arrow Issue Type: Bug Components: FlightRPC Reporter: Carol Nichols Attachments: cpp-client-empty-data-body.png, rust-client-missing-data-body.png h1. Problem ProtoBuf {{proto3}} specifies that [if a message does not contain a particular singular element, the field should get the default value|https://developers.google.com/protocol-buffers/docs/proto3#default]. However, when the C++ {{flight-test-integration-server}} gets a {{DoPut}} request with a {{FlightData}} message for a record batch containing no items, and the {{FlightData}} is missing the {{data_body}} field, the server responds with an error "Expected body in IPC message of type record batch". h2. What happens If I run the C++ {{flight-test-integration-server}} and the C++ {{flight-test-integration-client}} with the {{generated_null_trivial}} test case, the test passes and I see the protobuf in wireshark shown in the cpp-client-empty-data-body.png attachment. Note the {{data_body}} field is present but has no value. If I run the Rust {{flight-test-integration-client}} that I'm working on developing, it does not send the {{data_body}} field at all if there are no bytes to send. I see the protobuf in wireshark shown in the rust-client-missing-data-body.png attachment. Note the {{data_body}} field is not present. The C++ server then returns the error message "Expected body in IPC message of type record batch", which comes from [this check for message body|https://github.com/apache/arrow/blob/519e9da4fc1698f686525f4226295f3680a3f3db/cpp/src/arrow/ipc/reader.cc#L92] called in [{{ReadNext}} of the record batch stream reader|https://github.com/apache/arrow/blob/519e9da4fc1698f686525f4226295f3680a3f3db/cpp/src/arrow/ipc/reader.cc#L787]. h2. What I expect to happen Instead of returning an error message because of a null pointer, the Message should get the default value of empty bytes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10959) [C++] Add scalar string join kernel
Maarten Breddels created ARROW-10959: Summary: [C++] Add scalar string join kernel Key: ARROW-10959 URL: https://issues.apache.org/jira/browse/ARROW-10959 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Maarten Breddels Similar to Python's str.join -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10958) "Nested data conversions not implemented" through glib, but not through pyarrow
Samay Kapadia created ARROW-10958: - Summary: "Nested data conversions not implemented" through glib, but not through pyarrow Key: ARROW-10958 URL: https://issues.apache.org/jira/browse/ARROW-10958 Project: Apache Arrow Issue Type: Bug Components: GLib Affects Versions: 2.0.0 Environment: macOS Catalina 10.15.7 Reporter: Samay Kapadia Hey all, For some context, I am trying to use Arrow's GLib interface through Julia; I have a sense that I can speedup by pandas workflows by using Julia and Apache Arrow. I have a 1.7GB parquet file that can be read in about 20s by using pyarrow's parquet reader {code:java} pq.read_table(path) {code} but I've tried to go the same through the GLib interface in Julia and I'm seeing {code:java} [parquet][arrow][file-reader][read-table]: NotImplemented: Nested data conversions not implemented for chunked array outputs {code} {{Arrow was installed using }} {{brew install apache-arrow-glib}} {{and it installed version 2.0.0}} Here's my Julia code: {code:java} using Pkg Pkg.add("Gtk") using Gtk.GLib using Gtk path = "..." # contains columns that are lists of strings struct _GParquetArrowFileReader parent_instance::Cint end const GParquetArrowFileReader = _GParquetArrowFileReaderstruct _GParquetArrowFileReaderClass parent_class::Cint end const GParquetArrowFileReaderClass = _GParquetArrowFileReaderClass struct _GArrowTable parent_instance::Cint end const GArrowTable = _GArrowTable struct _GArrowTableClass parent_class::Cint end const GArrowTableClass = _GArrowTableClass function parquet_arrow_file_reader_new_path(path::String)::Ptr{GParquetArrowFileReader} ret::Ptr{GParquetArrowFileReader} = 0 GError() do error_check ret = ccall( (:gparquet_arrow_file_reader_new_path, "/usr/local/Cellar/apache-arrow-glib/2.0.0/lib/libparquet-glib.200"), Ptr{GParquetArrowFileReader}, (Ptr{UInt8}, Ptr{Ptr{GError}}), Gtk.bytestring(path), error_check ) ret != 0 end ret end function parquet_arrow_file_reader_read_table(reader::Ptr{GParquetArrowFileReader})::Ptr{GArrowTable} ret::Ptr{GArrowTable} = 0 GError() do error_check ret = ccall( (:gparquet_arrow_file_reader_read_table, "/usr/local/Cellar/apache-arrow-glib/2.0.0/lib/libparquet-glib.200"), Ptr{GParquetArrowFileReader}, (Ptr{GParquetArrowFileReader}, Ptr{Ptr{GError}}), reader, error_check ) ret != 0 end ret end reader = parquet_arrow_file_reader_new_path(path) tbl = parquet_arrow_file_reader_read_table(reader) {code} Am I doing something wrong or is there a behavior discrepancy between pyarrow and glib? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10957) Expanding pyarrow buffer size more than 2GB for pandas_udf functions
Dmitry Kravchuk created ARROW-10957: --- Summary: Expanding pyarrow buffer size more than 2GB for pandas_udf functions Key: ARROW-10957 URL: https://issues.apache.org/jira/browse/ARROW-10957 Project: Apache Arrow Issue Type: Improvement Components: C++, Java, Python Affects Versions: 2.0.0 Environment: Spark: 2.4.4 Python: Dcycler (0.10.0) glmnet-py (0.1.0b2) joblib (1.0.0) kiwisolver (1.3.1) lightgbm (3.1.1) EPRECATION matplotlib (3.0.3) numpy (1.19.4) pandas (1.1.5) pip (9.0.3: The default format will switch to columns in the future. You can) pyarrow 2.0.0 pyparsing (2.4.7) use --format=(legacy|columns) (or define a format=(python-dateutil (2.8.1) pytz (202legacy|columns) in yo0.4) scikit-learn (0.23.2) scipy (1.5.4) setuptools (51.0.0) ur pip.conf under the [list] section) to disable this warnsix (1.15.0) sklearn (0.0) threadpoolctl (2.1.0) venv-paing. ck (0.2.0) wheel (0.36.2) Reporter: Dmitry Kravchuk Fix For: 2.0.1 There is 2GB limit for data that can be passed to any pandas_udf function and the aim of this issue is to expand this limit. It's very small buffer size if we use pyspark and our goal is fitting machine learning models. Steps to reproduce - just use following spark-submit for executing following after python function. {code:java} %sh cd /home/zeppelin/code && \ export PYSPARK_DRIVER_PYTHON=/home/zeppelin/envs/env3/bin/python && \ export PYSPARK_PYTHON=./env3/bin/python && \ export ARROW_PRE_0_15_IPC_FORMAT=1 && \ spark-submit \ --master yarn \ --deploy-mode client \ --num-executors 5 \ --executor-cores 5 \ --driver-memory 8G \ --executor-memory 8G \ --conf spark.executor.memoryOverhead=4G \ --conf spark.driver.memoryOverhead=4G \ --archives /home/zeppelin/env3.tar.gz#env3 \ --jars "/opt/deltalake/delta-core_2.11-0.5.0.jar" \ --py-files jobs.zip,"/opt/deltalake/delta-core_2.11-0.5.0.jar" main.py \ --job temp {code} {code:java|title=Bar.Python|borderStyle=solid} import pyspark from pyspark.sql import functions as F, types as T import pandas as pd def analyze(spark): pdf1 = pd.DataFrame( [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]], columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4'] ) df1 = spark.createDataFrame(pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index') pdf2 = pd.DataFrame( [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]], columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6'] ) df2 = spark.createDataFrame(pd.concat([pdf2 for i in range(48993)]).reset_index()).drop('index') df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner') def myudf(df): import os os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1" return df df4 = df3 \ .withColumn('df1_c1', F.col('df1_c1').cast(T.IntegerType())) \ .withColumn('df1_c2', F.col('df1_c2').cast(T.DoubleType())) \ .withColumn('df1_c3', F.col('df1_c3').cast(T.StringType())) \ .withColumn('df1_c4', F.col('df1_c4').cast(T.StringType())) \ .withColumn('df2_c1', F.col('df2_c1').cast(T.IntegerType())) \ .withColumn('df2_c2', F.col('df2_c2').cast(T.DoubleType())) \ .withColumn('df2_c3', F.col('df2_c3').cast(T.StringType())) \ .withColumn('df2_c4', F.col('df2_c4').cast(T.StringType())) \ .withColumn('df2_c5', F.col('df2_c5').cast(T.StringType())) \ .withColumn('df2_c6', F.col('df2_c6').cast(T.StringType())) print(df4.printSchema()) udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf) df5 = df4.groupBy('df1_c1').apply(udf) print('df5.count()', df5.count()) {code} If you need more details please let me know. -- This message was sent by Atlassian Jira (v8.3.4#803005)