[jira] [Created] (ARROW-10970) [Rust][DataFusion] Implement Value(Null)

2020-12-18 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10970:
---

 Summary: [Rust][DataFusion] Implement Value(Null)
 Key: ARROW-10970
 URL: https://issues.apache.org/jira/browse/ARROW-10970
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Mike Seddon


We need to add support for the NULL value. 

For example:

```sql
SELECT char_length(NULL) AS char_length_null
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10969) [Rust][DataFusion] Implement basic String Functions

2020-12-18 Thread Mike Seddon (Jira)
Mike Seddon created ARROW-10969:
---

 Summary: [Rust][DataFusion] Implement basic String Functions
 Key: ARROW-10969
 URL: https://issues.apache.org/jira/browse/ARROW-10969
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon


There are not many ANSI SQL functions currently supported. This ticket is an 
umbrella for increasing the support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10968) [Rust][DataFusion] Don't build hash table for right side of the join

2020-12-18 Thread Jira
Daniël Heres created ARROW-10968:


 Summary: [Rust][DataFusion] Don't build hash table for right side 
of the join 
 Key: ARROW-10968
 URL: https://issues.apache.org/jira/browse/ARROW-10968
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10967) Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA optional

2020-12-18 Thread meng qingyou (Jira)
meng qingyou created ARROW-10967:


 Summary: Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA 
optional
 Key: ARROW-10967
 URL: https://issues.apache.org/jira/browse/ARROW-10967
 Project: Apache Arrow
  Issue Type: Test
Reporter: meng qingyou


Facts/problems:
 # Two vars *c* and *PARQUET_TEST_DATA* are required by be set for running 
tests,  benchmarks, examples.
 # There are totally eighteen .rs files use these environment variables.
 # The major usage likes this: ```
let testdata =
std::env::var("PARQUET_TEST_DATA").expect("PARQUET_TEST_DATA not defined");```
 # Somebody tried to assembly the test data directories by appending relative 
dir to *current dir* of current running process, but highly depend on the 
actual current dir (for example, rust/, rust/datafusion, etc.).

Here is my solution:

Suppose:
 # *current_dir* is ALWAYS inside the *git workspace dir*
 # We know an *absolute dir X relative to git workspace dir*

Get absolute dir of X == get absolute dir *TOP* of *git workspace dir*.

Given *current dir* (in *git workspace dir*),we visit the dir and it's parents, 
check if ."git"  (file or dir)exists. The first dir that contains ".git" SHOULD 
be *git workspace dir*.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10966) [C++] Use FnOnce for ThreadPool's tasks instead of std::function

2020-12-18 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-10966:


 Summary: [C++] Use FnOnce for ThreadPool's tasks instead of 
std::function
 Key: ARROW-10966
 URL: https://issues.apache.org/jira/browse/ARROW-10966
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 4.0.0


FnOnce drops dependencies on invocation and is lighter weight than std::function



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10965) [Rust][DataFusion] switching key join order leads to error

2020-12-18 Thread Jira
Daniël Heres created ARROW-10965:


 Summary: [Rust][DataFusion] switching key join order leads to error
 Key: ARROW-10965
 URL: https://issues.apache.org/jira/browse/ARROW-10965
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Daniël Heres


If we switch the order of keys in a equijoin this results in an error.

For example changing l_orderkey = o_orderkey to
o_orderkey = l_orderkey

in query 12 of  the tpch benchmark, we get this error:
 
{{Error: Plan("The left or right side of the join does not have all columns on 
\"on\": \nMissing on the left: \{\"o_orderkey\"}\nMissing on the right: 
\{\"l_orderkey\"}")}}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10964) [Rust] [DataFusion] Optimize nested joins

2020-12-18 Thread Andy Grove (Jira)
Andy Grove created ARROW-10964:
--

 Summary: [Rust] [DataFusion] Optimize nested joins
 Key: ARROW-10964
 URL: https://issues.apache.org/jira/browse/ARROW-10964
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


Once [https://github.com/apache/arrow/pull/8961] is merged, we have an 
optimization for a JOIN that operates on two tables.

The next step is to extend this optimization to work with nested joins, and 
this is not trivial. See discussion in 
[https://github.com/apache/arrow/pull/8961] for context.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10963) [Rust] [DataFusion] Improve the PartailEq implementation for UDF expressions

2020-12-18 Thread Daniel Russo (Jira)
Daniel Russo created ARROW-10963:


 Summary: [Rust] [DataFusion] Improve the PartailEq implementation 
for UDF expressions
 Key: ARROW-10963
 URL: https://issues.apache.org/jira/browse/ARROW-10963
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust, Rust - DataFusion
Reporter: Daniel Russo


An implementation of {{PartialEq}} for {{ScalarUDF}} and {{AggregateUDF}} was 
added in ARROW-10808 ([pull request|https://github.com/apache/arrow/pull/8836], 
which was a requirement for the {{PartialEq}} derivation for {{Expr}}.

The implementation checks equality on only the UDFs' {{name}} and {{signature}} 
fields: the underlying assumption is two UDFs with the same name and signature 
must be the same UDF. This assumption may hold up in a SQL context where UDFs 
are registered by name (therefore guaranteeing they are distinct), however, it 
doesn't hold up in a general case where there are no uniqueness requirements on 
the name. 

Improve the equality implementation for {{ScalarUDF}} and {{AggregateUDF}}. For 
additional context, see the discussion in the pull request 
[here|https://github.com/apache/arrow/pull/8836#discussion_r536874229]. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10962) [Java][FlightRPC] FlightData deserializer should accept missing fields

2020-12-18 Thread David Li (Jira)
David Li created ARROW-10962:


 Summary: [Java][FlightRPC] FlightData deserializer should accept 
missing fields
 Key: ARROW-10962
 URL: https://issues.apache.org/jira/browse/ARROW-10962
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Java
Affects Versions: 2.0.0
Reporter: David Li
 Fix For: 3.0.0


To be compatible with Protobuf implementations, missing fields should not be 
treated as an error. The C++ implementation has the same issue, and this is 
causing issues with the C# and Rust implementations (and presumably the Go 
implementation)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10961) [Rust] [Flight] Upgrade to tonic > 0.3.1 to get fix of headers/trailers handling

2020-12-18 Thread Carol Nichols (Jira)
Carol Nichols created ARROW-10961:
-

 Summary: [Rust] [Flight] Upgrade to tonic > 0.3.1 to get fix of 
headers/trailers handling
 Key: ARROW-10961
 URL: https://issues.apache.org/jira/browse/ARROW-10961
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Rust
Reporter: Carol Nichols


C++ gRPC servers sometimes return both headers and trailers, and 
[tonic|https://crates.io/crates/tonic], the crate that provides a Rust gRPC 
implementation, wasn't correctly merging headers and trailers for errors in the 
gRPC client. [This has been fixed|https://github.com/hyperium/tonic/pull/510] 
and should be included in the next release of tonic, which should have some 
version number greater than 0.3.1, but I'm not sure what tonic's release plans 
are.

In the Rust Flight integration test client I'm developing, the middleware 
scenario with the Rust client against the C++ server will fail until this is 
taken care of. Filing this ticket so I can reference it in the disabling of 
that test case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10960) [C++] [Flight] Missing protobuf data_body should result in default value of empty bytes, not null

2020-12-18 Thread Carol Nichols (Jira)
Carol Nichols created ARROW-10960:
-

 Summary: [C++] [Flight] Missing protobuf data_body should result 
in default value of empty bytes, not null
 Key: ARROW-10960
 URL: https://issues.apache.org/jira/browse/ARROW-10960
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC
Reporter: Carol Nichols
 Attachments: cpp-client-empty-data-body.png, 
rust-client-missing-data-body.png

h1. Problem

ProtoBuf {{proto3}} specifies that [if a message does not contain a particular 
singular element, the field should get the default 
value|https://developers.google.com/protocol-buffers/docs/proto3#default]. 
However, when the C++ {{flight-test-integration-server}} gets a {{DoPut}} 
request with a {{FlightData}} message for a record batch containing no items, 
and the {{FlightData}} is missing the {{data_body}} field, the server responds 
with an error "Expected body in IPC message of type record batch".

h2. What happens

If I run the C++ {{flight-test-integration-server}} and the C++ 
{{flight-test-integration-client}} with the {{generated_null_trivial}} test 
case, the test passes and I see the protobuf in wireshark shown in the 
cpp-client-empty-data-body.png attachment.

Note the {{data_body}} field is present but has no value.

If I run the Rust {{flight-test-integration-client}} that I'm working on 
developing, it does not send the {{data_body}} field at all if there are no 
bytes to send. I see the protobuf in wireshark shown in the 
rust-client-missing-data-body.png attachment.

Note the {{data_body}} field is not present.

The C++ server then returns the error message "Expected body in IPC message of 
type record batch", which comes from [this check for message 
body|https://github.com/apache/arrow/blob/519e9da4fc1698f686525f4226295f3680a3f3db/cpp/src/arrow/ipc/reader.cc#L92]
 called in [{{ReadNext}} of the record batch stream 
reader|https://github.com/apache/arrow/blob/519e9da4fc1698f686525f4226295f3680a3f3db/cpp/src/arrow/ipc/reader.cc#L787].

h2.  What I expect to happen

Instead of returning an error message because of a null pointer, the Message 
should get the default value of empty bytes.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10959) [C++] Add scalar string join kernel

2020-12-18 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10959:


 Summary: [C++] Add scalar string join kernel
 Key: ARROW-10959
 URL: https://issues.apache.org/jira/browse/ARROW-10959
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: Maarten Breddels


Similar to Python's str.join



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10958) "Nested data conversions not implemented" through glib, but not through pyarrow

2020-12-18 Thread Samay Kapadia (Jira)
Samay Kapadia created ARROW-10958:
-

 Summary: "Nested data conversions not implemented" through glib, 
but not through pyarrow
 Key: ARROW-10958
 URL: https://issues.apache.org/jira/browse/ARROW-10958
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Affects Versions: 2.0.0
 Environment: macOS Catalina 10.15.7
Reporter: Samay Kapadia


Hey all,

For some context, I am trying to use Arrow's GLib interface through Julia; I 
have a sense that I can speedup by pandas workflows by using Julia and Apache 
Arrow.

I have a 1.7GB parquet file that can be read in about 20s by using pyarrow's 
parquet reader
{code:java}
pq.read_table(path)
{code}
but I've tried to go the same through the GLib interface in Julia and I'm seeing
{code:java}
[parquet][arrow][file-reader][read-table]: NotImplemented: Nested data 
conversions not implemented for chunked array outputs
{code}
{{Arrow was installed using }}

{{brew install apache-arrow-glib}}

{{and it installed version 2.0.0}}

Here's my Julia code:
{code:java}
using Pkg
Pkg.add("Gtk")
using Gtk.GLib
using Gtk

path = "..." # contains columns that are lists of strings

struct _GParquetArrowFileReader
parent_instance::Cint
end

const GParquetArrowFileReader = _GParquetArrowFileReaderstruct 
_GParquetArrowFileReaderClass
parent_class::Cint
end

const GParquetArrowFileReaderClass = _GParquetArrowFileReaderClass

struct _GArrowTable
parent_instance::Cint
end

const GArrowTable = _GArrowTable

struct _GArrowTableClass
parent_class::Cint
end

const GArrowTableClass = _GArrowTableClass

function 
parquet_arrow_file_reader_new_path(path::String)::Ptr{GParquetArrowFileReader}
ret::Ptr{GParquetArrowFileReader} = 0
GError() do error_check
ret = ccall(
(:gparquet_arrow_file_reader_new_path, 
"/usr/local/Cellar/apache-arrow-glib/2.0.0/lib/libparquet-glib.200"), 
Ptr{GParquetArrowFileReader}, 
(Ptr{UInt8}, Ptr{Ptr{GError}}), 
Gtk.bytestring(path), error_check
)
ret != 0
end
ret
end

function 
parquet_arrow_file_reader_read_table(reader::Ptr{GParquetArrowFileReader})::Ptr{GArrowTable}
ret::Ptr{GArrowTable} = 0
GError() do error_check
ret = ccall(
(:gparquet_arrow_file_reader_read_table, 
"/usr/local/Cellar/apache-arrow-glib/2.0.0/lib/libparquet-glib.200"), 
Ptr{GParquetArrowFileReader}, 
(Ptr{GParquetArrowFileReader}, Ptr{Ptr{GError}}), 
reader, error_check
)
ret != 0
end
ret
end

reader = parquet_arrow_file_reader_new_path(path)
tbl = parquet_arrow_file_reader_read_table(reader)
{code}
Am I doing something wrong or is there a behavior discrepancy between pyarrow 
and glib?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10957) Expanding pyarrow buffer size more than 2GB for pandas_udf functions

2020-12-18 Thread Dmitry Kravchuk (Jira)
Dmitry Kravchuk created ARROW-10957:
---

 Summary: Expanding pyarrow buffer size more than 2GB for 
pandas_udf functions
 Key: ARROW-10957
 URL: https://issues.apache.org/jira/browse/ARROW-10957
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Java, Python
Affects Versions: 2.0.0
 Environment: Spark: 2.4.4

Python:
Dcycler (0.10.0)
glmnet-py (0.1.0b2)
joblib (1.0.0)
kiwisolver (1.3.1)
lightgbm (3.1.1) EPRECATION
matplotlib (3.0.3)
numpy (1.19.4)
pandas (1.1.5)
pip (9.0.3: The default format will switch to columns in the future. You can)
pyarrow 2.0.0
pyparsing (2.4.7) use --format=(legacy|columns) (or define a 
format=(python-dateutil (2.8.1)
pytz (202legacy|columns) in yo0.4)
scikit-learn (0.23.2)
scipy (1.5.4)
setuptools (51.0.0) ur pip.conf under the [list] section) to disable this 
warnsix (1.15.0)
sklearn (0.0)
threadpoolctl (2.1.0)
venv-paing. ck (0.2.0)
wheel (0.36.2)
Reporter: Dmitry Kravchuk
 Fix For: 2.0.1


There is 2GB limit for data that can be passed to any pandas_udf function and 
the aim of this issue is to expand this limit. It's very small buffer size if 
we use pyspark and our goal is fitting machine learning models.

Steps to reproduce - just use following spark-submit for executing following 
after python function.

{code:java}
%sh
cd /home/zeppelin/code && \
export PYSPARK_DRIVER_PYTHON=/home/zeppelin/envs/env3/bin/python && \
export PYSPARK_PYTHON=./env3/bin/python && \
export ARROW_PRE_0_15_IPC_FORMAT=1 && \
spark-submit \
--master yarn \
--deploy-mode client \
--num-executors 5 \
--executor-cores 5 \
--driver-memory 8G \
--executor-memory 8G \
--conf spark.executor.memoryOverhead=4G \
--conf spark.driver.memoryOverhead=4G \
--archives /home/zeppelin/env3.tar.gz#env3 \
--jars "/opt/deltalake/delta-core_2.11-0.5.0.jar" \
--py-files jobs.zip,"/opt/deltalake/delta-core_2.11-0.5.0.jar" main.py \
--job temp
{code}
 
{code:java|title=Bar.Python|borderStyle=solid}
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd

def analyze(spark):

pdf1 = pd.DataFrame(
[[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
)
df1 = spark.createDataFrame(pd.concat([pdf1 for i in 
range(429)]).reset_index()).drop('index')

pdf2 = pd.DataFrame(
[[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", 
"abcdefghijklmno"]],
columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
)
df2 = spark.createDataFrame(pd.concat([pdf2 for i in 
range(48993)]).reset_index()).drop('index')
df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')

def myudf(df):
import os
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
return df

df4 = df3 \
.withColumn('df1_c1', F.col('df1_c1').cast(T.IntegerType())) \
.withColumn('df1_c2', F.col('df1_c2').cast(T.DoubleType())) \
.withColumn('df1_c3', F.col('df1_c3').cast(T.StringType())) \
.withColumn('df1_c4', F.col('df1_c4').cast(T.StringType())) \
.withColumn('df2_c1', F.col('df2_c1').cast(T.IntegerType())) \
.withColumn('df2_c2', F.col('df2_c2').cast(T.DoubleType())) \
.withColumn('df2_c3', F.col('df2_c3').cast(T.StringType())) \
.withColumn('df2_c4', F.col('df2_c4').cast(T.StringType())) \
.withColumn('df2_c5', F.col('df2_c5').cast(T.StringType())) \
.withColumn('df2_c6', F.col('df2_c6').cast(T.StringType()))
print(df4.printSchema())

udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)

df5 = df4.groupBy('df1_c1').apply(udf)
print('df5.count()', df5.count())
{code}

If you need more details please let me know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)