[jira] [Created] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.

2021-10-26 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-14488:
-

 Summary: [Python] Incorrect inferred schema from pandas dataframe 
with length 0.
 Key: ARROW-14488
 URL: https://issues.apache.org/jira/browse/ARROW-14488
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 5.0.0
 Environment: OS: Windows 10, CentOS 7
Reporter: Yuan Zhou


We use pandas(with pyarrow engine) to write out parquet files and those outputs 
will be consumed by other applications such as Java apps using 
org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
dataframes would get incorrect schema for string columns in other applications. 
After some investigation, we narrow down the issue to the schema inference by 
pyarrow:

{{In [1]: import pandas as pd}}

{{In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])}}

{{In [3]: import pyarrow as pa}}

{{In [4]: pa.Schema.from_pandas(df)}}
{{Out[4]:}}
{{a: string}}
{{b: int64}}
{{c: double}}
{{-- schema metadata --}}
{{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
562}}

{{In [5]: pa.Schema.from_pandas(df.head(0))}}
{{Out[5]:}}
{{a: null}}
{{b: int64}}
{{c: double}}
{{-- schema metadata --}}
{{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
560}}

{{In [6]: pa.__version__}}
{{Out[6]: '5.0.0'}}

 

Is this an expected behavior? Or do we have any workaround for this issue? 
Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12147) TimestampTZ functions

2021-03-29 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-12147:
-

 Summary: TimestampTZ functions
 Key: ARROW-12147
 URL: https://issues.apache.org/jira/browse/ARROW-12147
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Yuan Zhou


hi arrow developers, 

Gandiva supports timestamp related functions already - but it looks like UTC 
only. It would be nice to support TimestampTZ(timestamp with timezone) also. 
Pass timestamp w/ timezone to those functions will return wrong result(should 
consider the timezone offset)

[https://github.com/apache/arrow/blob/master/cpp/src/gandiva/precompiled/time.cc#L41]

I guess we could provide a helper func like below to convert to timestamp w/ 
UTC first, then all existing functions should be working correctly. 

{{_ConvertTIMESTAMP(timestamp, timezone)_}}

A better way may require re-implement those functions by considering the zone 
offset when doing calculating, but this may make the code looks complicated. 

Note, castTIMESTAMP_utf8 supports creating timestamp with timezone already. 

[https://github.com/apache/arrow/blob/master/cpp/src/gandiva/precompiled/time.cc#L618-L743]

 

thanks, -yuan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9087) Missing HDFS options parsing

2020-06-09 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-9087:


 Summary: Missing HDFS options parsing
 Key: ARROW-9087
 URL: https://issues.apache.org/jira/browse/ARROW-9087
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Yuan Zhou
Assignee: Yuan Zhou


HDFS options for kerberos ticket and extra conf is not parsed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8609) orc JNI bridge crashed on null arrow buffer

2020-04-27 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-8609:


 Summary: orc JNI bridge crashed on null arrow buffer
 Key: ARROW-8609
 URL: https://issues.apache.org/jira/browse/ARROW-8609
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yuan Zhou
Assignee: Yuan Zhou


https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L278-L281
We should do a check on arrow buffer if it's null, and passing right value to 
the constructor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8609) [C++]orc JNI bridge crashed on null arrow buffer

2020-04-27 Thread Yuan Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan Zhou updated ARROW-8609:
-
Summary: [C++]orc JNI bridge crashed on null arrow buffer  (was: orc JNI 
bridge crashed on null arrow buffer)

> [C++]orc JNI bridge crashed on null arrow buffer
> 
>
> Key: ARROW-8609
> URL: https://issues.apache.org/jira/browse/ARROW-8609
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yuan Zhou
>Assignee: Yuan Zhou
>Priority: Major
>
> https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L278-L281
> We should do a check on arrow buffer if it's null, and passing right value to 
> the constructor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8360) Fixes date32 support for date/time functions

2020-04-07 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-8360:


 Summary: Fixes date32 support for date/time functions
 Key: ARROW-8360
 URL: https://issues.apache.org/jira/browse/ARROW-8360
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Yuan Zhou
Assignee: Yuan Zhou


Gandiva date/time functions like extractYear[1] only work with millisecond, 
passing date32 to these functions will get wrong results.


[1]https://github.com/apache/arrow/blob/6d92694d00aec08081ae1bfe06f0a265e141b1b7/cpp/src/gandiva/precompiled/time.cc#L75-L80




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8312) improve IN expression support

2020-04-02 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-8312:


 Summary: improve IN expression support
 Key: ARROW-8312
 URL: https://issues.apache.org/jira/browse/ARROW-8312
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva, Java
Reporter: Yuan Zhou
Assignee: Yuan Zhou


Gandiva C++ provided IN API[1] is able to accept TreeNode as param, which 
allows IN expression to operate on output of some function. However in Java 
API[2], IN expression only accept Field as param, which limits the API usage. 

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/gandiva/tree_expr_builder.h#L94-L125
[2] 
https://github.com/apache/arrow/blob/master/java/gandiva/src/main/java/org/apache/arrow/gandiva/expression/InNode.java#L50-L63



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7336) implement minmax options

2019-12-06 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-7336:


 Summary: implement minmax options
 Key: ARROW-7336
 URL: https://issues.apache.org/jira/browse/ARROW-7336
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Reporter: Yuan Zhou
Assignee: Yuan Zhou


minmax kernel has MinMaxOptions but not used



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2019-11-07 Thread Yuan Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969683#comment-16969683
 ] 

Yuan Zhou commented on ARROW-7083:
--

Hi [~emkornfi...@gmail.com]

For the coming AQE what kernels will Arrows use, Is it using 100% C++ kernels? 
or a combination of C++ and Gandiva kernels?

In the design draft there seems to make a combination of these two kernels. 
https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit#heading=h.2k6k5a4y9b8y

Cheers, -yuan

> [C++] Determine the feasibility and build a prototype to replace 
> compute/kernels with gandiva kernels
> -
>
> Key: ARROW-7083
> URL: https://issues.apache.org/jira/browse/ARROW-7083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute, C++ - Gandiva
>Reporter: Micah Kornfield
>Priority: Major
>
> See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]
>  
> Requirements:
> 1.  No hard runtime dependency on LLVM
> 2.  Ability to run without LLVM static/shared libraries.
>  
> Open questions:
> 1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2019-10-25 Thread Yuan Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959552#comment-16959552
 ] 

Yuan Zhou commented on ARROW-300:
-

Hi [~wesm], thanks for providing the general idea, I'm quite interested in this 
feature. Do you happen to have some updates on the detail proposal?   

Cheers, -yuan

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)