[jira] [Created] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.
Yuan Zhou created ARROW-14488: - Summary: [Python] Incorrect inferred schema from pandas dataframe with length 0. Key: ARROW-14488 URL: https://issues.apache.org/jira/browse/ARROW-14488 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 5.0.0 Environment: OS: Windows 10, CentOS 7 Reporter: Yuan Zhou We use pandas(with pyarrow engine) to write out parquet files and those outputs will be consumed by other applications such as Java apps using org.apache.parquet.hadoop.ParquetFileReader. We found that some empty dataframes would get incorrect schema for string columns in other applications. After some investigation, we narrow down the issue to the schema inference by pyarrow: {{In [1]: import pandas as pd}} {{In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])}} {{In [3]: import pyarrow as pa}} {{In [4]: pa.Schema.from_pandas(df)}} {{Out[4]:}} {{a: string}} {{b: int64}} {{c: double}} {{-- schema metadata --}} {{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 562}} {{In [5]: pa.Schema.from_pandas(df.head(0))}} {{Out[5]:}} {{a: null}} {{b: int64}} {{c: double}} {{-- schema metadata --}} {{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 560}} {{In [6]: pa.__version__}} {{Out[6]: '5.0.0'}} Is this an expected behavior? Or do we have any workaround for this issue? Could anyone take a look please. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12147) TimestampTZ functions
Yuan Zhou created ARROW-12147: - Summary: TimestampTZ functions Key: ARROW-12147 URL: https://issues.apache.org/jira/browse/ARROW-12147 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Yuan Zhou hi arrow developers, Gandiva supports timestamp related functions already - but it looks like UTC only. It would be nice to support TimestampTZ(timestamp with timezone) also. Pass timestamp w/ timezone to those functions will return wrong result(should consider the timezone offset) [https://github.com/apache/arrow/blob/master/cpp/src/gandiva/precompiled/time.cc#L41] I guess we could provide a helper func like below to convert to timestamp w/ UTC first, then all existing functions should be working correctly. {{_ConvertTIMESTAMP(timestamp, timezone)_}} A better way may require re-implement those functions by considering the zone offset when doing calculating, but this may make the code looks complicated. Note, castTIMESTAMP_utf8 supports creating timestamp with timezone already. [https://github.com/apache/arrow/blob/master/cpp/src/gandiva/precompiled/time.cc#L618-L743] thanks, -yuan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9087) Missing HDFS options parsing
Yuan Zhou created ARROW-9087: Summary: Missing HDFS options parsing Key: ARROW-9087 URL: https://issues.apache.org/jira/browse/ARROW-9087 Project: Apache Arrow Issue Type: Bug Reporter: Yuan Zhou Assignee: Yuan Zhou HDFS options for kerberos ticket and extra conf is not parsed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8609) orc JNI bridge crashed on null arrow buffer
Yuan Zhou created ARROW-8609: Summary: orc JNI bridge crashed on null arrow buffer Key: ARROW-8609 URL: https://issues.apache.org/jira/browse/ARROW-8609 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Yuan Zhou Assignee: Yuan Zhou https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L278-L281 We should do a check on arrow buffer if it's null, and passing right value to the constructor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8609) [C++]orc JNI bridge crashed on null arrow buffer
[ https://issues.apache.org/jira/browse/ARROW-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuan Zhou updated ARROW-8609: - Summary: [C++]orc JNI bridge crashed on null arrow buffer (was: orc JNI bridge crashed on null arrow buffer) > [C++]orc JNI bridge crashed on null arrow buffer > > > Key: ARROW-8609 > URL: https://issues.apache.org/jira/browse/ARROW-8609 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Yuan Zhou >Assignee: Yuan Zhou >Priority: Major > > https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L278-L281 > We should do a check on arrow buffer if it's null, and passing right value to > the constructor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8360) Fixes date32 support for date/time functions
Yuan Zhou created ARROW-8360: Summary: Fixes date32 support for date/time functions Key: ARROW-8360 URL: https://issues.apache.org/jira/browse/ARROW-8360 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Yuan Zhou Assignee: Yuan Zhou Gandiva date/time functions like extractYear[1] only work with millisecond, passing date32 to these functions will get wrong results. [1]https://github.com/apache/arrow/blob/6d92694d00aec08081ae1bfe06f0a265e141b1b7/cpp/src/gandiva/precompiled/time.cc#L75-L80 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8312) improve IN expression support
Yuan Zhou created ARROW-8312: Summary: improve IN expression support Key: ARROW-8312 URL: https://issues.apache.org/jira/browse/ARROW-8312 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva, Java Reporter: Yuan Zhou Assignee: Yuan Zhou Gandiva C++ provided IN API[1] is able to accept TreeNode as param, which allows IN expression to operate on output of some function. However in Java API[2], IN expression only accept Field as param, which limits the API usage. [1] https://github.com/apache/arrow/blob/master/cpp/src/gandiva/tree_expr_builder.h#L94-L125 [2] https://github.com/apache/arrow/blob/master/java/gandiva/src/main/java/org/apache/arrow/gandiva/expression/InNode.java#L50-L63 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7336) implement minmax options
Yuan Zhou created ARROW-7336: Summary: implement minmax options Key: ARROW-7336 URL: https://issues.apache.org/jira/browse/ARROW-7336 Project: Apache Arrow Issue Type: Improvement Components: C++ - Compute Reporter: Yuan Zhou Assignee: Yuan Zhou minmax kernel has MinMaxOptions but not used -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels
[ https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969683#comment-16969683 ] Yuan Zhou commented on ARROW-7083: -- Hi [~emkornfi...@gmail.com] For the coming AQE what kernels will Arrows use, Is it using 100% C++ kernels? or a combination of C++ and Gandiva kernels? In the design draft there seems to make a combination of these two kernels. https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit#heading=h.2k6k5a4y9b8y Cheers, -yuan > [C++] Determine the feasibility and build a prototype to replace > compute/kernels with gandiva kernels > - > > Key: ARROW-7083 > URL: https://issues.apache.org/jira/browse/ARROW-7083 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute, C++ - Gandiva >Reporter: Micah Kornfield >Priority: Major > > See discussion on [https://issues.apache.org/jira/browse/ARROW-7017] > > Requirements: > 1. No hard runtime dependency on LLVM > 2. Ability to run without LLVM static/shared libraries. > > Open questions: > 1. What dependencies does this add to the build tool chain? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959552#comment-16959552 ] Yuan Zhou commented on ARROW-300: - Hi [~wesm], thanks for providing the general idea, I'm quite interested in this feature. Do you happen to have some updates on the detail proposal? Cheers, -yuan > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian Jira (v8.3.4#803005)