[jira] [Created] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.
Yuan Zhou created ARROW-14488: - Summary: [Python] Incorrect inferred schema from pandas dataframe with length 0. Key: ARROW-14488 URL: https://issues.apache.org/jira/browse/ARROW-14488 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 5.0.0 Environment: OS: Windows 10, CentOS 7 Reporter: Yuan Zhou We use pandas(with pyarrow engine) to write out parquet files and those outputs will be consumed by other applications such as Java apps using org.apache.parquet.hadoop.ParquetFileReader. We found that some empty dataframes would get incorrect schema for string columns in other applications. After some investigation, we narrow down the issue to the schema inference by pyarrow: {{In [1]: import pandas as pd}} {{In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])}} {{In [3]: import pyarrow as pa}} {{In [4]: pa.Schema.from_pandas(df)}} {{Out[4]:}} {{a: string}} {{b: int64}} {{c: double}} {{-- schema metadata --}} {{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 562}} {{In [5]: pa.Schema.from_pandas(df.head(0))}} {{Out[5]:}} {{a: null}} {{b: int64}} {{c: double}} {{-- schema metadata --}} {{pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 560}} {{In [6]: pa.__version__}} {{Out[6]: '5.0.0'}} Is this an expected behavior? Or do we have any workaround for this issue? Could anyone take a look please. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14487) [R] Implement altrep Extract_subset() methods
Romain Francois created ARROW-14487: --- Summary: [R] Implement altrep Extract_subset() methods Key: ARROW-14487 URL: https://issues.apache.org/jira/browse/ARROW-14487 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Romain Francois Assignee: Romain Francois -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14486) [Packaging][deb] libthrift-dev dependency is missing
Kouhei Sutou created ARROW-14486: Summary: [Packaging][deb] libthrift-dev dependency is missing Key: ARROW-14486 URL: https://issues.apache.org/jira/browse/ARROW-14486 Project: Apache Arrow Issue Type: Improvement Components: Packaging Affects Versions: 6.0.0 Reporter: Kouhei Sutou Assignee: Kouhei Sutou Fix For: 7.0.0, 6.0.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14485) ParquetFile.read_row_group looses struct nullability when selecting one column from a struct
Jim Pivarski created ARROW-14485: Summary: ParquetFile.read_row_group looses struct nullability when selecting one column from a struct Key: ARROW-14485 URL: https://issues.apache.org/jira/browse/ARROW-14485 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 6.0.0 Reporter: Jim Pivarski Attachments: test8.parquet This appeared minutes ago because we have a test suite that saw Arrow 6.0.0 land in PyPI. (Congrats, by the way! I've been looking forward to this one!) Below, you'll see one thing that version 6 fixed (asking for one column in a nested struct returns only that one column) and a new error (it does not preserve nullability of the surrounding struct). Here, I'll write down the steps to reproduce and then explain. {code:python} Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow.parquet >>> pyarrow.__version__ '5.0.0' >>> file = pyarrow.parquet.ParquetFile("test8.parquet") >>> file.schema required group field_id=-1 schema { required group field_id=-1 x (List) { repeated group field_id=-1 list { required group field_id=-1 item { required int64 field_id=-1 y; required double field_id=-1 z; } } } } >>> file.schema_arrow x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0, ["x.list.item.y"]).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow.parquet >>> pyarrow.__version__ '6.0.0' >>> file = pyarrow.parquet.ParquetFile("test8.parquet") >>> file.schema required group field_id=-1 schema { required group field_id=-1 x (List) { repeated group field_id=-1 list { required group field_id=-1 item { required int64 field_id=-1 y; required double field_id=-1 z; } } } } >>> file.schema_arrow x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0, ["x.list.item.y"]).schema x: large_list> not null child 0, item: struct child 0, y: int64 not null >>> file.read_row_group(0, ["x.list.item.y", "x.list.item.z"]).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null >>> file.read_row_group(0).schema x: large_list not null> not null child 0, item: struct not null child 0, y: int64 not null child 1, z: double not null {code} In Arrow 5, asking for only column {{"x.list.item.y"}} returns a struct of type {{x: large_list not null> not null}}, which was undesirable because it has unnecessarily read the {{"z"}} column, but it got all of the {{"not null"}} types right. In test8.parquet, the data are non-nullable at each level. In Arrow 6, asking for only column {{"x.list.item.y"}} returns a struct of type {{x: large_list> not null}}, which is great because it's not reading the {{"z"}} column, but the struct's nullability is wrong: we should see three {{"not nulls"}} here, one for the data in {{y}}, one for the {{struct}}, and one for the {{list}}. It's just missing the middle one. When I ask for two columns specifically or don't specify the columns, the nullability is correct. I think that can help to narrow it down. I've attached the file (test8.parquet). It was the same in both of the above tests (generated by Arrow 5). I labeled this as "Python" because I've only seen the symptom in Python, but I suspect that the actual error is in C++. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14484) [Crossbow] Add support for specifying queue path by environment variable
Kouhei Sutou created ARROW-14484: Summary: [Crossbow] Add support for specifying queue path by environment variable Key: ARROW-14484 URL: https://issues.apache.org/jira/browse/ARROW-14484 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14483) [Release] Packages for AlmaLinux and Amazon Linux aren't downloaded in verification script
Kouhei Sutou created ARROW-14483: Summary: [Release] Packages for AlmaLinux and Amazon Linux aren't downloaded in verification script Key: ARROW-14483 URL: https://issues.apache.org/jira/browse/ARROW-14483 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14482) [C++][Gandiva] Implement MASK_FIRST_N and MASK_LAST_N functions
Augusto Alves Silva created ARROW-14482: --- Summary: [C++][Gandiva] Implement MASK_FIRST_N and MASK_LAST_N functions Key: ARROW-14482 URL: https://issues.apache.org/jira/browse/ARROW-14482 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Augusto Alves Silva *MASK_FIRST_N* {color:#172b4d}Returns a masked version of str with the first n values masked.{color}{color:#172b4d} Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_first_n("1234-5678-8765-4321", 4) results in -5678-8765-4321. *MASK_LAST_N* Returns a masked version of str with the last n values masked. Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_last_n("1234-5678-8765-4321", 4) results in 1234-5678-8765-.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14481) [C++] Investigate recent regressions in some utf8 kernel benchmarks
David Li created ARROW-14481: Summary: [C++] Investigate recent regressions in some utf8 kernel benchmarks Key: ARROW-14481 URL: https://issues.apache.org/jira/browse/ARROW-14481 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li See [https://conbench.ursa.dev/benchmarks/6ccff6887e7c47148a09fe46f18c8688/] Some (on the surface) unrelated commits have caused performance for a few string kernels to plummet. We should try to replicate locally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14480) [R] Expose to R
Weston Pace created ARROW-14480: --- Summary: [R] Expose to R Key: ARROW-14480 URL: https://issues.apache.org/jira/browse/ARROW-14480 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Weston Pace Fix For: 6.0.1 Trying to keep this an R-only thing for ease of patch/CRAN. Not sure if fix version should be 6.0.1 or 7.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14479) [C++][Compute] Hash Join microbenchmarks
Michal Nowakiewicz created ARROW-14479: -- Summary: [C++][Compute] Hash Join microbenchmarks Key: ARROW-14479 URL: https://issues.apache.org/jira/browse/ARROW-14479 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 7.0.0 Reporter: Michal Nowakiewicz Assignee: Sasha Krassovsky Fix For: 7.0.0 Implement a series of microbenchmarks giving a good picture of the performance of hash join implemented in Arrow across different set of dimensions. Compare the performance against some other product(s). Add scripts for generating useful visual reports giving a good picture of the costs of hash join. Examples of dimensions to explore in microbenchmarks: * number of duplicate keys on build side * relative size of build side to probe side * selectivity of the join * number of key columns * number of payload columns * filtering performance for semi- and anti- joins * dense integer key vs sparse integer key vs string key * build size * scaling of build, filtering, probe * inner vs left outer, inner vs right outer * left semi vs right semi, left anti vs right anti, left outer vs right outer * non-uniform key distribution * monotonic key values in input, partitioned key values in input (with and without per batch min-max metadata) * chain of multiple hash joins * overhead of Bloom filter for non-selective Bloom filter -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14478) [C++] Potential stack overflow in async scanner
David Li created ARROW-14478: Summary: [C++] Potential stack overflow in async scanner Key: ARROW-14478 URL: https://issues.apache.org/jira/browse/ARROW-14478 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: David Li Observed in [AppVeyor CI|https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/41288964/job/6s4bx6cd2kc6eld6] on the main branch: {noformat} [ RUN ] TestScan/TestCsvFileFormatScan.ScanBatchSize/0AsyncThreaded16b1024r unknown file: error: SEH exception with code 0xc0fd thrown in the test body. [ FAILED ] TestScan/TestCsvFileFormatScan.ScanBatchSize/0AsyncThreaded16b1024r, where GetParam() = AsyncThreaded16b1024r (250 ms){noformat} >From some searching, this code corresponds to a stack overflow. We've >previously seen errors similar to this, so it might be good to identify and >track this down too. (It seems less likely on Linux due to the larger default >stack size.) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14477) [C++] Timezone-aware kernels should also handle offset strings
David Li created ARROW-14477: Summary: [C++] Timezone-aware kernels should also handle offset strings Key: ARROW-14477 URL: https://issues.apache.org/jira/browse/ARROW-14477 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li Both the [format|https://github.com/apache/arrow/blob/836ffa5656d5107fd4895ae8d7eb0e20a3df23ba/format/Schema.fbs#L341-L347] and the [C++ library|https://github.com/apache/arrow/blob/836ffa5656d5107fd4895ae8d7eb0e20a3df23ba/cpp/src/arrow/type.h#L1233-L1237] allow this, but kernels rely on a helper assuming that the timezone field of a timestamp is always a timezone name and not a timezone offset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14476) [CI] Crossbow should comment cause of failure
Balazs Jeszenszky created ARROW-14476: - Summary: [CI] Crossbow should comment cause of failure Key: ARROW-14476 URL: https://issues.apache.org/jira/browse/ARROW-14476 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Balazs Jeszenszky Fix For: 7.0.0 Instead of just giving a thumbs down, Crossbow should comment with a link to the failing job (e.g. https://github.com/apache/arrow/runs/4010195788?check_suite_focus=true), or its stack trace (usually under 'handle github commit event'). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14475) [C++] Don't shadow enable_if helpers in kernel implementations
David Li created ARROW-14475: Summary: [C++] Don't shadow enable_if helpers in kernel implementations Key: ARROW-14475 URL: https://issues.apache.org/jira/browse/ARROW-14475 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: David Li A few kernel implementation files define enable_if helpers that shadow existing ones, which can cause strange errors in unity builds. For example: scalar_arithmetic.cc defines {{enable_if_floating_point}} for the C types float/double which conflicts with the one defined in type_traits.h for the Arrow types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14474) [Java] Add support for sliced arrays in C Data Interface
Roee Shlomo created ARROW-14474: --- Summary: [Java] Add support for sliced arrays in C Data Interface Key: ARROW-14474 URL: https://issues.apache.org/jira/browse/ARROW-14474 Project: Apache Arrow Issue Type: Bug Affects Versions: 6.0.0 Reporter: Roee Shlomo The Java implementation of the C Data Interface does not support non-0-offset arrays. This means that arrays like pyarrow.array([0, None, 2, 3, 4]).slice(1, 2) cannot be moved to a Java process. This is not even documented as required by the spec because it was an oversight. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14473) [JS][Release] Ensure can use nohup with the release script
Benson Muite created ARROW-14473: Summary: [JS][Release] Ensure can use nohup with the release script Key: ARROW-14473 URL: https://issues.apache.org/jira/browse/ARROW-14473 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Affects Versions: 7.0.0 Reporter: Benson Muite Assignee: Benson Muite Node may have problems reading and writing files when called using nohup. Directly running {code:bash} env "TEST_DEFAULT=0" env "TEST_JS=1" bash dev/release/verify-release-candidate.sh source 6.0.0 3 {code} seems to work, but {code:bash} nohup env "TEST_DEFAULT=0" env "TEST_JS=1" bash dev/release/verify-release-candidate.sh source 6.0.0 3 > log.out & {code} may not work [1]. Either document that one can use {code:bash} (nohup env "TEST_DEFAULT=0" env "TEST_JS=1" bash dev/release/verify-release-candidate.sh source 6.0.0 3 > log.out & ) {code} or modify the javascript implementation so that it can run as a background process and still find files so that the error: {code:bash} yarn run v1.22.17 $ /tmp/arrow-6.0.0.BDnN3/apache-arrow-6.0.0/js/node_modules/.bin/run-s clean:all lint build events.js:377 throw er; // Unhandled 'error' event ^ Error: EBADF: bad file descriptor, read Emitted 'error' event on ReadStream instance at: at internal/fs/streams.js:173:14 at FSReqCallback.wrapper [as oncomplete] (fs.js:562:5) { errno: -9, code: 'EBADF', syscall: 'read' } error Command failed with exit code 1. {code} is not obtained. [1] https://stackoverflow.com/questions/16604176/error-ebadf-bad-file-descriptor-when-running-node-using-nohup-of-forever -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14472) [Dev][Archery] Generate contribution statistics using archery
Krisztian Szucs created ARROW-14472: --- Summary: [Dev][Archery] Generate contribution statistics using archery Key: ARROW-14472 URL: https://issues.apache.org/jira/browse/ARROW-14472 Project: Apache Arrow Issue Type: Improvement Components: Archery, Developer Tools Reporter: Krisztian Szucs Currently we use a bash script to do that: https://github.com/apache/arrow/blob/master/dev/release/post-03-website.sh#L47-L67 Since the rust repository split, this logic needs to be extended. Additionally the scripts expects {{gnu date}} commands which is not available on macOS by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
Nicola Crane created ARROW-14471: Summary: [R] Implement lubridate's date/time parsing functions Key: ARROW-14471 URL: https://issues.apache.org/jira/browse/ARROW-14471 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Parse dates with year, month, and day components: ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() Parse date-times with year, month, and day, hour, minute, and second components: ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() mdy_h() ydm_hms() ydm_hm() ydm_h() Parse periods with hour, minute, and second components: ms() hm() hms() -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14470) [Python] Expose the use_threads option in Feather read functions
Joris Van den Bossche created ARROW-14470: - Summary: [Python] Expose the use_threads option in Feather read functions Key: ARROW-14470 URL: https://issues.apache.org/jira/browse/ARROW-14470 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche On the C++ side, the Feather V2 Reader wraps the IPC RecordBatchFileReader, which accepts IpcReadOptions which can control the use of threads (and the default memory pool and some other options). On the Python (cython) side, those options are not passed through. As a consequence the {{use_threads}} keyword only disables multithreading in the conversion from arrow table to pandas DataFrame, and not the actual reading. As a follow-up on ARROW-13317, we can actually make this keyword control both.p -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14469) [R] Binding for lubridate::month() doesn't have `label` argument implemented
Nicola Crane created ARROW-14469: Summary: [R] Binding for lubridate::month() doesn't have `label` argument implemented Key: ARROW-14469 URL: https://issues.apache.org/jira/browse/ARROW-14469 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane It'll be worth checking the other lubridate temporal extraction bindings to check if any others need extra arguments implementing too -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14468) [Python] Resolve parquet version deprecation warnings when compiling pyarrow
Krisztian Szucs created ARROW-14468: --- Summary: [Python] Resolve parquet version deprecation warnings when compiling pyarrow Key: ARROW-14468 URL: https://issues.apache.org/jira/browse/ARROW-14468 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs {code} /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp: In function ‘PyObject* __pyx_pf_7pyarrow_8_parquet_12FileMetaData_14format_version___get__(__pyx_obj_7pyarrow_8_parquet_FileMetaData*)’: /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp:14168:36: warning: ‘parquet::ParquetVersion::PARQUET_2_0’ is deprecated: use PARQUET_2_4 or PARQUET_2_6 for fine-grained feature selection [-Wdeprecated-declarations] 14168 | case parquet::ParquetVersion::PARQUET_2_0: |^~~ In file included from /tmp/arrow-6.0.0.theE2/install/include/parquet/types.h:30, from /tmp/arrow-6.0.0.theE2/install/include/parquet/schema.h:32, from /tmp/arrow-6.0.0.theE2/install/include/parquet/api/schema.h:21, from /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp:734: /tmp/arrow-6.0.0.theE2/install/include/parquet/type_fwd.h:44:5: note: declared here 44 | PARQUET_2_0 ARROW_DEPRECATED_ENUM_VALUE("use PARQUET_2_4 or PARQUET_2_6 " | ^~~ /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp:14168:36: warning: ‘parquet::ParquetVersion::PARQUET_2_0’ is deprecated: use PARQUET_2_4 or PARQUET_2_6 for fine-grained feature selection [-Wdeprecated-declarations] 14168 | case parquet::ParquetVersion::PARQUET_2_0: |^~~ In file included from /tmp/arrow-6.0.0.theE2/install/include/parquet/types.h:30, from /tmp/arrow-6.0.0.theE2/install/include/parquet/schema.h:32, from /tmp/arrow-6.0.0.theE2/install/include/parquet/api/schema.h:21, from /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp:734: /tmp/arrow-6.0.0.theE2/install/include/parquet/type_fwd.h:44:5: note: declared here 44 | PARQUET_2_0 ARROW_DEPRECATED_ENUM_VALUE("use PARQUET_2_4 or PARQUET_2_6 " | ^~~ /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp: In function ‘std::shared_ptr __pyx_f_7pyarrow_8_parquet__create_writer_properties(__pyx_opt_args_7pyarrow_8_parquet__create_writer_properties*)’: /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp:23800:62: warning: ‘parquet::ParquetVersion::PARQUET_2_0’ is deprecated: use PARQUET_2_4 or PARQUET_2_6 for fine-grained feature selection [-Wdeprecated-declarations] 23800 | (void)(__pyx_v_props.version( parquet::ParquetVersion::PARQUET_2_0)); | ^~~ In file included from /tmp/arrow-6.0.0.theE2/install/include/parquet/types.h:30, from /tmp/arrow-6.0.0.theE2/install/include/parquet/schema.h:32, from /tmp/arrow-6.0.0.theE2/install/include/parquet/api/schema.h:21, from /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp:734: /tmp/arrow-6.0.0.theE2/install/include/parquet/type_fwd.h:44:5: note: declared here 44 | PARQUET_2_0 ARROW_DEPRECATED_ENUM_VALUE("use PARQUET_2_4 or PARQUET_2_6 " | ^~~ /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp:23800:62: warning: ‘parquet::ParquetVersion::PARQUET_2_0’ is deprecated: use PARQUET_2_4 or PARQUET_2_6 for fine-grained feature selection [-Wdeprecated-declarations] 23800 | (void)(__pyx_v_props.version( parquet::ParquetVersion::PARQUET_2_0)); | ^~~ In file included from /tmp/arrow-6.0.0.theE2/install/include/parquet/types.h:30, from /tmp/arrow-6.0.0.theE2/install/include/parquet/schema.h:32, from /tmp/arrow-6.0.0.theE2/install/include/parquet/api/schema.h:21, from /tmp/arrow-6.0.0.theE2/apache-arrow-6.0.0/python/build/temp.linux-x86_64-3.8/_parquet.cpp:734: /tmp/arrow-6.0.0.theE2/install/include/parquet/type_fwd.h:44:5: note: declared here 44 | PARQUET_2_0 ARROW_DEPRECATED_ENUM_VALUE("use PARQUET_2_4 or PARQUET_2_6 " | ^~~ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14467) [C++][Python][Parquet] Uniform encryption
Maya Anderson created ARROW-14467: - Summary: [C++][Python][Parquet] Uniform encryption Key: ARROW-14467 URL: https://issues.apache.org/jira/browse/ARROW-14467 Project: Apache Arrow Issue Type: Improvement Components: C++, Parquet, Python Reporter: Maya Anderson Assignee: Maya Anderson PME supports using the same encryption key for all columns, which is useful in a number of scenarios. However, misuse of this feature can break the NIST limit on the number of AES GCM operations with one key, as reported in PARQUET-2040. We will develop a limit-enforcing code and provide a Python API for uniform encryption, similarly to PARQUET-2040 and based on ARROW-9947. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-14466) [Java] Introduce memory leak detector/handler utility to hook on unused unreleased buffers
Hongze Zhang created ARROW-14466: Summary: [Java] Introduce memory leak detector/handler utility to hook on unused unreleased buffers Key: ARROW-14466 URL: https://issues.apache.org/jira/browse/ARROW-14466 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Hongze Zhang Assignee: Hongze Zhang See previous discussions in mail thread: https://lists.apache.org/thread.html/re9896b902cddc0931e4efbdecf27203710fb87505b63e927eef7ea77%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)