[jira] [Created] (ARROW-13321) [C++][P
Niranda Perera created ARROW-13321: -- Summary: [C++][P Key: ARROW-13321 URL: https://issues.apache.org/jira/browse/ARROW-13321 Project: Apache Arrow Issue Type: Bug Reporter: Niranda Perera -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13320) [Website] Add MIME types to FAQ
Kouhei Sutou created ARROW-13320: Summary: [Website] Add MIME types to FAQ Key: ARROW-13320 URL: https://issues.apache.org/jira/browse/ARROW-13320 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Kouhei Sutou Assignee: Wes McKinney -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13319) [C++] NullArray::IsNull always returns false and NullArray::IsValid always returns true
Nate Clark created ARROW-13319: -- Summary: [C++] NullArray::IsNull always returns false and NullArray::IsValid always returns true Key: ARROW-13319 URL: https://issues.apache.org/jira/browse/ARROW-13319 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 3.0.0, 5.0.0 Reporter: Nate Clark NullArray::SetData sets null_bitmap_data_ to NULLPTR which IsNull and IsValid interpret as all values are not null. However, null_count() and length() will return the same value which makes it seem like IsNull() and IsValid() should return true and false respectively. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13318) kMaxParserNumRows Value Increase/Removal
Ryan Stalets created ARROW-13318: Summary: kMaxParserNumRows Value Increase/Removal Key: ARROW-13318 URL: https://issues.apache.org/jira/browse/ARROW-13318 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Ryan Stalets I'm a new pyArrow user and have been investigating occasional errors related to the Python exception: "ArrowInvalid: Exceeded maximum rows" when parsing JSON line files using pyarrow.json.read_json(). In digging in, it looks like the original source of this exception is in cpp/src/arrow/json/parser.cc on line 703, which appears to throw the error when the number of lines processed exceeds kMaxParserNumRows. {code:java} for (; num_rows_ < kMaxParserNumRows; ++num_rows_) { auto ok = reader.Parse(json, handler); switch (ok.Code()) { case rj::kParseErrorNone: // parse the next object continue; case rj::kParseErrorDocumentEmpty: // parsed all objects, finish return Status::OK(); case rj::kParseErrorTermination: // handler emitted an error return handler.Error(); default: // rj emitted an error return ParseError(rj::GetParseError_En(ok.Code()), " in row ", num_rows_); } } return Status::Invalid("Exceeded maximum rows"); }{code} This constant appears to be set in arrow/json/parser.h on line 53, and has been set this way since that file's initial commit. {code:java} constexpr int32_t kMaxParserNumRows = 10;{code} There does not appear to be a comment in the code or in the commit or PR explaining this maximum number of lines. I'm wondering what the reason for this maximum might be, and if it might be removed, increased, or made overridable in the C++ and the upstream Python. It is common to need to process JSON files of arbitrary length (logs from applications, third-party vendors, etc) where the user of the data does not have control over the size of the file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13317) [Python] Improve documentation on what 'use_threads' does in 'read_feather'
Arun Joseph created ARROW-13317: --- Summary: [Python] Improve documentation on what 'use_threads' does in 'read_feather' Key: ARROW-13317 URL: https://issues.apache.org/jira/browse/ARROW-13317 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 4.0.1 Reporter: Arun Joseph The current documentation for [read_feather|https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html] states the following: *use_threads* (_bool__,_ _default True_) – Whether to parallelize reading using multiple threads. if the underlying file uses compression, then multiple threads will still be spawned. The verbiage of the `use_threads` is ambiguous that the restriction on multiple threads is only for the conversion from pyarrow to the pandas dataframe vs the reading/decompression of the file itself which might spawn additional threads. [set_cpu_count|http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count] might good to mention as well -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13316) [Python][C++][Doc] Fix warnings generated by sphinx when incorporating doxygen docs
Weston Pace created ARROW-13316: --- Summary: [Python][C++][Doc] Fix warnings generated by sphinx when incorporating doxygen docs Key: ARROW-13316 URL: https://issues.apache.org/jira/browse/ARROW-13316 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation, Python Reporter: Weston Pace Sphinx interprets the doxygen output to build the final documentation. This process generates some warnings. This warning is generated when running doxygen: {code:java} warning: Tag 'COLS_IN_ALPHA_INDEX' at line 1118 of file 'Doxyfile' has become obsolete. To avoid this warning please remove this line from your configuration file or upgrade it using "doxygen -u" {code} There are many warnings contributed to compute.rst that look like this (it is unclear where this static constexpr static is coming from as it is not present in the repo or doxygen that I can find): {code:java} /home/pace/dev/arrow/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "ScalarAggregateOptions" {code} There is a duplicate definition warning (I think this one is because the doc comment is present on both the definition and the override) {code:java} /home/pace/dev/arrow/docs/source/cpp/api/dataset.rst:69: WARNING: Duplicate declaration, Result< std::shared_ptr< FileFragment > > MakeFragment (FileSource source, compute::Expression partition_expression, std::shared_ptr< Schema > physical_schema) {code} Finally, there is a specific issue with the GetRecordBatchGenerator function {code:java} /home/pace/dev/arrow/docs/source/cpp/api/formats.rst:80: WARNING: Error when parsing function declaration. If the function has no return type: Error in declarator or parameters-and-qualifiers Main error: Invalid C++ declaration: Expecting "(" in parameters-and-qualifiers. [error at 23] virtual ::arrow::Result< std::function<::arrow::Future< std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator (std::shared_ptr< FileReader > reader, const std::vector< int > row_group_indices, const std::vector< int > column_indices, ::arrow::internal::Executor *cpu_executor=NULLPTR)=0 ---^ Potential other error: Error in parsing template argument list. If type argument: Main error: Invalid C++ declaration: Expected "...>", ">" or "," in template argument list. [error at 38] virtual ::arrow::Result< std::function<::arrow::Future< std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator (std::shared_ptr< FileReader > reader, const std::vector< int > row_group_indices, const std::vector< int > column_indices, ::arrow::internal::Executor *cpu_executor=NULLPTR)=0 --^ Potential other error: Main error: Invalid C++ declaration: Expected identifier in nested name. [error at 38] virtual ::arrow::Result< std::function<::arrow::Future< std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator (std::shared_ptr< FileReader > reader, const std::vector< int > row_group_indices, const std::vector< int > column_indices, ::arrow::internal::Executor *cpu_executor=NULLPTR)=0 --^ Potential other error: Error in parsing template argument list. If type argument: Invalid C++ declaration: Expected "...>", ">" or "," in template argument list. [error at 96] virtual ::arrow::Result< std::function<::arrow::Future< std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator (std::shared_ptr< FileReader > reader, const std::vector< int > row_group_indices, const std::vector< int > column_indices, ::arrow::internal::Executor *cpu_executor=NULLPTR)=0 ^ If non-type argument: Invalid C++ declaration: Expected "...>", ">" or "," in template argument list. [error at 96] virtual ::arrow::Result< std::function<::arrow::Future< std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator (std::shared_ptr< FileReader > reader, const std::vector< int > row_group_indices, const std::vector< int > column_indices, ::arrow::internal::Executor *cpu_executor=NULLPTR)=0 ^ If non-type argument: Invalid C++ declaration: Expected "...>", ">" or "," in template argument list. [error at 96] virtual ::arrow::Result< std::function<::arrow::Future< std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator
[jira] [Created] (ARROW-13315) [R] Wrap r_task_group includes with ARROW_R_WITH_ARROW checking
Jonathan Keane created ARROW-13315: -- Summary: [R] Wrap r_task_group includes with ARROW_R_WITH_ARROW checking Key: ARROW-13315 URL: https://issues.apache.org/jira/browse/ARROW-13315 Project: Apache Arrow Issue Type: Bug Reporter: Jonathan Keane Assignee: Jonathan Keane Fix For: 5.0.0 Need to wrap the includes with {code} #if defined(ARROW_R_WITH_ARROW) ... #endif {code} at https://github.com/apache/arrow/blob/master/r/src/r_task_group.h#L20-L21 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13314) JSON parsing segment fault on long records (block_size) dependent
Guido Muscioni created ARROW-13314: -- Summary: JSON parsing segment fault on long records (block_size) dependent Key: ARROW-13314 URL: https://issues.apache.org/jira/browse/ARROW-13314 Project: Apache Arrow Issue Type: Bug Reporter: Guido Muscioni Hello, I have a big JSON file (~300MB) with complex records (nested json, nested lists of jsons). When I try to read this with pyarrow I am getting a segmentation fault. I tried then couple of things from read options, please see the code below (I developed this code on an example file that was attached here: https://issues.apache.org/jira/browse/ARROW-9612): {code:python} from pyarrow import json from pyarrow.json import ReadOptions import tqdm if __name__ == '__main__': source = 'wiki_04.jsonl' ro = ReadOptions(block_size=2**20) with open(source, 'r') as file: for i, line in tqdm.tqdm(enumerate(file)): with open('temp_file_arrow_3.ndjson', 'a') as file2: file2.write(line) json.read_json('temp_file_arrow_3.ndjson', read_options=ro) {code} For both the example file and my file, this code will return the straddling object exception (or seg fault) once the file reach the block_size. Increasing the block_size will make the code fail later. Then I tried, on my file, to put an explicit schema: {code:python} from pyarrow import json from pyarrow.json import ReadOptions import pandas as pd if __name__ == '__main__': source = 'my_file.jsonl' df = pd.read_json(source, lines=True) table_schema = pa.Table.from_pandas(df).schema ro = ReadOptions(explicit_schema = table_schema) table = json.read_json(source, read_options=ro) {code} This works, which may suggest that this issue, and the issue of the linked JIRA issue, are only appearing when an explicit schema is not provided. Additionally the following code works as well: {code:python} from pyarrow import json from pyarrow.json import ReadOptions import pandas as pd if __name__ == '__main__': source = 'my_file.jsonl' ro = ReadOptions(block_size = 2**30) table = json.read_json(source, read_options=ro) {code} The block_size is bigger than my file in this case. Is it possible that the schema is defined in the first block and then if the schema changes, I get a seg fault? I cannot share my json file, however, I hope that someone could add some clarity on what I am seeing and maybe suggest a workaround. Thank you, Guido -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13313) [C++][Compute] Add ScalarAggregateNode
Ben Kietzman created ARROW-13313: Summary: [C++][Compute] Add ScalarAggregateNode Key: ARROW-13313 URL: https://issues.apache.org/jira/browse/ARROW-13313 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Ben Kietzman Assignee: Ben Kietzman Provide an ExecNode which wraps ScalarAggregateFunctions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13312) [C++] Bitmap::VisitWordAndWrite epilogue needs to work on Words (not bytes)
Niranda Perera created ARROW-13312: -- Summary: [C++] Bitmap::VisitWordAndWrite epilogue needs to work on Words (not bytes) Key: ARROW-13312 URL: https://issues.apache.org/jira/browse/ARROW-13312 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Niranda Perera In recently added `Bitmap::VisitWordAndWrite` method, translates the `visitor` lambda (that works on a `Word`) to a byte-visitor while handling the prologue. This could lead to incorrect results in the client code. ex: {code:java} // code placeholder // N readers, M writers int64_t bits_written = 0; auto visitor = [&](std::array in, std::array* out){ ... bits_written += (sizeof(Word) * 8); }{code} At the end of the Visit, bits_written would have an incorrect sum because in the prologue, it adds 64 to bits_written for each trailing byte, whereas it should've been 8. Possible solution: Needs to add ReadTrailingWord and WriteTrailingWord functionality to BitmapWordReader and BitmapWordWriter respectively and call visitor with the words in the prologue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13311) [C++][Documentation] List hash aggregate kernels somewhere
David Li created ARROW-13311: Summary: [C++][Documentation] List hash aggregate kernels somewhere Key: ARROW-13311 URL: https://issues.apache.org/jira/browse/ARROW-13311 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Reporter: David Li Hash aggregate kernels are not listed in compute.rst with the rest of the functions, presumably because they're not intended to be directly callable. However, once ARROW-12759 goes in, we should find some place to list what aggregations are supported with group by. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13310) [C++] Implement hash_aggregate mode kernel
David Li created ARROW-13310: Summary: [C++] Implement hash_aggregate mode kernel Key: ARROW-13310 URL: https://issues.apache.org/jira/browse/ARROW-13310 Project: Apache Arrow Issue Type: Improvement Reporter: David Li Requires ARROW-12759. We have a scalar aggregate kernel for this already and hopefully the implementation can be reused. Note, Pandas actually doesn't expose this in DataFrameGroupBy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13309) [C++] Implement hash_aggregate quantile kernel
David Li created ARROW-13309: Summary: [C++] Implement hash_aggregate quantile kernel Key: ARROW-13309 URL: https://issues.apache.org/jira/browse/ARROW-13309 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li Requires ARROW-12759. We have a scalar aggregate kernel for this already and hopefully the implementation can be reused. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13308) [Packaging] Should we maintain the Arch linux repository?
Jonathan Keane created ARROW-13308: -- Summary: [Packaging] Should we maintain the Arch linux repository? Key: ARROW-13308 URL: https://issues.apache.org/jira/browse/ARROW-13308 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Jonathan Keane One thing that came up in ARROW-13192 is [that Arrow on the the archlinux repo|https://archlinux.org/packages/community/x86_64/arrow/] is out of date. Arch linux isn't listed as a supported / maintained distribution, and as far as I can see we don't have any CI infrastructure that tests against it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13307) [C++] Use reflection-based enums for compute options
David Li created ARROW-13307: Summary: [C++] Use reflection-based enums for compute options Key: ARROW-13307 URL: https://issues.apache.org/jira/browse/ARROW-13307 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li Assignee: David Li This will reduce boilerplate and give us consistent naming. To be done after ARROW-13296. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13306) [Java][JDBC] use ResultSetMetaData.getColumnLabel instead of ResultSetMetaData.getColumnName
Jiangtao Peng created ARROW-13306: - Summary: [Java][JDBC] use ResultSetMetaData.getColumnLabel instead of ResultSetMetaData.getColumnName Key: ARROW-13306 URL: https://issues.apache.org/jira/browse/ARROW-13306 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Jiangtao Peng when using JDBC to Arrow utils, sometimes, column alias can not be displayed in final arrow results. For example, here is a result set from query {code:sql} SELECT col AS a FROM table{code} postgres can works properly, arrow result schema contains "a", but mysql arrow result schema contains "col". This is because postgres use field label as column name and column label ([postgres jdbc|https://github.com/pgjdbc/pgjdbc/blob/f61fbfe7b72ccf2ca0ac2e2c366230fdb93260e5/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSetMetaData.java#L144]), but mysql use column name as label, original column name as name ([mysql jdbc|https://github.com/mysql/mysql-connector-j/blob/18bbd5e68195d0da083cbd5bd0d05d76320df7cd/src/main/user-impl/java/com/mysql/cj/jdbc/result/ResultSetMetaData.java#L176]). Maybe "getColumnLabel" is more fittable for arrow results, instead of "getColumnName". -- This message was sent by Atlassian Jira (v8.3.4#803005)