[jira] [Created] (ARROW-13321) [C++][P

2021-07-12 Thread Niranda Perera (Jira)
Niranda Perera created ARROW-13321:
--

 Summary: [C++][P
 Key: ARROW-13321
 URL: https://issues.apache.org/jira/browse/ARROW-13321
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Niranda Perera






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13320) [Website] Add MIME types to FAQ

2021-07-12 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-13320:


 Summary: [Website] Add MIME types to FAQ
 Key: ARROW-13320
 URL: https://issues.apache.org/jira/browse/ARROW-13320
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Kouhei Sutou
Assignee: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13319) [C++] NullArray::IsNull always returns false and NullArray::IsValid always returns true

2021-07-12 Thread Nate Clark (Jira)
Nate Clark created ARROW-13319:
--

 Summary: [C++] NullArray::IsNull always returns false and 
NullArray::IsValid always returns true
 Key: ARROW-13319
 URL: https://issues.apache.org/jira/browse/ARROW-13319
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 3.0.0, 5.0.0
Reporter: Nate Clark


NullArray::SetData sets null_bitmap_data_ to NULLPTR which IsNull and IsValid 
interpret as all values are not null. However, null_count() and length() will 
return the same value which makes it seem like IsNull() and IsValid() should 
return true and false respectively.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13318) kMaxParserNumRows Value Increase/Removal

2021-07-12 Thread Ryan Stalets (Jira)
Ryan Stalets created ARROW-13318:


 Summary: kMaxParserNumRows Value Increase/Removal
 Key: ARROW-13318
 URL: https://issues.apache.org/jira/browse/ARROW-13318
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Ryan Stalets


I'm a new pyArrow user and have been investigating occasional errors related to 
the Python exception: "ArrowInvalid: Exceeded maximum rows" when parsing JSON 
line files using pyarrow.json.read_json(). In digging in, it looks like the 
original source of this exception is in cpp/src/arrow/json/parser.cc on line 
703, which appears to throw the error when the number of lines processed 
exceeds kMaxParserNumRows.

 
{code:java}
for (; num_rows_ < kMaxParserNumRows; ++num_rows_) {
      auto ok = reader.Parse(json, handler);
      switch (ok.Code()) {
        case rj::kParseErrorNone:
          // parse the next object
          continue;
        case rj::kParseErrorDocumentEmpty:
          // parsed all objects, finish
          return Status::OK();
        case rj::kParseErrorTermination:
          // handler emitted an error
          return handler.Error();
        default:
          // rj emitted an error
          return ParseError(rj::GetParseError_En(ok.Code()), " in row ", 
num_rows_);
      }
    }
    return Status::Invalid("Exceeded maximum rows");
  }{code}
 

 

This constant appears to be set in arrow/json/parser.h on line 53, and has been 
set this way since that file's initial commit.

 
{code:java}
constexpr int32_t kMaxParserNumRows = 10;{code}
 

 

There does not appear to be a comment in the code or in the commit or PR 
explaining this maximum number of lines.

 

I'm wondering what the reason for this maximum might be, and if it might be 
removed, increased, or made overridable in the C++ and the upstream Python. It 
is common to need to process JSON files of arbitrary length (logs from 
applications, third-party vendors, etc) where the user of the data does not 
have control over the size of the file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13317) [Python] Improve documentation on what 'use_threads' does in 'read_feather'

2021-07-12 Thread Arun Joseph (Jira)
Arun Joseph created ARROW-13317:
---

 Summary: [Python] Improve documentation on what 'use_threads' does 
in 'read_feather'
 Key: ARROW-13317
 URL: https://issues.apache.org/jira/browse/ARROW-13317
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 4.0.1
Reporter: Arun Joseph


The current documentation for 
[read_feather|https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html]
 states the following:

*use_threads* (_bool__,_ _default True_) – Whether to parallelize reading using 
multiple threads.

if the underlying file uses compression, then multiple threads will still be 
spawned. The verbiage of the `use_threads` is ambiguous that the restriction on 
multiple threads is only for the conversion from pyarrow to the pandas 
dataframe vs the reading/decompression of the file itself which might spawn 
additional threads.

[set_cpu_count|http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count]
 might good to mention as well



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13316) [Python][C++][Doc] Fix warnings generated by sphinx when incorporating doxygen docs

2021-07-12 Thread Weston Pace (Jira)
Weston Pace created ARROW-13316:
---

 Summary: [Python][C++][Doc] Fix warnings generated by sphinx when 
incorporating doxygen docs
 Key: ARROW-13316
 URL: https://issues.apache.org/jira/browse/ARROW-13316
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation, Python
Reporter: Weston Pace


Sphinx interprets the doxygen output to build the final documentation.  This 
process generates some warnings.

 

This warning is generated when running doxygen:
{code:java}
warning: Tag 'COLS_IN_ALPHA_INDEX' at line 1118 of file 'Doxyfile' has become 
obsolete.
 To avoid this warning please remove this line from your configuration 
file or upgrade it using "doxygen -u"
{code}
There are many warnings contributed to compute.rst that look like this (it is 
unclear where this static constexpr static is coming from as it is not present 
in the repo or doxygen that I can find):
{code:java}
/home/pace/dev/arrow/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ 
declaration: Expected identifier in nested name, got keyword: static [error at 
23]
  static constexpr static char const kTypeName []  = "ScalarAggregateOptions"
{code}
There is a duplicate definition warning (I think this one is because the doc 
comment is present on both the definition and the override)
{code:java}
/home/pace/dev/arrow/docs/source/cpp/api/dataset.rst:69: WARNING: Duplicate 
declaration, Result< std::shared_ptr< FileFragment > > MakeFragment (FileSource 
source, compute::Expression partition_expression, std::shared_ptr< Schema > 
physical_schema)
{code}
Finally, there is a specific issue with the GetRecordBatchGenerator function
{code:java}
/home/pace/dev/arrow/docs/source/cpp/api/formats.rst:80: WARNING: Error when 
parsing function declaration.
If the function has no return type:
  Error in declarator or parameters-and-qualifiers
  Main error:
Invalid C++ declaration: Expecting "(" in parameters-and-qualifiers. [error 
at 23]
  virtual ::arrow::Result< std::function<::arrow::Future< 
std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
(std::shared_ptr< FileReader > reader, const std::vector< int > 
row_group_indices, const std::vector< int > column_indices, 
::arrow::internal::Executor *cpu_executor=NULLPTR)=0
  ---^
  Potential other error:
Error in parsing template argument list.
If type argument:
  Main error:
Invalid C++ declaration: Expected "...>", ">" or "," in template 
argument list. [error at 38]
  virtual ::arrow::Result< std::function<::arrow::Future< 
std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
(std::shared_ptr< FileReader > reader, const std::vector< int > 
row_group_indices, const std::vector< int > column_indices, 
::arrow::internal::Executor *cpu_executor=NULLPTR)=0
  --^
  Potential other error:
Main error:
  Invalid C++ declaration: Expected identifier in nested name. [error 
at 38]
virtual ::arrow::Result< std::function<::arrow::Future< 
std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
(std::shared_ptr< FileReader > reader, const std::vector< int > 
row_group_indices, const std::vector< int > column_indices, 
::arrow::internal::Executor *cpu_executor=NULLPTR)=0
--^
Potential other error:
  Error in parsing template argument list.
  If type argument:
Invalid C++ declaration: Expected "...>", ">" or "," in template 
argument list. [error at 96]
  virtual ::arrow::Result< std::function<::arrow::Future< 
std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
(std::shared_ptr< FileReader > reader, const std::vector< int > 
row_group_indices, const std::vector< int > column_indices, 
::arrow::internal::Executor *cpu_executor=NULLPTR)=0
  
^
  If non-type argument:
Invalid C++ declaration: Expected "...>", ">" or "," in template 
argument list. [error at 96]
  virtual ::arrow::Result< std::function<::arrow::Future< 
std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 
(std::shared_ptr< FileReader > reader, const std::vector< int > 
row_group_indices, const std::vector< int > column_indices, 
::arrow::internal::Executor *cpu_executor=NULLPTR)=0
  
^
If non-type argument:
  Invalid C++ declaration: Expected "...>", ">" or "," in template argument 
list. [error at 96]
virtual ::arrow::Result< std::function<::arrow::Future< 
std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator 

[jira] [Created] (ARROW-13315) [R] Wrap r_task_group includes with ARROW_R_WITH_ARROW checking

2021-07-12 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-13315:
--

 Summary: [R] Wrap r_task_group includes with ARROW_R_WITH_ARROW 
checking
 Key: ARROW-13315
 URL: https://issues.apache.org/jira/browse/ARROW-13315
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jonathan Keane
Assignee: Jonathan Keane
 Fix For: 5.0.0


Need to wrap the includes with

{code}
#if defined(ARROW_R_WITH_ARROW)
...
#endif
{code}

at https://github.com/apache/arrow/blob/master/r/src/r_task_group.h#L20-L21




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13314) JSON parsing segment fault on long records (block_size) dependent

2021-07-12 Thread Guido Muscioni (Jira)
Guido Muscioni created ARROW-13314:
--

 Summary: JSON parsing segment fault on long records (block_size) 
dependent
 Key: ARROW-13314
 URL: https://issues.apache.org/jira/browse/ARROW-13314
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Guido Muscioni


Hello,

 

I have a big JSON file (~300MB) with complex records (nested json, nested lists 
of jsons). When I try to read this with pyarrow I am getting a segmentation 
fault. I tried then couple of things from read options, please see the code 
below (I developed this code on an example file that was attached here: 
https://issues.apache.org/jira/browse/ARROW-9612):

 
{code:python}
from pyarrow import json
from pyarrow.json import ReadOptions
import tqdm

if __name__ == '__main__':

 source = 'wiki_04.jsonl'

 ro = ReadOptions(block_size=2**20)

 with open(source, 'r') as file:
 for i, line in tqdm.tqdm(enumerate(file)):
 with open('temp_file_arrow_3.ndjson', 'a') as file2:
 file2.write(line)
 json.read_json('temp_file_arrow_3.ndjson', read_options=ro)
{code}
For both the example file and my file, this code will return the straddling 
object exception (or seg fault) once the file reach the block_size. Increasing 
the block_size will make the code fail later.

Then I tried, on my file, to put an explicit schema:
{code:python}
from pyarrow import json
from pyarrow.json import ReadOptions
import pandas as pd

if __name__ == '__main__':

 source = 'my_file.jsonl'

 df = pd.read_json(source, lines=True) 
 table_schema = pa.Table.from_pandas(df).schema
 
 ro = ReadOptions(explicit_schema = table_schema)
 table = json.read_json(source, read_options=ro) 

{code}
This works, which may suggest that this issue, and the issue of the linked JIRA 
issue, are only appearing when an explicit schema is not provided. Additionally 
the following code works as well:
{code:python}
from pyarrow import json
from pyarrow.json import ReadOptions
import pandas as pd

if __name__ == '__main__':

 source = 'my_file.jsonl'
 
 ro = ReadOptions(block_size = 2**30)
 table = json.read_json(source, read_options=ro) 

{code}
The block_size is bigger than my file in this case. Is it possible that the 
schema is defined in the first block and then if the schema changes, I get a 
seg fault?

I cannot share my json file, however, I hope that someone could add some 
clarity on what I am seeing and maybe suggest a workaround.

Thank you,
 Guido



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13313) [C++][Compute] Add ScalarAggregateNode

2021-07-12 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-13313:


 Summary: [C++][Compute] Add ScalarAggregateNode
 Key: ARROW-13313
 URL: https://issues.apache.org/jira/browse/ARROW-13313
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman


Provide an ExecNode which wraps ScalarAggregateFunctions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13312) [C++] Bitmap::VisitWordAndWrite epilogue needs to work on Words (not bytes)

2021-07-12 Thread Niranda Perera (Jira)
Niranda Perera created ARROW-13312:
--

 Summary: [C++] Bitmap::VisitWordAndWrite epilogue needs to work on 
Words (not bytes)
 Key: ARROW-13312
 URL: https://issues.apache.org/jira/browse/ARROW-13312
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Niranda Perera


In recently added `Bitmap::VisitWordAndWrite` method, translates the `visitor` 
lambda (that works on a `Word`) to a byte-visitor while handling the prologue.

This could lead to incorrect results in the client code.

ex:
{code:java}
// code placeholder
// N readers, M writers 
int64_t bits_written = 0;
auto visitor = [&](std::array in, std::array* out){
 ...
 bits_written += (sizeof(Word) * 8);
}{code}
At the end of the Visit, bits_written would have an incorrect sum because in 
the prologue, it adds 64 to bits_written for each trailing byte, whereas it 
should've been 8.

 

Possible solution:

Needs to add ReadTrailingWord and WriteTrailingWord functionality to 
BitmapWordReader and BitmapWordWriter respectively and call visitor with the 
words in the prologue.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13311) [C++][Documentation] List hash aggregate kernels somewhere

2021-07-12 Thread David Li (Jira)
David Li created ARROW-13311:


 Summary: [C++][Documentation] List hash aggregate kernels somewhere
 Key: ARROW-13311
 URL: https://issues.apache.org/jira/browse/ARROW-13311
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: David Li


Hash aggregate kernels are not listed in compute.rst with the rest of the 
functions, presumably because they're not intended to be directly callable. 
However, once ARROW-12759 goes in, we should find some place to list what 
aggregations are supported with group by.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13310) [C++] Implement hash_aggregate mode kernel

2021-07-12 Thread David Li (Jira)
David Li created ARROW-13310:


 Summary: [C++] Implement hash_aggregate mode kernel
 Key: ARROW-13310
 URL: https://issues.apache.org/jira/browse/ARROW-13310
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: David Li


Requires ARROW-12759.

We have a scalar aggregate kernel for this already and hopefully the 
implementation can be reused. Note, Pandas actually doesn't expose this in 
DataFrameGroupBy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13309) [C++] Implement hash_aggregate quantile kernel

2021-07-12 Thread David Li (Jira)
David Li created ARROW-13309:


 Summary: [C++] Implement hash_aggregate quantile kernel
 Key: ARROW-13309
 URL: https://issues.apache.org/jira/browse/ARROW-13309
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li


Requires ARROW-12759.

We have a scalar aggregate kernel for this already and hopefully the 
implementation can be reused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13308) [Packaging] Should we maintain the Arch linux repository?

2021-07-12 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-13308:
--

 Summary: [Packaging] Should we maintain the Arch linux repository?
 Key: ARROW-13308
 URL: https://issues.apache.org/jira/browse/ARROW-13308
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Jonathan Keane


One thing that came up in ARROW-13192 is [that Arrow on the the archlinux 
repo|https://archlinux.org/packages/community/x86_64/arrow/] is out of date. 

Arch linux isn't listed as a supported / maintained distribution, and as far as 
I can see we don't have any CI infrastructure that tests against it.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13307) [C++] Use reflection-based enums for compute options

2021-07-12 Thread David Li (Jira)
David Li created ARROW-13307:


 Summary: [C++] Use reflection-based enums for compute options
 Key: ARROW-13307
 URL: https://issues.apache.org/jira/browse/ARROW-13307
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li
Assignee: David Li


This will reduce boilerplate and give us consistent naming. To be done after 
ARROW-13296.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13306) [Java][JDBC] use ResultSetMetaData.getColumnLabel instead of ResultSetMetaData.getColumnName

2021-07-12 Thread Jiangtao Peng (Jira)
Jiangtao Peng created ARROW-13306:
-

 Summary: [Java][JDBC] use ResultSetMetaData.getColumnLabel instead 
of ResultSetMetaData.getColumnName
 Key: ARROW-13306
 URL: https://issues.apache.org/jira/browse/ARROW-13306
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Jiangtao Peng


when using JDBC to Arrow utils, sometimes, column alias can not be displayed in 
final arrow results. 

For example, here is a result set from query 
{code:sql}
SELECT col AS a FROM table{code}
postgres can works properly, arrow result schema contains "a", but mysql arrow 
result schema contains "col".

This is because postgres use field label as column name and column label 
([postgres 
jdbc|https://github.com/pgjdbc/pgjdbc/blob/f61fbfe7b72ccf2ca0ac2e2c366230fdb93260e5/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSetMetaData.java#L144]),
 but mysql use column name as label, original column name as name ([mysql 
jdbc|https://github.com/mysql/mysql-connector-j/blob/18bbd5e68195d0da083cbd5bd0d05d76320df7cd/src/main/user-impl/java/com/mysql/cj/jdbc/result/ResultSetMetaData.java#L176]).

Maybe "getColumnLabel" is more fittable for arrow results, instead of 
"getColumnName".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)