[jira] [Created] (ARROW-11942) [C++] If tasks are submitted quickly the thread pool may fail to spin up new threads

2021-03-11 Thread Weston Pace (Jira)
Weston Pace created ARROW-11942:
---

 Summary: [C++] If tasks are submitted quickly the thread pool may 
fail to spin up new threads
 Key: ARROW-11942
 URL: https://issues.apache.org/jira/browse/ARROW-11942
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


Probably only really affects unit tests.  Consider an idle thread pool with 1 
thread (ready_count_ == 1).  If `Spawn` is called very quickly it may look like 
`ready_count_` is still greater than 0 (because `ready_count_` doesn't 
necessarily decrement by the time `Spawn` returns) and so it will not spin up 
new threads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11941) [Dev] "DEBUG=1 merge_arrow_pr.py" updates Jira issue

2021-03-11 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-11941:


 Summary: [Dev] "DEBUG=1 merge_arrow_pr.py" updates Jira issue
 Key: ARROW-11941
 URL: https://issues.apache.org/jira/browse/ARROW-11941
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Yibo Cai
Assignee: Yibo Cai


"DEBUG=1 dev/merge_arrow_pr.py" acts as a dryrun without writing anything.
It doesn't merge PR, but it does updates the Jira issue status. Should be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11940) [Rust][Datafusion] Support joins on TimestampMillisecond columns

2021-03-11 Thread Morgan Cassels (Jira)
Morgan Cassels created ARROW-11940:
--

 Summary: [Rust][Datafusion] Support joins on TimestampMillisecond 
columns
 Key: ARROW-11940
 URL: https://issues.apache.org/jira/browse/ARROW-11940
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Morgan Cassels


Joining DataFrames on a TimestampMillisecond column gives error:

```

'called `Result::unwrap()` on an `Err` value: Internal("Unsupported data type 
in hasher")'

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11939) Bug in `pa.PythonFile`?

2021-03-11 Thread Dave Hirschfeld (Jira)
Dave Hirschfeld created ARROW-11939:
---

 Summary: Bug in `pa.PythonFile`?
 Key: ARROW-11939
 URL: https://issues.apache.org/jira/browse/ARROW-11939
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 3.0.0
Reporter: Dave Hirschfeld



```python
with pa.PythonFile('deleteme.jnk', 'wb') as f: pass
AttributeError: 'str' object has no attribute 'closed'
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11938) [R] Enable R build process to find locally built C++ library on Windows

2021-03-11 Thread Ian Cook (Jira)
Ian Cook created ARROW-11938:


 Summary: [R] Enable R build process to find locally built C++ 
library on Windows
 Key: ARROW-11938
 URL: https://issues.apache.org/jira/browse/ARROW-11938
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Ian Cook


Currently, {{configure.win}} and {{tools/winlibs.R}} have two ways of finding 
the Arrow C++ library:
 # If {{RWINLIB_LOCAL}} is set, it gets it from that zip file
 # If not, it downloads it

Enable and document a third option for the case when the C++ library has been 
built locally. This will enable R package developers using Windows machines to 
make changes to code in the C++ library, build and install it, and then build 
the R package using it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11937) [C++] GZip codec hangs if flushed twice

2021-03-11 Thread David Li (Jira)
David Li created ARROW-11937:


 Summary: [C++] GZip codec hangs if flushed twice
 Key: ARROW-11937
 URL: https://issues.apache.org/jira/browse/ARROW-11937
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 3.0.0
Reporter: David Li
Assignee: David Li
 Fix For: 4.0.0


{code:java}
// "If deflate returns with avail_out == 0, this function must be called
//  again with the same value of the flush parameter and more output space
//  (updated avail_out), until the flush is complete (deflate returns
//  with non-zero avail_out)."
return FlushResult{bytes_written, (bytes_written == 0)}; {code}
But contrary to the comment, we're checking bytes_written. So if we flush 
twice, the second time, we won't write any bytes, but we'll erroneously 
interpret that as zlib asking for a larger buffer, rather than zlib telling us 
there's no data to decompress. Then we'll enter a loop where we keep doubling 
the buffer size forever, hanging the program.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11936) Rust/Java incorrect serialization of Struct wrapped Int8Dictionary

2021-03-11 Thread Justin (Jira)
Justin created ARROW-11936:
--

 Summary: Rust/Java incorrect serialization of Struct wrapped 
Int8Dictionary
 Key: ARROW-11936
 URL: https://issues.apache.org/jira/browse/ARROW-11936
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java, Rust
Affects Versions: 3.0.0
Reporter: Justin


Using rust, serialized datatype to a file with a schema of
{code:java}
Field { name: "val", data_type: Struct([Field { name: "val", data_type: Utf8, 
nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }]), 
nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }{code}
Using a java client to read the serialized datatype results in a schema of
{code:java}
Schema not null>{code}
whilst calling ArrowFileReader.loadNextBatch() results in
{code:java}
Exception in thread "main" java.util.NoSuchElementExceptionException in thread 
"main" java.util.NoSuchElementException at 
java.base/java.util.ArrayList$Itr.next(ArrayList.java:1000) at 
org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:81) at 
org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:99) at 
org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61) at 
org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205) 
at 
org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:153)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11935) [C++] Add push generator

2021-03-11 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-11935:
--

 Summary: [C++] Add push generator
 Key: ARROW-11935
 URL: https://issues.apache.org/jira/browse/ARROW-11935
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Sometimes a producer of values just wants to queue futures and let a consumer 
pop them iteratively.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11934) [Rust] Document patch release process

2021-03-11 Thread Andy Grove (Jira)
Andy Grove created ARROW-11934:
--

 Summary: [Rust] Document patch release process
 Key: ARROW-11934
 URL: https://issues.apache.org/jira/browse/ARROW-11934
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.1


Now that we moved to voting on source releases for patch releases, we need to 
document the process for doing so in the Rust implementation.

 

Google doc for discussion / collaboration: 
https://docs.google.com/document/d/1i2Elk6J0H4nhPeQZdLDyqvHoRbsabx2iOTXLHxxNqRE/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11933) [Developer] Provide a dashboard for improved Pull Request management

2021-03-11 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-11933:


 Summary: [Developer] Provide a dashboard for improved Pull Request 
management
 Key: ARROW-11933
 URL: https://issues.apache.org/jira/browse/ARROW-11933
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Affects Versions: 3.0.0
Reporter: Ben Kietzman


The [spark PR dashboard|https://github.com/databricks/spark-pr-dashboard] 
(instance at
http://spark-prs.appspot.com/ ) provides a useful view of pull requests. 
Information is retrieved from the github API and persisted to a database for 
analyses, including classification of pull requests based on which files they 
modify. The added context provides greater visibility of PRs to the committers 
interested in reviewing/merging them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11932) [C++] Provide ArrayBuilder::AppendScalar

2021-03-11 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-11932:


 Summary: [C++] Provide ArrayBuilder::AppendScalar
 Key: ARROW-11932
 URL: https://issues.apache.org/jira/browse/ARROW-11932
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 3.0.0
Reporter: Ben Kietzman
 Fix For: 5.0.0


It would be useful to be able to append a Scalar (and/or ScalarVector) to an 
ArrayBuilder. For example, in 
https://github.com/apache/arrow/pull/9621#discussion_r587461083 (ARROW-11591) 
this could be used to accumulate an array of expected grouped aggregation 
results using existing scalar aggregate kernels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [arrow-testing] jmgpeeters commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

2021-03-11 Thread GitBox


jmgpeeters commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796777534


   Agreed. I'll make the changes and get back to you.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-testing] pitrou commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

2021-03-11 Thread GitBox


pitrou commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796768922


   Indeed, the JSON format doesn't support it, so that will be a problem if we 
want to do roundtripping tests with the integration machinery.
   However, I think we can still use the "golden files" part of integration 
testing, because there the logic for each implementation is (see 
[here](https://github.com/apache/arrow/blob/master/cpp/src/arrow/testing/json_integration_test.cc#L225-L234)
 for the C++ implementation):
   * read the JSON file and convert it into a series of record batches
   * read the Arrow file and decode it into a series of record batches
   * compare respective record batches for equality
   
   Comparing for equality doesn't care if the dictionaries are shared, so this 
should be ok for testing the ability to read IPC files with shared dictionaries.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-testing] jmgpeeters commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

2021-03-11 Thread GitBox


jmgpeeters commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796737078


   Ah, thanks, I wasn't aware of the Archery integration suite. Had a quick 
glance, and seems to make sense. Was a bit worried it would require support in 
all languages for shared dicts, but it seems easy to disable languages per 
folder etc.
   
   One thing I noticed from the JSON format is that it doesn't (appear to) 
support dictionary restatement, i.e. schema -> dict_batch[id=1] -> batch -> 
dict_batch[id=1] -> batch -> ... as we have in the streaming format, and that 
I'm currently explicitly testing in the bespoke tests. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow-testing] pitrou commented on pull request #59: ARROW-11838: files for testing IPC reads with shared dictionaries.

2021-03-11 Thread GitBox


pitrou commented on pull request #59:
URL: https://github.com/apache/arrow-testing/pull/59#issuecomment-796719583


   @jmgpeeters It seems these should go into the "golden files" used for 
integration testing, see 
https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration
   Integration testing is documented here: 
https://arrow.apache.org/docs/format/Integration.html
   The integration testing machinery is maintained here: 
https://github.com/apache/arrow/tree/master/dev/archery/archery/integration
   
   Don't hesitate to ask questions if you have trouble navigating this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org