[GitHub] [arrow] houqp commented on pull request #7501: ARROW-9192: [Rust] run clippy to lint arrow crate in CI

2020-06-23 Thread GitBox


houqp commented on pull request #7501:
URL: https://github.com/apache/arrow/pull/7501#issuecomment-648576776


   @kszucs let me know if there is anything i can help to move it to the main 
lint workflow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7501: ARROW-9192: [Rust] run clippy to lint arrow crate in CI

2020-06-23 Thread GitBox


wesm commented on pull request #7501:
URL: https://github.com/apache/arrow/pull/7501#issuecomment-648562595


   Hm I think this lint step should be merged into the main Lint workflow. 
@kszucs can you help?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7530: ARROW-8934: [C++] Enable `compute::Subtract` with timestamp inputs to return duration

2020-06-23 Thread GitBox


wesm closed pull request #7530:
URL: https://github.com/apache/arrow/pull/7530


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7530: ARROW-8934: [C++] Enable `compute::Subtract` with timestamp inputs to return duration

2020-06-23 Thread GitBox


wesm commented on pull request #7530:
URL: https://github.com/apache/arrow/pull/7530#issuecomment-648561822


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7528: ARROW-8933: [C++] Trim redundant generated code from compute/kernels/vector_hash.cc

2020-06-23 Thread GitBox


wesm closed pull request #7528:
URL: https://github.com/apache/arrow/pull/7528


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7528: ARROW-8933: [C++] Trim redundant generated code from compute/kernels/vector_hash.cc

2020-06-23 Thread GitBox


wesm commented on pull request #7528:
URL: https://github.com/apache/arrow/pull/7528#issuecomment-648561165


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] hcoona commented on pull request #7493: ARROW-9183: [C++] Fix build with clang & old libstdc++.

2020-06-23 Thread GitBox


hcoona commented on pull request #7493:
URL: https://github.com/apache/arrow/pull/7493#issuecomment-648542491


   Got it. It seems that the gcc 4.8.5 is released later than gcc 4.9.2 & gcc 
5.1. I'll take a future investigation on how to detect the missing of 
`std::atomic_load` overloads.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sagnikc-dremio commented on a change in pull request #7402: ARROW-9099: [C++][Gandiva] Implement trim function for string

2020-06-23 Thread GitBox


sagnikc-dremio commented on a change in pull request #7402:
URL: https://github.com/apache/arrow/pull/7402#discussion_r444588872



##
File path: cpp/src/gandiva/precompiled/string_ops.cc
##
@@ -284,6 +284,49 @@ const char* reverse_utf8(gdv_int64 context, const char* 
data, gdv_int32 data_len
   return ret;
 }
 
+// Trim a utf8 sequence
+FORCE_INLINE
+const char* trim_utf8(gdv_int64 context, const char* data, gdv_int32 data_len,
+  int32_t* out_len) {
+  if (data_len == 0) {
+*out_len = 0;
+return "";
+  }
+
+  gdv_int32 start = 0, end = data_len - 1;
+  // start and end denote the first and last positions of non-space
+  // characters in the input string respectively
+  while (start <= end && data[start] == ' ') {
+++start;
+  }
+  while (end >= start && data[end] == ' ') {

Review comment:
   Verified that for utf8, 0x20 is always a single-byte space and is not 
part of any multi-byte characters.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-23 Thread GitBox


kou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-648518023


   I've added a workaround we already used: 
https://github.com/apache/arrow/pull/7449/commits/782499f8641da4a23d86125bcc812546107f2ce5
   
   But it doesn't solve this yet.
   
   I'm trying reproducing this on my local environment.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6592: ARROW-8089: [C++] Port the toolchain build from Appveyor to Github Actions

2020-06-23 Thread GitBox


wesm commented on pull request #6592:
URL: https://github.com/apache/arrow/pull/6592#issuecomment-648515003


   @kszucs do you intend to keep working on this? I'll close the PR until it 
can be rehabilitated



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #6592: ARROW-8089: [C++] Port the toolchain build from Appveyor to Github Actions

2020-06-23 Thread GitBox


wesm closed pull request #6592:
URL: https://github.com/apache/arrow/pull/6592


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6156: ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices

2020-06-23 Thread GitBox


wesm commented on pull request #6156:
URL: https://github.com/apache/arrow/pull/6156#issuecomment-648514762


   Does this impact IPC? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-23 Thread GitBox


wesm commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648510708


   > That would be my preference.
   
   I'm OK with this. We would need to act quickly to try to pull this off for 
the release. I can start a DISCUSS thread and work up a patch with the proposed 
changes to the specification



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jacques-n commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-23 Thread GitBox


jacques-n commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648507985


   > We can decide to stipulate that union types never have non-valid values at 
the Union cell level, only at the child cell level. But then a union value 
cannot be "made null" by changing the validity bitmap of the Union. 
   
   I believe a union can still express this with a child type null type, no? I 
think that is how we either modeled it or planned to model it no the java side.
   
   > All nested types including union are composed from well-formed child 
arrays which may have null values.
   
   I'm in agreement on this. Decomposing would be complex.
   
   > In the case of union, a null at the top level would indicate that the type 
of the child is not known. This seems algebraically consistent to me.
   
   I think it's where the model breaks down because of the weird situation 
where you actually need to evaluate two validity buffers to determine whether 
something is valid: the parent and the child. And an inconsistency would be 
really weird. As such, I'm think it would be better to avoid the top-level 
validity buffer.
   
   > FTR I'm OK with dropping the top-level validity bitmap from Union, 
especially if it helps us move forward
   
   That would be my preference. It seems to ultimately reduce the risk of 
inconsistency and doesn't seem to have any functional loss (given the use of 
null type to indicate a non-alternatively-typed value). I also think this works 
well in the most common case of union types, e.g. two files where one has 
fieldA with schemaA and another where you have fieldA with schemaB. Compositing 
those two doesn't require some kind of introspection and AND'ing of the 
individual children to build an additional validity buffer (or simply setting 
true for all and then having an inconsistency with the child array) and allows 
a fast set of the type vector for each independent chunk.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7530: ARROW-8934: [C++] Enable `compute::Subtract` with timestamp inputs to return duration

2020-06-23 Thread GitBox


github-actions[bot] commented on pull request #7530:
URL: https://github.com/apache/arrow/pull/7530#issuecomment-648484942


   https://issues.apache.org/jira/browse/ARROW-8934



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7530: ARROW-8934: [C++] Enable `compute::Subtract` with timestamp inputs to return duration

2020-06-23 Thread GitBox


wesm commented on pull request #7530:
URL: https://github.com/apache/arrow/pull/7530#issuecomment-648482567


   Example use in Python:
   
   ```
   In [14]: arr = pa.array(pd.date_range('2000-01-01', periods=20)) 

   
   In [15]: arr2 = pa.array(pd.date_range('2000-01-10', periods=20))

   
   In [16]: pc.subtract(arr, arr2)  

   Out[16]: 
   
   [
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000,
 -7776000
   ]
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7530: ARROW-8934: [C++] Enable `compute::Subtract` with timestamp inputs to return duration

2020-06-23 Thread GitBox


wesm commented on a change in pull request #7530:
URL: https://github.com/apache/arrow/pull/7530#discussion_r444564799



##
File path: cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
##
@@ -39,7 +40,7 @@ namespace arrow {
 namespace compute {
 
 template 
-class TestBinaryArithmetics : public TestBase {
+class TestBinaryArithmetic : public TestBase {

Review comment:
   "arithmetics" is not correct, so many of the changes in this file are 
just this rename





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7530: ARROW-8934: [C++] Enable `compute::Subtract` with timestamp inputs to return duration

2020-06-23 Thread GitBox


wesm opened a new pull request #7530:
URL: https://github.com/apache/arrow/pull/7530


   I also did a little bit of cleaning, moving some stuff into 
`arrow::compute::internal`. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7529: ARROW-8025: [C++][CI][FOLLOWUP] Fix test compilation failure due to conflicting changes in scalar_cast_test.cc

2020-06-23 Thread GitBox


wesm closed pull request #7529:
URL: https://github.com/apache/arrow/pull/7529


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-23 Thread GitBox


kszucs commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-648473447


   I’m going to update the bot tomorrow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7529: ARROW-8025: [C++][CI][FOLLOWUP] Fix test compilation failure due to conflicting changes in scalar_cast_test.cc

2020-06-23 Thread GitBox


github-actions[bot] commented on pull request #7529:
URL: https://github.com/apache/arrow/pull/7529#issuecomment-648472074


   https://issues.apache.org/jira/browse/ARROW-8025



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7529: ARROW-8025: [C++][CI][FOLLOWUP] Fix test compilation failure due to conflicting changes in scalar_cast_test.cc

2020-06-23 Thread GitBox


wesm opened a new pull request #7529:
URL: https://github.com/apache/arrow/pull/7529


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7528: ARROW-8933: [C++] Trim redundant generated code from compute/kernels/vector_hash.cc

2020-06-23 Thread GitBox


github-actions[bot] commented on pull request #7528:
URL: https://github.com/apache/arrow/pull/7528#issuecomment-648472075


   https://issues.apache.org/jira/browse/ARROW-8933



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7529: ARROW-8025: [C++][CI][FOLLOWUP] Fix test compilation failure due to conflicting changes in scalar_cast_test.cc

2020-06-23 Thread GitBox


wesm commented on pull request #7529:
URL: https://github.com/apache/arrow/pull/7529#issuecomment-648472135


   I'll merge this ASAP to minimize the number of broken buidls



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7529: ARROW-8025: [C++][CI][FOLLOWUP] Fix test compilation failure due to conflicting changes in scalar_cast_test.cc

2020-06-23 Thread GitBox


wesm edited a comment on pull request #7529:
URL: https://github.com/apache/arrow/pull/7529#issuecomment-648472135


   I'll merge this ASAP to minimize the number of broken builds



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7470: ARROW-8025: [C++] Implement cast from String to Binary

2020-06-23 Thread GitBox


wesm closed pull request #7470:
URL: https://github.com/apache/arrow/pull/7470


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7528: ARROW-8933: [C++] Trim redundant generated code form vector_hash.cc

2020-06-23 Thread GitBox


wesm opened a new pull request #7528:
URL: https://github.com/apache/arrow/pull/7528


   Since hashing doesn't know the difference between int64, uint64, float64, or 
timestamp when it comes to performing its work, there's no need to generate 
identical compiled code for each of these logical types with the same 
underlying physical representation (64-bit values). 400KB is cut out of 
libarrow.so on Linux



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-23 Thread GitBox


emkornfield commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-648465709


   Thanks, I'll take a look tonight.  Hopefully this should be enough of a clue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-23 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-648454941


   The bug seems to be in the BitRunReader
   
   
![image](https://user-images.githubusercontent.com/329591/85470398-95a9fa00-b574-11ea-99c4-3f06db4a0179.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-23 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-648453423


   @emkornfield I'm sort of at a dead end here, hopefully the above gives you 
some clues about where there might be a problem



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion

2020-06-23 Thread GitBox


github-actions[bot] commented on pull request #7527:
URL: https://github.com/apache/arrow/pull/7527#issuecomment-648451652


   https://issues.apache.org/jira/browse/ARROW-7018



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-23 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-648451899


   there seems to be a situation where the bit run has more values then are 
needed to fulfill the call to `GetSpaced`
   
   
![image](https://user-images.githubusercontent.com/329591/85469465-50d19380-b573-11ea-8129-14da0a68dc5a.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444533246



##
File path: docs/source/developers/contributing.rst
##
@@ -76,46 +96,83 @@ visibility. They may add a "Fix version" to indicate that 
they're considering
 it for inclusion in the next release, though adding that tag is not a
 commitment that it will be done in the next release.
 
-Advanced use
-
-
-Once you are involved in the project and want to do more on JIRA, such as
-assign yourself an issue, you will need "Contributor" permissions on the
-Apache Arrow JIRA. To get this role, ask on the mailing list for a project
-maintainer's help.
-
-GitHub issues
--
-
-We support `GitHub issues `_ as a
-lightweight way to ask questions and engage with
-the Arrow developer community. We use JIRA for maintaining a queue of
-development work and as the public record for work on the project. So, feel
-free to open GitHub issues, but bugs and feature requests will eventually need
-to end up in JIRA, either before or after completing a pull request. Don't be
-surprised if you are immediately asked by a project maintainer to open a JIRA
-issue.
-
-How to contribute patches
-=
-
-We prefer to receive contributions in the form of GitHub pull requests. Please
-send pull requests against the `github.com/apache/arrow
-`_ repository following the procedure below.
-
-If you are looking for some ideas on what to contribute, check out the JIRA
-issues for the Apache Arrow project. Comment on the issue and/or contact
-d...@arrow.apache.org with your questions and ideas.
-
-If you’d like to report a bug but don’t have time to fix it, you can still post
-it on JIRA, or email the mailing list d...@arrow.apache.org.
+Tips for successful bug reports

+
+No one likes having bugs in their software, and in an ideal world, all bugs
+would get fixed as soon as they were reported. However, time and attention are
+finite, especially in an open-source project where most contributors are
+participating in their spare time. In order for your bug to get prompt
+attention, there are things you can do to make it easier for contributors to
+reproduce and fix it.
+
+When you're reporting a bug, please help us understand the issue by providing,
+to the best of your ability,
+
+* Clear, minimal steps to reproduce the issue, with as few non-Arrow
+  dependencies as possible. If there's a problem on reading a file, try to
+  provide as small of an example file as possible, or code to create one.
+  If your bug report says "it crashes trying to read my file, but I can't
+  share it with you," it's really hard for us to debug.
+* Any relevant operating system, language, and library version information
+* If it isn't obvious, clearly state the expected behavior and what actually
+  happened.
+
+If a developer can't get a failing unit test, they won't be able to know that
+the issue has been identified, and they won't know when it has been fixed.
+Try to anticipate the questions you might be asked by someone working to
+understand the issue and provide those supporting details up front.
+

Review comment:
   I added a couple of links; feel free to add others if you have favorites





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7527:
URL: https://github.com/apache/arrow/pull/7527#discussion_r444530279



##
File path: r/src/array_from_vector.cpp
##
@@ -159,6 +159,9 @@ struct VectorToArrayConverter {
   if (s == NA_STRING) {
 RETURN_NOT_OK(binary_builder->AppendNull());
 continue;
+  } else {
+// Make sure we're ingesting UTF-8
+s = Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8);

Review comment:
   This `Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8)` is dropped in 
several places; should we factor this out to a macro or something?

##
File path: r/src/recordbatch.cpp
##
@@ -246,6 +246,7 @@ std::shared_ptr 
RecordBatch__from_arrays__known_schema(
   SEXP names = Rf_getAttrib(lst, R_NamesSymbol);
 
   auto fill_array = [, ](int j, SEXP x, SEXP name) {
+name = Rf_mkCharCE(Rf_translateCharUTF8(name), CE_UTF8);

Review comment:
   There are a few places where I `Rf_mkCharCE()` and then immediately call 
`CHAR()`, which IIUC is boxing in a SEXP and then immediately unboxing it. 
Maybe that can be eliminated some places or done better?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7527: ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion

2020-06-23 Thread GitBox


nealrichardson opened a new pull request #7527:
URL: https://github.com/apache/arrow/pull/7527


   Sprinkles `Rf_translateCharUTF8` a few places. I tried to add tests for all 
of the different scenarios I could think of where we could have non-UTF strings.
   
   Also includes `$` and `[[` methods for `Schema` objects.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-23 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-648446373


   I'm able to reproduce the error in VS and set breakpoints, I got this far to 
see that GetBatchWithDictSpaced has decoded more values than it was asked to
   
   
![image](https://user-images.githubusercontent.com/329591/85467944-6e9df900-b571-11ea-82f8-5f7efade8f2e.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-23 Thread GitBox


wesm commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648439435


   FTR I'm OK with dropping the top-level validity bitmap from Union, 
especially if it helps us move forward



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-23 Thread GitBox


wesm edited a comment on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648435911


   > @wesm why would we have validity at both the top level and the inner level?
   
   Well, the way the specification is written
   
   * _All_ nested types including union are composed from well-formed child 
arrays which may have null values. 
   * Additionally, all array types, including all nested array types, have 
their own validity bitmap
   * In the case of union, a null at the top level would indicate that the type 
of the child is not known. This seems algebraically consistent to me. 
   
   We can decide to stipulate that union types never have non-valid values at 
the Union cell level, only at the child cell level. But then a union value 
cannot be "made null" by changing the validity bitmap of the Union. From a 
purely algebraic / design perspective it isn't great. From the get-go in the 
project I've been striving for algebraic consistency, e.g. enabling well-formed 
arrays to be composed to created composite types without alteration. Since 
Unions are one of the more seldom used part of the project we can decide to 
explicitly deviate from that, but we need to do it now if we're going to do 
that and not delay longer in reconciling the issue. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-23 Thread GitBox


wesm edited a comment on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648435911


   > @wesm why would we have validity at both the top level and the inner level?
   
   Well, the way the specification is written
   
   * _All_ nested types including union are composed from well-formed child 
arrays which may have null values. 
   * Additionally, all array types, including all nested array types, have 
their own validity bitmap
   * In the case of union would indicate that the type of the child is not 
known. This seems algebraically consistent to me. 
   
   We can decide to stipulate that union types never have non-valid values at 
the Union cell level, only at the child cell level. But then a union value 
cannot be "made null" by changing the validity bitmap of the Union. From a 
purely algebraic / design perspective it isn't great. From the get-go in the 
project I've been striving for algebraic consistency, e.g. enabling well-formed 
arrays to be composed to created composite types without alteration. Since 
Unions are one of the more seldom used part of the project we can decide to 
explicitly deviate from that, but we need to do it now if we're going to do 
that and not delay longer in reconciling the issue. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-23 Thread GitBox


wesm commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648435911


   > @wesm why would we have validity at both the top level and the inner level?
   
   Well, the way the specification is written
   
   * _All_ nested types including union are composed from well-formed child 
arrays which may be nullable. 
   * Additionally, all array types, including all nested array types, have 
their own validity bitmap
   * In the case of union would indicate that the type of the child is not 
known. This seems algebraically consistent to me. 
   
   We can decide to stipulate that union types never have non-valid values at 
the Union cell level, only at the child cell level. But then a union value 
cannot be "made null" by changing the validity bitmap of the Union. From a 
purely algebraic / design perspective it isn't great. From the get-go in the 
project I've been striving for algebraic consistency, e.g. enabling well-formed 
arrays to be composed to created composite types without alteration. Since 
Unions are one of the more seldom used part of the project we can decide to 
explicitly deviate from that, but we need to do it now if we're going to do 
that and not delay longer in reconciling the issue. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jacques-n commented on a change in pull request #6402: ARROW-7831: [Java] do not allocate a new offset buffer if the slice starts at 0 since the relative offset pointer would be uncha

2020-06-23 Thread GitBox


jacques-n commented on a change in pull request #6402:
URL: https://github.com/apache/arrow/pull/6402#discussion_r444514257



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
##
@@ -751,55 +757,57 @@ private void splitAndTransferOffsetBuffer(int startIndex, 
int length, BaseVariab
*/
   private void splitAndTransferValidityBuffer(int startIndex, int length,
   BaseVariableWidthVector target) {
-int firstByteSource = BitVectorHelper.byteIndex(startIndex);
-int lastByteSource = BitVectorHelper.byteIndex(valueCount - 1);
-int byteSizeTarget = getValidityBufferSizeFromCount(length);
-int offset = startIndex % 8;
+if (length <= 0) {

Review comment:
   I'm not understanding why this is okay. It seems like you're leaving 
this function with a different number of references than the other exit points.

##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
##
@@ -751,55 +757,57 @@ private void splitAndTransferOffsetBuffer(int startIndex, 
int length, BaseVariab
*/
   private void splitAndTransferValidityBuffer(int startIndex, int length,
   BaseVariableWidthVector target) {
-int firstByteSource = BitVectorHelper.byteIndex(startIndex);
-int lastByteSource = BitVectorHelper.byteIndex(valueCount - 1);
-int byteSizeTarget = getValidityBufferSizeFromCount(length);
-int offset = startIndex % 8;
+if (length <= 0) {
+  return;
+}
 
-if (length > 0) {
-  if (offset == 0) {
-// slice
-if (target.validityBuffer != null) {
-  target.validityBuffer.getReferenceManager().release();
-}
-target.validityBuffer = validityBuffer.slice(firstByteSource, 
byteSizeTarget);
-target.validityBuffer.getReferenceManager().retain();
+final int firstByteSource = BitVectorHelper.byteIndex(startIndex);
+final int lastByteSource = BitVectorHelper.byteIndex(valueCount - 1);
+final int byteSizeTarget = getValidityBufferSizeFromCount(length);
+final int offset = startIndex % 8;
+
+if (offset == 0) {
+  // slice
+  if (target.validityBuffer != null) {
+target.validityBuffer.getReferenceManager().release();
+  }
+  final ArrowBuf slicedValidityBuffer = 
validityBuffer.slice(firstByteSource, byteSizeTarget);
+  target.validityBuffer = transferBuffer(slicedValidityBuffer, 
target.allocator);

Review comment:
   i think you can just return here (or make small modifications to do so) 
avoiding the large else block





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jacques-n commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-23 Thread GitBox


jacques-n commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648428724


   Adding to my previous comments: if only at the top level, I'm not sure what 
the ramification of that would mean at the Java codebase. I think it would 
require a fairly massive refactoring.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-23 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444509707



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   Fragments created using `_metadata` have their physical schema preloaded 
so they won't incur IO when no row group can satisfy a predicate





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jacques-n commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-23 Thread GitBox


jacques-n commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648427018


   I'm really struggling with these changes. I don't understand why there is a 
validity buffer at the union level as well as at the cell level. I'm not sure 
what it even means that a a union value is valid but the inner value is not. I 
think there was an inconsistent understanding on my part when that change was 
proposed in the spec over what the Java implementation was doing. @wesm why 
would we have validity at both the top level and the inner level? Or is the C++ 
impl only at the top level and not the inner level (which means that this java 
code is incorrect since the inner vectors have their own validity vectors.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jacques-n edited a comment on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-23 Thread GitBox


jacques-n edited a comment on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-648427018


   I'm really struggling with these changes. I don't understand why there is a 
validity buffer at the union level as well as at the cell level. I'm not sure 
what it even means that a a union value is valid but the inner value is not. I 
think there was an inconsistent understanding on my part when that change was 
proposed in the format (versus what the Java implementation was already doing). 
@wesm why would we have validity at both the top level and the inner level? Or 
is the C++ impl only at the top level and not the inner level (which means that 
this java code is incorrect since the inner vectors have their own validity 
vectors.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


wesm closed pull request #7525:
URL: https://github.com/apache/arrow/pull/7525


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


wesm commented on pull request #7525:
URL: https://github.com/apache/arrow/pull/7525#issuecomment-648410864


   +1. We can work on performance smithing in follow up PRs



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


wesm commented on pull request #7525:
URL: https://github.com/apache/arrow/pull/7525#issuecomment-648410615


   I looked at the Parquet read/write benchmarks, the differences look like 
mostly noise to me
   
   ```
   benchmark baselinecontender  
change % counters
   40   BM_ReadColumn/-1/1  754.859 MiB/sec  857.980 MiB/sec  
  13.661   {'iterations': 18}
   17   BM_ReadColumn/25/251.165 GiB/sec1.228 GiB/sec  
   5.379{'iterations': 7}
   37   BM_ReadColumn/5/10  461.996 MiB/sec  485.049 MiB/sec  
   4.990   {'iterations': 11}
   8  BM_ReadColumn/25/5  749.004 MiB/sec  785.493 MiB/sec  
   4.872{'iterations': 8}
   33 BM_ReadColumn/-1/01.537 GiB/sec1.609 GiB/sec  
   4.694   {'iterations': 18}
   27 BM_ReadColumn/-1/02.398 GiB/sec2.492 GiB/sec  
   3.913   {'iterations': 17}
   36  BM_ReadColumn/1/20  396.502 MiB/sec  409.002 MiB/sec  
   3.153   {'iterations': 28}
   28   BM_ReadMultipleRowGroups4.873 GiB/sec4.988 GiB/sec  
   2.357   {'iterations': 35}
   44 BM_ReadIndividualRowGroups4.907 GiB/sec5.018 GiB/sec  
   2.259   {'iterations': 35}
   0   BM_ReadColumn/-1/01.003 GiB/sec1.023 GiB/sec  
   2.039   {'iterations': 72}
   12 BM_ReadColumn/50/0  608.159 MiB/sec  618.035 MiB/sec  
   1.624{'iterations': 7}
   35   BM_ReadColumn/-1/503.050 GiB/sec3.089 GiB/sec  
   1.260   {'iterations': 51}
   43 BM_ReadColumn/10/51.448 GiB/sec1.464 GiB/sec  
   1.062{'iterations': 9}
   2 BM_ReadColumn/10/10  946.909 MiB/sec  954.967 MiB/sec  
   0.851{'iterations': 9}
   38BM_ReadColumn/99/502.082 GiB/sec2.097 GiB/sec  
   0.731   {'iterations': 11}
   1 BM_ReadColumn/-1/19.340 GiB/sec9.396 GiB/sec  
   0.607   {'iterations': 79}
   15BM_ReadColumn/-1/15.533 GiB/sec5.564 GiB/sec  
   0.559  {'iterations': 101}
   32   BM_WriteColumn  323.257 MiB/sec  324.536 MiB/sec  
   0.396{'iterations': 7}
   7  BM_WriteColumn  575.266 MiB/sec  575.742 MiB/sec  
   0.083{'iterations': 6}
   16   BM_ReadColumn/-1/101.608 GiB/sec1.609 GiB/sec  
   0.074   {'iterations': 29}
   13BM_ReadColumn/50/50  607.396 MiB/sec  607.840 MiB/sec  
   0.073{'iterations': 6}
   34BM_ReadColumn/-1/02.485 GiB/sec2.486 GiB/sec  
   0.050   {'iterations': 13}
   39   BM_ReadColumn/-1/505.035 GiB/sec5.033 GiB/sec  
  -0.053   {'iterations': 47}
   9 BM_ReadColumn/50/50  960.235 MiB/sec  957.587 MiB/sec  
  -0.276{'iterations': 6}
   21  BM_ReadColumn/1/12.445 GiB/sec2.437 GiB/sec  
  -0.342   {'iterations': 17}
   42BM_ReadColumn/35/101.083 GiB/sec1.078 GiB/sec  
  -0.508{'iterations': 6}
   45   BM_ReadColumn/10/501.441 GiB/sec1.432 GiB/sec  
  -0.591   {'iterations': 10}
   31   BM_ReadColumn/-1/103.050 GiB/sec3.031 GiB/sec  
  -0.626   {'iterations': 26}
   26 BM_ReadColumn/99/02.080 GiB/sec2.067 GiB/sec  
  -0.630   {'iterations': 11}
   29BM_ReadColumn/99/501.345 GiB/sec1.333 GiB/sec  
  -0.845   {'iterations': 13}
   5  BM_ReadColumn/99/01.369 GiB/sec1.357 GiB/sec  
  -0.875   {'iterations': 13}
   14BM_ReadColumn/45/25  966.350 MiB/sec  957.826 MiB/sec  
  -0.882{'iterations': 6}
   18  BM_WriteColumn  318.849 MiB/sec  315.743 MiB/sec  
  -0.974   {'iterations': 22}
   3   BM_ReadColumn/-1/204.140 GiB/sec4.098 GiB/sec  
  -1.013   {'iterations': 34}
   25  BM_ReadColumn/1/11.610 GiB/sec1.593 GiB/sec  
  -1.049   {'iterations': 19}
   4  BM_ReadColumn/50/1  961.641 MiB/sec  949.897 MiB/sec  
  -1.221{'iterations': 6}
   10  BM_ReadColumn/5/51.751 GiB/sec1.729 GiB/sec  
  -1.259   {'iterations': 10}
   19   BM_WriteColumn1.087 GiB/sec1.072 GiB/sec  
  -1.355   {'iterations': 10}
   24BM_ReadColumn/30/101.137 GiB/sec1.120 GiB/sec  
  -1.502{'iterations': 7}
   22BM_WriteColumn  769.037 MiB/sec  757.229 MiB/sec  
  -1.536   {'iterations': 11}
   41BM_ReadColumn/25/101.207 GiB/sec1.187 GiB/sec  
  -1.611{'iterations': 9}
   20 BM_WriteColumn  995.675 MiB/sec  967.346 MiB/sec  
  -2.845{'iterations': 6}
   11 BM_ReadColumn/75/11.178 GiB/sec1.143 GiB/sec  
  -2.953{'iterations': 8}
   30BM_WriteColumn  967.789 MiB/sec  915.061 MiB/sec  
  -5.448{'iterations': 6}
   6BM_ReadColumn/-1/0   14.679 GiB/sec   13.820 GiB/sec  
  -5.851  {'iterations': 100}
   23BM_WriteColumn1.427 GiB/sec1.296 GiB/sec  
  -9.217   {'iterations': 13}
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure 

[GitHub] [arrow] fsaintjacques commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-23 Thread GitBox


fsaintjacques commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r92049



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   Correct, this line can't be here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-23 Thread GitBox


github-actions[bot] commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-648401641


   https://issues.apache.org/jira/browse/ARROW-9146



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-23 Thread GitBox


jorisvandenbossche commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r86902



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   Didn't look in detail yet (so might be misreading, it's late here ;)), 
but a quick note: in general, when Fragments were instantiated with 
ParquetFactory (from a `_metadata` file) with populated statistics, I think it 
should be possible to filter out fragments without any additional IO.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz opened a new pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-23 Thread GitBox


bkietz opened a new pull request #7526:
URL: https://github.com/apache/arrow/pull/7526


   The physical schema is required to validate predicates used for filtering 
row groups based on statistics.
   
   It can also be explicitly provided to ensure that if no row groups satisfy 
the predicate no I/O is done.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maxburke commented on a change in pull request #7500: ARROW-9191: [Rust] Do not panic when milliseconds is less than zero as chrono can handle…

2020-06-23 Thread GitBox


maxburke commented on a change in pull request #7500:
URL: https://github.com/apache/arrow/pull/7500#discussion_r70083



##
File path: rust/parquet/src/record/api.rs
##
@@ -893,16 +893,6 @@ mod tests {
 assert_eq!(row, Field::TimestampMillis(123854406));
 }
 
-#[test]
-#[should_panic(expected = "Expected non-negative milliseconds when 
converting Int96")]
-fn test_row_convert_int96_invalid() {
-// INT96 value does not depend on logical type
-let descr = make_column_descr![PhysicalType::INT96, LogicalType::NONE];
-
-let value = Int96::from(vec![0, 0, 0]);
-Field::convert_int96(, value);
-}
-

Review comment:
   As a consumer of Parquet and Arrow I'm trying to not have my program 
panic when it encounters a date from 1968. 
   
   Perhaps it's a mistake to have the timestamp types implemented with unsigned 
integers? Should these be signed?
   
   Regardless, from an API design point of view, I think that Parquet is the 
wrong place to be defending against (potential) gaps in functionality of 
non-project libraries like chrono.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maxburke commented on a change in pull request #7500: ARROW-9191: [Rust] Do not panic when milliseconds is less than zero as chrono can handle…

2020-06-23 Thread GitBox


maxburke commented on a change in pull request #7500:
URL: https://github.com/apache/arrow/pull/7500#discussion_r70083



##
File path: rust/parquet/src/record/api.rs
##
@@ -893,16 +893,6 @@ mod tests {
 assert_eq!(row, Field::TimestampMillis(123854406));
 }
 
-#[test]
-#[should_panic(expected = "Expected non-negative milliseconds when 
converting Int96")]
-fn test_row_convert_int96_invalid() {
-// INT96 value does not depend on logical type
-let descr = make_column_descr![PhysicalType::INT96, LogicalType::NONE];
-
-let value = Int96::from(vec![0, 0, 0]);
-Field::convert_int96(, value);
-}
-

Review comment:
   As a consumer of Parquet and Arrow I'm trying to not have my program 
panic when it encounters a date from 1968. 
   
   Perhaps it's a mistake to have the timestamp types implemented with unsigned 
integers? Should these be signed?
   
   Regardless, from an API design point of view, I think that Parquet is the 
wrong place to be defending against (potential) gaps in functionality of 
non-project libraries like chrono. As a consumer, I'd rather receive the panic 
from chrono than from Parquet :) 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran closed pull request #7466: ARROW-9158: [Rust][Datafusion] projection physical plan compilation should preserve nullability

2020-06-23 Thread GitBox


paddyhoran closed pull request #7466:
URL: https://github.com/apache/arrow/pull/7466


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran commented on a change in pull request #7500: ARROW-9191: [Rust] Do not panic when milliseconds is less than zero as chrono can handle…

2020-06-23 Thread GitBox


paddyhoran commented on a change in pull request #7500:
URL: https://github.com/apache/arrow/pull/7500#discussion_r43777



##
File path: rust/parquet/src/record/api.rs
##
@@ -893,16 +893,6 @@ mod tests {
 assert_eq!(row, Field::TimestampMillis(123854406));
 }
 
-#[test]
-#[should_panic(expected = "Expected non-negative milliseconds when 
converting Int96")]
-fn test_row_convert_int96_invalid() {
-// INT96 value does not depend on logical type
-let descr = make_column_descr![PhysicalType::INT96, LogicalType::NONE];
-
-let value = Int96::from(vec![0, 0, 0]);
-Field::convert_int96(, value);
-}
-

Review comment:
   Are there other tests we should be adding to test the effect of allowing 
negative milliseconds to flow through?  
   
   I'm not that familiar with Parquet but I see things like 
[this](https://github.com/apache/arrow/blob/master/rust/parquet/src/record/api.rs#L584)
 which seem to indicate that we might be relying on milliseconds being 
non-negative.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7518: ARROW-9138: [Docs][Format] Make sure format version is hard coded in the docs

2020-06-23 Thread GitBox


wesm closed pull request #7518:
URL: https://github.com/apache/arrow/pull/7518


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on pull request #7507: ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian

2020-06-23 Thread GitBox


kiszk commented on pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#issuecomment-648320579


   Are there any comments about this approach for preparing test cases between 
different endians?  cc @pitrou @wesm 
   If not, I will prepare other tests (but disabled now) with this approach. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


wesm edited a comment on pull request #7525:
URL: https://github.com/apache/arrow/pull/7525#issuecomment-648279829


   Here's the sort benchmarks prior to the initial visitor_inline.h changes
   
   gcc-8:
   
   ```
  benchmark baseline
contender  change %   counters
   2 SortToIndicesInt64Count/32768/1/min_time:1.0004.578 GiB/sec
5.269 GiB/sec15.096  {'iterations': 209972, 'null_percent': 100.0}
   0   SortToIndicesInt64Compare/32768/1/min_time:1.0004.604 GiB/sec
5.282 GiB/sec14.722  {'iterations': 212391, 'null_percent': 100.0}
   3 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.596 GiB/sec
5.108 GiB/sec11.149{'iterations': 6619, 'null_percent': 100.0}
   9   SortToIndicesInt64Count/1048576/1/min_time:1.0004.612 GiB/sec
5.098 GiB/sec10.548{'iterations': 6632, 'null_percent': 100.0}
   11  SortToIndicesInt64Count/8388608/1/min_time:1.0004.331 GiB/sec
4.745 GiB/sec 9.561 {'iterations': 782, 'null_percent': 100.0}
   4 SortToIndicesInt64Compare/8388608/1/min_time:1.0004.336 GiB/sec
4.742 GiB/sec 9.357 {'iterations': 775, 'null_percent': 100.0}
   7 SortToIndicesInt64Count/32768/1/min_time:1.0001.922 GiB/sec
2.101 GiB/sec 9.299{'iterations': 87523, 'null_percent': 0.01}
   10SortToIndicesInt64Count/32768/2/min_time:1.000  836.548 MiB/sec  
862.251 MiB/sec 3.073{'iterations': 37646, 'null_percent': 50.0}
   15  SortToIndicesInt64Compare/32768/2/min_time:1.000  244.171 MiB/sec  
251.494 MiB/sec 2.999{'iterations': 10912, 'null_percent': 50.0}
   1  SortToIndicesInt64Compare/32768/10/min_time:1.000  157.085 MiB/sec  
159.112 MiB/sec 1.290 {'iterations': 7038, 'null_percent': 10.0}
   5   SortToIndicesInt64Compare/32768/0/min_time:1.000  151.208 MiB/sec  
152.825 MiB/sec 1.069  {'iterations': 6831, 'null_percent': 0.0}
   13  SortToIndicesInt64Compare/32768/1/min_time:1.000  149.768 MiB/sec  
150.096 MiB/sec 0.219 {'iterations': 6664, 'null_percent': 0.01}
   12SortToIndicesInt64Compare/32768/100/min_time:1.000  149.783 MiB/sec  
149.835 MiB/sec 0.035  {'iterations': 6716, 'null_percent': 1.0}
   6 SortToIndicesInt64Count/32768/0/min_time:1.0002.440 GiB/sec
2.317 GiB/sec-5.020{'iterations': 107181, 'null_percent': 0.0}
   14  SortToIndicesInt64Count/32768/100/min_time:1.0001.866 GiB/sec
1.607 GiB/sec   -13.895 {'iterations': 86112, 'null_percent': 1.0}
   8SortToIndicesInt64Count/32768/10/min_time:1.0001.795 GiB/sec
1.337 GiB/sec   -25.485{'iterations': 81140, 'null_percent': 10.0}
   ```
   
   clang-8
   
   ```
  benchmark baseline
contender  change %   counters
   8 SortToIndicesInt64Count/32768/1/min_time:1.0001.582 GiB/sec
2.017 GiB/sec27.429{'iterations': 71240, 'null_percent': 0.01}
   4   SortToIndicesInt64Compare/32768/1/min_time:1.000  150.179 MiB/sec  
151.637 MiB/sec 0.971 {'iterations': 6727, 'null_percent': 0.01}
   1 SortToIndicesInt64Compare/32768/100/min_time:1.000  150.667 MiB/sec  
151.708 MiB/sec 0.691  {'iterations': 6799, 'null_percent': 1.0}
   15  SortToIndicesInt64Compare/32768/0/min_time:1.000  154.512 MiB/sec  
155.255 MiB/sec 0.481  {'iterations': 6960, 'null_percent': 0.0}
   14SortToIndicesInt64Compare/8388608/1/min_time:1.0004.288 GiB/sec
4.298 GiB/sec 0.216 {'iterations': 721, 'null_percent': 100.0}
   3   SortToIndicesInt64Compare/32768/1/min_time:1.0004.759 GiB/sec
4.765 GiB/sec 0.135  {'iterations': 219030, 'null_percent': 100.0}
   5   SortToIndicesInt64Count/1048576/1/min_time:1.0004.586 GiB/sec
4.589 GiB/sec 0.064{'iterations': 6584, 'null_percent': 100.0}
   6 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.581 GiB/sec
4.581 GiB/sec-0.004{'iterations': 6543, 'null_percent': 100.0}
   11  SortToIndicesInt64Count/8388608/1/min_time:1.0004.296 GiB/sec
4.295 GiB/sec-0.014 {'iterations': 772, 'null_percent': 100.0}
   7   SortToIndicesInt64Compare/32768/2/min_time:1.000  255.324 MiB/sec  
253.514 MiB/sec-0.709{'iterations': 11541, 'null_percent': 50.0}
   2  SortToIndicesInt64Compare/32768/10/min_time:1.000  162.353 MiB/sec  
161.194 MiB/sec-0.714 {'iterations': 7318, 'null_percent': 10.0}
   9 SortToIndicesInt64Count/32768/1/min_time:1.0004.772 GiB/sec
4.737 GiB/sec-0.720  {'iterations': 216649, 'null_percent': 100.0}
   0 SortToIndicesInt64Count/32768/0/min_time:1.0002.156 GiB/sec
2.091 GiB/sec-3.013 {'iterations': 98848, 'null_percent': 0.0}
   12  

[GitHub] [arrow] wesm commented on pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


wesm commented on pull request #7525:
URL: https://github.com/apache/arrow/pull/7525#issuecomment-648279829


   Here's the sort benchmarks prior to the visitor_inline.h changes
   
   gcc-8:
   
   ```
  benchmark baseline
contender  change %   counters
   2 SortToIndicesInt64Count/32768/1/min_time:1.0004.578 GiB/sec
5.269 GiB/sec15.096  {'iterations': 209972, 'null_percent': 100.0}
   0   SortToIndicesInt64Compare/32768/1/min_time:1.0004.604 GiB/sec
5.282 GiB/sec14.722  {'iterations': 212391, 'null_percent': 100.0}
   3 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.596 GiB/sec
5.108 GiB/sec11.149{'iterations': 6619, 'null_percent': 100.0}
   9   SortToIndicesInt64Count/1048576/1/min_time:1.0004.612 GiB/sec
5.098 GiB/sec10.548{'iterations': 6632, 'null_percent': 100.0}
   11  SortToIndicesInt64Count/8388608/1/min_time:1.0004.331 GiB/sec
4.745 GiB/sec 9.561 {'iterations': 782, 'null_percent': 100.0}
   4 SortToIndicesInt64Compare/8388608/1/min_time:1.0004.336 GiB/sec
4.742 GiB/sec 9.357 {'iterations': 775, 'null_percent': 100.0}
   7 SortToIndicesInt64Count/32768/1/min_time:1.0001.922 GiB/sec
2.101 GiB/sec 9.299{'iterations': 87523, 'null_percent': 0.01}
   10SortToIndicesInt64Count/32768/2/min_time:1.000  836.548 MiB/sec  
862.251 MiB/sec 3.073{'iterations': 37646, 'null_percent': 50.0}
   15  SortToIndicesInt64Compare/32768/2/min_time:1.000  244.171 MiB/sec  
251.494 MiB/sec 2.999{'iterations': 10912, 'null_percent': 50.0}
   1  SortToIndicesInt64Compare/32768/10/min_time:1.000  157.085 MiB/sec  
159.112 MiB/sec 1.290 {'iterations': 7038, 'null_percent': 10.0}
   5   SortToIndicesInt64Compare/32768/0/min_time:1.000  151.208 MiB/sec  
152.825 MiB/sec 1.069  {'iterations': 6831, 'null_percent': 0.0}
   13  SortToIndicesInt64Compare/32768/1/min_time:1.000  149.768 MiB/sec  
150.096 MiB/sec 0.219 {'iterations': 6664, 'null_percent': 0.01}
   12SortToIndicesInt64Compare/32768/100/min_time:1.000  149.783 MiB/sec  
149.835 MiB/sec 0.035  {'iterations': 6716, 'null_percent': 1.0}
   6 SortToIndicesInt64Count/32768/0/min_time:1.0002.440 GiB/sec
2.317 GiB/sec-5.020{'iterations': 107181, 'null_percent': 0.0}
   14  SortToIndicesInt64Count/32768/100/min_time:1.0001.866 GiB/sec
1.607 GiB/sec   -13.895 {'iterations': 86112, 'null_percent': 1.0}
   8SortToIndicesInt64Count/32768/10/min_time:1.0001.795 GiB/sec
1.337 GiB/sec   -25.485{'iterations': 81140, 'null_percent': 10.0}
   ```
   
   clang-8
   
   ```
  benchmark baseline
contender  change %   counters
   8 SortToIndicesInt64Count/32768/1/min_time:1.0001.582 GiB/sec
2.017 GiB/sec27.429{'iterations': 71240, 'null_percent': 0.01}
   4   SortToIndicesInt64Compare/32768/1/min_time:1.000  150.179 MiB/sec  
151.637 MiB/sec 0.971 {'iterations': 6727, 'null_percent': 0.01}
   1 SortToIndicesInt64Compare/32768/100/min_time:1.000  150.667 MiB/sec  
151.708 MiB/sec 0.691  {'iterations': 6799, 'null_percent': 1.0}
   15  SortToIndicesInt64Compare/32768/0/min_time:1.000  154.512 MiB/sec  
155.255 MiB/sec 0.481  {'iterations': 6960, 'null_percent': 0.0}
   14SortToIndicesInt64Compare/8388608/1/min_time:1.0004.288 GiB/sec
4.298 GiB/sec 0.216 {'iterations': 721, 'null_percent': 100.0}
   3   SortToIndicesInt64Compare/32768/1/min_time:1.0004.759 GiB/sec
4.765 GiB/sec 0.135  {'iterations': 219030, 'null_percent': 100.0}
   5   SortToIndicesInt64Count/1048576/1/min_time:1.0004.586 GiB/sec
4.589 GiB/sec 0.064{'iterations': 6584, 'null_percent': 100.0}
   6 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.581 GiB/sec
4.581 GiB/sec-0.004{'iterations': 6543, 'null_percent': 100.0}
   11  SortToIndicesInt64Count/8388608/1/min_time:1.0004.296 GiB/sec
4.295 GiB/sec-0.014 {'iterations': 772, 'null_percent': 100.0}
   7   SortToIndicesInt64Compare/32768/2/min_time:1.000  255.324 MiB/sec  
253.514 MiB/sec-0.709{'iterations': 11541, 'null_percent': 50.0}
   2  SortToIndicesInt64Compare/32768/10/min_time:1.000  162.353 MiB/sec  
161.194 MiB/sec-0.714 {'iterations': 7318, 'null_percent': 10.0}
   9 SortToIndicesInt64Count/32768/1/min_time:1.0004.772 GiB/sec
4.737 GiB/sec-0.720  {'iterations': 216649, 'null_percent': 100.0}
   0 SortToIndicesInt64Count/32768/0/min_time:1.0002.156 GiB/sec
2.091 GiB/sec-3.013 {'iterations': 98848, 'null_percent': 0.0}
   12  

[GitHub] [arrow] bkietz closed pull request #7513: ARROW-9207: [Python] Clean-up internal FileSource class

2020-06-23 Thread GitBox


bkietz closed pull request #7513:
URL: https://github.com/apache/arrow/pull/7513


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


wesm commented on pull request #7525:
URL: https://github.com/apache/arrow/pull/7525#issuecomment-648270948


   OK I'm done twiddling this, here is the latest comparison of the hash 
benchmarks versus master with gcc-8:
   
   ```
 benchmark  baselinecontender  change % 
   counters
   35UniqueInt64/6 3.391 GiB/sec   21.874 GiB/sec   544.979 
 {'iterations': 77, 'null_percent': 100.0, 'num_unique': 100.0}
   22   UniqueInt64/13 3.407 GiB/sec   21.807 GiB/sec   540.146   
{'iterations': 77, 'null_percent': 100.0, 'num_unique': 10.0}
   33UniqueInt64/5 2.588 GiB/sec8.809 GiB/sec   240.359 
  {'iterations': 58, 'null_percent': 99.0, 'num_unique': 100.0}
   27   UniqueInt64/12 2.282 GiB/sec5.873 GiB/sec   157.319
{'iterations': 50, 'null_percent': 99.0, 'num_unique': 10.0}
   30UniqueUInt8/6 1.763 GiB/sec2.712 GiB/sec53.829 
{'iterations': 315, 'null_percent': 100.0, 'num_unique': 200.0}
   44  BuildDictionary   978.092 MiB/sec1.401 GiB/sec46.647 
{'iterations': 172, 'null_percent': 11.111047470674487}
   8UniqueInt64/11   613.864 MiB/sec  836.145 MiB/sec36.210
{'iterations': 13, 'null_percent': 50.0, 'num_unique': 10.0}
   2UniqueInt64/10   559.826 MiB/sec  743.501 MiB/sec32.809
{'iterations': 11, 'null_percent': 10.0, 'num_unique': 10.0}
   21UniqueUInt8/3   434.059 MiB/sec  553.941 MiB/sec27.619 
  {'iterations': 76, 'null_percent': 10.0, 'num_unique': 200.0}
   25UniqueUInt8/4   211.533 MiB/sec  251.887 MiB/sec19.077 
  {'iterations': 36, 'null_percent': 50.0, 'num_unique': 200.0}
   13UniqueUInt8/5 1.115 GiB/sec1.290 GiB/sec15.775 
 {'iterations': 199, 'null_percent': 99.0, 'num_unique': 200.0}
   5 UniqueUInt8/2   855.076 MiB/sec  977.493 MiB/sec14.316 
  {'iterations': 148, 'null_percent': 1.0, 'num_unique': 200.0}
   7 UniqueInt64/4 1.072 GiB/sec1.224 GiB/sec14.136 
  {'iterations': 24, 'null_percent': 50.0, 'num_unique': 100.0}
   43UniqueInt64/9   697.824 MiB/sec  744.400 MiB/sec 6.674 
{'iterations': 15, 'null_percent': 1.0, 'num_unique': 10.0}
   40UniqueString10bytes/0   889.829 MiB/sec  924.593 MiB/sec 3.907 
   {'iterations': 15, 'null_percent': 0.0, 'num_unique': 100.0}
   10UniqueInt64/3 1.490 GiB/sec1.549 GiB/sec 3.907 
  {'iterations': 34, 'null_percent': 10.0, 'num_unique': 100.0}
   18UniqueString10bytes/1   862.556 MiB/sec  890.677 MiB/sec 3.260 
   {'iterations': 15, 'null_percent': 0.1, 'num_unique': 100.0}
   41   UniqueString100bytes/9 1.023 GiB/sec1.051 GiB/sec 2.746 
 {'iterations': 2, 'null_percent': 1.0, 'num_unique': 10.0}
   38  UniqueString100bytes/11 1.941 GiB/sec1.982 GiB/sec 2.152 
{'iterations': 3, 'null_percent': 50.0, 'num_unique': 10.0}
   29   UniqueString100bytes/8 1.021 GiB/sec1.037 GiB/sec 1.487 
 {'iterations': 2, 'null_percent': 0.1, 'num_unique': 10.0}
   34  UniqueString100bytes/10 1.109 GiB/sec1.126 GiB/sec 1.484 
{'iterations': 2, 'null_percent': 10.0, 'num_unique': 10.0}
   3UniqueString100bytes/1 4.719 GiB/sec4.775 GiB/sec 1.190 
{'iterations': 8, 'null_percent': 0.1, 'num_unique': 100.0}
   46UniqueString10bytes/2   855.155 MiB/sec  862.082 MiB/sec 0.810 
   {'iterations': 15, 'null_percent': 1.0, 'num_unique': 100.0}
   6UniqueString100bytes/0 4.769 GiB/sec4.799 GiB/sec 0.635 
{'iterations': 9, 'null_percent': 0.0, 'num_unique': 100.0}
   45   UniqueString100bytes/2 4.735 GiB/sec4.759 GiB/sec 0.515 
{'iterations': 8, 'null_percent': 1.0, 'num_unique': 100.0}
   11UniqueInt64/2 1.812 GiB/sec1.812 GiB/sec-0.018 
   {'iterations': 41, 'null_percent': 1.0, 'num_unique': 100.0}
   14   UniqueString10bytes/10   332.210 MiB/sec  331.356 MiB/sec-0.257 
{'iterations': 6, 'null_percent': 10.0, 'num_unique': 10.0}
   48UniqueInt64/0 2.123 GiB/sec2.115 GiB/sec-0.357 
   {'iterations': 47, 'null_percent': 0.0, 'num_unique': 100.0}
   15UniqueString10bytes/9   311.568 MiB/sec  310.197 MiB/sec-0.440 
 {'iterations': 5, 'null_percent': 1.0, 'num_unique': 10.0}
   32UniqueString10bytes/8   312.322 MiB/sec  310.270 MiB/sec-0.657 
 {'iterations': 5, 'null_percent': 0.1, 'num_unique': 10.0}
   47UniqueInt64/1 1.987 GiB/sec1.972 GiB/sec-0.750 
   {'iterations': 44, 'null_percent': 0.1, 'num_unique': 100.0}
   42UniqueString10bytes/7   315.974 MiB/sec  312.873 MiB/sec-0.981 
 {'iterations': 6, 'null_percent': 0.0, 

[GitHub] [arrow] rjzamora edited a comment on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

2020-06-23 Thread GitBox


rjzamora edited a comment on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-648269136


   Thanks for working on this @jorisvandenbossche !
   
   This does seem like the functionality needed by Dask.  To test my 
understanding (and for the sake of discussion), I am imagining something 
(roughly) like the following in Dask to collect row-group statistics (note that 
I am using pyarrow-0.17.1 from conda, so the `get_row_group_fragments` call 
would be replaced):
   
   ```python
   from collections import defaultdict
   import json
   import pandas as pd
   import pyarrow.dataset as pds
   from string import ascii_lowercase as letters
   
   path = "simple.pdf"
   df0 = pd.DataFrame(
   {"x": range(26), "myindex": list(letters)}
   ).set_index("myindex")
   df0.to_parquet(path, engine="pyarrow", row_group_size=10)
   ds = pds.dataset(path)
   
   # Need index_cols to be specified by user or encoded in
   # the "pandas" metadata. Otherwise, we will not bother
   # to infer an index column (and wont need statistics).
   index_cols = json.loads(
   ds.schema.metadata[b"pandas"].decode("utf8")
   )["index_columns"]
   filter = None # Some user-defined filter
   
   # Collect path and statistics for each row-group
   metadata = defaultdict(list)
   for file_frag in ds.get_fragments(filter=filter):
   for rg_frag in file_frag.get_row_group_fragments():
   for rg in rg_frag.row_groups:
   stats = ds.get_min_max_statistics(rg.statistics)
   metadata[rg_frag.path].append((, stats))
   ```
   
   In this case, the resulting `metadata` object would be something like:
   ```
   defaultdict(list,
   {'simple.pdf': [(0, ),
 (1, ),
 (2, )]})
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rjzamora commented on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

2020-06-23 Thread GitBox


rjzamora commented on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-648269136


   Thanks for working on this @jorisvandenbossche !
   
   This does seem like the functionality needed by Dask.  To test my 
understanding (and for the sake of discussion), I am imagining something 
(roughly) like the following in Dask to collect row-group statistics (note that 
I am using pyarrow-0.17.1 from conda, so the `get_row_group_fragments` call 
would be replaced):
   
   ```python
   from collections import defaultdict
   import json
   import pandas as pd
   import pyarrow.dataset as ds
   from string import ascii_lowercase as letters
   
   path = "simple.pdf"
   df0 = pd.DataFrame(
   {"x": range(26), "myindex": list(letters)}
   ).set_index("myindex")
   df0.to_parquet(path, engine="pyarrow", row_group_size=10)
   ds = pds.dataset(path)
   
   # Need index_cols to be specified by user or encoded in
   # the "pandas" metadata. Otherwise, we will not bother
   # to infer an index column (and wont need statistics).
   index_cols = json.loads(
   ds.schema.metadata[b"pandas"].decode("utf8")
   )["index_columns"]
   filter = None # Some user-defined filter
   
   # Collect path and statistics for each row-group
   metadata = defaultdict(list)
   for file_frag in ds.get_fragments(filter=filter):
   for rg_frag in file_frag.get_row_group_fragments():
   for rg in rg_frag.row_groups:
   stats = ds.get_min_max_statistics(rg.statistics)
   metadata[rg_frag.path].append((, stats))
   ```
   
   In this case, the resulting `metadata` object would be something like:
   ```
   defaultdict(list,
   {'simple.pdf': [(0, ),
 (1, ),
 (2, )]})
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444333447



##
File path: docs/source/developers/contributing.rst
##
@@ -124,29 +181,72 @@ To contribute a patch:
   `ARROW-767: [C++] Filesystem abstraction 
`_).
 * Make sure that your code passes the unit tests. You can find instructions how
   to run the unit tests for each Arrow component in its respective README file.
+
+Core developers and others with a stake in the part of the project your change
+affects will review, request changes, and hopefully indicate their approval
+in the end. To make the review process smooth for everyone, try to
+
+* Break your work into small, single-purpose patches if possible. It’s much
+  harder to merge in a large change with a lot of disjoint features, and
+  particularly if you're new to the project, smaller changes are much easier
+  for maintainers to accept.
 * Add new unit tests for your code.
+* Follow the style guides for the part(s) of the project you're modifying.
+  Some languages (C++, Python, and Rust, for example) run a lint check in
+  continuous integration. For all languages, see their respective developer
+  documentation and READMEs for style guidance. In general, try to make it look
+  as if the codebase has a single author, and emulate any conventions you see,
+  whether or not they are officially documented or checked.
+
+When tests are passing and the pull request has been approved by the interested
+parties, a committer will merge the pull request. This is done with a
+command-line utility that does a squash merge, so all of your commits will be
+registered as a single commit to the master branch; this simplifies the
+connection between JIRA issues and commits, and it makes it easier to bisect
+history to identify where changes were introduced. A side effect of this way of
+merging is that your pull request will appear in the GitHub interface to have
+been "closed without merge". Do not be alarmed: if you look at the bottom, you
+will see a message that says "@user closed this in $COMMIT".
+
+Local git conventions
++
+
+If you are tracking the Arrow source repository locally, here are some tips
+for using ``git``.
+
+All Arrow contributors work off of their personal fork of ``apache/arrow``
+and submit pull requests "upstream". Once you've cloned your fork of Arrow,
+be sure to::
+
+$ git remote add upstream https://github.com/apache/arrow
+
+to set the "upstream" repository.
+
+You are encouraged to develop on branches, rather than your own "master" 
branch,
+and it helps to keep your fork's master branch synced with ``upstream/master``.
 
-Thank you in advance for your contributions!
+To start a new branch, pull the latest from upstream first::
 
-Common Git conventions followed within the project
---
+   $ git fetch upstream
+   $ git checkout master
+   $ git reset --hard upstream/master
+   $ git checkout -b $NEW_BRANCH_NAME
 
-If you are tracking the Arrow source repository locally, following some common 
Git
-conventions would make everyone's workflow compatible.  These recommendations 
along with
-their rationale are outlined below.
+It does not matter what you call your branch. Some people like to use the JIRA
+number as branch name, others use descriptive names.
 
-It is strongly discouraged to use a regular ``git merge``, as a linear commit 
history is
-prefered by the project.  It is much easier to maintain, and makes for easier
-``cherry-picking`` of features; useful for backporting fixes to maintenance 
releases.
+Once you have a branch going, you should sync with ``upstream/master``
+regularly, as many commits merge to master every day.
+It is recommended to use ``git rebase`` rather than ``git merge``.

Review comment:
   Honestly I don't think it matters what people do on their branches since 
we squash merge in the end, right?
   
   The last thing I want out of this document is to spark a flame war between 
`rebase` and `merge` partisans. I wouldn't have added this section myself but 
there was a JIRA requesting it and someone else added it this weekend.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444330998



##
File path: docs/source/developers/contributing.rst
##
@@ -124,29 +181,72 @@ To contribute a patch:
   `ARROW-767: [C++] Filesystem abstraction 
`_).
 * Make sure that your code passes the unit tests. You can find instructions how
   to run the unit tests for each Arrow component in its respective README file.
+
+Core developers and others with a stake in the part of the project your change
+affects will review, request changes, and hopefully indicate their approval
+in the end. To make the review process smooth for everyone, try to
+
+* Break your work into small, single-purpose patches if possible. It’s much
+  harder to merge in a large change with a lot of disjoint features, and
+  particularly if you're new to the project, smaller changes are much easier
+  for maintainers to accept.
 * Add new unit tests for your code.
+* Follow the style guides for the part(s) of the project you're modifying.
+  Some languages (C++, Python, and Rust, for example) run a lint check in
+  continuous integration. For all languages, see their respective developer
+  documentation and READMEs for style guidance. In general, try to make it look
+  as if the codebase has a single author, and emulate any conventions you see,
+  whether or not they are officially documented or checked.
+
+When tests are passing and the pull request has been approved by the interested
+parties, a committer will merge the pull request. This is done with a
+command-line utility that does a squash merge, so all of your commits will be
+registered as a single commit to the master branch; this simplifies the
+connection between JIRA issues and commits, and it makes it easier to bisect
+history to identify where changes were introduced. A side effect of this way of
+merging is that your pull request will appear in the GitHub interface to have
+been "closed without merge". Do not be alarmed: if you look at the bottom, you
+will see a message that says "@user closed this in $COMMIT".
+
+Local git conventions
++
+
+If you are tracking the Arrow source repository locally, here are some tips
+for using ``git``.
+
+All Arrow contributors work off of their personal fork of ``apache/arrow``
+and submit pull requests "upstream". Once you've cloned your fork of Arrow,
+be sure to::
+
+$ git remote add upstream https://github.com/apache/arrow
+
+to set the "upstream" repository.
+
+You are encouraged to develop on branches, rather than your own "master" 
branch,
+and it helps to keep your fork's master branch synced with ``upstream/master``.
 
-Thank you in advance for your contributions!
+To start a new branch, pull the latest from upstream first::
 
-Common Git conventions followed within the project
---
+   $ git fetch upstream
+   $ git checkout master
+   $ git reset --hard upstream/master

Review comment:
   Learn something new every day. 
   
   I generally treat `origin/master` as disposable and wish GitHub had a way to 
keep it automatically in sync with `upstream/master`, but that's neither here 
nor there.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7493: ARROW-9183: [C++] Fix build with clang & old libstdc++.

2020-06-23 Thread GitBox


bkietz commented on pull request #7493:
URL: https://github.com/apache/arrow/pull/7493#issuecomment-648252136


   Hmm, there's a failure building with GCC 4.8 
https://github.com/apache/arrow/pull/7493/checks?check_run_id=791725319#step:9:534
   The `#ifdef` condition seems to be failing to detect too-old libstdc++



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alippai commented on pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-23 Thread GitBox


alippai commented on pull request #7517:
URL: https://github.com/apache/arrow/pull/7517#issuecomment-648247832


   Thanks, now I understand. So the pairing with toxiproxy is for the testing 
:))
   That's what you wrote, I just misunderstood



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444322251



##
File path: docs/source/developers/contributing.rst
##
@@ -76,46 +96,83 @@ visibility. They may add a "Fix version" to indicate that 
they're considering
 it for inclusion in the next release, though adding that tag is not a
 commitment that it will be done in the next release.
 
-Advanced use
-
-
-Once you are involved in the project and want to do more on JIRA, such as
-assign yourself an issue, you will need "Contributor" permissions on the
-Apache Arrow JIRA. To get this role, ask on the mailing list for a project
-maintainer's help.
-
-GitHub issues
--
-
-We support `GitHub issues `_ as a
-lightweight way to ask questions and engage with
-the Arrow developer community. We use JIRA for maintaining a queue of
-development work and as the public record for work on the project. So, feel
-free to open GitHub issues, but bugs and feature requests will eventually need
-to end up in JIRA, either before or after completing a pull request. Don't be
-surprised if you are immediately asked by a project maintainer to open a JIRA
-issue.
-
-How to contribute patches
-=
-
-We prefer to receive contributions in the form of GitHub pull requests. Please
-send pull requests against the `github.com/apache/arrow
-`_ repository following the procedure below.
-
-If you are looking for some ideas on what to contribute, check out the JIRA
-issues for the Apache Arrow project. Comment on the issue and/or contact
-d...@arrow.apache.org with your questions and ideas.
-
-If you’d like to report a bug but don’t have time to fix it, you can still post
-it on JIRA, or email the mailing list d...@arrow.apache.org.
+Tips for successful bug reports

+
+No one likes having bugs in their software, and in an ideal world, all bugs
+would get fixed as soon as they were reported. However, time and attention are
+finite, especially in an open-source project where most contributors are
+participating in their spare time. In order for your bug to get prompt
+attention, there are things you can do to make it easier for contributors to
+reproduce and fix it.
+
+When you're reporting a bug, please help us understand the issue by providing,
+to the best of your ability,
+
+* Clear, minimal steps to reproduce the issue, with as few non-Arrow
+  dependencies as possible. If there's a problem on reading a file, try to
+  provide as small of an example file as possible, or code to create one.
+  If your bug report says "it crashes trying to read my file, but I can't
+  share it with you," it's really hard for us to debug.
+* Any relevant operating system, language, and library version information
+* If it isn't obvious, clearly state the expected behavior and what actually
+  happened.
+
+If a developer can't get a failing unit test, they won't be able to know that
+the issue has been identified, and they won't know when it has been fixed.
+Try to anticipate the questions you might be asked by someone working to
+understand the issue and provide those supporting details up front.
+

Review comment:
   Clearly there is lots of prior art out there on how to write a good bug 
report. I hesitated to add links for the reason you mentioned, but I'll see if 
I can collect a few links that say the same things but in different 
(programming) languages.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-23 Thread GitBox


fsaintjacques commented on pull request #7517:
URL: https://github.com/apache/arrow/pull/7517#issuecomment-648244980


   I can't comment on the production quality of MinIO since I've never used it 
in such scenario. I meant this for reference to other developers who wants to 
test the S3 bindings without having to use an actual S3 bucket. For example, I 
have a dataset stored locally. When I want to test the difference between local 
and cloud storage, I spin a MinIO instance pointing to the same local directory 
and benchmark accordingly. toxiproxy is used to introduce latency and bandwidth 
limits mimicking S3.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444320449



##
File path: docs/source/developers/contributing.rst
##
@@ -168,11 +274,15 @@ remote repo still holds the old history, you would need 
to do a force push. ::
 look at your updates, please ensure you comment on the PR on GitHub as simply 
force
 pushing does not trigger a notification in the GitHub user interface.
 
-Simplifying ``rebase``
-++
+Also, once you have a pull request up, be sure you pull from ``origin``
+before rebasing and force-pushing. Arrow maintainers can push commits directly
+to your branch, which they sometimes do to help move a pull request along.
+In addition, the GitHub PR "suggestion" feature can also add commits to
+your branch, so it is possible that your local copy of your branch is missing
+some additions.

Review comment:
   FWIW I too like the suggestion feature: I find it makes for a better 
being-reviewed experience because trivial or pedantic suggestions can just be 
accepted. In any case, I agree with Joris that I don't read this paragraph as 
an endorsement, just an acknowledgement of a way that `origin` may have commits 
that you don't have locally, even though it is "your" branch.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444318497



##
File path: docs/source/developers/contributing.rst
##
@@ -124,29 +181,72 @@ To contribute a patch:
   `ARROW-767: [C++] Filesystem abstraction 
`_).
 * Make sure that your code passes the unit tests. You can find instructions how
   to run the unit tests for each Arrow component in its respective README file.
+
+Core developers and others with a stake in the part of the project your change
+affects will review, request changes, and hopefully indicate their approval
+in the end. To make the review process smooth for everyone, try to
+
+* Break your work into small, single-purpose patches if possible. It’s much
+  harder to merge in a large change with a lot of disjoint features, and
+  particularly if you're new to the project, smaller changes are much easier
+  for maintainers to accept.
 * Add new unit tests for your code.
+* Follow the style guides for the part(s) of the project you're modifying.
+  Some languages (C++, Python, and Rust, for example) run a lint check in
+  continuous integration. For all languages, see their respective developer
+  documentation and READMEs for style guidance. In general, try to make it look
+  as if the codebase has a single author, and emulate any conventions you see,
+  whether or not they are officially documented or checked.
+
+When tests are passing and the pull request has been approved by the interested
+parties, a committer will merge the pull request. This is done with a
+command-line utility that does a squash merge, so all of your commits will be
+registered as a single commit to the master branch; this simplifies the
+connection between JIRA issues and commits, and it makes it easier to bisect
+history to identify where changes were introduced. A side effect of this way of
+merging is that your pull request will appear in the GitHub interface to have
+been "closed without merge". Do not be alarmed: if you look at the bottom, you
+will see a message that says "@user closed this in $COMMIT".
+
+Local git conventions
++
+
+If you are tracking the Arrow source repository locally, here are some tips
+for using ``git``.
+
+All Arrow contributors work off of their personal fork of ``apache/arrow``
+and submit pull requests "upstream". Once you've cloned your fork of Arrow,
+be sure to::
+
+$ git remote add upstream https://github.com/apache/arrow
+
+to set the "upstream" repository.
+
+You are encouraged to develop on branches, rather than your own "master" 
branch,
+and it helps to keep your fork's master branch synced with ``upstream/master``.
 
-Thank you in advance for your contributions!
+To start a new branch, pull the latest from upstream first::
 
-Common Git conventions followed within the project
---
+   $ git fetch upstream
+   $ git checkout master
+   $ git reset --hard upstream/master
+   $ git checkout -b $NEW_BRANCH_NAME
 
-If you are tracking the Arrow source repository locally, following some common 
Git
-conventions would make everyone's workflow compatible.  These recommendations 
along with
-their rationale are outlined below.
+It does not matter what you call your branch. Some people like to use the JIRA
+number as branch name, others use descriptive names.
 
-It is strongly discouraged to use a regular ``git merge``, as a linear commit 
history is
-prefered by the project.  It is much easier to maintain, and makes for easier
-``cherry-picking`` of features; useful for backporting fixes to maintenance 
releases.
+Once you have a branch going, you should sync with ``upstream/master``
+regularly, as many commits merge to master every day.
+It is recommended to use ``git rebase`` rather than ``git merge``.
 To sync your local copy of a branch, you may do the following::
 
 $ git pull upstream branch --rebase

Review comment:
   IDK personally, and IMO it's hard to give `git` advice because everyone 
has slightly different workflows that work for them





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#discussion_r444311774



##
File path: r/R/table.R
##
@@ -202,7 +210,27 @@ Table$create <- function(..., schema = NULL) {
 
 #' @export
 as.data.frame.Table <- function(x, row.names = NULL, optional = FALSE, ...) {
-  Table__to_dataframe(x, use_threads = option_use_threads())
+  df <- Table__to_dataframe(x, use_threads = option_use_threads())
+
+  if (!is.null(r_metadata <- x$metadata$r)) {
+r_metadata <- .arrow_unserialize(r_metadata)
+
+df_metadata <- r_metadata[[1L]]

Review comment:
   IMO using a named list would be clearer than relying on position in an 
unnamed list

##
File path: r/src/table.cpp
##
@@ -172,11 +172,20 @@ std::shared_ptr Table__from_dots(SEXP lst, 
SEXP schema_sxp) {
   std::shared_ptr schema;
 
   if (Rf_isNull(schema_sxp)) {
-// infer the schema from the ...
+// infer the schema from the `...`

Review comment:
   I wonder if this whole block of code should be factored out so it can be 
used whether you're creating a Table or a RecordBatch. It's the same either 
way: return a Schema corresponding to the data given.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


github-actions[bot] commented on pull request #7525:
URL: https://github.com/apache/arrow/pull/7525#issuecomment-648240180


   https://issues.apache.org/jira/browse/ARROW-9214



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


wesm commented on pull request #7525:
URL: https://github.com/apache/arrow/pull/7525#issuecomment-648238512


   Here are some vector-hash benchmarks comparing this branch with master. The 
performance "regressions" are for the 99%-100% null cases, I'll take a quick 
look at these in the implementation but these perf changes don't concern me
   
   ```
   $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 ARROW-9214 master 
--suite-filter=vector-hash
 benchmark  baselinecontender  change % 
   counters
   35   UniqueInt64/13 3.377 GiB/sec   14.983 GiB/sec   343.678   
{'iterations': 76, 'null_percent': 100.0, 'num_unique': 10.0}
   23UniqueInt64/6 3.401 GiB/sec   14.871 GiB/sec   337.243 
 {'iterations': 76, 'null_percent': 100.0, 'num_unique': 100.0}
   42UniqueInt64/5 2.642 GiB/sec7.320 GiB/sec   177.036 
  {'iterations': 57, 'null_percent': 99.0, 'num_unique': 100.0}
   27   UniqueInt64/12 2.268 GiB/sec5.175 GiB/sec   128.206
{'iterations': 50, 'null_percent': 99.0, 'num_unique': 10.0}
   33UniqueInt64/0 2.121 GiB/sec4.793 GiB/sec   125.974 
   {'iterations': 48, 'null_percent': 0.0, 'num_unique': 100.0}
   31UniqueInt64/1 1.994 GiB/sec4.212 GiB/sec   111.160 
   {'iterations': 44, 'null_percent': 0.1, 'num_unique': 100.0}
   11   UniqueInt64/10   565.967 MiB/sec1.054 GiB/sec90.635
{'iterations': 12, 'null_percent': 10.0, 'num_unique': 10.0}
   44UniqueInt64/2 1.810 GiB/sec3.215 GiB/sec77.589 
   {'iterations': 41, 'null_percent': 1.0, 'num_unique': 100.0}
   34UniqueInt64/9   679.320 MiB/sec1.113 GiB/sec67.808 
{'iterations': 14, 'null_percent': 1.0, 'num_unique': 10.0}
   32   UniqueInt64/11   618.468 MiB/sec  959.261 MiB/sec55.103
{'iterations': 12, 'null_percent': 50.0, 'num_unique': 10.0}
   3 UniqueInt64/8   785.346 MiB/sec1.186 GiB/sec54.663 
{'iterations': 18, 'null_percent': 0.1, 'num_unique': 10.0}
   7 UniqueInt64/7   846.630 MiB/sec1.241 GiB/sec50.088 
{'iterations': 18, 'null_percent': 0.0, 'num_unique': 10.0}
   8 UniqueInt64/3 1.510 GiB/sec2.212 GiB/sec46.508 
  {'iterations': 34, 'null_percent': 10.0, 'num_unique': 100.0}
   28UniqueInt64/4 1.066 GiB/sec1.398 GiB/sec31.131 
  {'iterations': 24, 'null_percent': 50.0, 'num_unique': 100.0}
   24  BuildDictionary  1014.450 MiB/sec1.282 GiB/sec29.367 
{'iterations': 177, 'null_percent': 11.111047470674487}
   19UniqueUInt8/1 1.322 GiB/sec1.453 GiB/sec 9.968 
  {'iterations': 241, 'null_percent': 0.1, 'num_unique': 200.0}
   12UniqueUInt8/3   434.316 MiB/sec  443.925 MiB/sec 2.212 
  {'iterations': 76, 'null_percent': 10.0, 'num_unique': 200.0}
   30   UniqueString100bytes/8 1.025 GiB/sec1.043 GiB/sec 1.705 
 {'iterations': 2, 'null_percent': 0.1, 'num_unique': 10.0}
   22UniqueUInt8/2   837.222 MiB/sec  848.342 MiB/sec 1.328 
  {'iterations': 145, 'null_percent': 1.0, 'num_unique': 200.0}
   29UniqueUInt8/6 1.759 GiB/sec1.778 GiB/sec 1.065 
{'iterations': 316, 'null_percent': 100.0, 'num_unique': 200.0}
   47   UniqueString100bytes/1 4.643 GiB/sec4.661 GiB/sec 0.373 
{'iterations': 8, 'null_percent': 0.1, 'num_unique': 100.0}
   25   UniqueString100bytes/2 4.654 GiB/sec4.669 GiB/sec 0.326 
{'iterations': 8, 'null_percent': 1.0, 'num_unique': 100.0}
   6 UniqueString10bytes/8   311.452 MiB/sec  310.589 MiB/sec-0.277 
 {'iterations': 5, 'null_percent': 0.1, 'num_unique': 10.0}
   43UniqueString10bytes/7   316.291 MiB/sec  315.246 MiB/sec-0.330 
 {'iterations': 5, 'null_percent': 0.0, 'num_unique': 10.0}
   37  UniqueString100bytes/11 1.883 GiB/sec1.876 GiB/sec-0.402 
{'iterations': 3, 'null_percent': 50.0, 'num_unique': 10.0}
   15  UniqueString100bytes/10 1.130 GiB/sec1.125 GiB/sec-0.407 
{'iterations': 2, 'null_percent': 10.0, 'num_unique': 10.0}
   48   UniqueString10bytes/10   333.843 MiB/sec  331.318 MiB/sec-0.756 
{'iterations': 6, 'null_percent': 10.0, 'num_unique': 10.0}
   2 UniqueString10bytes/9   313.340 MiB/sec  310.886 MiB/sec-0.783 
 {'iterations': 5, 'null_percent': 1.0, 'num_unique': 10.0}
   17UniqueUInt8/4   210.495 MiB/sec  208.540 MiB/sec-0.929 
  {'iterations': 37, 'null_percent': 50.0, 'num_unique': 200.0}
   36   UniqueString100bytes/7 1.038 GiB/sec1.029 GiB/sec-0.950 
 {'iterations': 2, 'null_percent': 0.0, 'num_unique': 10.0}
   20   UniqueString100bytes/0 4.733 GiB/sec  

[GitHub] [arrow] wesm commented on pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


wesm commented on pull request #7525:
URL: https://github.com/apache/arrow/pull/7525#issuecomment-648235922


   Here's what I see in the sort benchmarks with this patch compared with 
7ed698b94, the patch right before the visitor_inline.h changes
   
   ```
  benchmark baseline
contender  change %   counters
   14SortToIndicesInt64Count/32768/1/min_time:1.0001.604 GiB/sec
1.888 GiB/sec17.699{'iterations': 72619, 'null_percent': 0.01}
   3   SortToIndicesInt64Compare/32768/1/min_time:1.0003.926 GiB/sec
4.186 GiB/sec 6.625  {'iterations': 187121, 'null_percent': 100.0}
   8 SortToIndicesInt64Compare/1048576/1/min_time:1.0005.012 GiB/sec
5.182 GiB/sec 3.399{'iterations': 7178, 'null_percent': 100.0}
   1   SortToIndicesInt64Count/8388608/1/min_time:1.0004.460 GiB/sec
4.588 GiB/sec 2.886 {'iterations': 803, 'null_percent': 100.0}
   9 SortToIndicesInt64Count/32768/1/min_time:1.0003.937 GiB/sec
4.047 GiB/sec 2.809  {'iterations': 180351, 'null_percent': 100.0}
   12  SortToIndicesInt64Count/1048576/1/min_time:1.0004.953 GiB/sec
5.082 GiB/sec 2.617{'iterations': 7119, 'null_percent': 100.0}
   10SortToIndicesInt64Compare/8388608/1/min_time:1.0004.496 GiB/sec
4.610 GiB/sec 2.533 {'iterations': 808, 'null_percent': 100.0}
   4   SortToIndicesInt64Compare/32768/0/min_time:1.000  141.060 MiB/sec  
143.991 MiB/sec 2.078  {'iterations': 6473, 'null_percent': 0.0}
   15  SortToIndicesInt64Compare/32768/2/min_time:1.000  231.389 MiB/sec  
232.690 MiB/sec 0.562{'iterations': 10364, 'null_percent': 50.0}
   5   SortToIndicesInt64Compare/32768/1/min_time:1.000  140.567 MiB/sec  
140.746 MiB/sec 0.127 {'iterations': 6253, 'null_percent': 0.01}
   13SortToIndicesInt64Count/32768/2/min_time:1.000  765.674 MiB/sec  
761.830 MiB/sec-0.502{'iterations': 34058, 'null_percent': 50.0}
   7 SortToIndicesInt64Compare/32768/100/min_time:1.000  141.645 MiB/sec  
139.367 MiB/sec-1.609  {'iterations': 6318, 'null_percent': 1.0}
   11 SortToIndicesInt64Compare/32768/10/min_time:1.000  150.378 MiB/sec  
143.670 MiB/sec-4.461 {'iterations': 6717, 'null_percent': 10.0}
   2 SortToIndicesInt64Count/32768/0/min_time:1.0002.139 GiB/sec
2.042 GiB/sec-4.574 {'iterations': 96456, 'null_percent': 0.0}
   6   SortToIndicesInt64Count/32768/100/min_time:1.0001.571 GiB/sec
1.343 GiB/sec   -14.525 {'iterations': 71534, 'null_percent': 1.0}
   0SortToIndicesInt64Count/32768/10/min_time:1.0001.532 GiB/sec
1.098 GiB/sec   -28.306{'iterations': 69953, 'null_percent': 10.0}
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7525: ARROW-9214: [C++] Use separate functions for valid/not-valid values in VisitArrayDataInline

2020-06-23 Thread GitBox


wesm opened a new pull request #7525:
URL: https://github.com/apache/arrow/pull/7525


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-23 Thread GitBox


romainfrancois commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r444308097



##
File path: r/src/array_from_vector.cpp
##
@@ -1067,12 +1110,22 @@ std::shared_ptr 
InferArrowTypeFromVector(SEXP x) {
   if (Rf_inherits(x, "data.frame")) {
 return InferArrowTypeFromDataFrame(x);
   } else {
-if (XLENGTH(x) == 0) {
-  Rcpp::stop(
-  "Requires at least one element to infer the values' type of a list 
vector");
-}
+SEXP ptype = Rf_getAttrib(x, symbols::ptype);
+if (ptype == R_NilValue) {

Review comment:
   Counting 6 cases of `Rf_isNull()` and 3 cases of `== R_NilValue` I'll 
switch to `Rf_isNull()` here
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#discussion_r444306795



##
File path: r/tests/testthat/test-Table.R
##
@@ -334,5 +334,5 @@ test_that("Table metadata", {
 
 test_that("Table handles null type (ARROW-7064)", {
   tab <- Table$create(a = 1:10, n = vctrs::unspecified(10))
-  expect_equal(tab$schema,  schema(a = int32(), n = null()))
+  expect_true(tab$schema$Equals(schema(a = int32(), n = null()), FALSE))

Review comment:
   I think you can use `expect_equivalent()` instead of `expect_equal` and 
it will skip the metadata comparison; cf. 
https://github.com/apache/arrow/blob/master/r/R/arrow-package.R#L111 
   
   (of course, we'll have to fix this for the new version of `testthat` that 
doesn't use all.equal under the hood)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-23 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-648230676


   I'm not sure what the MSVC failure is about but I'll debug locally



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-23 Thread GitBox


nealrichardson commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r444302116



##
File path: r/src/array_from_vector.cpp
##
@@ -1067,12 +1110,22 @@ std::shared_ptr 
InferArrowTypeFromVector(SEXP x) {
   if (Rf_inherits(x, "data.frame")) {
 return InferArrowTypeFromDataFrame(x);
   } else {
-if (XLENGTH(x) == 0) {
-  Rcpp::stop(
-  "Requires at least one element to infer the values' type of a list 
vector");
-}
+SEXP ptype = Rf_getAttrib(x, symbols::ptype);
+if (ptype == R_NilValue) {

Review comment:
   Doesn't matter to me which we use as long as we're consistent. Happy to 
update everywhere to follow whichever pattern is preferred.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


jorisvandenbossche commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444295036



##
File path: docs/source/developers/contributing.rst
##
@@ -124,29 +181,72 @@ To contribute a patch:
   `ARROW-767: [C++] Filesystem abstraction 
`_).
 * Make sure that your code passes the unit tests. You can find instructions how
   to run the unit tests for each Arrow component in its respective README file.
+
+Core developers and others with a stake in the part of the project your change
+affects will review, request changes, and hopefully indicate their approval
+in the end. To make the review process smooth for everyone, try to
+
+* Break your work into small, single-purpose patches if possible. It’s much
+  harder to merge in a large change with a lot of disjoint features, and
+  particularly if you're new to the project, smaller changes are much easier
+  for maintainers to accept.
 * Add new unit tests for your code.
+* Follow the style guides for the part(s) of the project you're modifying.
+  Some languages (C++, Python, and Rust, for example) run a lint check in
+  continuous integration. For all languages, see their respective developer
+  documentation and READMEs for style guidance. In general, try to make it look
+  as if the codebase has a single author, and emulate any conventions you see,
+  whether or not they are officially documented or checked.
+
+When tests are passing and the pull request has been approved by the interested
+parties, a committer will merge the pull request. This is done with a
+command-line utility that does a squash merge, so all of your commits will be
+registered as a single commit to the master branch; this simplifies the
+connection between JIRA issues and commits, and it makes it easier to bisect
+history to identify where changes were introduced. A side effect of this way of
+merging is that your pull request will appear in the GitHub interface to have
+been "closed without merge". Do not be alarmed: if you look at the bottom, you
+will see a message that says "@user closed this in $COMMIT".
+
+Local git conventions
++
+
+If you are tracking the Arrow source repository locally, here are some tips
+for using ``git``.
+
+All Arrow contributors work off of their personal fork of ``apache/arrow``
+and submit pull requests "upstream". Once you've cloned your fork of Arrow,
+be sure to::
+
+$ git remote add upstream https://github.com/apache/arrow
+
+to set the "upstream" repository.
+
+You are encouraged to develop on branches, rather than your own "master" 
branch,
+and it helps to keep your fork's master branch synced with ``upstream/master``.
 
-Thank you in advance for your contributions!
+To start a new branch, pull the latest from upstream first::
 
-Common Git conventions followed within the project
---
+   $ git fetch upstream
+   $ git checkout master
+   $ git reset --hard upstream/master
+   $ git checkout -b $NEW_BRANCH_NAME
 
-If you are tracking the Arrow source repository locally, following some common 
Git
-conventions would make everyone's workflow compatible.  These recommendations 
along with
-their rationale are outlined below.
+It does not matter what you call your branch. Some people like to use the JIRA
+number as branch name, others use descriptive names.
 
-It is strongly discouraged to use a regular ``git merge``, as a linear commit 
history is
-prefered by the project.  It is much easier to maintain, and makes for easier
-``cherry-picking`` of features; useful for backporting fixes to maintenance 
releases.
+Once you have a branch going, you should sync with ``upstream/master``
+regularly, as many commits merge to master every day.
+It is recommended to use ``git rebase`` rather than ``git merge``.

Review comment:
   Yes, it's partly a lack of GitHub's reviewing functionalities, but 
*when* using GitHub (which is the case right now), my feeling is that a merging 
workflow works better with it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


jorisvandenbossche commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444293158



##
File path: docs/source/developers/contributing.rst
##
@@ -124,29 +181,72 @@ To contribute a patch:
   `ARROW-767: [C++] Filesystem abstraction 
`_).
 * Make sure that your code passes the unit tests. You can find instructions how
   to run the unit tests for each Arrow component in its respective README file.
+
+Core developers and others with a stake in the part of the project your change
+affects will review, request changes, and hopefully indicate their approval
+in the end. To make the review process smooth for everyone, try to
+
+* Break your work into small, single-purpose patches if possible. It’s much
+  harder to merge in a large change with a lot of disjoint features, and
+  particularly if you're new to the project, smaller changes are much easier
+  for maintainers to accept.
 * Add new unit tests for your code.
+* Follow the style guides for the part(s) of the project you're modifying.
+  Some languages (C++, Python, and Rust, for example) run a lint check in
+  continuous integration. For all languages, see their respective developer
+  documentation and READMEs for style guidance. In general, try to make it look
+  as if the codebase has a single author, and emulate any conventions you see,
+  whether or not they are officially documented or checked.
+
+When tests are passing and the pull request has been approved by the interested
+parties, a committer will merge the pull request. This is done with a
+command-line utility that does a squash merge, so all of your commits will be
+registered as a single commit to the master branch; this simplifies the
+connection between JIRA issues and commits, and it makes it easier to bisect
+history to identify where changes were introduced. A side effect of this way of
+merging is that your pull request will appear in the GitHub interface to have
+been "closed without merge". Do not be alarmed: if you look at the bottom, you
+will see a message that says "@user closed this in $COMMIT".
+
+Local git conventions
++
+
+If you are tracking the Arrow source repository locally, here are some tips
+for using ``git``.
+
+All Arrow contributors work off of their personal fork of ``apache/arrow``
+and submit pull requests "upstream". Once you've cloned your fork of Arrow,
+be sure to::
+
+$ git remote add upstream https://github.com/apache/arrow
+
+to set the "upstream" repository.
+
+You are encouraged to develop on branches, rather than your own "master" 
branch,
+and it helps to keep your fork's master branch synced with ``upstream/master``.
 
-Thank you in advance for your contributions!
+To start a new branch, pull the latest from upstream first::
 
-Common Git conventions followed within the project
---
+   $ git fetch upstream
+   $ git checkout master
+   $ git reset --hard upstream/master
+   $ git checkout -b $NEW_BRANCH_NAME
 
-If you are tracking the Arrow source repository locally, following some common 
Git
-conventions would make everyone's workflow compatible.  These recommendations 
along with
-their rationale are outlined below.
+It does not matter what you call your branch. Some people like to use the JIRA
+number as branch name, others use descriptive names.
 
-It is strongly discouraged to use a regular ``git merge``, as a linear commit 
history is
-prefered by the project.  It is much easier to maintain, and makes for easier
-``cherry-picking`` of features; useful for backporting fixes to maintenance 
releases.
+Once you have a branch going, you should sync with ``upstream/master``
+regularly, as many commits merge to master every day.
+It is recommended to use ``git rebase`` rather than ``git merge``.

Review comment:
   > Resolving conflicts in a PR with merge commits can be a nightmare
   
   If you only `merge upstream/master`, then that is in my experience never a 
nightmare (you should of course never mix rebasing and merging, as that indeed 
will give nightmares). 
   
   Also, when merging master instead of rebasing, you only need to fix merge 
conflicts once. While now we have here a complicated section about how to 
simplify the conflict resolution by squashing your commits while rebasing. This 
is never needed in a merging workflow (which IMO is also nicer because it 
preserves that you can see what has been changed since a last review on github).
   
   Anyway, we are not going to resolve that discussion here ;) (and both 
workflows have its pros/cons). I was mainly wondering to what extent we want to 
push contributors to a chosen workflow.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific 

[GitHub] [arrow] lionel- commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-23 Thread GitBox


lionel- commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r444292367



##
File path: r/src/array_from_vector.cpp
##
@@ -1067,12 +1110,22 @@ std::shared_ptr 
InferArrowTypeFromVector(SEXP x) {
   if (Rf_inherits(x, "data.frame")) {
 return InferArrowTypeFromDataFrame(x);
   } else {
-if (XLENGTH(x) == 0) {
-  Rcpp::stop(
-  "Requires at least one element to infer the values' type of a list 
vector");
-}
+SEXP ptype = Rf_getAttrib(x, symbols::ptype);
+if (ptype == R_NilValue) {

Review comment:
   Yes in vctrs and rlang we generally use pointer comparison with 
singletons (NULL, symbols, cached objects). I guess it's best to use the local 
style though?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


wesm commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444288120



##
File path: docs/source/developers/contributing.rst
##
@@ -124,29 +181,72 @@ To contribute a patch:
   `ARROW-767: [C++] Filesystem abstraction 
`_).
 * Make sure that your code passes the unit tests. You can find instructions how
   to run the unit tests for each Arrow component in its respective README file.
+
+Core developers and others with a stake in the part of the project your change
+affects will review, request changes, and hopefully indicate their approval
+in the end. To make the review process smooth for everyone, try to
+
+* Break your work into small, single-purpose patches if possible. It’s much
+  harder to merge in a large change with a lot of disjoint features, and
+  particularly if you're new to the project, smaller changes are much easier
+  for maintainers to accept.
 * Add new unit tests for your code.
+* Follow the style guides for the part(s) of the project you're modifying.
+  Some languages (C++, Python, and Rust, for example) run a lint check in
+  continuous integration. For all languages, see their respective developer
+  documentation and READMEs for style guidance. In general, try to make it look
+  as if the codebase has a single author, and emulate any conventions you see,
+  whether or not they are officially documented or checked.
+
+When tests are passing and the pull request has been approved by the interested
+parties, a committer will merge the pull request. This is done with a
+command-line utility that does a squash merge, so all of your commits will be
+registered as a single commit to the master branch; this simplifies the
+connection between JIRA issues and commits, and it makes it easier to bisect
+history to identify where changes were introduced. A side effect of this way of
+merging is that your pull request will appear in the GitHub interface to have
+been "closed without merge". Do not be alarmed: if you look at the bottom, you
+will see a message that says "@user closed this in $COMMIT".
+
+Local git conventions
++
+
+If you are tracking the Arrow source repository locally, here are some tips
+for using ``git``.
+
+All Arrow contributors work off of their personal fork of ``apache/arrow``
+and submit pull requests "upstream". Once you've cloned your fork of Arrow,
+be sure to::
+
+$ git remote add upstream https://github.com/apache/arrow
+
+to set the "upstream" repository.
+
+You are encouraged to develop on branches, rather than your own "master" 
branch,
+and it helps to keep your fork's master branch synced with ``upstream/master``.
 
-Thank you in advance for your contributions!
+To start a new branch, pull the latest from upstream first::
 
-Common Git conventions followed within the project
---
+   $ git fetch upstream
+   $ git checkout master
+   $ git reset --hard upstream/master
+   $ git checkout -b $NEW_BRANCH_NAME
 
-If you are tracking the Arrow source repository locally, following some common 
Git
-conventions would make everyone's workflow compatible.  These recommendations 
along with
-their rationale are outlined below.
+It does not matter what you call your branch. Some people like to use the JIRA
+number as branch name, others use descriptive names.
 
-It is strongly discouraged to use a regular ``git merge``, as a linear commit 
history is
-prefered by the project.  It is much easier to maintain, and makes for easier
-``cherry-picking`` of features; useful for backporting fixes to maintenance 
releases.
+Once you have a branch going, you should sync with ``upstream/master``
+regularly, as many commits merge to master every day.
+It is recommended to use ``git rebase`` rather than ``git merge``.

Review comment:
   FWIW at large organizations like Google, PRs (which are called 
"changelists") with more than one commit are not accepted. Every proposed patch 
must be a single rebased commit (this is tightly integrated with Gerrit, 
Google's code review system). I would be much happier using Gerrit than GitHub 
PRs because of the incremental code diffing system -- once you experience it, 
it is difficult to go back. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


wesm commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444285972



##
File path: docs/source/developers/contributing.rst
##
@@ -124,29 +181,72 @@ To contribute a patch:
   `ARROW-767: [C++] Filesystem abstraction 
`_).
 * Make sure that your code passes the unit tests. You can find instructions how
   to run the unit tests for each Arrow component in its respective README file.
+
+Core developers and others with a stake in the part of the project your change
+affects will review, request changes, and hopefully indicate their approval
+in the end. To make the review process smooth for everyone, try to
+
+* Break your work into small, single-purpose patches if possible. It’s much
+  harder to merge in a large change with a lot of disjoint features, and
+  particularly if you're new to the project, smaller changes are much easier
+  for maintainers to accept.
 * Add new unit tests for your code.
+* Follow the style guides for the part(s) of the project you're modifying.
+  Some languages (C++, Python, and Rust, for example) run a lint check in
+  continuous integration. For all languages, see their respective developer
+  documentation and READMEs for style guidance. In general, try to make it look
+  as if the codebase has a single author, and emulate any conventions you see,
+  whether or not they are officially documented or checked.
+
+When tests are passing and the pull request has been approved by the interested
+parties, a committer will merge the pull request. This is done with a
+command-line utility that does a squash merge, so all of your commits will be
+registered as a single commit to the master branch; this simplifies the
+connection between JIRA issues and commits, and it makes it easier to bisect
+history to identify where changes were introduced. A side effect of this way of
+merging is that your pull request will appear in the GitHub interface to have
+been "closed without merge". Do not be alarmed: if you look at the bottom, you
+will see a message that says "@user closed this in $COMMIT".
+
+Local git conventions
++
+
+If you are tracking the Arrow source repository locally, here are some tips
+for using ``git``.
+
+All Arrow contributors work off of their personal fork of ``apache/arrow``
+and submit pull requests "upstream". Once you've cloned your fork of Arrow,
+be sure to::
+
+$ git remote add upstream https://github.com/apache/arrow
+
+to set the "upstream" repository.
+
+You are encouraged to develop on branches, rather than your own "master" 
branch,
+and it helps to keep your fork's master branch synced with ``upstream/master``.
 
-Thank you in advance for your contributions!
+To start a new branch, pull the latest from upstream first::
 
-Common Git conventions followed within the project
---
+   $ git fetch upstream
+   $ git checkout master
+   $ git reset --hard upstream/master
+   $ git checkout -b $NEW_BRANCH_NAME
 
-If you are tracking the Arrow source repository locally, following some common 
Git
-conventions would make everyone's workflow compatible.  These recommendations 
along with
-their rationale are outlined below.
+It does not matter what you call your branch. Some people like to use the JIRA
+number as branch name, others use descriptive names.
 
-It is strongly discouraged to use a regular ``git merge``, as a linear commit 
history is
-prefered by the project.  It is much easier to maintain, and makes for easier
-``cherry-picking`` of features; useful for backporting fixes to maintenance 
releases.
+Once you have a branch going, you should sync with ``upstream/master``
+regularly, as many commits merge to master every day.
+It is recommended to use ``git rebase`` rather than ``git merge``.

Review comment:
   Resolving conflicts in a PR with merge commits can be a nightmare, 
whereas resolving the same conflicts in a PR that doesn't have the merges is, 
to me, relatively straightforward. Maybe I'm doing something wrong. Very rarely 
have I seen anything good come out of `git merge`
   
   So I think we can say "The project maintainers prefer to handle conflict 
resolution from a rebase standpoint rather than from a merge standpoint. If a 
PR contains merge commits and contains conflicts, then a maintainer may squash 
the changes and rebase them to remove the merge commits."





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-23 Thread GitBox


romainfrancois commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r444283172



##
File path: r/src/array_from_vector.cpp
##
@@ -1067,12 +1110,22 @@ std::shared_ptr 
InferArrowTypeFromVector(SEXP x) {
   if (Rf_inherits(x, "data.frame")) {
 return InferArrowTypeFromDataFrame(x);
   } else {
-if (XLENGTH(x) == 0) {
-  Rcpp::stop(
-  "Requires at least one element to infer the values' type of a list 
vector");
-}
+SEXP ptype = Rf_getAttrib(x, symbols::ptype);
+if (ptype == R_NilValue) {
+  if (XLENGTH(x) == 0) {
+Rcpp::stop(
+"Requires at least one element to infer the values' type of a list 
vector");
+  }
 
-return arrow::list(InferArrowType(VECTOR_ELT(x, 0)));
+  return arrow::list(InferArrowType(VECTOR_ELT(x, 0)));
+} else {
+  // special case list(raw()) -> BinaryArray
+  if (TYPEOF(ptype) == RAWSXP) {
+return arrow::binary();
+  }
+
+  return arrow::list(InferArrowType(ptype));

Review comment:
   Oh yeah that makes sense. Although this only looks at the attribute, not 
specifically that it is a `vctrs_list_of` but on the arrow -> R conversion, I 
think it does not hurt to make it a `vctrs_list_of`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-23 Thread GitBox


romainfrancois commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r444281970



##
File path: r/src/array_from_vector.cpp
##
@@ -1067,12 +1110,22 @@ std::shared_ptr 
InferArrowTypeFromVector(SEXP x) {
   if (Rf_inherits(x, "data.frame")) {
 return InferArrowTypeFromDataFrame(x);
   } else {
-if (XLENGTH(x) == 0) {
-  Rcpp::stop(
-  "Requires at least one element to infer the values' type of a list 
vector");
-}
+SEXP ptype = Rf_getAttrib(x, symbols::ptype);
+if (ptype == R_NilValue) {

Review comment:
   IIRC @lionel- ? said he prefers `== R_NilValue` 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-23 Thread GitBox


romainfrancois commented on a change in pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#discussion_r444273703



##
File path: r/tests/testthat/test-Table.R
##
@@ -334,5 +334,5 @@ test_that("Table metadata", {
 
 test_that("Table handles null type (ARROW-7064)", {
   tab <- Table$create(a = 1:10, n = vctrs::unspecified(10))
-  expect_equal(tab$schema,  schema(a = int32(), n = null()))
+  expect_true(tab$schema$Equals(schema(a = int32(), n = null()), FALSE))

Review comment:
   Maybe `vctrs::unspecified()` can be special cased so that it does not 
generate additional info in the schema (it's just the class) here. 
   
   Or maybe the equality operator could disregard the metadata altogether (i.e. 
changing the default). 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-23 Thread GitBox


jorisvandenbossche commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r444268553



##
File path: docs/source/developers/contributing.rst
##
@@ -124,29 +181,72 @@ To contribute a patch:
   `ARROW-767: [C++] Filesystem abstraction 
`_).
 * Make sure that your code passes the unit tests. You can find instructions how
   to run the unit tests for each Arrow component in its respective README file.
+
+Core developers and others with a stake in the part of the project your change
+affects will review, request changes, and hopefully indicate their approval
+in the end. To make the review process smooth for everyone, try to
+
+* Break your work into small, single-purpose patches if possible. It’s much
+  harder to merge in a large change with a lot of disjoint features, and
+  particularly if you're new to the project, smaller changes are much easier
+  for maintainers to accept.
 * Add new unit tests for your code.
+* Follow the style guides for the part(s) of the project you're modifying.
+  Some languages (C++, Python, and Rust, for example) run a lint check in
+  continuous integration. For all languages, see their respective developer
+  documentation and READMEs for style guidance. In general, try to make it look
+  as if the codebase has a single author, and emulate any conventions you see,
+  whether or not they are officially documented or checked.
+
+When tests are passing and the pull request has been approved by the interested
+parties, a committer will merge the pull request. This is done with a
+command-line utility that does a squash merge, so all of your commits will be
+registered as a single commit to the master branch; this simplifies the
+connection between JIRA issues and commits, and it makes it easier to bisect
+history to identify where changes were introduced. A side effect of this way of
+merging is that your pull request will appear in the GitHub interface to have
+been "closed without merge". Do not be alarmed: if you look at the bottom, you
+will see a message that says "@user closed this in $COMMIT".
+
+Local git conventions
++
+
+If you are tracking the Arrow source repository locally, here are some tips
+for using ``git``.
+
+All Arrow contributors work off of their personal fork of ``apache/arrow``
+and submit pull requests "upstream". Once you've cloned your fork of Arrow,
+be sure to::
+
+$ git remote add upstream https://github.com/apache/arrow
+
+to set the "upstream" repository.
+
+You are encouraged to develop on branches, rather than your own "master" 
branch,
+and it helps to keep your fork's master branch synced with ``upstream/master``.
 
-Thank you in advance for your contributions!
+To start a new branch, pull the latest from upstream first::
 
-Common Git conventions followed within the project
---
+   $ git fetch upstream
+   $ git checkout master
+   $ git reset --hard upstream/master
+   $ git checkout -b $NEW_BRANCH_NAME
 
-If you are tracking the Arrow source repository locally, following some common 
Git
-conventions would make everyone's workflow compatible.  These recommendations 
along with
-their rationale are outlined below.
+It does not matter what you call your branch. Some people like to use the JIRA
+number as branch name, others use descriptive names.
 
-It is strongly discouraged to use a regular ``git merge``, as a linear commit 
history is
-prefered by the project.  It is much easier to maintain, and makes for easier
-``cherry-picking`` of features; useful for backporting fixes to maintenance 
releases.
+Once you have a branch going, you should sync with ``upstream/master``
+regularly, as many commits merge to master every day.
+It is recommended to use ``git rebase`` rather than ``git merge``.
 To sync your local copy of a branch, you may do the following::
 
 $ git pull upstream branch --rebase

Review comment:
   Is this doing the same as `git rebase upstream/master` ? (I am used to 
doing this)
   
   EDIT: ah, I suppose my suggestion needs a `git fetch upstream/master` first, 
so the above is probably the one-liner to do this

##
File path: docs/source/developers/contributing.rst
##
@@ -17,53 +17,73 @@
 
 .. _contributing:
 
-***
-Contribution Guidelines
-***
+
+Contributing to Apache Arrow
+
 
-There are many ways to contribute to Apache Arrow:
+Thanks for your interest in the Apache Arrow project. Arrow is a large project
+and may seem overwhelming when you're first getting involved.
+Contributing code is great, but that's probably not the first place to start.
+There are lots of ways to make valuable contributions to the project and
+community.
 
-* Contributing code (we call them "patches")
-* Writing documentation (another form of code, in 

[GitHub] [arrow] github-actions[bot] commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-23 Thread GitBox


github-actions[bot] commented on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-648198565


   https://issues.apache.org/jira/browse/ARROW-8899



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois opened a new pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-23 Thread GitBox


romainfrancois opened a new pull request #7524:
URL: https://github.com/apache/arrow/pull/7524


   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   tab <- Table$create(
 a = structure(1:4, foo = "bar"), 
 b = haven::labelled(1:4, label = "description")
   )
   tab$metadata$r
   #> [1] 
"A\n3\n262144\n197888\n5\nUTF-8\n19\n2\n254\n531\n2\n1026\n1\n262153\n3\nfoo\n16\n1\n262153\n3\nbar\n254\n1026\n1\n262153\n5\nlabel\n16\n1\n262153\n11\ndescription\n1026\n1\n262153\n5\nclass\n16\n3\n262153\n14\nhaven_labelled\n262153\n10\nvctrs_vctr\n262153\n7\ninteger\n254\n1026\n1\n262153\n5\nnames\n16\n2\n262153\n1\na\n262153\n1\nb\n254\n"
   arrow:::.arrow_unserialize(tab$metadata$r)
   #> [[1]]
   #> NULL
   #> 
   #> [[2]]
   #> [[2]]$a
   #> [[2]]$a$foo
   #> [1] "bar"
   #> 
   #> 
   #> [[2]]$b
   #> [[2]]$b$label
   #> [1] "description"
   #> 
   #> [[2]]$b$class
   #> [1] "haven_labelled" "vctrs_vctr" "integer"
   
   df <- as.data.frame(tab)
   df$a
   #> [1] 1 2 3 4
   #> attr(,"foo")
   #> [1] "bar"
   df$b
   #> [4]>: description
   #> [1] 1 2 3 4
   ```
   
   Created on 2020-06-23 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)
   
   `metadata$r$` gets a serialized string of a list with 2 things: 
 - attributes of the data frame itself
 - named list of attributes of each column (possibly NULL)
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7321: ARROW-8985: [Format][DONOTMERGE] RFC Proposed Decimal::byteWidth field for forward compatibility

2020-06-23 Thread GitBox


wesm commented on a change in pull request #7321:
URL: https://github.com/apache/arrow/pull/7321#discussion_r444249804



##
File path: format/Schema.fbs
##
@@ -134,11 +134,20 @@ table FixedSizeBinary {
 table Bool {
 }
 
+/// Exact decimal value represented as an integer value in two's
+/// complement. Currently only 128-bit (16-byte) integers are used but this may

Review comment:
   This should use the endianness indicated by the schema, I believe (like 
the other data types)?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7395: ARROW-9089: [Python] A PyFileSystem handler for fsspec-based filesystems

2020-06-23 Thread GitBox


jorisvandenbossche commented on pull request #7395:
URL: https://github.com/apache/arrow/pull/7395#issuecomment-648165633


   More comments on this? (apart from ensuring the tests pass)
   
   I should probably still add it to the filesystem docs.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-23 Thread GitBox


wesm closed pull request #7516:
URL: https://github.com/apache/arrow/pull/7516


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-23 Thread GitBox


wesm commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-648162680


   +1. The bot changes can't be done here so going to go ahead and merge this 
so I can use it more easily without having to switch branches (to use this 
branch) before running benchmarks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7522: ARROW-8801: [Python] Fix memory leak when converting datetime64-with-tz data to pandas

2020-06-23 Thread GitBox


wesm closed pull request #7522:
URL: https://github.com/apache/arrow/pull/7522


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-23 Thread GitBox


wesm commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-648147535


   thanks @pitrou and @cyb70289 -- I will spend a little time on the count-sort 
implementation and post a new patch



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #7522: ARROW-8801: [Python] Fix memory leak when converting datetime64-with-tz data to pandas

2020-06-23 Thread GitBox


jorisvandenbossche edited a comment on pull request #7522:
URL: https://github.com/apache/arrow/pull/7522#issuecomment-648146247


   Was just testing it, and can also confirm the case from the issue is fixed, 
and the code looks good to me



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7522: ARROW-8801: [Python] Fix memory leak when converting datetime64-with-tz data to pandas

2020-06-23 Thread GitBox


jorisvandenbossche commented on pull request #7522:
URL: https://github.com/apache/arrow/pull/7522#issuecomment-648146247


   Was just testing it, and can also confirm the case from the issue is fixed



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >