[GitHub] [arrow] bkietz closed pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox


bkietz closed pull request #7546:
URL: https://github.com/apache/arrow/pull/7546


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #6213: ARROW-7592: [C++] Fix crashes on corrupt IPC input

2020-06-25 Thread GitBox


wesm commented on a change in pull request #6213:
URL: https://github.com/apache/arrow/pull/6213#discussion_r445934957



##
File path: cpp/src/arrow/type.cc
##
@@ -501,20 +501,35 @@ Status Decimal128Type::Make(int32_t precision, int32_t 
scale,
 // --
 // Dictionary-encoded type
 
+Status DictionaryType::ValidateParameters(const DataType& index_type,
+  const DataType& value_type) {
+  const bool index_type_ok = is_integer(index_type.id()) &&
+ checked_cast(index_type).is_signed();
+  if (!index_type_ok) {
+return Status::TypeError("Dictionary index type should be signed integer, 
got ",

Review comment:
   I'm against supporting unsigned dictionary indices. The costs outweigh 
the benefits in my opinion. If you want to discuss further could we move this 
to the Arrow dev mailing list?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


wesm commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649911719


   Okay, thanks. I think we can leave it as is 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jianxind commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


jianxind commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649874368


   > +1, merging this. If removing the BitmapReader helps perf it can be done 
in a follow up PR
   
   Thanks. BitmapReader does help for the 1% and 50% null probability case, the 
initial implementation is directly using GetBit but I observe some regressions 
compared to the previous for my setup at least.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location

2020-06-25 Thread GitBox


kszucs commented on pull request #7376:
URL: https://github.com/apache/arrow/pull/7376#issuecomment-649859049


   Yes. We can go on either route, but it is still unclear to me whether we can 
test it automatically (I'm unfamiliar with go packaging). 
   
   Based on the information @jba provided I'm happy to merge it, but if there 
is a way to maintain the packaging requirements of go we should have jiras for 
it - like creating nightly packages perhaps?
   
   cc @sbinet 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


wesm closed pull request #7531:
URL: https://github.com/apache/arrow/pull/7531


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7502: [DISCUSS] Proposed Feature enum for forward compatibility.

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7502:
URL: https://github.com/apache/arrow/pull/7502#discussion_r445874451



##
File path: format/Schema.fbs
##
@@ -33,6 +33,35 @@ enum MetadataVersion:short {
   V4,
 }
 
+/// Represents Arrow Features that might not have full support
+/// within implementations. This is intended to be used in
+/// two scenarios:
+///  1.  A mechanism for readers of Arrow Streams
+///  and files to understand that the stream or file makes
+///  use of a feature that isn't supported or unknown to
+///  the implementation (and therefore can meet the Arrow
+///  forward compatibility guarantees).
+///  2.  A means of negotiating between a client and server
+///  what features a stream is allowed to use. The enums
+///  values here are intented to represent higher level
+///  features, additional details maybe negotiated
+///  with key-value pairs specific to the protocol.
+///
+/// Enums added to this list should be assigned power-of-two values
+/// to facilitate exchanging and comparing bitmaps for supported
+/// features.
+enum Feature : int {

Review comment:
   agreed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


wesm commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649849609


   +1, merging this. If removing the BitmapReader helps perf it can be done in 
a follow up PR



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r445873149



##
File path: docs/source/format/Columnar.rst
##
@@ -566,11 +572,6 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 ::
 
 * Length: 4, Null count: 1
-* Validity bitmap buffer:
-  |Byte 0 (validity bitmap) | Bytes 1-63|
-  |-|---|
-  |1101 | 0 (padding)   |
-
 * Types buffer:
 
   |Byte 0   | Byte 1  | Byte 2   | Byte 3   | Bytes 4-63  |

Review comment:
   Oops, went to fast. Fixing





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r445873223



##
File path: docs/source/format/Columnar.rst
##
@@ -586,13 +587,13 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 * Children arrays:
   * Field-0 array (f: float):
 * Length: 2, nulls: 0
-* Validity bitmap buffer: Not required
+* Validity bitmap buffer: 0101
 
 * Value Buffer:
 
-  | Bytes 0-7 | Bytes 8-63  |
-  |---|-|
-  | 1.2, 3.4  | unspecified |
+  | Bytes 0-11 | Bytes 8-63  |

Review comment:
   Fixing





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #6213: ARROW-7592: [C++] Fix crashes on corrupt IPC input

2020-06-25 Thread GitBox


emkornfield commented on a change in pull request #6213:
URL: https://github.com/apache/arrow/pull/6213#discussion_r445866678



##
File path: cpp/src/arrow/type.cc
##
@@ -501,20 +501,35 @@ Status Decimal128Type::Make(int32_t precision, int32_t 
scale,
 // --
 // Dictionary-encoded type
 
+Status DictionaryType::ValidateParameters(const DataType& index_type,
+  const DataType& value_type) {
+  const bool index_type_ok = is_integer(index_type.id()) &&
+ checked_cast(index_type).is_signed();
+  if (!index_type_ok) {
+return Status::TypeError("Dictionary index type should be signed integer, 
got ",

Review comment:
   This appears to be only mentioned in 
[Columnar.rst](https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#dictionary-encoded-layout)
 but not in 
[Schema.fbs](https://github.com/apache/arrow/blob/master/format/Schema.fbs#L282).
  Schema.fbs only mentioned the default is signed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kkraus14 commented on a change in pull request #6213: ARROW-7592: [C++] Fix crashes on corrupt IPC input

2020-06-25 Thread GitBox


kkraus14 commented on a change in pull request #6213:
URL: https://github.com/apache/arrow/pull/6213#discussion_r445863087



##
File path: cpp/src/arrow/type.cc
##
@@ -501,20 +501,35 @@ Status Decimal128Type::Make(int32_t precision, int32_t 
scale,
 // --
 // Dictionary-encoded type
 
+Status DictionaryType::ValidateParameters(const DataType& index_type,
+  const DataType& value_type) {
+  const bool index_type_ok = is_integer(index_type.id()) &&
+ checked_cast(index_type).is_signed();
+  if (!index_type_ok) {
+return Status::TypeError("Dictionary index type should be signed integer, 
got ",

Review comment:
   cc @trxcllnt who claimed that other Arrow implementations (I presume JS) 
didn't enforce signed types for the indices





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#discussion_r445811655



##
File path: python/pyarrow/_dataset.pyx
##
@@ -845,6 +845,28 @@ cdef class RowGroupInfo:
 def num_rows(self):
 return self.info.num_rows()
 
+@property
+def statistics(self):
+if not self.info.HasStatistics():
+return None
+
+cdef:
+CStructScalar* c_statistics
+CStructScalar* c_minmax
+
+statistics = dict()
+c_statistics = self.info.statistics().get()
+for i in range(c_statistics.value.size()):
+name = frombytes(c_statistics.type.get().field(i).get().name())
+c_minmax =  c_statistics.value[i].get()
+
+statistics[name] = {
+'min': pyarrow_wrap_scalar(c_minmax.value[0]).as_py(),
+'max': pyarrow_wrap_scalar(c_minmax.value[1]).as_py(),
+}
+
+return statistics
+

Review comment:
   If that's desired it can wait for a follow up





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7550: ARROW-9219: [R] coerce_timestamps in Parquet write options does not work

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7550:
URL: https://github.com/apache/arrow/pull/7550#issuecomment-649754779


   https://issues.apache.org/jira/browse/ARROW-9219



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7550: ARROW-9219: [R] coerce_timestamps in Parquet write options does not work

2020-06-25 Thread GitBox


nealrichardson opened a new pull request #7550:
URL: https://github.com/apache/arrow/pull/7550


   In addition to fixing the bug (a quick fix), I also spent some time deleting 
unnecessary bindings for parquet writer builder methods. There's more that can 
be done, which I think would shave more off of the built library size, but I 
don't have time to do it right now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7549: ARROW-9230: [FlightRPC][Python] pass through all options in flight.connect

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7549:
URL: https://github.com/apache/arrow/pull/7549#issuecomment-649740871


   https://issues.apache.org/jira/browse/ARROW-9230



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm opened a new pull request #7549: ARROW-9230: [FlightRPC][Python] pass through all options in flight.connect

2020-06-25 Thread GitBox


lidavidm opened a new pull request #7549:
URL: https://github.com/apache/arrow/pull/7549


   Instead of individually listing options, just use kwargs - especially as 
options are in the docstring anyways.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7502: [DISCUSS] Proposed Feature enum for forward compatibility.

2020-06-25 Thread GitBox


emkornfield commented on a change in pull request #7502:
URL: https://github.com/apache/arrow/pull/7502#discussion_r445699442



##
File path: format/Schema.fbs
##
@@ -33,6 +33,35 @@ enum MetadataVersion:short {
   V4,
 }
 
+/// Represents Arrow Features that might not have full support
+/// within implementations. This is intended to be used in
+/// two scenarios:
+///  1.  A mechanism for readers of Arrow Streams
+///  and files to understand that the stream or file makes
+///  use of a feature that isn't supported or unknown to
+///  the implementation (and therefore can meet the Arrow
+///  forward compatibility guarantees).
+///  2.  A means of negotiating between a client and server
+///  what features a stream is allowed to use. The enums
+///  values here are intented to represent higher level
+///  features, additional details maybe negotiated
+///  with key-value pairs specific to the protocol.
+///
+/// Enums added to this list should be assigned power-of-two values
+/// to facilitate exchanging and comparing bitmaps for supported
+/// features.
+enum Feature : int {

Review comment:
   this should probably be "long" if possible





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7548: WIP - Data frame memory management - see: "Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7548:
URL: https://github.com/apache/arrow/pull/7548#issuecomment-64960


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] raduteo opened a new pull request #7548: WIP - Data frame memory management - see: "Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread GitBox


raduteo opened a new pull request #7548:
URL: https://github.com/apache/arrow/pull/7548


   Follow up on the "Two proposals for expanding arrow Table API (virtual 
arrays and random access)" thread
   
   I have laid out a number of components illustrating how I see an arrow 
DataFrame implementation connecting to the current arrow structure while also 
allowing for virtual data source, thread safe mutation, shallow copying and 
fragmentation control.
   
   The best way to read the header is probably is either:
   - bottom up, starting from DataFrameOperationExamples::puttingItAllTogether 
for those inclined to see it all in action first
   or
   - top down, where I hopefully did an ok job introducing each component.
   
   I tried to strike a balance between being overwhelmingly verbose and 
hopelessly vague, so please flag any aspect that need clarification and I will 
update accordingly



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques closed pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-25 Thread GitBox


fsaintjacques closed pull request #7517:
URL: https://github.com/apache/arrow/pull/7517


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-06-25 Thread GitBox


pitrou commented on pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#issuecomment-649673482


   @liyafan82  I'm assuming this is work-in-progress? I can give it a quick 
review anyway.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


wesm closed pull request #7537:
URL: https://github.com/apache/arrow/pull/7537


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#discussion_r445673261



##
File path: cpp/src/arrow/dataset/discovery.cc
##
@@ -156,27 +177,26 @@ Result> 
FileSystemDatasetFactory::Make(
 
   ARROW_ASSIGN_OR_RAISE(auto files, filesystem->GetFileInfo(selector));
 
-  std::vector paths;
+  // XXX We may avoid copying around vectors of FileInfo by filtering in place.
+  std::vector filtered_files;

Review comment:
   Yeah, that's what I thought :-) Thanks for spelling it out, will give it 
a try.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#discussion_r445672523



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -571,6 +571,8 @@ static inline Result FileFromRowGroup(
   }
 }
 
+// TODO Is it possible to infer the file size and return a populated 
FileInfo?
+// This could avoid some spurious HEAD requests on S3 (ARROW-8950)

Review comment:
   https://issues.apache.org/jira/browse/ARROW-9227





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#discussion_r445663612



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -571,6 +571,8 @@ static inline Result FileFromRowGroup(
   }
 }
 
+// TODO Is it possible to infer the file size and return a populated 
FileInfo?
+// This could avoid some spurious HEAD requests on S3 (ARROW-8950)

Review comment:
   If it's not easily accessible from FileMetaData or so then this probably 
warrants a custom metadata field when writing `_metadata`.

##
File path: cpp/src/arrow/dataset/discovery.cc
##
@@ -156,27 +177,26 @@ Result> 
FileSystemDatasetFactory::Make(
 
   ARROW_ASSIGN_OR_RAISE(auto files, filesystem->GetFileInfo(selector));
 
-  std::vector paths;
+  // XXX We may avoid copying around vectors of FileInfo by filtering in place.
+  std::vector filtered_files;

Review comment:
   ```suggestion
 auto files_end = std::remove_if(files.begin(), files.end(), [&](const 
FileInfo& info) {
   if (!info.IsFile()) {
 return true;
   }
   
   if (StartsWithAnyOf(info.path(), options.selector_ignore_prefixes)) {
 return true;
   }
   
   return false;
 });
 files.erase(files_end, files.end());
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7541: ARROW-9224: [Dev][Archery] Copy local repo on clone failure

2020-06-25 Thread GitBox


pitrou commented on pull request #7541:
URL: https://github.com/apache/arrow/pull/7541#issuecomment-649653387


   > What about simply replacing `git clone --local` with `git clone --shared`?
   
   +1 for me.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-649644745


   @rjzamora indeed something like that. I am not sure that you need to keep 
track of the path as well, unless maybe to have it working with existing 
functions to determine sorted columns out of this (but that's more something to 
discuss on the dask issue/PR)
   
   Now, in the meantime, @bkietz did the more proper implementation -> 
https://github.com/apache/arrow/pull/7546



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche closed pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

2020-06-25 Thread GitBox


jorisvandenbossche closed pull request #7523:
URL: https://github.com/apache/arrow/pull/7523


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#discussion_r445661743



##
File path: python/pyarrow/_dataset.pyx
##
@@ -845,6 +845,28 @@ cdef class RowGroupInfo:
 def num_rows(self):
 return self.info.num_rows()
 
+@property
+def statistics(self):
+if not self.info.HasStatistics():
+return None
+
+cdef:
+CStructScalar* c_statistics
+CStructScalar* c_minmax
+
+statistics = dict()
+c_statistics = self.info.statistics().get()
+for i in range(c_statistics.value.size()):
+name = frombytes(c_statistics.type.get().field(i).get().name())
+c_minmax =  c_statistics.value[i].get()
+
+statistics[name] = {
+'min': pyarrow_wrap_scalar(c_minmax.value[0]).as_py(),
+'max': pyarrow_wrap_scalar(c_minmax.value[1]).as_py(),
+}
+
+return statistics
+

Review comment:
   Do we want to expose the `statistics_expression` as well? (not fully 
sure if it would have a use case, so maybe we should only do that if we have 
one)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


pitrou commented on pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#issuecomment-649636318


   Note to self: instead of basing the default `OpenInputStream(const 
FileInfo&)` on `OpenInputStream(const std::string&)`, it would be more logical 
to do the reverse, so that no subclass needs to override both.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#issuecomment-649636224


   https://issues.apache.org/jira/browse/ARROW-8950



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#issuecomment-649636223


   https://issues.apache.org/jira/browse/ARROW-8733



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-649636228


   https://issues.apache.org/jira/browse/ARROW-9139



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


pitrou commented on pull request #7547:
URL: https://github.com/apache/arrow/pull/7547#issuecomment-649634658


   @bkietz Would like your input on the dataset changes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox


pitrou opened a new pull request #7547:
URL: https://github.com/apache/arrow/pull/7547


   Add FileSystem::OpenInput{Stream,File} overrides that accept a FileInfo 
parameter.
   This can be used to optimize file opening when it the file size and 
existence is already known.  Concretely, avoids a HEAD request in S3.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz opened a new pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox


bkietz opened a new pull request #7546:
URL: https://github.com/apache/arrow/pull/7546


   ```python
   stats = parquet_fragment.row_groups[0].statistics
   assert stats == {
 'normal_column': {'min': 1, 'max': 2},
 'all_null_column': {'min': None, 'max': None},
 'column_without_stats': None,
   }
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-649631989


   Still some work:
   
   Need to add tests for the different filesystems that can be passed.
   
   There are still some skipped tests:
   
   * `ARROW:schema` is not yet removed from the metadata -> ARROW-9009
   * Partition fields as dictionary keys
   * Specifying `metadata` object (not very important IMO)
   
   One of the `large_memory` tests is also failing 
(`test_binary_array_overflow_to_chunked`):
   
   ```
   $ pytest python/pyarrow/tests/test_parquet.py -v -r s -m large_memory 
--enable-large_memory
   
===
 test session starts 
===
   platform linux -- Python 3.7.3, pytest-5.2.1, py-1.8.0, pluggy-0.12.0 -- 
/home/joris/miniconda3/envs/arrow-dev/bin/python
   cachedir: .pytest_cache
   hypothesis profile 'dev' -> max_examples=10, 
database=DirectoryBasedExampleDatabase('/home/joris/scipy/repos/arrow/.hypothesis/examples')
   rootdir: /home/joris/scipy/repos/arrow/python, inifile: setup.cfg
   plugins: hypothesis-4.47.5, lazy-fixture-0.6.1
   collected 277 items / 273 deselected / 4 selected

 
   
   python/pyarrow/tests/test_parquet.py::test_large_table_int32_overflow PASSED 

   [ 25%]
   python/pyarrow/tests/test_parquet.py::test_byte_array_exactly_2gb PASSED 

   [ 50%]
   python/pyarrow/tests/test_parquet.py::test_binary_array_overflow_to_chunked 
FAILED  
[ 75%]
   python/pyarrow/tests/test_parquet.py::test_list_of_binary_large_cell PASSED  

   [100%]
   
   

 FAILURES 
=
   
__
 test_binary_array_overflow_to_chunked 
__
   
   assert t.equals(result)
   
   
   @pytest.mark.pandas
   @pytest.mark.large_memory
   def test_binary_array_overflow_to_chunked():
   # ARROW-3762
   
   # 2^31 + 1 bytes
   values = [b'x'] + [
   b'x' * (1 << 20)
   ] * 2 * (1 << 10)
   >   df = pd.DataFrame({'byte_col': values})
   
   python/pyarrow/tests/test_parquet.py:3043: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   python/pyarrow/tests/test_parquet.py:3010: in _simple_table_roundtrip
   stream = pa.BufferOutputStream()
   python/pyarrow/tests/test_parquet.py:82: in _read_table
   return pq.read_table(*args, **kwargs)
   python/pyarrow/parquet.py:1555: in read_table
   raise ValueError(
   python/pyarrow/parquet.py:1468: in read
   use_threads=use_threads
   pyarrow/_dataset.pyx:403: in pyarrow._dataset.Dataset.to_table
   ???
   pyarrow/_dataset.pyx:1893: in pyarrow._dataset.Scanner.to_table
   ???
   pyarrow/error.pxi:122: in pyarrow.lib.pyarrow_internal_check_status
   ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate 
chunked arrays
   
   pyarrow/error.pxi:105: ArrowNotImplementedError
   
= 1 
failed, 3 passed, 273 deselected in 512.87s (0:08:32) 
=
   ```
   
   
   
   
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:

[GitHub] [arrow] jorisvandenbossche opened a new pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox


jorisvandenbossche opened a new pull request #7545:
URL: https://github.com/apache/arrow/pull/7545


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 opened a new pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-06-25 Thread GitBox


liyafan82 opened a new pull request #7544:
URL: https://github.com/apache/arrow/pull/7544


   https://issues.apache.org/jira/browse/ARROW-7285



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


wesm commented on pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#issuecomment-649612663


   +1, will merge this once build is green



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-25 Thread GitBox


fsaintjacques commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649607214


   I'm also of the opinion that we should stick with int32_t. That's what 
parquet uses for dict column, that's what R uses for factor columns, that's 
what we use by default in DictType, etc... I suspect the short-time net effect 
of this is uncovering index type issues we have around.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#discussion_r445625710



##
File path: cpp/src/parquet/arrow/reader.cc
##
@@ -338,22 +348,39 @@ class RowGroupRecordBatchReader : public 
::arrow::RecordBatchReader {
 // TODO (hatemhelal): Consider refactoring this to share logic with 
ReadTable as this
 // does not currently honor the use_threads option.
 std::vector> columns(field_readers_.size());
-for (size_t i = 0; i < field_readers_.size(); ++i) {
-  RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, [i]));
-  if (columns[i]->num_chunks() > 1) {
-return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+int64_t num_rows = -1;
+
+if (columns.empty()) {
+  // num_rows cannot be derived from field_readers_ so compute it using
+  // row group sizes cached from metadata
+  num_rows = std::min(batch_size_, *row_group_remaining_size_);
+  *row_group_remaining_size_ -= num_rows;
+  if (*row_group_remaining_size_ == 0) {
+++row_group_remaining_size_;
   }
+} else {
+  for (size_t i = 0; i < field_readers_.size(); ++i) {
+RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, [i]));
+if (columns[i]->num_chunks() > 1) {
+  return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+}
+  }
+  num_rows = columns[0]->length();
 }
 
 // Create an intermediate table and use TableBatchReader as an adaptor to a
 // RecordBatch
-std::shared_ptr table = Table::Make(schema_, columns);
+std::shared_ptr table = Table::Make(schema_, columns, num_rows);
+
 RETURN_NOT_OK(table->Validate());
 ::arrow::TableBatchReader table_batch_reader(*table);
 return table_batch_reader.ReadNext(out);
   }
 
  private:
+  std::shared_ptr metadata_;

Review comment:
   I'll remove it and rebase





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


wesm closed pull request #7542:
URL: https://github.com/apache/arrow/pull/7542


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


wesm commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649601444


   Here's my benchmarks on i9-9960X
   
   ```
   $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 cyb70289/sort master 
--suite-filter=vector-sort
  benchmark baseline
contender  change %   counters
   12   SortToIndicesInt64Count/32768/10/min_time:1.0001.233 GiB/sec
2.062 GiB/sec67.239{'iterations': 56417, 'null_percent': 10.0}
   1   SortToIndicesInt64Count/32768/100/min_time:1.0001.605 GiB/sec
2.033 GiB/sec26.628 {'iterations': 73247, 'null_percent': 1.0}
   3 SortToIndicesInt64Count/32768/0/min_time:1.0002.438 GiB/sec
2.850 GiB/sec16.911{'iterations': 111533, 'null_percent': 0.0}
   7 SortToIndicesInt64Count/32768/2/min_time:1.000  805.132 MiB/sec  
854.275 MiB/sec 6.104{'iterations': 36039, 'null_percent': 50.0}
   11SortToIndicesInt64Compare/32768/100/min_time:1.000  147.729 MiB/sec  
149.201 MiB/sec 0.996  {'iterations': 6588, 'null_percent': 1.0}
   2   SortToIndicesInt64Count/1048576/1/min_time:1.0004.559 GiB/sec
4.580 GiB/sec 0.458{'iterations': 6503, 'null_percent': 100.0}
   5 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.554 GiB/sec
4.560 GiB/sec 0.136{'iterations': 6577, 'null_percent': 100.0}
   0   SortToIndicesInt64Compare/32768/1/min_time:1.000  148.181 MiB/sec  
148.368 MiB/sec 0.126 {'iterations': 6659, 'null_percent': 0.01}
   9   SortToIndicesInt64Compare/32768/2/min_time:1.000  245.628 MiB/sec  
245.887 MiB/sec 0.106{'iterations': 11171, 'null_percent': 50.0}
   6  SortToIndicesInt64Compare/32768/10/min_time:1.000  156.791 MiB/sec  
156.841 MiB/sec 0.032 {'iterations': 7073, 'null_percent': 10.0}
   4   SortToIndicesInt64Count/8388608/1/min_time:1.0004.276 GiB/sec
4.269 GiB/sec-0.155 {'iterations': 772, 'null_percent': 100.0}
   10SortToIndicesInt64Compare/8388608/1/min_time:1.0004.286 GiB/sec
4.273 GiB/sec-0.305 {'iterations': 769, 'null_percent': 100.0}
   8   SortToIndicesInt64Compare/32768/1/min_time:1.0004.677 GiB/sec
4.648 GiB/sec-0.634  {'iterations': 209517, 'null_percent': 100.0}
   14SortToIndicesInt64Count/32768/1/min_time:1.0004.661 GiB/sec
4.617 GiB/sec-0.945  {'iterations': 214256, 'null_percent': 100.0}
   15  SortToIndicesInt64Compare/32768/0/min_time:1.000  151.334 MiB/sec  
147.025 MiB/sec-2.847  {'iterations': 6660, 'null_percent': 0.0}
   13SortToIndicesInt64Count/32768/1/min_time:1.0002.312 GiB/sec
2.131 GiB/sec-7.844{'iterations': 91492, 'null_percent': 0.01}
   ```
   
   and clang-11
   
   ```
  benchmark baseline
contender  change %   counters
   15SortToIndicesInt64Count/32768/0/min_time:1.0002.069 GiB/sec
3.490 GiB/sec68.665 {'iterations': 95577, 'null_percent': 0.0}
   13   SortToIndicesInt64Count/32768/10/min_time:1.0001.288 GiB/sec
1.895 GiB/sec47.099{'iterations': 59504, 'null_percent': 10.0}
   10  SortToIndicesInt64Count/32768/100/min_time:1.0001.551 GiB/sec
1.993 GiB/sec28.456 {'iterations': 71755, 'null_percent': 1.0}
   2 SortToIndicesInt64Count/32768/1/min_time:1.0001.975 GiB/sec
2.069 GiB/sec 4.771{'iterations': 87376, 'null_percent': 0.01}
   3   SortToIndicesInt64Compare/32768/0/min_time:1.000  155.058 MiB/sec  
158.211 MiB/sec 2.033  {'iterations': 6935, 'null_percent': 0.0}
   9   SortToIndicesInt64Compare/32768/2/min_time:1.000  248.515 MiB/sec  
252.036 MiB/sec 1.417{'iterations': 11676, 'null_percent': 50.0}
   0   SortToIndicesInt64Compare/32768/1/min_time:1.0004.678 GiB/sec
4.735 GiB/sec 1.222  {'iterations': 214204, 'null_percent': 100.0}
   8 SortToIndicesInt64Count/32768/1/min_time:1.0004.677 GiB/sec
4.729 GiB/sec 1.110  {'iterations': 213435, 'null_percent': 100.0}
   5   SortToIndicesInt64Compare/32768/1/min_time:1.000  151.573 MiB/sec  
151.744 MiB/sec 0.113 {'iterations': 6769, 'null_percent': 0.01}
   1  SortToIndicesInt64Compare/32768/10/min_time:1.000  162.339 MiB/sec  
162.226 MiB/sec-0.069 {'iterations': 7284, 'null_percent': 10.0}
   14SortToIndicesInt64Compare/32768/100/min_time:1.000  151.879 MiB/sec  
151.589 MiB/sec-0.191  {'iterations': 6837, 'null_percent': 1.0}
   7 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.570 GiB/sec
4.553 GiB/sec-0.383{'iterations': 6579, 'null_percent': 100.0}
   6   SortToIndicesInt64Count/1048576/1/min_time:1.0004.569 GiB/sec
4.550 GiB/sec-0.407{'iterations': 6459, 

[GitHub] [arrow] fsaintjacques commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


fsaintjacques commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-649600834


   The test is failing on windows. 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/33736283/job/mshkl837y3b5v5u6



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


fsaintjacques commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-649600179


   Correct, we don't support the file changing underneath. For now we can 
preach YAGNI on this as it is a rare use case. If this is required, the 
consumer can create a custom dataset/fragment.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7541: ARROW-9224: [Dev][Archery] Copy local repo on clone failure

2020-06-25 Thread GitBox


cyb70289 commented on pull request #7541:
URL: https://github.com/apache/arrow/pull/7541#issuecomment-649595034


   > Can you try using `git clone --shared` instead? It should avoid the copy.
   
   `--shard` works okay per my test. What about simply replacing `git clone 
--local` with `git clone --shared`?
   From man page, the only **danger** of `--shared` is if I delete and prune 
the test branch when related test is running, looks not real.
   
   ```
  --shared, -s
  When the repository to clone is on the local machine, instead of 
using hard links, automatically setup .git/objects/info/alternates to share the
  objects with the source repository. The resulting repository 
starts out without any object of its own.
   
  NOTE: this is a possibly dangerous operation; do not use it 
unless you understand what it does. If you clone your repository using this 
option
  and then delete branches (or use any other Git command that makes 
any existing commit unreferenced) in the source repository, some objects may
  become unreferenced (or dangling). These objects may be removed 
by normal Git operations (such as git commit) which automatically call git gc
  --auto. (See git-gc(1).) If these objects are removed and were 
referenced by the cloned repository, then the cloned repository will become
  corrupt.
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


fsaintjacques commented on a change in pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#discussion_r445605965



##
File path: cpp/src/parquet/arrow/reader.cc
##
@@ -338,22 +348,39 @@ class RowGroupRecordBatchReader : public 
::arrow::RecordBatchReader {
 // TODO (hatemhelal): Consider refactoring this to share logic with 
ReadTable as this
 // does not currently honor the use_threads option.
 std::vector> columns(field_readers_.size());
-for (size_t i = 0; i < field_readers_.size(); ++i) {
-  RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, [i]));
-  if (columns[i]->num_chunks() > 1) {
-return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+int64_t num_rows = -1;
+
+if (columns.empty()) {
+  // num_rows cannot be derived from field_readers_ so compute it using
+  // row group sizes cached from metadata
+  num_rows = std::min(batch_size_, *row_group_remaining_size_);
+  *row_group_remaining_size_ -= num_rows;
+  if (*row_group_remaining_size_ == 0) {
+++row_group_remaining_size_;
   }
+} else {
+  for (size_t i = 0; i < field_readers_.size(); ++i) {
+RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, [i]));
+if (columns[i]->num_chunks() > 1) {
+  return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+}
+  }
+  num_rows = columns[0]->length();
 }
 
 // Create an intermediate table and use TableBatchReader as an adaptor to a
 // RecordBatch
-std::shared_ptr table = Table::Make(schema_, columns);
+std::shared_ptr table = Table::Make(schema_, columns, num_rows);
+
 RETURN_NOT_OK(table->Validate());
 ::arrow::TableBatchReader table_batch_reader(*table);
 return table_batch_reader.ReadNext(out);
   }
 
  private:
+  std::shared_ptr metadata_;

Review comment:
   `metadata_` is not used.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


kszucs commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-649590873


   > Really nice!
   > 
   > Could we add a `is_valid` attribute to the python scalar as well? Now the 
only way to check for a null value is to do `.as_py() is None` ?
   
   Definitely!
   
   > 
   > Might not be related to the changes in this PR, but reviewing this 
triggered me to test scalar casting:
   > 
   > ```
   > In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   >...:  
   >...: import pyarrow.compute as pc 
   >...: pc.cast(s, pa.timestamp('ns'))  
   > ../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: 
(batch[0].kind()) == (Datum::ARRAY) 
   > ...
   > Aborted (core dumped)
   > ```
   I think the cast kernel doesn't support scalars yet, on the other hand the 
scalars have custom `CastTo` implementation which we may want to remove in 
favor of `compute.Cast`?
   
   cc @bkietz 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445603562



##
File path: python/pyarrow/util.py
##
@@ -41,6 +41,24 @@ def wrapper(*args, **kwargs):
 return wrapper
 
 
+def _deprecate_class(old_name, new_class, next_version,
+ instancecheck=True):
+"""
+Raise warning if a deprecated class is used in an isinstance check.

Review comment:
   The deprecated classes cannot be instantiated, so we don't need to worry 
about it for now - although we can add support for overriding `__init__` for 
later reuse.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445604627



##
File path: python/pyarrow/tests/test_scalars.py
##
@@ -17,426 +17,395 @@
 
 import datetime
 import pytest
-import unittest
 
 import numpy as np
 
 import pyarrow as pa
 
 
-class TestScalars(unittest.TestCase):
-
-def test_null_singleton(self):
-with pytest.raises(Exception):
-pa.NAType()
+@pytest.mark.parametrize(['value', 'ty', 'klass', 'deprecated'], [
+(None, None, pa.NullScalar, pa.NullType),
+(False, None, pa.BooleanScalar, pa.BooleanValue),
+(True, None, pa.BooleanScalar, pa.BooleanValue),
+(1, None, pa.Int64Scalar, pa.Int64Value),
+(-1, None, pa.Int64Scalar, pa.Int64Value),
+(1, pa.int8(), pa.Int8Scalar, pa.Int8Value),
+(1, pa.uint8(), pa.UInt8Scalar, pa.UInt8Value),
+(1, pa.int16(), pa.Int16Scalar, pa.Int16Value),
+(1, pa.uint16(), pa.UInt16Scalar, pa.UInt16Value),
+(1, pa.int32(), pa.Int32Scalar, pa.Int32Value),
+(1, pa.uint32(), pa.UInt32Scalar, pa.UInt32Value),
+(1, pa.int64(), pa.Int64Scalar, pa.Int64Value),
+(1, pa.uint64(), pa.UInt64Scalar, pa.UInt64Value),
+(1.0, None, pa.DoubleScalar, pa.DoubleValue),
+(np.float16(1.0), pa.float16(), pa.HalfFloatScalar, pa.HalfFloatValue),
+(1.0, pa.float32(), pa.FloatScalar, pa.FloatValue),
+("string", None, pa.StringScalar, pa.StringValue),
+(b"bytes", None, pa.BinaryScalar, pa.BinaryValue),
+([1, 2, 3], None, pa.ListScalar, pa.ListValue),
+([1, 2, 3, 4], pa.large_list(pa.int8()), pa.LargeListScalar,
+ pa.LargeListValue),
+# date
+# time
+])
+def test_type_inference(value, ty, klass, deprecated):
+s = pa.scalar(value, type=ty)
+assert isinstance(s, klass)
+assert s == value
+with pytest.warns(FutureWarning):
+isinstance(s, deprecated)
+
+
+def test_null_singleton():
+with pytest.raises(Exception):
+pa.NullScalar()
+
+
+def test_nulls():
+arr = pa.array([None, None])
+for v in arr:
+assert v is pa.NA
+assert v.as_py() is None
+
+
+def test_null_equality():
+assert (pa.NA == pa.NA) is pa.NA
+assert (pa.NA == 1) is pa.NA

Review comment:
   Great question, I assume it should follow the same semantics as `Null` 
does.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-649585804


   +1, thanks for fixing this @emkornfield!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox


wesm closed pull request #7143:
URL: https://github.com/apache/arrow/pull/7143


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-649585468


   FWIW gcc benchmarks (sse4.2) on my machine (ubuntu 18.04 on i9-9960X)
   
   ```
   $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 emkornfield/ARROW-8504 
master --suite-filter=parquet-arrow
   benchmark baselinecontender  
change %counters
   35 BM_ReadColumn/75/1  267.930 MiB/sec  275.083 MiB/sec  
   2.670   {'iterations': 2}
   6  BM_ReadColumn/99/0  238.381 MiB/sec  244.286 MiB/sec  
   2.477   {'iterations': 3}
   10   BM_ReadColumn/10/50  270.775 MiB/sec  275.249 MiB/sec  
   1.652   {'iterations': 2}
   31 BM_ReadColumn/25/5  171.549 MiB/sec  174.237 MiB/sec  
   1.567   {'iterations': 2}
   29BM_ReadColumn/-1/1  856.194 MiB/sec  866.945 MiB/sec  
   1.256  {'iterations': 11}
   7BM_WriteColumn   31.593 MiB/sec   31.845 MiB/sec  
   0.797   {'iterations': 1}
   44 BM_ReadIndividualRowGroups  849.469 MiB/sec  856.209 MiB/sec  
   0.793   {'iterations': 6}
   36BM_ReadColumn/-1/0  408.388 MiB/sec  411.490 MiB/sec  
   0.759   {'iterations': 3}
   3  BM_ReadColumn/50/0  151.600 MiB/sec  152.399 MiB/sec  
   0.527   {'iterations': 2}
   32BM_ReadColumn/30/10  268.512 MiB/sec  269.849 MiB/sec  
   0.498   {'iterations': 2}
   8   BM_ReadColumn/-1/20  772.878 MiB/sec  776.716 MiB/sec  
   0.497   {'iterations': 5}
   17   BM_ReadColumn/25/25  277.705 MiB/sec  278.957 MiB/sec  
   0.451   {'iterations': 2}
   23BM_ReadColumn/-1/11.472 GiB/sec1.478 GiB/sec  
   0.414  {'iterations': 13}
   30  BM_ReadColumn/1/1  235.809 MiB/sec  236.730 MiB/sec  
   0.391   {'iterations': 3}
   45 BM_WriteColumn   45.067 MiB/sec   45.232 MiB/sec  
   0.366   {'iterations': 1}
   41BM_ReadColumn/45/25  241.545 MiB/sec  242.427 MiB/sec  
   0.365   {'iterations': 2}
   40   BM_ReadColumn/-1/10  710.485 MiB/sec  712.181 MiB/sec  
   0.239   {'iterations': 5}
   22  BM_ReadColumn/-1/0  211.939 MiB/sec  212.433 MiB/sec  
   0.233  {'iterations': 15}
   39   BM_ReadColumn/-1/50  629.632 MiB/sec  630.955 MiB/sec  
   0.210   {'iterations': 9}
   15  BM_ReadColumn/1/20  145.490 MiB/sec  145.701 MiB/sec  
   0.145  {'iterations': 10}
   9BM_ReadColumn/-1/1  162.373 MiB/sec  162.497 MiB/sec  
   0.077   {'iterations': 4}
   11 BM_ReadColumn/-1/0  257.190 MiB/sec  257.278 MiB/sec  
   0.034   {'iterations': 3}
   13BM_WriteColumn   32.378 MiB/sec   32.386 MiB/sec  
   0.025   {'iterations': 1}
   37BM_ReadColumn/99/50  238.918 MiB/sec  238.975 MiB/sec  
   0.024   {'iterations': 3}
   20   BM_WriteColumn   54.200 MiB/sec   54.209 MiB/sec  
   0.017   {'iterations': 1}
   19  BM_ReadColumn/1/1  369.522 MiB/sec  369.199 MiB/sec  
  -0.087   {'iterations': 2}
   0BM_ReadColumn/-1/501.144 GiB/sec1.143 GiB/sec  
  -0.090  {'iterations': 10}
   34BM_ReadColumn/35/10  262.271 MiB/sec  261.890 MiB/sec  
  -0.145   {'iterations': 2}
   21 BM_ReadColumn/50/1  242.510 MiB/sec  242.067 MiB/sec  
  -0.183   {'iterations': 2}
   5 BM_ReadColumn/50/50  152.344 MiB/sec  152.000 MiB/sec  
  -0.225   {'iterations': 2}
   1 BM_ReadColumn/10/10  179.063 MiB/sec  178.583 MiB/sec  
  -0.268   {'iterations': 2}
   28BM_WriteColumn   64.162 MiB/sec   63.990 MiB/sec  
  -0.269   {'iterations': 1}
   38BM_WriteColumn   69.641 MiB/sec   69.446 MiB/sec  
  -0.280   {'iterations': 1}
   18  BM_WriteColumn   26.643 MiB/sec   26.557 MiB/sec  
  -0.324   {'iterations': 2}
   16   BM_ReadColumn/-1/02.291 GiB/sec2.281 GiB/sec  
  -0.397  {'iterations': 20}
   43 BM_WriteColumn   75.453 MiB/sec   75.050 MiB/sec  
  -0.534   {'iterations': 1}
   33   BM_ReadColumn/5/10  115.534 MiB/sec  114.574 MiB/sec  
  -0.832   {'iterations': 3}
   14BM_ReadColumn/50/50  242.071 MiB/sec  240.005 MiB/sec  
  -0.854   {'iterations': 2}
   4BM_ReadMultipleRowGroups  826.026 MiB/sec  818.709 MiB/sec  
  -0.886   {'iterations': 5}
   24   BM_ReadColumn/-1/10  386.383 MiB/sec  382.213 MiB/sec  
  -1.079   {'iterations': 6}
   12 BM_ReadColumn/-1/0  410.877 MiB/sec  404.700 MiB/sec  
  -1.503   {'iterations': 3}
   25 BM_ReadColumn/10/5  289.612 MiB/sec  284.777 MiB/sec  
  -1.670   {'iterations': 2}
   2 BM_ReadColumn/25/10  277.045 MiB/sec  271.925 MiB/sec  
  -1.848   {'iterations': 2}
   26  BM_ReadColumn/5/5  311.137 MiB/sec  295.823 MiB/sec  
  -4.922   {'iterations': 2}
   42 BM_ReadColumn/99/0  371.803 MiB/sec  347.872 MiB/sec  
  -6.436   {'iterations': 2}
   27BM_ReadColumn/99/50  381.029 MiB/sec  351.329 MiB/sec  
  -7.795   {'iterations': 3}
   ```
   
   Here's clang-11
   
   ```
   $ archery benchmark diff --cc=clang-11 --cxx=clang++-11 
emkornfield/ARROW-8504 master --suite-filter=parquet-arrow
   benchmark baselinecontender  
change %  

[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445603562



##
File path: python/pyarrow/util.py
##
@@ -41,6 +41,24 @@ def wrapper(*args, **kwargs):
 return wrapper
 
 
+def _deprecate_class(old_name, new_class, next_version,
+ instancecheck=True):
+"""
+Raise warning if a deprecated class is used in an isinstance check.

Review comment:
   > Really nice!
   > 
   > Could we add a `is_valid` attribute to the python scalar as well? Now the 
only way to check for a null value is to do `.as_py() is None` ?
   Definitely!
   > 
   > Might not be related to the changes in this PR, but reviewing this 
triggered me to test scalar casting:
   > 
   > ```
   > In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   >...:  
   >...: import pyarrow.compute as pc 
   >...: pc.cast(s, pa.timestamp('ns'))  
   > ../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: 
(batch[0].kind()) == (Datum::ARRAY) 
   > ...
   > Aborted (core dumped)
   > ```
   I think the cast kernel doesn't support scalars yet, on the other hand the 
scalars have custom `CastTo` implementation which we may want to remove in 
favor of `compute.Cast`?
   
   cc @bkietz 
   
   

##
File path: python/pyarrow/util.py
##
@@ -41,6 +41,24 @@ def wrapper(*args, **kwargs):
 return wrapper
 
 
+def _deprecate_class(old_name, new_class, next_version,
+ instancecheck=True):
+"""
+Raise warning if a deprecated class is used in an isinstance check.

Review comment:
   > Really nice!
   > 
   > Could we add a `is_valid` attribute to the python scalar as well? Now the 
only way to check for a null value is to do `.as_py() is None` ?
   
   Definitely!
   
   > 
   > Might not be related to the changes in this PR, but reviewing this 
triggered me to test scalar casting:
   > 
   > ```
   > In [2]: s = pa.scalar(pd.Timestamp("2012-01-01")) 
   >...:  
   >...: import pyarrow.compute as pc 
   >...: pc.cast(s, pa.timestamp('ns'))  
   > ../src/arrow/compute/kernels/scalar_cast_temporal.cc:130:  Check failed: 
(batch[0].kind()) == (Datum::ARRAY) 
   > ...
   > Aborted (core dumped)
   > ```
   I think the cast kernel doesn't support scalars yet, on the other hand the 
scalars have custom `CastTo` implementation which we may want to remove in 
favor of `compute.Cast`?
   
   cc @bkietz 
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on issue #1771: pyarrow-- reading selected columns from multiindex parquet file

2020-06-25 Thread GitBox


wesm commented on issue #1771:
URL: https://github.com/apache/arrow/issues/1771#issuecomment-649581184


   Please direct questions to u...@arrow.apache.org or d...@arrow.apache.org. 
Thank you



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

2020-06-25 Thread GitBox


wesm commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-649580251


   Looks good here. Merging



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

2020-06-25 Thread GitBox


wesm closed pull request #7456:
URL: https://github.com/apache/arrow/pull/7456


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


wesm commented on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649579503


   That crash is ARROW-8999. I'm fairly confident it's a real bug given that is 
happens about 1-5% of the time



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


wesm edited a comment on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649579503


   That crash is ARROW-8999. I'm fairly confident it's a real bug given that it 
happens about 1-5% of the time



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445594104



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static PyTypeObject* pandas_NaTType = nullptr;
+
+void GetPandasStaticSymbols() {
+  OwnedRef pandas;
+  Status s = ImportModule("pandas", );
+  if (!s.ok()) {
+return;
+  }
+
+  OwnedRef nat_value;
+  s = ImportFromModule(pandas.obj(), "NaT", _value);
+  if (!s.ok()) {
+return;
+  }
+  PyObject* nat_type = PyObject_Type(nat_value.obj());
+  pandas_NaTType = reinterpret_cast(nat_type);
+
+  // PyObject_Type returns a new reference but we trust that pandas.NaT will
+  // outlive our use of this PyObject*
+  Py_DECREF(nat_type);
+}

Review comment:
   Sure, I'll add it





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445593841



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static PyTypeObject* pandas_NaTType = nullptr;
+
+void GetPandasStaticSymbols() {
+  OwnedRef pandas;
+  Status s = ImportModule("pandas", );
+  if (!s.ok()) {
+return;
+  }
+
+  OwnedRef nat_value;
+  s = ImportFromModule(pandas.obj(), "NaT", _value);
+  if (!s.ok()) {
+return;
+  }
+  PyObject* nat_type = PyObject_Type(nat_value.obj());
+  pandas_NaTType = reinterpret_cast(nat_type);
+
+  // PyObject_Type returns a new reference but we trust that pandas.NaT will
+  // outlive our use of this PyObject*
+  Py_DECREF(nat_type);
+}
+
+}  // namespace
+
+void InitPandasStaticData() {
+  std::call_once(pandas_static_initialized, GetPandasStaticSymbols);
+}
+
 bool PandasObjectIsNull(PyObject* obj) {
   if (!MayHaveNaN(obj)) {
 return false;
   }
   if (obj == Py_None) {
 return true;
   }
-  if (PyFloat_IsNaN(obj) ||
+  if (PyFloat_IsNaN(obj) || (pandas_NaTType && PyObject_TypeCheck(obj, 
pandas_NaTType)) ||

Review comment:
   Right unfortunately you can have multiple instances of `pandas.NaT`... 
not great





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445593243



##
File path: cpp/src/arrow/python/python_to_arrow.cc
##
@@ -1171,6 +1171,12 @@ Status GetConverterFlat(const std::shared_ptr& 
type, bool strict_conve
 
 Status GetConverter(const std::shared_ptr& type, bool from_pandas,
 bool strict_conversions, std::unique_ptr* 
out) {
+  if (from_pandas) {
+// ARROW-842: If pandas is not installed then null checks will be less
+// comprehensive, but that is okay.
+internal::InitPandasStaticData();

Review comment:
   I see. Well, the problem with putting the pandas initialization 
elsewhere is that we've already arranged for pandas to only be imported when 
it's needed, so if we did `import pandas` always when doing `import pyarrow` 
we'd be breaking that contract. I suggest we address this as follow up as needed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on issue #7540: Why conversion from DoubleArray with nulls to numpy needs to copy?

2020-06-25 Thread GitBox


wesm commented on issue #7540:
URL: https://github.com/apache/arrow/issues/7540#issuecomment-649571447


   This came up in ARROW-3263 for R. If you'd like you can open a corresponding 
Python issue. On the face of it, this seems rather difficult to me to do safely



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed issue #7540: Why conversion from DoubleArray with nulls to numpy needs to copy?

2020-06-25 Thread GitBox


wesm closed issue #7540:
URL: https://github.com/apache/arrow/issues/7540


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


wesm commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649563024


   I'm interested to see if we could add a `BitmapReader::Advance` method so 
that we could employ a hybrid approach to use both BitBlockCounter and 
BitmapReader to get the best of both worlds



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445583091



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   https://github.com/apache/arrow/pull/7526#discussion_r445512267





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


fsaintjacques commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445096637



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   We use it everywhere why is it a problem here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


cyb70289 commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649561938


   About `null_percent=0` case, the benchmark result I attached shows about 19% 
 improvement. Looks reasonable.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-25 Thread GitBox


wesm commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-649561148


   This removes already more than 2MB of code from libarrow.so on Linux: great. 
I'll keep an eye on this



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rladeira edited a comment on issue #1771: pyarrow-- reading selected columns from multiindex parquet file

2020-06-25 Thread GitBox


rladeira edited a comment on issue #1771:
URL: https://github.com/apache/arrow/issues/1771#issuecomment-649554677


   I found the same issue here, using pyarrow version '0.17.1'. I could not 
select columns from a multi index dataframe saved as a parquet file. Is there 
some way to accomplish this? Read just some columns from a multindex parquet 
file? Something like:
   
   `pd.read_parquet('file.parquet', engine='pyarrow', columns=[('level_0_key', 
'level_1_key')])`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rladeira commented on issue #1771: pyarrow-- reading selected columns from multiindex parquet file

2020-06-25 Thread GitBox


rladeira commented on issue #1771:
URL: https://github.com/apache/arrow/issues/1771#issuecomment-649554677


   I found the same issue here, using pyarrow version '0.17.1'. I could not 
select columns from a multi index dataframe saved as a parquet file. Is there 
some way to accomplish this? Read just some columns from a multindex parquet 
file? Something like:
   
   `pd.read_parquet('file.parquet', engine='pyarrow', columns=[('a', 'b')])`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on a change in pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


cyb70289 commented on a change in pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#discussion_r445575078



##
File path: cpp/src/arrow/compute/kernels/vector_sort.cc
##
@@ -88,6 +88,28 @@ struct PartitionIndices {
   }
 };
 
+template 
+inline void VisitRawValueInline(const ArrayType& values,

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r445574693



##
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##
@@ -68,6 +76,64 @@ TYPED_TEST(TestStringKernels, AsciiLower) {
this->string_type(), "[\"aaazzæÆ&\", null, \"\", \"bbb\"]");
 }
 
+TEST(TestStringKernels, Utf8Upper32bitGrowth) {

Review comment:
   I think it's fine to have this test but it definitely should be marked 
as LARGE_MEMORY_TEST





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445569869



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,200 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scanned with BitBlockCounter
+  arrow::internal::BitBlockCounter data_counter(valid_bits, valid_bits_offset,
+num_values - leading_bits);
+  auto current_block = data_counter.NextWord();
+  while (idx_src < num_values) {
+if (current_block.AllSet()) {  // All true values
+  int run_length = 0;
+  // Scan forward until a block that has some false values (or the end)
+  while (current_block.length > 0 && current_block.AllSet()) {
+run_length += current_block.length;
+current_block = data_counter.NextWord();
+  }
+  // Fill all valid values of this scan
+  std::memcpy([num_valid_values], [idx_src], run_length * 
sizeof(T));
+  num_valid_values += run_length;
+  idx_src += run_length;
+  valid_bits_offset += run_length;
+  // The current_block already computed, advance to next loop
+  continue;
+} else if (!current_block.NoneSet()) {  // Some values are null
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,

Review comment:
   One question: is it actually beneficial to use `BitmapReader` here to 
process just a run of 64 bits? In other places where I have used 
BitBlockCounter, I have observed it does not always help





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-649551591


   https://issues.apache.org/jira/browse/ARROW-9017



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


wesm commented on a change in pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#discussion_r445569869



##
File path: cpp/src/arrow/util/spaced.h
##
@@ -0,0 +1,200 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/util/align_util.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bitmap_reader.h"
+
+namespace arrow {
+namespace util {
+namespace internal {
+
+/// \brief Compress the buffer to spaced, excluding the null entries.
+///
+/// \param[in] src the source buffer
+/// \param[in] num_values the size of source buffer
+/// \param[in] valid_bits bitmap data indicating position of valid slots
+/// \param[in] valid_bits_offset offset into valid_bits
+/// \param[out] output the output buffer spaced
+/// \return The size of spaced buffer.
+template 
+inline int SpacedCompress(const T* src, int num_values, const uint8_t* 
valid_bits,
+  int64_t valid_bits_offset, T* output) {
+  int num_valid_values = 0;
+  int idx_src = 0;
+
+  const auto p =
+  arrow::internal::BitmapWordAlign<1>(valid_bits, valid_bits_offset, 
num_values);
+  // First handle the leading bits
+  const int leading_bits = static_cast(p.leading_bits);
+  while (idx_src < leading_bits) {
+if (BitUtil::GetBit(valid_bits, valid_bits_offset)) {
+  output[num_valid_values] = src[idx_src];
+  num_valid_values++;
+}
+idx_src++;
+valid_bits_offset++;
+  }
+
+  // The aligned parts scanned with BitBlockCounter
+  arrow::internal::BitBlockCounter data_counter(valid_bits, valid_bits_offset,
+num_values - leading_bits);
+  auto current_block = data_counter.NextWord();
+  while (idx_src < num_values) {
+if (current_block.AllSet()) {  // All true values
+  int run_length = 0;
+  // Scan forward until a block that has some false values (or the end)
+  while (current_block.length > 0 && current_block.AllSet()) {
+run_length += current_block.length;
+current_block = data_counter.NextWord();
+  }
+  // Fill all valid values of this scan
+  std::memcpy([num_valid_values], [idx_src], run_length * 
sizeof(T));
+  num_valid_values += run_length;
+  idx_src += run_length;
+  valid_bits_offset += run_length;
+  // The current_block already computed, advance to next loop
+  continue;
+} else if (!current_block.NoneSet()) {  // Some values are null
+  arrow::internal::BitmapReader valid_bits_reader(valid_bits, 
valid_bits_offset,

Review comment:
   One question: is it actually beneficial to use `BitmapReader` here to 
process just a run of 64 bits? In other places where I have used 
BitBlockCounter, that it does not always help





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445569265



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,704 @@
 # under the License.
 
 
-_NULL = NA = None
-
-
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
 
-self.type = null()
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
 
 def __eq__(self, other):
-return NA
+# TODO(kszucs): use c++ Equals
+if isinstance(other, Scalar):
+other = other.as_py()
+return self.as_py() == other
 
+def __hash__(self):
+# TODO(kszucs): use C++ hash if implemented for the type
+return hash(self.as_py())
+
+def as_py(self):
+raise NotImplementedError()
 
-_NULL = NA = NullType()
+
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
-
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __init__(self):
+pass
 
-def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+def __eq__(self, other):
+return NA
 
-def __str__(self):
-if hasattr(self, 'as_py'):
-return str(self.as_py())
-else:
-return super(Scalar, self).__str__()
+def as_py(self):
+"""
+Return this value as a Python None.
+"""
+return None
 
-def __eq__(self, other):
-if hasattr(self, 'as_py'):
-if isinstance(other, ArrayValue):
-other = other.as_py()
-return self.as_py() == other
-else:
-raise NotImplementedError(
-"Cannot compare Arrow values that don't support as_py()")
 
-def __hash__(self):
-return hash(self.as_py())
+_NULL = NA = NullScalar()
 
 
-cdef class BooleanValue(ArrayValue):
+cdef class BooleanScalar(Scalar):
 """
-Concrete class for boolean array elements.
+Concrete class for boolean scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python bool.
 """
-cdef CBooleanArray* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CBooleanScalar* sp =  self.wrapped.get()
+return sp.value if sp.is_valid else None
 
 
-cdef class Int8Value(ArrayValue):
+cdef class UInt8Scalar(Scalar):
 """
-Concrete class for int8 array elements.
+Concrete class for uint8 scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python int.
 """
-cdef CInt8Array* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CUInt8Scalar* sp =  self.wrapped.get()
+return sp.value if 

[GitHub] [arrow] wesm commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox


wesm commented on pull request #7531:
URL: https://github.com/apache/arrow/pull/7531#issuecomment-649548830


   @jianxind I think the bot is broken right now because of the changes I 
recently made in ARROW-9201. @kszucs is going to update it



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7519: ARROW-9153: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r445523294



##
File path: python/pyarrow/_dataset.pyx
##
@@ -216,22 +216,18 @@ cdef class Expression:
 @staticmethod
 def _scalar(value):
 cdef:
-shared_ptr[CScalar] scalar
-
-if value is None:
-scalar.reset(new CNullScalar())
-elif isinstance(value, bool):
-scalar = MakeScalar(value)
-elif isinstance(value, float):
-scalar = MakeScalar(value)
-elif isinstance(value, int):
-scalar = MakeScalar(value)
-elif isinstance(value, (bytes, str)):
-scalar = MakeStringScalar(tobytes(value))

Review comment:
   I think the `MakeStringScalar` included in libarrow.pxd can then be 
removed (I don't see any other usage of it)

##
File path: python/pyarrow/tests/test_parquet.py
##
@@ -2028,7 +2028,7 @@ def test_filters_invalid_pred_op(tempdir, 
use_legacy_dataset):
 use_legacy_dataset=use_legacy_dataset)
 assert dataset.read().num_rows == 0
 
-with pytest.raises(ValueError if use_legacy_dataset else TypeError):
+with pytest.raises(ValueError if use_legacy_dataset else pa.ArrowInvalid):
 # dataset API returns TypeError when trying create invalid comparison

Review comment:
   ```suggestion
   with pytest.raises(ValueError):
   ```
   
   ArrowInvalid is a ValueError, I think, so if this changed, the above should 
be sufficient

##
File path: python/pyarrow/util.py
##
@@ -41,6 +41,24 @@ def wrapper(*args, **kwargs):
 return wrapper
 
 
+def _deprecate_class(old_name, new_class, next_version,
+ instancecheck=True):
+"""
+Raise warning if a deprecated class is used in an isinstance check.

Review comment:
   Shouldn't it also raise when instantiated?

##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,704 @@
 # under the License.
 
 
-_NULL = NA = None
-
-
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
 
-self.type = null()
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
 
 def __eq__(self, other):
-return NA
+# TODO(kszucs): use c++ Equals
+if isinstance(other, Scalar):
+other = other.as_py()
+return self.as_py() == other
 
+def __hash__(self):
+# TODO(kszucs): use C++ hash if implemented for the type
+return hash(self.as_py())
+
+def as_py(self):
+raise NotImplementedError()
 
-_NULL = NA = NullType()
+
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
-
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __init__(self):
+

[GitHub] [arrow] github-actions[bot] commented on pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-06-25 Thread GitBox


github-actions[bot] commented on pull request #7543:
URL: https://github.com/apache/arrow/pull/7543#issuecomment-649534666


   https://issues.apache.org/jira/browse/ARROW-9221



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm opened a new pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-06-25 Thread GitBox


lidavidm opened a new pull request #7543:
URL: https://github.com/apache/arrow/pull/7543


   `ArrowBuf.setBytes` has an override that uses a 8-byte-at-a-time copy loop 
if the byte buffer does not provide an array and is not direct. Unfortunately, 
this means it'll mangle data when the byte buffer is big-endian, as it then 
writes the data into the little-endian ArrowBuf. This fixes it by setting the 
byte order before copying, and then restoring it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

2020-06-25 Thread GitBox


pitrou commented on pull request #7456:
URL: https://github.com/apache/arrow/pull/7456#issuecomment-649522260


   Ok, I added some tests for error propagation. I'm going to merge if CI stays 
green.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-25 Thread GitBox


bkietz commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649514981


   @jorisvandenbossche okay, I'll extend the key value `Partitioning`s to 
maintain dictionaries of all unique values of a field.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445516148



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   Thanks for the explanation! maybe add: "can't use it directly *in public 
headers*"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


bkietz commented on pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#issuecomment-649506352


   > I suppose we don't / don't want to support files being changed after the 
dataset object has been constructed anyways?
   
   No, we don't support this at all. IMHO if fragments are modified that calls 
for the viewing `Dataset` to be reconstructed. @fsaintjacques 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445512267



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   We cannot use `std::mutex` in public headers due to CPP/CLI forbidding 
use of that header 
https://docs.microsoft.com/en-us/cpp/standard-library/mutex#remarks
   We don't support compilation of Arrow with the `/clr` option so it's 
acceptable to use `std::mutex` in source files and private headers (including 
`arrow/util/mutex.cc`). However arrow headers may be included by projects which 
*are* using `/clr` and usage of `std::mutex` in those public header files would 
break those projects.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649501921


   Hmm, it failed on the last commit as well, but I restarted that one. And so 
now appears to be green indeed .. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


bkietz commented on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649500754


   @jorisvandenbossche that build comes from the first of two `suggestion` 
commits and it doesn't seem to have crashed with both commits in place. Maybe 
it was ephemeral?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649500017


   Currently for the ParquetDataset, it also simply uses int32 for the indices. 
   
   Now, there is a more fundamental issue I had not thought of: the actual 
dictionary of the DictionaryArray. Right now, you create a DictionaryArray with 
only the (single) value of the partition field for that specific fragment 
(because we don't keep track of all unique values of a certain partition 
level?). 
   In the python ParquetDataset, however, we create a DictionaryArray with all 
occurring values of that partition field (also for the other fragments/pieces).
   
   To illustrate with a small dataset with `part=A` and `part=B` directories:
   
   ```python
   In [1]: import pyarrow.dataset as ds
   
   In [6]: part = 
ds.HivePartitioning.discover(max_partition_dictionary_size=-1)   
   
   In [9]: dataset = ds.dataset("test_partitioned/", format="parquet", 
partitioning=part) 
   
   In [10]: fragment = list(dataset.get_fragments())[0] 
   
   In [11]: fragment.to_table(schema=dataset.schema) 
   Out[11]: 
   pyarrow.Table
   dummy: int64
   part: dictionary
   
   # only A included
   In [13]: fragment.to_table(schema=dataset.schema).column("part")  
   Out[13]: 
   
   [
   
 -- dictionary:
   [
 "A"
   ]
 -- indices:
   [
 0,
 0
   ]
   ]
   
   In [15]: import pyarrow.parquet as pq  
   
   In [16]: dataset2 = pq.ParquetDataset("test_partitioned/")  
   
   In [19]: piece = dataset2.pieces[0] 
   
   In [25]: piece.read(partitions=dataset2.partitions) 
   Out[25]: 
   pyarrow.Table
   dummy: int64
   part: dictionary
   
   # both A and B included
   In [26]: piece.read(partitions=dataset2.partitions).column("part")   
   Out[26]: 
   
   [
   
 -- dictionary:
   [
 "A",
 "B"
   ]
 -- indices:
   [
 0,
 0
   ]
   ]
   ```
   
   I think for this being valuable (eg in the context of dask, or for pandas 
where reading in only a part of the parquet dataset), it's important to get all 
values of the partition field. But I am not sure to what extent that fits in 
the Dataset design (although I think that during the discovery in the Factory, 
we could keep track of all unique values of a partition field?)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7537:
URL: https://github.com/apache/arrow/pull/7537#discussion_r445495780



##
File path: cpp/src/arrow/python/helpers.cc
##
@@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) {
   return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
 }
 
+namespace {
+
+static std::once_flag pandas_static_initialized;
+static PyTypeObject* pandas_NaTType = nullptr;
+
+void GetPandasStaticSymbols() {
+  OwnedRef pandas;
+  Status s = ImportModule("pandas", );
+  if (!s.ok()) {
+return;
+  }
+
+  OwnedRef nat_value;
+  s = ImportFromModule(pandas.obj(), "NaT", _value);
+  if (!s.ok()) {
+return;
+  }
+  PyObject* nat_type = PyObject_Type(nat_value.obj());
+  pandas_NaTType = reinterpret_cast(nat_type);
+
+  // PyObject_Type returns a new reference but we trust that pandas.NaT will
+  // outlive our use of this PyObject*
+  Py_DECREF(nat_type);
+}

Review comment:
   Do you also want to add `pd.NA` here, or leave that for a follow-up JIRA?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#issuecomment-649482676


   The python dataset tests are crashing on Mac: 
https://github.com/apache/arrow/runs/806974457



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


jorisvandenbossche commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445482607



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   we use `std:mutex` in several places in the code base, so this comment 
is not fully clear to me

##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   To avoid confusion about this, should we rename ReadPhysicalSchema? Or 
instead of having `ReadPhysicalSchema` which caches and 
`ReadPhysicalSchemaImpl` that always reads, a `GetPhysicalSchema` that caches 
and `ReadPhysicalSchema` that reads?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#discussion_r445470685



##
File path: cpp/src/parquet/arrow/reader.cc
##
@@ -338,22 +348,37 @@ class RowGroupRecordBatchReader : public 
::arrow::RecordBatchReader {
 // TODO (hatemhelal): Consider refactoring this to share logic with 
ReadTable as this
 // does not currently honor the use_threads option.
 std::vector> columns(field_readers_.size());
-for (size_t i = 0; i < field_readers_.size(); ++i) {
-  RETURN_NOT_OK(field_readers_[i]->NextBatch(batch_size_, [i]));
-  if (columns[i]->num_chunks() > 1) {
-return Status::NotImplemented("This class cannot yet iterate chunked 
arrays");
+int64_t num_rows = -1;
+
+if (columns.empty()) {
+  num_rows = std::min(batch_size_, *row_group_remaining_size_);

Review comment:
   ```suggestion
 // num_rows cannot be derived from field_readers_ so compute it using
 // row group sizes cached from metadata
 num_rows = std::min(batch_size_, *row_group_remaining_size_);
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox


cyb70289 commented on pull request #7542:
URL: https://github.com/apache/arrow/pull/7542#issuecomment-649457378


   > I see no significant performance difference here on an AMD Ryzen CPU.
   > The weird thing with these performance numbers is that the 
`null_percent=0` case should probably be faster, no?
   
   Per @wesm [comment 
](https://github.com/apache/arrow/pull/7525#issuecomment-648279829), he also 
didn't see significant performance difference after replacing bitmapreader with 
bitblockcount.
   
   But I see big differences on AMD EYPC 7251, Intel Gold 5218, and Arm server. 
Both gcc and clang. No idea what's happening.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7517:
URL: https://github.com/apache/arrow/pull/7517#discussion_r445450752



##
File path: docs/source/python/dataset.rst
##
@@ -325,6 +325,22 @@ The currently available classes are 
:class:`~pyarrow.fs.S3FileSystem` and
 details.
 
 
+Reading from Minio
+--
+
+In addition to cloud storage, pyarrow also supports reading from a MinIO object
+storage instance emulating S3 APIs. Paired with toxiproxy, this is useful for
+testing or benchmarking.
+
+.. code-block:: python
+
+from pyarrow import fs
+
+minio = fs.S3FileSystem(scheme="http", endpoint="localhost:9000")

Review comment:
   Add a comment that this assumes MinIO is running unencrypted on local 
port 9000?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7517:
URL: https://github.com/apache/arrow/pull/7517#discussion_r445450482



##
File path: docs/source/python/dataset.rst
##
@@ -325,6 +325,22 @@ The currently available classes are 
:class:`~pyarrow.fs.S3FileSystem` and
 details.
 
 
+Reading from Minio
+--
+
+In addition to cloud storage, pyarrow also supports reading from a MinIO object
+storage instance emulating S3 APIs. Paired with toxiproxy, this is useful for

Review comment:
   Can you add hyperlinks to MinIO and toxyproxy?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r445449521



##
File path: docs/source/format/Columnar.rst
##
@@ -586,13 +587,13 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 * Children arrays:
   * Field-0 array (f: float):
 * Length: 2, nulls: 0
-* Validity bitmap buffer: Not required
+* Validity bitmap buffer: 0101
 
 * Value Buffer:
 
-  | Bytes 0-7 | Bytes 8-63  |
-  |---|-|
-  | 1.2, 3.4  | unspecified |
+  | Bytes 0-11 | Bytes 8-63  |

Review comment:
   Second column should be "bytes 12-63"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox


pitrou commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r445449396



##
File path: docs/source/format/Columnar.rst
##
@@ -566,11 +572,6 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 ::
 
 * Length: 4, Null count: 1
-* Validity bitmap buffer:
-  |Byte 0 (validity bitmap) | Bytes 1-63|
-  |-|---|
-  |1101 | 0 (padding)   |
-
 * Types buffer:
 
   |Byte 0   | Byte 1  | Byte 2   | Byte 3   | Bytes 4-63  |

Review comment:
   It seems the types and offsets buffers need to be fixed (the second 
entry isn't unspecified anymore).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >