[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-06-30 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r447441667



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionUtility.java
##
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.flatbuf.BodyCompressionMethod;
+import org.apache.arrow.flatbuf.CompressionType;
+import org.apache.arrow.vector.ipc.message.ArrowBodyCompression;
+
+/**
+ * Utilities for data compression/decompression.
+ */
+public class CompressionUtility {
+
+  private CompressionUtility() {
+  }
+
+  /**
+   * Creates the {@link ArrowBodyCompression} object, given the {@link 
CompressionCodec}.
+   */
+  public static ArrowBodyCompression createBodyCompression(CompressionCodec 
codec) {
+if (codec == null) {
+  return null;
+}
+for (int i = 0; i < CompressionType.names.length; i++) {
+  if (CompressionType.names[i].equals(codec.getCodecName())) {

Review comment:
   Thanks for the suggestion. I have revised the code to implement it 
through a switch statement. Please check if it looks good.
   
   It looks clearer. However, we may need to change the code whenever we 
support a new compression method. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7326: ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-06-30 Thread GitBox


liyafan82 commented on a change in pull request #7326:
URL: https://github.com/apache/arrow/pull/7326#discussion_r447442441



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.compression;
+
+import org.apache.arrow.memory.ArrowBuf;
+
+/**
+ * The codec for compression/decompression.
+ */
+public interface CompressionCodec {
+
+  /**
+   * Given a buffer, estimate the compressed size.
+   * Please note this operation is optional, and some compression methods may 
not support it.
+   *
+   * @param input the input buffer to be estimated.
+   * @return the estimated size of the compressed data.
+   */
+  long estimateCompressedSize(ArrowBuf input);

Review comment:
   Sounds reasonable. Thank you.
   I have revised the code to simplify the interface. Maybe we can add the 
optimization in the future, if we belive it is necessary. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #7500: ARROW-9191: [Rust] Do not panic when milliseconds is less than zero as chrono can handle…

2020-06-30 Thread GitBox


nevi-me commented on pull request #7500:
URL: https://github.com/apache/arrow/pull/7500#issuecomment-651650955


   I've merged this, I've been having computer issues so haven't been able to 
do much work



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me closed pull request #7554: ARROW-9236: [Rust] CSV WriterBuilder never writes header

2020-06-30 Thread GitBox


nevi-me closed pull request #7554:
URL: https://github.com/apache/arrow/pull/7554


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me closed pull request #7500: ARROW-9191: [Rust] Do not panic when milliseconds is less than zero as chrono can handle…

2020-06-30 Thread GitBox


nevi-me closed pull request #7500:
URL: https://github.com/apache/arrow/pull/7500


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-30 Thread GitBox


pitrou commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651665973


   I'm restarting that build just in case, seems it seems to fail contacting 
Github. But macOS wheel builds don't enable S3 anyway, so I doubt this has 
anything to do with this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-30 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651717472


   Phew. It worked. RTools 4.0 is still broken, but there doesn't seem to be 
anything we can do, except perhaps disable that job. I'm gonna merge and leave 
the R cleanup to someone else.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-30 Thread GitBox


pitrou closed pull request #7449:
URL: https://github.com/apache/arrow/pull/7449


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-30 Thread GitBox


pitrou closed pull request #7580:
URL: https://github.com/apache/arrow/pull/7580


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-30 Thread GitBox


rymurr commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-651630439


   I _think_ this is now inline with the spec. The Union/DenseUnion types now 
uses logical type ids in Java. Which is the same as in c++. The difference in a 
java created Union is that the type ids are chosen from a known index (minor 
type). But the java implementation doesn't rely on minor type ordinals strict 
ordering like before. Or at least the intention was to be fully in line with 
spec while maintaining a convenient default choice for type id. Drawing any 
meaning from type buffers alone in client code would be an error in my opinion. 
Hope that clarifies my intention w/ this change.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #7588: ARROW-9274: [Rust] Parse 64bit numbers from integration files as strings

2020-06-30 Thread GitBox


nevi-me commented on pull request #7588:
URL: https://github.com/apache/arrow/pull/7588#issuecomment-651720277


   @wesm may I merge this? The Travis CI often takes long to start running



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me opened a new pull request #7588: ARROW-9274: [Rust] Parse 64bit numbers from integration files as strings

2020-06-30 Thread GitBox


nevi-me opened a new pull request #7588:
URL: https://github.com/apache/arrow/pull/7588


   This fixes Rust build failures



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 opened a new pull request #7587: ARROW-8973: [Java] Support batch value appending for large varchar/varbinary vectors

2020-06-30 Thread GitBox


liyafan82 opened a new pull request #7587:
URL: https://github.com/apache/arrow/pull/7587


   Please see https://issues.apache.org/jira/browse/ARROW-8973



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tianchen92 commented on pull request #6156: ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices

2020-06-30 Thread GitBox


tianchen92 commented on pull request #6156:
URL: https://github.com/apache/arrow/pull/6156#issuecomment-651706326


   > Can we leave the old method in place and mark it as deprecated and remove 
in a later release?
   
   I am afraid it's not reasonable. since we need the right order in IPC and 
Dremio need the old wrong order in getBuffers to avoid test failure, unless we 
correct the behavior in getBuffers and rename the old method like getBuffers2 
which seems ugly . @projjal Any thought? :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7587: ARROW-8973: [Java] Support batch value appending for large varchar/varbinary vectors

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7587:
URL: https://github.com/apache/arrow/pull/7587#issuecomment-651614454


   https://issues.apache.org/jira/browse/ARROW-8973



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7588: ARROW-9274: [Rust] Parse 64bit numbers from integration files as strings

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7588:
URL: https://github.com/apache/arrow/pull/7588#issuecomment-651709041


   https://issues.apache.org/jira/browse/ARROW-9274



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me opened a new pull request #7591: ARROW-8535: [Rust] Specify arrow-flight version

2020-06-30 Thread GitBox


nevi-me opened a new pull request #7591:
URL: https://github.com/apache/arrow/pull/7591


   This is to ensure that Rust users who include the arrow crate from crates.io 
do not get errors as they would not have the arrow-flight directory



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7590: ARROW-9277: [C++] Fix docs of reading CSV files

2020-06-30 Thread GitBox


pitrou commented on pull request #7590:
URL: https://github.com/apache/arrow/pull/7590#issuecomment-651773473


   @crcrpar  Do you think the version I pushed is ok and clear enough?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-30 Thread GitBox


wesm commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r447691975



##
File path: docs/source/format/Columnar.rst
##
@@ -566,33 +572,28 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 ::
 
 * Length: 4, Null count: 1
-* Validity bitmap buffer:
-  |Byte 0 (validity bitmap) | Bytes 1-63|
-  |-|---|
-  |1101 | 0 (padding)   |
-
 * Types buffer:
 
   |Byte 0   | Byte 1  | Byte 2   | Byte 3   | Bytes 4-63  |
   |-|-|--|--|-|
-  | 0   | unspecified | 0| 1| unspecified |
+  | 0   | 0   | 0| 1| unspecified |
 
 * Offset buffer:
 
   |Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
   |--|-||-|-|
-  | 0| unspecified | 1  | 0   | unspecified |
+  | 0| 1   | 2  | 0   | unspecified |
 
 * Children arrays:
   * Field-0 array (f: float):
 * Length: 2, nulls: 0

Review comment:
   Yes





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-30 Thread GitBox


wesm commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r447692444



##
File path: docs/source/format/Columnar.rst
##
@@ -688,11 +687,10 @@ will have the following layout: ::
 ||---|
 | joemark| unspecified (padding) |
 
-Similar to structs, a particular child array may have a non-null slot
-even if the validity bitmap of the parent union array indicates the
-slot is null.  Additionally, a child array may have a non-null slot
-even if the types array indicates that a slot contains a different
-type at the index.
+Note that a child array may have a non-null slot even if the types array
+indicates that a slot contains a different type at the index, so the

Review comment:
   OK





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7594: ARROW-7654: [Python] Ability to set column_types to a Schema in csv.ConvertOptions is undocumented

2020-06-30 Thread GitBox


pitrou commented on a change in pull request #7594:
URL: https://github.com/apache/arrow/pull/7594#discussion_r447724372



##
File path: python/pyarrow/_csv.pyx
##
@@ -342,9 +342,9 @@ cdef class ConvertOptions:
 --
 check_utf8 : bool, optional (default True)
 Whether to check UTF8 validity of string columns.
-column_types: dict, optional
-Map column names to column types
-(disabling type inference on those columns).
+column_types: pa.Schema or dict, optional
+Explicitly map column names to column types. Passing this argument

Review comment:
   I'm surprised: was the original spelling not clear?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #7594: ARROW-7654: [Python] Ability to set column_types to a Schema in csv.ConvertOptions is undocumented

2020-06-30 Thread GitBox


kszucs opened a new pull request #7594:
URL: https://github.com/apache/arrow/pull/7594


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7594: ARROW-7654: [Python] Ability to set column_types to a Schema in csv.ConvertOptions is undocumented

2020-06-30 Thread GitBox


kszucs commented on a change in pull request #7594:
URL: https://github.com/apache/arrow/pull/7594#discussion_r447723473



##
File path: python/pyarrow/_csv.pyx
##
@@ -342,9 +342,9 @@ cdef class ConvertOptions:
 --
 check_utf8 : bool, optional (default True)
 Whether to check UTF8 validity of string columns.
-column_types: dict, optional
-Map column names to column types
-(disabling type inference on those columns).
+column_types: pa.Schema or dict, optional
+Explicitly map column names to column types. Passing this argument

Review comment:
   I'm not sure how to clarify it better.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-30 Thread GitBox


wesm commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651847404


   This looks good to me, @kszucs @kou @nealrichardson anything else you would 
want to check?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-30 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r447758318



##
File path: python/pyarrow/tests/test_scalars.py
##
@@ -16,427 +16,443 @@
 # under the License.
 
 import datetime
+import decimal
 import pytest
-import unittest
 
 import numpy as np
 
 import pyarrow as pa
 
 
-class TestScalars(unittest.TestCase):
-
-def test_null_singleton(self):
-with pytest.raises(Exception):
-pa.NAType()
+@pytest.mark.parametrize(['value', 'ty', 'klass', 'deprecated'], [
+(False, None, pa.BooleanScalar, pa.BooleanValue),
+(True, None, pa.BooleanScalar, pa.BooleanValue),
+(1, None, pa.Int64Scalar, pa.Int64Value),
+(-1, None, pa.Int64Scalar, pa.Int64Value),
+(1, pa.int8(), pa.Int8Scalar, pa.Int8Value),
+(1, pa.uint8(), pa.UInt8Scalar, pa.UInt8Value),
+(1, pa.int16(), pa.Int16Scalar, pa.Int16Value),
+(1, pa.uint16(), pa.UInt16Scalar, pa.UInt16Value),
+(1, pa.int32(), pa.Int32Scalar, pa.Int32Value),
+(1, pa.uint32(), pa.UInt32Scalar, pa.UInt32Value),
+(1, pa.int64(), pa.Int64Scalar, pa.Int64Value),
+(1, pa.uint64(), pa.UInt64Scalar, pa.UInt64Value),
+(1.0, None, pa.DoubleScalar, pa.DoubleValue),
+(np.float16(1.0), pa.float16(), pa.HalfFloatScalar, pa.HalfFloatValue),
+(1.0, pa.float32(), pa.FloatScalar, pa.FloatValue),
+("string", None, pa.StringScalar, pa.StringValue),
+(b"bytes", None, pa.BinaryScalar, pa.BinaryValue),
+([1, 2, 3], None, pa.ListScalar, pa.ListValue),
+([1, 2, 3, 4], pa.large_list(pa.int8()), pa.LargeListScalar,
+ pa.LargeListValue),
+(datetime.date.today(), None, pa.Date32Scalar, pa.Date64Value),
+(datetime.datetime.now(), None, pa.TimestampScalar, pa.TimestampValue),
+({'a': 1, 'b': [1, 2]}, None, pa.StructScalar, pa.StructValue)
+])
+def test_basics(value, ty, klass, deprecated):
+s = pa.scalar(value, type=ty)
+assert isinstance(s, klass)
+assert s == value
+assert s == s
+assert s != "else"
+assert hash(s) == hash(s)

Review comment:
   It tests that different calls of the c++ hashing mechanism produces the 
same hash values.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy opened a new pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


xhochy opened a new pull request #7593:
URL: https://github.com/apache/arrow/pull/7593


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-06-30 Thread GitBox


lidavidm commented on pull request #7543:
URL: https://github.com/apache/arrow/pull/7543#issuecomment-651777225


   Would a Java maintainer be able to look at this? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] crcrpar commented on pull request #7590: ARROW-9277: [C++] Fix docs of reading CSV files

2020-06-30 Thread GitBox


crcrpar commented on pull request #7590:
URL: https://github.com/apache/arrow/pull/7590#issuecomment-651835046


   @pitrou My pleasure :smiley: 
   
   Yes, I think so as the use of `arrow::Result` in my commits might be 
confusing.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7590: ARROW-9277: [C++] Fix docs of reading CSV files

2020-06-30 Thread GitBox


pitrou closed pull request #7590:
URL: https://github.com/apache/arrow/pull/7590


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-30 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r447734944



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,745 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
+
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
-self.type = null()
+@property
+def is_valid(self):
+return self.wrapped.get().is_valid
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
+
+def equals(self, Scalar other):
+return self.wrapped.get().Equals(other.unwrap().get()[0])
 
 def __eq__(self, other):
-return NA
+try:
+if not isinstance(other, Scalar):
+other = scalar(other, type=self.type)
+return self.equals(other)
+except (TypeError, ValueError, ArrowInvalid):
+return NotImplemented
+
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
+
+def as_py(self):
+raise NotImplementedError()
 
 
-_NULL = NA = NullType()
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __init__(self):
+pass
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __eq__(self, other):
+return NA
 
-def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
 
-def __str__(self):
-if hasattr(self, 'as_py'):
-return str(self.as_py())
-else:
-return super(Scalar, self).__str__()
+def as_py(self):
+"""
+Return this value as a Python None.
+"""
+return None
 
-def __eq__(self, other):
-if hasattr(self, 'as_py'):
-if isinstance(other, ArrayValue):
-other = other.as_py()
-return self.as_py() == other
-else:
-raise NotImplementedError(
-"Cannot compare Arrow values that don't support as_py()")
 
-def __hash__(self):
-return hash(self.as_py())
+_NULL = NA = NullScalar()
 
 
-cdef class BooleanValue(ArrayValue):
+cdef class BooleanScalar(Scalar):
 """
-Concrete class for boolean array elements.
+Concrete class for boolean scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python bool.
 """
-cdef CBooleanArray* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CBooleanScalar* sp =  self.wrapped.get()
+return sp.value if sp.is_valid else None
 
 
-cdef class Int8Value(ArrayValue):
+cdef class UInt8Scalar(Scalar):
 """
-   

[GitHub] [arrow] github-actions[bot] commented on pull request #7586: ARROW-9280: [Rust] [Parquet] Calculate page and column statistics

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-651852823


   https://issues.apache.org/jira/browse/ARROW-9280



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7589: ARROW-9276: [Release] Enforce CUDA device for updating the api documentations

2020-06-30 Thread GitBox


kszucs commented on a change in pull request #7589:
URL: https://github.com/apache/arrow/pull/7589#discussion_r447643080



##
File path: dev/release/post-09-docs.sh
##
@@ -42,20 +47,20 @@ popd
 pushd "${ARROW_DIR}"
 git checkout "${release_tag}"

Review comment:
   @wesm could you try running it after commenting the checkout line above 
(unless you have the tag locally):
   
   ```bash
   PUSH=0 dev/release/post-09-docs.sh 0.17.1
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7590: ARROW-9277: [C++] Fix docs of reading CSV files

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7590:
URL: https://github.com/apache/arrow/pull/7590#issuecomment-651768214


   https://issues.apache.org/jira/browse/ARROW-9277



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7591: ARROW-8535: [Rust] Specify arrow-flight version

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7591:
URL: https://github.com/apache/arrow/pull/7591#issuecomment-651768213


   https://issues.apache.org/jira/browse/ARROW-8535



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #7592: ARROW-9220: [C++] Make utf8proc optional even with ARROW_COMPUTE=ON

2020-06-30 Thread GitBox


pitrou opened a new pull request #7592:
URL: https://github.com/apache/arrow/pull/7592


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7582: ARROW-8190: [FlightRPC][C++] Expose IPC options

2020-06-30 Thread GitBox


pitrou commented on a change in pull request #7582:
URL: https://github.com/apache/arrow/pull/7582#discussion_r447665599



##
File path: cpp/src/arrow/flight/flight_test.cc
##
@@ -1808,6 +1880,90 @@ TEST_F(TestMetadata, DoPutReadMetadata) {
   ASSERT_OK(writer->Close());
 }
 
+TEST_F(TestOptions, DoGetReadOptions) {
+  // Call DoGet, but with a very low read nesting depth set to fail the call.
+  Ticket ticket{""};
+  auto options = FlightCallOptions();
+  options.read_options.max_recursion_depth = 1;
+  std::unique_ptr stream;
+  ASSERT_OK(client_->DoGet(options, ticket, ));
+  FlightStreamChunk chunk;
+  ASSERT_RAISES(Invalid, stream->Next());
+}
+
+TEST_F(TestOptions, DoPutWriteOptions) {
+  // Call DoPut, but with a very low write nesting depth set to fail the call.
+  std::unique_ptr writer;
+  std::unique_ptr reader;
+  BatchVector expected_batches;
+  ASSERT_OK(ExampleNestedBatches(_batches));
+
+  auto options = FlightCallOptions();
+  options.write_options.max_recursion_depth = 1;
+  ASSERT_OK(client_->DoPut(options, FlightDescriptor{}, 
expected_batches[0]->schema(),
+   , ));
+  for (const auto& batch : expected_batches) {
+ASSERT_RAISES(Invalid, writer->WriteRecordBatch(*batch));
+  }
+}
+
+TEST_F(TestOptions, DoExchangeReadOptions) {
+  // Call DoExchange and write nested data, but read with a very low nesting 
depth set to
+  // fail the call.
+  auto options = FlightCallOptions();
+  options.write_options.max_recursion_depth = 1;
+  auto descr = FlightDescriptor::Command("");
+  std::unique_ptr reader;
+  std::unique_ptr writer;
+  ASSERT_OK(client_->DoExchange(options, descr, , ));
+  BatchVector batches;
+  ASSERT_OK(ExampleNestedBatches());
+  ASSERT_OK(writer->Begin(batches[0]->schema()));
+  for (const auto& batch : batches) {
+ASSERT_RAISES(Invalid, writer->WriteRecordBatch(*batch));
+  }
+  ASSERT_OK(writer->DoneWriting());
+  ASSERT_OK(writer->Close());
+}
+
+TEST_F(TestOptions, DoExchangeReadOptionsBegin) {
+  // Call DoExchange and write nested data, but read with a very low nesting 
depth set to
+  // fail the call. Here the options are set explicitly when we write data and 
not in the
+  // call options.
+  auto descr = FlightDescriptor::Command("");
+  std::unique_ptr reader;
+  std::unique_ptr writer;
+  ASSERT_OK(client_->DoExchange(descr, , ));
+  BatchVector batches;
+  ASSERT_OK(ExampleNestedBatches());
+  auto options = ipc::IpcWriteOptions::Defaults();
+  options.max_recursion_depth = 1;
+  ASSERT_OK(writer->Begin(batches[0]->schema(), options));
+  for (const auto& batch : batches) {
+ASSERT_RAISES(Invalid, writer->WriteRecordBatch(*batch));
+  }
+  ASSERT_OK(writer->DoneWriting());
+  ASSERT_OK(writer->Close());
+}
+
+TEST_F(TestOptions, DoExchangeWriteOptions) {
+  // Call DoExchange and write nested data, but with a very low nesting depth 
set to fail
+  // the call. (The low nesting depth is set on the server side.)
+  auto options = FlightCallOptions();
+  options.read_options.max_recursion_depth = 1;

Review comment:
   Why do you set the read options here? AFAICT, this test raises its error 
on the server side.

##
File path: cpp/src/arrow/flight/flight_test.cc
##
@@ -1808,6 +1880,90 @@ TEST_F(TestMetadata, DoPutReadMetadata) {
   ASSERT_OK(writer->Close());
 }
 
+TEST_F(TestOptions, DoGetReadOptions) {
+  // Call DoGet, but with a very low read nesting depth set to fail the call.
+  Ticket ticket{""};
+  auto options = FlightCallOptions();
+  options.read_options.max_recursion_depth = 1;
+  std::unique_ptr stream;
+  ASSERT_OK(client_->DoGet(options, ticket, ));
+  FlightStreamChunk chunk;
+  ASSERT_RAISES(Invalid, stream->Next());
+}
+
+TEST_F(TestOptions, DoPutWriteOptions) {
+  // Call DoPut, but with a very low write nesting depth set to fail the call.
+  std::unique_ptr writer;
+  std::unique_ptr reader;
+  BatchVector expected_batches;
+  ASSERT_OK(ExampleNestedBatches(_batches));
+
+  auto options = FlightCallOptions();
+  options.write_options.max_recursion_depth = 1;
+  ASSERT_OK(client_->DoPut(options, FlightDescriptor{}, 
expected_batches[0]->schema(),
+   , ));
+  for (const auto& batch : expected_batches) {
+ASSERT_RAISES(Invalid, writer->WriteRecordBatch(*batch));
+  }
+}
+
+TEST_F(TestOptions, DoExchangeReadOptions) {
+  // Call DoExchange and write nested data, but read with a very low nesting 
depth set to

Review comment:
   Comment seems off, it's the write options that are being set (and it's 
the `WriteRecordBatch` call that fails).
   

##
File path: cpp/src/arrow/flight/flight_test.cc
##
@@ -1808,6 +1880,90 @@ TEST_F(TestMetadata, DoPutReadMetadata) {
   ASSERT_OK(writer->Close());
 }
 
+TEST_F(TestOptions, DoGetReadOptions) {
+  // Call DoGet, but with a very low read nesting depth set to fail the call.
+  Ticket ticket{""};
+  auto options = FlightCallOptions();
+  options.read_options.max_recursion_depth = 1;
+  

[GitHub] [arrow] xhochy commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


xhochy commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-651779867


   This currently fails for chunked arrays. I though that they should be 
handled by the kernel framework automatically but it seems, they aren't.
   
   ```
   [--] 1 test from TestStringKernels/1, where TypeParam = 
arrow::LargeStringType
   [ RUN  ] TestStringKernels/1.ContainsExact
   ../src/arrow/testing/gtest_util.cc:149: Failure
   Failed
   Got:
 [
   [
 false,
 null,
 false
   ]
 ]
   Expected:
 [
   [
 true,
 null
   ],
   [
 false
   ]
 ]
   ../src/arrow/testing/gtest_util.cc:149: Failure
   Failed
   Got:
 [
   [
 false,
 false,
 false,
 null,
 false
   ]
 ]
   Expected:
 [
   [
 true,
 false
   ],
   [
 true,
 null,
 false
   ]
 ]
   [  FAILED  ] TestStringKernels/1.ContainsExact, where TypeParam = 
arrow::LargeStringType (2 ms)
   ```
   
   @wesm @pitrou Any pointers what I'm missing?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-651787097


   https://issues.apache.org/jira/browse/ARROW-9160



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7592: ARROW-9220: [C++] Make utf8proc optional even with ARROW_COMPUTE=ON

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7592:
URL: https://github.com/apache/arrow/pull/7592#issuecomment-651787096


   https://issues.apache.org/jira/browse/ARROW-9220



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-30 Thread GitBox


wesm commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651793397


   thanks @maartenbreddels!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7594: ARROW-7654: [Python] Ability to set column_types to a Schema in csv.ConvertOptions is undocumented

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7594:
URL: https://github.com/apache/arrow/pull/7594#issuecomment-651831683


   https://issues.apache.org/jira/browse/ARROW-7654



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-06-30 Thread GitBox


pitrou commented on a change in pull request #7544:
URL: https://github.com/apache/arrow/pull/7544#discussion_r447729043



##
File path: cpp/src/arrow/ipc/read_write_test.cc
##
@@ -1228,6 +1228,152 @@ TEST_P(TestFileFormat, RoundTrip) {
   TestZeroLengthRoundTrip(*GetParam(), options);
 }
 
+Status MakeDictionaryBatch(std::shared_ptr* out) {
+  const int64_t length = 6;
+
+  std::vector is_valid = {true, true, false, true, true, true};
+
+  auto dict_ty = utf8();
+
+  auto dict = ArrayFromJSON(dict_ty, "[\"foo\", \"bar\", \"baz\"]");
+
+  auto f0_type = arrow::dictionary(arrow::int32(), dict_ty);
+  auto f1_type = arrow::dictionary(arrow::int8(), dict_ty);
+
+  std::shared_ptr indices0, indices1;
+  std::vector indices0_values = {1, 2, -1, 0, 2, 0};
+  std::vector indices1_values = {0, 0, 2, 2, 1, 1};
+
+  ArrayFromVector(is_valid, indices0_values, );
+  ArrayFromVector(is_valid, indices1_values, );

Review comment:
   You can create the indices arrays with `ArrayFromJSON` too, it will 
probably make the code simpler to read and maintain.

##
File path: cpp/src/arrow/ipc/metadata_internal.h
##
@@ -198,7 +198,7 @@ Status WriteDictionaryMessage(
 const int64_t id, const int64_t length, const int64_t body_length,
 const std::shared_ptr& custom_metadata,
 const std::vector& nodes, const 
std::vector& buffers,
-const IpcWriteOptions& options, std::shared_ptr* out);
+const IpcWriteOptions& options, std::shared_ptr* out, bool 
isDelta);

Review comment:
   We should keep `std::shared* out` at the end of the arguments.

##
File path: cpp/src/arrow/ipc/reader.cc
##
@@ -684,7 +685,19 @@ Status ReadDictionary(const Buffer& metadata, 
DictionaryMemo* dictionary_memo,
 return Status::Invalid("Dictionary record batch must only contain one 
field");
   }
   auto dictionary = batch->column(0);
-  return dictionary_memo->AddDictionary(id, dictionary);
+  if (dictionary_batch->isDelta()) {
+std::shared_ptr originalDict, combinedDict;
+RETURN_NOT_OK(dictionary_memo->GetDictionary(id, ));
+ArrayVector dictsToCombine{originalDict, dictionary};
+ARROW_ASSIGN_OR_RAISE(combinedDict, Concatenate(dictsToCombine, 
options.memory_pool));

Review comment:
   How about folding this logic, e.g. add a `AddDirectoryDelta` method to 
`DictionaryMemo`?

##
File path: cpp/src/arrow/ipc/writer.h
##
@@ -341,6 +343,29 @@ class ARROW_EXPORT IpcPayloadWriter {
   virtual Status Close() = 0;
 };
 
+/// Create a new IPC payload stream writer from stream sink. User is
+/// responsible for closing the actual OutputStream.
+///
+/// \param[in] sink output stream to write to
+/// \param[in] options options for serialization
+/// \return Result>
+ARROW_EXPORT
+Result> NewPayloadStreamWriter(

Review comment:
   Nit: call this `MakePayloadStreamWriter`

##
File path: cpp/src/arrow/ipc/writer.h
##
@@ -341,6 +343,29 @@ class ARROW_EXPORT IpcPayloadWriter {
   virtual Status Close() = 0;
 };
 
+/// Create a new IPC payload stream writer from stream sink. User is
+/// responsible for closing the actual OutputStream.
+///
+/// \param[in] sink output stream to write to
+/// \param[in] options options for serialization
+/// \return Result>
+ARROW_EXPORT
+Result> NewPayloadStreamWriter(
+io::OutputStream* sink, const IpcWriteOptions& options = 
IpcWriteOptions::Defaults());
+
+/// Create a new IPC payload file writer from stream sink.
+///
+/// \param[in] sink output stream to write to
+/// \param[in] schema the schema of the record batches to be written
+/// \param[in] options options for serialization, optional
+/// \param[in] metadata custom metadata for File Footer, optional
+/// \return Status
+ARROW_EXPORT
+Result> NewPayloadFileWriter(

Review comment:
   Nit: call this `MakePayloadFileWriter`

##
File path: cpp/src/arrow/ipc/read_write_test.cc
##
@@ -1228,6 +1228,152 @@ TEST_P(TestFileFormat, RoundTrip) {
   TestZeroLengthRoundTrip(*GetParam(), options);
 }
 
+Status MakeDictionaryBatch(std::shared_ptr* out) {
+  const int64_t length = 6;
+
+  std::vector is_valid = {true, true, false, true, true, true};
+
+  auto dict_ty = utf8();
+
+  auto dict = ArrayFromJSON(dict_ty, "[\"foo\", \"bar\", \"baz\"]");
+
+  auto f0_type = arrow::dictionary(arrow::int32(), dict_ty);
+  auto f1_type = arrow::dictionary(arrow::int8(), dict_ty);
+
+  std::shared_ptr indices0, indices1;
+  std::vector indices0_values = {1, 2, -1, 0, 2, 0};
+  std::vector indices1_values = {0, 0, 2, 2, 1, 1};
+
+  ArrayFromVector(is_valid, indices0_values, );
+  ArrayFromVector(is_valid, indices1_values, );
+
+  auto a0 = std::make_shared(f0_type, indices0, dict);
+  auto a1 = std::make_shared(f1_type, indices1, dict);
+
+  // construct batch
+  auto schema = ::arrow::schema({field("dict1", f0_type), field("dict2", 
f1_type)});
+
+  *out = RecordBatch::Make(schema, length, {a0, a1});
+  return Status::OK();
+}
+
+// A record batch 

[GitHub] [arrow] pitrou closed pull request #7592: ARROW-9220: [C++] Make utf8proc optional even with ARROW_COMPUTE=ON

2020-06-30 Thread GitBox


pitrou closed pull request #7592:
URL: https://github.com/apache/arrow/pull/7592


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-30 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r447731034



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,745 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
+
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
-self.type = null()
+@property
+def is_valid(self):
+return self.wrapped.get().is_valid
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
+
+def equals(self, Scalar other):
+return self.wrapped.get().Equals(other.unwrap().get()[0])
 
 def __eq__(self, other):
-return NA
+try:
+if not isinstance(other, Scalar):
+other = scalar(other, type=self.type)
+return self.equals(other)
+except (TypeError, ValueError, ArrowInvalid):
+return NotImplemented
+
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
+
+def as_py(self):
+raise NotImplementedError()
 
 
-_NULL = NA = NullType()
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __init__(self):
+pass
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __eq__(self, other):
+return NA

Review comment:
   Well, it was the behavior before the refactor. Shall we check proper 
equality with NullScalar instead?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-30 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r447738979



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,745 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
+
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
-self.type = null()
+@property
+def is_valid(self):
+return self.wrapped.get().is_valid
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
+
+def equals(self, Scalar other):
+return self.wrapped.get().Equals(other.unwrap().get()[0])
 
 def __eq__(self, other):
-return NA
+try:
+if not isinstance(other, Scalar):
+other = scalar(other, type=self.type)
+return self.equals(other)
+except (TypeError, ValueError, ArrowInvalid):
+return NotImplemented
+
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
+
+def as_py(self):
+raise NotImplementedError()
 
 
-_NULL = NA = NullType()
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __init__(self):
+pass
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __eq__(self, other):
+return NA
 
-def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
 
-def __str__(self):
-if hasattr(self, 'as_py'):
-return str(self.as_py())
-else:
-return super(Scalar, self).__str__()
+def as_py(self):
+"""
+Return this value as a Python None.
+"""
+return None
 
-def __eq__(self, other):
-if hasattr(self, 'as_py'):
-if isinstance(other, ArrayValue):
-other = other.as_py()
-return self.as_py() == other
-else:
-raise NotImplementedError(
-"Cannot compare Arrow values that don't support as_py()")
 
-def __hash__(self):
-return hash(self.as_py())
+_NULL = NA = NullScalar()
 
 
-cdef class BooleanValue(ArrayValue):
+cdef class BooleanScalar(Scalar):
 """
-Concrete class for boolean array elements.
+Concrete class for boolean scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python bool.
 """
-cdef CBooleanArray* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CBooleanScalar* sp =  self.wrapped.get()
+return sp.value if sp.is_valid else None
 
 
-cdef class Int8Value(ArrayValue):
+cdef class UInt8Scalar(Scalar):
 """
-   

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-30 Thread GitBox


maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651796902


   You're welcome. Thanks all for your help. Impressed by the project, setup 
(CI/CMake), and people, and happy with the results:
   
   
![image](https://user-images.githubusercontent.com/1765949/86132718-92b88780-bae7-11ea-97a7-bb6d853fbb91.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-30 Thread GitBox


pitrou commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-651806049


   Ok, at a quick glance, it seems that null container tests work properly 
regardless:
   ```python
   >>> s = set()

  
   >>> s.add(pa.scalar(None))   

  
   >>> s

  
   {}
   >>> pa.scalar(None) in s 

  
   True
   >>> s.add(pa.scalar(None, pa.int64()))   

  
   >>> s.add(pa.scalar(12, pa.int64())) 

  
   >>> s

  
   {,
,
}
   >>> pa.scalar(None, pa.int64()) in s 

  
   True
   >>> pa.scalar(None, pa.int32()) in s 

  
   False
   ```
   ```python
   >>> l = [pa.scalar(None)]

  
   >>> pa.scalar(None) in l 

  
   True
   >>> pa.scalar(None, pa.int64()) in l 

  
   False
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques closed pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-30 Thread GitBox


fsaintjacques closed pull request #7536:
URL: https://github.com/apache/arrow/pull/7536


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #7582: ARROW-8190: [FlightRPC][C++] Expose IPC options

2020-06-30 Thread GitBox


lidavidm commented on pull request #7582:
URL: https://github.com/apache/arrow/pull/7582#issuecomment-651813343


   Thanks for the review! I've fixed the comment/test names.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-30 Thread GitBox


pitrou commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r447732301



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,745 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
+
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
-self.type = null()
+@property
+def is_valid(self):
+return self.wrapped.get().is_valid
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
+
+def equals(self, Scalar other):
+return self.wrapped.get().Equals(other.unwrap().get()[0])
 
 def __eq__(self, other):
-return NA
+try:
+if not isinstance(other, Scalar):
+other = scalar(other, type=self.type)
+return self.equals(other)
+except (TypeError, ValueError, ArrowInvalid):
+return NotImplemented
+
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
+
+def as_py(self):
+raise NotImplementedError()
 
 
-_NULL = NA = NullType()
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __init__(self):
+pass
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __eq__(self, other):
+return NA

Review comment:
   Well, it should probably return `False` for non-nulls?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-651849432


   @xhochy chunked array should be handled automatically by the function 
executors. I will take a look. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-30 Thread GitBox


pitrou commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r447751371



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,745 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
+
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
-self.type = null()
+@property
+def is_valid(self):
+return self.wrapped.get().is_valid
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
+
+def equals(self, Scalar other):
+return self.wrapped.get().Equals(other.unwrap().get()[0])
 
 def __eq__(self, other):
-return NA
+try:
+if not isinstance(other, Scalar):
+other = scalar(other, type=self.type)
+return self.equals(other)
+except (TypeError, ValueError, ArrowInvalid):
+return NotImplemented
+
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
+
+def as_py(self):
+raise NotImplementedError()
 
 
-_NULL = NA = NullType()
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __init__(self):
+pass
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __eq__(self, other):
+return NA
 
-def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
 
-def __str__(self):
-if hasattr(self, 'as_py'):
-return str(self.as_py())
-else:
-return super(Scalar, self).__str__()
+def as_py(self):
+"""
+Return this value as a Python None.
+"""
+return None
 
-def __eq__(self, other):
-if hasattr(self, 'as_py'):
-if isinstance(other, ArrayValue):
-other = other.as_py()
-return self.as_py() == other
-else:
-raise NotImplementedError(
-"Cannot compare Arrow values that don't support as_py()")
 
-def __hash__(self):
-return hash(self.as_py())
+_NULL = NA = NullScalar()
 
 
-cdef class BooleanValue(ArrayValue):
+cdef class BooleanScalar(Scalar):
 """
-Concrete class for boolean array elements.
+Concrete class for boolean scalars.
 """
 
 def as_py(self):
 """
 Return this value as a Python bool.
 """
-cdef CBooleanArray* ap =  self.sp_array.get()
-return ap.Value(self.index)
+cdef CBooleanScalar* sp =  self.wrapped.get()
+return sp.value if sp.is_valid else None
 
 
-cdef class Int8Value(ArrayValue):
+cdef class UInt8Scalar(Scalar):
 """
-   

[GitHub] [arrow] pitrou commented on pull request #7590: ARROW-9277: [C++] Fix docs of reading CSV files

2020-06-30 Thread GitBox


pitrou commented on pull request #7590:
URL: https://github.com/apache/arrow/pull/7590#issuecomment-651836379


   Thank you for your contribution. Please don't hesitate to report other doc 
problems.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-651849432


   @xhochy chunked arrays should be handled automatically by the function 
executors. I will take a look. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7590: ARROW-9277: [C++] Fix docs of reading CSV files

2020-06-30 Thread GitBox


pitrou commented on pull request #7590:
URL: https://github.com/apache/arrow/pull/7590#issuecomment-651769268


   Thank you for spotting this and suggesting a fix! I will make a couple 
changes to your suggestion.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #7591: ARROW-8535: [Rust] Specify arrow-flight version

2020-06-30 Thread GitBox


nevi-me commented on pull request #7591:
URL: https://github.com/apache/arrow/pull/7591#issuecomment-651837534


   @kszucs may you please have a look at this when you get a chance. There's a 
change to the prepare-test Ruby script



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7589: ARROW-9276: [Release] Enforce CUDA device for updating the api documentations

2020-06-30 Thread GitBox


pitrou commented on a change in pull request #7589:
URL: https://github.com/apache/arrow/pull/7589#discussion_r447741353



##
File path: dev/release/post-09-docs.sh
##
@@ -42,20 +47,20 @@ popd
 pushd "${ARROW_DIR}"
 git checkout "${release_tag}"

Review comment:
   @kszucs Is a CUDA machine needed? Do you want me to try?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #7589: ARROW-9276: [Release] Enforce CUDA device for updating the api documentations

2020-06-30 Thread GitBox


kszucs opened a new pull request #7589:
URL: https://github.com/apache/arrow/pull/7589


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7590: [C++] Fix docs of reading CSV files

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7590:
URL: https://github.com/apache/arrow/pull/7590#issuecomment-651761199


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7589: ARROW-9276: [Release] Enforce CUDA device for updating the api documentations

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7589:
URL: https://github.com/apache/arrow/pull/7589#issuecomment-651761179


   https://issues.apache.org/jira/browse/ARROW-9276



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] crcrpar opened a new pull request #7590: [C++] Fix docs of reading CSV files

2020-06-30 Thread GitBox


crcrpar opened a new pull request #7590:
URL: https://github.com/apache/arrow/pull/7590


   Hi,
   
   This is my first PR for this awesome project. So if I do not follow the 
workflow, could you tell me, please?
   
   In this PR, I aim at fixing 
https://arrow.apache.org/docs/cpp/csv.html?highlight=csv%20c#basic-usage as I 
found the example not working in my environment (arrow v0.17.1).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me closed pull request #7588: ARROW-9274: [Rust] Parse 64bit numbers from integration files as strings

2020-06-30 Thread GitBox


nevi-me closed pull request #7588:
URL: https://github.com/apache/arrow/pull/7588


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


xhochy edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-651779867


   This currently fails for chunked arrays. I thought that they should be 
handled by the kernel framework automatically but it seems, they aren't.
   
   ```
   [--] 1 test from TestStringKernels/1, where TypeParam = 
arrow::LargeStringType
   [ RUN  ] TestStringKernels/1.ContainsExact
   ../src/arrow/testing/gtest_util.cc:149: Failure
   Failed
   Got:
 [
   [
 false,
 null,
 false
   ]
 ]
   Expected:
 [
   [
 true,
 null
   ],
   [
 false
   ]
 ]
   ../src/arrow/testing/gtest_util.cc:149: Failure
   Failed
   Got:
 [
   [
 false,
 false,
 false,
 null,
 false
   ]
 ]
   Expected:
 [
   [
 true,
 false
   ],
   [
 true,
 null,
 false
   ]
 ]
   [  FAILED  ] TestStringKernels/1.ContainsExact, where TypeParam = 
arrow::LargeStringType (2 ms)
   ```
   
   @wesm @pitrou Any pointers what I'm missing?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #7586: Calculate page and column statistics

2020-06-30 Thread GitBox


nevi-me commented on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-651810251


   Hi @zeevm, may you please kindly rebase (to fix the Rust failures) and open 
a JIRA for this PR



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-30 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r447725217



##
File path: python/pyarrow/includes/libarrow.pxd
##
@@ -44,6 +44,11 @@ cdef extern from "arrow/util/key_value_metadata.h" namespace 
"arrow" nogil:
 c_bool Contains(const c_string& key) const
 
 
+cdef extern from "arrow/util/decimal.h" namespace "arrow" nogil:
+cdef cppclass CDecimal128" arrow::Decimal128":
+c_string ToString(int32_t scale) const

Review comment:
   Created jira https://issues.apache.org/jira/browse/ARROW-9279





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-30 Thread GitBox


kszucs commented on a change in pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#discussion_r447733611



##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,745 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+return self.wrapped
+
+@property
+def type(self):
+return pyarrow_wrap_data_type(self.wrapped.get().type)
 
-self.type = null()
+@property
+def is_valid(self):
+return self.wrapped.get().is_valid
 
 def __repr__(self):
-return 'NULL'
+return ''.format(
+self.__class__.__name__, self.as_py()
+)
 
-def as_py(self):
-"""
-Return None
-"""
-return None
+def __str__(self):
+return str(self.as_py())
+
+def equals(self, Scalar other):
+return self.wrapped.get().Equals(other.unwrap().get()[0])
 
 def __eq__(self, other):
-return NA
+try:
+if not isinstance(other, Scalar):
+other = scalar(other, type=self.type)
+return self.equals(other)
+except (TypeError, ValueError, ArrowInvalid):
+return NotImplemented
+
+def __hash__(self):
+cdef CScalarHash hasher
+return hasher(self.wrapped)
+
+def as_py(self):
+raise NotImplementedError()
 
 
-_NULL = NA = NullType()
+_NULL = NA = None
 
 
-cdef class ArrayValue(Scalar):
+cdef class NullScalar(Scalar):
 """
-The base class for non-null array elements.
+Concrete class for null scalars.
 """
 
-def __init__(self):
-raise TypeError("Do not call {}'s constructor directly, use array "
-"subscription instead."
-.format(self.__class__.__name__))
+def __cinit__(self):
+global NA
+if NA is not None:
+raise Exception('Cannot create multiple NAType instances')
+self.init(shared_ptr[CScalar](new CNullScalar()))
 
-cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array,
-   int64_t index):
-self.type = type
-self.index = index
-self._set_array(sp_array)
+def __init__(self):
+pass
 
-cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
-self.sp_array = sp_array
+def __eq__(self, other):
+return NA
 
-def __repr__(self):
-if hasattr(self, 'as_py'):
-return repr(self.as_py())
-else:
-return super(Scalar, self).__repr__()
+def __hash__(self):

Review comment:
   Once I override `__eq__` the inherited `__hash__` gets removed.

##
File path: python/pyarrow/scalar.pxi
##
@@ -16,1198 +16,745 @@
 # under the License.
 
 
-_NULL = NA = None
+import collections
 
 
 cdef class Scalar:
 """
-The base class for all array elements.
+The base class for scalars.
 """
 
+def __init__(self):
+raise TypeError("Do not call {}'s constructor directly, use "
+"pa.scalar() instead.".format(self.__class__.__name__))
 
-cdef class NullType(Scalar):
-"""
-Singleton for null array elements.
-"""
-# TODO rename this NullValue?
+cdef void init(self, const shared_ptr[CScalar]& wrapped):
+self.wrapped = wrapped
 
-def __cinit__(self):
-global NA
-if NA is not None:
-raise Exception('Cannot create multiple NAType instances')
+@staticmethod
+cdef wrap(const shared_ptr[CScalar]& wrapped):
+cdef:
+Scalar self
+Type type_id = wrapped.get().type.get().id()
+
+if type_id == _Type_NA:
+return _NULL
+
+typ = _scalar_classes[type_id]
+self = typ.__new__(typ)
+self.init(wrapped)
+
+return self
+
+cdef inline shared_ptr[CScalar] unwrap(self) nogil:
+

[GitHub] [arrow] pitrou closed pull request #7549: ARROW-9230: [FlightRPC][Python] pass through all options in flight.connect

2020-06-30 Thread GitBox


pitrou closed pull request #7549:
URL: https://github.com/apache/arrow/pull/7549


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sunchao commented on a change in pull request #7319: ARROW-8289: [Rust] Parquet Arrow writer with nested support

2020-06-30 Thread GitBox


sunchao commented on a change in pull request #7319:
URL: https://github.com/apache/arrow/pull/7319#discussion_r447898772



##
File path: rust/parquet/src/arrow/arrow_writer.rs
##
@@ -0,0 +1,348 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains writer which writes arrow data into parquet data.
+
+use std::fs::File;
+use std::rc::Rc;
+
+use array::Array;
+use arrow::array;
+use arrow::datatypes::{DataType as ArrowDataType, Field, Schema};
+use arrow::record_batch::RecordBatch;
+
+use crate::column::writer::ColumnWriter;
+use crate::errors::Result;
+use crate::file::properties::WriterProperties;
+use crate::{
+data_type::*,
+file::writer::{FileWriter, RowGroupWriter, SerializedFileWriter},
+};
+
+struct ArrowWriter {
+writer: SerializedFileWriter,
+rows: i64,
+}
+
+impl ArrowWriter {
+pub fn try_new(file: File, arrow_schema: ) -> Result {
+let schema = crate::arrow::arrow_to_parquet_schema(arrow_schema)?;
+let props = Rc::new(WriterProperties::builder().build());
+let file_writer = SerializedFileWriter::new(
+file.try_clone()?,
+schema.root_schema_ptr(),
+props,
+)?;
+
+Ok(Self {
+writer: file_writer,
+rows: 0,
+})
+}
+
+pub fn write( self, batch: ) -> Result<()> {
+let mut row_group_writer = self.writer.next_row_group()?;
+self.rows += unnest_arrays_to_leaves(
+ row_group_writer,
+batch.schema().fields(),
+batch.columns(),
+![1i16; batch.num_rows()][..],
+0,
+)?;
+self.writer.close_row_group(row_group_writer)
+}
+
+pub fn close( self) -> Result<()> {
+self.writer.close()
+}
+}
+
+/// Write nested arrays by traversing into structs and lists until primitive
+/// arrays are found.
+fn unnest_arrays_to_leaves(
+row_group_writer:  Box,
+// The fields from the record batch or struct
+fields: ,
+// The columns from record batch or struct, must have same length as fields
+columns: &[array::ArrayRef],
+// The parent mask, in the case of a struct, this represents which values
+// of the struct are true (1) or false(0).
+// This is useful to respect the definition level of structs where all 
values are null in a row
+parent_mask: &[i16],
+// The current level that is being read at
+level: i16,
+) -> Result {
+let mut rows_written = 0;
+for (field, column) in fields.iter().zip(columns) {
+match field.data_type() {
+ArrowDataType::List(_dtype) => unimplemented!("list not yet 
implemented"),
+ArrowDataType::FixedSizeList(_, _) => {
+unimplemented!("fsl not yet implemented")
+}
+ArrowDataType::Struct(fields) => {
+// fields in a struct should recursively be written out
+let array = column
+.as_any()
+.downcast_ref::()
+.expect("Unable to get struct array");
+let mut null_mask = Vec::with_capacity(array.len());
+for i in 0..array.len() {
+null_mask.push(array.is_valid(i) as i16);
+}
+rows_written += unnest_arrays_to_leaves(
+row_group_writer,
+fields,
+_ref()[..],
+_mask[..],
+// if the field is nullable, we have to increment level
+level + field.is_nullable() as i16,
+)?;
+}
+ArrowDataType::Null => unimplemented!(),
+ArrowDataType::Boolean
+| ArrowDataType::Int8
+| ArrowDataType::Int16
+| ArrowDataType::Int32
+| ArrowDataType::Int64
+| ArrowDataType::UInt8
+| ArrowDataType::UInt16
+| ArrowDataType::UInt32
+| ArrowDataType::UInt64
+| ArrowDataType::Float16
+| ArrowDataType::Float32
+| ArrowDataType::Float64
+| ArrowDataType::Timestamp(_, _)
+| ArrowDataType::Date32(_)
+ 

[GitHub] [arrow] BryanCutler commented on pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-06-30 Thread GitBox


BryanCutler commented on pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#issuecomment-651984931


   > Is this good to merge now? @BryanCutler are you still planning to review 
this? Would like to get this in 1.0.
   
   I'm taking a look now, I'd like to get it in for 1.0 too.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7595: ARROW-9281: [R] Turn off utf8proc in R builds

2020-06-30 Thread GitBox


nealrichardson closed pull request #7595:
URL: https://github.com/apache/arrow/pull/7595


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tobim commented on pull request #7315: ARROW-7605: [C++] Bundle jemalloc into static libarrow

2020-06-30 Thread GitBox


tobim commented on pull request #7315:
URL: https://github.com/apache/arrow/pull/7315#issuecomment-652001357


   @wesm I still believe this approach is the sanest, but seeing that it 
requires CMake 3.9 I guess that makes it a non-starter? 
   I would not expect problems with this on windows, but that is not my turf, 
so don't take my word for it. Which other bundled dependencies were you 
thinking of?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7600: ARROW-4390: [R] Serialize "labeled" metadata in Feather files, IPC messages

2020-06-30 Thread GitBox


nealrichardson closed pull request #7600:
URL: https://github.com/apache/arrow/pull/7600


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lmsanch commented on issue #6841: AttributeError: module 'pyarrow' has no attribute 'filesystem'

2020-06-30 Thread GitBox


lmsanch commented on issue #6841:
URL: https://github.com/apache/arrow/issues/6841#issuecomment-651976303


   I am experiencing the same problem. I don't know what the solution is



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652048000


   I just opened https://issues.apache.org/jira/browse/ARROW-9285 -- it should 
be easy to check if a kernel has mistakenly replaced a preallocated data buffer 
(which may be a slice of a larger output that is being populated)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652053744


   I also propose to rename the function from "contains_exact" to 
"utf8_contains". ~~I'm pushing that change here shortly~~. Or should we call it 
"binary_contains" / "string_contains" since it will work with any base binary 
types? I'll hold off on making any changes until there is consensus about the 
name. I think it would be useful to put string functions in a 
"pseudo-namespace" so they appear next to each other when sorted 
alphabetically. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652053744


   I also propose to rename the function from "contains_exact" to 
"utf8_contains". ~~I'm pushing that change here shortly~~. Or should we call it 
"binary_contains" / "string_contains" since it will work with any base binary 
types? I'll hold off on making any changes until there is consensus about the 
name. I think it would be useful to put string functions in a 
"pseudo-namespace" so they appear next to each other when sorted 
alphabetically. 
   
   Ideas:
   
   - string_contains
   - string_utf8_lower
   - string_ascii_lower
   - ...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7597: ARROW-9282: [R] Remove usage of _EXTPTR_PTR

2020-06-30 Thread GitBox


nealrichardson closed pull request #7597:
URL: https://github.com/apache/arrow/pull/7597


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7601: ARROW-8867: [R] Support converting POSIXlt type

2020-06-30 Thread GitBox


nealrichardson opened a new pull request #7601:
URL: https://github.com/apache/arrow/pull/7601


   Also contains some test refactor, documentation, and slight tweaks as 
followup to ARROW-8899.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652053744


   I also propose to rename the function from "contains_exact" to 
"utf8_contains". ~~I'm pushing that change here shortly~~. Or should we call it 
"binary_contains" / "string_contains" since it will work with any base binary 
types? I'll hold off on making any changes until there is consensus about the 
name. I think it would be useful to put string functions in a 
"pseudo-namespace" so they appear next to each other when sorted 
alphabetically. 
   
   Ideas:
   
   - string_contains
   - string_lower_utf8
   - string_lower_ascii
   - ...
   
   @maartenbreddels @pitrou any opinions about this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652045953


   @xhochy I'm fixing a couple issues with the implementation:
   
   * The function executor allocates memory for you unless you explicitly 
disable it. The idea is that you don't want to do allocations in-kernel unless 
it is necessary. The preallocation settings affect how execution with chunked 
arrays work
   * The slice offset of the `Datum* out` parameter must be respected (it may 
be non-zero -- the executor does not preserve the chunk layout of ChunkedArrays 
and will merge chunks to make larger output arrays -- this can be configured in 
`ExecContext`)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm commented on a change in pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#discussion_r447982219



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -387,16 +380,20 @@ using ContainsExactState = 
OptionsWrapper;
 
 template 
 struct ContainsExact {
+  using offset_type = typename Type::offset_type;
   static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
-ContainsExactOptions arg =
-checked_cast(*ctx->state()).options;
-auto transform_func =
-std::bind(TransformContainsExact,
-  reinterpret_cast(arg.pattern.c_str()),
-  arg.pattern.length(), std::placeholders::_1, 
std::placeholders::_2,
-  std::placeholders::_3, std::placeholders::_4);
-
-StringBoolTransform(ctx, batch, transform_func, out);
+ContainsExactOptions arg = ContainsExactState::Get(ctx);
+const uint8_t* pat = reinterpret_cast(arg.pattern.c_str());
+const int64_t pat_size = arg.pattern.length();
+StringBoolTransform(
+ctx, batch,
+[pat, pat_size](const void* offsets, const uint8_t* data, int64_t 
length,
+int64_t output_offset, uint8_t* output) {
+  TransformContainsExact(
+  pat, pat_size, reinterpret_cast(offsets), 
data, length,
+  output_offset, output);
+},

Review comment:
   I hope you don't mind, this seemed simpler to me





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7601: ARROW-8867: [R] Support converting POSIXlt type

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7601:
URL: https://github.com/apache/arrow/pull/7601#issuecomment-652056875


   https://issues.apache.org/jira/browse/ARROW-8867



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-30 Thread GitBox


kou commented on a change in pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#discussion_r448003034



##
File path: python/manylinux201x/build_arrow.sh
##
@@ -120,6 +120,7 @@ PATH="${CPYTHON_PATH}/bin:${PATH}" cmake \
 -DARROW_WITH_SNAPPY=ON \
 -DARROW_WITH_ZLIB=ON \
 -DARROW_WITH_ZSTD=ON \
+-DARROW_BROTLI_USE_SHARED=OFF \

Review comment:
   Ditto.

##
File path: python/manylinux1/build_arrow.sh
##
@@ -122,6 +122,7 @@ cmake \
 -DARROW_WITH_SNAPPY=ON \
 -DARROW_WITH_ZLIB=ON \
 -DARROW_WITH_ZSTD=ON \
+-DARROW_BROTLI_USE_SHARED=OFF \

Review comment:
   Could you keep this option list in alphabetical order?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-30 Thread GitBox


wesm commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-652058759


   If there's no further feedback I will merge this in the next 24h and I 
assume that any packaging issues will come up in nightlies as we push toward 
the next release.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7315: ARROW-7605: [C++] Bundle jemalloc into static libarrow

2020-06-30 Thread GitBox


wesm edited a comment on pull request #7315:
URL: https://github.com/apache/arrow/pull/7315#issuecomment-652020060


   We're running out of time to get this completed for the release, so if a 
working solution can be demonstrated to have been reached on all 3 platforms 
that works for both jemalloc (for Linux and macOS) and mimalloc (for Windows), 
then I will accept it. But there are single digits days remaining to complete 
this work now. CMake 3.9 is a bit problematic since we've tried to support 
CMake >= 3.2 and definitely >= 3.7



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7315: ARROW-7605: [C++] Bundle jemalloc into static libarrow

2020-06-30 Thread GitBox


wesm commented on pull request #7315:
URL: https://github.com/apache/arrow/pull/7315#issuecomment-652020060


   We're running out of time to get this completed for the release, so if a 
working solution can be demonstrated to have been reached on all 3 platforms 
that works for both jemalloc (for Linux and macOS) and mimalloc (for Windows), 
then I will accept it. But there are single digits days remaining to complete 
this work now



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652053744


   I also propose to rename the function from "contains_exact" to 
"utf8_contains". I'm pushing that change here shortly. Or should we call it 
"BinaryContains" since it will work with any base binary types?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652053744


   I also propose to rename the function from "contains_exact" to 
"utf8_contains". I'm pushing that change here shortly. Or should we call it 
"binary_contains" / "string_contains" since it will work with any base binary 
types?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

2020-06-30 Thread GitBox


wesm commented on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652053744


   I also propose to rename the function from "contains_exact" to 
"utf8_contains". I'm pushing that change here shortly



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] BryanCutler commented on a change in pull request #7275: ARROW-6110: [Java][Integration] Support LargeList Type and add integration test with C++

2020-06-30 Thread GitBox


BryanCutler commented on a change in pull request #7275:
URL: https://github.com/apache/arrow/pull/7275#discussion_r448032771



##
File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/LargeListVector.java
##
@@ -0,0 +1,991 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.complex;
+
+import static java.util.Collections.singletonList;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.capAtMaxInt;
+import static org.apache.arrow.memory.util.LargeMemoryUtil.checkedCastToInt;
+import static org.apache.arrow.util.Preconditions.checkNotNull;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.arrow.memory.ArrowBuf;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.OutOfMemoryException;
+import org.apache.arrow.memory.util.ArrowBufPointer;
+import org.apache.arrow.memory.util.ByteFunctionHelpers;
+import org.apache.arrow.memory.util.CommonUtil;
+import org.apache.arrow.memory.util.hash.ArrowBufHasher;
+import org.apache.arrow.util.Preconditions;
+import org.apache.arrow.vector.AddOrGetResult;
+import org.apache.arrow.vector.BaseFixedWidthVector;
+import org.apache.arrow.vector.BaseValueVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.BitVectorHelper;
+import org.apache.arrow.vector.BufferBacked;
+import org.apache.arrow.vector.DensityAwareVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.NullVector;
+import org.apache.arrow.vector.UInt4Vector;
+import org.apache.arrow.vector.ValueVector;
+import org.apache.arrow.vector.ZeroVector;
+import org.apache.arrow.vector.compare.VectorVisitor;
+import org.apache.arrow.vector.complex.impl.ComplexCopier;
+import org.apache.arrow.vector.complex.impl.UnionLargeListReader;
+import org.apache.arrow.vector.complex.impl.UnionLargeListWriter;
+import org.apache.arrow.vector.complex.reader.FieldReader;
+import org.apache.arrow.vector.ipc.message.ArrowFieldNode;
+import org.apache.arrow.vector.types.Types.MinorType;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.Field;
+import org.apache.arrow.vector.types.pojo.FieldType;
+import org.apache.arrow.vector.util.CallBack;
+import org.apache.arrow.vector.util.JsonStringArrayList;
+import org.apache.arrow.vector.util.OversizedAllocationException;
+import org.apache.arrow.vector.util.SchemaChangeRuntimeException;
+import org.apache.arrow.vector.util.TransferPair;
+
+/**
+ * A list vector contains lists of a specific type of elements.  Its structure 
contains 3 elements.
+ * 
+ * A validity buffer.
+ *  An offset buffer, that denotes lists boundaries. 
+ *  A child data vector that contains the elements of lists. 
+ * 
+ *
+ * This is the LargeList variant of list, it has a 64-bit wide offset
+ *
+ * 
+ *   todo review checkedCastToInt usage in this class.
+ *   Once int64 indexed vectors are supported these checks aren't needed.
+ * 
+ */
+public class LargeListVector extends BaseValueVector implements 
RepeatedValueVector, BaseListVector, PromotableVector {
+
+  public static LargeListVector empty(String name, BufferAllocator allocator) {
+return new LargeListVector(name, allocator, 
FieldType.nullable(ArrowType.LargeList.INSTANCE), null);
+  }
+
+  public static final FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE;
+  public static final String DATA_VECTOR_NAME = "$data$";
+
+  public static final byte OFFSET_WIDTH = 8;
+  protected ArrowBuf offsetBuffer;
+  protected FieldVector vector;
+  protected final CallBack callBack;
+  protected int valueCount;
+  protected long offsetAllocationSizeInBytes = INITIAL_VALUE_ALLOCATION * 
OFFSET_WIDTH;
+  private final String name;
+
+  protected String defaultDataVectorName = DATA_VECTOR_NAME;
+  protected ArrowBuf validityBuffer;
+  protected UnionLargeListReader reader;
+  private final FieldType fieldType;
+  private int validityAllocationSizeInBytes;
+
+  /**
+   * The maximum index that is actually set.
+   */
+  private long lastSet;
+
+  /**
+   * Constructs a new instance.
+   *
+   * @param name The 

[GitHub] [arrow] mrkn commented on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-30 Thread GitBox


mrkn commented on pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#issuecomment-652124757


   @pitrou In summary, the constraint of the indices order of SparseCOOIndex 
was removed, but the new flag field is introduced to state whether or not the 
indices tensor is ordered in a row-major manner.  By this new flag, we can 
preserve the canonicality or noncanonicality of the sparse tensor before and 
after serialization.  Preserving this property can avoid needless scanning of 
the indices tensor after deserialization.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7478: ARROW-9055: [C++] Add sum/mean/minmax kernels for Boolean type

2020-06-30 Thread GitBox


wesm closed pull request #7478:
URL: https://github.com/apache/arrow/pull/7478


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7287: ARROW-8771: [C++] Add boost/process library to build support

2020-06-30 Thread GitBox


wesm commented on pull request #7287:
URL: https://github.com/apache/arrow/pull/7287#issuecomment-652142725


   Could this be picked up again?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7507: ARROW-8797: [C++] Create test to receive RecordBatch for different endian

2020-06-30 Thread GitBox


wesm commented on pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#issuecomment-652142541


   @kiszk I would suggest creating a LE and BE example corpus in 
apache/arrow-testing. You can use the integration test command line tools to 
create point-of-truth JSON files and then use the JSON_TO_ARROW mode of the 
integration test executable to create IPC binary files that can be checked in 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7602: ARROW-9083: [R] collect int64, uint32, uint64 as R integer type if not out of bounds

2020-06-30 Thread GitBox


nealrichardson opened a new pull request #7602:
URL: https://github.com/apache/arrow/pull/7602


   Still to do:
   
   - [ ] Update test expectations since the output types have changed
   - [ ] Add tests that exercise the case where the data doesn't fit into int32



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7602: ARROW-9083: [R] collect int64, uint32, uint64 as R integer type if not out of bounds

2020-06-30 Thread GitBox


github-actions[bot] commented on pull request #7602:
URL: https://github.com/apache/arrow/pull/7602#issuecomment-652080559


   https://issues.apache.org/jira/browse/ARROW-9083



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7602: ARROW-9083: [R] collect int64, uint32, uint64 as R integer type if not out of bounds

2020-06-30 Thread GitBox


wesm commented on a change in pull request #7602:
URL: https://github.com/apache/arrow/pull/7602#discussion_r448028250



##
File path: r/src/array_to_vector.cpp
##
@@ -673,6 +676,17 @@ class Converter_Null : public Converter {
   }
 };
 
+bool arrays_can_fit_integer(ArrayVector arrays) {

Review comment:
   nit: not sure if it's worth the effort of following the C++ guide 
viz-a-viz function naming





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7584: ARROW-9272: [C++][Python] Reduce complexity in python to arrow conversion

2020-06-30 Thread GitBox


wesm commented on a change in pull request #7584:
URL: https://github.com/apache/arrow/pull/7584#discussion_r448064117



##
File path: cpp/src/arrow/python/python_to_arrow.cc
##
@@ -53,6 +53,335 @@ using internal::checked_pointer_cast;
 
 namespace py {
 
+// --
+// NullCoding
+
+enum class NullCoding : char { NONE_ONLY, PANDAS_SENTINELS };
+
+template 
+struct NullChecker {};
+
+template <>
+struct NullChecker {
+  static inline bool Check(PyObject* obj) { return obj == Py_None; }
+};
+
+template <>
+struct NullChecker {
+  static inline bool Check(PyObject* obj) { return 
internal::PandasObjectIsNull(obj); }
+};
+
+// --
+// ValueConverters
+//
+// Typed conversion logic for single python objects are encapsulated in
+// ValueConverter structs using SFINAE for specialization.
+//
+// The FromPython medthod is responsible to convert the python object to the
+// C++ value counterpart which can be directly appended to the ArrayBuilder or
+// Scalar can be constructed from.
+
+template 
+struct ValueConverter {};
+
+template <>
+struct ValueConverter {
+  static inline Result FromPython(PyObject* obj) {
+if (obj == Py_True) {
+  return true;
+} else if (obj == Py_False) {
+  return false;
+} else {
+  return internal::InvalidValue(obj, "tried to convert to boolean");
+}
+  }
+};
+
+template 
+struct ValueConverter> {
+  using ValueType = typename Type::c_type;
+
+  static inline Result FromPython(PyObject* obj) {
+ValueType value;
+RETURN_NOT_OK(internal::CIntFromPython(obj, ));
+return value;
+  }
+};
+
+template <>
+struct ValueConverter {
+  using ValueType = typename HalfFloatType::c_type;
+
+  static inline Result FromPython(PyObject* obj) {
+ValueType value;
+RETURN_NOT_OK(PyFloat_AsHalf(obj, ));
+return value;
+  }
+};
+
+template <>
+struct ValueConverter {
+  static inline Result FromPython(PyObject* obj) {
+float value;
+if (internal::PyFloatScalar_Check(obj)) {
+  value = static_cast(PyFloat_AsDouble(obj));
+  RETURN_IF_PYERROR();
+} else if (internal::PyIntScalar_Check(obj)) {
+  RETURN_NOT_OK(internal::IntegerScalarToFloat32Safe(obj, ));
+} else {
+  return internal::InvalidValue(obj, "tried to convert to float32");
+}
+return value;
+  }
+};
+
+template <>
+struct ValueConverter {
+  static inline Result FromPython(PyObject* obj) {
+double value;
+if (PyFloat_Check(obj)) {
+  value = PyFloat_AS_DOUBLE(obj);
+} else if (internal::PyFloatScalar_Check(obj)) {
+  // Other kinds of float-y things
+  value = PyFloat_AsDouble(obj);
+  RETURN_IF_PYERROR();
+} else if (internal::PyIntScalar_Check(obj)) {
+  RETURN_NOT_OK(internal::IntegerScalarToDoubleSafe(obj, ));
+} else {
+  return internal::InvalidValue(obj, "tried to convert to double");
+}
+return value;
+  }
+};
+
+template <>
+struct ValueConverter {
+  static inline Result FromPython(PyObject* obj) {
+int32_t value;
+if (PyDate_Check(obj)) {
+  auto pydate = reinterpret_cast(obj);
+  value = static_cast(internal::PyDate_to_days(pydate));
+} else {
+  RETURN_NOT_OK(
+  internal::CIntFromPython(obj, , "Integer too large for 
date32"));
+}
+return value;
+  }
+};
+
+template <>
+struct ValueConverter {
+  static inline Result FromPython(PyObject* obj) {
+int64_t value;
+if (PyDateTime_Check(obj)) {
+  auto pydate = reinterpret_cast(obj);
+  value = internal::PyDateTime_to_ms(pydate);
+  // Truncate any intraday milliseconds
+  value -= value % 8640LL;
+} else if (PyDate_Check(obj)) {
+  auto pydate = reinterpret_cast(obj);
+  value = internal::PyDate_to_ms(pydate);
+} else {
+  RETURN_NOT_OK(
+  internal::CIntFromPython(obj, , "Integer too large for 
date64"));
+}
+return value;
+  }
+};
+
+template <>
+struct ValueConverter {
+  static inline Result FromPython(PyObject* obj, TimeUnit::type unit) 
{
+int32_t value;
+if (PyTime_Check(obj)) {
+  // datetime.time stores microsecond resolution
+  switch (unit) {
+case TimeUnit::SECOND:
+  value = static_cast(internal::PyTime_to_s(obj));
+  break;
+case TimeUnit::MILLI:
+  value = static_cast(internal::PyTime_to_ms(obj));
+  break;
+default:
+  return Status::UnknownError("Invalid time unit");
+  }
+} else {
+  RETURN_NOT_OK(internal::CIntFromPython(obj, , "Integer too large 
for int32"));
+}
+return value;
+  }
+};
+
+template <>
+struct ValueConverter {
+  static inline Result FromPython(PyObject* obj, TimeUnit::type unit) 
{
+int64_t value;
+if (PyTime_Check(obj)) {
+  // datetime.time stores microsecond resolution
+  switch (unit) {
+case TimeUnit::MICRO:
+  value = 

[GitHub] [arrow] wesm closed pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-30 Thread GitBox


wesm closed pull request #7556:
URL: https://github.com/apache/arrow/pull/7556


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7578: ARROW-9264: [C++][Parquet] Refactor and modernize schema conversion code

2020-06-30 Thread GitBox


wesm closed pull request #7578:
URL: https://github.com/apache/arrow/pull/7578


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >