[GitHub] [arrow] kszucs commented on pull request #7021: ARROW-8628: [Dev] Wrap docker-compose commands with archery

2020-05-01 Thread GitBox


kszucs commented on pull request #7021:
URL: https://github.com/apache/arrow/pull/7021#issuecomment-622320430


   @github-actions crossbow submit -g test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-01 Thread GitBox


kszucs commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622345879


   @github-actions crossbow submit wheel-win-cp38



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-04-27 Thread GitBox


zgramana commented on pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#issuecomment-620266004


   @eerhardt thanks for the excellent review and feedback. I believe that I 
have pushed fixes for everything.
   
   I also noticed I had missed validity buffer state management in 
`PrimitiveArrayBuilder`'s `Set`, `Swap` and `Clear` methods. I've added those 
in.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416235260



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   Reading using a predetermined schema is already supported for CSV and 
for all the other formats. The first chunk (as determined by 
`CsvFileFormat::chunk_size`) is used to sniff the schema which is assumed for 
the rest of the file, and this inspection can be applied to just one or to all 
the files in a dataset at discovery time.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416253903



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   I'll add a failing test tomorrow so this can become less hand wavy





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7048: ARROW-8609: [C++] fix orc jni crash

2020-04-27 Thread GitBox


github-actions[bot] commented on pull request #7048:
URL: https://github.com/apache/arrow/pull/7048#issuecomment-620323735


   https://issues.apache.org/jira/browse/ARROW-8609



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-27 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r416239202



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline csv::ReadOptions default_read_options() {
+  auto defaults = csv::ReadOptions::Defaults();
+  defaults.use_threads = false;
+  return defaults;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto maybe_reader = csv::StreamingReader::Make(
+  default_memory_pool(), std::move(input), default_read_options(),
+  format.parse_options, format.convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open IPC input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader, OpenReader(source_, *format_));

Review comment:
   The edge case here is due to the way we pass only the projected schema 
during a scan, so maybe the solution is to always pass the full dataset schema 
as well so that no materialized fields will be inferring at scan time





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhouyuan opened a new pull request #7048: ARROW-8609: [C++] fix orc jni crash

2020-04-27 Thread GitBox


zhouyuan opened a new pull request #7048:
URL: https://github.com/apache/arrow/pull/7048


   check if arrow buffer is null before passing to the constructor
   
   Signed-off-by: Yuan Zhou 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhouyuan commented on pull request #6953: [Java] Reproduce ORC JNI binding crash [WIP]

2020-04-27 Thread GitBox


zhouyuan commented on pull request #6953:
URL: https://github.com/apache/arrow/pull/6953#issuecomment-620321153


   @kszucs 
   ran into similar issue on parquet side, not sure if this #7048 works  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #7081: [CI] Cache docker volumes [WIP]

2020-05-01 Thread GitBox


kszucs opened a new pull request #7081:
URL: https://github.com/apache/arrow/pull/7081


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r418424144



##
File path: csharp/src/Apache.Arrow/Arrays/ListArray.cs
##
@@ -135,6 +152,11 @@ public int GetValueOffset(int index)
 
 public int GetValueLength(int index)
 {
+if (IsNull(index))

Review comment:
   Fixed #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


eerhardt commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r418393347



##
File path: csharp/src/Apache.Arrow/Arrays/ListArray.cs
##
@@ -135,6 +152,11 @@ public int GetValueOffset(int index)
 
 public int GetValueLength(int index)
 {
+if (IsNull(index))

Review comment:
   This should be after the argument checking. #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r418383314



##
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##
@@ -162,8 +188,8 @@ public TBuilder Swap(int i, int j)
 public TArray Build(MemoryAllocator allocator = default)
 {
 return Build(
-ValueBuffer.Build(allocator), ArrowBuffer.Empty,
-ValueBuffer.Length, 0, 0);
+ValueBuffer.Build(allocator), 
ValidityBuffer.Build(allocator).ValueBuffer,

Review comment:
   As noted on another comment, I've pushed up this refactor. #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r418380212



##
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##
@@ -99,55 +105,75 @@ public abstract class PrimitiveArrayBuilder : IArrowArrayBu
 {
 protected TBuilder Instance => this as TBuilder;
 protected ArrowBuffer.Builder ValueBuffer { get; }
+protected BooleanArray.Builder ValidityBuffer { get; }
 
 public int Length => ValueBuffer.Length;
-
-// TODO: Implement support for null values (null bitmaps)
+protected int NullCount { get; set; }
 
 internal PrimitiveArrayBuilder()
 {
 ValueBuffer = new ArrowBuffer.Builder();
+ValidityBuffer = new BooleanArray.Builder();
 }
 
 public TBuilder Resize(int length)
 {
 ValueBuffer.Resize(length);
+ValidityBuffer.Resize(length + 1);

Review comment:
   Copy/paste error from `BinaryArray`. Good catch.
   
   Fixed. #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


eerhardt commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r417709163



##
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##
@@ -99,55 +105,75 @@ public abstract class PrimitiveArrayBuilder : IArrowArrayBu
 {
 protected TBuilder Instance => this as TBuilder;
 protected ArrowBuffer.Builder ValueBuffer { get; }
+protected BooleanArray.Builder ValidityBuffer { get; }
 
 public int Length => ValueBuffer.Length;
-
-// TODO: Implement support for null values (null bitmaps)
+protected int NullCount { get; set; }
 
 internal PrimitiveArrayBuilder()
 {
 ValueBuffer = new ArrowBuffer.Builder();
+ValidityBuffer = new BooleanArray.Builder();
 }
 
 public TBuilder Resize(int length)
 {
 ValueBuffer.Resize(length);
+ValidityBuffer.Resize(length + 1);

Review comment:
   Why the `+ 1` here? Shouldn't the ValidityBuffer and ValueBuffer be the 
same size? (Same question for Reserve below). #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-05-01 Thread GitBox


bkietz commented on pull request #7073:
URL: https://github.com/apache/arrow/pull/7073#issuecomment-622364855


   WRT convenient single format or single file system datasets, it would be 
straightforward (and possibly more useful) to provide accessories for subsets, 
Dataset.subsets(by_format=True): { format: subset ... } Or similar. IMHO the 
convenience added does not justify a whole class, at least in C++. As you noted 
even if we remove it in C++ it could still be a convenience subclass in Python



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-05-01 Thread GitBox


wesm commented on pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#issuecomment-622371728


   What I've seen other projects do (have to dig for some examples) is to have 
files like
   
   ```
   functionality_nosimd.cc
   functionality_sse42.cc
   functionality_avx2.cc
   ```
   
   The appropriate compiler flags are set on the respective compilation unit. 
   
   Then query the CPU at runtime to determine its features and choose which 
variant to invoke. @cyb70289 expressed an interest in this recently, I think



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files

2020-05-01 Thread GitBox


cyb70289 commented on a change in pull request #6954:
URL: https://github.com/apache/arrow/pull/6954#discussion_r418558120



##
File path: cpp/src/arrow/util/simd.h
##
@@ -17,6 +17,24 @@
 
 #pragma once
 
+#ifdef _MSC_VER
+// MSVC x86_64/arm64
+
+#if defined(_M_AMD64) || defined(_M_X64)
+#include 
+#elif defined(_M_ARM64)
+#include 
+#endif
+
+#else
+// gcc/clang (possibly others)
+

Review comment:
   yes, I think so





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-05-01 Thread GitBox


jorisvandenbossche commented on pull request #7073:
URL: https://github.com/apache/arrow/pull/7073#issuecomment-622354887


   Do we need FileSystemDataset, maybe not. Is it still useful, IMO yes. 
   
   As mentioned above, I personally find it convenient to know that my dataset 
has a single format / filesystem, and be able to easily check this format. 
   Now of course, it might be that this convenience is not worth it given the 
added complexity to the code. Or that such convenience could also be given by 
the wrapper languages (eg in Python we could still have a Dataset subclass for 
single formats)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-01 Thread GitBox


kszucs commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622356800


   @github-actions crossbow submit wheel-win-*



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-01 Thread GitBox


github-actions[bot] commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622357099


   Revision: 8852e2f5f32402ca9c85877289c7948db141cca7
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-194](https://github.com/ursa-labs/crossbow/branches/all?query=actions-194)
   
   |Task|Status|
   ||--|
   
|wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-194-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-194-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp37m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-194-appveyor-wheel-win-cp37m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp38|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-194-appveyor-wheel-win-cp38.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-05-01 Thread GitBox


wesm commented on pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#issuecomment-622373323


   Take a look at how this is currently being handled in NumPy
   
   * https://numpy.org/neps/nep-0038-SIMD-optimizations.html
   * https://github.com/numpy/numpy/pull/13516
   
   I assume a lot of thought was put into this, and since it was done in 2019 
it should provide a good guideline for what we should do



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


eerhardt commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r417708201



##
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##
@@ -73,24 +76,34 @@ public TArray Build(MemoryAllocator allocator = default)
 {
 ValueOffsets.Append(Offset);
 
-var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 
0,
-new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), 
ValueBuffer.Build(allocator) });
+var data = new ArrayData(DataType, ValueOffsets.Length - 1, 
NullCount, 0,
+new[] { ValidityBuffer.Build(allocator).ValueBuffer, 
ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });

Review comment:
   Were we going to check for `NullCount == 0` and use an empty buffer for 
the validity buffer in that case?
   
   The same goes for `PrimitiveArrayBuilder` and `BoleanArray.Builder`. #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


eerhardt commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r415075754



##
File path: csharp/src/Apache.Arrow/Arrays/ArrayData.cs
##
@@ -84,7 +84,7 @@ public ArrayData Slice(int offset, int length)
 length = Math.Min(Length - offset, length);
 offset += Offset;
 
-return new ArrayData(DataType, length, -1, offset, Buffers, 
Children);
+return new ArrayData(DataType, length, NullCount, offset, Buffers, 
Children);

Review comment:
   This isn't correct because we are slicing a subset of the `ArrayData`.
   
   If the total `ArrayData` has values:
   
   `0`, `null`, `2`, `3`, `4`
   
   `NullCount` for the outer `ArrayData` will be `1`.
   
   But if we are slicing from `offset = 2` and `length = 3` and creating a new 
`ArrayData` from the above, the new `ArrayData` won't have any nulls in it. But 
with this change, it will have `NullCount = 1` from its parent.
   
   `NullCount = -1` in the original code means "we don't know yet, it needs to 
be computed". That was done so we don't have to count the nulls in the sliced 
values until it is necessary. Since there are cases where it may not be 
required. #Closed

##
File path: csharp/src/Apache.Arrow/Arrays/ArrayData.cs
##
@@ -22,6 +22,8 @@ namespace Apache.Arrow
 {
 public sealed class ArrayData : IDisposable
 {
+public static readonly int RecalculateNullCount = -1;

Review comment:
   Can we leave this `private` for now? We can always make it public later. 
#Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r416049921



##
File path: csharp/src/Apache.Arrow/Arrays/ArrayData.cs
##
@@ -84,7 +84,7 @@ public ArrayData Slice(int offset, int length)
 length = Math.Min(Length - offset, length);
 offset += Offset;
 
-return new ArrayData(DataType, length, -1, offset, Buffers, 
Children);
+return new ArrayData(DataType, length, NullCount, offset, Buffers, 
Children);

Review comment:
   Appears obvious in retrospect. I reverted, though I did introduce a 
constant for enhanced readability. #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r418423816



##
File path: csharp/src/Apache.Arrow/Arrays/ArrayData.cs
##
@@ -22,6 +22,8 @@ namespace Apache.Arrow
 {
 public sealed class ArrayData : IDisposable
 {
+public static readonly int RecalculateNullCount = -1;

Review comment:
   Changed #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zgramana commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


zgramana commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r418385072



##
File path: csharp/src/Apache.Arrow/Arrays/StringArray.cs
##
@@ -71,6 +76,15 @@ public string GetString(int index, Encoding encoding = 
default)
 
 var bytes = GetBytes(index);
 
+if (bytes == Span.Empty)
+{
+return null;
+}
+if (bytes.Length == 0)

Review comment:
   Yes. See 
`ArrayBuilderTests.StringArrayBuilderHandlesNullsAndEmptyStrings`. #Closed

##
File path: csharp/src/Apache.Arrow/Arrays/StringArray.cs
##
@@ -71,6 +76,15 @@ public string GetString(int index, Encoding encoding = 
default)
 
 var bytes = GetBytes(index);
 
+if (bytes == Span.Empty)

Review comment:
   I chose the first option as it avoids any public API changes. In general 
I am not a fan of `out` contracts, but did not have a better alternative. This 
way there is still time to contemplate alternatives before committing to the 
proposed overload. #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

2020-05-01 Thread GitBox


eerhardt commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r417711102



##
File path: csharp/src/Apache.Arrow/Arrays/StringArray.cs
##
@@ -71,6 +76,15 @@ public string GetString(int index, Encoding encoding = 
default)
 
 var bytes = GetBytes(index);
 
+if (bytes == Span.Empty)
+{
+return null;
+}
+if (bytes.Length == 0)

Review comment:
   Do we have unit tests for both of these cases to make sure they act 
appropriately? A test for `null` and a test for `string.Empty`. #Closed

##
File path: csharp/src/Apache.Arrow/Arrays/StringArray.cs
##
@@ -71,6 +76,15 @@ public string GetString(int index, Encoding encoding = 
default)
 
 var bytes = GetBytes(index);
 
+if (bytes == Span.Empty)

Review comment:
   We had similar conversations about this in the `DataFrame` type we made 
here: https://github.com/dotnet/corefxlab/pull/2710#issuecomment-521799900.
   
   I think it makes more sense to say `if (bytes == default)` here.
   
   Or another idea is to add a `ReadOnlySpan GetBytes(int index, out bool 
isNull)` method to `BinaryArray`, which would allow callers know explicitly if 
the value is `null` without having a double `IsValid` check. #Closed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-01 Thread GitBox


github-actions[bot] commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622397441


   Revision: b65130bd5eae0e6fe79ace9d529a57f76869f621
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-195](https://github.com/ursa-labs/crossbow/branches/all?query=actions-195)
   
   |Task|Status|
   ||--|
   
|wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-195-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-195-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp37m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-195-appveyor-wheel-win-cp37m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp38|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-195-appveyor-wheel-win-cp38.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-01 Thread GitBox


kszucs commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622396964


   @github-actions crossbow submit wheel-win-*



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-05-01 Thread GitBox


fsaintjacques commented on a change in pull request #7073:
URL: https://github.com/apache/arrow/pull/7073#discussion_r418554202



##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -83,131 +83,67 @@ Result 
FileFragment::Scan(std::shared_ptr options
 
 FileSystemDataset::FileSystemDataset(std::shared_ptr schema,
  std::shared_ptr 
root_partition,
- std::shared_ptr format,
- std::shared_ptr 
filesystem,
- fs::PathForest forest,
- ExpressionVector file_partitions)
+ 
std::vector> fragments)
 : Dataset(std::move(schema), std::move(root_partition)),
-  format_(std::move(format)),
-  filesystem_(std::move(filesystem)),
-  forest_(std::move(forest)),
-  partitions_(std::move(file_partitions)) {
-  DCHECK_EQ(static_cast(forest_.size()), partitions_.size());
-}
-
-Result> FileSystemDataset::Make(
-std::shared_ptr schema, std::shared_ptr root_partition,
-std::shared_ptr format, std::shared_ptr 
filesystem,
-std::vector infos) {
-  ExpressionVector partitions(infos.size(), scalar(true));
-  return Make(std::move(schema), std::move(root_partition), std::move(format),
-  std::move(filesystem), std::move(infos), std::move(partitions));
-}
+  fragments_(std::move(fragments)) {}
 
 Result> FileSystemDataset::Make(
 std::shared_ptr schema, std::shared_ptr root_partition,
-std::shared_ptr format, std::shared_ptr 
filesystem,
-std::vector infos, ExpressionVector partitions) {
-  ARROW_ASSIGN_OR_RAISE(auto forest, fs::PathForest::Make(std::move(infos), 
));
-  return Make(std::move(schema), std::move(root_partition), std::move(format),
-  std::move(filesystem), std::move(forest), std::move(partitions));
-}
+FragmentVector fragments) {
+  std::vector> file_fragments;
+  for (const auto& fragment : fragments) {
+auto file_fragment = 
internal::checked_pointer_cast(fragment);

Review comment:
   derp, I actually forgot to check.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-05-01 Thread GitBox


fsaintjacques commented on a change in pull request #7073:
URL: https://github.com/apache/arrow/pull/7073#discussion_r418556583



##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -221,42 +157,34 @@ Result> 
FileSystemDataset::Write(
 filesystem = std::make_shared();
   }
 
-  std::vector files(plan.paths.size());
-  ExpressionVector partition_expressions(plan.paths.size(), scalar(true));
   auto task_group = scan_context->TaskGroup();
-
   auto partition_base_dir = 
fs::internal::EnsureTrailingSlash(plan.partition_base_dir);
   auto extension = "." + plan.format->type_name();
 
+  FragmentVector fragments;
   for (size_t i = 0; i < plan.paths.size(); ++i) {
 const auto& op = plan.fragment_or_partition_expressions[i];
-if (util::holds_alternative>(op)) {
-  files[i].set_type(fs::FileType::Directory);
-  files[i].set_path(partition_base_dir + plan.paths[i]);
+if (util::holds_alternative>(op)) {

Review comment:
   This will be part of ARROW-8382.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-05-01 Thread GitBox


fsaintjacques commented on a change in pull request #7073:
URL: https://github.com/apache/arrow/pull/7073#discussion_r418556583



##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -221,42 +157,34 @@ Result> 
FileSystemDataset::Write(
 filesystem = std::make_shared();
   }
 
-  std::vector files(plan.paths.size());
-  ExpressionVector partition_expressions(plan.paths.size(), scalar(true));
   auto task_group = scan_context->TaskGroup();
-
   auto partition_base_dir = 
fs::internal::EnsureTrailingSlash(plan.partition_base_dir);
   auto extension = "." + plan.format->type_name();
 
+  FragmentVector fragments;
   for (size_t i = 0; i < plan.paths.size(); ++i) {
 const auto& op = plan.fragment_or_partition_expressions[i];
-if (util::holds_alternative>(op)) {
-  files[i].set_type(fs::FileType::Directory);
-  files[i].set_path(partition_base_dir + plan.paths[i]);
+if (util::holds_alternative>(op)) {

Review comment:
   This will be done part of ARROW-8382.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7081: [CI] Cache docker volumes [WIP]

2020-05-01 Thread GitBox


github-actions[bot] commented on pull request #7081:
URL: https://github.com/apache/arrow/pull/7081#issuecomment-622378960


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] hantusk opened a new issue #7082: pyarrow 0.17 atexit handler causes a segmentation fault

2020-05-01 Thread GitBox


hantusk opened a new issue #7082:
URL: https://github.com/apache/arrow/issues/7082


   When running an ASGI webapp in python with uvicorn, I am getting the 
following error when shutting down. Solved by reverting back to pyarrow 0.16.0
   
   ```python
   Error in atexit._run_exitfuncs:
   Traceback (most recent call last):
 File "pyarrow/types.pxi", line 2638, in 
pyarrow.lib._unregister_py_extension_types
   SystemError: Bad call flags in _PyMethodDef_RawFastCallDict. METH_OLDARGS is 
no longer supported!
   [1]23073 segmentation fault  uvicorn ba...
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7081: [CI] Cache docker volumes [WIP]

2020-05-01 Thread GitBox


kszucs commented on pull request #7081:
URL: https://github.com/apache/arrow/pull/7081#issuecomment-622417613


   With warmed up cache the build time has been reduced to 6m from 17m which is 
promising. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7080: ARROW-8662: [CI] Consolidate appveyor scripts

2020-05-01 Thread GitBox


github-actions[bot] commented on pull request #7080:
URL: https://github.com/apache/arrow/pull/7080#issuecomment-622423422


   https://issues.apache.org/jira/browse/ARROW-8662



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7080: [CI] Consolidate appveyor scripts [WIP]

2020-05-01 Thread GitBox


kszucs commented on pull request #7080:
URL: https://github.com/apache/arrow/pull/7080#issuecomment-622408594


   Checking that the cache properly works on my fork's master branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-05-01 Thread GitBox


kszucs commented on pull request #7073:
URL: https://github.com/apache/arrow/pull/7073#issuecomment-622410968


   I'd like elaborate a bit more on the generic dataset class regardless what 
kind of wrappers do we provide. 
   - Do you plan to unify the filesystem classes into a single one which can 
contain any kind of fragments (including InMemoryFragment)?
   - Will the UnionDataset still be required?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7080: [CI] Consolidate appveyor scripts [WIP]

2020-05-01 Thread GitBox


kszucs commented on pull request #7080:
URL: https://github.com/apache/arrow/pull/7080#issuecomment-622418446


   Checking that the cache properly works on my fork's master branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs removed a comment on pull request #7081: [CI] Cache docker volumes [WIP]

2020-05-01 Thread GitBox


kszucs removed a comment on pull request #7081:
URL: https://github.com/apache/arrow/pull/7081#issuecomment-622418381


   Checking that the cache properly works on my fork's master branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] crd477 opened a new pull request #7083: Update building.rst

2020-05-01 Thread GitBox


crd477 opened a new pull request #7083:
URL: https://github.com/apache/arrow/pull/7083


   simple typo: not -> note



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7083: ARROW-8663: [Documentation] Small correction to building.rst

2020-05-01 Thread GitBox


github-actions[bot] commented on pull request #7083:
URL: https://github.com/apache/arrow/pull/7083#issuecomment-622435457


   https://issues.apache.org/jira/browse/ARROW-8663



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7021: ARROW-8628: [Dev] Wrap docker-compose commands with archery

2020-05-01 Thread GitBox


fsaintjacques commented on a change in pull request #7021:
URL: https://github.com/apache/arrow/pull/7021#discussion_r418596433



##
File path: .github/workflows/archery.yml
##
@@ -51,10 +53,12 @@ jobs:
   python-version: '3.7'
   - name: Install
 working-directory: dev/archery
-run: pip install pytest responses ruamel.yaml toolz jinja2 -e .
+run: pip install pytest python-dotenv responses ruamel.yaml toolz 
jinja2 -e .

Review comment:
   Why not use pip install -e to ensure that requirements are up to date?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-05-01 Thread GitBox


wesm commented on pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#issuecomment-622409707


   The 32-bit R failure seems like it could be real



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-05-01 Thread GitBox


fsaintjacques commented on a change in pull request #7073:
URL: https://github.com/apache/arrow/pull/7073#discussion_r418564299



##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -83,131 +83,67 @@ Result 
FileFragment::Scan(std::shared_ptr options
 
 FileSystemDataset::FileSystemDataset(std::shared_ptr schema,
  std::shared_ptr 
root_partition,
- std::shared_ptr format,
- std::shared_ptr 
filesystem,
- fs::PathForest forest,
- ExpressionVector file_partitions)
+ 
std::vector> fragments)
 : Dataset(std::move(schema), std::move(root_partition)),
-  format_(std::move(format)),
-  filesystem_(std::move(filesystem)),
-  forest_(std::move(forest)),
-  partitions_(std::move(file_partitions)) {
-  DCHECK_EQ(static_cast(forest_.size()), partitions_.size());
-}
-
-Result> FileSystemDataset::Make(
-std::shared_ptr schema, std::shared_ptr root_partition,
-std::shared_ptr format, std::shared_ptr 
filesystem,
-std::vector infos) {
-  ExpressionVector partitions(infos.size(), scalar(true));
-  return Make(std::move(schema), std::move(root_partition), std::move(format),
-  std::move(filesystem), std::move(infos), std::move(partitions));
-}
+  fragments_(std::move(fragments)) {}
 
 Result> FileSystemDataset::Make(
 std::shared_ptr schema, std::shared_ptr root_partition,
-std::shared_ptr format, std::shared_ptr 
filesystem,
-std::vector infos, ExpressionVector partitions) {
-  ARROW_ASSIGN_OR_RAISE(auto forest, fs::PathForest::Make(std::move(infos), 
));
-  return Make(std::move(schema), std::move(root_partition), std::move(format),
-  std::move(filesystem), std::move(forest), std::move(partitions));
-}
+FragmentVector fragments) {
+  std::vector> file_fragments;
+  for (const auto& fragment : fragments) {
+auto file_fragment = 
internal::checked_pointer_cast(fragment);

Review comment:
   Resolved by changing the to vector and moving the checks 
at callsite.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs removed a comment on pull request #7080: [CI] Consolidate appveyor scripts [WIP]

2020-05-01 Thread GitBox


kszucs removed a comment on pull request #7080:
URL: https://github.com/apache/arrow/pull/7080#issuecomment-622408594


   Checking that the cache properly works on my fork's master branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7081: [CI] Cache docker volumes [WIP]

2020-05-01 Thread GitBox


kszucs commented on pull request #7081:
URL: https://github.com/apache/arrow/pull/7081#issuecomment-622418381


   Checking that the cache properly works on my fork's master branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-05-01 Thread GitBox


emkornfield commented on pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#issuecomment-622422074


   @nealrichardson i'm not sure what the error is saying?  also 32-bit R didn't 
realize that is still a thing :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs edited a comment on pull request #7081: [CI] Cache docker volumes [WIP]

2020-05-01 Thread GitBox


kszucs edited a comment on pull request #7081:
URL: https://github.com/apache/arrow/pull/7081#issuecomment-622417613


   With warmed up cache the build time has been reduced to 6m from 17m which is 
promising. 
   
   I'll need to do some gymnastics with the cache keys because the cache plugin 
lacks some features, but it'll work.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #6631: ARROW-8111: [C++][CSV] Support MM/DD/YYYY date format

2020-05-01 Thread GitBox


nealrichardson commented on pull request #6631:
URL: https://github.com/apache/arrow/pull/6631#issuecomment-622437524


   @github-actions autotune everything



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6631: ARROW-8111: [C++][CSV] Support MM/DD/YYYY date format

2020-05-01 Thread GitBox


wesm commented on pull request #6631:
URL: https://github.com/apache/arrow/pull/6631#issuecomment-622447876


   I can pick up this patch today and take it the last mile so it can be 
merged. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #6303: ARROW-8039: [Python] Use dataset API in existing parquet readers and tests

2020-04-30 Thread GitBox


jorisvandenbossche commented on pull request #6303:
URL: https://github.com/apache/arrow/pull/6303#issuecomment-621937113


   I finally listed the open TODO items from the discussions in this PR / the 
skipped tests, and opened JIRAs where this was not yet the case:
   
   - Deduplicating the specified column names 
(https://github.com/apache/arrow/pull/6303#discussion_r397350410): do we 
actually want this? 
   - Support buffers/NativeFile as file source -> 
https://issues.apache.org/jira/browse/ARROW-8074
   - Multithreaded discovery -> https://issues.apache.org/jira/browse/ARROW-8137
   - Deterministic row order -> https://issues.apache.org/jira/browse/ARROW-8447
   - Error on "bad" files instead of skipping -> 
https://issues.apache.org/jira/browse/ARROW-7673
   - Partition fields using dictionary type -> opened 
https://issues.apache.org/jira/browse/ARROW-8647 (but should this be on by 
default?)
   - Metadata support: https://issues.apache.org/jira/browse/ARROW-8062 is an 
issue about discovery from metadata files, but in addition we also need a way 
to expose the metadata/statistics
   - Support pickling -> opened https://issues.apache.org/jira/browse/ARROW-8651
   - Comment about needing better error message when encountering invalid 
files: this seems to work now, opened 
https://issues.apache.org/jira/browse/ARROW-8652 to enable the test
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7060: ARROW-8619: [C++] Use distinct enum values for MonthInterval, DayTimeInterval

2020-04-30 Thread GitBox


wesm commented on pull request #7060:
URL: https://github.com/apache/arrow/pull/7060#issuecomment-621933237


   For some reason I can't get JNI running in my local setup
   
   ```
   CMake Error at 
/home/wesm/cpp-toolchain/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146
 (message):
 Could NOT find JNI (missing: JAVA_AWT_INCLUDE_PATH)
   Call Stack (most recent call first):
 
/home/wesm/cpp-toolchain/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393
 (_FPHSA_FAILURE_MESSAGE)
 /home/wesm/cpp-toolchain/share/cmake-3.16/Modules/FindJNI.cmake:372 
(FIND_PACKAGE_HANDLE_STANDARD_ARGS)
 src/gandiva/jni/CMakeLists.txt:27 (find_package)
   ```
   
   tried with 
   
   ```
   export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
   ```
   
   and 
   
   ```
   export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
   ```
   
   It worked with
   
   ```
   export JAVA_HOME=/usr/lib/jvm/java-14-oracle
   ```
   
   though



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on a change in pull request #7067: ARROW-8639: [C++][Plasma] Require gflags

2020-04-30 Thread GitBox


kszucs commented on a change in pull request #7067:
URL: https://github.com/apache/arrow/pull/7067#discussion_r418109900



##
File path: cpp/cmake_modules/FindgflagsAlt.cmake
##
@@ -15,6 +15,8 @@
 # specific language governing permissions and limitations
 # under the License.
 
+# TODO: Support version detection.

Review comment:
   Created a follow-up ticket 
https://issues.apache.org/jira/browse/ARROW-8653





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7060: ARROW-8619: [C++] Use distinct enum values for MonthInterval, DayTimeInterval

2020-04-30 Thread GitBox


wesm commented on pull request #7060:
URL: https://github.com/apache/arrow/pull/7060#issuecomment-621872082


   Will fix the JNI issue and will notify the mailing list



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-30 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r418108005



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,136 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/filter.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/type.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline Result GetConvertOptions(
+const CsvFileFormat& format, const std::shared_ptr& 
scan_options) {
+  auto options = csv::ConvertOptions::Defaults();
+  if (scan_options != nullptr) {
+// This is set to true to match behavior with other formats; a missing 
column
+// will be materialized as null.
+options.include_missing_columns = true;
+
+for (const auto& field : scan_options->schema()->fields()) {
+  options.column_types[field->name()] = field->type();
+  options.include_columns.push_back(field->name());
+}
+
+// FIXME(bkietz) also acquire types of fields materialized but not 
projected.
+for (auto&& name : FieldsInExpression(scan_options->filter)) {
+  ARROW_ASSIGN_OR_RAISE(auto match,
+
FieldRef(name).FindOneOrNone(*scan_options->schema()));
+  if (match.indices().empty()) {
+options.include_columns.push_back(std::move(name));
+  }
+}
+  }
+  return options;
+}
+
+static inline csv::ReadOptions GetReadOptions(const CsvFileFormat& format) {
+  auto options = csv::ReadOptions::Defaults();
+  // Multithreaded conversion of individual files would lead to excessive 
thread
+  // contention when ScanTasks are also executed in multiple threads, so we 
disable it
+  // here.
+  options.use_threads = false;
+  return options;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format,
+const std::shared_ptr& options = nullptr,

Review comment:
   Not sure what you mean





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7021: ARROW-8628: [Dev] Wrap docker-compose commands with archery

2020-04-30 Thread GitBox


kszucs commented on pull request #7021:
URL: https://github.com/apache/arrow/pull/7021#issuecomment-621957531


   @github-actions crossbow submit test-debian-10-cpp



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7021: ARROW-8628: [Dev] Wrap docker-compose commands with archery

2020-04-30 Thread GitBox


github-actions[bot] commented on pull request #7021:
URL: https://github.com/apache/arrow/pull/7021#issuecomment-621884756


   Revision: 5c0b02dd7947e1e61da701169cc5fafb9135a6e5
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-183](https://github.com/ursa-labs/crossbow/branches/all?query=actions-183)
   
   |Task|Status|
   ||--|
   |test-conda-python-3.7|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-183-github-test-conda-python-3.7)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-183-github-test-conda-python-3.7)|
   
|test-debian-10-cpp|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-183-circle-test-debian-10-cpp.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-183-circle-test-debian-10-cpp)|
   
|test-debian-10-go-1.12|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-183-azure-test-debian-10-go-1.12)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-183-azure-test-debian-10-go-1.12)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7021: ARROW-8628: [Dev] Wrap docker-compose commands with archery

2020-04-30 Thread GitBox


kszucs commented on pull request #7021:
URL: https://github.com/apache/arrow/pull/7021#issuecomment-621883711


   @github-actions crossbow submit test-debian-10-cpp test-debian-10-go-1.12 
test-conda-python-3.7



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-30 Thread GitBox


bkietz commented on pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#issuecomment-621814564


   I'm happy to implement whatever configuration is agreeable. I'll add a list 
of the approaches which have been discussed here to the follow-up so we can 
discuss them there.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] markhildreth opened a new pull request #7072: ARROW-8648: [Rust] Optimize Rust CI Workflows

2020-04-30 Thread GitBox


markhildreth opened a new pull request #7072:
URL: https://github.com/apache/arrow/pull/7072


   WIP



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7072: ARROW-8648: [Rust] Optimize Rust CI Workflows

2020-04-30 Thread GitBox


github-actions[bot] commented on pull request #7072:
URL: https://github.com/apache/arrow/pull/7072#issuecomment-621910533


   https://issues.apache.org/jira/browse/ARROW-8648



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-30 Thread GitBox


bkietz commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r418109100



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,136 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/filter.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/type.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline Result GetConvertOptions(
+const CsvFileFormat& format, const std::shared_ptr& 
scan_options) {
+  auto options = csv::ConvertOptions::Defaults();
+  if (scan_options != nullptr) {
+// This is set to true to match behavior with other formats; a missing 
column
+// will be materialized as null.
+options.include_missing_columns = true;
+
+for (const auto& field : scan_options->schema()->fields()) {
+  options.column_types[field->name()] = field->type();
+  options.include_columns.push_back(field->name());
+}
+
+// FIXME(bkietz) also acquire types of fields materialized but not 
projected.
+for (auto&& name : FieldsInExpression(scan_options->filter)) {
+  ARROW_ASSIGN_OR_RAISE(auto match,
+
FieldRef(name).FindOneOrNone(*scan_options->schema()));
+  if (match.indices().empty()) {
+options.include_columns.push_back(std::move(name));
+  }
+}
+  }
+  return options;
+}
+
+static inline csv::ReadOptions GetReadOptions(const CsvFileFormat& format) {
+  auto options = csv::ReadOptions::Defaults();
+  // Multithreaded conversion of individual files would lead to excessive 
thread
+  // contention when ScanTasks are also executed in multiple threads, so we 
disable it
+  // here.
+  options.use_threads = false;
+  return options;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format,
+const std::shared_ptr& options = nullptr,
+MemoryPool* pool = default_memory_pool()) {
+  ARROW_ASSIGN_OR_RAISE(auto input, source.Open());
+
+  auto reader_options = GetReadOptions(format);
+  const auto& parse_options = format.parse_options;
+  ARROW_ASSIGN_OR_RAISE(auto convert_options, GetConvertOptions(format, 
options));
+  auto maybe_reader = csv::StreamingReader::Make(pool, std::move(input), 
reader_options,
+ parse_options, 
convert_options);
+  if (!maybe_reader.ok()) {
+return maybe_reader.status().WithMessage("Could not open CSV input source 
'",
+ source.path(), "': ", 
maybe_reader.status());
+  }
+
+  return std::move(maybe_reader).ValueOrDie();
+}
+
+/// \brief A ScanTask backed by an Csv file.
+class CsvScanTask : public ScanTask {
+ public:
+  CsvScanTask(std::shared_ptr format, FileSource source,
+  std::shared_ptr options, 
std::shared_ptr context)
+  : ScanTask(std::move(options), std::move(context)),
+format_(std::move(format)),
+source_(std::move(source)) {}
+
+  Result Execute() override {
+ARROW_ASSIGN_OR_RAISE(auto reader,
+  OpenReader(source_, *format_, options(), 
context()->pool));
+return IteratorFromReader(std::move(reader));
+  }
+
+ private:
+  std::shared_ptr format_;
+  FileSource source_;
+};
+
+Result CsvFileFormat::IsSupported(const FileSource& source) const {
+  RETURN_NOT_OK(source.Open().status());

Review comment:
   If source fails to open (for example if it points to a file which 
doesn't exist) that should raise an error rather than returning false, which is 
what this line detects.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about 

[GitHub] [arrow] github-actions[bot] commented on pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-04-30 Thread GitBox


github-actions[bot] commented on pull request #7073:
URL: https://github.com/apache/arrow/pull/7073#issuecomment-621937629


   https://issues.apache.org/jira/browse/ARROW-8318



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on pull request #7066: ARROW-8634: [Java] Add Java examples

2020-04-30 Thread GitBox


andygrove commented on pull request #7066:
URL: https://github.com/apache/arrow/pull/7066#issuecomment-621873527


   Ah, I wish I had found this documentation before I started using Arrow Java! 
OK, I will just add links to the README instead then. Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #7071: ARROW-7955: [Java] Support large buffer for file/stream IPC

2020-04-30 Thread GitBox


liyafan82 commented on pull request #7071:
URL: https://github.com/apache/arrow/pull/7071#issuecomment-621896058


   Also add an integration test for VarCharVector, as it is possible that the 
size of the offset buffer be larger than Integer.MAX_VALUE



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques opened a new pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-04-30 Thread GitBox


fsaintjacques opened a new pull request #7073:
URL: https://github.com/apache/arrow/pull/7073


   * Simplified FileSystemDataset to hold a FragmentVector. Each
 Fragment must be a FileFragment and is checked at
 `FileSystemDataset::Make`.  Fragments are not required to use the same
 backing filesystem nor the same format.
   
   * Removed `FileSystemDataset::format` and
 `FileSystemDataset::partitions`.
   
   * Since FileInfo is not required by neither FileSystemDataset and
 FileSystemDatasetFactory, it is no possible to create a dataset
 without any IO involved.
   
   * Re-introduced the natural behavior of creating FileFragment with their
 full partition expressions instead of removing the ancestors common
 partitions.
   
   * Added `Expression::IsSatisfiableWith` method.
   
   * Added missing compression cmake options to archery.
   
   * Ensure FileSource holds a shared_ptr pointer.
 This is required to refactor FileSystemDataset to support Buffer
 FileSource and heterogeneous FileSystems.
   
   * Rename `type` to `id`, following other classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7021: ARROW-8628: [Dev] Wrap docker-compose commands with archery

2020-04-30 Thread GitBox


github-actions[bot] commented on pull request #7021:
URL: https://github.com/apache/arrow/pull/7021#issuecomment-621958397


   Revision: 0b532b06da2ffab64d34d4420790eee3e4f64ca3
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-185](https://github.com/ursa-labs/crossbow/branches/all?query=actions-185)
   
   |Task|Status|
   ||--|
   
|test-debian-10-cpp|[![CircleCI](https://img.shields.io/circleci/build/github/ursa-labs/crossbow/actions-185-circle-test-debian-10-cpp.svg)](https://circleci.com/gh/ursa-labs/crossbow/tree/actions-185-circle-test-debian-10-cpp)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7073: ARROW-8318: [C++][Dataset] Construct FileSystemDataset from fragments

2020-04-30 Thread GitBox


jorisvandenbossche commented on pull request #7073:
URL: https://github.com/apache/arrow/pull/7073#issuecomment-621940060


   >  Fragments are not required to use the same
   backing filesystem nor the same format.
   
   Shouldn't we require that? That seems the goal of UnionDataset to combine 
datasets with different formats



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7033: ARROW-7759: [C++][Dataset] Add CsvFileFormat

2020-04-30 Thread GitBox


fsaintjacques commented on a change in pull request #7033:
URL: https://github.com/apache/arrow/pull/7033#discussion_r417959216



##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,136 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/filter.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/type.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline Result GetConvertOptions(
+const CsvFileFormat& format, const std::shared_ptr& 
scan_options) {
+  auto options = csv::ConvertOptions::Defaults();
+  if (scan_options != nullptr) {
+// This is set to true to match behavior with other formats; a missing 
column
+// will be materialized as null.
+options.include_missing_columns = true;

Review comment:
   I would expect the FilterAndProjectScanTask to fix this in a more 
efficient way.

##
File path: cpp/src/arrow/dataset/file_csv.cc
##
@@ -0,0 +1,136 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/dataset/file_csv.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/csv/options.h"
+#include "arrow/csv/reader.h"
+#include "arrow/dataset/dataset_internal.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/filter.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/result.h"
+#include "arrow/type.h"
+#include "arrow/util/iterator.h"
+
+namespace arrow {
+namespace dataset {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+
+static inline Result GetConvertOptions(
+const CsvFileFormat& format, const std::shared_ptr& 
scan_options) {
+  auto options = csv::ConvertOptions::Defaults();
+  if (scan_options != nullptr) {
+// This is set to true to match behavior with other formats; a missing 
column
+// will be materialized as null.
+options.include_missing_columns = true;
+
+for (const auto& field : scan_options->schema()->fields()) {
+  options.column_types[field->name()] = field->type();
+  options.include_columns.push_back(field->name());
+}
+
+// FIXME(bkietz) also acquire types of fields materialized but not 
projected.
+for (auto&& name : FieldsInExpression(scan_options->filter)) {
+  ARROW_ASSIGN_OR_RAISE(auto match,
+
FieldRef(name).FindOneOrNone(*scan_options->schema()));
+  if (match.indices().empty()) {
+options.include_columns.push_back(std::move(name));
+  }
+}
+  }
+  return options;
+}
+
+static inline csv::ReadOptions GetReadOptions(const CsvFileFormat& format) {
+  auto options = csv::ReadOptions::Defaults();
+  // Multithreaded conversion of individual files would lead to excessive 
thread
+  // contention when ScanTasks are also executed in multiple threads, so we 
disable it
+  // here.
+  options.use_threads = false;
+  return options;
+}
+
+static inline Result> OpenReader(
+const FileSource& source, const CsvFileFormat& format,
+const std::shared_ptr& options = nullptr,


[GitHub] [arrow] kszucs commented on pull request #7021: ARROW-8628: [Dev] Wrap docker-compose commands with archery

2020-04-30 Thread GitBox


kszucs commented on pull request #7021:
URL: https://github.com/apache/arrow/pull/7021#issuecomment-621946996


   @github-actions crossbow submit test-debian-10-cpp 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sunchao commented on pull request #7076: ARROW-8659: [Rust] ListBuilder allocate with_capacity

2020-05-01 Thread GitBox


sunchao commented on pull request #7076:
URL: https://github.com/apache/arrow/pull/7076#issuecomment-622616144


   Merged. Thanks @tustvold !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou edited a comment on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-01 Thread GitBox


kou edited a comment on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622586181


   @kszucs I want to set `DOCKERHUB_USER` and `DOCKERHUB_TOKEN` in 
https://travis-ci.org/github/ursa-labs/crossbow and 
https://github.com/ursa-labs/crossbow. Which user should we use for this?
   It seems that we use them for https://hub.docker.com/u/asfjenkins on GitHub 
Actions in this repository.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou edited a comment on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-01 Thread GitBox


kou edited a comment on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622586181


   @kszucs I want to set `DOCKERHUB_USER` and `DOCKERHUB_TOKEN` in 
https://travis-ci.org/github/ursa-labs/crossbow and 
https://github.com/ursa-labs/crossbow. Which user should we use for this?
   It seems that we use them for https://hub.docker.com/u/asfjenkins on GitHub 
Actions.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-01 Thread GitBox


kszucs commented on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622672970


   @kou yes we use asf provided credentials on github actions to upload the 
images. We need a user with write access to that repository with a custom 
dockerhub token. Just granted you admin rights on ursa-labs/crossbow so you can 
set it yourself (or I can do it later if you want me to).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-02 Thread GitBox


kou commented on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622731551


   @github-actions crossbow submit -g linux



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-02 Thread GitBox


kou commented on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622731373


   Thanks! I set mine to https://travis-ci.org/github/ursa-labs/crossbow and 
https://github.com/ursa-labs/crossbow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-02 Thread GitBox


xhochy commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622729700


   Actually, I think the most practical approach would be to use the 
conda-forge recipes, patch them to use `vs2015` and upload them to a separate 
channel. This gives us the build machinery to not rebuild them in every CI run 
and also keeps us from maintaining a complete set of build scripts, we just 
need to copy them over from time to time.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column

2020-05-01 Thread GitBox


nealrichardson commented on pull request #6985:
URL: https://github.com/apache/arrow/pull/6985#issuecomment-622495803


   For what it's worth, R on Windows uses mingw, not msvc



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-01 Thread GitBox


wesm commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622496448


   Is this change necessary? I understand why we are using VS2017 in the conda 
package but why in the wheels? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7077: ARROW-8660: [C++][Gandiva] Reduce usage of Boost in Gandiva codebase

2020-05-01 Thread GitBox


wesm commented on pull request #7077:
URL: https://github.com/apache/arrow/pull/7077#issuecomment-622513296


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mayuropensource commented on pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics

2020-05-01 Thread GitBox


mayuropensource commented on pull request #7022:
URL: https://github.com/apache/arrow/pull/7022#issuecomment-622581788


   thank you @wesm 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #6631: ARROW-8111: [C++][CSV] Support MM/DD/YYYY date format

2020-05-01 Thread GitBox


wesm edited a comment on pull request #6631:
URL: https://github.com/apache/arrow/pull/6631#issuecomment-622586463


   Problems:
   
   * There aren't any unit tests in this patch so there is some work to do to 
get this merged
   * Code is duplicated from arrow/util/parsing.h
   
   I started working on this so I'll either post an update here or open a new 
PR. Since there's a good deal more code changes required it would be a good 
idea to have another round of code review. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-02 Thread GitBox


github-actions[bot] commented on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622733242


   Revision: c73b2b65c373892f95dab8d65cf6fae06a39fe68
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-198](https://github.com/ursa-labs/crossbow/branches/all?query=actions-198)
   
   |Task|Status|
   ||--|
   |centos-6-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-198-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-198-github-centos-6-amd64)|
   
|centos-7-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-198-travis-centos-7-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-7-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-198-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-198-github-centos-7-amd64)|
   
|centos-8-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-198-travis-centos-8-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-8-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-198-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-198-github-centos-8-amd64)|
   |debian-buster-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-198-github-debian-buster-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-198-github-debian-buster-amd64)|
   
|debian-buster-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-198-travis-debian-buster-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |debian-stretch-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-198-github-debian-stretch-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-198-github-debian-stretch-amd64)|
   
|debian-stretch-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-198-travis-debian-stretch-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-bionic-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-198-github-ubuntu-bionic-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-198-github-ubuntu-bionic-amd64)|
   
|ubuntu-bionic-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-198-travis-ubuntu-bionic-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-eoan-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-198-github-ubuntu-eoan-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-198-github-ubuntu-eoan-amd64)|
   
|ubuntu-eoan-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-198-travis-ubuntu-eoan-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-focal-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-198-github-ubuntu-focal-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-198-github-ubuntu-focal-amd64)|
   
|ubuntu-focal-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-198-travis-ubuntu-focal-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-xenial-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-198-github-ubuntu-xenial-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-198-github-ubuntu-xenial-amd64)|
   
|ubuntu-xenial-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-198-travis-ubuntu-xenial-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-02 Thread GitBox


xhochy commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622697125


   We would need to switch away from using the compiled conda packages as the 
basis for the wheels. I'm not sure if there is a different source that could 
provide except Arrow building them on their own.
   
   With my Arrow hat on, this is really annoying as either build times will 
increase significantly or we would have to use make a pre-filled Windows Docker 
image that already includes all our static libraries.
   
   With my conda-forge hat on, I would actually love to see this happen as this 
would mean that we could get rid of the static libraries which we actually 
don't want to build and distribute. As a package manager who can explicitly 
deal with native dependencies, dynamic linkage makes the package maintenance a 
whole lot easier.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-02 Thread GitBox


kou commented on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622885428


   @github-actions crossbow submit -g linux



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-02 Thread GitBox


github-actions[bot] commented on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622887235


   Revision: c73b2b65c373892f95dab8d65cf6fae06a39fe68
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-199](https://github.com/ursa-labs/crossbow/branches/all?query=actions-199)
   
   |Task|Status|
   ||--|
   |centos-6-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-199-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-199-github-centos-6-amd64)|
   
|centos-7-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-199-travis-centos-7-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-7-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-199-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-199-github-centos-7-amd64)|
   
|centos-8-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-199-travis-centos-8-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-8-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-199-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-199-github-centos-8-amd64)|
   |debian-buster-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-199-github-debian-buster-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-199-github-debian-buster-amd64)|
   
|debian-buster-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-199-travis-debian-buster-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |debian-stretch-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-199-github-debian-stretch-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-199-github-debian-stretch-amd64)|
   
|debian-stretch-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-199-travis-debian-stretch-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-bionic-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-199-github-ubuntu-bionic-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-199-github-ubuntu-bionic-amd64)|
   
|ubuntu-bionic-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-199-travis-ubuntu-bionic-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-eoan-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-199-github-ubuntu-eoan-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-199-github-ubuntu-eoan-amd64)|
   
|ubuntu-eoan-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-199-travis-ubuntu-eoan-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-focal-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-199-github-ubuntu-focal-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-199-github-ubuntu-focal-amd64)|
   
|ubuntu-focal-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-199-travis-ubuntu-focal-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |ubuntu-xenial-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-199-github-ubuntu-xenial-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-199-github-ubuntu-xenial-amd64)|
   
|ubuntu-xenial-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-199-travis-ubuntu-xenial-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #7037: ARROW-6718: [Rust] Remove packed_simd

2020-05-02 Thread GitBox


nevi-me commented on pull request #7037:
URL: https://github.com/apache/arrow/pull/7037#issuecomment-622889198


   Hi @yordan-pavlov, we want to remove packed_simd due to the uncertainty with 
it being stabilised soon. We so far found that if we optimise some non-SIMD 
code, we don't lose a lot of performance relative to the explicit SIMD code. 
   
   It'd be great if we could see where we could further improve the non-SIMD 
functions



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-01 Thread GitBox


xhochy commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622494222


   > @xhochy to run the wheels, or build them?
   
   To run.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tobim commented on pull request #7038: ARROW-8593: [C++][Parquet] Fix build with musl libc

2020-05-01 Thread GitBox


tobim commented on pull request #7038:
URL: https://github.com/apache/arrow/pull/7038#issuecomment-622535149


   @fsaintjacques @emkornfield sorry for the long silence, I updated the commit 
as you suggested.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pauldix commented on a change in pull request #7064: ARROW-6945: [Rust] WIP: Add initial skeleton for Rust integration tests

2020-05-01 Thread GitBox


pauldix commented on a change in pull request #7064:
URL: https://github.com/apache/arrow/pull/7064#discussion_r418732315



##
File path: rust/arrow/Cargo.toml
##
@@ -50,6 +50,7 @@ chrono = "0.4"
 flatbuffers = "0.6"
 hex = "0.4"
 arrow-flight = { path = "../arrow-flight", optional = true }
+clap = "2.33.0"

Review comment:
   It's not clear to me how trybuild would be used in conjunction with the 
Archery suite, which just seems to want three CLI executables to pipe invoke.  
Maybe these would be best in a separate project on the same level as `arrow`, 
`arrow-flight` etd? `arrow-integration` perhaps? Then I'll just import from 
arrow and we keep the actual library dependencies clean.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #6631: ARROW-8111: [C++][CSV] Support MM/DD/YYYY date format

2020-05-01 Thread GitBox


wesm edited a comment on pull request #6631:
URL: https://github.com/apache/arrow/pull/6631#issuecomment-622586463


   Problems:
   
   * There aren't any unit tests in this patch so there is some work to do to 
get this merged
   * Code is duplicated from arrow/util/parsing.h



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-01 Thread GitBox


wesm edited a comment on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622496448


   Is this change necessary? I understand why we are using VS2017 in the conda 
package but why in the wheels? I'm sort of -0.5 on this unless there is a 
concrete reason why we can't keep using VS2015 (because it may require users to 
install a different C++ redistributable than the VS2015 one that other packages 
require)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-01 Thread GitBox


kou commented on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622586181


   @kszucs I want to set `DOCKERHUB_USER` and `DOCKERHUB_TOKEN` in 
https://travis-ci.org/github/ursa-labs/crossbow . Which user should we use for 
this?
   It seems that we use them for https://hub.docker.com/u/asfjenkins on GitHub 
Actions.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou opened a new pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-01 Thread GitBox


kou opened a new pull request #7085:
URL: https://github.com/apache/arrow/pull/7085


   If we use QEMU on GitHub Actions, it takes 6h+.
   
   If we use ARM machine on Travis CI, it takes 30-40m.
   
   This change adds Docker image caching to
   
apache/arrow-dev:${ARCH}-{DISTRIBUTION}-${CODE_NAME_OR_VERSION}-package-${PACKAGE_NAME}-llvm-${LLVM_VERSION}.
 It
   refers .env at the top directory. So we can override the
   "apache/arrow-dev" by the REPO environment variable for testing.
   
   This change disables Gandiva for Debian/Ubuntu ARM packages because
   apt.llvm.org doesn't provide packages for ARM.
   
   This change disables Gandiva for CentOS 8 ARM packages because we
   don't have enough disk space to build it. We may need to disable
   Flight later.
   
   This change adds ARM package builds to nightly jobs. If they are
   fragile, I'll remove them.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7085: ARROW-8668: [Packaging][APT][Yum][ARM] Use Travis CI's ARM machine to build packages

2020-05-01 Thread GitBox


kou commented on pull request #7085:
URL: https://github.com/apache/arrow/pull/7085#issuecomment-622584942


   @github-actions crossbow submit -g linux



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6631: ARROW-8111: [C++][CSV] Support MM/DD/YYYY date format

2020-05-01 Thread GitBox


wesm commented on pull request #6631:
URL: https://github.com/apache/arrow/pull/6631#issuecomment-622586463


   There aren't any unit tests in this patch so there is some work to do to get 
this merged



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7074: ARROW-8656: [Python] Switch to VS2017 in the windows wheel builds

2020-05-01 Thread GitBox


fsaintjacques commented on pull request #7074:
URL: https://github.com/apache/arrow/pull/7074#issuecomment-622491245


   @xhochy to run the wheels, or build them?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-05-01 Thread GitBox


wesm commented on pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#issuecomment-622529037


   thanks @lidavidm! I'm confident we'll be able to devise some solutions to 
the resource allocation problem



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




<    3   4   5   6   7   8   9   10   11   12   >