[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445583091



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   https://github.com/apache/arrow/pull/7526#discussion_r445512267





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r445512267



##
File path: cpp/src/arrow/util/mutex.h
##
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// A wrapper around std::mutex since we can't use it directly due to CPP/CLI

Review comment:
   We cannot use `std::mutex` in public headers due to CPP/CLI forbidding 
use of that header 
https://docs.microsoft.com/en-us/cpp/standard-library/mutex#remarks
   We don't support compilation of Arrow with the `/clr` option so it's 
acceptable to use `std::mutex` in source files and private headers (including 
`arrow/util/mutex.cc`). However arrow headers may be included by projects which 
*are* using `/clr` and usage of `std::mutex` in those public header files would 
break those projects.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-24 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444914459



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   Fragments created by ParquetDatasetFactory will have their physical 
schema populated on construction: 
https://github.com/apache/arrow/pull/7526/files#diff-b7aed3b5c0adacff778c6930cc6a03e4R646-R650
   
   After that calling `ReadPhysicalSchema` on those fragments will be an 
IO-free operation 
https://github.com/apache/arrow/pull/7526/files#diff-ce896c4eb96f3a050aec23151adf58e8R42-R48





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-24 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444914459



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   
https://github.com/apache/arrow/pull/7526/files#diff-b7aed3b5c0adacff778c6930cc6a03e4R646-R650





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-23 Thread GitBox


bkietz commented on a change in pull request #7526:
URL: https://github.com/apache/arrow/pull/7526#discussion_r444509707



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -357,13 +355,20 @@ static inline Result> 
AugmentRowGroups(
   return row_groups;
 }
 
-Result ParquetFileFormat::ScanFile(
-const FileSource& source, std::shared_ptr options,
-std::shared_ptr context, std::vector 
row_groups) const {
+Result 
ParquetFileFormat::ScanFile(std::shared_ptr options,
+ 
std::shared_ptr context,
+ FileFragment* fragment) 
const {
+  const auto& source = fragment->source();
+  auto row_groups = checked_cast(fragment)->row_groups();
+
   bool row_groups_are_complete = RowGroupInfosAreComplete(row_groups);
   // The following block is required to avoid any IO if all RowGroups are
   // excluded due to prior statistics knowledge.
   if (row_groups_are_complete) {
+// physical_schema should be cached at this point
+ARROW_ASSIGN_OR_RAISE(auto physical_schema, 
fragment->ReadPhysicalSchema());
+RETURN_NOT_OK(options->filter->Validate(*physical_schema));

Review comment:
   Fragments created using `_metadata` have their physical schema preloaded 
so they won't incur IO when no row group can satisfy a predicate





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org