subject:"\[jira\] \[Commented\] \(PARQUET\-1369\) \[Python\] Unavailable Parquet column statistics from Spark\-generated file"

[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-09-28 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632014#comment-16632014
 ] 

ASF GitHub Bot commented on PARQUET-1369:
-

rgruener closed pull request #491: PARQUET-1369: Disregard column sort order if 
statistics max/min are equal
URL: https://github.com/apache/parquet-cpp/pull/491
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/metadata-test.cc b/src/parquet/metadata-test.cc
index 53653bd7..25bab380 100644
--- a/src/parquet/metadata-test.cc
+++ b/src/parquet/metadata-test.cc
@@ -20,6 +20,7 @@
 #include "parquet/schema.h"
 #include "parquet/statistics.h"
 #include "parquet/types.h"
+#include "parquet/parquet_types.h"
 
 namespace parquet {
 
@@ -219,12 +220,36 @@ TEST(ApplicationVersion, Basics) {
 
   ASSERT_EQ(true, version.VersionLt(version1));
 
-  ASSERT_FALSE(version1.HasCorrectStatistics(Type::INT96, SortOrder::UNKNOWN));
-  ASSERT_TRUE(version.HasCorrectStatistics(Type::INT32, SortOrder::SIGNED));
-  ASSERT_FALSE(version.HasCorrectStatistics(Type::BYTE_ARRAY, 
SortOrder::SIGNED));
-  ASSERT_TRUE(version1.HasCorrectStatistics(Type::BYTE_ARRAY, 
SortOrder::SIGNED));
-  ASSERT_TRUE(
-  version3.HasCorrectStatistics(Type::FIXED_LEN_BYTE_ARRAY, 
SortOrder::SIGNED));
+  EncodedStatistics stats;
+  ASSERT_FALSE(version1.HasCorrectStatistics(Type::INT96, stats, 
SortOrder::UNKNOWN));
+  ASSERT_TRUE(version.HasCorrectStatistics(Type::INT32, stats, 
SortOrder::SIGNED));
+  ASSERT_FALSE(version.HasCorrectStatistics(Type::BYTE_ARRAY, stats, 
SortOrder::SIGNED));
+  ASSERT_TRUE(version1.HasCorrectStatistics(Type::BYTE_ARRAY, stats, 
SortOrder::SIGNED));
+  ASSERT_FALSE(version1.HasCorrectStatistics(Type::BYTE_ARRAY, stats,
+ SortOrder::UNSIGNED));
+  ASSERT_TRUE(version3.HasCorrectStatistics(Type::FIXED_LEN_BYTE_ARRAY,
+stats, SortOrder::SIGNED));
+
+  // Check that the old stats are correct if min and max are the same
+  // regardless of sort order
+  EncodedStatistics stats_str;
+  stats_str.set_min("a").set_max("b");
+  ASSERT_FALSE(version1.HasCorrectStatistics(Type::BYTE_ARRAY,
+ stats_str, SortOrder::UNSIGNED));
+  stats_str.set_max("a");
+  ASSERT_TRUE(version1.HasCorrectStatistics(Type::BYTE_ARRAY,
+stats_str, SortOrder::UNSIGNED));
+
+  // Check that the same holds true for ints
+  int32_t int_min = 100, int_max = 200;
+  EncodedStatistics stats_int;
+  stats_int.set_min(std::string(reinterpret_cast(_min), 4))
+  .set_max(std::string(reinterpret_cast(_max), 4));
+  ASSERT_FALSE(version1.HasCorrectStatistics(Type::BYTE_ARRAY,
+ stats_int, SortOrder::UNSIGNED));
+  stats_int.set_max(std::string(reinterpret_cast(_min), 4));
+  ASSERT_TRUE(version1.HasCorrectStatistics(Type::BYTE_ARRAY,
+stats_int, SortOrder::UNSIGNED));
 }
 
 }  // namespace metadata
diff --git a/src/parquet/metadata.cc b/src/parquet/metadata.cc
index 1cab51f0..3414d258 100644
--- a/src/parquet/metadata.cc
+++ b/src/parquet/metadata.cc
@@ -99,7 +99,7 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl {
 for (auto encoding : meta_data.encodings) {
   encodings_.push_back(FromThrift(encoding));
 }
-stats_ = nullptr;
+possible_stats_ = nullptr;
   }
   ~ColumnChunkMetaDataImpl() {}
 
@@ -124,15 +124,19 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl {
   //Eg: UTF8
   inline bool is_stats_set() const {
 DCHECK(writer_version_ != nullptr);
-return column_->meta_data.__isset.statistics &&
-   writer_version_->HasCorrectStatistics(type(), descr_->sort_order());
+if (!column_->meta_data.__isset.statistics) {
+  return false;
+}
+if (possible_stats_ == nullptr) {
+  possible_stats_ = MakeColumnStats(column_->meta_data, descr_);
+}
+EncodedStatistics encodedStatistics = possible_stats_->Encode();
+return writer_version_->HasCorrectStatistics(type(), encodedStatistics,
+ descr_->sort_order());
   }
 
   inline std::shared_ptr statistics() const {
-if (stats_ == nullptr && is_stats_set()) {
-  stats_ = MakeColumnStats(column_->meta_data, descr_);
-}
-return stats_;
+return is_stats_set() ? possible_stats_ : nullptr;
   }
 
   inline Compression::type compression() const {
@@ -168,7 +172,7 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl {
   }
 
  private:
-  mutable std::shared_ptr stats_;
+  mutable std::shared_ptr possible_stats_;
   std::vector

[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-17 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584436#comment-16584436
 ] 

ASF GitHub Bot commented on PARQUET-1369:
-

rgruener opened a new pull request #491: PARQUET-1369: Disregard column sort 
order if statistics max/min are equal
URL: https://github.com/apache/parquet-cpp/pull/491
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: PARQUET-1369
> URL: https://issues.apache.org/jira/browse/PARQUET-1369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: cpp-1.5.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-03 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568247#comment-16568247
 ] 

Uwe L. Korn commented on PARQUET-1369:
--

[~rgruener] Moved it.

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: PARQUET-1369
> URL: https://issues.apache.org/jira/browse/PARQUET-1369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.5.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

3 matches

Site Navigation

Mail list logo

Footer information