Re: [PR] PARQUET-34: implement Size() filter for repeated columns [parquet-java]

2024-12-20 Thread via GitHub


emkornfield commented on PR #3098:
URL: https://github.com/apache/parquet-java/pull/3098#issuecomment-2557642415

   I can try to look in more detail but stats can certainly be used here, I 
imagine they are most useful for repeated fieds when trying to discriminate 
between repeated fields that mostly have 0 or 1 element, and trying to filter 
out cases with > 0  or 1 elements. e.g. if all fields have 0 observed 
rep_levels of 1, then one knows for sure all lists are of length 0 or 1 
(whether there are any lists of length 0 or one can be deteremined by 
inspecting the def level histogram).  For larger cardinality lists the 
filtering power diminishes significanly (its hard to distinguish based on 
histograms the difference between many very small lists vs one very large one).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-34: implement Size() filter for repeated columns [parquet-java]

2024-12-16 Thread via GitHub


clairemcginty commented on PR #3098:
URL: https://github.com/apache/parquet-java/pull/3098#issuecomment-2546352209

   > Thanks for adding this! This is a large PR that I need to take some time 
to review.
   
   thanks, no rush on reviewing it! 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-34: implement Size() filter for repeated columns [parquet-java]

2024-12-13 Thread via GitHub


wgtmac commented on PR #3098:
URL: https://github.com/apache/parquet-java/pull/3098#issuecomment-2541673515

   Thanks for adding this! This is a large PR that I need to take some time to 
review.
   
   It would be good if @emkornfield @gszadovszky could take a look to see if 
this is a good use case for SizeStatistics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-34: implement Size() filter for repeated columns [parquet-java]

2024-12-06 Thread via GitHub


clairemcginty commented on code in PR #3098:
URL: https://github.com/apache/parquet-java/pull/3098#discussion_r1873889419


##
parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java:
##
@@ -493,6 +494,39 @@ public > Boolean visit(Contains 
contains) {
 return contains.filter(this, (l, r) -> l || r, (l, r) -> l && r, v -> 
BLOCK_MIGHT_MATCH);
   }
 
+  @Override
+  public Boolean visit(Size size) {
+ColumnChunkMetaData meta = 
getColumnChunk(size.getColumn().getColumnPath());
+
+if (meta == null) {
+  // the column isn't in this file, so fail eq/gt/gte targeting size > 0
+  final boolean blockCannotMatch =
+  size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> 
gt >= 0, (gte) -> gte > 0);
+  return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+}
+
+try {
+  // We know the block has at least as many array elements as the 
dictionary sizes
+  final Set dict = expandDictionary(meta);
+  if (dict == null) {
+return BLOCK_MIGHT_MATCH;
+  }
+  int numDistinctValues = dict.size();
+  final boolean blockCannotMatch = size.filter(
+  (eq) -> eq < numDistinctValues,
+  (lt) -> lt <= numDistinctValues,
+  (lte) -> lte < numDistinctValues,
+  (gt) -> false,
+  (gte) -> false);
+

Review Comment:
   actually now that I think about it, this isn't accurate, since we don't know 
the distribution of values. I guess we could combine it with SizeStatistics to 
get the number of elements and work out the minimum size from there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-34: implement Size() filter for repeated columns [parquet-java]

2024-12-05 Thread via GitHub


clairemcginty commented on code in PR #3098:
URL: https://github.com/apache/parquet-java/pull/3098#discussion_r1872004619


##
parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java:
##
@@ -217,6 +219,70 @@ public > Boolean visit(Contains 
contains) {
 return contains.filter(this, (l, r) -> l || r, (l, r) -> l && r, v -> 
BLOCK_MIGHT_MATCH);
   }
 
+  @Override
+  public Boolean visit(Size size) {
+final ColumnChunkMetaData metadata = 
getColumnChunk(size.getColumn().getColumnPath());
+if (metadata == null) {
+  // the column isn't in this file, so fail eq/gt/gte targeting size > 0
+  final boolean blockCannotMatch =
+  size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> 
gt >= 0, (gte) -> gte > 0);
+  return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+}
+
+final SizeStatistics stats = metadata.getSizeStatistics();
+final List repetitionLevelHistogram = 
stats.getRepetitionLevelHistogram();
+final List definitionLevelHistogram = 
stats.getDefinitionLevelHistogram();
+
+if (repetitionLevelHistogram.isEmpty() || 
definitionLevelHistogram.isEmpty()) {
+  return BLOCK_MIGHT_MATCH;
+}
+
+// If all values have repetition level 0, then no array has more than 1 
element
+if (repetitionLevelHistogram.size() == 1
+|| repetitionLevelHistogram.subList(1, 
repetitionLevelHistogram.size()).stream()
+.allMatch(l -> l == 0)) {
+
+  // Null list fields are treated as having size 0
+  if (( // all lists are nulls
+  definitionLevelHistogram.subList(1, 
definitionLevelHistogram.size()).stream()
+  .allMatch(l -> l == 0))
+  || // all lists are size 0
+  (definitionLevelHistogram.get(0) == 0
+  && definitionLevelHistogram.subList(2, 
definitionLevelHistogram.size()).stream()
+  .allMatch(l -> l == 0))) {
+
+final boolean blockCannotMatch =
+size.filter((eq) -> eq > 0, (lt) -> false, (lte) -> false, (gt) -> 
gt >= 0, (gte) -> gte > 0);
+return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+  }
+
+  long maxDefinitionLevel = 
definitionLevelHistogram.get(definitionLevelHistogram.size() - 1);
+
+  // If all repetition levels are zero and all definitions level are > 
MAX_DEFINITION_LEVEL - 1, all lists
+  // are of size 1
+  if (definitionLevelHistogram.stream().allMatch(l -> l > 
maxDefinitionLevel - 1)) {
+final boolean blockCannotMatch = size.filter(
+(eq) -> eq != 1, (lt) -> lt <= 1, (lte) -> lte < 1, (gt) -> gt >= 
1, (gte) -> gte > 1);
+
+return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;
+  }
+}
+long nonNullElementCount =
+repetitionLevelHistogram.stream().mapToLong(l -> l).sum() - 
definitionLevelHistogram.get(0);
+long numNonNullRecords = repetitionLevelHistogram.get(0) - 
definitionLevelHistogram.get(0);
+
+// Given the total number of elements and non-null fields, we can compute 
the max size of any array field
+long maxArrayElementCount = 1 + (nonNullElementCount - numNonNullRecords);
+final boolean blockCannotMatch = size.filter(
+(eq) -> eq > maxArrayElementCount,
+(lt) -> false,
+(lte) -> false,
+(gt) -> gt >= maxArrayElementCount,
+(gte) -> gte > maxArrayElementCount);
+
+return blockCannotMatch ? BLOCK_CANNOT_MATCH : BLOCK_MIGHT_MATCH;

Review Comment:
   hopefully this is a faithful transcription of the logic outlined here: 
https://github.com/apache/parquet-java/issues/1452#issuecomment-2271914678



##
parquet-hadoop/src/test/java/org/apache/parquet/filter2/statisticslevel/TestStatisticsFilter.java:
##
@@ -435,6 +468,174 @@ public void testOr() {
 assertFalse(canDrop(or(no, no), columnMetas));
   }
 
+  @Test
+  public void testSizeFilterRequiredGroupRequiredElements() throws Exception {
+final IntStatistics minMaxStats = new IntStatistics();
+
+// Case 1: Lists are populated
+List columnMeta = 
Collections.singletonList(getIntColumnMeta(
+nestedListColumn.getColumnPath(),
+minMaxStats,
+createSizeStatisticsForRepeatedField(
+true,
+ImmutableList.of(
+ImmutableList.of(1, 2, 3),
+ImmutableList.of(1),
+ImmutableList.of(1, 2, 3),
+ImmutableList.of())),
+4));
+
+// SizeStats tells us that there are 7 total array elements spread across 
3 non-empty list_fields,
+// so the max size any single list_field could have is 5
+assertTrue(canDrop(size(nestedListColumn, Operators.Size.Operator.GTE, 6), 
columnMeta));
+assertTrue(canDrop(size(nestedListColumn, Operators.Size.Operator.GT, 5), 
columnMeta));
+assertTrue(canDrop(size(nestedListColumn, Operators.Size.Operator.EQ, 6), 
columnMeta));
+
+// These predicates should not