[GitHub] [parquet-mr] ggershinsky commented on pull request #978: PARQUET-2161: Fix row index generation in combination with range filtering

2022-06-28 Thread GitBox


ggershinsky commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1168681667

   Yep, I remember reviewing that PR. @prakharjain09 , can you also have a look 
at this fix?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-27 Thread GitBox


steveloughran commented on PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#issuecomment-1167200968

   i will do a separate PR to remove `PathGlobPattern`; not this week though. 
   
   It is used in DeprecatedFieldProjectionFilter, and that is used in 
org.apache.parquet.hadoop.thrift.ThriftReadSupport if 
"parquet.thrift.column.filter" is set. that use would have to be cut and rather 
than just print a deprecation warning, actually fail.
   
   nobody must be using this on anything with ASF hadoop binaries 3.2+ or they 
would have complained about linkage errors by now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ala commented on pull request #978: PARQUET-2161: Fix row index generation in combination with range filtering

2022-06-24 Thread GitBox


ala commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1165708679

   cc @ggershinsky
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-24 Thread GitBox


steveloughran commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r905967944


##
pom.xml:
##
@@ -76,7 +76,7 @@
 2.13.2.2
 0.14.2
 shaded.parquet
-2.10.1
+3.2.0

Review Comment:
   I was being unambitious. move to this, the oldest 3.x release working on 
java11 ensures that anything else on a version >= to this should link properly.
   
   if you do want to be more current, well, spark is on 3.3.3, hive is trying 
to move to 3.3.x and I will be doing a 3.3.4 release in a week's time, which is 
just some security changes mostly of relevance to servers



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-24 Thread GitBox


steveloughran commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r905965620


##
parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:
##
@@ -20,8 +20,8 @@
 
 import org.apache.hadoop.fs.GlobPattern;
 
-import java.util.regex.Pattern;
-import java.util.regex.PatternSyntaxException;
+import com.google.re2j.Pattern;

Review Comment:
   +1 for cutting. i will update the patch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ala commented on pull request #978: PARQUET-2161: Fix row index generation in combination with range filtering

2022-06-23 Thread GitBox


ala commented on PR #978:
URL: https://github.com/apache/parquet-mr/pull/978#issuecomment-1164263841

   cc @shangxinli This is a small follow-up bug fix for 
https://github.com/apache/parquet-mr/pull/945


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-22 Thread GitBox


ggershinsky commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r903697361


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1796,5 +1882,314 @@ public void readAll(SeekableInputStream f, 
ChunkListBuilder builder) throws IOEx
 public long endPos() {
   return offset + length;
 }
+
+@Override
+public String toString() {
+  return "ConsecutivePartList{" +
+"offset=" + offset +
+", length=" + length +
+", chunks=" + chunks +
+'}';
+}
   }
+
+  /**
+   * Encapsulates the reading of a single page.
+   */
+  public class PageReader implements Closeable {
+private final Chunk chunk;
+private final int currentBlock;
+private final BlockCipher.Decryptor headerBlockDecryptor;
+private final BlockCipher.Decryptor pageBlockDecryptor;
+private final byte[] aadPrefix;
+private final int rowGroupOrdinal;
+private final int columnOrdinal;
+
+//state
+private final LinkedBlockingDeque> pagesInChunk = new 
LinkedBlockingDeque<>();
+private DictionaryPage dictionaryPage = null;
+private int pageIndex = 0;
+private long valuesCountReadSoFar = 0;
+private int dataPageCountReadSoFar = 0;
+
+// derived
+private final PrimitiveType type;
+private final byte[] dataPageAAD;
+private final byte[] dictionaryPageAAD;

Review Comment:
   probably not needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-22 Thread GitBox


ggershinsky commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r903693351


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1796,5 +1882,314 @@ public void readAll(SeekableInputStream f, 
ChunkListBuilder builder) throws IOEx
 public long endPos() {
   return offset + length;
 }
+
+@Override
+public String toString() {
+  return "ConsecutivePartList{" +
+"offset=" + offset +
+", length=" + length +
+", chunks=" + chunks +
+'}';
+}
   }
+
+  /**
+   * Encapsulates the reading of a single page.
+   */
+  public class PageReader implements Closeable {
+private final Chunk chunk;
+private final int currentBlock;
+private final BlockCipher.Decryptor headerBlockDecryptor;
+private final BlockCipher.Decryptor pageBlockDecryptor;
+private final byte[] aadPrefix;
+private final int rowGroupOrdinal;
+private final int columnOrdinal;
+
+//state
+private final LinkedBlockingDeque> pagesInChunk = new 
LinkedBlockingDeque<>();
+private DictionaryPage dictionaryPage = null;
+private int pageIndex = 0;
+private long valuesCountReadSoFar = 0;
+private int dataPageCountReadSoFar = 0;
+
+// derived
+private final PrimitiveType type;
+private final byte[] dataPageAAD;
+private final byte[] dictionaryPageAAD;
+private byte[] dataPageHeaderAAD = null;
+
+private final BytesInputDecompressor decompressor;
+
+private final ConcurrentLinkedQueue> readFutures = new 
ConcurrentLinkedQueue<>();
+
+private final LongAdder totalTimeReadOnePage = new LongAdder();
+private final LongAdder totalCountReadOnePage = new LongAdder();
+private final LongAccumulator maxTimeReadOnePage = new 
LongAccumulator(Long::max, 0L);
+private final LongAdder totalTimeBlockedPagesInChunk = new LongAdder();
+private final LongAdder totalCountBlockedPagesInChunk = new LongAdder();
+private final LongAccumulator maxTimeBlockedPagesInChunk = new 
LongAccumulator(Long::max, 0L);
+
+public PageReader(Chunk chunk, int currentBlock, Decryptor 
headerBlockDecryptor,
+  Decryptor pageBlockDecryptor, byte[] aadPrefix, int rowGroupOrdinal, int 
columnOrdinal,
+  BytesInputDecompressor decompressor
+  ) {
+  this.chunk = chunk;
+  this.currentBlock = currentBlock;
+  this.headerBlockDecryptor = headerBlockDecryptor;
+  this.pageBlockDecryptor = pageBlockDecryptor;
+  this.aadPrefix = aadPrefix;
+  this.rowGroupOrdinal = rowGroupOrdinal;
+  this.columnOrdinal = columnOrdinal;
+  this.decompressor = decompressor;
+
+  this.type = getFileMetaData().getSchema()
+.getType(chunk.descriptor.col.getPath()).asPrimitiveType();
+
+  if (null != headerBlockDecryptor) {
+dataPageHeaderAAD = AesCipher.createModuleAAD(aadPrefix, 
ModuleType.DataPageHeader,
+  rowGroupOrdinal,
+  columnOrdinal, chunk.getPageOrdinal(dataPageCountReadSoFar));
+  }
+  if (null != pageBlockDecryptor) {
+dataPageAAD = AesCipher.createModuleAAD(aadPrefix, 
ModuleType.DataPage, rowGroupOrdinal,
+  columnOrdinal, 0);
+dictionaryPageAAD = AesCipher.createModuleAAD(aadPrefix, 
ModuleType.DictionaryPage,

Review Comment:
   Yep, the `dictionaryPageAAD` is not necessary here. This is a significant 
code change, more than just moving the current logic of
   ```java
   public ColumnChunkPageReader readAllPages(BlockCipher.Decryptor 
headerBlockDecryptor, BlockCipher.Decryptor pageBlockDecryptor, byte[] 
aadPrefix, int rowGroupOrdinal, int columnOrdinal)
   ```
   
   I'll have a closer look at the details, but we need a unitest (proposed in 
my other comment) to make sure decryption works ok with async io and parallel 
column reading.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-22 Thread GitBox


ggershinsky commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r903595526


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1796,5 +1882,314 @@ public void readAll(SeekableInputStream f, 
ChunkListBuilder builder) throws IOEx
 public long endPos() {
   return offset + length;
 }
+
+@Override
+public String toString() {
+  return "ConsecutivePartList{" +
+"offset=" + offset +
+", length=" + length +
+", chunks=" + chunks +
+'}';
+}
   }
+
+  /**
+   * Encapsulates the reading of a single page.
+   */
+  public class PageReader implements Closeable {

Review Comment:
   maybe can also be separated from the ParquetFileReader, this is a chance to 
reduce the size of the latter :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-22 Thread GitBox


ggershinsky commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r903592957


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -126,6 +127,42 @@ public class ParquetFileReader implements Closeable {
 
   public static String PARQUET_READ_PARALLELISM = 
"parquet.metadata.read.parallelism";
 
+  public static int numProcessors = Runtime.getRuntime().availableProcessors();
+
+  // Thread pool to read column chunk data from disk. Applications should call 
setAsyncIOThreadPool
+  // to initialize this with their own implementations.
+  // Default initialization is useful only for testing
+  public static ExecutorService ioThreadPool = Executors.newCachedThreadPool(
+r -> new Thread(r, "parquet-io"));
+
+  // Thread pool to process pages for multiple columns in parallel. 
Applications should call
+  // setAsyncProcessThreadPool to initialize this with their own 
implementations.
+  // Default initialization is useful only for testing
+  public static ExecutorService processThreadPool = 
Executors.newCachedThreadPool(

Review Comment:
   given the comment "Default initialization is useful only for testing", maybe 
this can be moved to the tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-22 Thread GitBox


ggershinsky commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r903469988


##
parquet-hadoop/src/main/java/org/apache/parquet/crypto/InternalFileDecryptor.java:
##
@@ -61,10 +61,7 @@ public InternalFileDecryptor(FileDecryptionProperties 
fileDecryptionProperties)
 
   private BlockCipher.Decryptor getThriftModuleDecryptor(byte[] columnKey) {
 if (null == columnKey) { // Decryptor with footer key
-  if (null == aesGcmDecryptorWithFooterKey) {
-aesGcmDecryptorWithFooterKey = 
ModuleCipherFactory.getDecryptor(AesMode.GCM, footerKey);
-  }
-  return aesGcmDecryptorWithFooterKey;
+  return ModuleCipherFactory.getDecryptor(AesMode.GCM, footerKey);

Review Comment:
   could you add a unitest of decryption with async io and parallel column 
reader, eg to the 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java#L507



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -126,6 +127,42 @@ public class ParquetFileReader implements Closeable {
 
   public static String PARQUET_READ_PARALLELISM = 
"parquet.metadata.read.parallelism";
 
+  public static int numProcessors = Runtime.getRuntime().availableProcessors();
+
+  // Thread pool to read column chunk data from disk. Applications should call 
setAsyncIOThreadPool
+  // to initialize this with their own implementations.
+  // Default initialization is useful only for testing
+  public static ExecutorService ioThreadPool = Executors.newCachedThreadPool(
+r -> new Thread(r, "parquet-io"));
+
+  // Thread pool to process pages for multiple columns in parallel. 
Applications should call
+  // setAsyncProcessThreadPool to initialize this with their own 
implementations.
+  // Default initialization is useful only for testing
+  public static ExecutorService processThreadPool = 
Executors.newCachedThreadPool(

Review Comment:
   should we be creating thread pools if the Async IO and parallel column 
reading are not activated? 
   (here and in the line 135)



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1796,5 +1882,314 @@ public void readAll(SeekableInputStream f, 
ChunkListBuilder builder) throws IOEx
 public long endPos() {
   return offset + length;
 }
+
+@Override
+public String toString() {
+  return "ConsecutivePartList{" +
+"offset=" + offset +
+", length=" + length +
+", chunks=" + chunks +
+'}';
+}
   }
+
+  /**
+   * Encapsulates the reading of a single page.
+   */
+  public class PageReader implements Closeable {

Review Comment:
   we already have a PageReader (interface). Could you rename this class.



##
parquet-common/src/main/java/org/apache/parquet/bytes/AsyncMultiBufferInputStream.java:
##
@@ -0,0 +1,158 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+
+package org.apache.parquet.bytes;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+import java.util.concurrent.LinkedBlockingQueue;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.LongAccumulator;
+import java.util.concurrent.atomic.LongAdder;
+import org.apache.parquet.io.SeekableInputStream;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+class AsyncMultiBufferInputStream extends MultiBufferInputStream {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(AsyncMultiBufferInputStream.class);
+
+  private int fetchIndex = 0;
+  private final SeekableInputStream fileInputStream;
+  private int readIndex = 0;
+  private ExecutorService threadPool;
+  private LinkedBlockingQueue> readFutures;
+  private boolean closed = false;
+
+  private LongAdder totalTimeBlocked = new LongAdder();
+  private LongAdder totalCountBlocked = new LongAdder();
+  private LongAccumulator maxTimeBlocked = new LongAccumulator(Long::max, 0L);
+
+  AsyncMultiBufferInputStream(ExecutorService threadPool, 

[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-22 Thread GitBox


ggershinsky commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r903403821


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1796,5 +1882,314 @@ public void readAll(SeekableInputStream f, 
ChunkListBuilder builder) throws IOEx
 public long endPos() {
   return offset + length;
 }
+
+@Override
+public String toString() {
+  return "ConsecutivePartList{" +
+"offset=" + offset +
+", length=" + length +
+", chunks=" + chunks +
+'}';
+}
   }
+
+  /**
+   * Encapsulates the reading of a single page.
+   */
+  public class PageReader implements Closeable {
+private final Chunk chunk;
+private final int currentBlock;
+private final BlockCipher.Decryptor headerBlockDecryptor;
+private final BlockCipher.Decryptor pageBlockDecryptor;
+private final byte[] aadPrefix;
+private final int rowGroupOrdinal;
+private final int columnOrdinal;
+
+//state
+private final LinkedBlockingDeque> pagesInChunk = new 
LinkedBlockingDeque<>();
+private DictionaryPage dictionaryPage = null;
+private int pageIndex = 0;
+private long valuesCountReadSoFar = 0;
+private int dataPageCountReadSoFar = 0;
+
+// derived
+private final PrimitiveType type;
+private final byte[] dataPageAAD;
+private final byte[] dictionaryPageAAD;
+private byte[] dataPageHeaderAAD = null;
+
+private final BytesInputDecompressor decompressor;
+
+private final ConcurrentLinkedQueue> readFutures = new 
ConcurrentLinkedQueue<>();
+
+private final LongAdder totalTimeReadOnePage = new LongAdder();
+private final LongAdder totalCountReadOnePage = new LongAdder();
+private final LongAccumulator maxTimeReadOnePage = new 
LongAccumulator(Long::max, 0L);
+private final LongAdder totalTimeBlockedPagesInChunk = new LongAdder();
+private final LongAdder totalCountBlockedPagesInChunk = new LongAdder();
+private final LongAccumulator maxTimeBlockedPagesInChunk = new 
LongAccumulator(Long::max, 0L);
+
+public PageReader(Chunk chunk, int currentBlock, Decryptor 
headerBlockDecryptor,
+  Decryptor pageBlockDecryptor, byte[] aadPrefix, int rowGroupOrdinal, int 
columnOrdinal,
+  BytesInputDecompressor decompressor
+  ) {
+  this.chunk = chunk;
+  this.currentBlock = currentBlock;
+  this.headerBlockDecryptor = headerBlockDecryptor;
+  this.pageBlockDecryptor = pageBlockDecryptor;
+  this.aadPrefix = aadPrefix;
+  this.rowGroupOrdinal = rowGroupOrdinal;
+  this.columnOrdinal = columnOrdinal;
+  this.decompressor = decompressor;
+
+  this.type = getFileMetaData().getSchema()
+.getType(chunk.descriptor.col.getPath()).asPrimitiveType();
+
+  if (null != headerBlockDecryptor) {
+dataPageHeaderAAD = AesCipher.createModuleAAD(aadPrefix, 
ModuleType.DataPageHeader,
+  rowGroupOrdinal,
+  columnOrdinal, chunk.getPageOrdinal(dataPageCountReadSoFar));
+  }
+  if (null != pageBlockDecryptor) {
+dataPageAAD = AesCipher.createModuleAAD(aadPrefix, 
ModuleType.DataPage, rowGroupOrdinal,
+  columnOrdinal, 0);
+dictionaryPageAAD = AesCipher.createModuleAAD(aadPrefix, 
ModuleType.DictionaryPage,

Review Comment:
   sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-22 Thread GitBox


ggershinsky commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r90337


##
parquet-common/src/main/java/org/apache/parquet/bytes/AsyncMultiBufferInputStream.java:
##
@@ -0,0 +1,158 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+
+package org.apache.parquet.bytes;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Future;
+import java.util.concurrent.LinkedBlockingQueue;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.LongAccumulator;
+import java.util.concurrent.atomic.LongAdder;
+import org.apache.parquet.io.SeekableInputStream;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+class AsyncMultiBufferInputStream extends MultiBufferInputStream {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(AsyncMultiBufferInputStream.class);
+
+  private int fetchIndex = 0;
+  private final SeekableInputStream fileInputStream;
+  private int readIndex = 0;
+  private ExecutorService threadPool;
+  private LinkedBlockingQueue> readFutures;
+  private boolean closed = false;
+
+  private LongAdder totalTimeBlocked = new LongAdder();
+  private LongAdder totalCountBlocked = new LongAdder();
+  private LongAccumulator maxTimeBlocked = new LongAccumulator(Long::max, 0L);
+
+  AsyncMultiBufferInputStream(ExecutorService threadPool, SeekableInputStream 
fileInputStream,
+List buffers) {
+super(buffers);
+this.fileInputStream = fileInputStream;
+this.threadPool = threadPool;
+readFutures = new LinkedBlockingQueue<>(buffers.size());
+if (LOG.isDebugEnabled()) {
+  LOG.debug("ASYNC: Begin read into buffers ");
+  for (ByteBuffer buf : buffers) {
+LOG.debug("ASYNC: buffer {} ", buf);
+  }
+}
+fetchAll();
+  }
+
+  private void checkState() {
+if (closed) {
+  throw new RuntimeException("Stream is closed");
+}
+  }
+
+  private void fetchAll() {
+checkState();
+submitReadTask(0);
+  }
+
+  private void submitReadTask(int bufferNo) {
+ByteBuffer buffer = buffers.get(bufferNo);
+try {
+  readFutures.put(threadPool.submit(() -> {
+  readOneBuffer(buffer);
+  if (bufferNo < buffers.size() - 1) {
+submitReadTask(bufferNo + 1);
+  }
+  return null;
+})
+  );
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException(e);
+}
+  }
+
+  private void readOneBuffer(ByteBuffer buffer) {
+long startTime = System.nanoTime();
+try {
+  fileInputStream.readFully(buffer);
+  buffer.flip();
+  long readCompleted = System.nanoTime();
+  long timeSpent = readCompleted - startTime;
+  LOG.debug("ASYNC Stream: READ - {}", timeSpent / 1000.0);
+  fetchIndex++;
+} catch (IOException e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  @Override
+  public boolean nextBuffer() {
+checkState();
+// hack: parent constructor can call this method before this class is 
fully initialized.
+// Just return without doing anything.
+if (readFutures == null) {
+  return false;
+}
+if (readIndex < buffers.size()) {
+  long start = System.nanoTime();
+  try {
+LOG.debug("ASYNC (next): Getting next buffer");
+Future future = readFutures.take();
+future.get();
+long timeSpent = System.nanoTime() - start;
+totalCountBlocked.add(1);
+totalTimeBlocked.add(timeSpent);
+maxTimeBlocked.accumulate(timeSpent);
+LOG.debug("ASYNC (next): {}: Time blocked for read {} ns", this, 
timeSpent);

Review Comment:
   should `if (LOG.isDebugEnabled()) {` be added here and in 118? This check is 
performed in the constructor (line 58); `nextBuffer()` is called with 
higher(/same) frequency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, 

[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-20 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1160693399

   @shangxinli Thank you for the review! I'll address these comments asap.
   I am reviewing the thread pool and its initialization. IMO, it is better if 
there is no default initialization of the pool and the calling 
application/framework does so explicitly. One side effect of the default 
initialization is that the pool is created unnecessarily even if async is off. 
Also, if an application, shades and includes another copy of the library (or 
transitively, many more), then one more thread pool gets created for every 
version of the library included. 
   It is probably a better idea to allow the thread pool to be assigned as a 
per instance variable. The calling application can then decide to use a single 
pool for all instances or a new one per instance whichever use case is better 
for their performance.
   Finally, some large scale testing has revealed a possible resource leak. I'm 
looking into addressing it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-20 Thread GitBox


steveloughran commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r901859862


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -126,6 +127,42 @@ public class ParquetFileReader implements Closeable {
 
   public static String PARQUET_READ_PARALLELISM = 
"parquet.metadata.read.parallelism";
 
+  public static int numProcessors = Runtime.getRuntime().availableProcessors();

Review Comment:
   dynamically changing the number of threads/buffer sizes/cache sizes is a 
recurrent source of pain in past work, as once you get to 128 core systems they 
often end up asking for too much of a limited resource



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-06-20 Thread GitBox


steveloughran commented on code in PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#discussion_r901856428


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -50,51 +46,45 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
-byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
-  try {
-return h2SeekableConstructor.newInstance(stream);
-  } catch (InstantiationException | IllegalAccessException e) {
-LOG.warn("Could not instantiate H2SeekableInputStream, falling back to 
byte array reads", e);
-return new H1SeekableInputStream(stream);
-  } catch (InvocationTargetException e) {
-throw new ParquetDecodingException(
-"Could not instantiate H2SeekableInputStream", 
e.getTargetException());
-  }
+if (isWrappedStreamByteBufferReadable(stream)) {
+  return new H2SeekableInputStream(stream);
 } else {
   return new H1SeekableInputStream(stream);
 }
   }
 
-  private static Class getReadableClass() {
-try {
-  return Class.forName("org.apache.hadoop.fs.ByteBufferReadable");
-} catch (ClassNotFoundException | NoClassDefFoundError e) {
-  return null;
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is "the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * That is: all streams which implement ByteBufferReadable
+   * other than FSDataInputStream successfuly support read(ByteBuffer).
+   * This is true for all filesytem clients the hadoop codebase.
+   *
+   * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
+   * check this: only those streams which provide the read(ByteBuffer)
+   * semantics MAY return true for the probe "in:readbytebuffer";
+   * FSDataInputStream will pass the probe down to the underlying stream.
+   *
+   * @param stream stream to probe
+   * @return true if it is safe to a H2SeekableInputStream to access the data
+   */
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+if (stream.hasCapability("in:readbytebuffer")) {

Review Comment:
   no, the StreamCapabilities probe has been around since hadoop 2. it is just 
in 3.3.0 all streams which implement the api return true for this probe...a 
probe which gets passed down the wrapped streams. It avoids looking at the 
wrapped streams as you should be able to trust the response (put differently: 
if something lied it is in trouble)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #970: PARQUET-2150: parquet-protobuf to compile on Mac M1

2022-06-20 Thread GitBox


steveloughran commented on PR #970:
URL: https://github.com/apache/parquet-mr/pull/970#issuecomment-1160638077

   this patch is based on Dongjoon;s one for hadoop, tells maven to use the x86 
artifact on macbook m1 builds.
   
   the sunchao one switches to a version of protobuf with a genuine mac m1 
artifacts, a version which should also include some CVE fixes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] theosib-amazon commented on a diff in pull request #957: PARQUET-2069: Allow list and array record types to be compatible.

2022-06-20 Thread GitBox


theosib-amazon commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r901748898


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {
+  System.err.println("Warning, Avro schema doesn't match Parquet schema, 
falling back to conversion: " + e.toString());

Review Comment:
   Oversight on my part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] theosib-amazon commented on a diff in pull request #957: PARQUET-2069: Allow list and array record types to be compatible.

2022-06-20 Thread GitBox


theosib-amazon commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r901740673


##
parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayListCompatibility.java:
##
@@ -0,0 +1,51 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.avro;
+
+import com.google.common.io.Resources;
+import org.apache.avro.generic.GenericData;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.Test;
+import java.io.IOException;
+
+public class TestArrayListCompatibility {
+
+  @Test
+  public void testListArrayCompatibility() throws IOException {
+Path testPath = new 
Path(Resources.getResource("list-array-compat.parquet").getFile());
+
+Configuration conf = new Configuration();
+ParquetReader parquetReader =
+  AvroParquetReader.builder(testPath).withConf(conf).build();
+GenericData.Record firstRecord;
+try {
+  firstRecord = (GenericData.Record) parquetReader.read();
+} catch (Exception x) {
+  x.printStackTrace();

Review Comment:
   Ok, I got rid of the extra catch. I'm not sure what kind of exceptions 
parquetReader.read() can throw, though, so we'll see if we get a compile error 
from not specifying it in the function signature. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] theosib-amazon commented on a diff in pull request #957: PARQUET-2069: Allow list and array record types to be compatible.

2022-06-20 Thread GitBox


theosib-amazon commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r901733632


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {

Review Comment:
   I think the underlying problem is that some versions of ParquetMR produce 
*bad schemas*, so when we try to load those same files, parsing fails, since 
the Parquet schema implicit in the file metadata doesn't match up with the 
stored Avro schema. I'm not sure what to do about bad schemas other than to 
throw them away and try a fallback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ala opened a new pull request, #978: PARQUET-2161: Fix row index generation in combination with range filtering

2022-06-20 Thread GitBox


ala opened a new pull request, #978:
URL: https://github.com/apache/parquet-mr/pull/978

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2161
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
 - Extends `TestParquetReader` suite. 
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sunchao commented on pull request #970: PARQUET-2150: parquet-protobuf to compile on Mac M1

2022-06-19 Thread GitBox


sunchao commented on PR #970:
URL: https://github.com/apache/parquet-mr/pull/970#issuecomment-1159806290

   @shangxinli my approach is different from @steveloughran 's one. Since newer 
version of protobuf already provides M1 artifacts, upgrade will solve the issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-19 Thread GitBox


shangxinli commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r900994027


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -46,12 +46,11 @@
 import java.util.List;
 import java.util.Map;
 import java.util.Map.Entry;
+import java.util.Optional;
 import java.util.Set;
-import java.util.concurrent.Callable;
-import java.util.concurrent.ExecutionException;
-import java.util.concurrent.ExecutorService;
-import java.util.concurrent.Executors;
-import java.util.concurrent.Future;
+import java.util.concurrent.*;

Review Comment:
   I guess it is IDE does that but let's not use wildcard here 



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -126,6 +127,42 @@ public class ParquetFileReader implements Closeable {
 
   public static String PARQUET_READ_PARALLELISM = 
"parquet.metadata.read.parallelism";
 
+  public static int numProcessors = Runtime.getRuntime().availableProcessors();
+
+  // Thread pool to read column chunk data from disk. Applications should call 
setAsyncIOThreadPool
+  // to initialize this with their own implementations.
+  // Default initialization is useful only for testing

Review Comment:
   I understand we want applications to provide their own implementations, but 
can you share why we choose the cached thread pool instead of fixed in default? 
I kind of feel a lot of user scenarios of Parquet is with unpredictable 
execution times and we need better control over our program's resource 
consumption. 



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1387,8 +1489,13 @@ public void close() throws IOException {
* result of the column-index based filtering when some pages might be 
skipped at reading.
*/
   private class ChunkListBuilder {
+// ChunkData is backed by either a list of buffers or a list of strams

Review Comment:
   typo? streams? 



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1796,5 +1882,314 @@ public void readAll(SeekableInputStream f, 
ChunkListBuilder builder) throws IOEx
 public long endPos() {
   return offset + length;
 }
+
+@Override
+public String toString() {
+  return "ConsecutivePartList{" +
+"offset=" + offset +
+", length=" + length +
+", chunks=" + chunks +
+'}';
+}
   }
+
+  /**
+   * Encapsulates the reading of a single page.
+   */
+  public class PageReader implements Closeable {
+private final Chunk chunk;
+private final int currentBlock;
+private final BlockCipher.Decryptor headerBlockDecryptor;
+private final BlockCipher.Decryptor pageBlockDecryptor;
+private final byte[] aadPrefix;
+private final int rowGroupOrdinal;
+private final int columnOrdinal;
+
+//state
+private final LinkedBlockingDeque> pagesInChunk = new 
LinkedBlockingDeque<>();
+private DictionaryPage dictionaryPage = null;
+private int pageIndex = 0;
+private long valuesCountReadSoFar = 0;
+private int dataPageCountReadSoFar = 0;
+
+// derived
+private final PrimitiveType type;
+private final byte[] dataPageAAD;
+private final byte[] dictionaryPageAAD;
+private byte[] dataPageHeaderAAD = null;
+
+private final BytesInputDecompressor decompressor;
+
+private final ConcurrentLinkedQueue> readFutures = new 
ConcurrentLinkedQueue<>();
+
+private final LongAdder totalTimeReadOnePage = new LongAdder();
+private final LongAdder totalCountReadOnePage = new LongAdder();
+private final LongAccumulator maxTimeReadOnePage = new 
LongAccumulator(Long::max, 0L);
+private final LongAdder totalTimeBlockedPagesInChunk = new LongAdder();
+private final LongAdder totalCountBlockedPagesInChunk = new LongAdder();
+private final LongAccumulator maxTimeBlockedPagesInChunk = new 
LongAccumulator(Long::max, 0L);
+
+public PageReader(Chunk chunk, int currentBlock, Decryptor 
headerBlockDecryptor,
+  Decryptor pageBlockDecryptor, byte[] aadPrefix, int rowGroupOrdinal, int 
columnOrdinal,
+  BytesInputDecompressor decompressor
+  ) {
+  this.chunk = chunk;
+  this.currentBlock = currentBlock;
+  this.headerBlockDecryptor = headerBlockDecryptor;
+  this.pageBlockDecryptor = pageBlockDecryptor;
+  this.aadPrefix = aadPrefix;
+  this.rowGroupOrdinal = rowGroupOrdinal;
+  this.columnOrdinal = columnOrdinal;
+  this.decompressor = decompressor;
+
+  this.type = getFileMetaData().getSchema()
+.getType(chunk.descriptor.col.getPath()).asPrimitiveType();
+
+  if (null != headerBlockDecryptor) {
+dataPageHeaderAAD = AesCipher.createModuleAAD(aadPrefix, 
ModuleType.DataPageHeader,
+  rowGroupOrdinal,
+  columnOrdinal, chunk.getPageOrdinal(dataPageCountReadSoFar));
+  }
+  if (null != 

[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-18 Thread GitBox


shangxinli commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r900993899


##
parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java:
##
@@ -61,9 +65,10 @@ private HadoopReadOptions(boolean useSignedStringMinMax,
 Configuration conf,
 FileDecryptionProperties fileDecryptionProperties) 
{
 super(
-useSignedStringMinMax, useStatsFilter, useDictionaryFilter, 
useRecordFilter, useColumnIndexFilter,
-usePageChecksumVerification, useBloomFilter, recordFilter, 
metadataFilter, codecFactory, allocator,
-maxAllocationSize, properties, fileDecryptionProperties
+  useSignedStringMinMax, useStatsFilter, useDictionaryFilter, 
useRecordFilter,

Review Comment:
   it seems two spaces were removed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-17 Thread GitBox


huaxingao commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1159354846

   Thank you all very much! @chenjunjiedada @dongjoon-hyun @ggershinsky 
@shangxinli 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #970: PARQUET-2150: parquet-protobuf to compile on Mac M1

2022-06-17 Thread GitBox


shangxinli commented on PR #970:
URL: https://github.com/apache/parquet-mr/pull/970#issuecomment-1159351267

   @sunchao I see your change is to upgrade the protobuf version. Is that 
required to solve this problem, which I don't see in this PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sunchao commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread GitBox


sunchao commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900689594


##
parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:
##
@@ -20,8 +20,8 @@
 
 import org.apache.hadoop.fs.GlobPattern;
 
-import java.util.regex.Pattern;
-import java.util.regex.PatternSyntaxException;
+import com.google.re2j.Pattern;

Review Comment:
   I think this may not work for projects like Spark who are using Hadoop 
shaded client, since the `GlobPattern.compiled` is relocated to 
`org.apache.hadoop.shaded.com.google.re2j.Pattern`.
   
   It might be easier to just remove the class as it has been marked as 
deprecated since Parquet 1.8.0, 2015



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sunchao commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread GitBox


sunchao commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900689594


##
parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:
##
@@ -20,8 +20,8 @@
 
 import org.apache.hadoop.fs.GlobPattern;
 
-import java.util.regex.Pattern;
-import java.util.regex.PatternSyntaxException;
+import com.google.re2j.Pattern;

Review Comment:
   I think this may not work for projects like Spark who are using Hadoop 
shaded client, since the `GlobPattern.compiled` is relocated to 
`org.apache.hadoop.shaded.com.google.re2j.Pattern`.
   
   It might be easier to just remove the class as it has been marked as 
deprecated since Parquet 1.8.0, 2015. It is also not used anywhere in the 
project.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli merged pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-17 Thread GitBox


shangxinli merged PR #975:
URL: https://github.com/apache/parquet-mr/pull/975


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-17 Thread GitBox


shangxinli commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1159349333

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread GitBox


shangxinli commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900688843


##
pom.xml:
##
@@ -76,7 +76,7 @@
 2.13.2.2
 0.14.2
 shaded.parquet
-2.10.1
+3.2.0

Review Comment:
   +1 for the question 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #958: PARQUET-2138: Add ShowBloomFilterCommand to parquet-cli

2022-06-17 Thread GitBox


shangxinli commented on PR #958:
URL: https://github.com/apache/parquet-mr/pull/958#issuecomment-1159347960

   @WangGuangxin Do you still plan to implement the decryption? I don't want to 
place a blocker if you don't have plan for it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #957: PARQUET-2069: Allow list and array record types to be compatible.

2022-06-17 Thread GitBox


shangxinli commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r900688194


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {

Review Comment:
   I understand the targetted issue can be solved by this retry with a 
converted schema. But I am not sure if it is safe to just ignore Avro schema in 
case of exception. @rdblue @wesm Do you have some time to have a look at this? 



##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {

Review Comment:
   I understand the target issue can be solved by this retry with a converted 
schema. But I am not sure if it is safe to just ignore Avro schema in case of 
exception. @rdblue @wesm Do you have some time to have a look at this? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #957: PARQUET-2069: Allow list and array record types to be compatible.

2022-06-17 Thread GitBox


shangxinli commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r900687324


##
parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java:
##
@@ -136,10 +137,22 @@ public RecordMaterializer prepareForRead(
 
 GenericData model = getDataModel(configuration);
 String compatEnabled = metadata.get(AvroReadSupport.AVRO_COMPATIBILITY);
-if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
-  return newCompatMaterializer(parquetSchema, avroSchema, model);
+
+try {
+  if (compatEnabled != null && Boolean.valueOf(compatEnabled)) {
+return newCompatMaterializer(parquetSchema, avroSchema, model);
+  }
+  return new AvroRecordMaterializer(parquetSchema, avroSchema, model);
+} catch (InvalidRecordException | ClassCastException e) {
+  System.err.println("Warning, Avro schema doesn't match Parquet schema, 
falling back to conversion: " + e.toString());

Review Comment:
   Any reason we don't use Log4j? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #957: PARQUET-2069: Allow list and array record types to be compatible.

2022-06-17 Thread GitBox


shangxinli commented on code in PR #957:
URL: https://github.com/apache/parquet-mr/pull/957#discussion_r900687175


##
parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayListCompatibility.java:
##
@@ -0,0 +1,51 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.avro;
+
+import com.google.common.io.Resources;
+import org.apache.avro.generic.GenericData;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.Test;
+import java.io.IOException;
+
+public class TestArrayListCompatibility {
+
+  @Test
+  public void testListArrayCompatibility() throws IOException {
+Path testPath = new 
Path(Resources.getResource("list-array-compat.parquet").getFile());
+
+Configuration conf = new Configuration();
+ParquetReader parquetReader =
+  AvroParquetReader.builder(testPath).withConf(conf).build();
+GenericData.Record firstRecord;
+try {
+  firstRecord = (GenericData.Record) parquetReader.read();
+} catch (Exception x) {
+  x.printStackTrace();

Review Comment:
   I think if you don't catch, it would still print out the stack. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-06-17 Thread GitBox


shangxinli commented on code in PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#discussion_r900676217


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -50,51 +46,45 @@ public class HadoopStreams {
*/
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
-if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
-byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
-  try {
-return h2SeekableConstructor.newInstance(stream);
-  } catch (InstantiationException | IllegalAccessException e) {
-LOG.warn("Could not instantiate H2SeekableInputStream, falling back to 
byte array reads", e);
-return new H1SeekableInputStream(stream);
-  } catch (InvocationTargetException e) {
-throw new ParquetDecodingException(
-"Could not instantiate H2SeekableInputStream", 
e.getTargetException());
-  }
+if (isWrappedStreamByteBufferReadable(stream)) {
+  return new H2SeekableInputStream(stream);
 } else {
   return new H1SeekableInputStream(stream);
 }
   }
 
-  private static Class getReadableClass() {
-try {
-  return Class.forName("org.apache.hadoop.fs.ByteBufferReadable");
-} catch (ClassNotFoundException | NoClassDefFoundError e) {
-  return null;
+  /**
+   * Is the inner stream byte buffer readable?
+   * The test is "the stream is not FSDataInputStream
+   * and implements ByteBufferReadable'
+   *
+   * That is: all streams which implement ByteBufferReadable
+   * other than FSDataInputStream successfuly support read(ByteBuffer).
+   * This is true for all filesytem clients the hadoop codebase.
+   *
+   * In hadoop 3.3.0+, the StreamCapabilities probe can be used to
+   * check this: only those streams which provide the read(ByteBuffer)
+   * semantics MAY return true for the probe "in:readbytebuffer";
+   * FSDataInputStream will pass the probe down to the underlying stream.
+   *
+   * @param stream stream to probe
+   * @return true if it is safe to a H2SeekableInputStream to access the data
+   */
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+if (stream.hasCapability("in:readbytebuffer")) {

Review Comment:
   We don't have the Hadoop 3..3.0 yet in Parquet. Does it mean we need to hold 
of this PR? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sunchao commented on a diff in pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-17 Thread GitBox


sunchao commented on code in PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#discussion_r900576817


##
pom.xml:
##
@@ -76,7 +76,7 @@
 2.13.2.2
 0.14.2
 shaded.parquet
-2.10.1
+3.2.0

Review Comment:
   hmm why 3.2.0, not 3.3.1/3.3.2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-16 Thread GitBox


huaxingao commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1157762209

   > it should be good enough to also check the lower limit, eg exist > 
totalCount * (testFpp[i] * 0.9) , or exist > totalCount * (testFpp[i] * 0.5) , 
or even exist > 0. What do you think? This way, we'll be certain the test 
passes not because exist is just 0.
   
   Thanks for the suggestion! I can't find a reliable number for the lower 
limit. I put `exist > 0`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-16 Thread GitBox


huaxingao commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r899177750


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", totalCount)
+.withBloomFilterFPP("name", testFpp[i])
+.build()) {
+java.util.Iterator iterator = distinctStrings.iterator();
+while (iterator.hasNext()) {
+  writer.write(factory.newGroup().append("name", iterator.next()));
+}
+  }
+  distinctStrings.clear();
+
+  try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration( {
+BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+  .readBloomFilter(blockMetaData.getColumns().get(0));
+
+// The exist counts the number of times FindHash returns true.
+int exist = 0;
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(10);
+  if (distinctStrings.add(str) &&
+
bloomFilter.findHash(LongHashFunction.xx(0).hashBytes(Binary.fromString(str).toByteBuffer(
 {
+exist++;
+  }
+}
+// The exist should be less than totalCount * fpp. Add 10% here for 
error space.
+assertTrue(exist < totalCount * (testFpp[i] * 1.1));

Review Comment:
   Yes. Agree.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-16 Thread GitBox


ggershinsky commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1157577513

   > The test takes about 2300 milli seconds on my laptop.
   
   Ok, this is reasonable. If this time is sufficient for reliably testing the 
upper limit of FPPs, it should be good enough to also check the lower limit, eg 
`exist > totalCount * (testFpp[i] * 0.9)` , or `exist > totalCount * 
(testFpp[i] * 0.5)` , or even `exist > 0`. What do you think? This way, we'll 
be certain the test passes not because `exist` is just 0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] chenjunjiedada commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-16 Thread GitBox


chenjunjiedada commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r898756998


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", totalCount)
+.withBloomFilterFPP("name", testFpp[i])
+.build()) {
+java.util.Iterator iterator = distinctStrings.iterator();
+while (iterator.hasNext()) {
+  writer.write(factory.newGroup().append("name", iterator.next()));
+}
+  }
+  distinctStrings.clear();
+
+  try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration( {
+BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+  .readBloomFilter(blockMetaData.getColumns().get(0));
+
+// The exist counts the number of times FindHash returns true.
+int exist = 0;
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(10);
+  if (distinctStrings.add(str) &&
+
bloomFilter.findHash(LongHashFunction.xx(0).hashBytes(Binary.fromString(str).toByteBuffer(
 {
+exist++;
+  }
+}
+// The exist should be less than totalCount * fpp. Add 10% here for 
error space.
+assertTrue(exist < totalCount * (testFpp[i] * 1.1));

Review Comment:
   The size of the bloom filter is computed with `ndv` and `fpp`. So even the 
size is "unreasonable" small it should be enough to handle the given situation. 
Right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-15 Thread GitBox


huaxingao commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r898230192


##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -471,6 +484,12 @@ public Builder withBloomFilterNDV(String columnPath, long 
ndv) {
   return this;
 }
 
+public Builder withBloomFilterFPP(String columnPath, double fpp) {

Review Comment:
   This value will be silently ignored.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-15 Thread GitBox


huaxingao commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r898229962


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", totalCount)
+.withBloomFilterFPP("name", testFpp[i])
+.build()) {
+java.util.Iterator iterator = distinctStrings.iterator();
+while (iterator.hasNext()) {
+  writer.write(factory.newGroup().append("name", iterator.next()));
+}
+  }
+  distinctStrings.clear();
+
+  try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration( {
+BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+  .readBloomFilter(blockMetaData.getColumns().get(0));
+
+// The exist counts the number of times FindHash returns true.
+int exist = 0;
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(10);
+  if (distinctStrings.add(str) &&
+
bloomFilter.findHash(LongHashFunction.xx(0).hashBytes(Binary.fromString(str).toByteBuffer(
 {
+exist++;
+  }
+}
+// The exist should be less than totalCount * fpp. Add 10% here for 
error space.
+assertTrue(exist < totalCount * (testFpp[i] * 1.1));

Review Comment:
   Basically `exist` > 0 is false positive. which happens when any given hash 
value that was never inserted into the bloom filter causes the check to return 
true. I don't think there is a simple closed-form calculation of this 
probability, but setting `totalCount` to be `10` seems to be a pretty safe 
number for the test to pass.
   
   I am thinking we probably should disallow the Bloom filter's size to be 
unreasonably small. We currently only have the 
   maximum bytes of the Bloom filter. Shall we also have the minimum bytes of 
the Bloom filter? What do you think? @chenjunjiedada 
   
   The test takes about 2300 milli seconds on my laptop.
   
   
   
   



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", totalCount)
+.withBloomFilterFPP("name", testFpp[i])
+.build()) {
+java.util.Iterator iterator = distinctStrings.iterator();
+while (iterator.hasNext()) {
+  writer.write(factory.newGroup().append("name", iterator.next()));
+}
+  }
+  distinctStrings.clear();
+
+  try (ParquetFileReader reader = 

[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-15 Thread GitBox


ggershinsky commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r898002511


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", totalCount)
+.withBloomFilterFPP("name", testFpp[i])
+.build()) {
+java.util.Iterator iterator = distinctStrings.iterator();
+while (iterator.hasNext()) {
+  writer.write(factory.newGroup().append("name", iterator.next()));
+}
+  }
+  distinctStrings.clear();
+
+  try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration( {
+BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+  .readBloomFilter(blockMetaData.getColumns().get(0));
+
+// The exist counts the number of times FindHash returns true.
+int exist = 0;
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(10);
+  if (distinctStrings.add(str) &&
+
bloomFilter.findHash(LongHashFunction.xx(0).hashBytes(Binary.fromString(str).toByteBuffer(
 {
+exist++;
+  }
+}
+// The exist should be less than totalCount * fpp. Add 10% here for 
error space.
+assertTrue(exist < totalCount * (testFpp[i] * 1.1));

Review Comment:
   Two related questions:
   - what should be the `totalCount` to reliably ensure that a) `exist > 0` b) 
`exist < totalCount * (testFpp[i] * 1.1)` ? Depending on the `fpp` value, we 
can get a random assert exception if `totalCount` is too low (also, `exist` 
could be just 0 then). If `totalCount` is high, the unitest could take a very 
long time. 
   - how long does this unitest run on your laptop? (with the current 
`totalCount` of 10).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-15 Thread GitBox


ggershinsky commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r897894894


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", totalCount)
+.withBloomFilterFPP("name", testFpp[i])
+.build()) {
+java.util.Iterator iterator = distinctStrings.iterator();
+while (iterator.hasNext()) {
+  writer.write(factory.newGroup().append("name", iterator.next()));
+}
+  }
+  distinctStrings.clear();
+
+  try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration( {
+BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+  .readBloomFilter(blockMetaData.getColumns().get(0));
+
+// The exist counts the number of times FindHash returns true.
+int exist = 0;
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(10);

Review Comment:
   the original values are 12 char long. To make sure that finding a different 
length string among them is always false, can you change it to `originalLength 
- 2`, instead of hard coding 10?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-15 Thread GitBox


ggershinsky commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r897894894


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", totalCount)
+.withBloomFilterFPP("name", testFpp[i])
+.build()) {
+java.util.Iterator iterator = distinctStrings.iterator();
+while (iterator.hasNext()) {
+  writer.write(factory.newGroup().append("name", iterator.next()));
+}
+  }
+  distinctStrings.clear();
+
+  try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration( {
+BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+  .readBloomFilter(blockMetaData.getColumns().get(0));
+
+// The exist counts the number of times FindHash returns true.
+int exist = 0;
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(10);

Review Comment:
   the original values are 12 char long. Are we supposed to find a 10-char 
string among them?



##
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##
@@ -471,6 +484,12 @@ public Builder withBloomFilterNDV(String columnPath, long 
ndv) {
   return this;
 }
 
+public Builder withBloomFilterFPP(String columnPath, double fpp) {

Review Comment:
   what happens if this value is set, but the BF is not enabled? (general / 
per-column)



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", totalCount)
+.withBloomFilterFPP("name", testFpp[i])
+.build()) {
+java.util.Iterator iterator = distinctStrings.iterator();
+while (iterator.hasNext()) {
+  writer.write(factory.newGroup().append("name", iterator.next()));
+}
+  }
+  distinctStrings.clear();
+
+  try (ParquetFileReader reader = 
ParquetFileReader.open(HadoopInputFile.fromPath(path, new Configuration( {
+BlockMetaData blockMetaData = reader.getFooter().getBlocks().get(0);
+BloomFilter bloomFilter = 
reader.getBloomFilterDataReader(blockMetaData)
+  .readBloomFilter(blockMetaData.getColumns().get(0));
+
+// The exist counts the number of times FindHash returns true.
+int exist = 0;
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(10);
+  if (distinctStrings.add(str) &&
+

[GitHub] [parquet-mr] dossett commented on pull request #963: PARQUET-1020 Add DynamicMessage writing support

2022-06-14 Thread GitBox


dossett commented on PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#issuecomment-1155680245

   I tested this locally and it works beautifully thank you @guillaume-fetter.
   
   @shangxinli @gszadovszky -- Apologies to ping you directly but in be 
difficult to get protobuf-parquet changes approved since their aren't any(?) 
active committers working in the protobuf code.  This seems like a low risk 
change and, as I said, works great for us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-13 Thread GitBox


huaxingao commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r896299022


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+final int totalCount = 10;

Review Comment:
   Removed. 



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+final int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", 10l)

Review Comment:
   Fixed. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] chenjunjiedada commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-13 Thread GitBox


chenjunjiedada commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r896285374


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+final int totalCount = 10;
+double[] testFpp = {0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25};
+
+Set distinctStrings = new HashSet<>();
+while (distinctStrings.size() < totalCount) {
+  String str = RandomStringUtils.randomAlphabetic(12);
+  distinctStrings.add(str);
+}
+
+MessageType schema = Types.buildMessage().
+  required(BINARY).as(stringType()).named("name").named("msg");
+
+Configuration conf = new Configuration();
+GroupWriteSupport.setSchema(schema, conf);
+
+GroupFactory factory = new SimpleGroupFactory(schema);
+for (int i = 0; i < testFpp.length; i++) {
+  File file = temp.newFile();
+  file.delete();
+  Path path = new Path(file.getAbsolutePath());
+  try (ParquetWriter writer = ExampleParquetWriter.builder(path)
+.withPageRowCountLimit(10)
+.withConf(conf)
+.withDictionaryEncoding(false)
+.withBloomFilterEnabled("name", true)
+.withBloomFilterNDV("name", 10l)

Review Comment:
   Nit: Can we use `TotalCount`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] chenjunjiedada commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-13 Thread GitBox


chenjunjiedada commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r896285197


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws 
IOException {
 }
   }
 
+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+final int totalCount = 10;

Review Comment:
   Nit: Why do we need `final`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-13 Thread GitBox


huaxingao commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1154253580

   cc @chenjunjiedada @ggershinsky @shangxinli 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] huaxingao commented on pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-13 Thread GitBox


huaxingao commented on PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#issuecomment-1154252101

   The CI passed. Thanks a lot @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] dongjoon-hyun commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-13 Thread GitBox


dongjoon-hyun commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r895949925


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -39,13 +39,17 @@
 import java.io.File;
 import java.io.IOException;
 import java.util.HashMap;
+import java.util.HashSet;
 import java.util.Map;
+import java.util.Set;
 import java.util.concurrent.Callable;
 
 import net.openhft.hashing.LongHashFunction;
+import org.apache.commons.lang3.RandomStringUtils;

Review Comment:
   To avoid CI failure, please add this as a test dependency to 
`parquet-hadoop/pom.xml`.
   ```
   
 org.apache.commons
 commons-lang3
 3.9
 test
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] dongjoon-hyun commented on a diff in pull request #975: PARQUET-2157: add bloom filter fpp config

2022-06-13 Thread GitBox


dongjoon-hyun commented on code in PR #975:
URL: https://github.com/apache/parquet-mr/pull/975#discussion_r895949925


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java:
##
@@ -39,13 +39,17 @@
 import java.io.File;
 import java.io.IOException;
 import java.util.HashMap;
+import java.util.HashSet;
 import java.util.Map;
+import java.util.Set;
 import java.util.concurrent.Callable;
 
 import net.openhft.hashing.LongHashFunction;
+import org.apache.commons.lang3.RandomStringUtils;

Review Comment:
   Please add this as a test dependency to `parquet-hadoop/pom.xml`.
   ```
   
 org.apache.commons
 commons-lang3
 3.9
 test
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #976: PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0

2022-06-13 Thread GitBox


steveloughran commented on PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#issuecomment-1154018331

   This PR fixes Parquet to build/link against Hadoop 3.2.0 and higher. It 
would be cleaner to remove the deprecated class causing compatibility issues 
-the fact that nobody has ever reported linkage errors implies it is not in 
active use


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] dossett commented on pull request #963: PARQUET-1020 Add DynamicMessage writing support

2022-06-13 Thread GitBox


dossett commented on PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#issuecomment-1154013785

   @guillaume-fetter I see what you mean, that makes sense. I think for my use 
case (reading protobuf data from kafka via the confluent schema registry and 
then writing to parquet) I won't get tripped up by the serializability issue. 
This will be a nice parquet enhancement!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] guillaume-fetter commented on pull request #963: PARQUET-1020 Add DynamicMessage writing support

2022-06-13 Thread GitBox


guillaume-fetter commented on PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#issuecomment-1154004113

   @dossett Depends on your use case. If you are running a simple program that 
does data processing on a single host, then you're good. If you are using a big 
data processing tool (like me here, Flink) you can't pass around a DM instance 
from one task to the other, or at least, I did not find a way to make it work...
   For unrelated reasons, we are using the SelfDescribingMessage design pattern 
(https://developers.google.com/protocol-buffers/docs/techniques#self-description),
 which is a specific message, therefore serializable. From there we wrote a 
parquet writer which basically converts the SelfDescribingMessage to a 
DynamicMessage and then writes it using this upgraded ProtoWriteSupport.
   
   It's clearly convoluted unless you are already using a SelfDescribingMessage 
or equivalent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-13 Thread GitBox


steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1153924743

   (i could of course add those probes into the shim class, so at least that 
access of internals was in one place)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-13 Thread GitBox


steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1153923501

   bq.  perhaps check if the ByteBufferReadable interface is implemented in the 
stream?
   
   The requirement for the `hasCapability("in:readbytebuffer")` to return true 
postdates the API; there's no way to be confident that if the probe returns 
false (or hasPathCapability() isn't available) that the method *isn't actually 
there*
   
   see #951 for a design which will trust a `true` response, falling back to 
looking at the wrapped stream. Note that as it calls getWrapped() it is calling 
methods tagged LimitedPrivate. it should really do that...at the very least 
hadoop needs a PR saying "we need to do this because..." and that tag can be 
changed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-06-13 Thread GitBox


steveloughran commented on PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#issuecomment-1153871066

   whoever actually commits this can use the github squash option to combine 
all commits into one before merging.
   
   FYI, I've just started writing a shim library so that apps compiling against 
hadoop 3.2.0 wil be able to invoke the 3.3+ API calls when present: 
[HADOOP-18287](https://issues.apache.org/jira/browse/HADOOP-18287). 
   
   First parquet will need to be able to compile/link against hadoop 3.x: #976 
976


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] NilsB44 opened a new pull request, #977: PARQUET-2153: SchemaParseException: Can't redefine: element for FixedSchema

2022-06-13 Thread GitBox


NilsB44 opened a new pull request, #977:
URL: https://github.com/apache/parquet-mr/pull/977

   This extends the previous issue PARQUET-1441, where this issue was fixed for 
RecordSchema but not for FixedSchema. 
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   I'm unsure on how to add unit tests for FixedSchema, maybe someone could 
help me on this?
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] dossett commented on pull request #963: PARQUET-1020 Add DynamicMessage writing support

2022-06-13 Thread GitBox


dossett commented on PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#issuecomment-1153856375

   Oh that's interesting @guillaume-fetter so you can't just write out a 
dynamic message into parquet without jumping through more hoops?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #976: PARQUET-2158:. pgrade Hadoop dependency to version 3.2.0

2022-06-13 Thread GitBox


steveloughran commented on PR #976:
URL: https://github.com/apache/parquet-mr/pull/976#issuecomment-1153851509

   thrift module doesn't compile is using an hadoop internal class tagged as 
private & which made an incompatible change in hadoop 3. see  HADOOP-12436
   
   ```
   Error:  Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) 
on project parquet-thrift: Compilation failure
   Error:  
/home/runner/work/parquet-mr/parquet-mr/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:[55,49]
 incompatible types: com.google.re2j.Pattern cannot be converted to 
java.util.regex.Pattern
   ```
   
   the good news, the class is deprecated, which explains why nobody has seen 
it in the wild. Any attempt to use that class would fail with hadoop 3.x on the 
classpath. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran opened a new pull request, #976: PARQUET-2158:. pgrade Hadoop dependency to version 3.2.0

2022-06-13 Thread GitBox


steveloughran opened a new pull request, #976:
URL: https://github.com/apache/parquet-mr/pull/976

   
   
   This updates Parquet's Hadoop dependency to 3.2.0.
   This version adds compatibility with Java 11, as well
   as many other features and bug fixes.
   
   ### Jira
   
   - [X] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests!
   
   - [X] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
 it's a dependency update.
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] guillaume-fetter commented on pull request #963: PARQUET-1020 Add DynamicMessage writing support

2022-06-13 Thread GitBox


guillaume-fetter commented on PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#issuecomment-1153626738

   Just a heads-up (because I have run into that issue), DynamicMessage is not 
serializable. 
   So this means that this use-case is for local-only instances of a 
DynamicMessage. In my use case I need to build the DynamicMessage from another 
object which is serializable and do so directly in the writer, which is a bit 
convoluted.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] guillaume-fetter commented on a diff in pull request #963: PARQUET-1020 Add DynamicMessage writing support

2022-06-13 Thread GitBox


guillaume-fetter commented on code in PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#discussion_r895442880


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java:
##
@@ -115,27 +120,32 @@ public void prepareForWrite(RecordConsumer 
recordConsumer) {
   public WriteContext init(Configuration configuration) {
 
 // if no protobuf descriptor was given in constructor, load descriptor 
from configuration (set with setProtobufClass)
-if (protoMessage == null) {
-  Class pbClass = 
configuration.getClass(PB_CLASS_WRITE, null, Message.class);
-  if (pbClass != null) {
-protoMessage = pbClass;
-  } else {
-String msg = "Protocol buffer class not specified.";
-String hint = " Please use method 
ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
-throw new BadConfigurationException(msg + hint);
+if (descriptor == null) {
+  if (protoMessage == null) {
+Class pbClass = 
configuration.getClass(PB_CLASS_WRITE, null, Message.class);
+if (pbClass != null) {
+  protoMessage = pbClass;
+} else {
+  String msg = "Protocol buffer class or descriptor not specified.";
+  String hint = " Please use method 
ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
+  throw new BadConfigurationException(msg + hint);
+}
   }
+  descriptor = Protobufs.getMessageDescriptor(protoMessage);
+} else {
+  //Assume no specific Message extending class, so use DynamicMessage
+  protoMessage = DynamicMessage.class;

Review Comment:
   Yes I agree. In the end I set it just for the sake of having it set, but you 
are right it will be more confusing than useful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sheinbergon commented on pull request #900: PARQUET-2042: Add support for unwrapping common Protobuf wrappers and…

2022-06-13 Thread GitBox


sheinbergon commented on PR #900:
URL: https://github.com/apache/parquet-mr/pull/900#issuecomment-1153562823

   @mwong38 anyway I can help with finalizing this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] 7c00 commented on pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-06-13 Thread GitBox


7c00 commented on PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#issuecomment-1153522439

   Thanks @steveloughran @shangxinli . I have cherry-picked the commit from 
https://github.com/apache/parquet-mr/pull/971


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] huaxingao opened a new pull request, #975: add bloom filter fpp config

2022-06-12 Thread GitBox


huaxingao opened a new pull request, #975:
URL: https://github.com/apache/parquet-mr/pull/975

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sunchao commented on pull request #973: PARQUET-2155: Upgrade protobuf version to 3.20.1

2022-06-10 Thread GitBox


sunchao commented on PR #973:
URL: https://github.com/apache/parquet-mr/pull/973#issuecomment-1152586301

   cc @ggershinsky @shangxinli 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sunchao commented on pull request #973: PARQUET-2155: Upgrade protobuf version to 3.20.1

2022-06-10 Thread GitBox


sunchao commented on PR #973:
URL: https://github.com/apache/parquet-mr/pull/973#issuecomment-1152458544

   To reduce the risk, we can also change 3.20.1 to 3.17.1 which seems to be 
the lowest version that has the aarch_64 artifact. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #973: PARQUET-2155: Upgrade protobuf version to 3.20.1

2022-06-10 Thread GitBox


steveloughran commented on PR #973:
URL: https://github.com/apache/parquet-mr/pull/973#issuecomment-1152148534

   ...but the #970 patch should add a comment saying "not needed once protobuf 
is incremented to 3.20+"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #973: PARQUET-2155: Upgrade protobuf version to 3.20.1

2022-06-10 Thread GitBox


steveloughran commented on PR #973:
URL: https://github.com/apache/parquet-mr/pull/973#issuecomment-1152146993

   this is a different solution to mine, which just changed the protoc 
artifact. i think my one is lower risk, but this one can address other protobuf 
issues


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] LuciferYang commented on pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-10 Thread GitBox


LuciferYang commented on PR #972:
URL: https://github.com/apache/parquet-mr/pull/972#issuecomment-1152089288

   Thanks @sunchao @HyukjinKwon @ggershinsky 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky merged pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-10 Thread GitBox


ggershinsky merged PR #972:
URL: https://github.com/apache/parquet-mr/pull/972


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] LuciferYang commented on pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-09 Thread GitBox


LuciferYang commented on PR #972:
URL: https://github.com/apache/parquet-mr/pull/972#issuecomment-1151743122

   @ggershinsky yeah~ all passed now~ thanks~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sunchao commented on pull request #970: PARQUET-2150: parquet-protobuf to compile on Mac M1

2022-06-09 Thread GitBox


sunchao commented on PR #970:
URL: https://github.com/apache/parquet-mr/pull/970#issuecomment-1151738855

   Oops @steveloughran just found out this PR. I opened #973 for the same 
purpose 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] sunchao opened a new pull request, #973: Upgrade protobuf version to 3.20.1

2022-06-09 Thread GitBox


sunchao opened a new pull request, #973:
URL: https://github.com/apache/parquet-mr/pull/973

   Currently the repo can't be compiled with M1 Mac:
   ```
   com.google.protobuf:protoc:exe:osx-aarch_64:3.16.1 was not found in 
https://jitpack.io/ during a previous attempt.
   ```
   
   since the artifact for M1 arch was not published: 
https://repo1.maven.org/maven2/com/google/protobuf/protoc/3.16.1/
   
   This upgrade the protobuf version to 3.20.1 to support compiling on M1 Mac.
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [X] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2155
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [X] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] jinyius commented on pull request #445: Proposed change in AvroSchema to handle circular references.

2022-06-09 Thread GitBox


jinyius commented on PR #445:
URL: https://github.com/apache/parquet-mr/pull/445#issuecomment-1151527049

   any updates here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] dossett commented on pull request #963: PARQUET-1020 Add DynamicMessage writing support

2022-06-09 Thread GitBox


dossett commented on PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#issuecomment-1151481732

   +1 (non-binding) for this change.  `DynamicMessage` is quite useful in 
protobuf and support here would be great, I ran into a need for it just today.
   cc @belugabehr in case they have thoughts. There aren't any active 
protobuf-parquet committers AFAICT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] dossett commented on pull request #963: PARQUET-1020 Add DynamicMessage writing support

2022-06-09 Thread GitBox


dossett commented on PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#issuecomment-1151481731

   +1 (non-binding) for this change.  `DynamicMessage` is quite useful in 
protobuf and support here would be great, I ran into a need for it just today.
   cc @belugabehr in case they have thoughts. There aren't any active 
protobuf-parquet committers AFAICT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] dossett commented on a diff in pull request #963: PARQUET-1020 Add DynamicMessage writing support

2022-06-09 Thread GitBox


dossett commented on code in PR #963:
URL: https://github.com/apache/parquet-mr/pull/963#discussion_r893846424


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java:
##
@@ -115,27 +120,32 @@ public void prepareForWrite(RecordConsumer 
recordConsumer) {
   public WriteContext init(Configuration configuration) {
 
 // if no protobuf descriptor was given in constructor, load descriptor 
from configuration (set with setProtobufClass)
-if (protoMessage == null) {
-  Class pbClass = 
configuration.getClass(PB_CLASS_WRITE, null, Message.class);
-  if (pbClass != null) {
-protoMessage = pbClass;
-  } else {
-String msg = "Protocol buffer class not specified.";
-String hint = " Please use method 
ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
-throw new BadConfigurationException(msg + hint);
+if (descriptor == null) {
+  if (protoMessage == null) {
+Class pbClass = 
configuration.getClass(PB_CLASS_WRITE, null, Message.class);
+if (pbClass != null) {
+  protoMessage = pbClass;
+} else {
+  String msg = "Protocol buffer class or descriptor not specified.";
+  String hint = " Please use method 
ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
+  throw new BadConfigurationException(msg + hint);
+}
   }
+  descriptor = Protobufs.getMessageDescriptor(protoMessage);
+} else {
+  //Assume no specific Message extending class, so use DynamicMessage
+  protoMessage = DynamicMessage.class;

Review Comment:
   Should this just be left null?  Having it set implies it usable, but it's 
only used to determine the descriptor and get a protoschemaconverter and then 
only in cases when we don't have a descriptor.
   
   I just worry that setting it could lead to assumptions later on that it's as 
usable as if it were a generated message and not a dynamic message.



##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java:
##
@@ -115,27 +120,32 @@ public void prepareForWrite(RecordConsumer 
recordConsumer) {
   public WriteContext init(Configuration configuration) {
 
 // if no protobuf descriptor was given in constructor, load descriptor 
from configuration (set with setProtobufClass)
-if (protoMessage == null) {
-  Class pbClass = 
configuration.getClass(PB_CLASS_WRITE, null, Message.class);
-  if (pbClass != null) {
-protoMessage = pbClass;
-  } else {
-String msg = "Protocol buffer class not specified.";
-String hint = " Please use method 
ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
-throw new BadConfigurationException(msg + hint);
+if (descriptor == null) {
+  if (protoMessage == null) {
+Class pbClass = 
configuration.getClass(PB_CLASS_WRITE, null, Message.class);
+if (pbClass != null) {
+  protoMessage = pbClass;
+} else {
+  String msg = "Protocol buffer class or descriptor not specified.";
+  String hint = " Please use method 
ProtoParquetOutputFormat.setProtobufClass(...) or other similar method.";
+  throw new BadConfigurationException(msg + hint);
+}
   }
+  descriptor = Protobufs.getMessageDescriptor(protoMessage);
+} else {
+  //Assume no specific Message extending class, so use DynamicMessage
+  protoMessage = DynamicMessage.class;

Review Comment:
   Should this just be left null?  Having it set implies it usable, but it's 
only used to determine the descriptor and get a protoschemaconverter and then 
only in cases when we don't have a descriptor.
   
   I just worry that setting it could lead to assumptions later on that it's as 
usable as if it were a generated message and not a dynamic message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] dossett commented on pull request #959: PARQUET-2126: Make cached (de)compressors thread-safe

2022-06-09 Thread GitBox


dossett commented on PR #959:
URL: https://github.com/apache/parquet-mr/pull/959#issuecomment-1151467318

   @shangxinli I do not feel strongly about it.  I think historical context is 
better kept in JIRAs and PR discussion than in code comments, but that is just 
a style choice if there's no standard.  (I appreciate you following up, btw!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-09 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1151467274

   Sounds good.
   
   Also, perhaps check if the ByteBufferReadable interface is implemented in 
the stream?
   > ByteBufferReadable will raise UnsupportedException if not found, there is 
a check for it
   > 
https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FSDataInputStreamShim.java
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-09 Thread GitBox


steveloughran commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1151417126

   I've started work on a fs-api-shim library, with the goal of "apps compile 
against hadoop 3.2.0 can get access to the 3.3 and 3.4 APIs when available 
either with transparent fallback (openFile()) or ability to probe the API 
before trying to invoke
   
   https://github.com/steveloughran/fs-api-shim
   
   openfile takes the seek & status params, falls back to open()
   : 
https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FileSystemShim.java#L87
   
   ByteBufferReadable will raise UnsupportedException if not found, there is a 
check for it
   
https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FSDataInputStreamShim.java
   
   Vector IO SHALL be available the same way
   
   Adopt 3.2.0 then and we will help give the library the ability to use the 
newer api calls, even stuff not yet shipped in apache releases.
   
   (I want to release this as an asf artifact with oversight by hadoop project. 
lets us maintain it)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-09 Thread GitBox


steveloughran commented on code in PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#discussion_r893760667


##
parquet-common/src/main/java/org/apache/parquet/bytes/SequenceByteBufferInputStream.java:
##
@@ -0,0 +1,269 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+
+package org.apache.parquet.bytes;
+
+import java.io.EOFException;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+
+/**
+ *   A bare minimum implementation of a {@link java.io.SequenceInputStream} 
that wraps an
+ *   ordered collection of ByteBufferInputStreams.
+ *   
+ *   This class, as implemented, is intended only for a specific use in the 
ParquetFileReader and
+ *   throws {@link UnsupportedOperationException} in unimplemented methods to 
catch any unintended
+ *   use in other cases.
+ *   
+ *   Even thought this class is derived from ByteBufferInputStream it 
explicitly does not support any
+ *   byte buffer related methods like slice. It does, however support 
sliceBuffers which is a
+ *   curious case of reading data from underlying streams
+ *   
+ *   Even though this class changes the state of the underlying streams (by 
reading from them)
+ *   it does not own them and so the close method does not close the streams. 
To avoid resource
+ *   leaks the calling code should close the underlying streams
+ */
+public class SequenceByteBufferInputStream extends ByteBufferInputStream {
+
+  Collection collection;
+  Iterator iterator;
+  ByteBufferInputStream current;
+  long position = 0;
+
+  @Override
+  public String toString() {
+return "SequenceByteBufferInputStream{" +
+  "collection=" + collection +
+  ", current=" + current +
+  ", position=" + position +
+  '}';
+  }
+
+  public SequenceByteBufferInputStream(Collection 
collection) {
+this.collection = collection;
+iterator = collection.iterator();
+current = iterator.hasNext() ? iterator.next() : null;
+if (current == null) {
+  throw new UnsupportedOperationException(
+"Initializing SequenceByteBufferInputStream with an empty collection 
is not supported");
+}
+  }
+
+  @Override
+  public long position() {
+return position;
+  }
+
+  @Override
+  public int read(ByteBuffer out) {

Review Comment:
   good to know. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] LuciferYang commented on pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-09 Thread GitBox


LuciferYang commented on PR #972:
URL: https://github.com/apache/parquet-mr/pull/972#issuecomment-1151345710

   > can you re-run the CI? (eg via re-opening the PR)
   
   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] LuciferYang closed pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-09 Thread GitBox


LuciferYang closed pull request #972: PARQUET-2154: `ParquetFileReader` should 
close its input stream when `filterRowGroups` throw Exception in constructor
URL: https://github.com/apache/parquet-mr/pull/972


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-09 Thread GitBox


ggershinsky commented on PR #972:
URL: https://github.com/apache/parquet-mr/pull/972#issuecomment-1151344885

   can you re-run the CI? (eg via re-opening the PR)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #959: PARQUET-2126: Make cached (de)compressors thread-safe

2022-06-09 Thread GitBox


shangxinli commented on PR #959:
URL: https://github.com/apache/parquet-mr/pull/959#issuecomment-1151340364

   Thanks for addressing the feedback! What I meant was that ideally when 
'Threads terminate' happens, it should clean up the compressor/decompressor 
immediately. I understand we won't leak in the end of 'close/release is called' 
though. 
   
   > Seems good to me (non-binding!). Revisiting whether or not the caching 
strategy make sense might be worthwhile, but that shouldn't stop this fix.
   > 
   > Small comment: I would remove most of the references to the JIRA ticket as 
well as descriptions of the old behavior. I think the comment that describes 
the new behavior and why it might be unintuitive with a reference to the JIRA 
makes sense though. I'll defer to others in the project (again, I'm not a 
committer) if there are existing standards for this though.
   
   @dossett, we don't have a standard like that. It seems OK to have. What do 
you think? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] LuciferYang commented on pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-09 Thread GitBox


LuciferYang commented on PR #972:
URL: https://github.com/apache/parquet-mr/pull/972#issuecomment-1151319153

   @ggershinsky I found Travis CI failed, but `Raw log` is null
   
   https://user-images.githubusercontent.com/1475305/172891856-e8acc55a-99cd-4023-9e31-e0fac5b9d1e4.png;>
   
   What should I do now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-06-09 Thread GitBox


shangxinli commented on PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#issuecomment-1151315837

   > I've taken this PR and added the changes I was suggesting, plus tests. see 
#971. If you take that extra commit and merge it in here, it should complete 
this PR
   
   @7c00 Are you OK since you are originally created this PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] ggershinsky commented on pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-08 Thread GitBox


ggershinsky commented on PR #972:
URL: https://github.com/apache/parquet-mr/pull/972#issuecomment-1150670630

   yep, I've approved the CI run. Once complete, will merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] LuciferYang commented on a diff in pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-08 Thread GitBox


LuciferYang commented on code in PR #972:
URL: https://github.com/apache/parquet-mr/pull/972#discussion_r893047755


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -715,7 +715,14 @@ public ParquetFileReader(
   
.withDecryption(fileDecryptor.getDecryptionProperties())
   .build();
 }
-this.blocks = filterRowGroups(blocks);
+try {
+  this.blocks = filterRowGroups(blocks);
+} catch (Exception e) {

Review Comment:
   There will be more than IOE here, for example, if push down filters for 
repeated primitive types will throw `IllegalArgumentException` as follows:
   
   ```
   Caused by: java.lang.IllegalArgumentException: FilterPredicates do not 
currently support repeated columns. Column keywords is repeated.
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:176)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:89)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
 at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:192)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61)
 at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95)
 at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45)
 at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
 at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:72)
 at 
org.apache.parquet.hadoop.ParquetFileReader.filterRowGroups(ParquetFileReader.java:870)
 at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:789)
   ``` 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] LuciferYang commented on a diff in pull request #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-08 Thread GitBox


LuciferYang commented on code in PR #972:
URL: https://github.com/apache/parquet-mr/pull/972#discussion_r893047755


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -715,7 +715,14 @@ public ParquetFileReader(
   
.withDecryption(fileDecryptor.getDecryptionProperties())
   .build();
 }
-this.blocks = filterRowGroups(blocks);
+try {
+  this.blocks = filterRowGroups(blocks);
+} catch (Exception e) {

Review Comment:
   There will be more than IOE here, for example if push down filters for 
repeated primitive types will throw `IllegalArgumentException` as follows:
   
   ```
   Caused by: java.lang.IllegalArgumentException: FilterPredicates do not 
currently support repeated columns. Column keywords is repeated.
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:176)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:89)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
 at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:192)
 at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61)
 at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95)
 at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45)
 at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
 at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:72)
 at 
org.apache.parquet.hadoop.ParquetFileReader.filterRowGroups(ParquetFileReader.java:870)
 at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:789)
   ``` 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] LuciferYang opened a new pull request, #972: PARQUET-2154: `ParquetFileReader` should close its input stream when `filterRowGroups` throw Exception in constructor

2022-06-08 Thread GitBox


LuciferYang opened a new pull request, #972:
URL: https://github.com/apache/parquet-mr/pull/972

   During the construction of `ParquetFileReader`, if 
`filterRowGroups(footer.getBlocks())` method throws an exception, it will cause 
resource leak because when the Exception thrown, the open stream `this.f = 
file.newStream()` looks unable to be closed.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-06-07 Thread GitBox


steveloughran commented on PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#issuecomment-1148538544

   I've taken this PR and added the changes I was suggesting, plus tests. see 
#971. If you take that extra commit and merge it in here, it should complete 
this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran opened a new pull request, #971: PARQUET-2134: Improve binding to ByteBufferReadable

2022-06-07 Thread GitBox


steveloughran opened a new pull request, #971:
URL: https://github.com/apache/parquet-mr/pull/971

   This extends #951
   
   It improves binding to streams which implement
   ByteBufferReadable through recursive probes of wrapped
   streams and direct querying of the stream on Hadoop 3.3.0+.
   
   Since HDFS-14111 all input streams in the hadoop codebase
   which implement ByteBufferReadable return true on the
   StreamCapabilities probe hasCapability("in:readbytebuffer")
   
   This means the best way to probe for the API on those versions
   is to ask the stream.
   
   The StreamCapabilities probe was added in Hadoop 2.9. Along with
   making all use of `ByteBufferReadable` non-reflective, this makes
   the checks fairly straightforward.
   
   The recursive check is from #951; the change is it no longer
   needs to use reflection.
   
   Tests verify that if a stream implements `ByteBufferReadable' then
   it will be bonded to H2SeekableInputStream, even if multiply wrapped
   by FSDataInputStreams, and that if it doesn't, it won't.
   
   ### Jira
   
   - [X] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-06-06 Thread GitBox


steveloughran commented on code in PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#discussion_r890391185


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -66,6 +67,19 @@ public static SeekableInputStream wrap(FSDataInputStream 
stream) {
 }
   }
 
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped == stream) {
+  throw new ParquetDecodingException("Illegal FSDataInputStream as wrapped 
itself");

Review Comment:
   this can't happen. the inner stream is set in the constructor, so cannot 
take the not-yet-constructed class as an argument...no need to worry about 
recursion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #951: PARQUET-2134: Fix type checking in HadoopStreams.wrap

2022-06-06 Thread GitBox


steveloughran commented on code in PR #951:
URL: https://github.com/apache/parquet-mr/pull/951#discussion_r890388928


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -51,7 +52,7 @@ public class HadoopStreams {
   public static SeekableInputStream wrap(FSDataInputStream stream) {
 Objects.requireNonNull(stream, "Cannot wrap a null input stream");
 if (byteBufferReadableClass != null && h2SeekableConstructor != null &&
-byteBufferReadableClass.isInstance(stream.getWrappedStream())) {
+isWrappedStreamByteBufferReadable(stream)) {

Review Comment:
   this is really going into the internals of the hadoop classes and 
potentially tricky if there is any dynamic decision making in the inner class. 
The good news there is I don't see anything doing that.
   
   there is a way to ask (hadoop 3.2+) if a stream does support the API before 
calling, using the StreamCapabilities interface.
   https://issues.apache.org/jira/browse/HDFS-14111
   
   ```java
   if (stream.hasCapability( "in:readbytebuffer") {
 // stream is confident it has the api
   ) else {
 // do the checking of the inner class
   }



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java:
##
@@ -66,6 +67,15 @@ public static SeekableInputStream wrap(FSDataInputStream 
stream) {
 }
   }
 
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream 
stream) {
+InputStream wrapped = stream.getWrappedStream();
+if (wrapped instanceof FSDataInputStream) {
+  return isWrappedStreamByteBufferReadable(((FSDataInputStream) wrapped));

Review Comment:
   you can't. the inner stream is set in the constructor, so cannot take the 
not-yet-constructed class as an argument.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   4   5   6   7   8   9   10   >