Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-06 Thread via GitHub


Fokko merged PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-06 Thread via GitHub


Fokko commented on PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#issuecomment-2460726171

   Thanks @pan3793 for finding and fixing this, and thanks @wgtmac @ConeyLiu 
and @gszadovszky for the review 🙌 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-05 Thread via GitHub


pan3793 commented on code in PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1829273177


##
parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java:
##
@@ -376,7 +378,15 @@ void writeInto(ByteBuffer buffer) {
 ByteBuffer workBuf = buffer.duplicate();
 int pos = buffer.position();
 workBuf.limit(pos + byteCount);
-Channels.newChannel(in).read(workBuf);

Review Comment:
   I went through the [original 
PR](https://github.com/apache/parquet-java/pull/1278) and found nothing else, 
it would be great if others have a double check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-05 Thread via GitHub


ConeyLiu commented on code in PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1829249295


##
parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java:
##
@@ -376,7 +378,15 @@ void writeInto(ByteBuffer buffer) {
 ByteBuffer workBuf = buffer.duplicate();
 int pos = buffer.position();
 workBuf.limit(pos + byteCount);
-Channels.newChannel(in).read(workBuf);

Review Comment:
   Is there any other place that is used like this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-04 Thread via GitHub


Fokko commented on code in PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827579758


##
parquet-common/src/test/java/org/apache/parquet/bytes/AvailableAgnosticInputStream.java:
##
@@ -0,0 +1,35 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+
+package org.apache.parquet.bytes;
+
+import java.io.ByteArrayInputStream;
+
+public class AvailableAgnosticInputStream extends ByteArrayInputStream {
+
+  public AvailableAgnosticInputStream(byte[] buf) {
+super(buf);
+  }
+
+  // In practice, there are some implementations always return 0 even if they 
has more data

Review Comment:
   Thanks, this is very helpful!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-04 Thread via GitHub


pan3793 commented on code in PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827388657


##
parquet-common/src/test/java/org/apache/parquet/bytes/AvailableAgnosticInputStream.java:
##
@@ -0,0 +1,35 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+
+package org.apache.parquet.bytes;
+
+import java.io.ByteArrayInputStream;
+
+public class AvailableAgnosticInputStream extends ByteArrayInputStream {
+
+  public AvailableAgnosticInputStream(byte[] buf) {
+super(buf);
+  }
+
+  // In practice, there are some implementations always return 0 even if they 
has more data

Review Comment:
   My test code is
   (sorry, the file contains private data so I can not share)
   ```
   public class MyDictionaryFilterTest {
   
 private static final Configuration conf = new Configuration();
 List ccmd;
 ParquetFileReader reader;
 DictionaryPageReadStore dictionaries;
 private Path file = new 
Path("/Users/chengpan/Temp/part-2bb8404a-f6e5-4e9f-9161-f749c4bf46d0-2-");
   
 @Before
 public void setUp() throws Exception {
   reader = ParquetFileReader.open(conf, file);
   ParquetMetadata meta = reader.getFooter();
   ccmd = meta.getBlocks().get(0).getColumns();
   dictionaries = reader.getDictionaryReader(meta.getBlocks().get(0));
 }
   
 @After
 public void tearDown() throws Exception {
   reader.close();
 }
   
 @Test
 public void testEqBinary() throws Exception {
   BinaryColumn b = binaryColumn("source_id");
   FilterPredicate pred = eq(b, Binary.fromString("5059661515"));
   
   assertFalse(canDrop(pred, ccmd, dictionaries));
 }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-04 Thread via GitHub


Fokko commented on code in PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827384881


##
parquet-common/src/test/java/org/apache/parquet/bytes/AvailableAgnosticInputStream.java:
##
@@ -0,0 +1,35 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+
+package org.apache.parquet.bytes;
+
+import java.io.ByteArrayInputStream;
+
+public class AvailableAgnosticInputStream extends ByteArrayInputStream {
+
+  public AvailableAgnosticInputStream(byte[] buf) {
+super(buf);
+  }
+
+  // In practice, there are some implementations always return 0 even if they 
has more data

Review Comment:
   Out of curiosity. Why are you using `H1SeekableInputStream`? This one is 
related to Hadoop 1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-03 Thread via GitHub


pan3793 commented on code in PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827221056


##
parquet-common/src/test/java/org/apache/parquet/bytes/TestBytesInput.java:
##
@@ -140,6 +140,20 @@ public void testFromInputStream() throws IOException {
 validate(data, factory);
   }
 
+  @Test
+  public void testFromLargeAvailableAgnosticInputStream() throws IOException {
+// allocate a bytes that large than
+// java.nio.channel.Channels.ReadableByteChannelImpl.TRANSFER_SIZE = 8192
+byte[] data = new byte[9 * 1024];
+RANDOM.nextBytes(data);
+byte[] input = new byte[data.length + 10];
+RANDOM.nextBytes(input);
+System.arraycopy(data, 0, input, 0, data.length);
+Supplier factory = () -> BytesInput.from(new 
AvailableAgnosticInputStream(input), 9 * 1024);

Review Comment:
   I tend to use a new file as it is a quite common case that needs to be 
tested, it might be used in other places in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-03 Thread via GitHub


pan3793 commented on code in PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827220535


##
parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java:
##
@@ -376,7 +377,11 @@ void writeInto(ByteBuffer buffer) {
 ByteBuffer workBuf = buffer.duplicate();
 int pos = buffer.position();
 workBuf.limit(pos + byteCount);
-Channels.newChannel(in).read(workBuf);
+ReadableByteChannel channel = Channels.newChannel(in);
+int remaining = byteCount;
+while (remaining > 0) {
+  remaining -= channel.read(workBuf);

Review Comment:
   added a check to detect EOF case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-03 Thread via GitHub


wgtmac commented on code in PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827206140


##
parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java:
##
@@ -376,7 +377,11 @@ void writeInto(ByteBuffer buffer) {
 ByteBuffer workBuf = buffer.duplicate();
 int pos = buffer.position();
 workBuf.limit(pos + byteCount);
-Channels.newChannel(in).read(workBuf);
+ReadableByteChannel channel = Channels.newChannel(in);
+int remaining = byteCount;
+while (remaining > 0) {
+  remaining -= channel.read(workBuf);

Review Comment:
   Is `remaining` reliable? Should we check the return value of 
`channel.read(workBuf)`?



##
parquet-common/src/test/java/org/apache/parquet/bytes/TestBytesInput.java:
##
@@ -140,6 +140,20 @@ public void testFromInputStream() throws IOException {
 validate(data, factory);
   }
 
+  @Test
+  public void testFromLargeAvailableAgnosticInputStream() throws IOException {
+// allocate a bytes that large than
+// java.nio.channel.Channels.ReadableByteChannelImpl.TRANSFER_SIZE = 8192
+byte[] data = new byte[9 * 1024];
+RANDOM.nextBytes(data);
+byte[] input = new byte[data.length + 10];
+RANDOM.nextBytes(input);
+System.arraycopy(data, 0, input, 0, data.length);
+Supplier factory = () -> BytesInput.from(new 
AvailableAgnosticInputStream(input), 9 * 1024);

Review Comment:
   What about using an anonymous class here instead of adding a new file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-03 Thread via GitHub


pan3793 commented on PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#issuecomment-2453525098

   cc @gszadovszky @wgtmac @Fokko 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]

2024-11-03 Thread via GitHub


pan3793 commented on code in PR #3041:
URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827037716


##
parquet-common/src/test/java/org/apache/parquet/bytes/AvailableAgnosticInputStream.java:
##
@@ -0,0 +1,35 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+
+package org.apache.parquet.bytes;
+
+import java.io.ByteArrayInputStream;
+
+public class AvailableAgnosticInputStream extends ByteArrayInputStream {
+
+  public AvailableAgnosticInputStream(byte[] buf) {
+super(buf);
+  }
+
+  // In practice, there are some implementations always return 0 even if they 
has more data

Review Comment:
   in my case, the underlying `IntputStream` is `H1SeekableInputStream`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org