Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
Fokko merged PR #3041: URL: https://github.com/apache/parquet-java/pull/3041 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
Fokko commented on PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#issuecomment-2460726171 Thanks @pan3793 for finding and fixing this, and thanks @wgtmac @ConeyLiu and @gszadovszky for the review 🙌 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
pan3793 commented on code in PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1829273177 ## parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java: ## @@ -376,7 +378,15 @@ void writeInto(ByteBuffer buffer) { ByteBuffer workBuf = buffer.duplicate(); int pos = buffer.position(); workBuf.limit(pos + byteCount); -Channels.newChannel(in).read(workBuf); Review Comment: I went through the [original PR](https://github.com/apache/parquet-java/pull/1278) and found nothing else, it would be great if others have a double check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
ConeyLiu commented on code in PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1829249295 ## parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java: ## @@ -376,7 +378,15 @@ void writeInto(ByteBuffer buffer) { ByteBuffer workBuf = buffer.duplicate(); int pos = buffer.position(); workBuf.limit(pos + byteCount); -Channels.newChannel(in).read(workBuf); Review Comment: Is there any other place that is used like this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
Fokko commented on code in PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827579758 ## parquet-common/src/test/java/org/apache/parquet/bytes/AvailableAgnosticInputStream.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.bytes; + +import java.io.ByteArrayInputStream; + +public class AvailableAgnosticInputStream extends ByteArrayInputStream { + + public AvailableAgnosticInputStream(byte[] buf) { +super(buf); + } + + // In practice, there are some implementations always return 0 even if they has more data Review Comment: Thanks, this is very helpful! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
pan3793 commented on code in PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827388657 ## parquet-common/src/test/java/org/apache/parquet/bytes/AvailableAgnosticInputStream.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.bytes; + +import java.io.ByteArrayInputStream; + +public class AvailableAgnosticInputStream extends ByteArrayInputStream { + + public AvailableAgnosticInputStream(byte[] buf) { +super(buf); + } + + // In practice, there are some implementations always return 0 even if they has more data Review Comment: My test code is (sorry, the file contains private data so I can not share) ``` public class MyDictionaryFilterTest { private static final Configuration conf = new Configuration(); List ccmd; ParquetFileReader reader; DictionaryPageReadStore dictionaries; private Path file = new Path("/Users/chengpan/Temp/part-2bb8404a-f6e5-4e9f-9161-f749c4bf46d0-2-"); @Before public void setUp() throws Exception { reader = ParquetFileReader.open(conf, file); ParquetMetadata meta = reader.getFooter(); ccmd = meta.getBlocks().get(0).getColumns(); dictionaries = reader.getDictionaryReader(meta.getBlocks().get(0)); } @After public void tearDown() throws Exception { reader.close(); } @Test public void testEqBinary() throws Exception { BinaryColumn b = binaryColumn("source_id"); FilterPredicate pred = eq(b, Binary.fromString("5059661515")); assertFalse(canDrop(pred, ccmd, dictionaries)); } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
Fokko commented on code in PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827384881 ## parquet-common/src/test/java/org/apache/parquet/bytes/AvailableAgnosticInputStream.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.bytes; + +import java.io.ByteArrayInputStream; + +public class AvailableAgnosticInputStream extends ByteArrayInputStream { + + public AvailableAgnosticInputStream(byte[] buf) { +super(buf); + } + + // In practice, there are some implementations always return 0 even if they has more data Review Comment: Out of curiosity. Why are you using `H1SeekableInputStream`? This one is related to Hadoop 1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
pan3793 commented on code in PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827221056 ## parquet-common/src/test/java/org/apache/parquet/bytes/TestBytesInput.java: ## @@ -140,6 +140,20 @@ public void testFromInputStream() throws IOException { validate(data, factory); } + @Test + public void testFromLargeAvailableAgnosticInputStream() throws IOException { +// allocate a bytes that large than +// java.nio.channel.Channels.ReadableByteChannelImpl.TRANSFER_SIZE = 8192 +byte[] data = new byte[9 * 1024]; +RANDOM.nextBytes(data); +byte[] input = new byte[data.length + 10]; +RANDOM.nextBytes(input); +System.arraycopy(data, 0, input, 0, data.length); +Supplier factory = () -> BytesInput.from(new AvailableAgnosticInputStream(input), 9 * 1024); Review Comment: I tend to use a new file as it is a quite common case that needs to be tested, it might be used in other places in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
pan3793 commented on code in PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827220535 ## parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java: ## @@ -376,7 +377,11 @@ void writeInto(ByteBuffer buffer) { ByteBuffer workBuf = buffer.duplicate(); int pos = buffer.position(); workBuf.limit(pos + byteCount); -Channels.newChannel(in).read(workBuf); +ReadableByteChannel channel = Channels.newChannel(in); +int remaining = byteCount; +while (remaining > 0) { + remaining -= channel.read(workBuf); Review Comment: added a check to detect EOF case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
wgtmac commented on code in PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827206140 ## parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java: ## @@ -376,7 +377,11 @@ void writeInto(ByteBuffer buffer) { ByteBuffer workBuf = buffer.duplicate(); int pos = buffer.position(); workBuf.limit(pos + byteCount); -Channels.newChannel(in).read(workBuf); +ReadableByteChannel channel = Channels.newChannel(in); +int remaining = byteCount; +while (remaining > 0) { + remaining -= channel.read(workBuf); Review Comment: Is `remaining` reliable? Should we check the return value of `channel.read(workBuf)`? ## parquet-common/src/test/java/org/apache/parquet/bytes/TestBytesInput.java: ## @@ -140,6 +140,20 @@ public void testFromInputStream() throws IOException { validate(data, factory); } + @Test + public void testFromLargeAvailableAgnosticInputStream() throws IOException { +// allocate a bytes that large than +// java.nio.channel.Channels.ReadableByteChannelImpl.TRANSFER_SIZE = 8192 +byte[] data = new byte[9 * 1024]; +RANDOM.nextBytes(data); +byte[] input = new byte[data.length + 10]; +RANDOM.nextBytes(input); +System.arraycopy(data, 0, input, 0, data.length); +Supplier factory = () -> BytesInput.from(new AvailableAgnosticInputStream(input), 9 * 1024); Review Comment: What about using an anonymous class here instead of adding a new file? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
pan3793 commented on PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#issuecomment-2453525098 cc @gszadovszky @wgtmac @Fokko -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k [parquet-java]
pan3793 commented on code in PR #3041: URL: https://github.com/apache/parquet-java/pull/3041#discussion_r1827037716 ## parquet-common/src/test/java/org/apache/parquet/bytes/AvailableAgnosticInputStream.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.bytes; + +import java.io.ByteArrayInputStream; + +public class AvailableAgnosticInputStream extends ByteArrayInputStream { + + public AvailableAgnosticInputStream(byte[] buf) { +super(buf); + } + + // In practice, there are some implementations always return 0 even if they has more data Review Comment: in my case, the underlying `IntputStream` is `H1SeekableInputStream` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org