(This is a cross post. I did not get any response on SO: https://stackoverflow.com/questions/44691416/why-is-apache-orc-recordreader-searchargument-not-filtering-correctly. I'm hoping someone can help me get to the bottom of the issue.)
Here is a simple program that: 1. Writes records into an Orc file 2. Then tries to read the file using predicate pushdown (searchArgument) Questions: 1. Is this the right way to use predicate push down in Orc? 2. The read(..) method seems to return all the records, completely ignoring the searchArguments. Why is that? *Notes:* I have not been able to find any useful unit test that demonstrates how predicate pushdown works in Orc (Orc on GitHub <https://github.com/apache/orc/tree/9175b3e22742b5d4537f072165b863c78de23db5/java/core/src/test/org/apache/orc>). Nor am I able to find any clear documentation on this feature. Tried looking at Spark <https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala> and Presto <https://github.com/prestodb/presto/blob/master/presto-orc/src/test/java/com/facebook/presto/orc/TestCachingOrcDataSource.java#L192> code, but I was not able to find anything useful. The code below is a modified version of https://github.com/melanio/codecheese-blog-examples/tree/master/orc-examples/src/main/java/codecheese/blog/examples/orc public class TestRoundTrip {public static void main(String[] args) throws IOException { final String file = "tmp/test-round-trip.orc"; new File(file).delete(); final long highestX = 10000L; final Configuration conf = new Configuration(); write(file, highestX, conf); read(file, highestX, conf);} private static void read(String file, long highestX, Configuration conf) throws IOException { Reader reader = OrcFile.createReader( new Path(file), OrcFile.readerOptions(conf) ); //Retrieve x that is "highestX - 1000". So, only 1 value should've been retrieved. Options readerOptions = new Options(conf) .searchArgument( SearchArgumentFactory .newBuilder() .equals("x", Type.LONG, highestX - 1000) .build(), new String[]{"x"} ); RecordReader rows = reader.rows(readerOptions); VectorizedRowBatch batch = reader.getSchema().createRowBatch(); while (rows.nextBatch(batch)) { LongColumnVector x = (LongColumnVector) batch.cols[0]; LongColumnVector y = (LongColumnVector) batch.cols[1]; for (int r = 0; r < batch.size; r++) { long xValue = x.vector[r]; long yValue = y.vector[r]; System.out.println(xValue + ", " + yValue); } } rows.close();} private static void write(String file, long highestX, Configuration conf) throws IOException { TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>"); Writer writer = OrcFile.createWriter( new Path(file), OrcFile.writerOptions(conf).setSchema(schema) ); VectorizedRowBatch batch = schema.createRowBatch(); LongColumnVector x = (LongColumnVector) batch.cols[0]; LongColumnVector y = (LongColumnVector) batch.cols[1]; for (int r = 0; r < highestX; ++r) { int row = batch.size++; x.vector[row] = r; y.vector[row] = r * 3; // If the batch is full, write it out and start over. if (batch.size == batch.getMaxSize()) { writer.addRowBatch(batch); batch.reset(); } } if (batch.size != 0) { writer.addRowBatch(batch); batch.reset(); } writer.close();} } Thanks, Ashwin.
