If you'd like to contribute a patch to Impala, but aren't sure what you want to work on, you can look at Impala's newbie issues: https://issues.apache.org/jira/issues/?filter=12341668. You can find detailed instructions on submitting patches at https://cwiki.apache.org/confluence/display/IMPALA/Contributing+to+Impala. This is a walkthrough of a ticket a new contributor could take on, with hopefully enough detail to get you going but not so much to take away the fun.
How can we fix https://issues.apache.org/jira/browse/IMPALA-5341, "File size filter in planner tests also filters row-size"? The very first thing to do is understand the pieces of test infrastructure that relate to this issue. When Impala processes a query, before running it, the query has to be "planned". Part of the output of that is what you see when you EXPLAIN SELECT. For example: [localhost:21000] > explain select * from tpch.lineitem; Query: explain select * from tpch.lineitem +----------------------------------------------+ | Explain String | +----------------------------------------------+ | Max Per-Host Resource Reservation: Memory=0B | | Per-Host Resource Estimates: Memory=264.00MB | | | | PLAN-ROOT SINK | | | | | 01:EXCHANGE [UNPARTITIONED] | | | | | 00:SCAN HDFS [tpch.lineitem] | | partitions=1/1 files=1 size=718.94MB | +----------------------------------------------+ Fetched 9 row(s) in 0.01s Impala has planner-specific tests that focus on just this one part of the system. You can see what these tests look like in testdata/workloads/functional-planner/queries/PlannerTest/. For example, in aggregation.test, one test starts with: # basic aggregation select count(*), count(tinyint_col), min(tinyint_col), max(tinyint_col), sum(tinyint_col), avg(tinyint_col) from functional.alltypesagg ---- PLAN PLAN-ROOT SINK | 01:AGGREGATE [FINALIZE] | output: count(*), count(tinyint_col), min(tinyint_col), max(tinyint_col), sum(tinyint_col), avg(tinyint_col) | 00:SCAN HDFS [functional.alltypesagg] partitions=11/11 files=11 size=814.73KB ---- DISTRIBUTEDPLAN ... As you can see, this looks like the output of EXPLAIN. These tests are run by running EXPLAIN on the query in the first section and diffing the result with the plan in the second section. One part of the EXPLAIN output that isn't consistent is the file size. Because this can change, text like "size=814.73KB" is replaced with just "size=" before diffing. This covers up any differences in the file sizes, but it also covers up differences in "row-size=" sections, which you can see in constant-folding.test. Try changing one of the "row-size=" sections to be much larger or smaller and see that it doesn't cause the tests to fail. You can find these in fe/src/test/java/org/apache/impala/planner/PlannerTest.java. They mostly just refer to the .test files, so for instance, constant-folding.test is referenced in @Test public void testConstantFolding() { // Tests that constant folding is applied to all relevant PlanNodes and DataSinks. // Note that not all Exprs are printed in the explain plan, so validating those // via this test is currently not possible. TQueryOptions options = defaultQueryOptions(); options.setExplain_level(TExplainLevel.EXTENDED); runPlannerTestFile("constant-folding", options); } You can run this test by using: (pushd fe && mvn -fae test -Dtest=PlannerTest#testConstantFolding) OK, now that we have covered the background, you are ready to fix the issue. You probably want to make the filter more restrictive, perhaps by changing the static Strings or changing the matches() and transform() methods in TestUtils.java. Once that's done, try running the Planner tests on a test that includes row-size again. This time, it should pass with the row-size as written and fail if you change the row-size, like we did above. Have fun, and don't hesitate to ask d...@impala.apache.org if you get stuck!