If you'd like to contribute a patch to Impala, but aren't sure what you
want to work on, you can look at Impala's newbie issues:
https://issues.apache.org/jira/issues/?filter=12341668. You can find
detailed instructions on submitting patches at
https://cwiki.apache.org/confluence/display/IMPALA/Contributing+to+Impala.
This is a walkthrough of a ticket a new contributor could take on, with
hopefully enough detail to get you going but not so much to take away the
fun.

How can we fix https://issues.apache.org/jira/browse/IMPALA-5341, "File
size filter in planner tests also filters row-size"? The very first thing
to do is understand the pieces of test infrastructure that relate to this
issue.

When Impala processes a query, before running it, the query has to be
"planned". Part of the output of that is what you see when you EXPLAIN
SELECT. For example:

[localhost:21000] > explain select * from tpch.lineitem;
Query: explain select * from tpch.lineitem
+----------------------------------------------+
| Explain String                               |
+----------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B |
| Per-Host Resource Estimates: Memory=264.00MB |
|                                              |
| PLAN-ROOT SINK                               |
| |                                            |
| 01:EXCHANGE [UNPARTITIONED]                  |
| |                                            |
| 00:SCAN HDFS [tpch.lineitem]                 |
|    partitions=1/1 files=1 size=718.94MB      |
+----------------------------------------------+
Fetched 9 row(s) in 0.01s


Impala has planner-specific tests that focus on just this one part of the
system. You can see what these tests look like in
testdata/workloads/functional-planner/queries/PlannerTest/. For example, in
aggregation.test, one test starts with:

# basic aggregation
select count(*), count(tinyint_col), min(tinyint_col), max(tinyint_col),
sum(tinyint_col),
avg(tinyint_col)
from functional.alltypesagg
---- PLAN
PLAN-ROOT SINK
|
01:AGGREGATE [FINALIZE]
|  output: count(*), count(tinyint_col), min(tinyint_col),
max(tinyint_col), sum(tinyint_col), avg(tinyint_col)
|
00:SCAN HDFS [functional.alltypesagg]
   partitions=11/11 files=11 size=814.73KB
---- DISTRIBUTEDPLAN
...

As you can see, this looks like the output of EXPLAIN. These tests are run
by running EXPLAIN on the query in the first section and diffing the result
with the plan in the second section.

One part of the EXPLAIN output that isn't consistent is the file size.
Because this can change, text like "size=814.73KB" is replaced with just
"size=" before diffing. This covers up any differences in the file sizes,
but it also covers up differences in "row-size=" sections, which you can
see in constant-folding.test. Try changing one of the "row-size=" sections
to be much larger or smaller and see that it doesn't cause the tests to
fail.

You can find these in
fe/src/test/java/org/apache/impala/planner/PlannerTest.java. They mostly
just refer to the .test files, so for instance, constant-folding.test is
referenced in

  @Test
  public void testConstantFolding() {
    // Tests that constant folding is applied to all relevant PlanNodes and
DataSinks.
    // Note that not all Exprs are printed in the explain plan, so
validating those
    // via this test is currently not possible.
    TQueryOptions options = defaultQueryOptions();
    options.setExplain_level(TExplainLevel.EXTENDED);
    runPlannerTestFile("constant-folding", options);
  }

You can run this test by using:

(pushd fe && mvn -fae test -Dtest=PlannerTest#testConstantFolding)

OK, now that we have covered the background, you are ready to fix the
issue. You probably want to make the filter more restrictive, perhaps by
changing the static Strings or changing the matches() and transform()
methods in TestUtils.java. Once that's done, try running the Planner tests
on a test that includes row-size again. This time, it should pass with the
row-size as written and fail if you change the row-size, like we did above.

Have fun, and don't hesitate to ask d...@impala.apache.org if you get stuck!

Reply via email to