Cheolsoo Park created PIG-3956: ---------------------------------- Summary: UDF profile is often misleading Key: PIG-3956 URL: https://issues.apache.org/jira/browse/PIG-3956 Project: Pig Issue Type: Bug Components: impl Reporter: Cheolsoo Park
Pig provides UDF profiling via the {{pig.udf.profile}} perperty. How it works is that it samples every 100th UDF invocation and increments a counter by 100 x elapsed time. {code} private final static int TIMING_FREQ = 100; ... PigStatusReporter.getInstance().incrCounter(counterGroup, TIMING_COUNTER, Math.round((System.nanoTime() - startNanos) / 1000) * TIMING_FREQ); {code} The problem is that this assumes every invocation takes about the same time, which doesn't hold true oftentimes. Here is an example. TIMING_FREQ = 100. {code} set pig.udf.profile true; set opt.fetch false; l = LOAD 'foo' AS (value:chararray); f = FOREACH l GENERATE LOWER(value); DUMP f; {code} This job gives me the following counters- {code} org.apache.pig.builtin.LOWER - approx_invocations: 6,144,600 org.apache.pig.builtin.LOWER - approx_microsecs: 130,913,800 {code} However, if I change TIMING_FREQ to 1, I get the followings- {code} org.apache.pig.builtin.LOWER - approx_invocations: 6,144,248 org.apache.pig.builtin.LOWER - approx_microsecs: 19,711,842 {code} As can be seen, approx_microsecs is 6x larger when sampling every 100th invocation as compared to when sampling every invocation. -- This message was sent by Atlassian JIRA (v6.2#6252)