Hi -
I have a Pig script that hangs when it streams data through multiple Python
scripts. It only hangs in multiquery mode; the script runs successfully if I
add the -no_multiquery flag. However, the resulting run is extremely slow. I
am using Pig 0.7.
It seems that when I have a simple process (load, stream, save) the Pig
script works as expected. When I have a slightly more complex process (load
A, stream A -> A', generate B from A', process B to B', stream A' to output,
stream B' to output), the Pig script stalls and never completes execution.
The failure occurs in both local and mapred mode. I have been debugging in
local mode.
Here is an example of a working Pig script:
arrivals = load '$arrivals_path' using PigStorage('\t');
processed_arrivals = stream arrivals through `python -u -m ProcessArrivals`
as ( foo, bar, baz );
-- Replace empty strings with \N for preparation for import into MySQL
final_arrivals = stream processed_arrivals through `python -u -m Nullify`;
store final_arrivals into '$final_arrivals_path' using PigStorage('\t');
This works and everything is as expected.
Here is an example of a failing Pig script:
arrivals = load '$arrivals_path' using PigStorage('\t');
processed_arrivals = stream arrivals through `python -u -m ProcessArrivals`
as ( foo, bar, baz);
-- Derive a new data set from processed_arrivals
derived = foreach processed_arrivals generate foo, bar;
processed_derived = stream derived through `python -u -m ProcessDerived`;
-- Replace empty strings with \N for preparation for import into MySQL
final_arrivals = stream processed_arrivals through `python -u -m Nullify`;
final_derived = stream processed_derived through `python -u -m Nullify`;
store final_arrivals into '$final_arrivals_path' using PigStorage('\t');
store final_derived into '$final_derived_path' using PigStorage('\t');
When I run the failing script, I observe the following:
1. Pig output like the following:
2011-01-05 09:17:27,350 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:30,351 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:33,351 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:36,352 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:39,352 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner -
===== Task Information Footer =====
End time: Wed Jan 05 09:17:40 EST 2011
Exit code: 0
Input records: 93036
Input bytes: 14089001 bytes (stdin using
org.apache.pig.builtin.PigStreaming)
Output records: 200064
Output bytes: 26794153 bytes (stdout using
org.apache.pig.builtin.PigStreaming)
===== * * * =====
===== Task Information Footer =====
End time: Wed Jan 05 09:17:40 EST 2011
Exit code: 0
Input records: 200064
Input bytes: 26794153 bytes (stdin using
org.apache.pig.builtin.PigStreaming)
Output records: 200064
Output bytes: 26794153 bytes (stdout using
org.apache.pig.builtin.PigStreaming)
===== * * * =====
===== Task Information Footer =====
End time: Wed Jan 05 09:17:40 EST 2011
Exit code: 0
Input records: 93036
Input bytes: 98024714 bytes (stdin using
org.apache.pig.builtin.PigStreaming)
Output records: 93036
Output bytes: 102466442 bytes (stdout using
org.apache.pig.builtin.PigStreaming)
===== * * * =====
2011-01-05 09:17:42,353 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2011-01-05 09:17:45,353 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner -
The tasks end, and then there are always 2 lines output, then nothing. It
hangs indefinitely.
2. Before I kill the hanging process, htop indicates that "python -u -m
ProcessArrivals" is running, but using 0% CPU and 1% memory.
3. Before I kill the hanging process, I can look at the temporary files in
the output directory. The files are the outputs "final_arrivals" and
"final_derived". Both files terminate abruptly, as if the output hasn't been
flushed. e.g., the user agent column might be:
(iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; en-us) AppleWebKit/533.17.9
(KHTML, like Ge
and then the file ends.
4. When I kill the hanging process (Ctrl-C), the output is a Python
traceback:
^CTraceback (most recent call last):
File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/opt/svn/repository/ProcessArrivals.py", line 285, in <module>
print io.serialize_output(ordered)
KeyboardInterrupt
This indicates that ProcessArrivals is hanging when it is writing an output
row. The code is nothing special: "ordered" is a python list, and
io.serialize_output is defined as:
def serialize_output(row):
"Returns a string representation of data to output."
return "\t".join([str(x) for x in row])
So nothing out of the ordinary there.
I have verified that piping the data through the Python scripts on the
command line completes successfully. Because it looked like a buffering
issue, I tried with running python in buffered mode (default) and unbuffered
mode (-u flag) to no avail.
Any thoughts of what I could try to make this run in multiquery mode or what
the problem might be are very much appreciated.
Thanks,
- Charles