Pig Streaming with Python Scripts

felix gao Wed, 06 Oct 2010 18:09:41 -0700

I have a python script defined as
import sys

for line in sys.stdin:
    if not line:
        break
    sys.stdout.write(line)


my data test looks like
({(19199vzFj6+uRbJf,7388,9074,50|22598,1267739954,0.0020,365,1,1)},1L)


my pig script is

temp = STREAM test THROUGH GroupStreamer as
(test_bag:chararray,·num_entries: long );

when I ran that my job will fail with
===== Task Information Header =====
Command: TestStream.py
(stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)
Start time: Wed Oct 06 17:57:52 PDT 2010
=====          * * *          =====
/Users/felixgao/Documents/data/logs/TestStream.py: line 1: import: command
not found
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: syntax error near
unexpected token `if'
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: `    if not
line:'
2010-10-06 17:57:52,690 [Thread-21] ERROR
org.apache.pig.impl.streaming.ExecutableManager - 'TestStream.py ' failed
with exit status: 2
2010-10-06 17:57:52,697 [Thread-14] WARN
org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
org.apache.pig.backend.executionengine.ExecException: ERROR 2090: Received
Error while processing the reduce plan: 'TestStream.py ' failed with exit
status: 2
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:465)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
    at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:250)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)

What did I do wrong here?




Another question is if I specify by alias as
temp = STREAM Test THROUGH GroupStreamer
as (test_grp_cnt:bag {test_none_zero: tuple(f1:chararray, f2:int, f3:int,
f4:chararray, f5:int, f6:double, f7:int, f8:int, f9:int)}, ·num_entries:
long  );
I will get
2010-10-06 17:38:57,092 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered " ";" "; "" at line 76, column
179.
Was expecting one of:
    ")" ...
    "," ...

What is the correct way of specifying a bag of tuples based on my data
sample?

Thanks,

Felix

Pig Streaming with Python Scripts

Reply via email to