I don't think Pig understands that this is a Python script. What
happens if you put #!/bin/python (or whatever is appropriate in your
system) at the beginning of your GroupStreamer? Alternatively you
could explicitly call python on this file in your command by saying
STREAM test THROUGH `/bin/python GroupStreamer`
Alan.
On Oct 6, 2010, at 6:09 PM, felix gao wrote:
I have a python script defined as
import sys
for line in sys.stdin:
if not line:
break
sys.stdout.write(line)
my data test looks like
({(19199vzFj6+uRbJf,7388,9074,50|22598,1267739954,0.0020,365,1,1)},1L)
my pig script is
temp = STREAM test THROUGH GroupStreamer as
(test_bag:chararray,·num_entries: long );
when I ran that my job will fail with
===== Task Information Header =====
Command: TestStream.py
(stdin-org.apache.pig.builtin.PigStreaming/stdout-
org.apache.pig.builtin.PigStreaming)
Start time: Wed Oct 06 17:57:52 PDT 2010
===== * * * =====
/Users/felixgao/Documents/data/logs/TestStream.py: line 1: import:
command
not found
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: syntax
error near
unexpected token `if'
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: ` if not
line:'
2010-10-06 17:57:52,690 [Thread-21] ERROR
org.apache.pig.impl.streaming.ExecutableManager - 'TestStream.py '
failed
with exit status: 2
2010-10-06 17:57:52,697 [Thread-14] WARN
org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
org.apache.pig.backend.executionengine.ExecException: ERROR 2090:
Received
Error while processing the reduce plan: 'TestStream.py ' failed with
exit
status: 2
at
org
.apache
.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce
$Reduce.runPipeline(PigMapReduce.java:465)
at
org
.apache
.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce
$Reduce.processOnePackageOutput(PigMapReduce.java:401)
at
org
.apache
.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce
$Reduce.reduce(PigMapReduce.java:381)
at
org
.apache
.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce
$Reduce.reduce(PigMapReduce.java:250)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:
216)
What did I do wrong here?
Another question is if I specify by alias as
temp = STREAM Test THROUGH GroupStreamer
as (test_grp_cnt:bag {test_none_zero: tuple(f1:chararray, f2:int,
f3:int,
f4:chararray, f5:int, f6:double, f7:int, f8:int, f9:int)},
·num_entries:
long );
I will get
2010-10-06 17:38:57,092 [main] ERROR
org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered " ";" "; "" at line
76, column
179.
Was expecting one of:
")" ...
"," ...
What is the correct way of specifying a bag of tuples based on my data
sample?
Thanks,
Felix