I don't think Pig understands that this is a Python script. What happens if you put #!/bin/python (or whatever is appropriate in your system) at the beginning of your GroupStreamer? Alternatively you could explicitly call python on this file in your command by saying

STREAM test THROUGH `/bin/python GroupStreamer`

Alan.

On Oct 6, 2010, at 6:09 PM, felix gao wrote:

I have a python script defined as
import sys

for line in sys.stdin:
   if not line:
       break
   sys.stdout.write(line)

my data test looks like
({(19199vzFj6+uRbJf,7388,9074,50|22598,1267739954,0.0020,365,1,1)},1L)


my pig script is

temp = STREAM test THROUGH GroupStreamer as
(test_bag:chararray,·num_entries: long );

when I ran that my job will fail with
===== Task Information Header =====
Command: TestStream.py
(stdin-org.apache.pig.builtin.PigStreaming/stdout- org.apache.pig.builtin.PigStreaming)
Start time: Wed Oct 06 17:57:52 PDT 2010
=====          * * *          =====
/Users/felixgao/Documents/data/logs/TestStream.py: line 1: import: command
not found
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: syntax error near
unexpected token `if'
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: `    if not
line:'
2010-10-06 17:57:52,690 [Thread-21] ERROR
org.apache.pig.impl.streaming.ExecutableManager - 'TestStream.py ' failed
with exit status: 2
2010-10-06 17:57:52,697 [Thread-14] WARN
org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
org.apache.pig.backend.executionengine.ExecException: ERROR 2090: Received Error while processing the reduce plan: 'TestStream.py ' failed with exit
status: 2
   at
org .apache .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce $Reduce.runPipeline(PigMapReduce.java:465)
   at
org .apache .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce $Reduce.processOnePackageOutput(PigMapReduce.java:401)
   at
org .apache .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce $Reduce.reduce(PigMapReduce.java:381)
   at
org .apache .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce $Reduce.reduce(PigMapReduce.java:250)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
   at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
   at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java: 216)

What did I do wrong here?




Another question is if I specify by alias as
temp = STREAM Test THROUGH GroupStreamer
as (test_grp_cnt:bag {test_none_zero: tuple(f1:chararray, f2:int, f3:int, f4:chararray, f5:int, f6:double, f7:int, f8:int, f9:int)}, ·num_entries:
long  );
I will get
2010-10-06 17:38:57,092 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " ";" "; "" at line 76, column
179.
Was expecting one of:
   ")" ...
   "," ...

What is the correct way of specifying a bag of tuples based on my data
sample?

Thanks,

Felix

Reply via email to