Re: Pig Streaming with Python Scripts

Alan Gates Fri, 08 Oct 2010 12:11:53 -0700

I don't think Pig understands that this is a Python script. Whathappens if you put #!/bin/python (or whatever is appropriate in yoursystem) at the beginning of your GroupStreamer? Alternatively youcould explicitly call python on this file in your command by saying


STREAM test THROUGH `/bin/python GroupStreamer`


Alan.

On Oct 6, 2010, at 6:09 PM, felix gao wrote:

I have a python script defined as
import sys

for line in sys.stdin:
   if not line:
       break
   sys.stdout.write(line)

my data test looks like
({(19199vzFj6+uRbJf,7388,9074,50|22598,1267739954,0.0020,365,1,1)},1L)


my pig script is

temp = STREAM test THROUGH GroupStreamer as
(test_bag:chararray,·num_entries: long );

when I ran that my job will fail with
===== Task Information Header =====
Command: TestStream.py
(stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)
Start time: Wed Oct 06 17:57:52 PDT 2010
=====          * * *          =====
/Users/felixgao/Documents/data/logs/TestStream.py: line 1: import:command
not found
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: syntaxerror near
unexpected token `if'
/Users/felixgao/Documents/data/logs/TestStream.py: line 9: `    if not
line:'
2010-10-06 17:57:52,690 [Thread-21] ERROR
org.apache.pig.impl.streaming.ExecutableManager - 'TestStream.py 'failed
with exit status: 2
2010-10-06 17:57:52,697 [Thread-14] WARN
org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
org.apache.pig.backend.executionengine.ExecException: ERROR 2090:ReceivedError while processing the reduce plan: 'TestStream.py ' failed withexit
status: 2
   at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:465)
   at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
   at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
   at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:250)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
   at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
   at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
What did I do wrong here?




Another question is if I specify by alias as
temp = STREAM Test THROUGH GroupStreamer
as (test_grp_cnt:bag {test_none_zero: tuple(f1:chararray, f2:int,f3:int,f4:chararray, f5:int, f6:double, f7:int, f8:int, f9:int)},·num_entries:
long  );
I will get
2010-10-06 17:38:57,092 [main] ERRORorg.apache.pig.tools.grunt.Grunt -ERROR 1000: Error during parsing. Encountered " ";" "; "" at line76, column
179.
Was expecting one of:
   ")" ...
   "," ...

What is the correct way of specifying a bag of tuples based on my data
sample?

Thanks,

Felix

Re: Pig Streaming with Python Scripts

Reply via email to