Re: [PERFORM] Greenplum MapReduce

2009-08-03 Thread Richard Huxton

Suvankar Roy wrote:

Hi all,

Has anybody worked on Greenplum MapReduce programming ?

I am facing a problem while trying to execute the below Greenplum 
Mapreduce program written in YAML (in blue). 


The other poster suggested contacting Greenplum and I can only agree.


The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token 
while scanning for the next token, at line 7 (in red)


There is no red, particularly if viewing messages as plain text (which 
most people do on mailing lists). Consider indicating a line some other 
way next time (commonly below the line you put something like this is 
line 7 ^)


The most common problem I get with YAML files though is when a tab is 
accidentally inserted instead of spaces at the start of a line.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Greenplum MapReduce

2009-08-03 Thread Richard Huxton

Suvankar Roy wrote:

Hi Richard,

I sincerely regret the inconvenience caused.


No big inconvenience, but the lists can be very busy sometimes and the 
easier you make it for people to answer your questions the better the 
answers you will get.



%YAML 1.1
---
VERSION: 1.0.0.1 
DATABASE: test_db1

USER: gpadmin
DEFINE: 
- INPUT: #** This the line which is causing the error **#

 NAME: doc
 TABLE: documents

If it looks fine, always check for tabs. Oh, and you could have cut out 
all the rest of the file, really.


I have learnt that unnecessary TABs can the cause of this, so trying to 
overcome that, hopefully the problem will subside then


I'm always getting this. It's easy to accidentally introduce a tab 
character when reformatting YAML. It might be worth checking if your 
text editor has an option to always replace tabs with spaces.


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Greenplum MapReduce

2009-08-03 Thread Suvankar Roy
Hi Robert,

Thanks much for your valuable inputs

This spaces and tabs problem is killing me in a way, it is pretty 
cumbersome to say the least

Regards,

Suvankar Roy



Robert Mah r...@pobox.com 
Sent by: Robert Mah robert@gmail.com
08/02/2009 10:52 PM

To
'Suvankar Roy' suvankar@tcs.com, 
pgsql-performance@postgresql.org
cc

Subject
RE: [PERFORM] Greenplum MapReduce






Suvankar:
 
Check your file for spaces vs tabs (one of them is bad and yes, it 
matters).
 
And as an personal aside, this is yet another reason I hate YAML.
 
Cheers,
Rob
 
From: pgsql-performance-ow...@postgresql.org [
mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Suvankar Roy
Sent: Thursday, July 30, 2009 8:25 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Greenplum MapReduce
 

Hi all, 

Has anybody worked on Greenplum MapReduce programming ? 

I am facing a problem while trying to execute the below Greenplum 
Mapreduce program written in YAML (in blue). 

The error is thrown in the 7th line as: 
Error: YAML syntax error - found character that cannot start any token 
while scanning for the next token, at line 7 (in red) 

If somebody can explain this and the potential solution 

%YAML 1.1 
--- 
VERSION: 1.0.0.1 
DATABASE: test_db1 
USER: gpadmin 
DEFINE: 
- INPUT: 
NAME: doc 
TABLE: documents 
- INPUT: 
NAME: kw 
TABLE: keywords 
- MAP: 
NAME: doc_map 
LANGUAGE: python 
FUNCTION:  | 
i = 0 
terms = {} 
for term in data.lower().split(): 
i = i + 1 
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
for term in terms: 
yield([doc_id, term, terms[term]])   
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- doc_id integer 
- data text 
RETURNS: 
- doc_id integer 
- term text 
- positions text 
- MAP: 
NAME: kw_map 
LANGUAGE: python 
FUNCTION: | 
i = 0 
terms = {} 
for term in keyword.lower().split(): 
i = i + 1 
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
yield([keyword_id, i, term, terms[term]]) 
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- keyword_id integer 
- keyword text 
RETURNS: 
- keyword_id integer 
- nterms integer 
- term text 
- positions text   
- TASK: 
NAME: doc_prep 
SOURCE: doc 
MAP: doc_map 
- TASK: 
NAME: kw_prep 
SOURCE: kw 
MAP: kw_map   
- INPUT: 
NAME: term_join 
QUERY: | 
SELECT doc.doc_id, kw.keyword_id, kw.term, 
kw.nterms, 
 doc.positions as doc_positions, 
kw.positions as kw_positions 
 FROM doc_prep doc INNER JOIN kw_prep kw ON 
(doc.term = kw.term) 
- REDUCE: 
NAME: term_reducer 
TRANSITION: term_transition 
FINALIZE: term_finalizer 
- TRANSITION: 
NAME: term_transition 
LANGUAGE: python 
PARAMETERS: 
- state text 
- term text 
- nterms integer 
- doc_positions text 
- kw_positions text 
FUNCTION: | 
if state: 
kw_split = state.split(':') 
else: 
kw_split = [] 
for i in range(0,nterms): 
kw_split.append('') 
for kw_p in kw_positions.split(','): 
kw_split[int(kw_p)-1] = doc_positions  

outstate = kw_split[0] 
for s in kw_split[1

[PERFORM] Greenplum MapReduce

2009-08-02 Thread Suvankar Roy
Hi all,

Has anybody worked on Greenplum MapReduce programming ?

I am facing a problem while trying to execute the below Greenplum 
Mapreduce program written in YAML (in blue). 

The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token 
while scanning for the next token, at line 7 (in red)

If somebody can explain this and the potential solution

%YAML 1.1
---
VERSION: 1.0.0.1 
DATABASE: test_db1
USER: gpadmin
DEFINE: 
- INPUT:
NAME: doc
TABLE: documents 
- INPUT:
NAME: kw
TABLE: keywords
- MAP: 
NAME:   doc_map 
LANGUAGE:   python 
FUNCTION:|
i = 0 
terms = {}
for term in data.lower().split(): 
i = i + 1
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
for term in terms: 
yield([doc_id, term, terms[term]]) 
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- doc_id integer 
- data text 
RETURNS: 
- doc_id integer 
- term text 
- positions text 
- MAP: 
NAME:   kw_map 
LANGUAGE:   python 
FUNCTION:   | 
i = 0 
terms = {} 
for term in keyword.lower().split(): 
i = i + 1 
if term in terms: 
terms[term] += ','+str(i) 
else: 
terms[term] = str(i) 
yield([keyword_id, i, term, terms[term]]) 
OPTIMIZE: STRICT IMMUTABLE 
PARAMETERS: 
- keyword_id integer 
- keyword text 
RETURNS: 
- keyword_id integer 
- nterms integer 
- term text 
- positions text 
- TASK: 
NAME: doc_prep 
SOURCE: doc 
MAP: doc_map
- TASK: 
NAME: kw_prep 
SOURCE: kw 
MAP: kw_map 
- INPUT: 
NAME: term_join 
QUERY: | 
SELECT doc.doc_id, kw.keyword_id, kw.term, 
kw.nterms, 
doc.positions as doc_positions, 
kw.positions as kw_positions 
 FROM doc_prep doc INNER JOIN kw_prep kw ON 
(doc.term = kw.term)
- REDUCE: 
NAME: term_reducer 
TRANSITION: term_transition 
FINALIZE: term_finalizer 
- TRANSITION: 
NAME: term_transition 
LANGUAGE: python 
PARAMETERS: 
- state text 
- term text 
- nterms integer 
- doc_positions text 
- kw_positions text 
FUNCTION: | 
if state: 
kw_split = state.split(':') 
else: 
kw_split = [] 
for i in range(0,nterms): 
kw_split.append('') 
for kw_p in kw_positions.split(','): 
kw_split[int(kw_p)-1] = doc_positions 
outstate = kw_split[0] 
for s in kw_split[1:]: 
outstate = outstate + ':' + s 
return outstate 
- FINALIZE: 
NAME: term_finalizer 
LANGUAGE: python 
RETURNS: 
- count integer 
MODE: MULTI 
FUNCTION: | 
if not state: 
return 0 
kw_split = state.split(':') 
previous = None 
for i in range(0,len(kw_split)): 
isplit = kw_split[i].split(',') 
if any(map(lambda(x): x == '', isplit)): 
return 0 
adjusted = set(map(lambda(x): int(x)-i, 
isplit)) 
if (previous): 

Re: [PERFORM] Greenplum MapReduce

2009-08-02 Thread Chris

Suvankar Roy wrote:


Hi all,

Has anybody worked on Greenplum MapReduce programming ?


It's a commercial product, you need to contact greenplum.

--
Postgresql  php tutorials
http://www.designmagick.com/


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance