Re: [PERFORM] Greenplum MapReduce
Suvankar Roy wrote: Hi all, Has anybody worked on Greenplum MapReduce programming ? I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue). The other poster suggested contacting Greenplum and I can only agree. The error is thrown in the 7th line as: Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red) There is no red, particularly if viewing messages as plain text (which most people do on mailing lists). Consider indicating a line some other way next time (commonly below the line you put something like this is line 7 ^) The most common problem I get with YAML files though is when a tab is accidentally inserted instead of spaces at the start of a line. -- Richard Huxton Archonet Ltd -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Greenplum MapReduce
Suvankar Roy wrote: Hi Richard, I sincerely regret the inconvenience caused. No big inconvenience, but the lists can be very busy sometimes and the easier you make it for people to answer your questions the better the answers you will get. %YAML 1.1 --- VERSION: 1.0.0.1 DATABASE: test_db1 USER: gpadmin DEFINE: - INPUT: #** This the line which is causing the error **# NAME: doc TABLE: documents If it looks fine, always check for tabs. Oh, and you could have cut out all the rest of the file, really. I have learnt that unnecessary TABs can the cause of this, so trying to overcome that, hopefully the problem will subside then I'm always getting this. It's easy to accidentally introduce a tab character when reformatting YAML. It might be worth checking if your text editor has an option to always replace tabs with spaces. -- Richard Huxton Archonet Ltd -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Greenplum MapReduce
Hi Robert, Thanks much for your valuable inputs This spaces and tabs problem is killing me in a way, it is pretty cumbersome to say the least Regards, Suvankar Roy Robert Mah r...@pobox.com Sent by: Robert Mah robert@gmail.com 08/02/2009 10:52 PM To 'Suvankar Roy' suvankar@tcs.com, pgsql-performance@postgresql.org cc Subject RE: [PERFORM] Greenplum MapReduce Suvankar: Check your file for spaces vs tabs (one of them is bad and yes, it matters). And as an personal aside, this is yet another reason I hate YAML. Cheers, Rob From: pgsql-performance-ow...@postgresql.org [ mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Suvankar Roy Sent: Thursday, July 30, 2009 8:25 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Greenplum MapReduce Hi all, Has anybody worked on Greenplum MapReduce programming ? I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue). The error is thrown in the 7th line as: Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red) If somebody can explain this and the potential solution %YAML 1.1 --- VERSION: 1.0.0.1 DATABASE: test_db1 USER: gpadmin DEFINE: - INPUT: NAME: doc TABLE: documents - INPUT: NAME: kw TABLE: keywords - MAP: NAME: doc_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in data.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) for term in terms: yield([doc_id, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - doc_id integer - data text RETURNS: - doc_id integer - term text - positions text - MAP: NAME: kw_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in keyword.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) yield([keyword_id, i, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - keyword_id integer - keyword text RETURNS: - keyword_id integer - nterms integer - term text - positions text - TASK: NAME: doc_prep SOURCE: doc MAP: doc_map - TASK: NAME: kw_prep SOURCE: kw MAP: kw_map - INPUT: NAME: term_join QUERY: | SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms, doc.positions as doc_positions, kw.positions as kw_positions FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term) - REDUCE: NAME: term_reducer TRANSITION: term_transition FINALIZE: term_finalizer - TRANSITION: NAME: term_transition LANGUAGE: python PARAMETERS: - state text - term text - nterms integer - doc_positions text - kw_positions text FUNCTION: | if state: kw_split = state.split(':') else: kw_split = [] for i in range(0,nterms): kw_split.append('') for kw_p in kw_positions.split(','): kw_split[int(kw_p)-1] = doc_positions outstate = kw_split[0] for s in kw_split[1
[PERFORM] Greenplum MapReduce
Hi all, Has anybody worked on Greenplum MapReduce programming ? I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue). The error is thrown in the 7th line as: Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red) If somebody can explain this and the potential solution %YAML 1.1 --- VERSION: 1.0.0.1 DATABASE: test_db1 USER: gpadmin DEFINE: - INPUT: NAME: doc TABLE: documents - INPUT: NAME: kw TABLE: keywords - MAP: NAME: doc_map LANGUAGE: python FUNCTION:| i = 0 terms = {} for term in data.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) for term in terms: yield([doc_id, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - doc_id integer - data text RETURNS: - doc_id integer - term text - positions text - MAP: NAME: kw_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in keyword.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) yield([keyword_id, i, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - keyword_id integer - keyword text RETURNS: - keyword_id integer - nterms integer - term text - positions text - TASK: NAME: doc_prep SOURCE: doc MAP: doc_map - TASK: NAME: kw_prep SOURCE: kw MAP: kw_map - INPUT: NAME: term_join QUERY: | SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms, doc.positions as doc_positions, kw.positions as kw_positions FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term) - REDUCE: NAME: term_reducer TRANSITION: term_transition FINALIZE: term_finalizer - TRANSITION: NAME: term_transition LANGUAGE: python PARAMETERS: - state text - term text - nterms integer - doc_positions text - kw_positions text FUNCTION: | if state: kw_split = state.split(':') else: kw_split = [] for i in range(0,nterms): kw_split.append('') for kw_p in kw_positions.split(','): kw_split[int(kw_p)-1] = doc_positions outstate = kw_split[0] for s in kw_split[1:]: outstate = outstate + ':' + s return outstate - FINALIZE: NAME: term_finalizer LANGUAGE: python RETURNS: - count integer MODE: MULTI FUNCTION: | if not state: return 0 kw_split = state.split(':') previous = None for i in range(0,len(kw_split)): isplit = kw_split[i].split(',') if any(map(lambda(x): x == '', isplit)): return 0 adjusted = set(map(lambda(x): int(x)-i, isplit)) if (previous):
Re: [PERFORM] Greenplum MapReduce
Suvankar Roy wrote: Hi all, Has anybody worked on Greenplum MapReduce programming ? It's a commercial product, you need to contact greenplum. -- Postgresql php tutorials http://www.designmagick.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance