>> But one more point, I have only one mapper running with this pig job as my cluster has 4 slaves. How could it be different ?
Are you asking why only a single mapper runs even though there are 3 more slaves available? 4 slaves doesn't mean that you will always have 4 mappers/reducers. Hadoop launches a mapper per file split. How many input file do you have? - If you have just one small file, Pig will launch a single mapper. You can increase parallelism by splitting that file into smaller splits: http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop - If you have many small files, Pig will combine them into a single split and launch a single mapper. This case, you might want to change pig.maxCombinedSplitSize: http://pig.apache.org/docs/r0.10.0/perf.html#combine-files Thanks, Cheolsoo On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson <[email protected]>wrote: > Thaks a lot. It works fine. > > But one more point, I have only one mapper running with this pig job as my > cluster has 4 slaves. > How could it be different ? > > Regards, > Jérôme > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit : > >> Hi Jerome, >> >> Try this: >> >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); >> XmlTag2 = FOREACH XmlTag { >> tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); >> GENERATE *, COUNT(tag_with_amenity) AS count; >> }; >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id, >> node_attr_lon, node_attr_lat, tag; >> >> Thanks, >> Cheolsoo >> >> >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson >> <[email protected]>**wrote: >> >> Hi There, >>> >>> I am a beginner, I achieved something, but I guess I could have done >>> better. Let me explain. >>> (Pig 0.10) >>> >>> My data is DESCRIBE as : >>> >>> xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: >>> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} >>> >>> >>> and DUMP like this : >>> >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)})) >>> ((100948454,45.2620946,-12.****7849171,)) >>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)})) >>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(**** >>> lon,45.24166667)})) >>> ((1230941976,45.0743117,-12.****6888807,{(place,village)})) >>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)})) >>> ((1976927219,45.2272263,-12.****7794359,)) >>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(**** >>> name,Brochetterie)})) >>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(**** >>> name,Brochetterie)})) >>> ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)})) >>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(**** >>> operator,Total)})) >>> >>> >>> I want to extract the record which have a certain value for the >>> tag_attr_k >>> field. For example, give me the record where there is a tag_attr_k = >>> amesity ? That should be : >>> >>> (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)}) >>> (362795028,-12.8062991,45.****2086809,{(operator,Total),(**** >>> amenity,fuel)}) >>> (1751057677,-12.7825896,45.****2216163,{(amenity,fast_food),(**** >>> name,Brochetterie)}) >>> (1751057678,-12.7829678,45.****2216953,{(amenity,fast_food),(**** >>> name,Brochetterie)}) >>> >>> So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k, >>> tag_attr_v)...(tag_attr_k,tag_****attr_v)} >>> >>> >>> I ended up with this script. >>> >>> >>> ... >>> XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top >>> including >>> level bag >>> XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN >>> (tag) as (key, value); --flatten the bag of tags >>> XmlTag3 = FILTER XmlTag2 BY key == 'amenity'; -- get all the records >>> with >>> amenity tags >>> XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all >>> tags containing amenity tag >>> XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as >>> key, $9 as value; -- re-build records : removing redundant field >>> XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping >>> redundant records >>> XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long >>> {(key,value)...(key,value)} >>> tag = foreach XmlTag7 GENERATE key, value; >>> GENERATE group.id,group.lat,group.lon,****tag; >>> >>> }; >>> >>> Using this variable: >>> >>> xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: >>> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} >>> XmlTag: {null::node_attr_id: int,null::node_attr_lon: >>> chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k: >>> chararray,tag_attr_v: chararray)}} >>> XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value: >>> chararray} >>> XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value: >>> chararray} >>> XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat: >>> chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id: >>> int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key: >>> chararray,XmlTag2::value: chararray} >>> XmlTag7: {id: int,lon: chararray,lat: chararray,key: chararray,value: >>> chararray} >>> XmlTag5: {group: (id: int,lat: chararray,lon: chararray),XmlTag7: {(id: >>> int,lon: chararray,lat: chararray,key: chararray,value: chararray)}} >>> XmlTag8: {id: int,lat: chararray,lon: chararray,tag: {(key: >>> chararray,value: chararray)}} >>> >>> >>> I guess this not very straightforward and can be largely optimized. >>> Please >>> give me some hints ? >>> >>> Regards, >>> Jérôme >>> >>> >
