Re: Some optimization advices

Cheolsoo Park Tue, 05 Feb 2013 14:13:59 -0800

>> But one more point, I have only one mapper running with this pig job as
my cluster has 4 slaves. How could it be different ?


Are you asking why only a single mapper runs even though there are 3 more
slaves available? 4 slaves doesn't mean that you will always have 4
mappers/reducers. Hadoop launches a mapper per file split.

How many input file do you have?

- If you have just one small file, Pig will launch a single mapper. You can
increase parallelism by splitting that file into smaller splits:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop

- If you have many small files, Pig will combine them into a single split
and launch a single mapper. This case, you might want to change
pig.maxCombinedSplitSize:
http://pig.apache.org/docs/r0.10.0/perf.html#combine-files

Thanks,
Cheolsoo

On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson
<[email protected]>wrote:

> Thaks a lot. It works fine.
>
> But one more point, I have only one mapper running with this pig job as my
> cluster has 4 slaves.
> How could it be different ?
>
> Regards,
> Jérôme
>
>
> Le 31/01/2013 20:45, Cheolsoo Park a écrit :
>
>> Hi Jerome,
>>
>> Try this:
>>
>> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
>> XmlTag2 = FOREACH XmlTag {
>>      tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity');
>>      GENERATE *, COUNT(tag_with_amenity) AS count;
>> };
>> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id,
>> node_attr_lon, node_attr_lat, tag;
>>
>> Thanks,
>> Cheolsoo
>>
>>
>> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
>> <[email protected]>**wrote:
>>
>>  Hi There,
>>>
>>> I am a beginner, I achieved something, but I guess I could have done
>>> better. Let me explain.
>>> (Pig 0.10)
>>>
>>> My data is DESCRIBE as :
>>>
>>>   xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
>>> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
>>>
>>>
>>> and DUMP like this :
>>>
>>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)}))
>>> ((100948454,45.2620946,-12.****7849171,))
>>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)}))
>>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(****
>>> lon,45.24166667)}))
>>> ((1230941976,45.0743117,-12.****6888807,{(place,village)}))
>>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)}))
>>> ((1976927219,45.2272263,-12.****7794359,))
>>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(****
>>> name,Brochetterie)}))
>>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(****
>>> name,Brochetterie)}))
>>> ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)}))
>>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(****
>>> operator,Total)}))
>>>
>>>
>>> I want to extract the record which have a certain value for the
>>> tag_attr_k
>>> field. For example, give me the record where there is a tag_attr_k =
>>> amesity ? That should be :
>>>
>>> (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)})
>>> (362795028,-12.8062991,45.****2086809,{(operator,Total),(****
>>> amenity,fuel)})
>>> (1751057677,-12.7825896,45.****2216163,{(amenity,fast_food),(****
>>> name,Brochetterie)})
>>> (1751057678,-12.7829678,45.****2216953,{(amenity,fast_food),(****
>>> name,Brochetterie)})
>>>
>>> So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k,
>>> tag_attr_v)...(tag_attr_k,tag_****attr_v)}
>>>
>>>
>>> I ended up with this script.
>>>
>>>
>>> ...
>>> XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top
>>> including
>>> level bag
>>> XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN
>>> (tag) as (key, value); --flatten the bag of tags
>>> XmlTag3 =  FILTER XmlTag2 BY key == 'amenity'; -- get all the records
>>> with
>>> amenity tags
>>> XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all
>>> tags containing amenity tag
>>> XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as
>>> key, $9 as value; -- re-build records : removing redundant field
>>> XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping
>>> redundant records
>>> XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long
>>> {(key,value)...(key,value)}
>>>          tag = foreach XmlTag7 GENERATE  key, value;
>>>      GENERATE group.id,group.lat,group.lon,****tag;
>>>
>>> };
>>>
>>> Using this variable:
>>>
>>> xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
>>> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
>>> XmlTag: {null::node_attr_id: int,null::node_attr_lon:
>>> chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k:
>>> chararray,tag_attr_v: chararray)}}
>>> XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value:
>>> chararray}
>>> XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value:
>>> chararray}
>>> XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat:
>>> chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id:
>>> int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key:
>>> chararray,XmlTag2::value: chararray}
>>> XmlTag7: {id: int,lon: chararray,lat: chararray,key: chararray,value:
>>> chararray}
>>> XmlTag5: {group: (id: int,lat: chararray,lon: chararray),XmlTag7: {(id:
>>> int,lon: chararray,lat: chararray,key: chararray,value: chararray)}}
>>> XmlTag8: {id: int,lat: chararray,lon: chararray,tag: {(key:
>>> chararray,value: chararray)}}
>>>
>>>
>>> I guess this not very straightforward and can be largely optimized.
>>> Please
>>> give me some hints ?
>>>
>>> Regards,
>>> Jérôme
>>>
>>>
>

Re: Some optimization advices

Reply via email to