Re: Some optimization advices

Jerome Person Wed, 06 Feb 2013 08:56:27 -0800

Thanks. I will have a look at my InputFormat.
If my InputFormat make one split, there will be only one mapper.


Regards,
Jérôme.

Le Wed, 6 Feb 2013 08:41:50 -0800,
Cheolsoo Park <[email protected]> a écrit :

> Hi Jerome,
> 
> It's not Pig but Hadoop that splits input files. Pig Load/Store UDFs
> are associated with InputFormat, OutputFormat and RecordReader
> classes. Hadoop uses them to decide how to creates splits. Here are
> more explanations:
> http://www.quora.com/How-does-Hadoop-handle-split-input-records
> 
> Thanks,
> Cheolsoo
> 
> 
> On Wed, Feb 6, 2013 at 2:00 AM, Jerome Person
> <[email protected]>wrote:
> 
> > It is not a gzip file. It is an XML file which is load with an UDF.
> > When does pig split the input file.
> > I guess my loader is wrong ?
> >
> > Jérôme.
> >
> >
> > Le Tue, 5 Feb 2013 15:10:14 -0800,
> > Prashant Kommireddi <[email protected]> a écrit :
> >
> > > Is this a gzip file? You have to make sure the compression scheme
> > > you use is splittable for more mappers to be spawned.
> > >
> > > -Prashant
> > >
> > > On Tue, Feb 5, 2013 at 2:57 PM, Jerome Person
> > > <[email protected]>wrote:
> > >
> > > > As it is a 50 Gb single file, I believe this job need more than
> > > > one mapper.
> > > >
> > > > I do not find any mapred.max.split.size parameter in the job
> > > > configuration xml file (only mapred.min.split.size = 0).
> > > >
> > > > Is there any "key word" to activate parallelism into the pig
> > > > script ?
> > > >
> > > > Jérôme.
> > > >
> > > > Le Tue, 5 Feb 2013 14:13:32 -0800,
> > > > Cheolsoo Park <[email protected]> a écrit :
> > > >
> > > > > >> But one more point, I have only one mapper running with
> > > > > >> this pig job as
> > > > > my cluster has 4 slaves. How could it be different ?
> > > > >
> > > > > Are you asking why only a single mapper runs even though there
> > > > > are 3 more slaves available? 4 slaves doesn't mean that you
> > > > > will always have 4 mappers/reducers. Hadoop launches a mapper
> > > > > per file split.
> > > > >
> > > > > How many input file do you have?
> > > > >
> > > > > - If you have just one small file, Pig will launch a single
> > > > > mapper. You can increase parallelism by splitting that file
> > > > > into smaller splits:
> > > > >
> > > >
> > http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
> > > > >
> > > > > - If you have many small files, Pig will combine them into a
> > > > > single split and launch a single mapper. This case, you might
> > > > > want to change pig.maxCombinedSplitSize:
> > > > > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
> > > > >
> > > > > Thanks,
> > > > > Cheolsoo
> > > > >
> > > > > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson
> > > > > <[email protected]>wrote:
> > > > >
> > > > > > Thaks a lot. It works fine.
> > > > > >
> > > > > > But one more point, I have only one mapper running with
> > > > > > this pig job as my cluster has 4 slaves.
> > > > > > How could it be different ?
> > > > > >
> > > > > > Regards,
> > > > > > Jérôme
> > > > > >
> > > > > >
> > > > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit :
> > > > > >
> > > > > >> Hi Jerome,
> > > > > >>
> > > > > >> Try this:
> > > > > >>
> > > > > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
> > > > > >> XmlTag2 = FOREACH XmlTag {
> > > > > >>      tag_with_amenity = FILTER tag BY (tag_attr_k ==
> > > > > >> 'amenity'); GENERATE *, COUNT(tag_with_amenity) AS count;
> > > > > >> };
> > > > > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE
> > > > > >> node_attr_id, node_attr_lon, node_attr_lat, tag;
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Cheolsoo
> > > > > >>
> > > > > >>
> > > > > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
> > > > > >> <[email protected]>**wrote:
> > > > > >>
> > > > > >>  Hi There,
> > > > > >>>
> > > > > >>> I am a beginner, I achieved something, but I guess I could
> > > > > >>> have done better. Let me explain.
> > > > > >>> (Pig 0.10)
> > > > > >>>
> > > > > >>> My data is DESCRIBE as :
> > > > > >>>
> > > > > >>>   xmlToTuple: {(node_attr_id: int,node_attr_lon:
> > > > > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k:
> > > > > >>> chararray,tag_attr_v: chararray)})}
> > > > > >>>
> > > > > >>>
> > > > > >>> and DUMP like this :
> > > > > >>>
> > > > > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)}))
> > > > > >>> ((100948454,45.2620946,-12.****7849171,))
> > > > > >>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)}))
> > > > > >>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(****
> > > > > >>> lon,45.24166667)}))
> > > > > >>> ((1230941976,45.0743117,-12.****6888807,{(place,village)}))
> > > > > >>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)}))
> > > > > >>> ((1976927219,45.2272263,-12.****7794359,))
> > > > > >>>
> > ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(****
> > > > > >>> name,Brochetterie)}))
> > > > > >>>
> > ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(****
> > > > > >>> name,Brochetterie)}))
> > > > > >>>
> > > > ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)}))
> > > > > >>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(****
> > > > > >>> operator,Total)}))
> > > > > >>>
> > > > > >>>
> > > > > >>> I want to extract the record which have a certain value
> > > > > >>> for the tag_attr_k
> > > > > >>> field. For example, give me the record where there is a
> > > > > >>> tag_attr_k = amesity ? That should be :
> > > > > >>>
> > > > > >>>
> > (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)})
> > > > > >>> (362795028,-12.8062991,45.****2086809,{(operator,Total),(****
> > > > > >>> amenity,fuel)})
> > > > > >>> (1751057677,-12.7825896,45.****2216163,{(amenity,fast_food),(****
> > > > > >>> name,Brochetterie)})
> > > > > >>> (1751057678,-12.7829678,45.****2216953,{(amenity,fast_food),(****
> > > > > >>> name,Brochetterie)})
> > > > > >>>
> > > > > >>> So (node_attr_id, node_attr_lat ,
> > > > > >>> node_attr_lon,{(tag_attr_k,
> > > > > >>> tag_attr_v)...(tag_attr_k,tag_****attr_v)}
> > > > > >>>
> > > > > >>>
> > > > > >>> I ended up with this script.
> > > > > >>>
> > > > > >>>
> > > > > >>> ...
> > > > > >>> XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0);
> > > > > >>> --removed top including
> > > > > >>> level bag
> > > > > >>> XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2
> > > > > >>> as lat, FLATTEN (tag) as (key, value); --flatten the bag
> > > > > >>> of tags XmlTag3 =  FILTER XmlTag2 BY key == 'amenity'; --
> > > > > >>> get all the records with
> > > > > >>> amenity tags
> > > > > >>> XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build
> > > > > >>> records with all tags containing amenity tag
> > > > > >>> XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2
> > > > > >>> as lat,$8 as key, $9 as value; -- re-build records :
> > > > > >>> removing redundant field XmlTag5 = GROUP XmlTag7 BY
> > > > > >>> (id,lat,lon); -- re-build records : grouping redundant
> > > > > >>> records XmlTag8 = foreach XmlTag5 { --rebuild records :
> > > > > >>> id,lat,long {(key,value)...(key,value)}
> > > > > >>>          tag = foreach XmlTag7 GENERATE  key, value;
> > > > > >>>      GENERATE group.id,group.lat,group.lon,****tag;
> > > > > >>>
> > > > > >>> };
> > > > > >>>
> > > > > >>> Using this variable:
> > > > > >>>
> > > > > >>> xmlToTuple: {(node_attr_id: int,node_attr_lon:
> > > > > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k:
> > > > > >>> chararray,tag_attr_v: chararray)})} XmlTag:
> > > > > >>> {null::node_attr_id: int,null::node_attr_lon:
> > > > > >>> chararray,null::node_attr_lat: chararray,null::tag:
> > > > > >>> {(tag_attr_k: chararray,tag_attr_v: chararray)}} XmlTag2:
> > > > > >>> {id: int,lon: chararray,lat: chararray,key:
> > > > > >>> chararray,value: chararray} XmlTag3: {id: int,lon:
> > > > > >>> chararray,lat: chararray,key: chararray,value: chararray}
> > > > > >>> XmlTag4: {XmlTag3::id: int,XmlTag3::lon:
> > > > > >>> chararray,XmlTag3::lat: chararray,XmlTag3::key:
> > > > > >>> chararray,XmlTag3::value: chararray,XmlTag2::id:
> > > > > >>> int,XmlTag2::lon: chararray,XmlTag2::lat:
> > > > > >>> chararray,XmlTag2::key: chararray,XmlTag2::value:
> > > > > >>> chararray} XmlTag7: {id: int,lon: chararray,lat:
> > > > > >>> chararray,key: chararray,value: chararray} XmlTag5:
> > > > > >>> {group: (id: int,lat: chararray,lon: chararray),XmlTag7:
> > > > > >>> {(id: int,lon: chararray,lat: chararray,key:
> > > > > >>> chararray,value: chararray)}} XmlTag8: {id: int,lat:
> > > > > >>> chararray,lon: chararray,tag: {(key: chararray,value:
> > > > > >>> chararray)}}
> > > > > >>>
> > > > > >>>
> > > > > >>> I guess this not very straightforward and can be largely
> > > > > >>> optimized. Please
> > > > > >>> give me some hints ?
> > > > > >>>
> > > > > >>> Regards,
> > > > > >>> Jérôme
> > > > > >>>
> > > > > >>>
> > > > > >
> > > >
> > > >
> >
> >

Re: Some optimization advices

Reply via email to