Hi Keren,

The # of output files is determined by the # of tasks that write out output
files. Given your query, Pig will run a map-only job. But even if you run
it on a single local file, multiple tasks (threads) can be launched if the
input file is big and splittable. You can probably enforce a single task by
tuning pig.maxCombinedSplitSize and mapred.max.split.size.

Thanks,
Cheolsoo


On Wed, Jul 9, 2014 at 2:02 PM, Keren Ouaknine <[email protected]> wrote:

> Hi,
>
> I am aware there are several threads on the topic already :), however the
> suggestions out there didn't seem to work on my script.
>
> My output folder contains many parts:
> part-m-00000  part-m-00003  part-m-00006  part-m-00009  part-m-00012
> part-m-00015  part-m-00018  part-m-00021  part-m-00024  part-m-00027
> part-m-00030
> part-m-00001  part-m-00004  part-m-00007  part-m-00010  part-m-00013
> part-m-00016  part-m-00019  part-m-00022  part-m-00025  part-m-00028
> _temporary
> part-m-00002  part-m-00005  part-m-00008  part-m-00011  part-m-00014
> part-m-00017  part-m-00020  part-m-00023  part-m-00026  part-m-00029
>
> I am reading from one local file and executing in local mode so I would
> expect getting only one part-m-00000 as my output. Any clue why I get more
> than one part?
>
> I pasted my script below:
> register /home/kereno/pigmix.jar
>
> page_views = load
> '/home/kereno/more/pig-0.13.0-RC1/conversion_pig_scripts/page_views' using
> org.apache.pig.test.pigmix.udf.PigPerformanceLoader() as (user, action,
> timespent, query_term, ip_addr, timesta
> mp,estimated_revenue, page_info, page_links);
>
> page_views_flattened = foreach page_views generate user, action, timespent,
> query_term, ip_addr, timestamp, estimated_revenue,
> ((map[]) page_info) as page_info, (bag{tuple(map[])})page_links as
> page_links;
>
> store page_views_flattened into 'parsed/ADM-format/page_views' using
> org.apache.pig.builtin.PigStorage_for_AQL('\t');
>
> Thanks,
> Keren
> ​
>
>
>
> --
> Keren Ouaknine
> www.kereno.com
>

Reply via email to