Hi Keren, The # of output files is determined by the # of tasks that write out output files. Given your query, Pig will run a map-only job. But even if you run it on a single local file, multiple tasks (threads) can be launched if the input file is big and splittable. You can probably enforce a single task by tuning pig.maxCombinedSplitSize and mapred.max.split.size.
Thanks, Cheolsoo On Wed, Jul 9, 2014 at 2:02 PM, Keren Ouaknine <[email protected]> wrote: > Hi, > > I am aware there are several threads on the topic already :), however the > suggestions out there didn't seem to work on my script. > > My output folder contains many parts: > part-m-00000 part-m-00003 part-m-00006 part-m-00009 part-m-00012 > part-m-00015 part-m-00018 part-m-00021 part-m-00024 part-m-00027 > part-m-00030 > part-m-00001 part-m-00004 part-m-00007 part-m-00010 part-m-00013 > part-m-00016 part-m-00019 part-m-00022 part-m-00025 part-m-00028 > _temporary > part-m-00002 part-m-00005 part-m-00008 part-m-00011 part-m-00014 > part-m-00017 part-m-00020 part-m-00023 part-m-00026 part-m-00029 > > I am reading from one local file and executing in local mode so I would > expect getting only one part-m-00000 as my output. Any clue why I get more > than one part? > > I pasted my script below: > register /home/kereno/pigmix.jar > > page_views = load > '/home/kereno/more/pig-0.13.0-RC1/conversion_pig_scripts/page_views' using > org.apache.pig.test.pigmix.udf.PigPerformanceLoader() as (user, action, > timespent, query_term, ip_addr, timesta > mp,estimated_revenue, page_info, page_links); > > page_views_flattened = foreach page_views generate user, action, timespent, > query_term, ip_addr, timestamp, estimated_revenue, > ((map[]) page_info) as page_info, (bag{tuple(map[])})page_links as > page_links; > > store page_views_flattened into 'parsed/ADM-format/page_views' using > org.apache.pig.builtin.PigStorage_for_AQL('\t'); > > Thanks, > Keren > > > > > -- > Keren Ouaknine > www.kereno.com >
