Hi,
      I need to create hive udaf and control number of mapper launch for
it. The hive udaf works fine, it just read row from file , store in a
arraylist , then pass it to merge method in reducer. However, in order to
control number of mapper for it, I create customer inputformat class which
return some fake splits in the getsplits method , so the number of mapper
launched is determined by number fake split, and each recordReader return
only 1 record with random key,value pairs . For example , I return 10 fake
split in the method, there will be 10 mappers launched on datanode, then in
iterator method of udaf(only one record per mapper), generate some random
string , save in the arraylist in aggregatebuf , in the case, I expect 10
random string from 10 mapper will go into merge method in reducer, then in
terminal method , should have list with 10 random string generated from
mapper, but the result in terminal is just 10 copys of  string from one of
10 mappers, I checked, the string output from mapper is correct(in partial
terminal ) ,but in merge method, I got something like:
in the first call to merge,
aggregatebuf has empty arraylist, put randomstring1 from mapper1 in it,
in the sec call to merge method,
aggregatebuf has arraylist with value randomstring2, which i think should
be randomstring1, then put randomstring2 from mapper 2 in it
then in 3rd call to merge method,
aggregatebuf has the arraylist with 2 string value randomstring3
.....
in the final, aggregatebuf contain the arraylist with 10 randomstring10
instead of randomstrng1,randomstring2 .... randomstring10.

The problem is  exist when using the customer inputformat, so I was
wondering , what I miss in the inputformat, meanwhile, I notice , if set
number of mapper by using hiveinputformat :
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
 set mapred.map.tasks =10;
it will be fine if map.tasks is set to 2 or 3, but if pick larger value
,say more 10, I also got duplicate value, btw there is 15 rows in the table
when i use hiveinputformat, so anyone know why udaf behavior like this, is
there any way to solve it.

Thanks in advance for any help !

Ted

Reply via email to