Hi, I need to create hive udaf and control number of mapper launch for it. The hive udaf works fine, it just read row from file , store in a arraylist , then pass it to merge method in reducer. However, in order to control number of mapper for it, I create customer inputformat class which return some fake splits in the getsplits method , so the number of mapper launched is determined by number fake split, and each recordReader return only 1 record with random key,value pairs . For example , I return 10 fake split in the method, there will be 10 mappers launched on datanode, then in iterator method of udaf(only one record per mapper), generate some random string , save in the arraylist in aggregatebuf , in the case, I expect 10 random string from 10 mapper will go into merge method in reducer, then in terminal method , should have list with 10 random string generated from mapper, but the result in terminal is just 10 copys of string from one of 10 mappers, I checked, the string output from mapper is correct(in partial terminal ) ,but in merge method, I got something like: in the first call to merge, aggregatebuf has empty arraylist, put randomstring1 from mapper1 in it, in the sec call to merge method, aggregatebuf has arraylist with value randomstring2, which i think should be randomstring1, then put randomstring2 from mapper 2 in it then in 3rd call to merge method, aggregatebuf has the arraylist with 2 string value randomstring3 ..... in the final, aggregatebuf contain the arraylist with 10 randomstring10 instead of randomstrng1,randomstring2 .... randomstring10.
The problem is exist when using the customer inputformat, so I was wondering , what I miss in the inputformat, meanwhile, I notice , if set number of mapper by using hiveinputformat : set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set mapred.map.tasks =10; it will be fine if map.tasks is set to 2 or 3, but if pick larger value ,say more 10, I also got duplicate value, btw there is 15 rows in the table when i use hiveinputformat, so anyone know why udaf behavior like this, is there any way to solve it. Thanks in advance for any help ! Ted