Not sure if there is a function for that but I wrote a UDF to do so - 
https://github.com/chandeepsingh/Hive-UDFs 
<https://github.com/chandeepsingh/Hive-UDFs>

hive> ADD JAR hive-udfs-1.0-uber.jar;
Added [hive-udfs-1.0-uber.jar] to class path
Added resources: [hive-udfs-1.0-uber.jar]

hive> CREATE TEMPORARY FUNCTION array_dedup AS 'com.hive.udfs.UdfArrayDeDup';
OK
Time taken: 0.015 seconds

hive> SELECT array_dedup(array("blah","blah","blah")) from table1 limit 1;
OK
["blah"]
Time taken: 0.502 seconds, Fetched: 1 row(s)

Here is the code:

package com.hive.udfs;

/**
 *
 * @author chandeepsingh 
 * Remove duplicates from an array
 */

import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils;

import java.util.HashSet;
import java.util.List;

public class UdfArrayDeDup extends GenericUDF {

    ListObjectInspector arrayOI = null;

    @Override
    public ObjectInspector initialize(ObjectInspector[] arguments)
            throws UDFArgumentException {

        arrayOI = (ListObjectInspector) arguments[0];
        return ObjectInspectorUtils.getStandardObjectInspector(arrayOI);
    }

    @Override
    public Object evaluate(DeferredObject[] arguments) throws HiveException {

        List<?> myArr = (List<?>) 
ObjectInspectorUtils.copyToStandardObject(arguments[0].get(), arrayOI);
        HashSet<Object> myHashSet = new HashSet<>();

        myHashSet.addAll(myArr);

        if (myHashSet != null) {
            return new ArrayList<>(myHashSet);
        } else {
            return null;
        }

    }

    @Override
    public String getDisplayString(String[] input) {
        return new String();
    }
}



> On Mar 13, 2016, at 1:30 AM, Rex X <dnsr...@gmail.com> wrote:
> 
> For the first question, is there any way to use "set" instead of an "array" 
> to dedupe all elements?
> 
> "select array(1,1)" will return "[1,1]", not "[1]".
> 
> 
> 
> On Sat, Mar 12, 2016 at 5:26 PM, Rex X <dnsr...@gmail.com 
> <mailto:dnsr...@gmail.com>> wrote:
> Thank you, Chandeep. Yes, my first problem solved. 
> How about the second one? Is there any way to append an element to an 
> existing array?
> 
> 
> 
> On Sat, Mar 12, 2016 at 5:10 PM, Chandeep Singh <c...@chandeep.com 
> <mailto:c...@chandeep.com>> wrote:
> If you only want the array while you’re querying table1 your example should 
> work. If you want to add AB to the table you’ll probably need to create a new 
> table by selecting everything you need from table1.
> 
> hive> select * from table1 limit 1;
> OK
> temp1 temp2   temp3
> 
> hive> select f1, array(f2, f3) AS AB from table1 limit 1;
> OK
> temp1 [“temp2”,"temp3"]
> 
> 
>> On Mar 13, 2016, at 12:33 AM, Rex X <dnsr...@gmail.com 
>> <mailto:dnsr...@gmail.com>> wrote:
>> 
>> How to make the following work?
>> 
>> 1. combine columns A and B to make one array as a new column AB. Both column 
>> A and B are string types.
>> 
>>   select 
>> string_columnA, 
>> string_columnB, 
>> array(string_columnA, string_columnB) as AB
>> from Table1;
>> 
>> 2. append columnA to an existing array-type column B
>> 
>> select
>> string_columnA,
>> array_columnB,
>> array_flatmerge(string_columnA, array_columnB) as AB
>> from Table2;
>> 
>> In fact, I should say "set" instead of "array" above, since I expect no 
>> duplicates.
>> 
>> Any idea?
>> 
> 
> 
> 

Reply via email to