Re: Unions causing many scans of input - workaround?

Tim Robertson Mon, 08 Nov 2010 00:18:20 -0800

I am writing a GenericUDTF now, but notice on
http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF


the method docs show:
/**
   * Called to notify the UDTF that there are no more rows to process. Note that
   * forward() should not be called in this function. Only clean up code should
   * be run.
   */
  public abstract void close() throws HiveException;

but the example does exactly that:
@Override
  public void close() throws HiveException {
    forwardObj[0] = count;
    forward(forwardObj);
    forward(forwardObj);
  }

I'll assume the example is correct and continue, but it might be worth
fixing that page.

Cheers,
Tim





On Mon, Nov 8, 2010 at 7:35 AM, Tim Robertson <timrobertson...@gmail.com> wrote:
> Thank you both,
>
> A quick glance looks like that is what I am looking for.  When I get
> it working, I'll post the solution.
>
> Cheers,
> Tim
>
> On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain <nj...@facebook.com> wrote:
>> Other option would be to create a wrapper script (not use either UDF or
>> UDTF)
>> That script, in any language, can emit any number of output rows per input
>> row.
>>
>> Look at:
>> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
>> for details
>>
>> ________________________________
>> From: Sonal Goyal [sonalgoy...@gmail.com]
>> Sent: Sunday, November 07, 2010 8:40 PM
>> To: user@hive.apache.org
>> Subject: Re: Unions causing many scans of input - workaround?
>>
>> Hey Tim,
>>
>> You have an interesting problem. Have you tried creating a UDTF for your
>> case, so that you can possibly emit more than one record for each row of
>> your input?
>>
>> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF
>>
>> Thanks and Regards,
>> Sonal
>>
>> Sonal Goyal | Founder and CEO | Nube Technologies LLP
>> http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal
>>
>>
>>
>>
>>
>> On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson <timrobertson...@gmail.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> I am porting custom MR code to Hive and have written working UDFs
>>> where I need them.  Is there a work around to having to do this in
>>> Hive:
>>>
>>> select * from
>>> (
>>>    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
>>> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
>>> f2_y, count (1) as count
>>>    from table
>>>    group by name_id, x, y, f2_x, f2_y
>>>
>>>    UNION ALL
>>>
>>>    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
>>> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
>>> f2_y, count (1) as count
>>>    from table
>>>    group by name_id, x, y, f2_x, f2_y
>>>
>>>   --- etc etc increasing in zoom
>>> )
>>>
>>> The issue being that this does many passes over the table, whereas
>>> previously in my Map() I would just emit many times from the same
>>> input record and then let it all group in the shuffle and sort.
>>> I actually emit 184 times for an input record (23 zoom levels of
>>> google maps, and 8 ways to derive the name_id) for a single record
>>> which means 184 union statements - Is it possible in hive to force it
>>> to emit many times from the source record in the stage-1 map?
>>>
>>> (ahem) Does anyone know if Pig can do this if not in Hive?
>>>
>>> I hope I have explained this well enough to make sense.
>>>
>>> Thanks in advance,
>>> Tim
>>
>>
>

Re: Unions causing many scans of input - workaround?

Reply via email to