C = DISTINCT B;
STORE C INTO '$OUTPUT';
-Kris
On Fri, May 18, 2012 at 04:55:23PM +0100, Brendan Gill wrote:
> Hi all,
>
> We've been getting some funny outputs to some Pig jobs recently that
> contains a lot of duplicated data. I'm wondering if the cause of this
> could be Pig, or if we must have duplicates in our raw data set (which is
> very possible).
>
> We're running simple Pig jobs that are just filtering a subset of our data
> based on co-ordinates e.g.:
>
> A = LOAD '$INPUT' USING PigStorage('\t') as (entity_id: long, lat: double,
> lng: double);
>
> B = FILTER A BY (lat > 37.708) AND (lat < 37.817) AND (lng > -122.519) AND
> (lng < -122.356);
>
> STORE B INTO '$OUTPUT';
>
> Thanks.
--
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3