Hi all,

We've been getting some funny outputs to some Pig jobs recently that
contains a lot of duplicated data.  I'm wondering if the cause of this
could be Pig, or if we must have duplicates in our raw data set (which is
very possible).

We're running simple Pig jobs that are just filtering a subset of our data
based on co-ordinates e.g.:

A =  LOAD '$INPUT' USING PigStorage('\t') as (entity_id: long, lat: double,
lng: double);

B =  FILTER A BY (lat > 37.708) AND (lat < 37.817) AND (lng > -122.519) AND
(lng < -122.356);

STORE B INTO '$OUTPUT';

Thanks.

Reply via email to