I'm using PigStorage(',') for all stores.

I agree about the expensiveness of CROSS, but I'm still kind of confused as
to why it would lose records in this case.

--Alex


On Fri, Apr 18, 2014 at 2:28 PM, Pradeep Gollakota <[email protected]>wrote:

> What is the storage func you're using? My guess is that there is some
> shared state in the Storage func. Take a look at this SO that is dealing
> with shared state in Stores.
>
> http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592
> .
> The reason why this doesn't occur is because PigStorage doesn't have shared
> state. So, in T3, you're loading from text files instead of your original
> store func.
>
> CROSS is pretty expensive by nature. If one of your datasets is small
> enough to load into memory, you use a fragment replicate join instead.
>
>
> On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen <[email protected]
> >wrote:
>
> > I'm noticing some really strange behavior with a CROSS operation in one
> of
> > my scripts.
> >
> > I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one
> > row, and T2 has 2,982,035 rows.
> >
> > If I STORE both T1 and T2 before CROSSing them together to get T3, like
> so:
> >
> > -- ... Long script that, among other things, creates T1 and T2 ...
> > STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(',');
> > STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(',');
> > T3 = CROSS T2, T1;
> >
> > then I get what I expect; T3 has 2,982,035 records.
> >
> > However, if I omit the STOREs and run the CROSS directly, T3 only has
> > 1,492,977
> > records.
> >
> > I've run EXPLAIN on both the script with the STOREs and the script
> without,
> > and their query plans are identical.
> >
> > I'm going to end up refactoring the script to get rid of the CROSS anyway
> > since it's expensive, but am curious as to whether I'm doing something
> > wrong or if there may be a subtle bug in CROSS.
> >
> > I'm using Pig version 0.11.0-cdh4.5.0
> >
> > Any insight you could give me here would be greatly appreciated.
> >
> > Thanks,
> > --Alex
> >
>

Reply via email to