Re: Easy question...difference between this::form and this.form?

Dmitriy Ryaboy Tue, 07 Dec 2010 16:51:23 -0800

it's sort of true -- but, iirc, only goes one level deep, so once you do a
second join, you are stuck with "::"s


On Tue, Dec 7, 2010 at 10:11 AM, Santhosh Srinivasan <[email protected]>wrote:

> > The sql way to deal with this issue is essentially to keep the name of
> the parent relation
> > around during parsing, and require that you explicitly provide the
> desired parent if column
> > names are ambiguous. That's probably something that could be implemented
> now that we have
> > the required metadata in the operators (I believe it wasn't there when
> the disambiguation
> > design was implemented).
>
> Isn't that true today? Unambiguous columns can be referenced without the ::
> operator.
>
> Santhosh
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:[email protected]]
> Sent: Tuesday, December 07, 2010 9:49 AM
> To: [email protected]
> Subject: Re: Easy question...difference between this::form and this.form?
>
> Consider self-joins, with regards to the meaningful name problem...
>
> The sql way to deal with this issue is essentially to keep the name of the
> parent relation around during parsing, and require that you explicitly
> provide the desired parent if column names are ambiguous. That's probably
> something that could be implemented now that we have the required metadata
> in the operators (I believe it wasn't there when the disambiguation design
> was implemented).
>
> As far as difference between "::" and ".".  The double-colon is just a
> string with no special meaning, it's simply part of the field name. The
> period is essentially a projection operator -- you are saying, "the thing to
> the left of the period is a tuple, and the thing to the right is a field in
> that tuple". (works for bags as well, in which case it means, the thing to
> the left of the period is a bag of tuples, and the thing to the right is a
> field in every tuple in the bag)
>
> -Dmitriy.
>
> 2010/12/7 Anze <[email protected]>
>
> >
> > If one uses meaningful names then Pig would never use '::' anyway. The
> > problem is when you use multiple joins in sequence, then '::' names
> > get very annoying.
> > But that's just my opinion. :)
> >
> > Anze
> >
> >
> > On Tuesday 07 December 2010, Jonathan Coveney wrote:
> > > Would that even be much better? It seems like it'd be better to have
> > > it
> > be
> > > consistent in appending the whatever::, so that at least you have to
> > > be cognizant of it when you do the join. If it starts being too
> > > clever, then it's up to you to figure out when it does and doesn't
> > > do it which might
> > be
> > > annoying.
> > >
> > > 2010/12/7 Anze <[email protected]>
> > >
> > > > I understand the reason for this, it just seems like a drastic
> > solution.
> > > > :)
> > > >
> > > > Ideally, Pig should be clever enough to detect ambiguity and deal
> > > > with it, and leave the non-conflicting names intact. For instance:
> > > >
> > > > A = load 'foo' as (x, y, z);
> > > > B = load 'bar' as (x, a, b, c);
> > > > C = join A by x, B by x;
> > > > DESCRIBE C;
> > > > C: {A::x, y, z, B::x, a, b, c}
> > > >
> > > > or even:
> > > > C: {x, y, z, B::x, a, b, c}
> > > >
> > > > or even a step further, in case of JOIN:
> > > > C: {x, y, z, a, b, c}
> > > > (since join *joins* by x, why would there be two? This doesn't
> > > > always work for other operations, of course)
> > > >
> > > > Reasoning: at least in my cases the names are descriptive from the
> > start,
> > > > therefore there are almost no name conflicts. In rare cases where
> > > > there are Pig can determine that and use old syntax with "::",
> > > > then let me deal with it.
> > > >
> > > > I know this is backwards-incompatible change and is not likely to
> > > > be accepted, but still... :)
> > > >
> > > > Anze
> > > >
> > > > On Monday 06 December 2010, Alan Gates wrote:
> > > > > The reason it's needed is that ambiguities would result otherwise.
> > > > >
> > > > > A = load 'foo' as (x, y, z);
> > > > > B = load 'bar' as (w, x, y, z);
> > > > > C = join A by x, B by x;
> > > > > D = filter C by z > 0;  -- which z?
> > > > >
> > > > > As long as the name is not ambiguous, the :: is not required.
> > > > > So in the above example it would be perfectly legal to say
> > > > >
> > > > > D = filter C by w > 0;
> > > > >
> > > > > Out of curiosity, why do you want to remove the :: names?
> > > > >
> > > > > Alan.
> > > > >
> > > > > On Dec 6, 2010, at 1:05 PM, Jonathan Coveney wrote:
> > > > > > Hijack away. I would be curious as to the reason we need this
> > > > > > as well.
> > > > > >
> > > > > > 2010/12/6 Anze <[email protected]>
> > > > > >
> > > > > >> Sorry to hijack your question, Jonathan, but while we are at
> it...
> > > > > >> :)
> > > > > >>
> > > > > >> Is there a way to tell Pig NOT to add "base_alias::"? Almost
> > > > > >> half my code consists of FOREACH... GENERATE that just remove
> > > > > >> these prefixes.
> > > > > >>
> > > > > >> Thanks,
> > > > > >>
> > > > > >> Anze
> > > > > >>
> > > > > >> On Monday 06 December 2010, Daniel Dai wrote:
> > > > > >>> After join, cross, foreach flatten, Pig will automatically
> > > > > >>> add "base_alias::" prefix. All other cases use "."
> > > > > >>>
> > > > > >>> Daniel
> > > > > >>>
> > > > > >>> Jonathan Coveney wrote:
> > > > > >>>> It's very hard to search for this among the docs because
> > > > > >>>> it's so
> > > > > >>
> > > > > >> generic,
> > > > > >>
> > > > > >>>> so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > > >>>>
> > > > > >>>> Taking a look at this code that I found online, for example
> > > > > >>>>
> > > > > >>>> --
> > > > > >>>> -- Read in a bag of tuples (timeseries for this example)
> > > > > >>>> and divide the
> > > > > >>>> -- numeric column by its maximum.
> > > > > >>>> --
> > > > > >>>> %default DATABAG 'data/timeseries.tsv'
> > > > > >>>>
> > > > > >>>> data       = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > > > >>>> accumulate = GROUP data ALL;
> > > > > >>>> calc_max   = FOREACH accumulate GENERATE FLATTEN(data),
> > > > > >>>> MAX(data.count) AS max_count; normalize  = FOREACH calc_max
> > > > > >>>> GENERATE data::month AS month, data::count AS count,
> > > > > >>>> (float)data::count / (float)max_count AS normed_count; DUMP
> > > > > >>>> normalize;
> > > > > >>>>
> > > > > >>>> What purpose does data::month serve versus data.count?
> > > > > >>>>
> > > > > >>>> Thanks
> >
> >
>

Re: Easy question...difference between this::form and this.form?

Reply via email to