That is probably true, regarding needing skewed joins, but our users rarely encounter those situations, and I have never knowingly done so - though I may have worked my way round one or two without knowing it.
As to the bugs, I used dev branch for a long time so my recommendations are colored by that, and some are owing to peculiarities with our storage UDFs. Don't remember exactly, but round 0.2 and a few releases on up, lots of features were on the books but were really 'Yahoo only.' :D Pig is much more stable now. I should try more features and be more expressive. Sent from my iPhone On Jan 7, 2011, at 11:22 AM, Dmitriy Ryaboy <[email protected]> wrote: > Would love to see what bugs you are running into with skewed and replicated > joins. > I use them all the time to great effect. > > You are correct in saying putting biggest relation to the left in regular > joins is effective but totally wrong when saying it's the same thing as > skewed join; we've encountered queries that are simply not possible to > finish without skewed join. > > D > > On Thu, Jan 6, 2011 at 11:44 PM, Russell Jurney > <[email protected]>wrote: > >> I wrote this up for LinkedIn Hadoop Users today, figured it was worth >> sharing. If you have any other tips, or edits, please submit and I'll put >> these in a wiki some place: >> >> /* Russell's philosophy of Pig: >> 1) The pig is powerful, but cannot be trusted. His nature is perverse. He >> will eat anything, and his diet effects his mood. >> In time, you will understand his nature. In the meantime, do as little >> as possible at each step - each line of Pig code. >> Don't tempt the pig, for he fill fuck your world, going from a tool that >> enables you to a tool that gores you. >> 2) Whenever possible, do all similar operations on a relation together in >> one step. In one foreach. >> 3) When operations on a single field of a relation become too chained, >> too complex to read, do not be ashamed to break them into >> two foreaches, one right after the other. Pig is smart enough to combine >> those into one job. >> 4) Each GROUP BY/FOREACH is a MR job. Each LOAD/FOREACH is an MR job. >> Pig 0.8 will hint at this by telling you the relations >> used in each job, but it is helpful to yourself as you edit your code >> later, and to others that follow, to label your scripts >> as you learn to infer which Pig code chunks correspond to which Pig >> lines. Examples of this are below. >> 5) Always strip relation names, even if it means another >> FOREACH/GENERATE. For example, after a JOIN you relations may have two >> namespaces like one_type::thing and another_type::gearbox. Take the time >> to do: >> >> relation = FOREACH relation GENERATE one_type::thing AS thing, >> another_type::gearbox AS gearbox; >> >> Either you, or the person inheriting your code, will thank you later. >> >> 6) Pig Latin Code Highlighting/Formatting. If you aren't using Textmate >> http://macromates.com/ to edit your Pig Latin code... shoot yourself in >> the >> foot. >> Facilities has pistols and first-aid kits suitable for this masochism. >> Stop the bleeding, then, download Textmate here: >> http://download.macromates.com/TextMate_1.5.10.zip >> and make a helpdesk ticket for a permanent licence here:****** Then, >> download the >> Pig Latin syntax and install it by pasting the commands listed here into >> your shell, and restarting TextMate: >> http://tommy.chheng.com/index.php/2009/09/pig-textmate-bundle/ >> 7) Never use the special variables $0/$1/$2 to represent a field, unless >> you are creating a throwaway script. In that case, do not save the script. >> >> 8) If your code fails, you are probably being too clever for the Pig. >> Back off doing things in combination, starting with the point of failure, >> working back. Add steps when sensible. Think like the pig parser - and >> know that there are really 5 or so Pig parsers, each one thinking a little >> differently. >> 9) Use ILLUSTRATE if it works. Complain bitterly to *** if it does not. >> And give *** or >> *** or *** the evil eye, then learn their material weaknesses and bribe >> them to fix Pig's parser to make it work. >> >> In the absence of ILLUSTRATE, use: >> foo = SAMPLE my_relation 0.01; STORE foo INTO '/tmp/foo5125'; cat >> /tmp/foo5125' >> OR >> foo = SAMPLE my_relation 0.01; LIMIT my_relation 100; DUMP >> my_relation; -- Be aware that this sorts the 'sample' and makes it less >> random. >> >> 11) Format your Pig code so that lists of things being generated line up >> with CRs after each thing, as below. This makes it readable. >> 12) Always put the smaller dataset to the left in a JOIN. This puts it in >> RAM if possible, resulting in a 10-1000x performance improvement. >> The other join types are often much buggier, so I personally never use >> them. >> 13) There are undocumented limitations to Pig. If you run into a >> problem, search Pig's JIRA: https://issues.apache.org/jira/browse/PIG >> Do not feel afraid to file bugs, after you email >> [email protected]. We have contributed enough to the Pig project, >> both through >> UDFs, steering feedback, events and marketing that they are sensitive to >> our needs if we make reasonable requests. >> 14) PUT YOUR UDFs IN THE NEW PIGGYBANK. It is on github and is on >> wilbur. https://github.com/wilbur/Piggybank Fork the project, git clone >> it, >> add your UDF, and do a pull request. Email me [email protected] when >> you do so and I will immediately approve it. Congrats - you've >> contritubed to the pig project! >> */ >>
