Thanks for being thorough! It's indeed a bug, but backporting a fix may be hard. The parser and logical plan changed a lot from .8-.9, so if at all possible, I would try to use 0.10 (the last release). We use it in production and it is stable, and has a lot of benefits over .8. I will wan that the parser changed so if you have many existing jobs, it may be worth running them on a test cluster with 0.10, but if you don't, defintely better to make the jump now.
2012/5/23 Johannes Schwenk <johannes.schw...@adition.com> > Hi Jonathan, > > thanks again for your help! > > I have cloned the current git head and created this pig script > http://pastebin.com/Gc9C9ZPS > > TestCONTAINS-testFilteringCluster-input.txt contains > http://pastebin.com/h5MC695F > > The adition.jar has been built against the cloudera cdh3u3 distribution > and contains the filter function CONTAINS > http://pastebin.com/Uwje7v1V > > > Output from running my script with both versions of pig: > > pig 0.11.0-SNAPSHOT > http://pastebin.com/Cr5CkHui > > => Correct results!! > > > pig 0.8.1-cdh3u3 > http://pastebin.com/yXY17mXx > > => Incorrect results!! > > > It seems like the new logical plan in pig 0.8.1 optimizes the OR > operator away. So its a bug, right? > > > > Am 22.05.2012 21:26, schrieb Jonathan Coveney: > > If this is a bug, it's an annoying one, so I definitely appreciate your > > help in getting to the bottom of it. So let's get to the bottom of it :) > > > > First, I would clone the trunk version of pig and run the same tests > > against it and compare. Always good to test any bugs against trunk to see > > if it is version specific. > > > > Right off the bat, I would say that you should dump the files in your > test > > to a file, make a short script that does exactly what your test does, and > > paste the EXPLAIN plan generated for your script (ideally in both your > > version of pig and trunk). We should be able to see if there is something > > weird going on. > > > > Let me know if you need any help with any of that. If it persists I'll > try > > and recreate on my end. > > > > 2012/5/22 Johannes Schwenk <johannes.schw...@adition.com> > > > >> Thank you for your quick suggestions! > >> > >> - I am now using local mode - good point! > >> - I know of builtin matches, the CONTAINS filter was just to get into > >> programming UDFS... > >> - Whatever I do the problem persists. I tried: > >> * turning off all optimizations (-t All) : no effect > >> * reordering the statements : the outcome contains still only the > >> matching tuples to the lhs of the OR > >> * using different data (just in case...) : no effect > >> * finally counted how many times the exec() function gets called > >> processing the script... : exactly *six times* - each for every record! > >> > >> That last observation leads me to believe that this is a bug!? The exec > >> function should be called at least *ten times* I think. > >> > >> Du you have any suggestions on how to verify this? > >> > >> Greetings > >> > >> Am 21.05.2012 19:11, schrieb Jonathan Coveney: > >>> Not sure why it is failing... though I will mention two things. 1) you > >>> should use local mode if possible, especially just to test UDFs :) 2) > you > >>> could use the builtin matches function to achieve this (ie matches > >>> '.*keyword.*') > >>> > >>> Besides that it is odd indeed, and I'd have to dig in more. > >>> > >>> 2012/5/21 Johannes Schwenk <johannes.schw...@adition.com> > >>> > >>>> Hello List, > >>>> > >>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. > >>>> > >>>> I have written a UDF extending FilterFunc that checks if the provided > >>>> string is contained within the specified column of the current tuple: > >>>> http://pastebin.com/Uwje7v1V > >>>> > >>>> I have also written some TestCases: > >>>> http://pastebin.com/uA4LHB4Q > >>>> > >>>> The odd thing is, that only TestCase testFilteringClusterWithOR1 fails > >>>> because the result has not the expected length of 3 but is of length 2 > >>>> instead (line 177 in http://pastebin.com/Uwje7v1V). After a lot of > >>>> investigating I still can not find out why testFilteringCluster and > >>>> testFilteringClusterWithOR2 succeed but not > testFilteringClusterWithOR1. > >>>> Is there a special prerequisite for making my FilterFunc usabel within > >>>> OR ? Maybe I have missed something very obvious... Please help me > figure > >>>> this out! > >>>> > >>>> Greetings, > >>>> Johannes Schwenk > >>>> > >>>> -- > >>>> Softwareentwickler (Reporting) > >>>> ________________________________________________________ > >>>> > >>>> ADITION technologies AG > >>>> Schwarzwaldstraße 78b > >>>> 79117 Freiburg > >>>> > >>>> http://www.adition.com > >>>> > >>>> T +49 / (0)761 / 88147 - 30 > >>>> F +49 / (0)761 / 88147 - 77 > >>>> SUPPORT +49 / (0)1805 - ADITION > >>>> > >>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > >>>> > >>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > >>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus > >> Schlüter > >>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > >>>> UStIDNr.: DE 218 858 434 > >>>> > >>>> > >>> > >> > >> > >> > >> Johannes Schwenk > >> > >> -- > >> Softwareentwickler (Reporting) > >> ________________________________________________________ > >> > >> ADITION technologies AG > >> Schwarzwaldstraße 78b > >> 79117 Freiburg > >> > >> http://www.adition.com > >> > >> T +49 / (0)761 / 88147 - 30 > >> F +49 / (0)761 / 88147 - 77 > >> SUPPORT +49 / (0)1805 - ADITION > >> > >> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > >> > >> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > >> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus > Schlüter > >> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > >> UStIDNr.: DE 218 858 434 > >> > >> > > > > > > Johannes Schwenk > > -- > Softwareentwickler (Reporting) > ________________________________________________________ > > ADITION technologies AG > Schwarzwaldstraße 78b > 79117 Freiburg > > http://www.adition.com > > T +49 / (0)761 / 88147 - 30 > F +49 / (0)761 / 88147 - 77 > SUPPORT +49 / (0)1805 - ADITION > > (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > > Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter > Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > UStIDNr.: DE 218 858 434 > >