Hi Jonathan, thanks again for your help!
I have cloned the current git head and created this pig script http://pastebin.com/Gc9C9ZPS TestCONTAINS-testFilteringCluster-input.txt contains http://pastebin.com/h5MC695F The adition.jar has been built against the cloudera cdh3u3 distribution and contains the filter function CONTAINS http://pastebin.com/Uwje7v1V Output from running my script with both versions of pig: pig 0.11.0-SNAPSHOT http://pastebin.com/Cr5CkHui => Correct results!! pig 0.8.1-cdh3u3 http://pastebin.com/yXY17mXx => Incorrect results!! It seems like the new logical plan in pig 0.8.1 optimizes the OR operator away. So its a bug, right? Am 22.05.2012 21:26, schrieb Jonathan Coveney: > If this is a bug, it's an annoying one, so I definitely appreciate your > help in getting to the bottom of it. So let's get to the bottom of it :) > > First, I would clone the trunk version of pig and run the same tests > against it and compare. Always good to test any bugs against trunk to see > if it is version specific. > > Right off the bat, I would say that you should dump the files in your test > to a file, make a short script that does exactly what your test does, and > paste the EXPLAIN plan generated for your script (ideally in both your > version of pig and trunk). We should be able to see if there is something > weird going on. > > Let me know if you need any help with any of that. If it persists I'll try > and recreate on my end. > > 2012/5/22 Johannes Schwenk <johannes.schw...@adition.com> > >> Thank you for your quick suggestions! >> >> - I am now using local mode - good point! >> - I know of builtin matches, the CONTAINS filter was just to get into >> programming UDFS... >> - Whatever I do the problem persists. I tried: >> * turning off all optimizations (-t All) : no effect >> * reordering the statements : the outcome contains still only the >> matching tuples to the lhs of the OR >> * using different data (just in case...) : no effect >> * finally counted how many times the exec() function gets called >> processing the script... : exactly *six times* - each for every record! >> >> That last observation leads me to believe that this is a bug!? The exec >> function should be called at least *ten times* I think. >> >> Du you have any suggestions on how to verify this? >> >> Greetings >> >> Am 21.05.2012 19:11, schrieb Jonathan Coveney: >>> Not sure why it is failing... though I will mention two things. 1) you >>> should use local mode if possible, especially just to test UDFs :) 2) you >>> could use the builtin matches function to achieve this (ie matches >>> '.*keyword.*') >>> >>> Besides that it is odd indeed, and I'd have to dig in more. >>> >>> 2012/5/21 Johannes Schwenk <johannes.schw...@adition.com> >>> >>>> Hello List, >>>> >>>> I am using Clouderas distribution (cdh3u3) which comes with pig-0.8.1. >>>> >>>> I have written a UDF extending FilterFunc that checks if the provided >>>> string is contained within the specified column of the current tuple: >>>> http://pastebin.com/Uwje7v1V >>>> >>>> I have also written some TestCases: >>>> http://pastebin.com/uA4LHB4Q >>>> >>>> The odd thing is, that only TestCase testFilteringClusterWithOR1 fails >>>> because the result has not the expected length of 3 but is of length 2 >>>> instead (line 177 in http://pastebin.com/Uwje7v1V). After a lot of >>>> investigating I still can not find out why testFilteringCluster and >>>> testFilteringClusterWithOR2 succeed but not testFilteringClusterWithOR1. >>>> Is there a special prerequisite for making my FilterFunc usabel within >>>> OR ? Maybe I have missed something very obvious... Please help me figure >>>> this out! >>>> >>>> Greetings, >>>> Johannes Schwenk >>>> >>>> -- >>>> Softwareentwickler (Reporting) >>>> ________________________________________________________ >>>> >>>> ADITION technologies AG >>>> Schwarzwaldstraße 78b >>>> 79117 Freiburg >>>> >>>> http://www.adition.com >>>> >>>> T +49 / (0)761 / 88147 - 30 >>>> F +49 / (0)761 / 88147 - 77 >>>> SUPPORT +49 / (0)1805 - ADITION >>>> >>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >>>> >>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus >> Schlüter >>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >>>> UStIDNr.: DE 218 858 434 >>>> >>>> >>> >> >> >> >> Johannes Schwenk >> >> -- >> Softwareentwickler (Reporting) >> ________________________________________________________ >> >> ADITION technologies AG >> Schwarzwaldstraße 78b >> 79117 Freiburg >> >> http://www.adition.com >> >> T +49 / (0)761 / 88147 - 30 >> F +49 / (0)761 / 88147 - 77 >> SUPPORT +49 / (0)1805 - ADITION >> >> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >> >> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >> UStIDNr.: DE 218 858 434 >> >> > Johannes Schwenk -- Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
signature.asc
Description: OpenPGP digital signature