Hi Alan, I'd like to use this method to not include records in my output that are already present in previously computed data. So I tried to use your suggestion like this:
grunt> cat in.dat 1 2 3 4 5 6 7 8 9 grunt> C = LOAD 'in.dat' AS (A1); -- previously generated data grunt> cat in2.dat 12 2 13 1 10 9 11 8 grunt> A = LOAD 'in2.dat' AS (A1); -- new data grunt> B1 = join A by A1, C by A1; grunt> B2 = filter B1 by SIZE(C) == 0; Which gives me this error: 2012-07-04 14:36:16,768 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 14, column 23> Invalid scalar projection: C : A column needs to be projected from a relation for it to be used as a scalar Details at logfile: /home/schwenk/pig-0.10.0/pig_1341403702015.log The relevant pig stack trace from the logfile can be found at http://pastebin.com/MxPfduWS What am I doing wrong? Greetings, Johannes Am 25.06.2012 18:39, schrieb Alan Gates: > This type of in is really a semi-join. So you could rewrite this as: > > B1 = join A by A1, C by A1; > B2 = filter B1 by SIZE(C) > 0; > B = foreach B2 flatten(A); > > Alan. > > On Jun 25, 2012, at 2:50 AM, yonghu wrote: > >> Dear all, >> >> in the sql, there is a in clause which is used to check if the value >> is in a set or not? Does pig also have the same in clause? Such as: >> >> B = filter A by A1 in C; >> >> A,B,C are relation names and A1 is a column_name of A. >> >> Thanks! >> >> Yong > Johannes Schwenk -- Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
signature.asc
Description: OpenPGP digital signature
