just for the record I m posting here the solution for my problem. Thank you for your help.
In the end the problem seams to be with the JsonLoader I was using. I don't know why exactly, but it seams to have a bug with my strings. I finally changed my code to use https://github.com/kevinweil/elephant-bird. the code now looks like this: register 'elephant-bird-core-3.0.0.jar' register 'elephant-bird-pig-3.0.0.jar' register 'google-collections-1.0.jar' register 'json-simple-1.1.jar' json_lines = LOAD '/twitterecho/tweets/stream/v1/json/2012_10_10/08' USING com.twitter.elephantbird.pig.load.JsonLoader(); geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS id, (CHARARRAY) $0#'geoLocation' AS geoLocation; tweets_grp = GROUP geo_tweets BY id; unique_tweets = FOREACH tweets_grp { first_tweet = LIMIT inpt 1; GENERATE FLATTEN(first_tweet); }; only_not_nulls = FILTER geo_tweets BY geoLocation is not null; store only_not_nulls into '/twitter_data/results/geo_tweets'; cheers thanks again for your support Arian P 2012/11/1 Arian Pasquali <[email protected]> > You are right Cheolsoo, > Indeed, it doesn't make any sense to write an UDF to compare datatypes. I > know its possible, but doesn't sound the right way. > Maybe it can be a bug at the JsonLoader I'm using > https://github.com/mmay/PigJsonLoader/blob/master/JsonLoader.java > > I will share with u the script and the data in a few. > > tks for the hints. > > Arian Rodrigo Pasquali > FEUP, SAPO Labs > http://www.arianpasquali.com > twitter @arianpasquali > > > > 2012/10/31 Cheolsoo Park <[email protected]> > >> Hi, >> >> > what's be the best way to filter only the valid rows, since some of >> them are string and others map? >> >> This shouldn't happen. The data type is defined per column, so it should >> be >> either string or map for all rows. If that's not the case, it should be a >> bug. >> >> > can create an expression to compare datatypes? is it possible? >> >> Technically, you should be able to write a UDF that checks type. But I am >> more interested in knowing why you're running into this problem. Can you >> please share your script and sample data? I'd like to reproduce it. >> >> Thanks, >> Cheolsoo >> >> On Wed, Oct 31, 2012 at 2:54 PM, Arian Pasquali <[email protected] >> >wrote: >> >> > can create an expression to compare datatypes? >> > is it possible? >> > >> > ArianP >> > >> > 2012/10/31 Arian Pasquali <[email protected]> >> > >> > > you are right, it doesn't seam like a null value. >> > > it looks like a chararray. But the expression causes error when >> comparing >> > > a string with ([longitude#-9.15199849,latitude#38.71179122]) >> > > >> > > geoinfo_no_nulls = FILTER geoinfo BY $0!='null' >> > > >> > > I get >> > > ERROR 2997: Unable to recreate exception from backed error: >> > > org.apache.pig.backend.executionengine.ExecException: ERROR 1071: >> Cannot >> > > convert a map to a String >> > > >> > > what's be the best way to filter only the valid rows, since some of >> them >> > > are string and others map? >> > > >> > > Arian >> > > >> > > >> > > >> > > 2012/10/31 Cheolsoo Park <[email protected]> >> > > >> > >> Hi, >> > >> >> > >> I am not sure what's the problem because I can't reproduce it. To me, >> > null >> > >> values are printed as an empty "( )" not "(null)", so it doesn't seem >> > like >> > >> null. >> > >> >> > >> I am wondering whether OpenJDK is the problem. Can you try Oracle >> > HotSpot >> > >> JDK 1.6 and see that fixes it? >> > >> >> > >> Thanks, >> > >> Cheolsoo >> > >> >> > >> On Wed, Oct 31, 2012 at 1:06 PM, Arian Pasquali < >> > [email protected] >> > >> >wrote: >> > >> >> > >> > hey people >> > >> > I'm having some troubles with a silly task, I can“t find a way to >> > filter >> > >> > null values from my rows. This is the result when I dump the object >> > >> > geoinfo: >> > >> > >> > >> > DUMP geoinfo; >> > >> > ([longitude#70.95853,latitude#30.9773]) >> > >> > ([longitude#-9.37944507,latitude#38.91780853]) >> > >> > (null) >> > >> > (null) >> > >> > (null) >> > >> > ([longitude#-92.64416,latitude#16.73326]) >> > >> > (null) >> > >> > (null) >> > >> > ([longitude#-9.15199849,latitude#38.71179122]) >> > >> > ([longitude#-9.15210796,latitude#38.71195131]) >> > >> > >> > >> > and here is the description >> > >> > >> > >> > DESCRIBE geoinfo; >> > >> > geoinfo: {geoLocation: bytearray} >> > >> > >> > >> > What I'm trying to do is to filter null values like this: >> > >> > >> > >> > geoinfo_no_nulls = FILTER geoinfo BY geoLocation is not null; >> > >> > >> > >> > but the result remains the same. nothing is filtered. >> > >> > >> > >> > I also tried something like this >> > >> > >> > >> > geoinfo_no_nulls = FILTER geoinfo BY geoLocation != 'null'; >> > >> > >> > >> > and I got an error >> > >> > >> > >> > org.apache.pig.backend.executionengine.ExecException: ERROR 1071: >> > Cannot >> > >> > convert a map to a String >> > >> > >> > >> > What am I doing wrong here? >> > >> > >> > >> > env details, >> > >> > >> > >> > Ubuntu 12.04.1 LTS, >> > >> > hadoop-1.0.3 >> > >> > pig 0.9.3 version 0.9.3-SNAPSHOT (rexported) compiled Oct 24 2012, >> > >> 19:04:03 >> > >> > java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 >> 1.11.4) >> > >> > (6b24-1.11.4-1ubuntu0.12.04.1) >> > >> > OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) >> > >> > >> > >> > >> > >> > ArianP >> > >> > >> > >> >> > > >> > > >> > >> > >
