just for the record
I m posting here the solution for my problem.

Thank you for your help.

In the end the problem seams to be with the JsonLoader I was using. I don't
know why exactly, but it seams to have a bug with my strings.

I finally changed my code to use https://github.com/kevinweil/elephant-bird.

the code now looks like this:

    register 'elephant-bird-core-3.0.0.jar'
    register 'elephant-bird-pig-3.0.0.jar'
    register 'google-collections-1.0.jar'
    register 'json-simple-1.1.jar'

    json_lines = LOAD
'/twitterecho/tweets/stream/v1/json/2012_10_10/08' USING
com.twitter.elephantbird.pig.load.JsonLoader();

    geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS
id, (CHARARRAY) $0#'geoLocation' AS geoLocation;

    tweets_grp = GROUP geo_tweets BY id;
    unique_tweets = FOREACH tweets_grp {
          first_tweet = LIMIT inpt 1;
          GENERATE FLATTEN(first_tweet);
    };

    only_not_nulls = FILTER geo_tweets BY geoLocation is not null;
    store only_not_nulls into '/twitter_data/results/geo_tweets';


cheers
thanks again for your support
Arian P



2012/11/1 Arian Pasquali <[email protected]>

> You are right Cheolsoo,
> Indeed, it doesn't make any sense to write an UDF to compare datatypes. I
> know its possible, but doesn't sound the right way.
> Maybe it can be a bug at the JsonLoader I'm using
> https://github.com/mmay/PigJsonLoader/blob/master/JsonLoader.java
>
> I will share with u the script and the data in a few.
>
> tks for the hints.
>
> Arian Rodrigo Pasquali
> FEUP, SAPO Labs
> http://www.arianpasquali.com
> twitter @arianpasquali
>
>
>
> 2012/10/31 Cheolsoo Park <[email protected]>
>
>> Hi,
>>
>> > what's be the best way to filter only the valid rows, since some of
>> them are string and others map?
>>
>> This shouldn't happen. The data type is defined per column, so it should
>> be
>> either string or map for all rows. If that's not the case, it should be a
>> bug.
>>
>> > can create an expression to compare datatypes? is it possible?
>>
>> Technically, you should be able to write a UDF that checks type. But I am
>> more interested in knowing why you're running into this problem. Can you
>> please share your script and sample data? I'd like to reproduce it.
>>
>> Thanks,
>> Cheolsoo
>>
>> On Wed, Oct 31, 2012 at 2:54 PM, Arian Pasquali <[email protected]
>> >wrote:
>>
>> > can create an expression to compare datatypes?
>> > is it possible?
>> >
>> > ArianP
>> >
>> > 2012/10/31 Arian Pasquali <[email protected]>
>> >
>> > > you are right, it doesn't seam like a null value.
>> > > it looks like a chararray. But the expression causes error when
>> comparing
>> > > a string with ([longitude#-9.15199849,latitude#38.71179122])
>> > >
>> > > geoinfo_no_nulls = FILTER geoinfo BY $0!='null'
>> > >
>> > > I get
>> > > ERROR 2997: Unable to recreate exception from backed error:
>> > > org.apache.pig.backend.executionengine.ExecException: ERROR 1071:
>> Cannot
>> > > convert a map to a String
>> > >
>> > > what's be the best way to filter only the valid rows, since some of
>> them
>> > > are string and others map?
>> > >
>> > > Arian
>> > >
>> > >
>> > >
>> > > 2012/10/31 Cheolsoo Park <[email protected]>
>> > >
>> > >> Hi,
>> > >>
>> > >> I am not sure what's the problem because I can't reproduce it. To me,
>> > null
>> > >> values are printed as an empty "( )" not "(null)", so it doesn't seem
>> > like
>> > >> null.
>> > >>
>> > >> I am wondering whether OpenJDK is the problem. Can you try Oracle
>> > HotSpot
>> > >> JDK 1.6 and see that fixes it?
>> > >>
>> > >> Thanks,
>> > >> Cheolsoo
>> > >>
>> > >> On Wed, Oct 31, 2012 at 1:06 PM, Arian Pasquali <
>> > [email protected]
>> > >> >wrote:
>> > >>
>> > >> > hey people
>> > >> > I'm having some troubles with a silly task, I can“t find a way to
>> > filter
>> > >> > null values from my rows. This is the result when I dump the object
>> > >> > geoinfo:
>> > >> >
>> > >> > DUMP geoinfo;
>> > >> > ([longitude#70.95853,latitude#30.9773])
>> > >> > ([longitude#-9.37944507,latitude#38.91780853])
>> > >> > (null)
>> > >> > (null)
>> > >> > (null)
>> > >> > ([longitude#-92.64416,latitude#16.73326])
>> > >> > (null)
>> > >> > (null)
>> > >> > ([longitude#-9.15199849,latitude#38.71179122])
>> > >> > ([longitude#-9.15210796,latitude#38.71195131])
>> > >> >
>> > >> > and here is the description
>> > >> >
>> > >> > DESCRIBE geoinfo;
>> > >> > geoinfo: {geoLocation: bytearray}
>> > >> >
>> > >> > What I'm trying to do is to filter null values like this:
>> > >> >
>> > >> > geoinfo_no_nulls = FILTER geoinfo BY geoLocation is not null;
>> > >> >
>> > >> > but the result remains the same. nothing is filtered.
>> > >> >
>> > >> > I also tried something like this
>> > >> >
>> > >> > geoinfo_no_nulls = FILTER geoinfo BY geoLocation != 'null';
>> > >> >
>> > >> >  and I got an error
>> > >> >
>> > >> > org.apache.pig.backend.executionengine.ExecException: ERROR 1071:
>> > Cannot
>> > >> > convert a map to a String
>> > >> >
>> > >> > What am I doing wrong here?
>> > >> >
>> > >> > env details,
>> > >> >
>> > >> > Ubuntu 12.04.1 LTS,
>> > >> > hadoop-1.0.3
>> > >> > pig 0.9.3 version 0.9.3-SNAPSHOT (rexported) compiled Oct 24 2012,
>> > >> 19:04:03
>> > >> > java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6
>> 1.11.4)
>> > >> > (6b24-1.11.4-1ubuntu0.12.04.1)
>> > >> > OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
>> > >> >
>> > >> >
>> > >> > ArianP
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Reply via email to