Dmitriy's solution is definitely more elegant than writing a UDF, and in a
quick test, worked equally as well.

c = filter a by x matches '\\p{ASCII}*'

This would work if you wanted to ensure that all characters are ASCII.

2011/11/11 Dmitriy Ryaboy <[email protected]>

> I think you can just filter by "not foo matches '.*\\p{ASCII}.*'
>
> On Fri, Nov 11, 2011 at 1:12 PM, Kat Huang <[email protected]> wrote:
> >
> > I have parsed a json file structured as:
> > {"id":"xyz", "name":"John", "tags":"apples and oranges"}
> > {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc
> >
> > and I'd like to filter out the entries that contain unicode --like the
> > second entry.
> > I've tried using:
> >
> > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> > logs = FOREACH rawdata generate json#name as thingtag;
> > result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
> > dump result;
> >
> > This does not filter the second entry. What's more -- when I just look
> > at the tags being loaded, it looks like the unicode characters have
> > been converted (ie I see weird graphics)
> >
> > running:
> > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> > logs = FOREACH rawdata generate json#name as thingtag;
> > dump logs;
> >
> > Any help would be appreciated.
>

Reply via email to