Re: filter (regex) unicode from json files in pig?

mh2942 Fri, 11 Nov 2011 15:56:58 -0800

Thank you so much, that did the trick.

Quoting Jonathan Coveney <[email protected]>:

Dmitriy's solution is definitely more elegant than writing a UDF, and in a
quick test, worked equally as well.

c = filter a by x matches '\\p{ASCII}*'

This would work if you wanted to ensure that all characters are ASCII.

2011/11/11 Dmitriy Ryaboy <[email protected]>

I think you can just filter by "not foo matches '.*\\p{ASCII}.*'

On Fri, Nov 11, 2011 at 1:12 PM, Kat Huang <[email protected]> wrote:
>
> I have parsed a json file structured as:
> {"id":"xyz", "name":"John", "tags":"apples and oranges"}
> {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc
>
> and I'd like to filter out the entries that contain unicode --like the
> second entry.
> I've tried using:
>
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
> dump result;
>
> This does not filter the second entry. What's more -- when I just look
> at the tags being loaded, it looks like the unicode characters have
> been converted (ie I see weird graphics)
>
> running:
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> dump logs;
>
> Any help would be appreciated.

Re: filter (regex) unicode from json files in pig?

Reply via email to