I think you can just filter by "not foo matches '.*\\p{ASCII}.*'

On Fri, Nov 11, 2011 at 1:12 PM, Kat Huang <[email protected]> wrote:
>
> I have parsed a json file structured as:
> {"id":"xyz", "name":"John", "tags":"apples and oranges"}
> {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc
>
> and I'd like to filter out the entries that contain unicode --like the
> second entry.
> I've tried using:
>
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
> dump result;
>
> This does not filter the second entry. What's more -- when I just look
> at the tags being loaded, it looks like the unicode characters have
> been converted (ie I see weird graphics)
>
> running:
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> dump logs;
>
> Any help would be appreciated.

Reply via email to