I think you can just filter by "not foo matches '.*\\p{ASCII}.*'On Fri, Nov 11, 2011 at 1:12 PM, Kat Huang <[email protected]> wrote: > > I have parsed a json file structured as: > {"id":"xyz", "name":"John", "tags":"apples and oranges"} > {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc > > and I'd like to filter out the entries that contain unicode --like the > second entry. > I've tried using: > > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]); > logs = FOREACH rawdata generate json#name as thingtag; > result = FILTER logs by thingtag matches '.*\\\\[a-z].*'; > dump result; > > This does not filter the second entry. What's more -- when I just look > at the tags being loaded, it looks like the unicode characters have > been converted (ie I see weird graphics) > > running: > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]); > logs = FOREACH rawdata generate json#name as thingtag; > dump logs; > > Any help would be appreciated.
