Thank you so much, that did the trick. Quoting Jonathan Coveney <[email protected]>:
Dmitriy's solution is definitely more elegant than writing a UDF, and in a quick test, worked equally as well. c = filter a by x matches '\\p{ASCII}*' This would work if you wanted to ensure that all characters are ASCII. 2011/11/11 Dmitriy Ryaboy <[email protected]>I think you can just filter by "not foo matches '.*\\p{ASCII}.*' On Fri, Nov 11, 2011 at 1:12 PM, Kat Huang <[email protected]> wrote: > > I have parsed a json file structured as: > {"id":"xyz", "name":"John", "tags":"apples and oranges"} > {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc > > and I'd like to filter out the entries that contain unicode --like the > second entry. > I've tried using: > > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]); > logs = FOREACH rawdata generate json#name as thingtag; > result = FILTER logs by thingtag matches '.*\\\\[a-z].*'; > dump result; > > This does not filter the second entry. What's more -- when I just look > at the tags being loaded, it looks like the unicode characters have > been converted (ie I see weird graphics) > > running: > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]); > logs = FOREACH rawdata generate json#name as thingtag; > dump logs; > > Any help would be appreciated.
