Thanks, Marcos! I ended up going with a UDF and it's working great.

On Tue, Apr 16, 2013 at 4:06 AM, MARCOS MEDRADO RUBINELLI <
[email protected]> wrote:

> Dylan,
>
> It seems my first message fell through a crack, so I apologize if you
> receive it twice, but: yes it is a known issu, and there isn't a stable
> version with the fix yet. I see two ways to work around it:
>
> 1. write a UDF that encapsulates the regex
>
> 2. load the regex from a file
>
> I actually tested number 2. I ran it on 0.10.0, but it should work on a
> recent version of EMR too:
>
> $ echo "test=(\\S+);?" > testregex.txt
> $ hadoop fs -put testregex.txt /tmp
>
> B = LOAD '/tmp/testregex.txt' as (regex :chararray);
>
> blah =
>        FOREACH
>          data
>        GENERATE
>          FLATTEN (
>            REGEX_EXTRACT (
>              str_of_interest, B.regex, 1
>            )
>          )
>          AS (
>            test: chararray
>          )
>        ;
>
> Cheers,
> Marcos
>
> On 16-04-2013 02:03, Dylan Sather wrote:
> > Hi y'all,
> >
> > First time on this list, and hoping you might be able to help me with a
> > (possible) issue.
> >
> > I'm working with some data in Pig that includes strings of interest,
> > optionally separated by semicolons and in random order, e.g.
> >
> >      test=12345;foo=bar
> >      test=12345
> >      foo=bar;test=12345
> >
> > The following code should extract the value of the string for the test
> > 'key':
> >
> >      blah =
> >        FOREACH
> >          data
> >        GENERATE
> >          FLATTEN (
> >            EXTRACT (
> >              str_of_interest,
> >              'test=(\\S+);?'
> >            )
> >          )
> >          AS (
> >            test: chararray
> >          )
> >        ;
> >
> > However, when running the code, I encounter the following error:
> >
> >      <line 46, column 0>  mismatched character '<EOF>' expecting '''
> >      2013-04-16 04:46:05,245 [main] ERROR
> org.apache.pig.tools.grunt.Grunt -
> > ERROR 1200: <line 46, column 0>  mismatched character '<EOF>' expecting
> '''
> >
> > I thought I had my regex escape syntax off at first, but that doesn't
> > appear to be the problem. The only information I get from a Google search
> > is a bug report (https://issues.apache.org/jira/browse/PIG-2507) that
> > appears to have been recently fixed, but it's still an issue on the
> Amazon
> > EMR cluster I'm running (spun up ad hoc, just now, for this analysis).
> >
> > As in the bug report and as suggested elsewhere, replacing the semicolon
> > with its Unicode equivalent (\u003B) yields the same error.
> >
> > I could be crazy and this could be a syntax issue, so I'm hoping someone
> > might be able to point me in the right direction or confirm that this is
> an
> > existing problem. If the latter, are there any workarounds (either in
> Pig,
> > or for matching the string I want)?
> >
> > Cheers.
> > Dylan
> >
>

Reply via email to