regex extract doesn't need to be flattened. In this case, use:

REGEX_EXTRACT(uri,'id=(\\d*)',0); --returns id=1234
or
REGEX_EXTRACT(uri,'id=(\\d*)',1); --returns 1234

You were missing the *, which is why it only grabbed the 1.

2011/6/17 Irooniam <[email protected]>

> Hello,
>
> I'm having an issue with regex in pig.
>
> Specifically, I'm loading an apache access log and trying to break out the
> bits from the query string:
>
> logs = LOAD '$input' using logloader as (remoteHost:CHARARRAY,
> hyphen:CHARARRAY, hyphen2:CHARARRAY, time:CHARARRAY, method:CHARARRAY,
> uri:CHARARRAY, protocol:CHARARRAY, statusCode:CHARARRAY,
> responseSize:CHARARRAY, treferer:CHARARRAY, agent:CHARARRAY);
>
> full_logs = FOREACH logs GENERATE time, uri, FLATTEN(REGEX_EXTRACT(uri,
> 'id=[0-9]', 2));
>
> The uri looks like:
> /khello.html?ref=http%3A%2F%2Fwww.google.com
> %2F&k=4165427574dfdb75e0a37a8c13ab757d4273a283&id=1234
>
> However when I run this simple pig script, I get the uri but not the 'id'
> parameter.
>
> I then tried using "\d" instead of [0-9] - still won't work.
>
> I tried both [0-9] and \d in php and I get 'id=1' and '1' so I'm not sure
> what I'm doing wrong.
>
> Thanks in advance.
>

Reply via email to