Awesome, works as advertised. Thanks for the help Jonathan.
On Fri, Jun 17, 2011 at 6:04 PM, Jonathan Coveney <[email protected]>wrote: > regex extract doesn't need to be flattened. In this case, use: > > REGEX_EXTRACT(uri,'id=(\\d*)',0); --returns id=1234 > or > REGEX_EXTRACT(uri,'id=(\\d*)',1); --returns 1234 > > You were missing the *, which is why it only grabbed the 1. > > 2011/6/17 Irooniam <[email protected]> > > > Hello, > > > > I'm having an issue with regex in pig. > > > > Specifically, I'm loading an apache access log and trying to break out > the > > bits from the query string: > > > > logs = LOAD '$input' using logloader as (remoteHost:CHARARRAY, > > hyphen:CHARARRAY, hyphen2:CHARARRAY, time:CHARARRAY, method:CHARARRAY, > > uri:CHARARRAY, protocol:CHARARRAY, statusCode:CHARARRAY, > > responseSize:CHARARRAY, treferer:CHARARRAY, agent:CHARARRAY); > > > > full_logs = FOREACH logs GENERATE time, uri, FLATTEN(REGEX_EXTRACT(uri, > > 'id=[0-9]', 2)); > > > > The uri looks like: > > /khello.html?ref=http%3A%2F%2Fwww.google.com > > %2F&k=4165427574dfdb75e0a37a8c13ab757d4273a283&id=1234 > > > > However when I run this simple pig script, I get the uri but not the 'id' > > parameter. > > > > I then tried using "\d" instead of [0-9] - still won't work. > > > > I tried both [0-9] and \d in php and I get 'id=1' and '1' so I'm not sure > > what I'm doing wrong. > > > > Thanks in advance. > > >
