Yes, I think too, it should behave like 'matches' and REGEX_EXTRACT_ALL.

The problem is about REGEX_EXTRACT.

A simple workaround is to use REGEX_EXTRACT_ALL but it defeats the purpose
of having REGEX_EXTRACT.
e.g. REGEX_EXTRACT_ALL(value, '(.+?)/?').$0 and it is a bug difficult to
see.

I posted:
https://issues.apache.org/jira/browse/PIG-2514

Romain



On Sun, Feb 5, 2012 at 9:41 PM, Dmitriy Ryaboy <[email protected]> wrote:

> I think the intent is to behave the same way as the Pig "matches" operator
> (which, unsurprisingly, uses the Java matches method).
>
> RegexExtractAll becomes quite confusing if it means "extract all matched
> subexpressions of the first match of the expression" (one might expect
> "all" to refer to all matches of the expression itself).
>
> At the very least, the behavior should be documented.
>
> On Fri, Feb 3, 2012 at 5:29 PM, Romain Rigaux <[email protected]
> >wrote:
>
> > Hello,
> >
> > REGEX_EXTRACT is using Matcher.find() instead of Matcher.matches() and so
> > does not work with some non greedy regular expression.
> >
> > Is it the wanted behavior?
> >
> > Thanks,
> >
> > Romain
> >
> >
> >
> http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html
> >
> >
> >
> >    -
> >
> >    The matches<
> http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#matches()>
> method
> >    attempts to match the entire input sequence against the pattern.
> >    - The find<
> http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#find()>
> method
> >    scans the input sequence looking for the next subsequence that
> matches the
> >    pattern.
> >
> >
> >
> >
> >     System.out.println("Pig's way with m.find()");
> >     String a = "hdfs://mygrid.com/projects/";
> >     Matcher m = Pattern.compile("(.+?)/?").matcher(a);
> >     System.out.println(m.find());
> >     System.out.println(m.group(1));
> >     System.out.println(m.start());
> >     System.out.println(m.end());
> >
> >     System.out.println("\nm.matches()");
> >     a = "hdfs://mygrid.com/projects/";
> >     m = Pattern.compile("(.+?)/?").matcher(a);
> >     System.out.println(m.matches());
> >     System.out.println(m.group(1));
> >     System.out.println(m.start());
> >     System.out.println(m.end());
> >
> >     System.out.println("\nREGEX_EXTRACT m.find()");
> >     Tuple t = TupleFactory.getInstance().newTuple();
> >     t.append(a);
> >     t.append("(.+?)/?");
> >     t.append(1);
> >     System.out.println(new TestPigExtractAll().new
> > REGEX_EXTRACT().exec(t));
> >
> >
> > Output:
> >
> > Pig's way with m.find()
> > true
> > h
> > 0
> > 1
> >
> > m.matches()
> > true
> > hdfs://mygrid.com/projects
> > 0
> > 27
> >
> > REGEX_EXTRACT m.find()
> > h
> >
> >
> >
>

Reply via email to