It would have been really nice to define fn:analyze-string() so that it would 
capture multiple matches of the capturing groups, but we made a policy decision 
that as far as possible it should be possible to implement the XPath/XQuery 
regex facilities using existing regex libraries, and sadly they generally do 
not have this capability.

But in any case I think a multi-pass approach is probably appropriate here. In 
fact generally, I think the approach of trying to do everything in one great 
complex regular expression is usually misguided. Splitting it up into smaller 
steps not only makes the logic easier to understand (and therefore to debug and 
maintain), it can also benefit performance.

So:

(a) use analyze-string to mark any substring comprising "(" followed by digits, 
spaces, and commas, followed by ")"

(b) use tokenize to split out the individual numbers.

Michael Kay
Saxonica

> On 23 Apr 2018, at 17:22, Joe Wicentowski <joe...@gmail.com> wrote:
> 
> Hi all,
> 
> I have encountered an unexpected challenge constructing a regex for a pattern 
> I am looking for.  I am looking for numbers in parentheses.  For example, in 
> the following string:
> 
>   "On February 13, 1968, Secretary of State Dean Rusk sent a 
>     message to Israeli Foreign Minister Abba Eban calling upon Israel to 
>     endorse openly Resolution 242, and on May 13 President Johnson sent a 
>     letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
>     urging him to seize the unique opportunity offered by the Jarring 
>     mission to achieve peace. (79, 171)"
> 
> ... I would like to match "79" and "171" (but not "UAR" or "13" or "1968").  
> I have been trying to construct a regex for use with analyze-string to 
> capture this pattern, but I have not been successful.  I have tried the 
> following:
> 
>   analyze-string($string, "(?:\()(?:(\d+)(?:, )?)+(?:\))")
> 
> In other words, there are these 3 components:
> 
>   1. (?:\() a non-capturing group consisting of an open parens, followed by
>   2. (?:(\d+)(?:, )?)+ one or more non-capturing groups consisting of (a 
> number followed by an optional, non-matching comma-and-space), followed by
>   3. (?:\)) a non-capturing group consisting of a close parens
> 
> I was expecting to get the following output:
> 
>   <fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions 
> <http://www.w3.org/2005/xpath-functions>">
>     <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a 
>     message to Israeli Foreign Minister Abba Eban calling upon Israel to 
>     endorse openly Resolution 242, and on May 13 President Johnson sent a 
>     letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
>     urging him to seize the unique opportunity offered by the Jarring 
>     mission to achieve peace. </fn:non-match>
>     <fn:match>(<fn:group nr="1">79</fn:group>, 
>       <fn:group nr="1">171</fn:group>)</fn:match>
>   </fn:analyze-string-result>
> 
> However, the actual result is that the first number ("79") is skipped, and 
> only the 2nd number ("171") is captured:
> 
>   <fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions 
> <http://www.w3.org/2005/xpath-functions>">
>     <fn:non-match>On February 13, 1968, Secretary of State Dean Rusk sent a 
>     message to Israeli Foreign Minister Abba Eban calling upon Israel to 
>     endorse openly Resolution 242, and on May 13 President Johnson sent a 
>     letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
>     urging him to seize the unique opportunity offered by the Jarring 
>     mission to achieve peace. </fn:non-match>
>     <fn:match>(79, 
>       <fn:group nr="1">171</fn:group>)</fn:match>
>   </fn:analyze-string-result>
> 
> What am I missing?  Can anyone suggest a regex that is able to capture both 
> numbers inside the parentheses?  Or do I need to make a two-pass run through 
> this, finding parenthetical text with a first analyze-string like "\(.+\)" 
> and then looking inside its matches with a second analyze-string like 
> "(\d+)(?:, )?"?
> 
> Thanks,
> Joe
> _______________________________________________
> talk@x-query.com
> http://x-query.com/mailman/listinfo/talk

_______________________________________________
talk@x-query.com
http://x-query.com/mailman/listinfo/talk

Reply via email to