Re: [xquery-talk] An analyze-string stumper

2018-04-24 Thread Michael Kay
It would have been really nice to define fn:analyze-string() so that it would 
capture multiple matches of the capturing groups, but we made a policy decision 
that as far as possible it should be possible to implement the XPath/XQuery 
regex facilities using existing regex libraries, and sadly they generally do 
not have this capability.

But in any case I think a multi-pass approach is probably appropriate here. In 
fact generally, I think the approach of trying to do everything in one great 
complex regular expression is usually misguided. Splitting it up into smaller 
steps not only makes the logic easier to understand (and therefore to debug and 
maintain), it can also benefit performance.

So:

(a) use analyze-string to mark any substring comprising "(" followed by digits, 
spaces, and commas, followed by ")"

(b) use tokenize to split out the individual numbers.

Michael Kay
Saxonica

> On 23 Apr 2018, at 17:22, Joe Wicentowski  wrote:
> 
> Hi all,
> 
> I have encountered an unexpected challenge constructing a regex for a pattern 
> I am looking for.  I am looking for numbers in parentheses.  For example, in 
> the following string:
> 
>   "On February 13, 1968, Secretary of State Dean Rusk sent a 
> message to Israeli Foreign Minister Abba Eban calling upon Israel to 
> endorse openly Resolution 242, and on May 13 President Johnson sent a 
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
> urging him to seize the unique opportunity offered by the Jarring 
> mission to achieve peace. (79, 171)"
> 
> ... I would like to match "79" and "171" (but not "UAR" or "13" or "1968").  
> I have been trying to construct a regex for use with analyze-string to 
> capture this pattern, but I have not been successful.  I have tried the 
> following:
> 
>   analyze-string($string, "(?:\()(?:(\d+)(?:, )?)+(?:\))")
> 
> In other words, there are these 3 components:
> 
>   1. (?:\() a non-capturing group consisting of an open parens, followed by
>   2. (?:(\d+)(?:, )?)+ one or more non-capturing groups consisting of (a 
> number followed by an optional, non-matching comma-and-space), followed by
>   3. (?:\)) a non-capturing group consisting of a close parens
> 
> I was expecting to get the following output:
> 
>   http://www.w3.org/2005/xpath-functions 
> ">
> On February 13, 1968, Secretary of State Dean Rusk sent a 
> message to Israeli Foreign Minister Abba Eban calling upon Israel to 
> endorse openly Resolution 242, and on May 13 President Johnson sent a 
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
> urging him to seize the unique opportunity offered by the Jarring 
> mission to achieve peace. 
> (79, 
>   171)
>   
> 
> However, the actual result is that the first number ("79") is skipped, and 
> only the 2nd number ("171") is captured:
> 
>   http://www.w3.org/2005/xpath-functions 
> ">
> On February 13, 1968, Secretary of State Dean Rusk sent a 
> message to Israeli Foreign Minister Abba Eban calling upon Israel to 
> endorse openly Resolution 242, and on May 13 President Johnson sent a 
> letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
> urging him to seize the unique opportunity offered by the Jarring 
> mission to achieve peace. 
> (79, 
>   171)
>   
> 
> What am I missing?  Can anyone suggest a regex that is able to capture both 
> numbers inside the parentheses?  Or do I need to make a two-pass run through 
> this, finding parenthetical text with a first analyze-string like "\(.+\)" 
> and then looking inside its matches with a second analyze-string like 
> "(\d+)(?:, )?"?
> 
> Thanks,
> Joe
> ___
> talk@x-query.com
> http://x-query.com/mailman/listinfo/talk

___
talk@x-query.com
http://x-query.com/mailman/listinfo/talk

Re: [xquery-talk] An analyze-string stumper

2018-04-23 Thread Patrick Durusau
Joe,

Forgive the length but I'm likely to bump my head on this issue in the
future, so a fuller than necessary explanation:

Started with the simplest regex that would capture the parens:

1. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d.*\)")

1. Result: http://www.w3.org/2005/xpath-functions;>
  On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic 
  (UAR) President Gamal Abdel Nasser, urging him to seize the
unique opportunity offered by the Jarring mission to achieve peace. (79,
171)
   


OK, so what do we know about the desired matches? Digits plus (, ) with
no spaces. Yes?

2. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d, \d+\)")

So I match parens plus digits, ", " (comma plus whitespace), digits plus
paren.

2. Result: http://www.w3.org/2005/xpath-functions;>
  On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. 
  (79, 171)
   


I need to split the two numbers and what better to do that than
alternative matching?

3. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d+ | \d+\)")

3. Result: http://www.w3.org/2005/xpath-functions;>
  On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79,
   171)
   


Your probably already laughing because you see my mistake, which I
correct in #4:

4. fn:analyze-string("On February 13, 1968, Secretary of State Dean Rusk
sent a message to Israeli Foreign Minister Abba Eban calling upon Israel
to endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. (79, 171) ", "\(\d+|\d+\)")

4. Result: http://www.w3.org/2005/xpath-functions;>
  On February 13, 1968, Secretary of State Dean Rusk sent
a message to Israeli Foreign Minister Abba Eban calling upon Israel to
endorse openly Resolution 242, and on May 13 President Johnson sent a
letter to United Arab Republic (UAR) President Gamal Abdel Nasser,
urging him to seize the unique opportunity offered by the Jarring
mission to achieve peace. 
  (79
  ,
   171)
   


The error was here: "\(\d+ | \d+\)", which would only match (any-digit
plus a white space, whereas the number in question was followed by *no
space* and a comma.

Know thy data!

Examples created on BaseX. BTW, I started from known good examples in
XQuery Functions 3.1, verified that they worked and then created the
search strings.

Hope this helps!

Patrick















On 04/23/2018 12:22 PM, Joe Wicentowski wrote:
> Hi all,
>
> I have encountered an unexpected challenge constructing a regex for a
> pattern I am looking for.  I am looking for numbers in parentheses. 
> For example, in the following string:
>
>   "On February 13, 1968, Secretary of State Dean Rusk sent a 
>     message to Israeli Foreign Minister Abba Eban calling upon Israel to 
>     endorse openly Resolution 242, and on May 13 President Johnson sent a 
>     letter to United Arab Republic (UAR) President Gamal Abdel Nasser, 
>     urging him to seize the unique opportunity offered by the Jarring 
>     mission to achieve peace. (79, 171)"
>
> ... I would like to match "79" and "171" (but not "UAR" or "13" or
> "1968").  I have