[R] Regular expressions: retrieving matches depending on intervening strings

2006-08-16 Thread Stefan Th. Gries
Dear all

I again have a regular expression question. I have this character vector a:

a-c(w AT0a w NN1blockage w CJCand w DT0thatc PUN.,
 w AT0a w NN1blockage w CJCand ptr target=KB2LC003w DT0thatc 
PUN.,
 w AT0a w NN1blockage w CJCandc PUN, w DT0thatc PUN.,
 w AT0a w NN1blockage w CJCand w AJ0hungry w DT0thatc PUN.)

I would like to retrieve those elements of a in which w CJC and w DT0 
are

- directly adjacent, as in a[1] or
- not interrupted by [wc] , as in a[2]

And, of these elements I would like to consume all characters from the  in 
w CJC to the last character after w DT0 that is not a . For example, 
if I was only searching a[1], I would like something like this:

matches-gregexpr(w CJC[^]+?w DT0[^]+, a[1], perl=TRUE)
substr(a[1], unlist(matches), unlist(matches)+unlist(attributes(matches[[1]], 
match.length))-1)

I have been fiddling around with negative lookahead but I really can't get my 
head around this. Any pointers would be greatly appreciated. Thanks a lot,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular expressions: retrieving matches depending on intervening strings [Follow-up]

2006-08-16 Thread Stefan Th. Gries
Dear all

This is a follow-up to an earlier posting today regarding a regular expression 
question. In the meantime, this is the best approximation I could come up with 
and should give you a better idea what I am talking about.

a-c(w AT0a w NN1blockage w CJCand w DT0thatc PUN.,
 w AT0a w NN1blockage w CJCand ptr target=KB2LC003w DT0thatc 
PUN.,
 w AT0a w NN1blockage w CJCandc PUN, w DT0thatc PUN.,
 w AT0a w NN1blockage w CJCand w AJ0hungry w DT0thatc PUN.)
matches-gregexpr(w CJC[^]+(?:[^wc].*?.*?)*w DT0that, a, perl=TRUE)
starts-unlist(matches)
lengths-unlist(sapply(matches, attributes))
stops-starts+lengths-1
substr(a, starts, stops)

What is still missing is that the disallowed string is not just [wc] but 
[wc]  and I don't know how to do that. Any ideas (preferably with 
lookarounds)?
Thanks a bunch,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
---


ORIGINAL MESSAGE
 Dear all

 I again have a regular expression question. I have this character vector a:

 a-c(w AT0a w NN1blockage w CJCand w DT0thatc PUN.,
 w AT0a w NN1blockage w CJCand ptr target=KB2LC003w DT0thatc 
 PUN.,
 w AT0a w NN1blockage w CJCandc PUN, w DT0thatc PUN.,
 w AT0a w NN1blockage w CJCand w AJ0hungry w DT0thatc PUN.)

 I would like to retrieve those elements of a in which w CJC and w DT0 
 are

 - directly adjacent, as in a[1] or
 - not interrupted by [wc] , as in a[2]

 And, of these elements I would like to consume all characters from the  in 
 w CJC to the last character after w DT0 that is not a . For 
 example, if I was only searching a[1], I would like something like this:

 matches-gregexpr(w CJC[^]+?w DT0[^]+, a[1], perl=TRUE)
 substr(a[1], unlist(matches), unlist(matches)+unlist(attributes(matches[[1]], 
 match.length))-1)

 I have been fiddling around with negative lookahead but I really can't get my 
 head around this. Any pointers would be greatly appreciated. Thanks a lot,
 STG

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.