Re: [CODE4LIB] Regex Question

2015-07-09 Thread Matt Sherman
Thanks for the advice everyone.  This is all helpful stuff that I need to
spend some time with.

On Thu, Jul 9, 2015 at 3:38 AM, Kool,Wouter wouter.k...@oclc.org wrote:

 I also recommend this site: http://www.regular-expressions.info/
 If you do not want to work inside MSWord and want to use only regexes not
 xpath, you could of course do something like:

 italics.*[A-Z ,;:]+.*/italics

 But, depending on your environment, you might be troubles by newlines in
 the data (regex engines tend to chunk your data, and they tend to use
 newlines by default).

 If you just want to list the titles you could grab the title proper like:

 italics.*([A-Z ,;:]+).*/italics. The part between ( and ) is then
 usually accessible as $1 (in a language like Perl) or \1 (in a text editor).

 Wouter



 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Harper, Cynthia
 Sent: woensdag 8 juli 2015 19:51
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Regex Question

 I like this regex add-in for Excel:
 http://www.codedawn.com/index/new-excel-add-in-regex-find-replace
 Cindy Harper

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Kyle Banerjee
 Sent: Tuesday, July 07, 2015 6:22 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Regex Question

 For clarity, Word does regex, not just wildcards.  It's not quite as
 complete as what you'd get with some other environments such as OpenOffice
 Writer since matching is lazy rather than greedy which can be a big deal
 depending on what you're doing and there are a couple other catches --
 notably no support for | -- but it's reasonably powerful. There is no
 regexp capability in Excel unless you're willing to use VBA.

 kyle

 On Tue, Jul 7, 2015 at 1:10 PM, Gordon, Bonnie bgor...@rockarch.org
 wrote:

  OpenOffice Writer (or a similar program) may be useful for this. It
  would allow you to search by format while using a more controlled
  regular expression than MS Word's wildcards.
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
  Of Matt Sherman
  Sent: Tuesday, July 07, 2015 12:45 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Regex Question
 
  Thanks everyone, this really helps.  I'll have to work out the
  italicized stuff, but this gets me much closer.
 
  On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee
  kyle.baner...@gmail.com
  wrote:
 
   Y'all are doing this the hard way. Word allows regex replacements as
   well as format based criteria.
  
   For this particular use case:
  
  1. Open the find/replace dialog (CTL+H)
  2. In the Find what box, put (*) -- make sure the option for
 Use
  Wildcards is selected, and for the format, specify italic
  3. For theReplace box, just put \1 and specify All caps
  
   And you're done
  
   kyle
  
   On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org
   wrote:
  
  Eric Phetteplace writes
   
 You can match a string of all caps letters like [A-Z]
   
  This works if you are limited to English. But in a multilingual
  setting, you need to watch out for other uppercases, such as
  крихель vs КРИХЕЛЬ. It then depends in the unicode implementation
  of your regex application. In Perl, for example, you would use
  [[:upper:]].
   
   
--
   
  Cheers,
   
  Thomas Krichel  http://openlib.org/home/krichel
  skype:thomaskrichel
   
  
 



Re: [CODE4LIB] Regex Question

2015-07-08 Thread Harper, Cynthia
I like this regex add-in for Excel: 
http://www.codedawn.com/index/new-excel-add-in-regex-find-replace
Cindy Harper

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
Banerjee
Sent: Tuesday, July 07, 2015 6:22 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Regex Question

For clarity, Word does regex, not just wildcards.  It's not quite as complete 
as what you'd get with some other environments such as OpenOffice Writer since 
matching is lazy rather than greedy which can be a big deal depending on what 
you're doing and there are a couple other catches -- notably no support for | 
-- but it's reasonably powerful. There is no regexp capability in Excel unless 
you're willing to use VBA.

kyle

On Tue, Jul 7, 2015 at 1:10 PM, Gordon, Bonnie bgor...@rockarch.org wrote:

 OpenOffice Writer (or a similar program) may be useful for this. It 
 would allow you to search by format while using a more controlled 
 regular expression than MS Word's wildcards.

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
 Of Matt Sherman
 Sent: Tuesday, July 07, 2015 12:45 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Regex Question

 Thanks everyone, this really helps.  I'll have to work out the 
 italicized stuff, but this gets me much closer.

 On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee 
 kyle.baner...@gmail.com
 wrote:

  Y'all are doing this the hard way. Word allows regex replacements as 
  well as format based criteria.
 
  For this particular use case:
 
 1. Open the find/replace dialog (CTL+H)
 2. In the Find what box, put (*) -- make sure the option for Use
 Wildcards is selected, and for the format, specify italic
 3. For theReplace box, just put \1 and specify All caps
 
  And you're done
 
  kyle
 
  On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org
  wrote:
 
 Eric Phetteplace writes
  
You can match a string of all caps letters like [A-Z]
  
 This works if you are limited to English. But in a multilingual
 setting, you need to watch out for other uppercases, such as
 крихель vs КРИХЕЛЬ. It then depends in the unicode implementation
 of your regex application. In Perl, for example, you would use
 [[:upper:]].
  
  
   --
  
 Cheers,
  
 Thomas Krichel  http://openlib.org/home/krichel
 skype:thomaskrichel
  
 



Re: [CODE4LIB] Regex Question

2015-07-07 Thread Ivan Goldsmith
Hi Matt, 

I don't know much about your docx file, but I've also recently been learning  
using regular expressions, and I thought I'd send you a link to a handy tool in 
case you hadn't seen it yet: http://regexr.com/ 

I've found regexr extremely helpful while trying to create useful regular 
expressions. You can tweak your regular expression in regexr and instantly see 
the results. (They provide some default sample text to search, though you're 
free to type/paste in your own.) 
If you hover your cursor over pieces of the regular expression, hints pop up 
and tell you what each part of the expression does; I've found it useful for 
learning how regular expressions work. There's also a nice cheatsheet on the 
left, which sometimes cuts down on how much Googling you need to do. 

Also, in case this is potentially helpful... here is a regular expression that 
matches groups of two or more capital letters: http://regexr.com/3bbet 
Perhaps this will do the trick when searching for words that are in all caps? 
(I make no guarantees; you might need to fiddle with it a bit.) 

As for searching for italicized words, I have no idea how to search for them 
unless they are surrounded by certain tags or signifiers. For instance, perhaps 
all italicized words are surrounded by tags like this: emSome Nice 
Title/em. You could search for all phrases surrounded by those tags. But 
without a textual signifier like that, it's beyond me. 

Best, 

-- Ivan Goldsmith 
Web Project Analyst 
University of Pennsylvania Libraries 


- Original Message -

From: Matt Sherman matt.r.sher...@gmail.com 
To: CODE4LIB@LISTSERV.ND.EDU 
Sent: Tuesday, July 7, 2015 11:56:15 AM 
Subject: [CODE4LIB] Regex Question 

Hi all, 

I am working my way through teaching myself regex to parse an annotated 
bibliography docx file and had a question as I can't seem to get a succinct 
answer from Google. Is it possible to have regex find words, or in the 
case names, in displayed in all caps? Also similarly is it possible to 
have regex find words, or in this case titles, that are italicized? Given 
how the document is formatted doing both would be nice so that I could 
parse them into a table or or database, but I cannot find a clear answer on 
that, though I am very new to regex so it is probably jumping into the deep 
end on this. Any answers are appreciated. 

Matt Sherman 


Re: [CODE4LIB] Regex Question

2015-07-07 Thread Eric Phetteplace
Hi Matt!

You can match a string of all caps letters like [A-Z]. Those brackets say
match anything inside and the hyphen indicates the full range of capital
letters.

You cannot, unfortunately, match italics since that's formatting and not
text. Regex is really only meant for strings of characters and not their
formatting.

Lastly, I'd be remiss if I didn't point you to Bohyun Kim's nice intro to
regex: http://acrl.ala.org/techconnect/?p=3549

Good luck!
On Tue, Jul 7, 2015 at 08:56 Matt Sherman matt.r.sher...@gmail.com wrote:

 Hi all,

 I am working my way through teaching myself regex to parse an annotated
 bibliography docx file and had a question as I can't seem to get a succinct
 answer from Google.  Is it possible to have regex find words, or in the
 case names, in displayed in all caps?  Also similarly is it possible to
 have regex find words, or in this case titles, that are italicized?  Given
 how the document is formatted doing both would be nice so that I could
 parse them into a table or or database, but I cannot find a clear answer on
 that, though I am very new to regex so it is probably jumping into the deep
 end on this.  Any answers are appreciated.

 Matt Sherman



Re: [CODE4LIB] Regex Question

2015-07-07 Thread Brian Zelip
I think I figured out the all-caps need, see http://regexr.com/3bbfi

Cheers


bzelip

On Tue, Jul 7, 2015 at 12:32 PM, Thomas Krichel kric...@openlib.org wrote:

   Eric Phetteplace writes

  You can match a string of all caps letters like [A-Z]

   This works if you are limited to English. But in a multilingual
   setting, you need to watch out for other uppercases, such as
   крихель vs КРИХЕЛЬ. It then depends in the unicode implementation
   of your regex application. In Perl, for example, you would use
   [[:upper:]].


 --

   Cheers,

   Thomas Krichel  http://openlib.org/home/krichel
   skype:thomaskrichel



Re: [CODE4LIB] Regex Question

2015-07-07 Thread Jason R Peak
In the case of xml, I think xpath is the simpler tool.


 Brian Zelip wrote 

Hi Matt.

Re: finding words in all caps, yes it's possible. See this SO answer to
help: http://stackoverflow.com/a/4255225/2145103

Re: italics, my hunch is that you could do so if you got hold of the xml
behind the word doc, which I'd assume would have something like an
`italic` tags or attribute values of `italic` in the markup.


good luck!

Brian Zelip

---

Emerging Technologies Librarian

Health Sciences  Human Services Library

University of Maryland, Baltimore

bze...@hshsl.umaryland.edu

410-706-8865


On Tue, Jul 7, 2015 at 11:56 AM, Matt Sherman matt.r.sher...@gmail.com
wrote:

 Hi all,

 I am working my way through teaching myself regex to parse an annotated
 bibliography docx file and had a question as I can't seem to get a succinct
 answer from Google.  Is it possible to have regex find words, or in the
 case names, in displayed in all caps?  Also similarly is it possible to
 have regex find words, or in this case titles, that are italicized?  Given
 how the document is formatted doing both would be nice so that I could
 parse them into a table or or database, but I cannot find a clear answer on
 that, though I am very new to regex so it is probably jumping into the deep
 end on this.  Any answers are appreciated.

 Matt Sherman



Re: [CODE4LIB] Regex Question

2015-07-07 Thread Brian Zelip
Hi Matt.

Re: finding words in all caps, yes it's possible. See this SO answer to
help: http://stackoverflow.com/a/4255225/2145103

Re: italics, my hunch is that you could do so if you got hold of the xml
behind the word doc, which I'd assume would have something like an
`italic` tags or attribute values of `italic` in the markup.


good luck!

Brian Zelip

---

Emerging Technologies Librarian

Health Sciences  Human Services Library

University of Maryland, Baltimore

bze...@hshsl.umaryland.edu

410-706-8865


On Tue, Jul 7, 2015 at 11:56 AM, Matt Sherman matt.r.sher...@gmail.com
wrote:

 Hi all,

 I am working my way through teaching myself regex to parse an annotated
 bibliography docx file and had a question as I can't seem to get a succinct
 answer from Google.  Is it possible to have regex find words, or in the
 case names, in displayed in all caps?  Also similarly is it possible to
 have regex find words, or in this case titles, that are italicized?  Given
 how the document is formatted doing both would be nice so that I could
 parse them into a table or or database, but I cannot find a clear answer on
 that, though I am very new to regex so it is probably jumping into the deep
 end on this.  Any answers are appreciated.

 Matt Sherman



Re: [CODE4LIB] Regex Question

2015-07-07 Thread Matt Sherman
Thanks everyone, this really helps.  I'll have to work out the italicized
stuff, but this gets me much closer.

On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com
wrote:

 Y'all are doing this the hard way. Word allows regex replacements as well
 as format based criteria.

 For this particular use case:

1. Open the find/replace dialog (CTL+H)
2. In the Find what box, put (*) -- make sure the option for Use
Wildcards is selected, and for the format, specify italic
3. For theReplace box, just put \1 and specify All caps

 And you're done

 kyle

 On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org
 wrote:

Eric Phetteplace writes
 
   You can match a string of all caps letters like [A-Z]
 
This works if you are limited to English. But in a multilingual
setting, you need to watch out for other uppercases, such as
крихель vs КРИХЕЛЬ. It then depends in the unicode implementation
of your regex application. In Perl, for example, you would use
[[:upper:]].
 
 
  --
 
Cheers,
 
Thomas Krichel  http://openlib.org/home/krichel
skype:thomaskrichel
 



Re: [CODE4LIB] Regex Question

2015-07-07 Thread Kyle Banerjee
Y'all are doing this the hard way. Word allows regex replacements as well
as format based criteria.

For this particular use case:

   1. Open the find/replace dialog (CTL+H)
   2. In the Find what box, put (*) -- make sure the option for Use
   Wildcards is selected, and for the format, specify italic
   3. For theReplace box, just put \1 and specify All caps

And you're done

kyle

On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org wrote:

   Eric Phetteplace writes

  You can match a string of all caps letters like [A-Z]

   This works if you are limited to English. But in a multilingual
   setting, you need to watch out for other uppercases, such as
   крихель vs КРИХЕЛЬ. It then depends in the unicode implementation
   of your regex application. In Perl, for example, you would use
   [[:upper:]].


 --

   Cheers,

   Thomas Krichel  http://openlib.org/home/krichel
   skype:thomaskrichel



Re: [CODE4LIB] Regex Question

2015-07-07 Thread Gordon, Bonnie
OpenOffice Writer (or a similar program) may be useful for this. It would allow 
you to search by format while using a more controlled regular expression than 
MS Word's wildcards.

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt 
Sherman
Sent: Tuesday, July 07, 2015 12:45 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Regex Question

Thanks everyone, this really helps.  I'll have to work out the italicized 
stuff, but this gets me much closer.

On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com
wrote:

 Y'all are doing this the hard way. Word allows regex replacements as 
 well as format based criteria.

 For this particular use case:

1. Open the find/replace dialog (CTL+H)
2. In the Find what box, put (*) -- make sure the option for Use
Wildcards is selected, and for the format, specify italic
3. For theReplace box, just put \1 and specify All caps

 And you're done

 kyle

 On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org
 wrote:

Eric Phetteplace writes
 
   You can match a string of all caps letters like [A-Z]
 
This works if you are limited to English. But in a multilingual
setting, you need to watch out for other uppercases, such as
крихель vs КРИХЕЛЬ. It then depends in the unicode implementation
of your regex application. In Perl, for example, you would use
[[:upper:]].
 
 
  --
 
Cheers,
 
Thomas Krichel  http://openlib.org/home/krichel
skype:thomaskrichel
 



Re: [CODE4LIB] Regex Question

2015-07-07 Thread Katherine N. Deibel
To add on a few things that others have said in this thread:

- Another good online regex tool is https://regex101.com/ I really like the 
testing tools it provides.

- Although it's not exactly what you need, Word does have an ability to search 
by format (it's under the Select menu on the Home tab of the ribbon).

Kate Deibel, PhD | Web Applications Specialist
Information Technology Services 
University of Washington Libraries 
http://staff.washington.edu/deibel

--

When Thor shows up, it's always deus ex machina.

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt 
Sherman
Sent: Tuesday, July 7, 2015 9:45 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Regex Question

Thanks everyone, this really helps.  I'll have to work out the italicized 
stuff, but this gets me much closer.

On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com
wrote:

 Y'all are doing this the hard way. Word allows regex replacements as 
 well as format based criteria.

 For this particular use case:

1. Open the find/replace dialog (CTL+H)
2. In the Find what box, put (*) -- make sure the option for Use
Wildcards is selected, and for the format, specify italic
3. For theReplace box, just put \1 and specify All caps

 And you're done

 kyle

 On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org
 wrote:

Eric Phetteplace writes
 
   You can match a string of all caps letters like [A-Z]
 
This works if you are limited to English. But in a multilingual
setting, you need to watch out for other uppercases, such as
крихель vs КРИХЕЛЬ. It then depends in the unicode implementation
of your regex application. In Perl, for example, you would use
[[:upper:]].
 
 
  --
 
Cheers,
 
Thomas Krichel  http://openlib.org/home/krichel
skype:thomaskrichel
 



Re: [CODE4LIB] Regex Question

2015-07-07 Thread Kyle Banerjee
For clarity, Word does regex, not just wildcards.  It's not quite as
complete as what you'd get with some other environments such as OpenOffice
Writer since matching is lazy rather than greedy which can be a big deal
depending on what you're doing and there are a couple other catches --
notably no support for | -- but it's reasonably powerful. There is no
regexp capability in Excel unless you're willing to use VBA.

kyle

On Tue, Jul 7, 2015 at 1:10 PM, Gordon, Bonnie bgor...@rockarch.org wrote:

 OpenOffice Writer (or a similar program) may be useful for this. It would
 allow you to search by format while using a more controlled regular
 expression than MS Word's wildcards.

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Matt Sherman
 Sent: Tuesday, July 07, 2015 12:45 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Regex Question

 Thanks everyone, this really helps.  I'll have to work out the italicized
 stuff, but this gets me much closer.

 On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com
 wrote:

  Y'all are doing this the hard way. Word allows regex replacements as
  well as format based criteria.
 
  For this particular use case:
 
 1. Open the find/replace dialog (CTL+H)
 2. In the Find what box, put (*) -- make sure the option for Use
 Wildcards is selected, and for the format, specify italic
 3. For theReplace box, just put \1 and specify All caps
 
  And you're done
 
  kyle
 
  On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org
  wrote:
 
 Eric Phetteplace writes
  
You can match a string of all caps letters like [A-Z]
  
 This works if you are limited to English. But in a multilingual
 setting, you need to watch out for other uppercases, such as
 крихель vs КРИХЕЛЬ. It then depends in the unicode implementation
 of your regex application. In Perl, for example, you would use
 [[:upper:]].
  
  
   --
  
 Cheers,
  
 Thomas Krichel  http://openlib.org/home/krichel
 skype:thomaskrichel