Hi Matt, 

I don't know much about your docx file, but I've also recently been learning & 
using regular expressions, and I thought I'd send you a link to a handy tool in 
case you hadn't seen it yet: http://regexr.com/ 

I've found regexr extremely helpful while trying to create useful regular 
expressions. You can tweak your regular expression in regexr and instantly see 
the results. (They provide some default sample text to search, though you're 
free to type/paste in your own.) 
If you hover your cursor over pieces of the regular expression, hints pop up 
and tell you what each part of the expression does; I've found it useful for 
learning how regular expressions work. There's also a nice cheatsheet on the 
left, which sometimes cuts down on how much Googling you need to do. 

Also, in case this is potentially helpful... here is a regular expression that 
matches groups of two or more capital letters: http://regexr.com/3bbet 
Perhaps this will do the trick when searching for words that are in all caps? 
(I make no guarantees; you might need to fiddle with it a bit.) 

As for searching for italicized words, I have no idea how to search for them 
unless they are surrounded by certain tags or signifiers. For instance, perhaps 
all italicized words are surrounded by tags like this: <em>Some Nice 
Title</em>. You could search for all phrases surrounded by those tags. But 
without a textual signifier like that, it's beyond me. 

Best, 

-- Ivan Goldsmith 
Web Project Analyst 
University of Pennsylvania Libraries 


----- Original Message -----

From: "Matt Sherman" <matt.r.sher...@gmail.com> 
To: CODE4LIB@LISTSERV.ND.EDU 
Sent: Tuesday, July 7, 2015 11:56:15 AM 
Subject: [CODE4LIB] Regex Question 

Hi all, 

I am working my way through teaching myself regex to parse an annotated 
bibliography docx file and had a question as I can't seem to get a succinct 
answer from Google. Is it possible to have regex find words, or in the 
case names, in displayed in all caps? Also similarly is it possible to 
have regex find words, or in this case titles, that are italicized? Given 
how the document is formatted doing both would be nice so that I could 
parse them into a table or or database, but I cannot find a clear answer on 
that, though I am very new to regex so it is probably jumping into the deep 
end on this. Any answers are appreciated. 

Matt Sherman 

Reply via email to