Re: [CODE4LIB] Regex Question
Thanks for the advice everyone. This is all helpful stuff that I need to spend some time with. On Thu, Jul 9, 2015 at 3:38 AM, Kool,Wouter wouter.k...@oclc.org wrote: I also recommend this site: http://www.regular-expressions.info/ If you do not want to work inside MSWord and want to use only regexes not xpath, you could of course do something like: italics.*[A-Z ,;:]+.*/italics But, depending on your environment, you might be troubles by newlines in the data (regex engines tend to chunk your data, and they tend to use newlines by default). If you just want to list the titles you could grab the title proper like: italics.*([A-Z ,;:]+).*/italics. The part between ( and ) is then usually accessible as $1 (in a language like Perl) or \1 (in a text editor). Wouter -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Harper, Cynthia Sent: woensdag 8 juli 2015 19:51 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Regex Question I like this regex add-in for Excel: http://www.codedawn.com/index/new-excel-add-in-regex-find-replace Cindy Harper -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Tuesday, July 07, 2015 6:22 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Regex Question For clarity, Word does regex, not just wildcards. It's not quite as complete as what you'd get with some other environments such as OpenOffice Writer since matching is lazy rather than greedy which can be a big deal depending on what you're doing and there are a couple other catches -- notably no support for | -- but it's reasonably powerful. There is no regexp capability in Excel unless you're willing to use VBA. kyle On Tue, Jul 7, 2015 at 1:10 PM, Gordon, Bonnie bgor...@rockarch.org wrote: OpenOffice Writer (or a similar program) may be useful for this. It would allow you to search by format while using a more controlled regular expression than MS Word's wildcards. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman Sent: Tuesday, July 07, 2015 12:45 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Regex Question Thanks everyone, this really helps. I'll have to work out the italicized stuff, but this gets me much closer. On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: Y'all are doing this the hard way. Word allows regex replacements as well as format based criteria. For this particular use case: 1. Open the find/replace dialog (CTL+H) 2. In the Find what box, put (*) -- make sure the option for Use Wildcards is selected, and for the format, specify italic 3. For theReplace box, just put \1 and specify All caps And you're done kyle On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org wrote: Eric Phetteplace writes You can match a string of all caps letters like [A-Z] This works if you are limited to English. But in a multilingual setting, you need to watch out for other uppercases, such as крихель vs КРИХЕЛЬ. It then depends in the unicode implementation of your regex application. In Perl, for example, you would use [[:upper:]]. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
Re: [CODE4LIB] Regex Question
I like this regex add-in for Excel: http://www.codedawn.com/index/new-excel-add-in-regex-find-replace Cindy Harper -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Tuesday, July 07, 2015 6:22 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Regex Question For clarity, Word does regex, not just wildcards. It's not quite as complete as what you'd get with some other environments such as OpenOffice Writer since matching is lazy rather than greedy which can be a big deal depending on what you're doing and there are a couple other catches -- notably no support for | -- but it's reasonably powerful. There is no regexp capability in Excel unless you're willing to use VBA. kyle On Tue, Jul 7, 2015 at 1:10 PM, Gordon, Bonnie bgor...@rockarch.org wrote: OpenOffice Writer (or a similar program) may be useful for this. It would allow you to search by format while using a more controlled regular expression than MS Word's wildcards. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman Sent: Tuesday, July 07, 2015 12:45 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Regex Question Thanks everyone, this really helps. I'll have to work out the italicized stuff, but this gets me much closer. On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: Y'all are doing this the hard way. Word allows regex replacements as well as format based criteria. For this particular use case: 1. Open the find/replace dialog (CTL+H) 2. In the Find what box, put (*) -- make sure the option for Use Wildcards is selected, and for the format, specify italic 3. For theReplace box, just put \1 and specify All caps And you're done kyle On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org wrote: Eric Phetteplace writes You can match a string of all caps letters like [A-Z] This works if you are limited to English. But in a multilingual setting, you need to watch out for other uppercases, such as крихель vs КРИХЕЛЬ. It then depends in the unicode implementation of your regex application. In Perl, for example, you would use [[:upper:]]. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
Re: [CODE4LIB] Regex Question
Hi Matt, I don't know much about your docx file, but I've also recently been learning using regular expressions, and I thought I'd send you a link to a handy tool in case you hadn't seen it yet: http://regexr.com/ I've found regexr extremely helpful while trying to create useful regular expressions. You can tweak your regular expression in regexr and instantly see the results. (They provide some default sample text to search, though you're free to type/paste in your own.) If you hover your cursor over pieces of the regular expression, hints pop up and tell you what each part of the expression does; I've found it useful for learning how regular expressions work. There's also a nice cheatsheet on the left, which sometimes cuts down on how much Googling you need to do. Also, in case this is potentially helpful... here is a regular expression that matches groups of two or more capital letters: http://regexr.com/3bbet Perhaps this will do the trick when searching for words that are in all caps? (I make no guarantees; you might need to fiddle with it a bit.) As for searching for italicized words, I have no idea how to search for them unless they are surrounded by certain tags or signifiers. For instance, perhaps all italicized words are surrounded by tags like this: emSome Nice Title/em. You could search for all phrases surrounded by those tags. But without a textual signifier like that, it's beyond me. Best, -- Ivan Goldsmith Web Project Analyst University of Pennsylvania Libraries - Original Message - From: Matt Sherman matt.r.sher...@gmail.com To: CODE4LIB@LISTSERV.ND.EDU Sent: Tuesday, July 7, 2015 11:56:15 AM Subject: [CODE4LIB] Regex Question Hi all, I am working my way through teaching myself regex to parse an annotated bibliography docx file and had a question as I can't seem to get a succinct answer from Google. Is it possible to have regex find words, or in the case names, in displayed in all caps? Also similarly is it possible to have regex find words, or in this case titles, that are italicized? Given how the document is formatted doing both would be nice so that I could parse them into a table or or database, but I cannot find a clear answer on that, though I am very new to regex so it is probably jumping into the deep end on this. Any answers are appreciated. Matt Sherman
Re: [CODE4LIB] Regex Question
Hi Matt! You can match a string of all caps letters like [A-Z]. Those brackets say match anything inside and the hyphen indicates the full range of capital letters. You cannot, unfortunately, match italics since that's formatting and not text. Regex is really only meant for strings of characters and not their formatting. Lastly, I'd be remiss if I didn't point you to Bohyun Kim's nice intro to regex: http://acrl.ala.org/techconnect/?p=3549 Good luck! On Tue, Jul 7, 2015 at 08:56 Matt Sherman matt.r.sher...@gmail.com wrote: Hi all, I am working my way through teaching myself regex to parse an annotated bibliography docx file and had a question as I can't seem to get a succinct answer from Google. Is it possible to have regex find words, or in the case names, in displayed in all caps? Also similarly is it possible to have regex find words, or in this case titles, that are italicized? Given how the document is formatted doing both would be nice so that I could parse them into a table or or database, but I cannot find a clear answer on that, though I am very new to regex so it is probably jumping into the deep end on this. Any answers are appreciated. Matt Sherman
Re: [CODE4LIB] Regex Question
I think I figured out the all-caps need, see http://regexr.com/3bbfi Cheers bzelip On Tue, Jul 7, 2015 at 12:32 PM, Thomas Krichel kric...@openlib.org wrote: Eric Phetteplace writes You can match a string of all caps letters like [A-Z] This works if you are limited to English. But in a multilingual setting, you need to watch out for other uppercases, such as крихель vs КРИХЕЛЬ. It then depends in the unicode implementation of your regex application. In Perl, for example, you would use [[:upper:]]. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
Re: [CODE4LIB] Regex Question
In the case of xml, I think xpath is the simpler tool. Brian Zelip wrote Hi Matt. Re: finding words in all caps, yes it's possible. See this SO answer to help: http://stackoverflow.com/a/4255225/2145103 Re: italics, my hunch is that you could do so if you got hold of the xml behind the word doc, which I'd assume would have something like an `italic` tags or attribute values of `italic` in the markup. good luck! Brian Zelip --- Emerging Technologies Librarian Health Sciences Human Services Library University of Maryland, Baltimore bze...@hshsl.umaryland.edu 410-706-8865 On Tue, Jul 7, 2015 at 11:56 AM, Matt Sherman matt.r.sher...@gmail.com wrote: Hi all, I am working my way through teaching myself regex to parse an annotated bibliography docx file and had a question as I can't seem to get a succinct answer from Google. Is it possible to have regex find words, or in the case names, in displayed in all caps? Also similarly is it possible to have regex find words, or in this case titles, that are italicized? Given how the document is formatted doing both would be nice so that I could parse them into a table or or database, but I cannot find a clear answer on that, though I am very new to regex so it is probably jumping into the deep end on this. Any answers are appreciated. Matt Sherman
Re: [CODE4LIB] Regex Question
Hi Matt. Re: finding words in all caps, yes it's possible. See this SO answer to help: http://stackoverflow.com/a/4255225/2145103 Re: italics, my hunch is that you could do so if you got hold of the xml behind the word doc, which I'd assume would have something like an `italic` tags or attribute values of `italic` in the markup. good luck! Brian Zelip --- Emerging Technologies Librarian Health Sciences Human Services Library University of Maryland, Baltimore bze...@hshsl.umaryland.edu 410-706-8865 On Tue, Jul 7, 2015 at 11:56 AM, Matt Sherman matt.r.sher...@gmail.com wrote: Hi all, I am working my way through teaching myself regex to parse an annotated bibliography docx file and had a question as I can't seem to get a succinct answer from Google. Is it possible to have regex find words, or in the case names, in displayed in all caps? Also similarly is it possible to have regex find words, or in this case titles, that are italicized? Given how the document is formatted doing both would be nice so that I could parse them into a table or or database, but I cannot find a clear answer on that, though I am very new to regex so it is probably jumping into the deep end on this. Any answers are appreciated. Matt Sherman
Re: [CODE4LIB] Regex Question
Thanks everyone, this really helps. I'll have to work out the italicized stuff, but this gets me much closer. On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: Y'all are doing this the hard way. Word allows regex replacements as well as format based criteria. For this particular use case: 1. Open the find/replace dialog (CTL+H) 2. In the Find what box, put (*) -- make sure the option for Use Wildcards is selected, and for the format, specify italic 3. For theReplace box, just put \1 and specify All caps And you're done kyle On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org wrote: Eric Phetteplace writes You can match a string of all caps letters like [A-Z] This works if you are limited to English. But in a multilingual setting, you need to watch out for other uppercases, such as крихель vs КРИХЕЛЬ. It then depends in the unicode implementation of your regex application. In Perl, for example, you would use [[:upper:]]. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
Re: [CODE4LIB] Regex Question
Y'all are doing this the hard way. Word allows regex replacements as well as format based criteria. For this particular use case: 1. Open the find/replace dialog (CTL+H) 2. In the Find what box, put (*) -- make sure the option for Use Wildcards is selected, and for the format, specify italic 3. For theReplace box, just put \1 and specify All caps And you're done kyle On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org wrote: Eric Phetteplace writes You can match a string of all caps letters like [A-Z] This works if you are limited to English. But in a multilingual setting, you need to watch out for other uppercases, such as крихель vs КРИХЕЛЬ. It then depends in the unicode implementation of your regex application. In Perl, for example, you would use [[:upper:]]. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
Re: [CODE4LIB] Regex Question
OpenOffice Writer (or a similar program) may be useful for this. It would allow you to search by format while using a more controlled regular expression than MS Word's wildcards. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman Sent: Tuesday, July 07, 2015 12:45 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Regex Question Thanks everyone, this really helps. I'll have to work out the italicized stuff, but this gets me much closer. On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: Y'all are doing this the hard way. Word allows regex replacements as well as format based criteria. For this particular use case: 1. Open the find/replace dialog (CTL+H) 2. In the Find what box, put (*) -- make sure the option for Use Wildcards is selected, and for the format, specify italic 3. For theReplace box, just put \1 and specify All caps And you're done kyle On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org wrote: Eric Phetteplace writes You can match a string of all caps letters like [A-Z] This works if you are limited to English. But in a multilingual setting, you need to watch out for other uppercases, such as крихель vs КРИХЕЛЬ. It then depends in the unicode implementation of your regex application. In Perl, for example, you would use [[:upper:]]. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
Re: [CODE4LIB] Regex Question
To add on a few things that others have said in this thread: - Another good online regex tool is https://regex101.com/ I really like the testing tools it provides. - Although it's not exactly what you need, Word does have an ability to search by format (it's under the Select menu on the Home tab of the ribbon). Kate Deibel, PhD | Web Applications Specialist Information Technology Services University of Washington Libraries http://staff.washington.edu/deibel -- When Thor shows up, it's always deus ex machina. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman Sent: Tuesday, July 7, 2015 9:45 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Regex Question Thanks everyone, this really helps. I'll have to work out the italicized stuff, but this gets me much closer. On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: Y'all are doing this the hard way. Word allows regex replacements as well as format based criteria. For this particular use case: 1. Open the find/replace dialog (CTL+H) 2. In the Find what box, put (*) -- make sure the option for Use Wildcards is selected, and for the format, specify italic 3. For theReplace box, just put \1 and specify All caps And you're done kyle On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org wrote: Eric Phetteplace writes You can match a string of all caps letters like [A-Z] This works if you are limited to English. But in a multilingual setting, you need to watch out for other uppercases, such as крихель vs КРИХЕЛЬ. It then depends in the unicode implementation of your regex application. In Perl, for example, you would use [[:upper:]]. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel
Re: [CODE4LIB] Regex Question
For clarity, Word does regex, not just wildcards. It's not quite as complete as what you'd get with some other environments such as OpenOffice Writer since matching is lazy rather than greedy which can be a big deal depending on what you're doing and there are a couple other catches -- notably no support for | -- but it's reasonably powerful. There is no regexp capability in Excel unless you're willing to use VBA. kyle On Tue, Jul 7, 2015 at 1:10 PM, Gordon, Bonnie bgor...@rockarch.org wrote: OpenOffice Writer (or a similar program) may be useful for this. It would allow you to search by format while using a more controlled regular expression than MS Word's wildcards. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman Sent: Tuesday, July 07, 2015 12:45 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Regex Question Thanks everyone, this really helps. I'll have to work out the italicized stuff, but this gets me much closer. On Tue, Jul 7, 2015 at 12:43 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: Y'all are doing this the hard way. Word allows regex replacements as well as format based criteria. For this particular use case: 1. Open the find/replace dialog (CTL+H) 2. In the Find what box, put (*) -- make sure the option for Use Wildcards is selected, and for the format, specify italic 3. For theReplace box, just put \1 and specify All caps And you're done kyle On Tue, Jul 7, 2015 at 9:32 AM, Thomas Krichel kric...@openlib.org wrote: Eric Phetteplace writes You can match a string of all caps letters like [A-Z] This works if you are limited to English. But in a multilingual setting, you need to watch out for other uppercases, such as крихель vs КРИХЕЛЬ. It then depends in the unicode implementation of your regex application. In Perl, for example, you would use [[:upper:]]. -- Cheers, Thomas Krichel http://openlib.org/home/krichel skype:thomaskrichel