Martin A. Brown wrote: > > Greetings Sam, > >>Hello,I am hoping you experts can help out a newbie.I have written >>some python code to gather a bunch of sentences from many files. >>These sentences contain the content: >> >>blah blah blah blah <uicontrol>1-up printing</uicontrol>blah blah blah >>blah blah blah blah blah blah blah <uicontrol>Preset</uicontrol>blah blah >>blah blah blah blah blah <uicontrol>Preset</uicontrol> blah blah blah >> >>What I want to do is remove the words before the <uicontrol> and >>after the </uicontrol>. How do I do that in Python? Thanks for any >>and all help. > > That looks like DocBook markup to me. > > If you actually wish only to mangle the strings, you can do the > following: > > def show_just_uicontrol_elements(fin): > for line in fin: > s = line.index('<uicontrol>') > e = line.index('</uicontrol>') + len('</uicontrol>') > print(line[s:e]) > > (Given your question, I'm assuming you can figure out how to open a > file and read line by line in Python.) > > If it's XML, though, you might consider more carefully Alan's > admonitions and his suggestion of lxml, as a bit of a safer choice > for handling XML data. > > def listtags(filename, tag=None): > if tag is None: > sys.exit("Provide a tag name to this function, e.g. uicontrol") > doc = lxml.etree.parse(filename) > sought = list(doc.getroot().iter(tag)) > for element in sought: > print(element.tag, element.text) > > If you were to call the above with: > > listtags('/path/to/the/data/file.xml', 'uicontrol') > > You should see the name 'uicontrol' and the contents of each tag, > stripped of all surrounding context. > > The above snippet is really just an example to show you how easy it > is (from a coding perspective) to use lxml. You still have to make > the investment to understand how lxml processes XML and what your > data processing needs are. > > Good luck in your explorations,
Another approach that uses XPath, also provided by lxml: $ cat tmp.xml <foo> blah blah blah blah <uicontrol>1-up printing</uicontrol>blah blah blah blah blah blah blah blah blah blah <uicontrol>Preset</uicontrol>blah blah blah blah blah blah blah <uicontrol>Preset</uicontrol> blah blah blah </foo> $ python3 [...] >>> from lxml import etree >>> root = etree.parse("tmp.xml") >>> root.xpath("//uicontrol/text()") ['1-up printing', 'Preset', 'Preset'] https://en.wikipedia.org/wiki/XPath _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor