Re: [Tutor] xpath - html entities issue --
On 04/10/16 15:02, bruce wrote: > I did a quick replace ('','&') and it replaced the '' as desired. > So the content only had '&' in it.. You are preonbably better using your parseers escape/unescape facilities. Simple string replacement is notioriously hard to get right. > I can provide a more comprehensive chunk of code, but minimized the post to > get to the heart of the issue. Also, I'd prefer not to use a sep parse lib. Define separate? There are several options in the standard library. All will be more effective than trying to do it by hand. And libxml2dom is not one of them (at least I've never seen it before) so you appear to be breaking your own rules? > > code chunk > > import libxml2dom > q1=libxml2dom You can get the same effect with import libxml2dom as ql > s2= q1.parseString(a.toString().strip(), html=1) > tt=s2.xpath(tpath) > > tt=tt[0].toString().strip() > print "tit "+tt > > - You may have over-simplified a tad, it has become fairly meaningless to us - what are 'a' and 'tpath'? -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] xpath - html entities issue --
Hi. Just realized I might have a prob with testing a crawl. I get a page of data via a basic curl. The returned data is html/charset-utf-8. I did a quick replace ('','&') and it replaced the '' as desired. So the content only had '&' in it.. I then did a parseString/xpath to extract what I wanted, and realized I have '' as representative of the '&' in the returned xpath content. My issue, is there a way/method/etc, to only return the actual char, not the html entiy () I can provide a more comprehensive chunk of code, but minimized the post to get to the heart of the issue. Also, I'd prefer not to use a sep parse lib. code chunk import libxml2dom q1=libxml2dom s2= q1.parseString(a.toString().strip(), html=1) tt=s2.xpath(tpath) tt=tt[0].toString().strip() print "tit "+tt - the content of a.toString() (shortened) . . . Organization Development & Change Edition: 10th . . . the xpath results are Organization Development Change Edition: 10th As you can see.. in the results of the xpath (toString()) the & --> I'm wondering if there's a process that can be used within the toString() or do you really have to wrap each xpath/toString with a unescape() kind of process to convert htmlentities to the requisite chars. Thanks ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor