Re: [Tutor] xpath - html entities issue --

2016-10-04 Thread Alan Gauld via Tutor
On 04/10/16 15:02, bruce wrote:

> I did a quick replace ('','&') and it replaced the '' as desired.
> So the content only had '&' in it..

You are preonbably better using your parseers escape/unescape
facilities. Simple string replacement is notioriously hard to
get right.

> I can provide a more comprehensive chunk of code, but minimized the post to
> get to the heart of the issue. Also, I'd prefer not to use a sep parse lib.

Define separate? There are several options in the standard library.
All will be more effective than trying to do it by hand.
And libxml2dom is not one of them (at least I've never
seen it before) so you appear to be breaking your own rules?

> 
> code chunk
> 
> import libxml2dom
> q1=libxml2dom

You can get the same effect with

import libxml2dom as ql

> s2= q1.parseString(a.toString().strip(), html=1)
> tt=s2.xpath(tpath)
> 
> tt=tt[0].toString().strip()
> print "tit "+tt
> 
> -

You may have over-simplified a tad, it has become fairly
meaningless to us - what are 'a' and 'tpath'?

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] xpath - html entities issue --

2016-10-04 Thread bruce
Hi.

Just realized I might have a prob with testing a crawl.

I get a page of data via a basic curl. The returned data is
html/charset-utf-8.

I did a quick replace ('','&') and it replaced the '' as desired.
So the content only had '&' in it..

I then did a parseString/xpath to extract what I wanted, and realized I
have '' as representative of the '&' in the returned xpath content.

My issue, is there a way/method/etc, to only return the actual char, not
the html entiy ()

I can provide a more comprehensive chunk of code, but minimized the post to
get to the heart of the issue. Also, I'd prefer not to use a sep parse lib.


code chunk

import libxml2dom

q1=libxml2dom

s2= q1.parseString(a.toString().strip(), html=1)
tt=s2.xpath(tpath)

tt=tt[0].toString().strip()
print "tit "+tt

-


the content of a.toString() (shortened)
.
.
.
 

Organization
Development & Change
Edition: 10th



.
.
.

the xpath results are



Organization
Development  Change
Edition: 10th



As you can see.. in the results of the xpath (toString())
 the & --> 

I'm wondering if there's a process that can be used within the toString()
or do you really have to wrap each xpath/toString with a unescape() kind of
process to convert htmlentities to the requisite chars.

Thanks
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor