Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Noah Gift
2009/1/13 Girish Redekar girish.rede...@gmail.com: I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh

Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Girish Redekar
Thanks Noah - Beautiful Soup does give a tree that can be used - however, getting from the tree to the result I desire is still a long way. I'm using lxml (for speed conerns) and it also returns a tree similar to BS .. I have even got as far as parsing the css and getting the attributes for each

Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Dirkjan Ochtman
2009/1/12 Girish Redekar girish.rede...@gmail.com: is still tedious as font sizes in html/css can be expressed in multiple methods (like FONT tags, sizes in pixels, relative sizes, default larger size for header etc). One can get down and code each of these cases, but I was hoping someone has

Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Thomas Broyer
2009/1/12 Girish Redekar: I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately). Have

Re: [Web-SIG] HTML parsing - get text position and font size

2009-01-12 Thread Manlio Perillo
Girish Redekar ha scritto: I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately).