Re: how to parse html content in handler

2011-03-25 Thread Nick Kew
On Thu, 24 Mar 2011 22:58:07 +0800 (CST)
"Whut  Jia"  wrote:

> Hi,
> Thank you!
> But I want to parse a jsp page in my handler.How can I do it??
> Please help me! In my handler, I do a request (http://www.xxx/xxx.jsp)with 
> libcurl,and then parse returned response ,and draw some infomation.Please ask 
> how to parse this jsp response???  

Last time I looked, JSP 2 insisted on XML well-formedness,
and would work well under an XML parser.  You could dispense
with parsing altogether and just implement event handlers
under an existing parser such as mod_xmlns or mod_xml2.

JSP 1 was an SSI-like language.  It would be a little more
work, but mod_includes would be a good startingpoint.

-- 
Nick Kew

Available for work, contract or permanent.
http://www.webthing.com/~nick/cv.html


Re: how to parse html content in handler

2011-03-25 Thread Mike Meyer
On Fri, 25 Mar 2011 09:28:01 -0400
MK  wrote:

> On Thu, 24 Mar 2011 20:10:46 +0800 (CST)
> Whut  Jia  wrote:
> > Hi,all
> > I want to parse a html content and withdraw some element in myself
> > apache handler.Please ask how to do it. Thanks,
> > Jia
> 
> I think right now the only public C library for parsing html is in the
> venerable and long unmaintained libwww.  

How about the HTMLparser module in libxml2?

 http://www.mired.org/consulting.html
Independent Software developer/SCM consultant, email for more information.

O< ascii ribbon campaign - stop html mail - www.asciiribbon.org


Re: how to parse html content in handler

2011-03-25 Thread Joshua Marantz
mod_pagespeed's event-driven HTML parser is open source, and is written in
C++:
http://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/htmlparse/public/html_parse.h

This
parser is tested using HTML from large numbers of web sites.  The build
process for this module (
http://code.google.com/p/modpagespeed/wiki/HowToBuild) generates a separate
.a for the HTML parser, although it's got a few dependencies that would need
to be linked in.  These are all included in mod_pagespeed.so which is
self-contained but larger.

If there was much interest we could try to try to package up a
self-contained library that would make it easier to call from other modules.

See also libxml2, which has an HTML mode.

-Josh

On Fri, Mar 25, 2011 at 9:28 AM, MK  wrote:

> On Thu, 24 Mar 2011 20:10:46 +0800 (CST)
> Whut  Jia  wrote:
> > Hi,all
> > I want to parse a html content and withdraw some element in myself
> > apache handler.Please ask how to do it. Thanks,
> > Jia
>
> I think right now the only public C library for parsing html is in the
> venerable and long unmaintained libwww.
>
> However, I wrote a quick and simple, event driven parser library a few
> months ago -- I have been meaning to open source this on CCAN or
> somewhere but have not gotten around to it, so if you are interested
> you can send me a message directly, I have some basic scraper demos
> etc.   It is not on the scale of libwww -- it is just a low level HTML
> parser -- but I am sure it could do what you want, and you can either
> compile it in or link to with an apache module (it has no further
> dependencies).
>
>
> --
> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
> "The angel of history[...]is turned toward the past." (Walter Benjamin)
>
>


Re: how to parse html content in handler

2011-03-25 Thread MK
On Thu, 24 Mar 2011 20:10:46 +0800 (CST)
Whut  Jia  wrote:
> Hi,all
> I want to parse a html content and withdraw some element in myself
> apache handler.Please ask how to do it. Thanks,
> Jia

I think right now the only public C library for parsing html is in the
venerable and long unmaintained libwww.  

However, I wrote a quick and simple, event driven parser library a few
months ago -- I have been meaning to open source this on CCAN or
somewhere but have not gotten around to it, so if you are interested
you can send me a message directly, I have some basic scraper demos
etc.   It is not on the scale of libwww -- it is just a low level HTML
parser -- but I am sure it could do what you want, and you can either
compile it in or link to with an apache module (it has no further
dependencies).


-- 
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)



Re:Re: how to parse html content in handler

2011-03-24 Thread Whut Jia
Hi,
Thank you!
But I want to parse a jsp page in my handler.How can I do it??
Please help me! In my handler, I do a request (http://www.xxx/xxx.jsp)with 
libcurl,and then parse returned response ,and draw some infomation.Please ask 
how to parse this jsp response???  
Thanks,
Jia 




At 2011-03-24 20:25:11,"Ben Noordhuis"  wrote:

>On Thu, Mar 24, 2011 at 13:10, Whut  Jia  wrote:
>> Hi,all
>> I want to parse a html content and withdraw some element in myself apache 
>> handler.Please ask how to do it.
>> Thanks,
>> Jia
>
>Hey, have a look at how mod_proxy_html[1] does it.
>
>[1] http://apache.webthing.com/mod_proxy_html/


Re: how to parse html content in handler

2011-03-24 Thread Ben Noordhuis
On Thu, Mar 24, 2011 at 13:10, Whut  Jia  wrote:
> Hi,all
> I want to parse a html content and withdraw some element in myself apache 
> handler.Please ask how to do it.
> Thanks,
> Jia

Hey, have a look at how mod_proxy_html[1] does it.

[1] http://apache.webthing.com/mod_proxy_html/