[PHP] Re: html analyzer

2010-05-19 Thread Bill Guion

At 12:30 AM +0200 5/19/10, Rene Veerman wrote:


Hi.

I'm trying to build a html analyzer that looks at natural words in html text.

I'd like to build a routine that walks through the HTML character by
character, but i'm not sure on how to properly walk through escaped 
and ' characters in javascript or other embedded languages. Skipping
the first  and ' is no problem, but after that, the escaped  and ',
they can get difficult imo.

If you have any ideas on this i'd like to hear 'm..

--
-
Greetings from Rene7705,

My free open source webcomponents:
  http://code.google.com/u/rene7705/
  http://mediabeez.ws/downloads (and demos)

http://www.facebook.com/rene7705
-


Renee,

I agree with the previous post - what you want to do is non-trivial. 
However, to address your question: one approach is to create a single 
quote flag (sqf) and a double quote flag (dqf). When you encounter 
the first quote, set that flag. When you encounter the second quote 
of the same type, clear the flag. At the end, both flags should be 
clear, or the html is mal-formed. You can also get more sophisticated 
and verify that you do not encounter a single, double, single 
sequence, or a double, single, double sequence. That gets more 
involved by remembering which quote was first, second, and third - 
third should be same as second, for example.


 -= Bill =-
--

Don't find fault. Find a remedy. - Henry Ford
  



--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: html analyzer

2010-05-19 Thread Robert Cummings

Bill Guion wrote:

At 12:30 AM +0200 5/19/10, Rene Veerman wrote:


Hi.

I'm trying to build a html analyzer that looks at natural words in html text.

I'd like to build a routine that walks through the HTML character by
character, but i'm not sure on how to properly walk through escaped 
and ' characters in javascript or other embedded languages. Skipping
the first  and ' is no problem, but after that, the escaped  and ',
they can get difficult imo.

If you have any ideas on this i'd like to hear 'm..

--
-
Greetings from Rene7705,

My free open source webcomponents:
  http://code.google.com/u/rene7705/
  http://mediabeez.ws/downloads (and demos)

http://www.facebook.com/rene7705
-


Renee,

I agree with the previous post - what you want to do is non-trivial. 
However, to address your question: one approach is to create a single 
quote flag (sqf) and a double quote flag (dqf). When you encounter 
the first quote, set that flag. When you encounter the second quote 
of the same type, clear the flag. At the end, both flags should be 
clear, or the html is mal-formed. You can also get more sophisticated 
and verify that you do not encounter a single, double, single 
sequence, or a double, single, double sequence. That gets more 
involved by remembering which quote was first, second, and third - 
third should be same as second, for example.


There's more to it than that. You also need to handle escaping of the 
same quote character within the string's content. This is parsing 101 :)


Cheers,
Rob.
--
http://www.interjinn.com
Application and Templating Framework for PHP

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: html analyzer

2010-05-19 Thread Ashley Sheridan
On Wed, 2010-05-19 at 13:24 -0400, Bill Guion wrote:

 At 12:30 AM +0200 5/19/10, Rene Veerman wrote:
 
 Hi.
 
 I'm trying to build a html analyzer that looks at natural words in html text.
 
 I'd like to build a routine that walks through the HTML character by
 character, but i'm not sure on how to properly walk through escaped 
 and ' characters in javascript or other embedded languages. Skipping
 the first  and ' is no problem, but after that, the escaped  and ',
 they can get difficult imo.
 
 If you have any ideas on this i'd like to hear 'm..
 
 --
 -
 Greetings from Rene7705,
 
 My free open source webcomponents:
http://code.google.com/u/rene7705/
http://mediabeez.ws/downloads (and demos)
 
 http://www.facebook.com/rene7705
 -
 
 Renee,
 
 I agree with the previous post - what you want to do is non-trivial. 
 However, to address your question: one approach is to create a single 
 quote flag (sqf) and a double quote flag (dqf). When you encounter 
 the first quote, set that flag. When you encounter the second quote 
 of the same type, clear the flag. At the end, both flags should be 
 clear, or the html is mal-formed. You can also get more sophisticated 
 and verify that you do not encounter a single, double, single 
 sequence, or a double, single, double sequence. That gets more 
 involved by remembering which quote was first, second, and third - 
 third should be same as second, for example.
 
   -= Bill =-
 -- 
 
 Don't find fault. Find a remedy. - Henry Ford

 
 


It would have to be a lot more complicated than that, consider:

print document.write('a href=\#\ onmouseover=
\doSomething(\'argument\')\link/a');

It's ugly, but potentially possible. I've seen Javascript being used to
write Javascript before because it required less (albeit uglier) code
than using cross-browser code to add event handlers.

The parser could though maybe split off strings it finds within
Javascript like this and parse that with the same function. It could
potentially then call itself recursively each time it encounters a
string.

Thanks,
Ash
http://www.ashleysheridan.co.uk




[PHP] Re: html analyzer

2010-05-18 Thread Manuel Lemos
Hello,

on 05/18/2010 07:30 PM Rene Veerman said the following:
 Hi.
 
 I'm trying to build a html analyzer that looks at natural words in html text.
 
 I'd like to build a routine that walks through the HTML character by
 character, but i'm not sure on how to properly walk through escaped 
 and ' characters in javascript or other embedded languages. Skipping
 the first  and ' is no problem, but after that, the escaped  and ',
 they can get difficult imo.
 
 If you have any ideas on this i'd like to hear 'm..

Better try something that is already done. HTML parsing is not that
trivial. If the HTML you are parsing is malformed, things get worse.

You may want to try this HTML parser package. It can parse HTML, CSS,
DTD, etc.. in pure PHP. No special extensions required. It can tolerate
malformed HTML and even filter insecure HTML and CSS that may contain
dangerous Javascript. Actually it was done mainly for that purpose.

http://www.phpclasses.org/secure-html-filter


-- 

Regards,
Manuel Lemos

Find and post PHP jobs
http://www.phpclasses.org/jobs/

PHP Classes - Free ready to use OOP components written in PHP
http://www.phpclasses.org/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php