Adrian Holovaty (creator of ChicagoCrime.org and Django) has a Python
script called templatemaker[1][2], which in theory would do what I want. You
feed it a bunch of similar web pages and it produces a template with "holes"
where the data was different across each web page. In practice, it's too
granular; it doesn't recognize HTML. It looks at every I don't care about
spaces between tags. I only care about substantial content differences
across pages. Everything else can be moved to the template.

you could try running everything through HTML Tidy first, see if that
normalizes whitespace and such. then run templatemaker and see how
that works out.

you could use a diff program to find out where they are different and the kinda do the reverse and come up with the similarities...however i would do it after running it all through tidy first.

If it was up to me then i would look at taking 1 page and creating a template from it and then extract all the data you need to populate other pages with that template.

--
thebigdog

_______________________________________________

UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net

Reply via email to