Has anyone seen a tool to extract a "template" from a set of similar web pages? We acquired a website that uses the same code across multiple web pages. Each web page was copy and pasted from the last; no includes were used. Each is slightly different from the next, even where they should be the same. (For example, some have <title> tags; some don't.) To the human eye, it's obvious what's template and what's content, but I can't do and find/replace because there's no good pattern to the code.

Adrian Holovaty (creator of ChicagoCrime.org and Django) has a Python script called templatemaker[1][2], which in theory would do what I want. You feed it a bunch of similar web pages and it produces a template with "holes" where the data was different across each web page. In practice, it's too granular; it doesn't recognize HTML. It looks at every I don't care about spaces between tags. I only care about substantial content differences across pages. Everything else can be moved to the template.

Any ideas come to mind?

Richard



[1] http://code.google.com/p/templatemaker/
[2] http://www.holovaty.com/blog/archive/2007/07/06/0128


_______________________________________________

UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net

Reply via email to