I have not looked at the code yet, but look for "NovelAnalyzer" in Lucene JIRA. I believe it's supposed to do something similar.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Cam Bazz <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Monday, June 9, 2008 3:55:16 PM > Subject: html to text based on some sort of uniqueness metric > > Hello, > > I am indexing newspaper articles as an excercise in solr. When dealing with > newspaper articles in previous experiences I always tried to get the div or > the table that contains the actual news, using nekohtml traversing tru the > dom tree and getting the text from the div or table that contains the > article. When dealing with many newspapers, it is a hassle to custom code to > extract relevant information. There is usually a lot of garbage in the html. > From categories to ads, and further more they change, so a static coding is > problematic. > > I have been thinking if I could measure the frequency or uniqueness for each > node, and find the news automatically - but I have not come up with an > implementation. > > Has anyone did/contemplated/used something similar? Maybe there is already a > way - using lucene, or even hadoop. > > Best Regards, > -C.A.