I have not looked at the code yet, but look for "NovelAnalyzer" in Lucene JIRA. 
 I believe it's supposed to do something similar.

Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Cam Bazz <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Monday, June 9, 2008 3:55:16 PM
> Subject: html to text based on some sort of uniqueness metric
> Hello,
> I am indexing newspaper articles as an excercise in solr. When dealing with
> newspaper articles in previous experiences I always tried to get the div or
> the table that contains the actual news, using nekohtml traversing tru the
> dom tree and getting the text from the div or table that contains the
> article. When dealing with many newspapers, it is a hassle to custom code to
> extract relevant information. There is usually a lot of garbage in the html.
> From categories to ads, and further more they change, so a static coding is
> problematic.
> I have been thinking if I could measure the frequency or uniqueness for each
> node, and find the news automatically - but I have not come up with an
> implementation.
> Has anyone did/contemplated/used something similar? Maybe there is already a
> way - using lucene, or even hadoop.
> Best Regards,
> -C.A.

Reply via email to