html to text based on some sort of uniqueness metric

2008-06-09 Thread Cam Bazz
Hello,

I am indexing newspaper articles as an excercise in solr. When dealing with
newspaper articles in previous experiences I always tried to get the div or
the table that contains the actual news, using nekohtml traversing tru the
dom tree and getting the text from the div or table that contains the
article. When dealing with many newspapers, it is a hassle to custom code to
extract relevant information. There is usually a lot of garbage in the html.
From categories to ads, and further more they change, so a static coding is
problematic.

I have been thinking if I could measure the frequency or uniqueness for each
node, and find the news automatically - but I have not come up with an
implementation.

Has anyone did/contemplated/used something similar? Maybe there is already a
way - using lucene, or even hadoop.

Best Regards,
-C.A.


Re: html to text based on some sort of uniqueness metric

2008-06-09 Thread Otis Gospodnetic
I have not looked at the code yet, but look for NovelAnalyzer in Lucene JIRA. 
 I believe it's supposed to do something similar.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: Cam Bazz [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, June 9, 2008 3:55:16 PM
 Subject: html to text based on some sort of uniqueness metric
 
 Hello,
 
 I am indexing newspaper articles as an excercise in solr. When dealing with
 newspaper articles in previous experiences I always tried to get the div or
 the table that contains the actual news, using nekohtml traversing tru the
 dom tree and getting the text from the div or table that contains the
 article. When dealing with many newspapers, it is a hassle to custom code to
 extract relevant information. There is usually a lot of garbage in the html.
 From categories to ads, and further more they change, so a static coding is
 problematic.
 
 I have been thinking if I could measure the frequency or uniqueness for each
 node, and find the news automatically - but I have not come up with an
 implementation.
 
 Has anyone did/contemplated/used something similar? Maybe there is already a
 way - using lucene, or even hadoop.
 
 Best Regards,
 -C.A.