You should keep in mind though that frequency plays a role in computing
the score. If you eliminate repeated words, you will have frequency 1
for all of them and will lose one of the dimensions used in ranking.
-Original Message-
From: Kelvin Tan [mailto:[EMAIL PROTECTED]]
Sent: Thursday, March 14, 2002 9:51 PM
To: Lucene Users List
Subject: Re: Need pointers on using a very small part of Lucene
Robert,
I just have one more question - how do I remove repeated words? Does
anyone have a filter for doing this?
For example, here's the result of one of my files being worked on:
todai customer.formattedmailingaddress3 dear customer.dearnam respond
request inform productlongnam summari inform subjectssinglelin addit
question topic question bertek product call 1-888-523-7835 ext.9877
product inform productnam includ correspond product inform product
bertek product electron internet site http wwwbertekcom interest
bertek pharmaceut product appreci sincer responsiblehcp.signatur
responsiblehcp.fullnam responsiblehcp.position.nam initiator.initi
responsiblehcp.initi casenumb cc salesrep.fullnam enclosur
enclosurestextbullet bodytext
If you look closely, you'll see the word 'question' repeated twice.
One way to do it is to write a TokenFilter, and basically construct a
Set of all token.termText. in the next() method, provide a check to see
if this token.termText already exists in the Set. If so, ignore it. If
not, add it to the set and carry on. Note that this may be rather
memory-intensive...:)
HTH.
Regards,
Kelvin
thanks,
rob
--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]
--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]
--
To unsubscribe, e-mail: mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]