RE: Need pointers on using a very small part of Lucene

2002-03-15 Thread Alex Murzaku

You should keep in mind though that frequency plays a role in computing
the score. If you eliminate repeated words, you will have frequency 1
for all of them and will lose one of the dimensions used in ranking.

-Original Message-
From: Kelvin Tan [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, March 14, 2002 9:51 PM
To: Lucene Users List
Subject: Re: Need pointers on using a very small part of Lucene


Robert,


 I just have one more question - how do I remove repeated words? Does 
 anyone have a filter for doing this?

 For example, here's the result of one of my files being worked on: 
 todai customer.formattedmailingaddress3 dear customer.dearnam respond

 request inform productlongnam summari inform subjectssinglelin addit 
 question topic question bertek product call 1-888-523-7835 ext.9877 
 product inform productnam includ correspond product inform product 
 bertek product electron internet site http wwwbertekcom interest 
 bertek pharmaceut product appreci sincer responsiblehcp.signatur 
 responsiblehcp.fullnam responsiblehcp.position.nam initiator.initi 
 responsiblehcp.initi casenumb cc salesrep.fullnam enclosur 
 enclosurestextbullet bodytext


 If you look closely, you'll see the word 'question' repeated twice.

One way to do it is to write a TokenFilter, and basically construct a
Set of all token.termText. in the next() method, provide a check to see
if this token.termText already exists in the Set. If so, ignore it. If
not, add it to the set and carry on. Note that this may be rather
memory-intensive...:)

HTH.

Regards,
Kelvin



 thanks,
 rob



 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




taxonmy and lucene

2002-03-15 Thread Mark Ayad

Hi All,

Is anyone using lucene for automating taxonomy generation on volumes of text
based documents ?

Regards

Mark


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: size and nos of documents in the index

2002-03-15 Thread Otis Gospodnetic

Parag,

Indexing time and index size should be proportional to the size of
documents being indexed.  Also, I believe a document containing more
different, unique terms will result in a larger index size increase
than a document containing more duplicates.  For instance I am going
to bed in a few moments because I am tired will result in more unique
terms than Good night.
As for the maximum number of documents that can be indexed I think
there is virtually no limit, other than you hardware and things like
that.

Otis

--- Parag Dharmadhikari [EMAIL PROTECTED] wrote:
 Hi all,
 
 How the indexing is afftected by the size of documents and what is
 the maximum number of documents which can be indexed.
 
 regards
 parag
 
 


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]