RE: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread Adam Saltiel
OK, thanks.

Adam

> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Monday, January 31, 2005 5:51 PM
> To: Lucene Users List; [EMAIL PROTECTED]
> Subject: RE: carrot2 question too - Re: Fun with the Wikipedia
>
> Adam,
>
> Dawid posted some code that lets you use Carrot2 locally with Lucene,
> without the componentized pipe line system described on Carrot2 site.
>
> Otis
>
> --- Adam Saltiel <[EMAIL PROTECTED]> wrote:
>
> > David, Hi,
> > Would you be able to comment on coincidentally recent thread " RE:
->
> > Grouping Search Results by Clustering Snippets:"?
> > Also, when I looked at Carrot2 the pipe line is implemented as over
> > http. I
> > wonder how efficient that is, or can it be changed, for instance for
> > an all
> > local implementation?
> > Has Carrot2 been integrated in with Lucene, has it been used as the
> > bases
> > for a recommender system (could it be?)?
> > TIA.
> >
> > Adam
> >
> > > -----Original Message-----
> > > From: Dawid Weiss [mailto:[EMAIL PROTECTED]
> > > Sent: Monday, January 31, 2005 4:12 PM
> > > To: Lucene Users List
> > > Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
> > >
> > >
> > > Hi.
> > >
> > > Coming up with answers... a little belated, but hope you're still
> > on:
> > >
> > > > we have been experimenting with carrot2 and are very pleased so
> > far,
> > > > only one issue: there is no release not even an alpha one and
the
> > > > dependencies seemed to be patched (jama)
> > >
> > > Yes, there is not "official" release. We just don't feel the need
> > to tag
> > > the sources with an official label because Carrot is not a
> > stand-alone
> > > product (rather a library... or a framework). It does not imply
> > that the
> > > project is in alpha stage... quite the contrary, in fact -- it has
> > been
> > > out there for a while and it seems to do a good job for most
> > people.
> > >
> > > > is there any intentions to have any releases in the near future?
> > >
> > > I could tag a release even today if it makes you happy ;) But I
> > hope I
> > > made the status of the project clear above.
> > >
> > > D.
> > >
> > >
> >
-
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> >
> >
> >
> >
> >
-
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread Dawid Weiss

Hi Adam.
Otis and David have already provided you with pointers to my previous 
post regarding Carrot2-Lucene integration, so just a tiny note here:

Also, when I looked at Carrot2 the pipe line is implemented as over http. I
wonder how efficient that is, or can it be changed, for instance for an all
local implementation?
Yes, there exists a possibility to combine components locally. It is 
even demonstrated in the sample code David Spencer mentioned.

Has Carrot2 been integrated in with Lucene, has it been used as the bases
for a recommender system (could it be?)?
I don't know... I guess it could but you'd have to play with the source 
code and modify it a bit to get the required functionality. Can't really 
tell anything more specific because I'm not deep in that subject.

D.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread David Spencer
Otis Gospodnetic wrote:
Adam,
Dawid posted some code that lets you use Carrot2 locally with Lucene,
see embedded zip url here for carrot2/lucene code - it may also be in 
the carrot2 cvs tree too - this is what I used in the wikipedia/cluster 
stuff as the basis

http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
without the componentized pipe line system described on Carrot2 site.


Otis
--- Adam Saltiel <[EMAIL PROTECTED]> wrote:

David, Hi,
Would you be able to comment on coincidentally recent thread " RE: ->
Grouping Search Results by Clustering Snippets:"?
Also, when I looked at Carrot2 the pipe line is implemented as over
http. I
wonder how efficient that is, or can it be changed, for instance for
an all
local implementation?
Has Carrot2 been integrated in with Lucene, has it been used as the
bases
for a recommender system (could it be?)?
TIA.
Adam

-Original Message-
From: Dawid Weiss [mailto:[EMAIL PROTECTED]
Sent: Monday, January 31, 2005 4:12 PM
To: Lucene Users List
Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
Hi.
Coming up with answers... a little belated, but hope you're still
on:
we have been experimenting with carrot2 and are very pleased so
far,
only one issue: there is no release not even an alpha one and the
dependencies seemed to be patched (jama)
Yes, there is not "official" release. We just don't feel the need
to tag
the sources with an official label because Carrot is not a
stand-alone
product (rather a library... or a framework). It does not imply
that the
project is in alpha stage... quite the contrary, in fact -- it has
been
out there for a while and it seems to do a good job for most
people.
is there any intentions to have any releases in the near future?
I could tag a release even today if it makes you happy ;) But I
hope I
made the status of the project clear above.
D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread Otis Gospodnetic
Adam,

Dawid posted some code that lets you use Carrot2 locally with Lucene,
without the componentized pipe line system described on Carrot2 site.

Otis

--- Adam Saltiel <[EMAIL PROTECTED]> wrote:

> David, Hi,
> Would you be able to comment on coincidentally recent thread " RE: ->
> Grouping Search Results by Clustering Snippets:"?
> Also, when I looked at Carrot2 the pipe line is implemented as over
> http. I
> wonder how efficient that is, or can it be changed, for instance for
> an all
> local implementation?
> Has Carrot2 been integrated in with Lucene, has it been used as the
> bases
> for a recommender system (could it be?)?
> TIA.
> 
> Adam
> 
> > -Original Message-
> > From: Dawid Weiss [mailto:[EMAIL PROTECTED]
> > Sent: Monday, January 31, 2005 4:12 PM
> > To: Lucene Users List
> > Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
> >
> >
> > Hi.
> >
> > Coming up with answers... a little belated, but hope you're still
> on:
> >
> > > we have been experimenting with carrot2 and are very pleased so
> far,
> > > only one issue: there is no release not even an alpha one and the
> > > dependencies seemed to be patched (jama)
> >
> > Yes, there is not "official" release. We just don't feel the need
> to tag
> > the sources with an official label because Carrot is not a
> stand-alone
> > product (rather a library... or a framework). It does not imply
> that the
> > project is in alpha stage... quite the contrary, in fact -- it has
> been
> > out there for a while and it seems to do a good job for most
> people.
> >
> > > is there any intentions to have any releases in the near future?
> >
> > I could tag a release even today if it makes you happy ;) But I
> hope I
> > made the status of the project clear above.
> >
> > D.
> >
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread Adam Saltiel
David, Hi,
Would you be able to comment on coincidentally recent thread " RE: ->
Grouping Search Results by Clustering Snippets:"?
Also, when I looked at Carrot2 the pipe line is implemented as over http. I
wonder how efficient that is, or can it be changed, for instance for an all
local implementation?
Has Carrot2 been integrated in with Lucene, has it been used as the bases
for a recommender system (could it be?)?
TIA.

Adam

> -Original Message-
> From: Dawid Weiss [mailto:[EMAIL PROTECTED]
> Sent: Monday, January 31, 2005 4:12 PM
> To: Lucene Users List
> Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
>
>
> Hi.
>
> Coming up with answers... a little belated, but hope you're still on:
>
> > we have been experimenting with carrot2 and are very pleased so far,
> > only one issue: there is no release not even an alpha one and the
> > dependencies seemed to be patched (jama)
>
> Yes, there is not "official" release. We just don't feel the need to tag
> the sources with an official label because Carrot is not a stand-alone
> product (rather a library... or a framework). It does not imply that the
> project is in alpha stage... quite the contrary, in fact -- it has been
> out there for a while and it seems to do a good job for most people.
>
> > is there any intentions to have any releases in the near future?
>
> I could tag a release even today if it makes you happy ;) But I hope I
> made the status of the project clear above.
>
> D.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread Dawid Weiss
Hi.
Coming up with answers... a little belated, but hope you're still on:
we have been experimenting with carrot2 and are very pleased so far,
only one issue: there is no release not even an alpha one and the
dependencies seemed to be patched (jama)
Yes, there is not "official" release. We just don't feel the need to tag 
the sources with an official label because Carrot is not a stand-alone 
product (rather a library... or a framework). It does not imply that the 
project is in alpha stage... quite the contrary, in fact -- it has been 
out there for a while and it seems to do a good job for most people.

is there any intentions to have any releases in the near future?
I could tag a release even today if it makes you happy ;) But I hope I 
made the status of the project clear above.

D.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: carrot2 question too - Re: Fun with the Wikipedia

2005-01-29 Thread Adam Saltiel
Strangely enough this subject is being taken up in the RE: -> Grouping
Search Results by Clustering Snippets: thread.

Adam

> -Original Message-
> From: Owen Densmore [mailto:[EMAIL PROTECTED]
> Sent: Friday, January 28, 2005 4:57 PM
> To: lucene-user@jakarta.apache.org
> Cc: Owen Densmore
> Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
>
> I looked at the Carrot2 docs which mentioned dimension reduction via
> singular value decomposition (SVD) .. and other forms too I think.
>
> Question: Does anyone have pointers to successful clustering
techniques
> used with lucene?  I'm particularly interested in 2D and 3D graphics
as
> well, possibly SOM (Self Organizing Maps).
>
> I'm hoping to combine lucene with a graphical auto-clustering stunt of
> some kind but am not sure how to do it yet.
>
> Owen
>
>
> > From: Akmal Sarhan <[EMAIL PROTECTED]>
> > Date: January 28, 2005 8:19:03 AM MST
> > To: Lucene Users List 
> > Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
> >
> >
> > Hello,
> >
> > we have been experimenting with carrot2 and are very pleased so far,
> > only one issue: there is no release not even an alpha one and the
> > dependencies seemed to be patched (jama)
> > is there any intentions to have any releases in the near future?
> >
> > thanks
> >
> > Akmal
> > Am Montag, den 17.01.2005, 10:15 +0100 schrieb Dawid Weiss:
> >> Hi David,
> >>
> >> I apologize about the delay in answering this one, Lucene is a busy
> >> mailing list and I had a hectic last week... Again, sorry for
belated
> >> answer, hope you still find it useful.
> >>
> >>>> That is awesome and very inspirational!
> >>
> >> Yes, I admit what you've done with Wikipedia is quite interesting
and
> >> looks very good. I'm also glad you spent some time working out
Carrot
> >> integration with Lucene. It works quite nice.
> >>
> >>>> Carrot2 looks very interesting. Wondering if anybody has a list
of
> >>>> all
> >>>> the
> >>>
> >>> Technically I don't think carrot2 uses lucene per-se- it's just
that
> >>> you
> >>> can integrate the two, and ditto for Nutch - it has code that uses
> >>> Carrot2.
> >>
> >> Yes, this is true. Carrot2 doesn't use all of Lucene's potential --
it
> >> merely takes the output from a query (titles, urls and snippets)
and
> >> attempts to cluster them into some sensible groups. I think many
> >> things
> >> could be improved, the most important of them is fast snippet
> >> retrieval
> >>from Lucene because right now it takes 50% of the time of the
> >> clustering; I've seen a post a while ago describing a faster
snippet
> >> generation technique, I'm sure that would give clustering a huge
boost
> >> speed-wise.
> >>
> >>> And here's my question. I reread the Carrot2<->Lucene code, esp
> >>> Demo.java, and there's this fragment:
> >>>
> >>> // warm-up round (stemmer tables must be read etc).
> >>> List clusters = clusterer.clusterHits(docs);
> >>>
> >>> long clusteringStartTime = System.currentTimeMillis();
> >>> clusters = clusterer.clusterHits(docs);
> >>> long clusteringEndTime = System.currentTimeMillis();
> >>>
> >>> Thus it calls clusterHits() twice.
> >>>
> >>> I don't really understand how to use Carrot2 - but I think the
above
> >>> is
> >>> just for the sake of benchmarking clusterHits() w/o the effect of
> >>> 1-time
> >>> initialization - and that there's no benefit of repeatedly calling
> >>> clusterHits (where a benefit might be that it can find nested
> >>> clusters
> >>> or whatever) - is that right (that there's no benefit)?
> >>
> >> No, there is absolutely no benefit from it. It was merely to show
> >> people
> >> that the clustering needs to be warmed up a bit. I should not have
put
> >> it in the code knowing people would be confused by it. You can
safely
> >> use clusterHits just once. It will just have a small delay at the
> >> first
> >> invocation.
> >>
> >>
> >> Thanks for experimenting. Please BCC me if you have any urgent
> >> projects
> >> -- I read Lucene's list in batches and my personal e-mail I try to
> >> keep
> >> up to date with.
> >>
> >> Dawid
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-28 Thread Owen Densmore
I looked at the Carrot2 docs which mentioned dimension reduction via 
singular value decomposition (SVD) .. and other forms too I think.

Question: Does anyone have pointers to successful clustering techniques 
used with lucene?  I'm particularly interested in 2D and 3D graphics as 
well, possibly SOM (Self Organizing Maps).

I'm hoping to combine lucene with a graphical auto-clustering stunt of 
some kind but am not sure how to do it yet.

Owen

From: Akmal Sarhan <[EMAIL PROTECTED]>
Date: January 28, 2005 8:19:03 AM MST
To: Lucene Users List 
Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
Hello,
we have been experimenting with carrot2 and are very pleased so far,
only one issue: there is no release not even an alpha one and the
dependencies seemed to be patched (jama)
is there any intentions to have any releases in the near future?
thanks
Akmal
Am Montag, den 17.01.2005, 10:15 +0100 schrieb Dawid Weiss:
Hi David,
I apologize about the delay in answering this one, Lucene is a busy
mailing list and I had a hectic last week... Again, sorry for belated
answer, hope you still find it useful.
That is awesome and very inspirational!
Yes, I admit what you've done with Wikipedia is quite interesting and
looks very good. I'm also glad you spent some time working out Carrot
integration with Lucene. It works quite nice.
Carrot2 looks very interesting. Wondering if anybody has a list of 
all
the
Technically I don't think carrot2 uses lucene per-se- it's just that 
you
can integrate the two, and ditto for Nutch - it has code that uses 
Carrot2.
Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it
merely takes the output from a query (titles, urls and snippets) and
attempts to cluster them into some sensible groups. I think many 
things
could be improved, the most important of them is fast snippet 
retrieval
   from Lucene because right now it takes 50% of the time of the
clustering; I've seen a post a while ago describing a faster snippet
generation technique, I'm sure that would give clustering a huge boost
speed-wise.

And here's my question. I reread the Carrot2<->Lucene code, esp
Demo.java, and there's this fragment:
// warm-up round (stemmer tables must be read etc).
List clusters = clusterer.clusterHits(docs);
long clusteringStartTime = System.currentTimeMillis();
clusters = clusterer.clusterHits(docs);
long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above 
is
just for the sake of benchmarking clusterHits() w/o the effect of 
1-time
initialization - and that there's no benefit of repeatedly calling
clusterHits (where a benefit might be that it can find nested 
clusters
or whatever) - is that right (that there's no benefit)?
No, there is absolutely no benefit from it. It was merely to show 
people
that the clustering needs to be warmed up a bit. I should not have put
it in the code knowing people would be confused by it. You can safely
use clusterHits just once. It will just have a small delay at the 
first
invocation.

Thanks for experimenting. Please BCC me if you have any urgent 
projects
-- I read Lucene's list in batches and my personal e-mail I try to 
keep 
up to date with.

Dawid


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-28 Thread Akmal Sarhan
Hello,

we have been experimenting with carrot2 and are very pleased so far,
only one issue: there is no release not even an alpha one and the
dependencies seemed to be patched (jama)
is there any intentions to have any releases in the near future?

thanks 

Akmal
Am Montag, den 17.01.2005, 10:15 +0100 schrieb Dawid Weiss:
> Hi David,
> 
> I apologize about the delay in answering this one, Lucene is a busy 
> mailing list and I had a hectic last week... Again, sorry for belated 
> answer, hope you still find it useful.
> 
> >> That is awesome and very inspirational!
> 
> Yes, I admit what you've done with Wikipedia is quite interesting and 
> looks very good. I'm also glad you spent some time working out Carrot 
> integration with Lucene. It works quite nice.
> 
> >> Carrot2 looks very interesting. Wondering if anybody has a list of all 
> >> the
> > 
> > Technically I don't think carrot2 uses lucene per-se- it's just that you 
> > can integrate the two, and ditto for Nutch - it has code that uses Carrot2.
> 
> Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
> merely takes the output from a query (titles, urls and snippets) and 
> attempts to cluster them into some sensible groups. I think many things 
> could be improved, the most important of them is fast snippet retrieval 
>from Lucene because right now it takes 50% of the time of the 
> clustering; I've seen a post a while ago describing a faster snippet 
> generation technique, I'm sure that would give clustering a huge boost 
> speed-wise.
> 
> > And here's my question. I reread the Carrot2<->Lucene code, esp 
> > Demo.java, and there's this fragment:
> > 
> > // warm-up round (stemmer tables must be read etc).
> > List clusters = clusterer.clusterHits(docs);
> > 
> > long clusteringStartTime = System.currentTimeMillis();
> > clusters = clusterer.clusterHits(docs);
> > long clusteringEndTime = System.currentTimeMillis();
> > 
> > Thus it calls clusterHits() twice.
> > 
> > I don't really understand how to use Carrot2 - but I think the above is 
> > just for the sake of benchmarking clusterHits() w/o the effect of 1-time 
> > initialization - and that there's no benefit of repeatedly calling 
> > clusterHits (where a benefit might be that it can find nested clusters 
> > or whatever) - is that right (that there's no benefit)?
> 
> No, there is absolutely no benefit from it. It was merely to show people 
> that the clustering needs to be warmed up a bit. I should not have put 
> it in the code knowing people would be confused by it. You can safely 
> use clusterHits just once. It will just have a small delay at the first 
> invocation.
> 
> 
> Thanks for experimenting. Please BCC me if you have any urgent projects 
> -- I read Lucene's list in batches and my personal e-mail I try to keep 
> up to date with.
> 
> Dawid
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> !EXCUBATOR:41eb81f8156071530375633!
> 
-- 
Akmal Sarhan <[EMAIL PROTECTED]>
ByteAction GmbH


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-17 Thread David Spencer
Dawid Weiss wrote:
Hi David,
I apologize about the delay in answering this one, Lucene is a busy 
mailing list and I had a hectic last week... Again, sorry for belated 
answer, hope you still find it useful.
Oh no problem, and yes carrot2 is useful and fun.  It's a rich package 
so it takes a while to understand all that it can do.

That is awesome and very inspirational!

Yes, I admit what you've done with Wikipedia is quite interesting and 
looks very good. I'm also glad you spent some time working out Carrot 
integration with Lucene. It works quite nice.
Thanks but I just took code that I think you wrote(!) and made minor 
mods to it - here's one link:
http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html

I'd like to do more w/ Carrot2- that's where things get harder.

Carrot2 looks very interesting. Wondering if anybody has a list of 
all the

Technically I don't think carrot2 uses lucene per-se- it's just that 
you can integrate the two, and ditto for Nutch - it has code that uses 
Carrot2.

Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
merely takes the output from a query (titles, urls and snippets) and 
attempts to cluster them into some sensible groups. I think many things 
could be improved, the most important of them is fast snippet retrieval 
  from Lucene because right now it takes 50% of the time of the 
clustering; I've seen a post a while ago describing a faster snippet 
generation technique, I'm sure that would give clustering a huge boost 
speed-wise.

And here's my question. I reread the Carrot2<->Lucene code, esp 
Demo.java, and there's this fragment:

// warm-up round (stemmer tables must be read etc).
List clusters = clusterer.clusterHits(docs);
long clusteringStartTime = System.currentTimeMillis();
clusters = clusterer.clusterHits(docs);
long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above 
is just for the sake of benchmarking clusterHits() w/o the effect of 
1-time initialization - and that there's no benefit of repeatedly 
calling clusterHits (where a benefit might be that it can find nested 
clusters or whatever) - is that right (that there's no benefit)?

No, there is absolutely no benefit from it. It was merely to show people 
that the clustering needs to be warmed up a bit. I should not have put 
it in the code knowing people would be confused by it. You can safely 
use clusterHits just once. It will just have a small delay at the first 
invocation.

Thanks for experimenting. Please BCC me if you have any urgent projects 
-- I read Lucene's list in batches and my personal e-mail I try to keep 
up to date with.

Dawid
thx,
 Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-17 Thread Dawid Weiss
Hi David,
I apologize about the delay in answering this one, Lucene is a busy 
mailing list and I had a hectic last week... Again, sorry for belated 
answer, hope you still find it useful.

That is awesome and very inspirational!
Yes, I admit what you've done with Wikipedia is quite interesting and 
looks very good. I'm also glad you spent some time working out Carrot 
integration with Lucene. It works quite nice.

Carrot2 looks very interesting. Wondering if anybody has a list of all 
the
Technically I don't think carrot2 uses lucene per-se- it's just that you 
can integrate the two, and ditto for Nutch - it has code that uses Carrot2.
Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
merely takes the output from a query (titles, urls and snippets) and 
attempts to cluster them into some sensible groups. I think many things 
could be improved, the most important of them is fast snippet retrieval 
  from Lucene because right now it takes 50% of the time of the 
clustering; I've seen a post a while ago describing a faster snippet 
generation technique, I'm sure that would give clustering a huge boost 
speed-wise.

And here's my question. I reread the Carrot2<->Lucene code, esp 
Demo.java, and there's this fragment:

// warm-up round (stemmer tables must be read etc).
List clusters = clusterer.clusterHits(docs);
long clusteringStartTime = System.currentTimeMillis();
clusters = clusterer.clusterHits(docs);
long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above is 
just for the sake of benchmarking clusterHits() w/o the effect of 1-time 
initialization - and that there's no benefit of repeatedly calling 
clusterHits (where a benefit might be that it can find nested clusters 
or whatever) - is that right (that there's no benefit)?
No, there is absolutely no benefit from it. It was merely to show people 
that the clustering needs to be warmed up a bit. I should not have put 
it in the code knowing people would be confused by it. You can safely 
use clusterHits just once. It will just have a small delay at the first 
invocation.

Thanks for experimenting. Please BCC me if you have any urgent projects 
-- I read Lucene's list in batches and my personal e-mail I try to keep 
up to date with.

Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]