Hi,
does anyone have any sample code/documentation available for doing document based
clustering using lucene?
Thanks,
Marc
Hi Marc,
I'm working on it. Classification and Clustering as well.
I was planing doing it for nutch.org, but actually some guys there
breakup some important basic work I already had done, so may be i will
not contribute it there.
However it will be open source and I can notice you if something
I'm working on it. Classification and Clustering as well.
Very interesting... if you get something working, please don't forget to
notify this list :-)
--
Eric Jain
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional
Hi
As everybody seems to be so exited about it, would someone please be so kind to
explain
what document based clustering is?
Regards,
Marcel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL
Marcel Stör wrote:
Hi
As everybody seems to be so exited about it, would someone please be so kind to explain
what document based clustering is?
Hi
they are trying to implement what you can see in the right panel here:
http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein
They may also
--- Leo Galambos [EMAIL PROTECTED] wrote:
Marcel Stör wrote:
Hi
As everybody seems to be so exited about it, would someone please be
so kind to explain
what document based clustering is?
AFAIK, document clustering consists of detection of documents with
similar content (similar
On Nov 11, 2003, at 16:05, Marcel Stör wrote:
As everybody seems to be so exited about it, would someone please be
so kind to explain
what document based clustering is?
This mostly means finding document which are similar in some way(s).
The similitude is mostly in the eyes of the beholder. In
Categorization typically assigns documents to a node in a pre-defined taxonomy.
For clustering, however, the categorization 'structure' is emergent... i.e. the
clusters (which are analogous to taxonomy nodes) are created dynamically based on the
content of the documents at hand.
-Original
On Nov 11, 2003, at 16:58, Tate Avery wrote:
Categorization typically assigns documents to a node in a pre-defined
taxonomy.
For clustering, however, the categorization 'structure' is emergent...
i.e. the clusters (which are analogous to taxonomy nodes) are created
dynamically based on the
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match in it.
You group all documents with minimal distance together.
Classification: you have already categories and samples for it, that help you to match
Thanks for the clarification, Stefan. I should have known that... :)
Otis
--- Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match
in it.
You group all
Hi,
A couple questions...
1). If I delete a term using an IndexReader, can I use an existing
IndexWriter to write to the index? Or do I need to close and reopen the
IndexWriter?
2). Is it safe to call IndexReader.delete(term) while an IndexWriter is
writing? Or should I be synchronizing
Classes for index Pdf and word files in lucene.
Ernesto.
- Original Message -
From: Ernesto De Santis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, October 29, 2003 12:04 PM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.
Hello all,
Thans very much
Stefan Groschupf wrote:
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match
in it. You group all documents with minimal distance together.
Would I be correct to say that you have to define a distance
On Tuesday, Nov 11, 2003, at 11:05 US/Pacific, Marcel Stor wrote:
Stefan Groschupf wrote:
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match
in it. You group all documents with minimal distance together.
Some of us have corporate firewalls that are stripping out attachments. If possible,
put these on a web site somewhere so we can download them. Thanks!
-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 11, 2003 11:07 AM
To: Lucene Users List
Marcel Stor wrote:
Stefan Groschupf wrote:
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match
in it. You group all documents with minimal distance together.
Would I be correct to say
Hello ,
now that the topic is clustering methods: has there been any effort in
implementing Latent semantic indexing in Lucene? Google only indicates
someone else asking this in february.
Is there an overview of the structure of the index of lucene despite of
the javadoc or any other fast
Thomas Krämer wrote:
Is there an overview of the structure of the index of lucene despite
of the javadoc or any other fast access to understanding what happens
inside lucene?
You mean something like this?:
http://jakarta.apache.org/lucene/docs/fileformats.html
cheers,
Gerret
Thomas Krämer wrote:
now that the topic is clustering methods: has there been any effort in
implementing Latent semantic indexing in Lucene? Google only indicates
someone else asking this in february.
Just a note the LSI is encumbered by US patents 4,839,853 and 5,301,109. It would be wise to
On Tuesday, November 11, 2003, at 02:37 PM, Thomas Krämer wrote:
Is there an overview of the structure of the index of lucene despite
of the javadoc or any other fast access to understanding what happens
inside lucene?
Here is what is inside a Lucene index:
Hi All and Marc,
There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/
The carrot system consists of webservices that can easily be fed by a lucene
resultlist. You simply have to create a JSP that creates this XML file and
create a custom process and input component. The input
really cool Stuff!!!
maurits van wijland wrote:
Hi All and Marc,
There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/
The carrot system consists of webservices that can easily be fed by a lucene
resultlist. You simply have to create a JSP that creates this XML file and
1). If I delete a term using an IndexReader, can I use an existing
IndexWriter to write to the index? Or do I need to close and reopen
the IndexWriter?
No. You should close IndexWriter first, then open IndexReader, then
call delete, then close IndexReader, and then open a new IndexWriter.
Ernesto, it looks like something got stripped. A ZIP file should make
it to the list. If not, maybe you can post it somewhere.
Could you also tell us a bit about this code? Is it better than
existing PDF/Word parsing solutions? Pure Java? Uses POI?
Thanks,
Otis
--- Ernesto De Santis
On Nov 11, 2003, at 21:32, maurits van wijland wrote:
There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/
Leo Galambos, author of the Egothor project, constantly supports us
with fresh ideas and includes Carrot components in his own project!
Hi!
I'm also interest it. Kindly CC to me the lastest progress of your
clustering project.
Regards,
AlexAw
- Original Message -
From: Eric Jain [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 11, 2003 10:07 PM
Subject: Re: Document Clustering
I'm
Thanks everyone for the responses and links to resources..
I was basically thinking of using lucene to generate document vectors, and
writing my custom similarity algorithms for measuring distance.
I could then run this data through k-means or SOM algorithms for calculating
clusters
Does this
Hi,
I have a huge data file with 4 gb data. The data in the file never changes.
The format of the file is as follows:
Col1,col2,col3,Value
abababc,xyzza,c,100
ababadx,xyz,adfdfd,101
I need to retrieve the value with simple queries on the data like:
On Tuesday, November 11, 2003, at 10:00 PM, Kumar Mettu wrote:
The format of the file is as follows:
Col1,col2,col3,Value
abababc,xyzza,c,100
ababadx,xyz,adfdfd,101
I need to retrieve the value with simple queries on the data like:
select value where col1
try again zipping the files.
after i post the files in the web site.
Could you also tell us a bit about this code? Is it better than
existing PDF/Word parsing solutions? Pure Java? Uses POI?
This code use existing parsing solution.
The intent is make a lucene Document for index pdf and
31 matches
Mail list logo