Document Clustering

2003-11-11 Thread marc
Hi,

does anyone have any sample code/documentation available for doing document based 
clustering using lucene?

Thanks,
Marc



Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi Marc,

I'm working on it. Classification and Clustering as well.
I was planing doing it for nutch.org, but actually some guys there 
breakup some important basic work I already had done, so may be i will 
not contribute it there.
However it will be open source and I can notice you  if something useful 
is ready.
But it will take some weeks. I actually working on radical minimizing of 
feature selection

Cheers
Stefan


marc wrote:

Hi,

does anyone have any sample code/documentation available for doing document based clustering using lucene?

Thanks,
Marc
 

--
day time: www.media-style.com
spare time: www.text-mining.org | www.weta-group.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Eric Jain
 I'm working on it. Classification and Clustering as well.

Very interesting... if you get something working, please don't forget to
notify this list :-)

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Marcel Stör
Hi

As everybody seems to be so exited about it, would someone please be so kind to 
explain 
what document based clustering is?

Regards,
Marcel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Leo Galambos
Marcel Stör wrote:

Hi

As everybody seems to be so exited about it, would someone please be so kind to explain 
what document based clustering is?
 

Hi

they are trying to implement what you can see in the right panel here:
http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein
They may also analyze identical pages (hit #9 and #10) - this could be 
also taken as clustering AFAIK.

For instance, Doug wrote some papers about clustering (if I remember it 
correctly) - see his bibliography.

Leo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Otis Gospodnetic

--- Leo Galambos [EMAIL PROTECTED] wrote:
 Marcel Stör wrote:
 
 Hi
 
 As everybody seems to be so exited about it, would someone please be
 so kind to explain 
 what document based clustering is?

AFAIK, document clustering consists of detection of documents with
similar content (similar subjects/topics).
 
 Hi
 
 they are trying to implement what you can see in the right panel
 here:
 http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein
 They may also analyze identical pages (hit #9 and #10) - this could
 be 
 also taken as clustering AFAIK.

Intersting.

 For instance, Doug wrote some papers about clustering (if I remember
 it 
 correctly) - see his bibliography.


How is document clustering different/related to text categorization?

Thanks,
Otis


__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 16:05, Marcel Stör wrote:

As everybody seems to be so exited about it, would someone please be 
so kind to explain
what document based clustering is?
This mostly means finding document which are similar in some way(s). 
The similitude is mostly in the eyes of the beholder. In such a 
world, a cluster would be a pile of document sharing something. As 
far as Lucene goes, a straightforward way of approaching this could be 
to use an entire document content to query an index. Lucene's result 
set could be construed as a document cluster. Admittedly, this is 
ground zero of document clustering, but here you go anyway :)

Here is an illustration:

Patterns in Unstructured Data
Discovery, Aggregation, and Visualization
http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm
Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Document Clustering

2003-11-11 Thread Tate Avery
Categorization typically assigns documents to a node in a pre-defined taxonomy.

For clustering, however, the categorization 'structure' is emergent... i.e. the 
clusters (which are analogous to taxonomy nodes) are created dynamically based on the 
content of the documents at hand.


-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 11, 2003 10:50 AM
To: Lucene Users List
Subject: Re: Document Clustering


Hi Otis,

On Nov 11, 2003, at 16:41, Otis Gospodnetic wrote:

 How is document clustering different/related to text categorization?

Not that I'm an expert in any of this, but clustering is a much more 
holistic approach than categorization. Usually, categorization is 
understood as a more precise endeavor (e.g. dmoz.org), while clustering 
is much more fuzzy and non-deterministic. Both try to achieve the 
same goal though. So perhaps this is just a question of jargon.

I'm confident that the owner of this site could help bring some light 
on the finer point of clustering vs categorization:

http://www.lissus.com/resources/index.htm

Cheers,

PA.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 16:58, Tate Avery wrote:

Categorization typically assigns documents to a node in a pre-defined 
taxonomy.

For clustering, however, the categorization 'structure' is emergent... 
i.e. the clusters (which are analogous to taxonomy nodes) are created 
dynamically based on the content of the documents at hand.
Another way to look at it is this:

An attempt to apply the Dewey Decimal system to an orgy. [1]

Without a Dewey Decimal system that is.

Cheers,

PA.

[1] http://www.eod.com/devil/archive/semantic_web.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match in it. 
You group all documents with minimal distance together. 

Classification: you have already categories and samples for it, that help you to match other documents. 
You calculate document distances to the existing categories and put it in the category with smallest distance.

Cheers
Stefan
--
day time: www.media-style.com
spare time: www.text-mining.org | www.weta-group.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Otis Gospodnetic
Thanks for the clarification, Stefan.  I should have known that... :)

Otis

--- Stefan Groschupf [EMAIL PROTECTED] wrote:
 Hi,
 How is document clustering different/related to text categorization?
 
 Clustering: try to find own categories and put documents that match
 in it. 
 You group all documents with minimal distance together. 
 
 Classification: you have already categories and samples for it, that
 help you to match other documents. 
 You calculate document distances to the existing categories and put
 it in the category with smallest distance.
 
 Cheers
 Stefan
 
 -- 
 day time: www.media-style.com
 spare time: www.text-mining.org | www.weta-group.net
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reopen IndexWriter after delete?

2003-11-11 Thread Wilton, Reece
Hi,

A couple questions...

1).  If I delete a term using an IndexReader, can I use an existing
IndexWriter to write to the index?  Or do I need to close and reopen the
IndexWriter?

2).  Is it safe to call IndexReader.delete(term) while an IndexWriter is
writing?  Or should I be synchronizing these two tasks so only one
occurs at a time?

Any help is appreciated!
-Reece

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index pdf files with your content in lucene.

2003-11-11 Thread Ernesto De Santis
Classes for index Pdf and word files in lucene.
Ernesto.

- Original Message -
From: Ernesto De Santis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, October 29, 2003 12:04 PM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.


Hello all,

Thans very much Stephan for your valuable help.
Attached you will find the PDFDocument, and WordDocument class source code

Ernesto.


- Original Message -
From: Hartmann, Waehrisch  Feykes GmbH [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, October 28, 2003 11:10 AM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.


 Hi Ernesto,

 the IndexManager retrieves a list of files of a folder by calling the
method
 getFilesInFolder of CmsObject. This method returns only empty files, i.e.
 with empty content. To get the content of a pdf file you have to reread
the
 file:
 f = cms.readFile(f.getAbsolutePath());

 Bye,
 Stephan

 Am Montag, 27. Oktober 2003 19:18 schrieben Sie:

   Hello
 
  Thanks for the previous reply.
 
  Now, i use
  - version 1.4 of lucene searche module. (the version attached in this
list)
  - new version of registry.xml format for module. (like you write me)
  - the pdf files are stored with the binary type.
 
  But i have the next problem:
  i can´t make a InputStream for the cmsfile content.
  For this i write this code in de Document method of my class
PDFDocument:
 
  -
 
  InputStream in = new ByteArrayInputStream(f.getContents()); //f is the
  parameter CmsFile of the Document method
 
  PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i
use.
  in file system work fine.
 
 
  bodyText = extractor.extractText(in);
 
  
 
  Is correct use ByteArrayInputStream for make a InputStream for a
CmsFile?
 
  The error ocurr in the third line.
  In the PDFParcer.
  the error menssage in tomcat is:
 
  java.io.IOException: Error: Header is corrupt ''
  at PDFParcer.parse
  at PDFExtractor.extractText
  at PDFDocument.Document (my class)
  at.
 
  By, and thanks.
  Ernesto.
 
 
  - Original Message -
From: Hartmann, Waehrisch  Feykes GmbH
To: [EMAIL PROTECTED]
Sent: Friday, October 24, 2003 4:45 AM
Subject: Re: [opencms-dev] Index pdf files with your content in
lucene.
 
 
Hello Ernesto,
 
i assume you are using the unpatched version 1.3 of the search module.
As i mentioned yesterday, the plainDocFactory does only index cmsFiles
of
  type plain but not of type binary. PDF files are stored as binary. I
  suggest to use the version i posted yesterday. Then your registry.xml
would
  have to look like this: ...
docFactories
...
   docFactory type=plain enabled=true
...
   /docFactory
   docFactory type=binary enabled=true
  fileType name=pdftext
 extension.pdf/extension
 
classnet.grcomputing.opencms.search.lucene.PDFDocument/class
  /fileType
   /docFactory
...
/docFactories
 
Important: The type attribute must match the file types of OpenCms
(also
  defined in the registry.xml).
 
Bye,
Stephan
 
  - Original Message -
  From: Ernesto De Santis
  To: Lucene Users List
  Cc: [EMAIL PROTECTED]
  Sent: Thursday, October 23, 2003 4:16 PM
  Subject: [opencms-dev] Index pdf files with your content in lucene.
 
 
  Hello
 
  I am new in opencms and lucene tecnology.
 
  I won index pdf files, and index de content of this files.
 
  I work in this way:
 
  Make a PDFDocument class like JspDocument class.
  use org.textmining.text.extraction.PDFExtractor class, this class
work
  fine out of vfs.
 
  and write my registry.xml for pdf document, in plainDocFactory tag.
 
  fileType name=pdftext
  extension.pdf/extension
  !-- This will strip tags before
processing --
 
  classnet.grcomputing.opencms.search.lucene.PDFDocument/class
  /fileType
 
  my PDFDocument content this code:
  I think that the probrem is how take the content from CmsFile?, what
  InputStream use? PDFExtractor work with extractText(InputStream) method.
 
  public class PDFDocument implements I_DocumentConstants,
  I_DocumentFactory {
 
  public PDFDocument(){
 
  }
 
 
  public Document Document(CmsObject cmsobject, CmsFile cmsfile)
 
  throws CmsException
 
  {
 
  return Document(cmsobject, cmsfile, null);
 
  }
 
  public Document Document(CmsObject cmsobject, CmsFile cmsfile,
HashMap
  hashmap)
 
  throws CmsException
 
  {
 
  Document document=(new BodylessDocument()).Document(cmsobject,
  cmsfile);
 
 
  //put de content in the pdf file.
 
  String contenido = new String(cmsfile.getContents());
 
  StringBufferInputStream in = new StringBufferInputStream(contenido);
 
  // ByteArrayInputStream in = new
  ByteArrayInputStream(contenido.getBytes());
 
 

RE: Document Clustering

2003-11-11 Thread Marcel Stor
Stefan Groschupf wrote:
 Hi,
  How is document clustering different/related to text categorization?
 
 Clustering: try to find own categories and put documents that match
 in it. You group all documents with minimal distance together.

Would I be correct to say that you have to define a distance threshold
parameter in order to define when to build a new category for a certain
group?

 Classification: you have already categories and samples for
 it, that help you to match other documents.
 You calculate document distances to the existing categories
 and put it in the category with smallest distance.

Regards,
Marcel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Joshua O'Madadhain
On Tuesday, Nov 11, 2003, at 11:05 US/Pacific, Marcel Stor wrote:

Stefan Groschupf wrote:
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match
in it. You group all documents with minimal distance together.
Would I be correct to say that you have to define a distance 
threshold
parameter in order to define when to build a new category for a certain
group?
Depends on the type of clustering algorithm.  Some clustering 
algorithms take the number of clusters as a parameter (in this case the 
algorithm may be run several times with different values, to determine 
the best value).  Other types of algorithms, such as hierarchical 
agglomerative clustering algorithms, work more as you suggest.

Regards,

Joshua O'Madadhain

 [EMAIL PROTECTED] Per 
Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, 
Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill 
Watterson
My opinions are too rational and insightful to be those of any 
organization.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Index pdf files with your content in lucene.

2003-11-11 Thread Wilton, Reece
Some of us have corporate firewalls that are stripping out attachments.  If possible, 
put these on a web site somewhere so we can download them.  Thanks!

-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 11, 2003 11:07 AM
To: Lucene Users List
Subject: Index pdf files with your content in lucene.

Classes for index Pdf and word files in lucene.
Ernesto.

- Original Message -
From: Ernesto De Santis [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, October 29, 2003 12:04 PM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.


Hello all,

Thans very much Stephan for your valuable help.
Attached you will find the PDFDocument, and WordDocument class source code

Ernesto.


- Original Message -
From: Hartmann, Waehrisch  Feykes GmbH [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, October 28, 2003 11:10 AM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.


 Hi Ernesto,

 the IndexManager retrieves a list of files of a folder by calling the
method
 getFilesInFolder of CmsObject. This method returns only empty files, i.e.
 with empty content. To get the content of a pdf file you have to reread
the
 file:
 f = cms.readFile(f.getAbsolutePath());

 Bye,
 Stephan

 Am Montag, 27. Oktober 2003 19:18 schrieben Sie:

   Hello
 
  Thanks for the previous reply.
 
  Now, i use
  - version 1.4 of lucene searche module. (the version attached in this
list)
  - new version of registry.xml format for module. (like you write me)
  - the pdf files are stored with the binary type.
 
  But i have the next problem:
  i can´t make a InputStream for the cmsfile content.
  For this i write this code in de Document method of my class
PDFDocument:
 
  -
 
  InputStream in = new ByteArrayInputStream(f.getContents()); //f is the
  parameter CmsFile of the Document method
 
  PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i
use.
  in file system work fine.
 
 
  bodyText = extractor.extractText(in);
 
  
 
  Is correct use ByteArrayInputStream for make a InputStream for a
CmsFile?
 
  The error ocurr in the third line.
  In the PDFParcer.
  the error menssage in tomcat is:
 
  java.io.IOException: Error: Header is corrupt ''
  at PDFParcer.parse
  at PDFExtractor.extractText
  at PDFDocument.Document (my class)
  at.
 
  By, and thanks.
  Ernesto.
 
 
  - Original Message -
From: Hartmann, Waehrisch  Feykes GmbH
To: [EMAIL PROTECTED]
Sent: Friday, October 24, 2003 4:45 AM
Subject: Re: [opencms-dev] Index pdf files with your content in
lucene.
 
 
Hello Ernesto,
 
i assume you are using the unpatched version 1.3 of the search module.
As i mentioned yesterday, the plainDocFactory does only index cmsFiles
of
  type plain but not of type binary. PDF files are stored as binary. I
  suggest to use the version i posted yesterday. Then your registry.xml
would
  have to look like this: ...
docFactories
...
   docFactory type=plain enabled=true
...
   /docFactory
   docFactory type=binary enabled=true
  fileType name=pdftext
 extension.pdf/extension
 
classnet.grcomputing.opencms.search.lucene.PDFDocument/class
  /fileType
   /docFactory
...
/docFactories
 
Important: The type attribute must match the file types of OpenCms
(also
  defined in the registry.xml).
 
Bye,
Stephan
 
  - Original Message -
  From: Ernesto De Santis
  To: Lucene Users List
  Cc: [EMAIL PROTECTED]
  Sent: Thursday, October 23, 2003 4:16 PM
  Subject: [opencms-dev] Index pdf files with your content in lucene.
 
 
  Hello
 
  I am new in opencms and lucene tecnology.
 
  I won index pdf files, and index de content of this files.
 
  I work in this way:
 
  Make a PDFDocument class like JspDocument class.
  use org.textmining.text.extraction.PDFExtractor class, this class
work
  fine out of vfs.
 
  and write my registry.xml for pdf document, in plainDocFactory tag.
 
  fileType name=pdftext
  extension.pdf/extension
  !-- This will strip tags before
processing --
 
  classnet.grcomputing.opencms.search.lucene.PDFDocument/class
  /fileType
 
  my PDFDocument content this code:
  I think that the probrem is how take the content from CmsFile?, what
  InputStream use? PDFExtractor work with extractText(InputStream) method.
 
  public class PDFDocument implements I_DocumentConstants,
  I_DocumentFactory {
 
  public PDFDocument(){
 
  }
 
 
  public Document Document(CmsObject cmsobject, CmsFile cmsfile)
 
  throws CmsException
 
  {
 
  return Document(cmsobject, cmsfile, null);
 
  }
 
  public Document Document(CmsObject cmsobject, CmsFile cmsfile,
HashMap
  hashmap)
 
  throws CmsException
 
  {
 
  

Re: Document Clustering

2003-11-11 Thread Stefan Groschupf


Marcel Stor wrote:

Stefan Groschupf wrote:
 

Hi,
   

How is document clustering different/related to text categorization?
 

Clustering: try to find own categories and put documents that match
in it. You group all documents with minimal distance together.
   

Would I be correct to say that you have to define a distance threshold
parameter in order to define when to build a new category for a certain
group?
 

I'm not sure. There are different data mining algorithms that could be used. Depends on this algoritm. I prefer Support vector machines(SVM). There you calculate distances of multi demensional vectors in a multidemensional room.
One vector represent one document. 

Stefan




fuzzy searches

2003-11-11 Thread Thomas Krämer
Hello ,

now that the topic is clustering methods: has there been any effort in 
implementing Latent semantic indexing in Lucene? Google only indicates 
someone else asking this in february.
Is there an overview of the structure of the index of lucene despite of 
the javadoc or any other fast access to understanding what happens 
inside lucene?

regards

thomas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: fuzzy searches

2003-11-11 Thread Gerret Apelt
Thomas Krämer wrote:

Is there an overview of the structure of the index of lucene despite 
of the javadoc or any other fast access to understanding what happens 
inside lucene?

You mean something like this?:

http://jakarta.apache.org/lucene/docs/fileformats.html

cheers,
Gerret
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: fuzzy searches

2003-11-11 Thread Bruce Ritchie
Thomas Krämer wrote:
now that the topic is clustering methods: has there been any effort in 
implementing Latent semantic indexing in Lucene? Google only indicates 
someone else asking this in february.
Just a note the LSI is encumbered by US patents 4,839,853 and 5,301,109. It would be wise to make 
sure that any implementation is either blessed by the patent holders or does not infringe on the 
patents.

Regards,

Bruce Ritchie



smime.p7s
Description: S/MIME Cryptographic Signature


Re: fuzzy searches

2003-11-11 Thread Erik Hatcher
On Tuesday, November 11, 2003, at 02:37  PM, Thomas Krämer wrote:
Is there an overview of the structure of the index of lucene despite 
of the javadoc or any other fast access to understanding what happens 
inside lucene?
Here is what is inside a Lucene index:  
http://jakarta.apache.org/lucene/docs/fileformats.html



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread maurits van wijland
Hi All and Marc,

There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/

The carrot system consists of webservices that can easily be fed by a lucene
resultlist. You simply have to create a JSP that creates this XML file and
create a custom process and input component. The input component
for lucene could look like:

?xml version=1.0 encoding=UTF-8?
service xmlns  =
http://www.dawidweiss.com/projects/carrot/componentDescriptor; framework  =
Carrot2
component id   = carrot2.input.lucene
   type = input
   serviceURL   = http://localhost/weblucene/c2.jsp;
   infoURL  = http://localhost/weblucene/;
/
/service

The c2.jsp file simply has to translate a resultlist into an XLM file such
as:
searchresult
document id=1
 title.../title
 weight1.0/weight
 urlhttp://.../url
 summarysum 1/summary
 snippetsnip 2/snippet
/document
document id=2
 title.../title
 weight1.0/weight
 urlhttp://.../url
 summarysum 2/summary
 snippetsnip 2/snippet
/document
/searchresult

Feed this into the carrot system, and you will get a nice clustered
result list. The amazing part is of this clustering mechanism is that
the cluster labels are incredible, their great!

Then there is a open source project called Classifier4J that can
be used for classification, the oposite of clustering. These other
open source projects are a great addition to the Lucene system.

I hope this helps...

Marc, what are you building?? Maybe we can help!

Kind regards,

Maurits


- Original Message - 
From: marc [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 11, 2003 5:15 PM
Subject: Document Clustering


Hi,

does anyone have any sample code/documentation available for doing document
based clustering using lucene?

Thanks,
Marc



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
really cool Stuff!!!

maurits van wijland wrote:

Hi All and Marc,

There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/
The carrot system consists of webservices that can easily be fed by a lucene
resultlist. You simply have to create a JSP that creates this XML file and
create a custom process and input component. The input component
for lucene could look like:
?xml version=1.0 encoding=UTF-8?
service xmlns  =
http://www.dawidweiss.com/projects/carrot/componentDescriptor; framework  =
Carrot2
   component id   = carrot2.input.lucene
  type = input
  serviceURL   = http://localhost/weblucene/c2.jsp;
  infoURL  = http://localhost/weblucene/;
   /
/service
The c2.jsp file simply has to translate a resultlist into an XLM file such
as:
searchresult
   document id=1
title.../title
weight1.0/weight
urlhttp://.../url
summarysum 1/summary
snippetsnip 2/snippet
   /document
   document id=2
title.../title
weight1.0/weight
urlhttp://.../url
summarysum 2/summary
snippetsnip 2/snippet
   /document
/searchresult
Feed this into the carrot system, and you will get a nice clustered
result list. The amazing part is of this clustering mechanism is that
the cluster labels are incredible, their great!
Then there is a open source project called Classifier4J that can
be used for classification, the oposite of clustering. These other
open source projects are a great addition to the Lucene system.
I hope this helps...

Marc, what are you building?? Maybe we can help!

Kind regards,

Maurits

- Original Message - 
From: marc [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 11, 2003 5:15 PM
Subject: Document Clustering

Hi,

does anyone have any sample code/documentation available for doing document
based clustering using lucene?
Thanks,
Marc


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
day time: www.media-style.com
spare time: www.text-mining.org | www.weta-group.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Reopen IndexWriter after delete?

2003-11-11 Thread Otis Gospodnetic
 1).  If I delete a term using an IndexReader, can I use an existing
 IndexWriter to write to the index?  Or do I need to close and reopen
 the IndexWriter?

No.  You should close IndexWriter first, then open IndexReader, then
call delete, then close IndexReader, and then open a new IndexWriter.

 2).  Is it safe to call IndexReader.delete(term) while an IndexWriter
 is
 writing?  Or should I be synchronizing these two tasks so only one
 occurs at a time?

No, it is not safe.  You should close the IndexWriter, then delete the
document and close IndexReader, and then get a new IndexWriter and
continue writing.

Incidentally, I just wrote a section about concurrency issues and about
locking in Lucene for the upcoming Lucene book.

Otis


__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index pdf files with your content in lucene.

2003-11-11 Thread Otis Gospodnetic
Ernesto, it looks like something got stripped.  A ZIP file should make
it to the list.  If not, maybe you can post it somewhere.

Could you also tell us a bit about this code?  Is it better than
existing PDF/Word parsing solutions?  Pure Java?  Uses POI?

Thanks,
Otis


--- Ernesto De Santis [EMAIL PROTECTED] wrote:
 Classes for index Pdf and word files in lucene.
 Ernesto.
 
 - Original Message -
 From: Ernesto De Santis [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Wednesday, October 29, 2003 12:04 PM
 Subject: Re: [opencms-dev] Index pdf files with your content in
 lucene.
 
 
 Hello all,
 
 Thans very much Stephan for your valuable help.
 Attached you will find the PDFDocument, and WordDocument class source
 code
 
 Ernesto.
 
 
 - Original Message -
 From: Hartmann, Waehrisch  Feykes GmbH
 [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, October 28, 2003 11:10 AM
 Subject: Re: [opencms-dev] Index pdf files with your content in
 lucene.
 
 
  Hi Ernesto,
 
  the IndexManager retrieves a list of files of a folder by calling
 the
 method
  getFilesInFolder of CmsObject. This method returns only empty
 files, i.e.
  with empty content. To get the content of a pdf file you have to
 reread
 the
  file:
  f = cms.readFile(f.getAbsolutePath());
 
  Bye,
  Stephan
 
  Am Montag, 27. Oktober 2003 19:18 schrieben Sie:
 
Hello
  
   Thanks for the previous reply.
  
   Now, i use
   - version 1.4 of lucene searche module. (the version attached in
 this
 list)
   - new version of registry.xml format for module. (like you write
 me)
   - the pdf files are stored with the binary type.
  
   But i have the next problem:
   i can´t make a InputStream for the cmsfile content.
   For this i write this code in de Document method of my class
 PDFDocument:
  
   -
  
   InputStream in = new ByteArrayInputStream(f.getContents()); //f
 is the
   parameter CmsFile of the Document method
  
   PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is
 lib i
 use.
   in file system work fine.
  
  
   bodyText = extractor.extractText(in);
  
   
  
   Is correct use ByteArrayInputStream for make a InputStream for a
 CmsFile?
  
   The error ocurr in the third line.
   In the PDFParcer.
   the error menssage in tomcat is:
  
   java.io.IOException: Error: Header is corrupt ''
   at PDFParcer.parse
   at PDFExtractor.extractText
   at PDFDocument.Document (my class)
   at.
  
   By, and thanks.
   Ernesto.
  
  
   - Original Message -
 From: Hartmann, Waehrisch  Feykes GmbH
 To: [EMAIL PROTECTED]
 Sent: Friday, October 24, 2003 4:45 AM
 Subject: Re: [opencms-dev] Index pdf files with your content in
 lucene.
  
  
 Hello Ernesto,
  
 i assume you are using the unpatched version 1.3 of the search
 module.
 As i mentioned yesterday, the plainDocFactory does only index
 cmsFiles
 of
   type plain but not of type binary. PDF files are stored as
 binary. I
   suggest to use the version i posted yesterday. Then your
 registry.xml
 would
   have to look like this: ...
 docFactories
 ...
docFactory type=plain enabled=true
 ...
/docFactory
docFactory type=binary enabled=true
   fileType name=pdftext
  extension.pdf/extension
  
 classnet.grcomputing.opencms.search.lucene.PDFDocument/class
   /fileType
/docFactory
 ...
 /docFactories
  
 Important: The type attribute must match the file types of
 OpenCms
 (also
   defined in the registry.xml).
  
 Bye,
 Stephan
  
   - Original Message -
   From: Ernesto De Santis
   To: Lucene Users List
   Cc: [EMAIL PROTECTED]
   Sent: Thursday, October 23, 2003 4:16 PM
   Subject: [opencms-dev] Index pdf files with your content in
 lucene.
  
  
   Hello
  
   I am new in opencms and lucene tecnology.
  
   I won index pdf files, and index de content of this files.
  
   I work in this way:
  
   Make a PDFDocument class like JspDocument class.
   use org.textmining.text.extraction.PDFExtractor class, this
 class
 work
   fine out of vfs.
  
   and write my registry.xml for pdf document, in
 plainDocFactory tag.
  
   fileType name=pdftext
   extension.pdf/extension
   !-- This will strip tags before
 processing --
  
   classnet.grcomputing.opencms.search.lucene.PDFDocument/class
   /fileType
  
   my PDFDocument content this code:
   I think that the probrem is how take the content from
 CmsFile?, what
   InputStream use? PDFExtractor work with extractText(InputStream)
 method.
  
   public class PDFDocument implements I_DocumentConstants,
   I_DocumentFactory {
  
   public PDFDocument(){
  
   }
  
  
   public Document Document(CmsObject cmsobject, CmsFile
 cmsfile)
  
   throws CmsException
  
   {
  
   return 

Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 21:32, maurits van wijland wrote:

There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/
Leo Galambos, author of the Egothor project, constantly supports us 
with fresh ideas and includes Carrot components in his own project!

http://www.cs.put.poznan.pl/dweiss/carrot/xml/authors.xml?lang=en

Small world :)

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Alex Aw Seat Kiong
Hi!

I'm also interest it. Kindly CC to me the lastest progress of your
clustering project.

Regards,
AlexAw


- Original Message - 
From: Eric Jain [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 11, 2003 10:07 PM
Subject: Re: Document Clustering


  I'm working on it. Classification and Clustering as well.

 Very interesting... if you get something working, please don't forget to
 notify this list :-)

 --
 Eric Jain


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread marc
Thanks everyone for the responses and links to resources..

I was basically thinking of using lucene to generate document vectors, and
writing my custom similarity algorithms for measuring distance.

I could then run this data through k-means or SOM algorithms for calculating
clusters

Does this sound like i'm on the right track...i'm still just in the
*thinking* stage.

Marc


- Original Message - 
From: Alex Aw Seat Kiong [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 11, 2003 5:47 PM
Subject: Re: Document Clustering


 Hi!

 I'm also interest it. Kindly CC to me the lastest progress of your
 clustering project.

 Regards,
 AlexAw


 - Original Message - 
 From: Eric Jain [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, November 11, 2003 10:07 PM
 Subject: Re: Document Clustering


   I'm working on it. Classification and Clustering as well.
 
  Very interesting... if you get something working, please don't forget to
  notify this list :-)
 
  --
  Eric Jain
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Can use Lucene be used for this

2003-11-11 Thread Kumar Mettu
Hi,
 
  I have a huge data file with 4 gb data. The data in the file never changes. 
 
The format of the file is as follows:
 
Col1,col2,col3,Value

abababc,xyzza,c,100
ababadx,xyz,adfdfd,101
 
I need to retrieve the value with simple queries on the data like:
select  value where col1 like %ab, col2 like %aa% and col3 sounds like ;
 
Is Lucene suitable for doing this kind of tasks? I am using DB currently for this. 
Wondering whether Lucene can be used for this.
 
Thanks,
Kumar.


-
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard

Re: Can use Lucene be used for this

2003-11-11 Thread Erik Hatcher
On Tuesday, November 11, 2003, at 10:00  PM, Kumar Mettu wrote:
The format of the file is as follows:

Col1,col2,col3,Value

abababc,xyzza,c,100
ababadx,xyz,adfdfd,101
I need to retrieve the value with simple queries on the data like:
select  value where col1 like %ab, col2 like %aa% and col3 sounds 
like ;

Is Lucene suitable for doing this kind of tasks? I am using DB 
currently for this. Wondering whether Lucene can be used for this.
It's not a straightforward use of Lucene to emulate that type of query. 
 The trickiest one is the sounds like.  The FuzzyQuery in Lucene is 
close, but not quite a soudns like.  You could use WildcardQuerys for 
the like clauses, but they might be better served with more 
sophisticated analysis that puts all combinations (a, ab, aba, 
abab) as terms.

There are certainly tricks that could be played at either indexing 
analysis or query analysis times that could do what you want.  Would it 
be faster than a fast database with that large of a dataset?  I'm not 
sure.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index pdf files with your content in lucene.

2003-11-11 Thread Ernesto De Santis
try again zipping the files.

after i post the files in the web site.

 Could you also tell us a bit about this code?  Is it better than
 existing PDF/Word parsing solutions?  Pure Java?  Uses POI?

This code use existing parsing solution.
The intent is make a lucene Document for index pdf and word files, with
content.
Is pure java.
Use TextExtraction library.
tm-extractors-0.2.jar
Use POI and PDFBox.

Ernesto
Sorry for my bad English.


 Thanks,
 Otis


 --- Ernesto De Santis [EMAIL PROTECTED] wrote:
  Classes for index Pdf and word files in lucene.
  Ernesto.
 
  - Original Message -
  From: Ernesto De Santis [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Wednesday, October 29, 2003 12:04 PM
  Subject: Re: [opencms-dev] Index pdf files with your content in
  lucene.
 
 
  Hello all,
 
  Thans very much Stephan for your valuable help.
  Attached you will find the PDFDocument, and WordDocument class source
  code
 
  Ernesto.
 
 
  - Original Message -
  From: Hartmann, Waehrisch  Feykes GmbH
  [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Tuesday, October 28, 2003 11:10 AM
  Subject: Re: [opencms-dev] Index pdf files with your content in
  lucene.
 
 
   Hi Ernesto,
  
   the IndexManager retrieves a list of files of a folder by calling
  the
  method
   getFilesInFolder of CmsObject. This method returns only empty
  files, i.e.
   with empty content. To get the content of a pdf file you have to
  reread
  the
   file:
   f = cms.readFile(f.getAbsolutePath());
  
   Bye,
   Stephan
  
   Am Montag, 27. Oktober 2003 19:18 schrieben Sie:
  
 Hello
   
Thanks for the previous reply.
   
Now, i use
- version 1.4 of lucene searche module. (the version attached in
  this
  list)
- new version of registry.xml format for module. (like you write
  me)
- the pdf files are stored with the binary type.
   
But i have the next problem:
i can´t make a InputStream for the cmsfile content.
For this i write this code in de Document method of my class
  PDFDocument:
   
-
   
InputStream in = new ByteArrayInputStream(f.getContents()); //f
  is the
parameter CmsFile of the Document method
   
PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is
  lib i
  use.
in file system work fine.
   
   
bodyText = extractor.extractText(in);
   

   
Is correct use ByteArrayInputStream for make a InputStream for a
  CmsFile?
   
The error ocurr in the third line.
In the PDFParcer.
the error menssage in tomcat is:
   
java.io.IOException: Error: Header is corrupt ''
at PDFParcer.parse
at PDFExtractor.extractText
at PDFDocument.Document (my class)
at.
   
By, and thanks.
Ernesto.
   
   
- Original Message -
  From: Hartmann, Waehrisch  Feykes GmbH
  To: [EMAIL PROTECTED]
  Sent: Friday, October 24, 2003 4:45 AM
  Subject: Re: [opencms-dev] Index pdf files with your content in
  lucene.
   
   
  Hello Ernesto,
   
  i assume you are using the unpatched version 1.3 of the search
  module.
  As i mentioned yesterday, the plainDocFactory does only index
  cmsFiles
  of
type plain but not of type binary. PDF files are stored as
  binary. I
suggest to use the version i posted yesterday. Then your
  registry.xml
  would
have to look like this: ...
  docFactories
  ...
 docFactory type=plain enabled=true
  ...
 /docFactory
 docFactory type=binary enabled=true
fileType name=pdftext
   extension.pdf/extension
   
  classnet.grcomputing.opencms.search.lucene.PDFDocument/class
/fileType
 /docFactory
  ...
  /docFactories
   
  Important: The type attribute must match the file types of
  OpenCms
  (also
defined in the registry.xml).
   
  Bye,
  Stephan
   
- Original Message -
From: Ernesto De Santis
To: Lucene Users List
Cc: [EMAIL PROTECTED]
Sent: Thursday, October 23, 2003 4:16 PM
Subject: [opencms-dev] Index pdf files with your content in
  lucene.
   
   
Hello
   
I am new in opencms and lucene tecnology.
   
I won index pdf files, and index de content of this files.
   
I work in this way:
   
Make a PDFDocument class like JspDocument class.
use org.textmining.text.extraction.PDFExtractor class, this
  class
  work
fine out of vfs.
   
and write my registry.xml for pdf document, in
  plainDocFactory tag.
   
fileType name=pdftext
extension.pdf/extension
!-- This will strip tags before
  processing --
   
classnet.grcomputing.opencms.search.lucene.PDFDocument/class
/fileType
   
my PDFDocument content this code:
I think that the probrem is how take the content from
  CmsFile?, what