RE: Document Clustering

2003-11-11 Thread Marcel Stor
Stefan Groschupf wrote:
 Hi,
  How is document clustering different/related to text categorization?
 
 Clustering: try to find own categories and put documents that match
 in it. You group all documents with minimal distance together.

Would I be correct to say that you have to define a distance threshold
parameter in order to define when to build a new category for a certain
group?

 Classification: you have already categories and samples for
 it, that help you to match other documents.
 You calculate document distances to the existing categories
 and put it in the category with smallest distance.

Regards,
Marcel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index entire filesystem

2003-11-05 Thread Marcel Stor
Hi all,

I'm thinkin' about writing a search tool for my filesystem. I know such
things exist already but programming it myself is much more fun ;-)
So, I would have Lucene crawl through my filesystem and pass each file
to an appropriate indexer (PDF - PDFbox, etc.). Yes, I run a Windows
system and would depend on the file ending to distinguish the file type.
Is this a good idea in general? Is there a list of available indexer for
the the different file types? Any other comments are also welcome.

Regards,
Marcel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: OutOfMemoryException while Indexing an XML file

2003-02-14 Thread Marcel Stor
 -Original Message-
 From: Rob Outar [mailto:[EMAIL PROTECTED]] 
 Sent: Freitag, 14. Februar 2003 14:13
 To: Lucene Users List
 Subject: OutOfMemoryException while Indexing an XML file
 
 
 Hi all,
 
   I was using the sample code provided I believe by Doug 
 Cutting to index an
 XML file, the XML file was 2 megs (kinda large) but while 
 adding fields to
 the Document object I got an OutOfMemoryException exception.  
 I work with
 XML files a lot, I can easily parse that 2 meg file into a 
 DOM tree, I can't
 imagine a Lucene document being larger than a DOM Tree, 
 pasted below is the
 SAX handler.
[...code...]

Try adding -Xmx256M as an argument for java to increase the heap size in
memory.

Marcel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]