RE: Document Clustering
Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say that you have to define a distance threshold parameter in order to define when to build a new category for a certain group? Classification: you have already categories and samples for it, that help you to match other documents. You calculate document distances to the existing categories and put it in the category with smallest distance. Regards, Marcel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index entire filesystem
Hi all, I'm thinkin' about writing a search tool for my filesystem. I know such things exist already but programming it myself is much more fun ;-) So, I would have Lucene crawl through my filesystem and pass each file to an appropriate indexer (PDF - PDFbox, etc.). Yes, I run a Windows system and would depend on the file ending to distinguish the file type. Is this a good idea in general? Is there a list of available indexer for the the different file types? Any other comments are also welcome. Regards, Marcel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: OutOfMemoryException while Indexing an XML file
-Original Message- From: Rob Outar [mailto:[EMAIL PROTECTED]] Sent: Freitag, 14. Februar 2003 14:13 To: Lucene Users List Subject: OutOfMemoryException while Indexing an XML file Hi all, I was using the sample code provided I believe by Doug Cutting to index an XML file, the XML file was 2 megs (kinda large) but while adding fields to the Document object I got an OutOfMemoryException exception. I work with XML files a lot, I can easily parse that 2 meg file into a DOM tree, I can't imagine a Lucene document being larger than a DOM Tree, pasted below is the SAX handler. [...code...] Try adding -Xmx256M as an argument for java to increase the heap size in memory. Marcel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]