RE: Concurrent searching re-indexing
Ok, I will change my reindex method to delete all documents and then re-add them all, rather than using an IndexWriter to write a completely new index. Thanks for the help on this everyone. Paul -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: 17 February 2005 22:26 To: Lucene Users List Subject: Re: Concurrent searching re-indexing Paul Mellor wrote: I've read from various sources on the Internet that it is perfectly safe to simultaneously search a Lucene index that is being updated from another Thread, as long as all write access to the index is synchronized. But does this apply only to updating the index (i.e. deleting and adding documents), or to a complete re-indexing (i.e. create a new IndexWriter with the 'create' argument true and then re-add all the documents)? [ ...] java.io.IOException: couldn't delete _a.f1 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) [...] This is running on Windows 2000. On Windows one cannot delete a file while it is still open. So, no, on Windows one cannot remove an index entirely while an IndexReader or Searcher is still open on it, since it is simply impossible to remove all the files in the index. We might attempt to patch this by keeping a list of such files and attempt to delete them later (as is done when updating an index). But this could cause problems, as a new index will eventually try to use these same file names again, and it would then conflict with the open IndexReader. This is not a problem when updating an existing index, since filenames (except for a few which are not kept open, like segments) are never reused in the lifetime of an index. So, in order for such a fix to work we would need to switch to globally unique segment names, e.g., long random strings, rather than increasing integers. In the meantime, the safe way to rebuild an index from scratch while other processes are reading it is simply to delete all of its documents, then start adding new ones. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any opinions expressed in this e-mail and/or files transmitted with it that do not relate to the official business of this company are those solely of the author and should not be interpreted as being endorsed by this company.
RE: Concurrent searching re-indexing
Otis, Looking at your reply again, I have a couple of questions - IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. 1. If IndexReader takes a snapshot of the index state when opened and then reads the files when searching, what would happen if the files it takes a snapshot of are deleted before the search is performed (as would happen with a reindexing in the period between opening an IndexSearcher and using it to search)? 2. Does a similar potential problem exist when optimising an index, if this combines all the segments into a single file? Many thanks Paul -Original Message- From: Paul Mellor [mailto:[EMAIL PROTECTED] Sent: 16 February 2005 17:37 To: 'Lucene Users List' Subject: RE: Concurrent searching re-indexing But all write access to the index is synchronized, so that although multiple threads are creating an IndexWriter for the same directory and using it to totally recreate that index, only one thread is doing this at once. I was concerned about the safety of using an IndexSearcher to perform queries on an index that is in the process of being recreated from scratch, but I guess that if the IndexSearcher takes a snapshot of the index when it is created (and in my code this creation is synchronized with the write operations as well so that the threads wait for the write operations to finish before instantiating an IndexSearcher, and vice versa) this can't be a problem. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 16 February 2005 17:30 To: Lucene Users List Subject: Re: Concurrent searching re-indexing Hi Paul, If I understand your setup correctly, it looks like you are running multiple threads that create IndexWriter for the ame directory. That's a no no. This section (first hit) describes all various concurrency issues with regards to adds, updates, optimization, and searches: http://www.lucenebook.com/search?query=concurrent IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. Otis --- Paul Mellor [EMAIL PROTECTED] wrote: Hi, I've read from various sources on the Internet that it is perfectly safe to simultaneously search a Lucene index that is being updated from another Thread, as long as all write access to the index is synchronized. But does this apply only to updating the index (i.e. deleting and adding documents), or to a complete re-indexing (i.e. create a new IndexWriter with the 'create' argument true and then re-add all the documents)? I have a class which encapsulates all access to my index, so that writes can be synchronized. This class also exposes a method to obtain an IndexSearcher for the index. I'm running unit tests to test this which create many threads - each thread does a complete re-indexing and then obtains an IndexSearcher and does a query. I'm finding that with sufficiently high numbers of threads, I'm getting the occasional failure, with the following exception thrown when attempting to construct a new IndexWriter (during the reindexing) - java.io.IOException: couldn't delete _a.f1 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151) ... The exception occurs quite infrequently (usually for somewhere between 1-5% of the Threads). Does the IndexSearcher take a 'snapshot' of the index at creation? Or does it access the filesystem whilst searching? I am also synchronizing creation of the IndexSearcher with the write lock, so that the IndexSearcher is not created whilst the index is being recreated (and vice versa). But do I need to ensure that the IndexSearcher cannot search whilst the index is being recreated as well? Note that a similar unit test where the threads update the index (rather than recreate it from scratch) works fine, as expected. This is running on Windows 2000. Any help would be much appreciated! Paul This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any opinions expressed in this e-mail and/or files transmitted with it that do not relate to the official
RE: Concurrent searching re-indexing
Paul Mellor writes: 1. If IndexReader takes a snapshot of the index state when opened and then reads the files when searching, what would happen if the files it takes a snapshot of are deleted before the search is performed (as would happen with a reindexing in the period between opening an IndexSearcher and using it to search)? On unix, open files are still there, even if they are deleted (that is, there is no link (filename) to the file anymore but the file's content still exists), on windows you cannot delete open files, so Lucene AFAIK (I don't use windows) postpones the deletion to a time, when the file is closed. 2. Does a similar potential problem exist when optimising an index, if this combines all the segments into a single file? AFAIK optimising creates new files. The only problem that might occur, is opening a reader during index change but that's handled by a lock. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Concurrent searching re-indexing
Hi Paul, If I understand your setup correctly, it looks like you are running multiple threads that create IndexWriter for the ame directory. That's a no no. This section (first hit) describes all various concurrency issues with regards to adds, updates, optimization, and searches: http://www.lucenebook.com/search?query=concurrent IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. Otis --- Paul Mellor [EMAIL PROTECTED] wrote: Hi, I've read from various sources on the Internet that it is perfectly safe to simultaneously search a Lucene index that is being updated from another Thread, as long as all write access to the index is synchronized. But does this apply only to updating the index (i.e. deleting and adding documents), or to a complete re-indexing (i.e. create a new IndexWriter with the 'create' argument true and then re-add all the documents)? I have a class which encapsulates all access to my index, so that writes can be synchronized. This class also exposes a method to obtain an IndexSearcher for the index. I'm running unit tests to test this which create many threads - each thread does a complete re-indexing and then obtains an IndexSearcher and does a query. I'm finding that with sufficiently high numbers of threads, I'm getting the occasional failure, with the following exception thrown when attempting to construct a new IndexWriter (during the reindexing) - java.io.IOException: couldn't delete _a.f1 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151) ... The exception occurs quite infrequently (usually for somewhere between 1-5% of the Threads). Does the IndexSearcher take a 'snapshot' of the index at creation? Or does it access the filesystem whilst searching? I am also synchronizing creation of the IndexSearcher with the write lock, so that the IndexSearcher is not created whilst the index is being recreated (and vice versa). But do I need to ensure that the IndexSearcher cannot search whilst the index is being recreated as well? Note that a similar unit test where the threads update the index (rather than recreate it from scratch) works fine, as expected. This is running on Windows 2000. Any help would be much appreciated! Paul This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any opinions expressed in this e-mail and/or files transmitted with it that do not relate to the official business of this company are those solely of the author and should not be interpreted as being endorsed by this company. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Concurrent searching re-indexing
But all write access to the index is synchronized, so that although multiple threads are creating an IndexWriter for the same directory and using it to totally recreate that index, only one thread is doing this at once. I was concerned about the safety of using an IndexSearcher to perform queries on an index that is in the process of being recreated from scratch, but I guess that if the IndexSearcher takes a snapshot of the index when it is created (and in my code this creation is synchronized with the write operations as well so that the threads wait for the write operations to finish before instantiating an IndexSearcher, and vice versa) this can't be a problem. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 16 February 2005 17:30 To: Lucene Users List Subject: Re: Concurrent searching re-indexing Hi Paul, If I understand your setup correctly, it looks like you are running multiple threads that create IndexWriter for the ame directory. That's a no no. This section (first hit) describes all various concurrency issues with regards to adds, updates, optimization, and searches: http://www.lucenebook.com/search?query=concurrent IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. Otis --- Paul Mellor [EMAIL PROTECTED] wrote: Hi, I've read from various sources on the Internet that it is perfectly safe to simultaneously search a Lucene index that is being updated from another Thread, as long as all write access to the index is synchronized. But does this apply only to updating the index (i.e. deleting and adding documents), or to a complete re-indexing (i.e. create a new IndexWriter with the 'create' argument true and then re-add all the documents)? I have a class which encapsulates all access to my index, so that writes can be synchronized. This class also exposes a method to obtain an IndexSearcher for the index. I'm running unit tests to test this which create many threads - each thread does a complete re-indexing and then obtains an IndexSearcher and does a query. I'm finding that with sufficiently high numbers of threads, I'm getting the occasional failure, with the following exception thrown when attempting to construct a new IndexWriter (during the reindexing) - java.io.IOException: couldn't delete _a.f1 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151) ... The exception occurs quite infrequently (usually for somewhere between 1-5% of the Threads). Does the IndexSearcher take a 'snapshot' of the index at creation? Or does it access the filesystem whilst searching? I am also synchronizing creation of the IndexSearcher with the write lock, so that the IndexSearcher is not created whilst the index is being recreated (and vice versa). But do I need to ensure that the IndexSearcher cannot search whilst the index is being recreated as well? Note that a similar unit test where the threads update the index (rather than recreate it from scratch) works fine, as expected. This is running on Windows 2000. Any help would be much appreciated! Paul This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any opinions expressed in this e-mail and/or files transmitted with it that do not relate to the official business of this company are those solely of the author and should not be interpreted as being endorsed by this company. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any
Re: Re-Indexing a moving target???
details? Yousef Ourabi wrote: Saad, Here is what I got. I will post again, and be more specific. -Y --- Nader Henein [EMAIL PROTECTED] wrote: We'll need a little more detail to help you, what are the sizes of your updates and how often are they updated. 1) No just re-open the index writer every time to re-index, according to you it's moderately changing index, just keep a flag on the rows and batch indexing every so often. 2) It all comes down to your needs, more detail would help us help you. Nader Henein Yousef Ourabi wrote: Hey, We are using lucene to index a moderatly changing database, and I have a couple of questions on a performance strategy. 1) Should we just have one index writer open unil the system comes down...or create a new index writer each time we re-index our data-set. 2) Does anyone have anythoughts...multi-threading and segments instead of one index? Thanks for your time and help. Best, Yousef - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Developer Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re-Indexing a moving target???
Saad, Here is what I got. I will post again, and be more specific. -Y --- Nader Henein [EMAIL PROTECTED] wrote: We'll need a little more detail to help you, what are the sizes of your updates and how often are they updated. 1) No just re-open the index writer every time to re-index, according to you it's moderately changing index, just keep a flag on the rows and batch indexing every so often. 2) It all comes down to your needs, more detail would help us help you. Nader Henein Yousef Ourabi wrote: Hey, We are using lucene to index a moderatly changing database, and I have a couple of questions on a performance strategy. 1) Should we just have one index writer open unil the system comes down...or create a new index writer each time we re-index our data-set. 2) Does anyone have anythoughts...multi-threading and segments instead of one index? Thanks for your time and help. Best, Yousef - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Developer Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing flat files with out .txt extension
Hi Erik, Thanks for the pointers, I have modified the Indexer.java to index the files from the directory by removing the file extenstion check of (.txt). Now I do get the index from the files. New situation is that when I run the FileSearch java org.apache.lucene.demo.SearchFiles Query: tty Searching for: tty 3 total matching documents 0. No path nor URL for this document 1. No path nor URL for this document 2. No path nor URL for this document I do not get the actual path from the index and using Luke I get the three hits. Last two are from the index and not the real documents. Any idea what is happeneing and how can I fix it. Thanks. -H Erik Hatcher wrote: On Jan 10, 2005, at 7:06 PM, Hetan Shah wrote: Got the latest Ant and got the demo to work. I am however not sure which part in the whole source code is the indexing for different file types is done, say for example .html .txt and such? Your best bet is to dig around in the codebase. The Indexer.java code is hard-coded to only do .txt file extensions - this was on purpose as the first example in the book, figuring someone using this code on the their C:\ drive would be relatively safe and fast to run. Their is also an example easily run from the Ant launcher to show how various document types can be handled using an extensible framework. Run ant ExtensionFileHandler. It doesn't actually index the document it creates, but displays it to the console. It would be pretty trivial to pair the Indexer.java code up with the file handler framework to crawl a directory tree and index any content it recognizes. Appreciate your help. If you have any sample code would certainly appreciate that also. You got all the code already. It should be fairly straightforward to navigate the src tree, especially with the Table of Contents handy: http://www.lucenebook.com/toc (incidentally, this dynamic TOC page is blending the blog content with the TOC using an IndexReader to find all blog entries that refer to each section - and you'll see the two, minor and cosmetic, errata listed there already). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing flat files with out .txt extension
Hi erik, Got the latest Ant and got the demo to work. I am however not sure which part in the whole source code is the indexing for different file types is done, say for example .html .txt and such? From there I can derive how can I index a plain text file which does not have any extension. Appreciate your help. If you have any sample code would certainly appreciate that also. -H. Erik Hatcher wrote: On Jan 6, 2005, at 6:49 PM, Hetan Shah wrote: Hi Erik, I got the source downloaded and unpacked. I am having difficulty in building and of the modules. Maybe something's wrong with my Ant installation. LuceneInAction% ant test Buildfile: build.xml BUILD FAILED file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element available The good ol' README says this: R E Q U I R E M E N T S --- * JDK 1.4+ * Ant 1.6+ (to run the automated examples) * JUnit 3.8.1+ - junit.jar should be in ANT_HOME/lib You are not running Ant 1.6, I'm sure. Upgrade your version of Ant, and of course follow the rest of the README and all should be well. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing flat files with out .txt extension
On Jan 10, 2005, at 7:06 PM, Hetan Shah wrote: Got the latest Ant and got the demo to work. I am however not sure which part in the whole source code is the indexing for different file types is done, say for example .html .txt and such? Your best bet is to dig around in the codebase. The Indexer.java code is hard-coded to only do .txt file extensions - this was on purpose as the first example in the book, figuring someone using this code on the their C:\ drive would be relatively safe and fast to run. Their is also an example easily run from the Ant launcher to show how various document types can be handled using an extensible framework. Run ant ExtensionFileHandler. It doesn't actually index the document it creates, but displays it to the console. It would be pretty trivial to pair the Indexer.java code up with the file handler framework to crawl a directory tree and index any content it recognizes. Appreciate your help. If you have any sample code would certainly appreciate that also. You got all the code already. It should be fairly straightforward to navigate the src tree, especially with the Table of Contents handy: http://www.lucenebook.com/toc (incidentally, this dynamic TOC page is blending the blog content with the TOC using an IndexReader to find all blog entries that refer to each section - and you'll see the two, minor and cosmetic, errata listed there already). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing flat files with out .txt extension
Hi Erik, I got the source downloaded and unpacked. I am having difficulty in building and of the modules. Maybe something's wrong with my Ant installation. LuceneInAction% ant test Buildfile: build.xml BUILD FAILED file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element available Total time: 5 seconds LuceneInAction% ant Indexer Buildfile: build.xml BUILD FAILED file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element available Total time: 5 seconds ** Can you point me to proper module for creating my own indexer? I tried looking into the indexing module but was not sure. TIA, -H Erik Hatcher wrote: On Jan 5, 2005, at 6:31 PM, Hetan Shah wrote: How can one index simple text files with out the .txt extension. I am trying to use the IndexFiles and IndexHTML but not to my satisfaction. In the IndexFiles I do not get any control over the content of the file and in case of IndexHTML the files with out any extension do not get index all together. Any pointers are really appreciated. Try out the Indexer code from Lucene in Action. You can download it from the link here: http://www.lucenebook.com/blog/announcements/sourcecode.html It'll be cleaner to follow and borrow from. The code that ships with Lucene is for demonstration purposes. It surprises me how often folks use that code to build real indexes. It's quite straightforward to create your own Java code to do the indexing in whatever manner you like, borrowing from examples. When you get the download unpacked, simply run ant Indexer to see it in action. And then ant Searcher to search the index just built. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing flat files with out .txt extension
On Jan 6, 2005, at 6:49 PM, Hetan Shah wrote: Hi Erik, I got the source downloaded and unpacked. I am having difficulty in building and of the modules. Maybe something's wrong with my Ant installation. LuceneInAction% ant test Buildfile: build.xml BUILD FAILED file:/home/hs152827/LuceneInAction/build.xml:12: Unexpected element available The good ol' README says this: R E Q U I R E M E N T S --- * JDK 1.4+ * Ant 1.6+ (to run the automated examples) * JUnit 3.8.1+ - junit.jar should be in ANT_HOME/lib You are not running Ant 1.6, I'm sure. Upgrade your version of Ant, and of course follow the rest of the README and all should be well. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing flat files with out .txt extension
On Jan 5, 2005, at 6:31 PM, Hetan Shah wrote: How can one index simple text files with out the .txt extension. I am trying to use the IndexFiles and IndexHTML but not to my satisfaction. In the IndexFiles I do not get any control over the content of the file and in case of IndexHTML the files with out any extension do not get index all together. Any pointers are really appreciated. Try out the Indexer code from Lucene in Action. You can download it from the link here: http://www.lucenebook.com/blog/announcements/sourcecode.html It'll be cleaner to follow and borrow from. The code that ships with Lucene is for demonstration purposes. It surprises me how often folks use that code to build real indexes. It's quite straightforward to create your own Java code to do the indexing in whatever manner you like, borrowing from examples. When you get the download unpacked, simply run ant Indexer to see it in action. And then ant Searcher to search the index just built. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing terms only
Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method or the appropriate constructor when adding a field to the Document. You'll need to also store a reference to the actual file (URL, Path, etc) in the document so it can be retrieved from the doc returned in the Hits object. Or did I completely misunderstand the question? -Mike On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote: hi i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks! DES - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing terms only
I actually use Field.Text(String,String) to add documents to my index. Maybe I do not understand the way an analyzer works, but I thought that all German articles (der, die, das etc) should be filtered out. However if I use Luke to view my index, the original text is completely stored in a field. And what I need is term vector, that I can create from an indexed document field. So this field should contain terms only. Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method or the appropriate constructor when adding a field to the Document. You'll need to also store a reference to the actual file (URL, Path, etc) in the document so it can be retrieved from the doc returned in the Hits object. Or did I completely misunderstand the question? -Mike On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote: hi i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks! DES - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing terms only
On Dec 22, 2004, at 11:36 AM, Mike Snare wrote: Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method Correction: Field.Text(String, String) is a stored field. If you want unstored, use Field.UnStored(String, String). This is a bit confusing because Field.Text(String, Reader) is not stored. This confusion has been cleared up in the CVS version of Lucene and will be deprecated in the 1.9 release, and removed in the 2.0 release. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing terms only
I've never used the german analyzer, so I don't know what stop words it defines/uses. Someone else will have to answer that. Sorry On Wed, 22 Dec 2004 17:45:17 +0100, DES [EMAIL PROTECTED] wrote: I actually use Field.Text(String,String) to add documents to my index. Maybe I do not understand the way an analyzer works, but I thought that all German articles (der, die, das etc) should be filtered out. However if I use Luke to view my index, the original text is completely stored in a field. And what I need is term vector, that I can create from an indexed document field. So this field should contain terms only. Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method or the appropriate constructor when adding a field to the Document. You'll need to also store a reference to the actual file (URL, Path, etc) in the document so it can be retrieved from the doc returned in the Hits object. Or did I completely misunderstand the question? -Mike On Wed, 22 Dec 2004 17:23:24 +0100, DES [EMAIL PROTECTED] wrote: hi i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks! DES - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing terms only
Thanks for correcting me. I use the reader version -- hence my confusion. -Mike On Wed, 22 Dec 2004 11:53:31 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: On Dec 22, 2004, at 11:36 AM, Mike Snare wrote: Whether or not the text is stored in the index is a different concern that how it is analyzed. If you want the text to be indexed, and not stored, then use the Field.Text(String, String) method Correction: Field.Text(String, String) is a stored field. If you want unstored, use Field.UnStored(String, String). This is a bit confusing because Field.Text(String, Reader) is not stored. This confusion has been cleared up in the CVS version of Lucene and will be deprecated in the 1.9 release, and removed in the 2.0 release. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing with Lucene 1.4.3
That looks right to me, assuming you have done an optimize. All of your index segments are merged into the one .cfs file (which is large, right?). Try searching -- it should work. Chuck is right, the index looks fine and will be searchable. Since lucene version 1.4, the index is stored per default using the compound file format. The index files you are missing are merged within one compound file which has the extension cfs. You can disable the compound file option using IndexWriters setUseCompoundFile(false). Bernhard -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 11:00 AM To: Lucene Users List Subject: Indexing with Lucene 1.4.3 Hello, I have been trying to index around 6000 documents using IndexHTML from 1.4.3 and at the end of indexing in my index directory I only have 3 files. segments deletable and _5en.cfs Can someone tell me what is going on and where are the actual index files? How can I resolve this issue? Thanks. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing with Lucene 1.4.3
Thanks Chuck, I now understand why I see only one file. Another question is do I have to specify somewhere in my code or some configuration setting that I would now be using a compound file format (.cfs file) for index. I have an application that was working in version 1.3-final till I moved to 1.4.3 now I do not get any results back from my searches. I tried using Luke and it shows me the content of the index. I can search using Luke but no success so far with my own application. Any pointers? Thanks. -H Chuck Williams wrote: That looks right to me, assuming you have done an optimize. All of your index segments are merged into the one .cfs file (which is large, right?). Try searching -- it should work. Chuck -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 11:00 AM To: Lucene Users List Subject: Indexing with Lucene 1.4.3 Hello, I have been trying to index around 6000 documents using IndexHTML from 1.4.3 and at the end of indexing in my index directory I only have 3 files. segments deletable and _5en.cfs Can someone tell me what is going on and where are the actual index files? How can I resolve this issue? Thanks. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing with Lucene 1.4.3
The only place where you have to specify that you are using the compound index format is on IndexWriter instance. Nothing needs to be done at search time on IndexSearcher. Otis --- Hetan Shah [EMAIL PROTECTED] wrote: Thanks Chuck, I now understand why I see only one file. Another question is do I have to specify somewhere in my code or some configuration setting that I would now be using a compound file format (.cfs file) for index. I have an application that was working in version 1.3-final till I moved to 1.4.3 now I do not get any results back from my searches. I tried using Luke and it shows me the content of the index. I can search using Luke but no success so far with my own application. Any pointers? Thanks. -H Chuck Williams wrote: That looks right to me, assuming you have done an optimize. All of your index segments are merged into the one .cfs file (which is large, right?). Try searching -- it should work. Chuck -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 11:00 AM To: Lucene Users List Subject: Indexing with Lucene 1.4.3 Hello, I have been trying to index around 6000 documents using IndexHTML from 1.4.3 and at the end of indexing in my index directory I only have 3 files. segments deletable and _5en.cfs Can someone tell me what is going on and where are the actual index files? How can I resolve this issue? Thanks. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing a large number of DB records
There was other reasons for my choice of going with a Temp Index - namely I was having terrible write times to my Live index as it was stored on a different server, also, while I was writing to my live index people were trying to search on it and were getting file not found exceptions so rather than spend hours or days trying to fix it I took the easiest route by creating a temp index on the server which had the application and merging to the server with the live index. This greatly increased my indexing speed. Best of luck Garrett -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 15 December 2004 18:43 To: Lucene Users List Subject: RE: Indexing a large number of DB records Note that this really includes some extra steps. You don't need a temp index. Add everything to a single index using a single IndexWriter instance. No need to call addIndexes nor optimize until the end. Adding Documents to an index takes a constant amount of time, regardless of the index size, because new segments are created as documents are added, and existing segments don't need to be updated (only when merges happen). Again, I'd run your app under a profiler to see where the time and memory are going. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Hi Homan I had a similar problem as you in that I was indexing A LOT of data Essentially how I got round it was to batch the index. What I was doing was to add 10,000 documents to a temporary index, use addIndexes() to merge to temporary index into the live index (which also optimizes the live index) then delete the temporary index. On the next loop I'd only query rows from the db above the id in the maxdoc of the live index and set the max rows of the query to to 10,000 i.e SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] {ID from Index.MaxDoc()} ORDER BY [id_field] ASC Ensuring that the documents go into the index sequentially your problem is solved and memory usage on mine (dotlucene 1.3) is low Regards Garrett -Original Message- From: Homam S.A. [mailto:[EMAIL PROTECTED] Sent: 15 December 2004 02:43 To: Lucene Users List Subject: Indexing a large number of DB records I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing with Lucene 1.4.3
That looks right to me, assuming you have done an optimize. All of your index segments are merged into the one .cfs file (which is large, right?). Try searching -- it should work. Chuck -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Thursday, December 16, 2004 11:00 AM To: Lucene Users List Subject: Indexing with Lucene 1.4.3 Hello, I have been trying to index around 6000 documents using IndexHTML from 1.4.3 and at the end of indexing in my index directory I only have 3 files. segments deletable and _5en.cfs Can someone tell me what is going on and where are the actual index files? How can I resolve this issue? Thanks. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing with Lucene 1.4.3
Hi there Apologies. If u are using the IndexHTML from the demo.jar package which is abvaliable from Lucene1.4.3.zip Then u bettter look at the File Extensions of u'r file's,they may be filtered out of the indexing process due to this code present in IndexHTML.java } else if (file.getPath().endsWith(.html) || // index .html files file.getPath().endsWith(.htm) || // index .htm files file.getPath().endsWith(.txt)) { // index .txt files It the Extensions u have is within the 'endsWith' options then u have sucessfully indexed the 6000 Documents of u's Try to use the Luke Monitering S/f avaliable from the Jakartha Lucene Web site and check for the same [Hint Try to use the SearchFiles.class from the Lucene1.4.3.zip to search onthe documents u have indexed sucessfuly] with regards Karthik -Original Message- From: Hetan Shah [mailto:[EMAIL PROTECTED] Sent: Friday, December 17, 2004 12:30 AM To: Lucene Users List Subject: Indexing with Lucene 1.4.3 Hello, I have been trying to index around 6000 documents using IndexHTML from 1.4.3 and at the end of indexing in my index directory I only have 3 files. segments deletable and _5en.cfs Can someone tell me what is going on and where are the actual index files? How can I resolve this issue? Thanks. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing a large number of DB records
Hello Homam, The batches I was referring to were batches of DB rows. Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X LIMIT=Y. Don't close IndexWriter - use the single instance. There is no MakeStable()-like method in Lucene, but you can control the number of in-memory Documents, the frequence of segment merges, and maximal size of an index segments with 3 IndexWriter parameters, described fairly verbosely in the javadocs. Since you are using the .Net version, you should really consult dotLucene guy(s). Running under the profiler should also tell you where the time and memory go. Otis --- Homam S.A. [EMAIL PROTECTED] wrote: Thanks Otis! What do you mean by building it in batches? Does it mean I should close the IndexWriter every 1000 rows and reopen it? Does that releases references to the document objects so that they can be garbage-collected? I'm calling optimize() only at the end. I agree that 1500 documents is very small. I'm building the index on a PC with 512 megs, and the indexing process is quickly gobbling up around 400 megs when I index around 1800 documents and the whole machine is grinding to a virtual halt. I'm using the latest DotLucene .NET port, so may be there's a memory leak in it. I have experience with AltaVista search (acquired by FastSearch), and I used to call MakeStable() every 20,000 documents to flush memory structures to disk. There doesn't seem to be an equivalent in Lucene. -- Homam --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader) 3) Give the JVM more memory to play with by using -Xms and -Xmx JVM parameters 4) See IndexWriter's minMergeDocs parameter. 5) Are you calling optimize() at some point by any chance? Leave that call for the end. 1500 documents with 30 columns of short String/number values is not a lot. You may be doing something else not Lucene related that's slowing things down. Otis --- Homam S.A. [EMAIL PROTECTED] wrote: I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Take Yahoo! Mail with you! Get it on your mobile phone. http://mobile.yahoo.com/maildemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing a large number of DB records
Hi Homan I had a similar problem as you in that I was indexing A LOT of data Essentially how I got round it was to batch the index. What I was doing was to add 10,000 documents to a temporary index, use addIndexes() to merge to temporary index into the live index (which also optimizes the live index) then delete the temporary index. On the next loop I'd only query rows from the db above the id in the maxdoc of the live index and set the max rows of the query to to 10,000 i.e SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] {ID from Index.MaxDoc()} ORDER BY [id_field] ASC Ensuring that the documents go into the index sequentially your problem is solved and memory usage on mine (dotlucene 1.3) is low Regards Garrett -Original Message- From: Homam S.A. [mailto:[EMAIL PROTECTED] Sent: 15 December 2004 02:43 To: Lucene Users List Subject: Indexing a large number of DB records I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing a large number of DB records
Note that this really includes some extra steps. You don't need a temp index. Add everything to a single index using a single IndexWriter instance. No need to call addIndexes nor optimize until the end. Adding Documents to an index takes a constant amount of time, regardless of the index size, because new segments are created as documents are added, and existing segments don't need to be updated (only when merges happen). Again, I'd run your app under a profiler to see where the time and memory are going. Otis --- Garrett Heaver [EMAIL PROTECTED] wrote: Hi Homan I had a similar problem as you in that I was indexing A LOT of data Essentially how I got round it was to batch the index. What I was doing was to add 10,000 documents to a temporary index, use addIndexes() to merge to temporary index into the live index (which also optimizes the live index) then delete the temporary index. On the next loop I'd only query rows from the db above the id in the maxdoc of the live index and set the max rows of the query to to 10,000 i.e SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] {ID from Index.MaxDoc()} ORDER BY [id_field] ASC Ensuring that the documents go into the index sequentially your problem is solved and memory usage on mine (dotlucene 1.3) is low Regards Garrett -Original Message- From: Homam S.A. [mailto:[EMAIL PROTECTED] Sent: 15 December 2004 02:43 To: Lucene Users List Subject: Indexing a large number of DB records I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing a large number of DB records
Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader) 3) Give the JVM more memory to play with by using -Xms and -Xmx JVM parameters 4) See IndexWriter's minMergeDocs parameter. 5) Are you calling optimize() at some point by any chance? Leave that call for the end. 1500 documents with 30 columns of short String/number values is not a lot. You may be doing something else not Lucene related that's slowing things down. Otis --- Homam S.A. [EMAIL PROTECTED] wrote: I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing a large number of DB records
Thanks Otis! What do you mean by building it in batches? Does it mean I should close the IndexWriter every 1000 rows and reopen it? Does that releases references to the document objects so that they can be garbage-collected? I'm calling optimize() only at the end. I agree that 1500 documents is very small. I'm building the index on a PC with 512 megs, and the indexing process is quickly gobbling up around 400 megs when I index around 1800 documents and the whole machine is grinding to a virtual halt. I'm using the latest DotLucene .NET port, so may be there's a memory leak in it. I have experience with AltaVista search (acquired by FastSearch), and I used to call MakeStable() every 20,000 documents to flush memory structures to disk. There doesn't seem to be an equivalent in Lucene. -- Homam --- Otis Gospodnetic [EMAIL PROTECTED] wrote: Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader) 3) Give the JVM more memory to play with by using -Xms and -Xmx JVM parameters 4) See IndexWriter's minMergeDocs parameter. 5) Are you calling optimize() at some point by any chance? Leave that call for the end. 1500 documents with 30 columns of short String/number values is not a lot. You may be doing something else not Lucene related that's slowing things down. Otis --- Homam S.A. [EMAIL PROTECTED] wrote: I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Take Yahoo! Mail with you! Get it on your mobile phone. http://mobile.yahoo.com/maildemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing HTML files give following message
Hello, This is probably due to some bad HTML. The application you are using is just a demo, and uses a JavaCC-based HTML parser, which may not be resilient to invalid HTML. For Lucene in Action we developed a little extensible indexing framework, and for HTML indexing we used 2 tools to handle HTML parsing: JTidy and NekoHTML. Since the code for the book is freely available... http://www.manning.com. NekoHTML knows how to deal with some bad HTML, that's why I'm suggesting this. The indexing framework could come handy for those working on various 'desktop search' applications (Roosster, LDesktop (if that's really happening), Lucidity, etc.) Otis --- Hetan Shah [EMAIL PROTECTED] wrote: java org.apache.lucene.demo.IndexHTML -create -index /source/workarea/hs152827/newIndex .. adding ../0/10037.html adding ../0/10050.html adding ../0/1006132.html adding ../0/1013223.html Parse Aborted: Encountered \ at line 5, column 1. Was expecting one of: ArgName ... = ... TagEnd ... And then the indexing hangs on this line. Earlier it used to go on and index remaining pages in the directory. Any idea why would the indexer stop at this error. Pointers are much needed and appreciated. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing within an XML document
Redirecting to lucene-user, which is more appropriate. I'm not sure what exactly the question is here, but: Parse your XML document and for each p element you encounter create a new Document instance, and then populate its fields with some data, like the URI data you mentioned. If you parse with DOM - just walk the node tree and make new Document whenever you encounter an element you want as a separate Document. If you are using the SAX API you'll probably want some logic in start/endElement and characters methods. When you reach the end of the element you are done with your Document instance, so add it to the IndexWriter instance that you opened once, before the parser. When you are done with the whole XML document close the IndexWriter. Otis --- Murray Altheim [EMAIL PROTECTED] wrote: Hi, I'm trying to develop a class to handle an XML document, where the contents aren't so much indexed on a per-document basis, rather on an element basis. Each element has a unique ID, so I'm looking to create a class/method similar to Lucene's Document.Document(). By way of example, I'll use some XHTML markup to illustrate what I'm trying to do: html base href=http://purl.org/ceryle/blat.xml/ [...] body p id=p1 some text to index... /p p id=p2 some more text to index... /p p id=p3 even more text to index... /p /body /html I'd very much appreciate any help in explaining how I'd go about creating a method to return a Lucene Document to index this via ID. Would I want a separate Document per p? (There are many thousands of such elements.) Everything in my system, both at the document and the individual element level is done via URL, so the method should create URLs for each p element like http://purl.org/ceryle/blat.xml#p1 http://purl.org/ceryle/blat.xml#p2 http://purl.org/ceryle/blat.xml#p3 etc. I don't need anyone to go to the trouble of coding this, just point me to how it might be done, or to any existing examples that do this kind of thing. Thanks very much! Murray .. Murray Altheim http://kmi.open.ac.uk/people/murray/ Knowledge Media Institute The Open University, Milton Keynes, Bucks, MK7 6AA, UK . If we can just get the people that can reconcile themselves to the new dispensation out of the way and then kill the few thousand people who can't reconcile themselves, then we can let the remaining 98 percent come back and live out their lives, Pike said. If we bomb the place to the ground, those peace-loving people won't have a home to live in. [...] If we simply pulverize the city, it would look bad on TV. -- John Pike U.S., Iraqi troops mass for assault on Fallujah STRATEGY: U.S. to employ snipers, robots to cut down casualties Matthew B. Stannard, San Francisco Chronicle http://www.sfgate.com/cgi-bin/article.cgi?file=/c/a/2004/11/06/MNGHL9NBU11.DTL We have a growing, maturing insurgency group. We see larger and more coordinated military attacks. They are getting better and they can self-regenerate. The idea there are x number of insurgents, and that when they're all dead we can get out is wrong. The insurgency has shown an ability to regenerate itself because there are people willing to fill the ranks of those who are killed. The political culture is more hostile to the US presence. The longer we stay, the more they are confirmed in that view. -- W Andrew Terrill Far Graver Than Vietnam, Sidney Blumenthal, The Guardian http://www.guardian.co.uk/comment/story/0,,1305360,00.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
That's one place to start. The other one would be textmining.org, at least for Word files. I used both POI and Textmining API in Lucene in Action, and the latter was much simpler to use. You can also find some comments about both libs in lucene-user archives. People tend to like Textmining API better. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
Thanks Otis. I am looking forward to this book. Any idea when it may be released? - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 11:54 AM Subject: Re: Indexing MS Files That's one place to start. The other one would be textmining.org, at least for Word files. I used both POI and Textmining API in Lucene in Action, and the latter was much simpler to use. You can also find some comments about both libs in lucene-user archives. People tend to like Textmining API better. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
As Manning publications said, you should be able to get it for your grandma this Christmas. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: Thanks Otis. I am looking forward to this book. Any idea when it may be released? - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 11:54 AM Subject: Re: Indexing MS Files That's one place to start. The other one would be textmining.org, at least for Word files. I used both POI and Textmining API in Lucene in Action, and the latter was much simpler to use. You can also find some comments about both libs in lucene-user archives. People tend to like Textmining API better. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
for loading the document PropertyValue propertyvalue[] = new PropertyValue[ 1 ]; // Setting the flag for hidding the open document propertyvalue[ 0 ] = new PropertyValue(); propertyvalue[ 0 ].Name = Hidden; propertyvalue[ 0 ].Value = new Boolean(true); // Loading the wanted document Object objectDocumentToStore = xcomponentloader.loadComponentFromURL( stringUrl, _blank, 0, propertyvalue ); // Getting an object that will offer a simple way to store a document to a URL. XStorable xstorable = ( XStorable ) UnoRuntime.queryInterface( XStorable.class, objectDocumentToStore ); // Preparing properties for converting the document propertyvalue = new PropertyValue[2]; // Setting the flag for overwriting propertyvalue[0] = new PropertyValue(); propertyvalue[0].Name = Overwrite; propertyvalue[0].Value = new Boolean(true); // Setting the filter name propertyvalue[1] = new PropertyValue(); propertyvalue[1].Name = FilterName; propertyvalue[1].Value = stringConvertType; // Appending the favoured extension to the origin document name //if(stringUrl.lastIndexOf(.)!=0){ //stringUrl=stringUrl.substring(0,stringUrl.lastIndexOf(.)); //} if(namedoc.lastIndexOf(.)!=-1){ namedoc=namedoc.substring(0,namedoc.lastIndexOf(.)); } //stringConvertedFile = stringUrl + . + stringExtension; stringConvertedFile=xbase.getAlias(local)+/oo_tmp/+namedoc+.+stringExt ension; stringConvertedFile=stringConvertedFile.replace( '\\', '/' ); // Storing and converting the document xstorable.storeToURL( stringConvertedFile, propertyvalue ); // Getting the method dispose() for closing the document XComponent xcomponent = ( XComponent ) UnoRuntime.queryInterface( XComponent.class, xstorable ); // Closing the converted document xcomponent.dispose(); } catch(NoConnectException ex ) { return( ); } catch( IOException ex ) { return( ); } catch( Exception ex ) { return( ); } // Returning the name of the converted file return( stringConvertedFile ); } - Original Message - From: Luke Shannon [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 5:59 PM Subject: Re: Indexing MS Files Thanks Otis. I am looking forward to this book. Any idea when it may be released? - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 11:54 AM Subject: Re: Indexing MS Files That's one place to start. The other one would be textmining.org, at least for Word files. I used both POI and Textmining API in Lucene in Action, and the latter was much simpler to use. You can also find some comments about both libs in lucene-user archives. People tend to like Textmining API better. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
Thanks. Grandmas around the world will certainly be surprised this Christmas. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 12:18 PM Subject: Re: Indexing MS Files As Manning publications said, you should be able to get it for your grandma this Christmas. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: Thanks Otis. I am looking forward to this book. Any idea when it may be released? - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 11:54 AM Subject: Re: Indexing MS Files That's one place to start. The other one would be textmining.org, at least for Word files. I used both POI and Textmining API in Lucene in Action, and the latter was much simpler to use. You can also find some comments about both libs in lucene-user archives. People tend to like Textmining API better. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: I need to index Word, Excel and Power Point files. Is this the place to start? http://jakarta.apache.org/poi/ Is there something better? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing MS Files
This looks great. Thank you Thierry! - Original Message - From: Thierry Ferrero [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 10, 2004 12:23 PM Subject: Re: Indexing MS Files I used OpenOffice API to convert all Word and Excel version. For me it's the solution for complex Word and Excel document. http://api.openoffice.org/ Good luck ! // UNO API import com.sun.star.bridge.XUnoUrlResolver; import com.sun.star.uno.XComponentContext; import com.sun.star.uno.UnoRuntime; import com.sun.star.frame.XComponentLoader; import com.sun.star.frame.XStorable; import com.sun.star.beans.PropertyValue; import com.sun.star.beans.XPropertySet; import com.sun.star.lang.XComponent; import com.sun.star.lang.XMultiComponentFactory; import com.sun.star.connection.NoConnectException; import com.sun.star.io.IOException; /** This class implements a http servlet in order to convert an incoming document * with help of a running OpenOffice.org and to push the converted file back * to the client. */ public class DocConverter { private String stringHost; private String stringPort; private Xcontext xcontext; private Xbase xbase; public DocConverter(Xbase xbase,Xcontext xcontext,ServletContext sc) { this.xbase=xbase; this.xcontext=xcontext; stringHost=ApplicationUtil.getParameter(sc,openoffice.oohost); stringPort=ApplicationUtil.getParameter(sc,openoffice.ooport); } public synchronized String convertToTxt(String namedoc, String pathdoc, String stringConvertType, String stringExtension) { String stringConvertedFile = this.convertDocument(namedoc, pathdoc, stringConvertType, stringExtension); return stringConvertedFile; } /** This method converts a document to a given type by using a running * OpenOffice.org and saves the converted document to the specified * working directory. * @param stringDocumentName The full path name of the file on the server to be converted. * @param stringConvertType Type to convert to. * @param stringExtension This string will be appended to the file name of the converted file. * @return The full path name of the converted file will be returned. * @see stringWorkingDirectory */ private String convertDocument(String namedoc, String pathdoc, String stringConvertType, String stringExtension ) { String tagerr=; String stringUrl=; String stringConvertedFile = ; // Converting the document to the favoured type try { tagerr=0; // Composing the URL - suppression de l'extension stringUrl = pathdoc+/+namedoc; stringUrl=stringUrl.replace( '\\', '/' ); /* Bootstraps a component context with the jurt base components registered. Component context to be granted to a component for running. Arbitrary values can be retrieved from the context. */ XComponentContext xcomponentcontext = com.sun.star.comp.helper.Bootstrap.createInitialComponentContext( null ); /* Gets the service manager instance to be used (or null). This method has been added for convenience, because the service manager is a often used object. */ XMultiComponentFactory xmulticomponentfactory = xcomponentcontext.getServiceManager(); tagerr=2; /* Creates an instance of the component UnoUrlResolver which supports the services specified by the factory. */ Object objectUrlResolver = xmulticomponentfactory.createInstanceWithContext( com.sun.star.bridge.UnoUrlResolver, xcomponentcontext ); // Create a new url resolver XUnoUrlResolver xurlresolver = ( XUnoUrlResolver ) UnoRuntime.queryInterface( XUnoUrlResolver.class, objectUrlResolver ); // Resolves an object that is specified as follow: // uno:connection description;protocol description;initial object name Object objectInitial = xurlresolver.resolve( uno:socket,host= + stringHost + ,port= + stringPort + ;urp;StarOffice.ServiceManager ); // Create a service manager from the initial object xmulticomponentfactory = ( XMultiComponentFactory ) UnoRuntime.queryInterface( XMultiComponentFactory.class, objectInitial ); // Query for the XPropertySet interface. XPropertySet xpropertysetMultiComponentFactory = ( XPropertySet ) UnoRuntime.queryInterface( XPropertySet.class, xmulticomponentfactory ); // Get the default context from the office server. Object objectDefaultContext = xpropertysetMultiComponentFactory.getPropertyValue( DefaultContext ); // Query for the interface XComponentContext. xcomponentcontext = ( XComponentContext ) UnoRuntime.queryInterface( XComponentContext.class, objectDefaultContext ); /* A desktop environment contains tasks with one or more frames in which components can be loaded. Desktop is the environment
RE: Indexing process causes Tomcat to stop working
before scewing tomcat too much... 1.make it sure both indexing and reading processes use the same locking directory (i.e. set it explicitly, take a look in wiky how to) 2. try to execute queries from command line and see what happends 3. in case your queries use sorting, there is a memory leak it 1.4.1 - upgrade to 1.4.2 Regards, J. James Tyrrell [EMAIL PROTECTED] 28.10.2004 10:13 Please respond to Lucene Users List To: [EMAIL PROTECTED] cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject:RE: Indexing process causes Tomcat to stop working Category: From: Armbrust, Daniel C. [EMAIL PROTECTED] Right got back to work with newly created index to try these ideas, So, are you creating the indexes from inside the tomcat runtime, or are you creating them on the command line (which would be in a different runtime than tomcat)? I'm creating them on the command line using a variation on the standard shown in the demo (has some additional optimisation input that is set to default until I can fix this bug). What happens to tomcat? Does it hang - still running but not responsive? Or does it crash? If it hangs, maybe you are running out of memory. By default, Tomcat's limit is set pretty low... It definately hangs when shutdown you can't access it, when re-started it just sits there trying to access port 8080 There is no reason at all you should have to reboot... If you stop and start tomcat, (make sure it actually stopped - sometimes it requires a kill -9 when it really gets hung) it should start working again. Depending on your setup of Tomcat + apache, you may have to restart apache as well to get them linked to each other again... Good news this did work, however I never see tomcat in top or even using ps -A | grep tomcat, the only way I've found tomcat is using ps -auwx | grep tomcat. The output is *after tomcat shutdown.sh run* --- root 2266 0.0 3.8 243740 4860 pts/0 SOct26 0:36 /opt/jdk1.4/bin/java -Djava.endorsed.dirs=/opt/tomcat/common/endorsed -classpath /opt/jdk1.4/lib/tools.jar:/opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/commons-logging-api.jar -Dcatalina.base=/opt/tomcat -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/to root 16050 0.0 0.4 3576 620 pts/0S08:41 0:00 grep tomcat -- I did however find two java proccesses running so I duitifully used kill -9 on both pid's, hey-presto when I restarted Tomcat it ran perfectly. So while I can work around this I think, I guess now the question becomes, does anyone have any advice as to what could be causing this? Bearing in mind I can still run java proccesses (even create new indexes) on the same machine so it is just Tomcat thats affected. Meanwhile, I will try as Dan suggested to raise the default memory of Tomcat significantly and run another index (it seems a likely culprit). Thanks for all the help thus far, its more than appreciated regards, JT Original Message- From: James Tyrrell [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 27, 2004 10:49 AM To: [EMAIL PROTECTED] Subject: RE: Indexing process causes Tomcat to stop working Aad, D'oh forgot to mention that mildly important info. Rather than re-index I am just creating a new index each time, this makes things easier to roll-back etc (which is what my boss wants). the command line is something like java com.lucene.IndexHTML -create -index indexstore/ .. I have wondered about whether sessions could be a problem, but I don't think so, otherwise wouldn't a restart of Tomcat be sufficient rather than a reboot? I even tried the killall command on java tomcat then started everything again to no avail. cheers, JT - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing process causes Tomcat to stop working
From: [EMAIL PROTECTED] Hello! before scewing tomcat too much... A little late but probably good advice thankfully it hasn't gone wrong 1.make it sure both indexing and reading processes use the same locking directory (i.e. set it explicitly, take a look in wiky how to) working on this not so good at Java yet (until recently I mostly worked on Php) , looked on the wiki how to's could you be more specific as I couldn't find much on locking directories. But I will struggle on 2. try to execute queries from command line and see what happends I only exectute from the command line, so so all the info in previous posts is what happens 3. in case your queries use sorting, there is a memory leak it 1.4.1 - upgrade to 1.4.2 My queries do use sorting! So I have placed the 1.4 final jar onto my classpath and have started 'another' index, as the company I work for is moving home tomorrow may not be able to tell you if that worked till next week mind. To Dan, the increased memory allocation for Tomcat didn't work unfortunately but I do know a lot more about catalina_opt and Tomcat now which has proved handy for other things. cheers for all the advice people will keep you posted if I make a breakthrough thanks for your patience, regards, JT Regards, J. James Tyrrell [EMAIL PROTECTED] 28.10.2004 10:13 Please respond to Lucene Users List To: [EMAIL PROTECTED] cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject:RE: Indexing process causes Tomcat to stop working Category: From: Armbrust, Daniel C. [EMAIL PROTECTED] Right got back to work with newly created index to try these ideas, So, are you creating the indexes from inside the tomcat runtime, or are you creating them on the command line (which would be in a different runtime than tomcat)? I'm creating them on the command line using a variation on the standard shown in the demo (has some additional optimisation input that is set to default until I can fix this bug). What happens to tomcat? Does it hang - still running but not responsive? Or does it crash? If it hangs, maybe you are running out of memory. By default, Tomcat's limit is set pretty low... It definately hangs when shutdown you can't access it, when re-started it just sits there trying to access port 8080 There is no reason at all you should have to reboot... If you stop and start tomcat, (make sure it actually stopped - sometimes it requires a kill -9 when it really gets hung) it should start working again. Depending on your setup of Tomcat + apache, you may have to restart apache as well to get them linked to each other again... Good news this did work, however I never see tomcat in top or even using ps -A | grep tomcat, the only way I've found tomcat is using ps -auwx | grep tomcat. The output is *after tomcat shutdown.sh run* --- root 2266 0.0 3.8 243740 4860 pts/0 SOct26 0:36 /opt/jdk1.4/bin/java -Djava.endorsed.dirs=/opt/tomcat/common/endorsed -classpath /opt/jdk1.4/lib/tools.jar:/opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/commons-logging-api.jar -Dcatalina.base=/opt/tomcat -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/to root 16050 0.0 0.4 3576 620 pts/0S08:41 0:00 grep tomcat -- I did however find two java proccesses running so I duitifully used kill -9 on both pid's, hey-presto when I restarted Tomcat it ran perfectly. So while I can work around this I think, I guess now the question becomes, does anyone have any advice as to what could be causing this? Bearing in mind I can still run java proccesses (even create new indexes) on the same machine so it is just Tomcat thats affected. Meanwhile, I will try as Dan suggested to raise the default memory of Tomcat significantly and run another index (it seems a likely culprit). Thanks for all the help thus far, its more than appreciated regards, JT Original Message- From: James Tyrrell [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 27, 2004 10:49 AM To: [EMAIL PROTECTED] Subject: RE: Indexing process causes Tomcat to stop working Aad, D'oh forgot to mention that mildly important info. Rather than re-index I am just creating a new index each time, this makes things easier to roll-back etc (which is what my boss wants). the command line is something like java com.lucene.IndexHTML -create -index indexstore/ .. I have wondered about whether sessions could be a problem, but I don't think so, otherwise wouldn't a restart of Tomcat be sufficient rather than a reboot? I even tried the killall command on java tomcat then started everything again to no avail. cheers, JT - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
RE: Indexing process causes Tomcat to stop working
You want version 1.4.2, not version 1.4. The website makes it hard to find 1.4.2, because the mirrors have not been updated yet. Get 1.4.2 here: http://cvs.apache.org/dist/jakarta/lucene/v1.4.2/ My queries do use sorting! So I have placed the 1.4 final jar onto my classpath and have started 'another' index, as the company I work for is moving home tomorrow may not be able to tell you if that worked till next week mind. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing process causes Tomcat to stop working
James, How do you kick off your reindex? Could it be a session timeout? cheers, Aad Hello, I am a Java/Lucene/Tomcat newbie I know that does not bode well as a start to a post but I really am in dire straits as far as Lucene goes so bear with me. I am working on indexing and replacing search functionality for a website (about 10 gig in size, although only about 7 gig is indexed) I presently have a working model based on the luceneweb demo dispatched with Lucene, this has already proven functional when tested on various sites (admittedly much smaller 200-400mb etc). However, issues occur when performing the index on the main site that I haven't found explained on any of the Lucene forums thus far. After a successful index and optimisation of the website (takes around 4hrs 40m unoptimised) I can't get to the index.jsp or even access tomcat. My first thought was to restart tomcat. No joy and no access. Thinking the larger index had killed the test server I accessed apache on port 80, which worked perfectly. After a few checks I realised the test server was fine, apache was fine, used the same application to create an index of the tomcat docs so java was working. Confused I went back to the forums, FAQ's and groups to see if anyone had any similar problems and have come up with a brief list of what my problem is not; There is no index write.lock files found for Lucene in either /tmp or opt/tomcat/temp directories so the index is open to be searched. Nor does 'top' reveal anything overloading the system. Apache is running fine and displays all relevant pages. Tomcat cannot be reached with a browser (neither the default congratulations page or the Luceneweb application) Tomcat was a fresh install as was Java, Tomcat logs show nothing different to standard startup logs. So I logged the entire indexing process and saw two errors occurring infrequently. Parse Aborted: Encountered \ at line 6, column 129. //where these values vary Was expecting one of: ArgName ... = ... TagEnd ... I'm satisfied this is just the HTML parser kicking off about some badly formatted HTML and is only affecting what is indexed but its here for completeness. The other error is more serious: java.io.IOException: Pipe closed at java.io.PipedInputStream.receive(PipedInputStream.java:136) at java.io.PipedInputStream.receive(PipedInputStream.java:176) at java.io.PipedOutputStream.write(PipedOutputStream.java:129) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336) at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java:395) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:146) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:204) at java.io.Writer.write(Writer.java:126) at org.apache.lucene.demo.html.HTMLParser.addText(HTMLParser.java:137) at org.apache.lucene.demo.html.HTMLParser.HTMLDocument(HTMLParser.java:203) at org.apache.lucene.demo.html.ParserThread.run(ParserThread.java:31) I'm again pretty sure that this is the same error that occurred once before when I was using the maxFieldLength to limit the number of terms recorded. I'm also confident it's a threading error and found the following post by Doug Cutting that seemed to explain it http://java2.5341.com/msg/80502.html however I am assuming that's what it is and haven't yet attempted to change the threading system of the demo as yet due to my lack of java knowledge. The strange thing is after restarting the server all aspects of the Lucene web application work perfectly stemming, alphanumeric indexing summaries etc are all as expected, so I am left assuming due to this (and by running out of options) that Lucene has somehow done something to Tomcat by doing such a large index. Being that both run off Java I guess its something to do with that but I have nowhere near enough experience in java to work out what The system I am currently running on is Java - 1.4.2_05, Tomcat - 5.0.27, Lucene - 1.4.1, Linux version - 2.4.20-8 (gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)), Apache 2.0.42. I have not modified the mergeFactor or MaxMergeDocuments nor am I using RAMdirectories. The processor is 800MHz and there is 128mb of RAM. If more info is required on setup, source code etc or you think this should be moved to a tomcat forum just post. Best regards and thanks in advance for any advice you can offer, J Tyrrell - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing process causes Tomcat to stop working
Aad, D'oh forgot to mention that mildly important info. Rather than re-index I am just creating a new index each time, this makes things easier to roll-back etc (which is what my boss wants). the command line is something like java com.lucene.IndexHTML -create -index indexstore/ .. I have wondered about whether sessions could be a problem, but I don't think so, otherwise wouldn't a restart of Tomcat be sufficient rather than a reboot? I even tried the killall command on java tomcat then started everything again to no avail. cheers, JT - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing process causes Tomcat to stop working
So, are you creating the indexes from inside the tomcat runtime, or are you creating them on the command line (which would be in a different runtime than tomcat)? What happens to tomcat? Does it hang - still running but not responsive? Or does it crash? If it hangs, maybe you are running out of memory. By default, Tomcat's limit is set pretty low... There is no reason at all you should have to reboot... If you stop and start tomcat, (make sure it actually stopped - sometimes it requires a kill -9 when it really gets hung) it should start working again. Depending on your setup of Tomcat + apache, you may have to restart apache as well to get them linked to each other again... Dan -Original Message- From: James Tyrrell [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 27, 2004 10:49 AM To: [EMAIL PROTECTED] Subject: RE: Indexing process causes Tomcat to stop working Aad, D'oh forgot to mention that mildly important info. Rather than re-index I am just creating a new index each time, this makes things easier to roll-back etc (which is what my boss wants). the command line is something like java com.lucene.IndexHTML -create -index indexstore/ .. I have wondered about whether sessions could be a problem, but I don't think so, otherwise wouldn't a restart of Tomcat be sufficient rather than a reboot? I even tried the killall command on java tomcat then started everything again to no avail. cheers, JT - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing Strategy for 20 million documents
--- Christoph Kiehl [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: I would try putting everything in a single index first, and split it up only if I see performance issues. Why would put everything into a single index? I found some benchmark results on the list (starting with your post from 06/08/04) from which I got the impression that the performance loss is very small if I choose to search in multiple indexes with MultiSearcher instead of using one big index. I think it's simpler to deal with a single index. One directory, one set of lock files, etc. If you don't gain anything by having multiple indices, why have them? Going from 1 index to N indices is not a lot of work (not a lot of Lucene-related code). How do you get from 1 index to N indices without adding the documents again? Yes, you would have to re-create N Lucene indices. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
Yes You need to parse the entities Yourself. I implemented an HTML entity parser as a part of http://objectledge.org project. You may use it if it will fit Your needs. It is in a ledge-components project module. See http://objectledge.org/modules/ledge-components/index.html Have fun, -- Damian Gajda Caltha Sp. j. http://www.caltha.pl/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: indexing numeric entities?
-Original Message- From: Damian Gajda [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 12, 2004 10:23 AM To: Lucene Users List Subject: Re: indexing numeric entities? Yes You need to parse the entities Yourself. I implemented an HTML entity parser as a part of http://objectledge.org project. You may use it if it will fit Your needs. It is in a ledge-components project module. See http://objectledge.org/modules/ledge-components/index.html Have fun, -- Damian Gajda Caltha Sp. j. http://www.caltha.pl/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing Strategy for 20 million documents
It depends on a lot of factors. I myself use multiple indexes for about 10M documents. My documents are transient. Each day I get about 400K and I remove about 400K. I always remove an entire days documents at one time. It is much faster/easier to delete the lucene index for the day that I am removing, then looping through one big index and removing the entries with the IndexReader. Since my data is also partitioned by day in my database, I essentially do the same thing there with truncate table. I use a ParallelMultiSearcher object to search the indexes. I store my indexes on a 14 disk 15k rpm fibre channel RAID 1+0 array (striped mirrors). I get very good performance in both updating and searching indexes. On Fri, 8 Oct 2004 06:11:37 -0700 (PDT), Otis Gospodnetic [EMAIL PROTECTED] wrote: Jeff, These questions are difficult to answer, because the answer depends on a number of factors, such as: - hardware (memory, disk speed, number of disks...) - index complexity and size (number of fields and their size) - number of queries/second - complexity of queries etc. I would try putting everything in a single index first, and split it up only if I see performance issues. Going from 1 index to N indices is not a lot of work (not a lot of Lucene-related code). If searching 1 big index is too slow, split your index, put each index on a separate disk, and use ParallelMultiSearcher (http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ParallelMultiSearcher.html) to search your indices. Otis --- Jeff Munson [EMAIL PROTECTED] wrote: I am a new user of Lucene. I am looking to index over 20 million documents (and a lot more someday) and am looking for ideas on the best indexing/search strategy. Which will optimize the Lucene search, one index or multiple indexes? Do I create multiple indexes and merge them all together? Or do I create multiple indexes and search on the multiple indexes? Any helpful ideas would be appreciated! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
Daan Hoogland wrote: Daan Hoogland wrote: Hello, Does anyone do indexeing of numeric entities for japanese characters? I have (non-x)html containing those entities and need to index and search them. Can the CJKAnalyzer index a string like #9679;#20837;#31038;? It seems to be ignored completely when used with the demo. There was talk on this list of fixes for the demo HTMLParser, do these adres this issue? When I look ate the code it seems that the entities should have been interpreted before indexing. What am I missing? Any comment please? Or a pointer to a howto for dumm^H^H^H^H^H westerners? Indexing the attached document using the HTMLParser demo and the CJKAnalyzer, only the term japan is found in the content. This is not correct, is it? Should I convert the entities by hand? thanks, -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
maybe inline? html xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; head titlejapan/title /head body bgcolor=#FF alink=black p #12501;#12451;#12540;#12523;#12489;#12469;#12540;#12499;#12473;#12456;#12531;#12472;#12491;#12450; /p /html Indexing the above document using the HTMLParser demo and the CJKAnalyzer, only the term japan is found in the content. This is not correct, is it? Should I convert the entities by hand? Sorry for the mess I send before. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
re-indexing
I am having touble reindexing. Basically what I want to do is: 1. Delete the old index 2. Write the new index. The enviroment: The index is search by a web app running from the Orion App Server. This code runs fin and reindexes fine prior to any searches. After the first search against the index is completed the index ends up beiong read-only ( or not writeable), I cannot reindex and subsequently cannot search because the index is incomplete. 1. Why doesn't IndexReader.delete(i) really delete the file. it seems to just make anothe 1K file with a .del extension the IndexWriter still cannot content with? 2. How can I make this work? Thanks, Jason The code below produces the following output when run AFTER an initial search against the index have be completed: IndexerDrug-disableLuceneLocks: true Directory: [EMAIL PROTECTED]:\lucene_index_drug Deleted [0]: true ... (out put form for loop confirming deleted items) Deleted [367]: true Hit uncaught exception java.io.IOException java.io.IOException: Cannot delete _ba.cfs at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:105) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:193) at IndexerDrug.index(IndexerDrug.java:103) at IndexerDrug.main(IndexerDrug.java:246) Exception in thread main =-=-=-=-=-=-=-=-=-=-=-=-=- My indexing code (some items have been deleted to protect the innocent) =-=-=-=-=-=-=-=-=-=-=-=-=- import java.io.*; import java.sql.*; import javax.naming.*; import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.document.*; import org.apache.lucene.index.*; import org.apache.lucene.store.*; public class IndexerDrug { private String sql = my query code ; public static String[] stopWords = org.apache.lucene.analysis.standard.StandardAnalyzer.STOP_WORDS; public File indexDir = new File(C:\\lucene_index_drug\\); public Directory fsDir; public void index() throws IOException { try { // Delete old index fsDir = FSDirectory.getDirectory(indexDir, false); if (indexDir.list().length 0) { IndexReader reader = IndexReader.open(fsDir); System.out.println(Directory:+reader.directory().toString()); reader.unlock(fsDir); for (int i = 0; i reader.maxDoc()-1; i++) { reader.delete(i); System.out.println(Deleted [+i+]: +reader.isDeleted(i)); } reader.close(); } } catch (Exception ex) { System.out.println(Error while deleting index: +ex.getMessage()); } // Write new index Analyzer analyzer = new StandardAnalyzer(stopWords); IndexWriter writer = new IndexWriter(indexDir, analyzer, true);// fails here * writer.mergeFactor = 1000; indexDirectory(writer); writer.setUseCompoundFile(true); writer.optimize(); writer.close(); } private void indexDirectory(IndexWriter writer) throws IOException { Connection c = null; ResultSet rs = null; Statement stmt = null; long startTime = System.currentTimeMillis(); System.out.println(Start Time: + new java.sql.Timestamp(System.currentTimeMillis()).toString()); try { Class.forName(); c = DriverManager.getConnection( , , ); stmt = c.createStatement(); rs = stmt.executeQuery(this.sql); System.out.println(Query Completed: + new java.sql.Timestamp(System.currentTimeMillis()).toString()); int total = 0; String resourceID = ; String resourceName = ; String summary = ; String shortSummary = ; String hciPick = ; String url = ; String format = ; String orgType = ; String holdingType = ; String indexText = ; String c_indexText = ; boolean ready = false; Document doc = null; String oldResourceID = null; String newResourceID = null; while (rs.next()) { newResourceID = rs.getString(resourceID)!= null ? rs.getString(resourceID) : ; resourceID = newResourceID; resourceName = rs.getString(resourceName) != null ? rs.getString(resourceName) : ; summary = rs.getString(summary) != null ? rs.getString(summary) : ; if (summary.length() 300) { shortSummary = summary.substring(0, 300) + ...; } else { shortSummary = summary; } hciPick = rs.getString(hciPick) != null ?rs.getString(hciPick) : ; url = rs.getString(url) != null ? rs.getString(url) : ; format = rs.getString(format) != null ? rs.getString(format): ; orgType = rs.getString(orgType) != null ?rs.getString(orgType) : ; holdingType = rs.getString(holdingType) != null ?rs.getString(holdingType) : ; indexText = rs.getString(indexText) != null ?rs.getString(indexText) : ; if
Re: re-indexing
Jason wrote: I am having touble reindexing. Basically what I want to do is: 1. Delete the old index 2. Write the new index. The enviroment: The index is search by a web app running from the Orion App Server. This code runs fin and reindexes fine prior to any searches. After the first search against the index is completed the index ends up beiong read-only ( or not writeable), I cannot reindex and subsequently cannot search because the index is incomplete. We have several apps running like this only on Tomcat and JBoss with no problems... 1. Why doesn't IndexReader.delete(i) really delete the file. it seems to just make anothe 1K file with a .del extension the IndexWriter still cannot content with? Never tried the IndexReader.delete() method, we generally build the new index in a temporary directory and when the index is done we delete the current online directory (using java.io.File methods) and then rename the temp directory to online. 2. How can I make this work? This may be just be silly, but do you remember to close your org.apache.lucene.search.IndexSearcher when you are done with your search? -- Bo Gundersen DBA/Software Developer M.Sc.CS. www.atira.dk - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing date ranges
If it is unindexed, then you cannot query on it, so you do not have a choice. The other option is to use a field that is indexed, not tokenized, and not stored (you have to use new Field(...) to accomplish that) if you don't want to store the field data. Erik On Sep 21, 2004, at 5:54 PM, Chris Fraschetti wrote: is it most effecient to index or not index 'numeric' ranges that i will do a range search byepoc_date:[110448 TO 820483200] would be be better to treat it as Field.Keyword or Field.UnIndexed ? -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi Doug, you are absolutely right about the older version of the JDK: it is 1.3.1 (ibm). Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 environment. Results: I patched the Lucene1.4.1: it has improved not much: after indexing 1897 Objects the number of SegmentTermEnum is up to 17936. To be realistic: This is even a deterioration :((( My next check will be with a JDK1.4.2 for the test environment, but this can only be a reference run for now. Thanks, Daniel Doug Cutting wrote: It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Okay, reference test is done: on JDK 1.4.2 Lucene 1.4.1 really seems to run fine: just a moderate number of SegmentTermEnums that is controlled by gc (about 500 for the 1900 test objects). Daniel Taurat wrote: Hi Doug, you are absolutely right about the older version of the JDK: it is 1.3.1 (ibm). Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 environment. Results: I patched the Lucene1.4.1: it has improved not much: after indexing 1897 Objects the number of SegmentTermEnum is up to 17936. To be realistic: This is even a deterioration :((( My next check will be with a JDK1.4.2 for the test environment, but this can only be a reference run for now. Thanks, Daniel Doug Cutting wrote: It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. What PDF parser are you using? Is the problem within the parser and not lucene? Are you releasing all resources? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Out of memory in lucene 1.4.1 when re-indexing large number of documents
hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
The Parser is pdfBox. pdf is about 25% of the over all indexing volume on the productive system. I also have word-docs and loads of hmtl resources to be indexed. In my testing environment I merely have 5 pdf docs and still those permanent object hanging around, though. Cheers, Daniel Ben Litchfield wrote: I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. What PDF parser are you using? Is the problem within the parser and not lucene? Are you releasing all resources? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread main java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: Daniel Taurat [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel
Re: indexing size
Dmitry Serebrennikov wrote: Niraj Alok wrote: Hi PA, Thanks for the detail ! Since we are using lucene to store the data also, I guess I would not be able to use it. By the way, I could be wrong, but I think the 35% figure you referenced in the your first e-mail actually does not include any stored fields. The deal with 35% was, I think, to illustrate that index data structures used for searching by Lucene are efficient. But Lucene does nothing special about stored content - no compression or anything like that. So you end up with the pure size of your data plus the 35% of the indexed data. There will be a patch available to the end of this week, which allows you to store binary values compressed within a lucene index. It means that you will be able to store and retrieve whole documents within lucene in a very efficient way ;-) regards bernhard Cheers. Dmitry. Regards, Niraj - Original Message - From: petite_abeille [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 1:14 PM Subject: Re: indexing size Hi Niraj, On Sep 01, 2004, at 06:45, Niraj Alok wrote: If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? The different type of fields don't impact how you do your search. This is always the same. Using Unstored fields simply means that you use Lucene as a pure index for search purpose only, not for storing any data. Specifically, the assumption is that your original data lives somewhere else, outside of Lucene. If this assumption is true, then you can index everything as Unstored with the addition of one Keyword per document. The Keyword field holds some sort of unique identifier which allows you to retrieve the original data if necessary (e.g. a primary key, an URI, what not). Here is an example of this approach: (1) For indexing, check the indexValuesWithID() method http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZIndex.java?view=markup Note the addition of a Field.Keyword for each document and the use of Field.UnStored for everything else (2) For fetching, check objectsWithSpecificationAndHitsInStore() http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZFinder.java?view=markup HTH. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi, I am facing an out of memory problem using Lucene 1.4.1. I am re-indexing a pretty large number ( about 30.000 ) of documents. I identify old instances by checking for a unique ID field, delete those with indexReader.delete() and add the new document version. HeapDump says I am having a huge number of HashMaps with SegmentTermEnum objects (256891) . IndexReader is closed directly after delete(term)... Seems to me that this did not happen with version1.2 (same number of objects and all...). Has anyone an idea how I get these hanging objects? Or what to do in order to avoid them? Thanks Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Niraj Alok wrote: Hi PA, Thanks for the detail ! Since we are using lucene to store the data also, I guess I would not be able to use it. By the way, I could be wrong, but I think the 35% figure you referenced in the your first e-mail actually does not include any stored fields. The deal with 35% was, I think, to illustrate that index data structures used for searching by Lucene are efficient. But Lucene does nothing special about stored content - no compression or anything like that. So you end up with the pure size of your data plus the 35% of the indexed data. Cheers. Dmitry. Regards, Niraj - Original Message - From: petite_abeille [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 1:14 PM Subject: Re: indexing size Hi Niraj, On Sep 01, 2004, at 06:45, Niraj Alok wrote: If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? The different type of fields don't impact how you do your search. This is always the same. Using Unstored fields simply means that you use Lucene as a pure index for search purpose only, not for storing any data. Specifically, the assumption is that your original data lives somewhere else, outside of Lucene. If this assumption is true, then you can index everything as Unstored with the addition of one Keyword per document. The Keyword field holds some sort of unique identifier which allows you to retrieve the original data if necessary (e.g. a primary key, an URI, what not). Here is an example of this approach: (1) For indexing, check the indexValuesWithID() method http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZIndex.java?view=markup Note the addition of a Field.Keyword for each document and the use of Field.UnStored for everything else (2) For fetching, check objectsWithSpecificationAndHitsInStore() http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZFinder.java?view=markup HTH. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Hi Niraj, I'd rather respond to the list as others may be interested in your questions, and since I don't consider myself a guru, I appreciate being corrected. For a title, I'd say yes, use the Field Text(String name, String value) constructor. Not the others that use a reader as they do not store the value. You want for it to be: 1) tokenised (so to have its fragments saved for searching, not only the totality of the text) 2) indexed so to make it searchable 3) store as to make the field retrievable from the index hth, sv p.s. my name is Stephane, it's been a while since I've been in Oz that I haven't been called James On Wed, 1 Sep 2004, Niraj Alok wrote: Hi James, Since this would be a minor issue hence I am not posting it on the lucene. Lets say I have one field as title which has a value of George Bush. I would need to search on that title and also retrieve its value. So you are saying that I should have it as Field.Text? Also, if I need to just search on that title but want to retrieve the value of another field content, then title should be unstored while content should be stored? Regards, Niraj - Original Message - From: Stephane James Vaucher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 10:59 AM Subject: Re: indexing size On Wed, 1 Sep 2004, Niraj Alok wrote I was also thinking on the same lines. Actually the original code was written by some one else who has left and so I have to own this. At almost all the places, it is Field.Text and at some few places its Field.UnIndexed. I looked at the javadocs and found that there is Field.UnStored also. The problem is I am not too sure which one to change to what. It would be really enlightening if you could point the differences between those three and what would I need to change in my search code. If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? Basically what is meant by indexed and stored, indexed and not stored and not indexed and stored? If all you need is to seach a field, you do not need to store it. If it is not stored it can still be tokenised and analysed by lucene. It will then be only stored as a set of token, but not as whole. You can thus use it for fields that you never need to retrieve from the index. For example: the quick brown fox jumped over the lazy dog. will be store in lucene only as tokens, not as a whole, so using a whitespace analyser using a stopword list {the}: You will have these tokens in lucene: quick brown fox jumped over dog You will NOT be able to retrieve the original text, but you will be able to search it. HTH, sv Regards, Niraj - Original Message - From: petite_abeille [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 8:57 PM Subject: Re: indexing size On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote: You also have a large number of fields, and it looks like a lot (all?) of them are stored and indexed. That's what that large .fdt file indicated. That file is 206 MB in size. Try using Field.UnStored() to avoid storing all those data in your indices as it's usually not necessary. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Thanks a lot Stephane and Otis for your detailed explanations. I am now on the path to make a judicious choice between the different choices on offer and hope to reduce the overall size. Will surely get back if there are any more hiccups (hope not! ) Thanks again! Niraj - Original Message - From: Stephane James Vaucher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 12:48 PM Subject: Re: indexing size Hi Niraj, I'd rather respond to the list as others may be interested in your questions, and since I don't consider myself a guru, I appreciate being corrected. For a title, I'd say yes, use the Field Text(String name, String value) constructor. Not the others that use a reader as they do not store the value. You want for it to be: 1) tokenised (so to have its fragments saved for searching, not only the totality of the text) 2) indexed so to make it searchable 3) store as to make the field retrievable from the index hth, sv p.s. my name is Stephane, it's been a while since I've been in Oz that I haven't been called James
Re: indexing size
Hi Niraj, On Sep 01, 2004, at 06:45, Niraj Alok wrote: If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? The different type of fields don't impact how you do your search. This is always the same. Using Unstored fields simply means that you use Lucene as a pure index for search purpose only, not for storing any data. Specifically, the assumption is that your original data lives somewhere else, outside of Lucene. If this assumption is true, then you can index everything as Unstored with the addition of one Keyword per document. The Keyword field holds some sort of unique identifier which allows you to retrieve the original data if necessary (e.g. a primary key, an URI, what not). Here is an example of this approach: (1) For indexing, check the indexValuesWithID() method http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZIndex.java?view=markup Note the addition of a Field.Keyword for each document and the use of Field.UnStored for everything else (2) For fetching, check objectsWithSpecificationAndHitsInStore() http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZFinder.java?view=markup HTH. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Hi PA, Thanks for the detail ! Since we are using lucene to store the data also, I guess I would not be able to use it. Regards, Niraj - Original Message - From: petite_abeille [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 1:14 PM Subject: Re: indexing size Hi Niraj, On Sep 01, 2004, at 06:45, Niraj Alok wrote: If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? The different type of fields don't impact how you do your search. This is always the same. Using Unstored fields simply means that you use Lucene as a pure index for search purpose only, not for storing any data. Specifically, the assumption is that your original data lives somewhere else, outside of Lucene. If this assumption is true, then you can index everything as Unstored with the addition of one Keyword per document. The Keyword field holds some sort of unique identifier which allows you to retrieve the original data if necessary (e.g. a primary key, an URI, what not). Here is an example of this approach: (1) For indexing, check the indexValuesWithID() method http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZIndex.java?view=markup Note the addition of a Field.Keyword for each document and the use of Field.UnStored for everything else (2) For fetching, check objectsWithSpecificationAndHitsInStore() http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZFinder.java?view=markup HTH. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Are you using pre-1.4.1 version of Lucene? There was a bug in one of the older versions that left multiple, old index files around, instead of deleting them. Maybe that's using up the disk space. Give us your index directory's 'ls -al' or 'dir'. Otis --- Niraj Alok [EMAIL PROTECTED] wrote: Hi Guys, If you have any ideas, please help me out. I have looked into most of the lucene archives and they are suggesting what I am currently doing. So the only possible solution for me right now would be to reduce the no. of fields which could severely change the logic used for searching. Regards, Niraj - Original Message - From: Niraj Alok [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 11:17 AM Subject: indexing size Hi, I am indexing plain xml files , total size of which is around 100 MB. I am creating two indexes for different modules, and they are stored in different directories as I am not merging them. The problem is that the combined size of these indexes is about 300 MB, ( 3 times the data size), which is in contrast to the 35% I have read it should create. Both these indexes have different fields and different data is stored in them and hence there is no duplication occuring. I have one indexwriter for each index. After both the indexes have been created, I am simply calling optimize on these two writers and closing them. Is there something I am doing wrong? I am using writer.addDocument(doc). Regards, Niraj - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
/2004 17:31 183,796 _4dkv.f75 21/08/2004 17:31 183,796 _4dkv.f76 21/08/2004 17:31 183,796 _4dkv.f77 21/08/2004 17:31 183,796 _4dkv.f78 21/08/2004 17:31 183,796 _4dkv.f79 21/08/2004 17:31 183,796 _4dkv.f8 21/08/2004 17:31 183,796 _4dkv.f80 21/08/2004 17:31 183,796 _4dkv.f81 21/08/2004 17:31 183,796 _4dkv.f82 21/08/2004 17:31 183,796 _4dkv.f83 21/08/2004 17:31 183,796 _4dkv.f84 21/08/2004 17:31 183,796 _4dkv.f85 21/08/2004 17:31 183,796 _4dkv.f86 21/08/2004 17:31 183,796 _4dkv.f87 21/08/2004 17:31 183,796 _4dkv.f88 21/08/2004 17:31 183,796 _4dkv.f89 21/08/2004 17:31 183,796 _4dkv.f9 21/08/2004 17:31 183,796 _4dkv.f90 21/08/2004 17:31 183,796 _4dkv.f91 21/08/2004 17:31 183,796 _4dkv.f92 21/08/2004 17:31 183,796 _4dkv.f93 21/08/2004 17:31 183,796 _4dkv.f94 21/08/2004 17:31 183,796 _4dkv.f95 21/08/2004 17:31 183,796 _4dkv.f96 21/08/2004 17:31 183,796 _4dkv.f97 21/08/2004 17:31 183,796 _4dkv.f98 21/08/2004 17:31 183,796 _4dkv.f99 21/08/2004 17:30 206,637,045 _4dkv.fdt 21/08/2004 17:30 1,470,368 _4dkv.fdx 21/08/2004 17:29 5,509 _4dkv.fnm 21/08/2004 17:3130,953,033 _4dkv.frq 21/08/2004 17:3129,334,297 _4dkv.prx 21/08/2004 17:31 225,415 _4dkv.tii 21/08/2004 17:3116,814,807 _4dkv.tis 455 File(s)367,413,520 bytes 2 Dir(s) 6,854,688,768 bytes free Regards, Niraj - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 6:02 PM Subject: Re: indexing size Are you using pre-1.4.1 version of Lucene? There was a bug in one of the older versions that left multiple, old index files around, instead of deleting them. Maybe that's using up the disk space. Give us your index directory's 'ls -al' or 'dir'. Otis
Re: indexing size
/2004 17:31 183,796 _4dkv.f58 21/08/2004 17:31 183,796 _4dkv.f59 21/08/2004 17:31 183,796 _4dkv.f6 21/08/2004 17:31 183,796 _4dkv.f60 21/08/2004 17:31 183,796 _4dkv.f61 21/08/2004 17:31 183,796 _4dkv.f62 21/08/2004 17:31 183,796 _4dkv.f63 21/08/2004 17:31 183,796 _4dkv.f64 21/08/2004 17:31 183,796 _4dkv.f65 21/08/2004 17:31 183,796 _4dkv.f66 21/08/2004 17:31 183,796 _4dkv.f67 21/08/2004 17:31 183,796 _4dkv.f68 21/08/2004 17:31 183,796 _4dkv.f69 21/08/2004 17:31 183,796 _4dkv.f7 21/08/2004 17:31 183,796 _4dkv.f70 21/08/2004 17:31 183,796 _4dkv.f71 21/08/2004 17:31 183,796 _4dkv.f72 21/08/2004 17:31 183,796 _4dkv.f73 21/08/2004 17:31 183,796 _4dkv.f74 21/08/2004 17:31 183,796 _4dkv.f75 21/08/2004 17:31 183,796 _4dkv.f76 21/08/2004 17:31 183,796 _4dkv.f77 21/08/2004 17:31 183,796 _4dkv.f78 21/08/2004 17:31 183,796 _4dkv.f79 21/08/2004 17:31 183,796 _4dkv.f8 21/08/2004 17:31 183,796 _4dkv.f80 21/08/2004 17:31 183,796 _4dkv.f81 21/08/2004 17:31 183,796 _4dkv.f82 21/08/2004 17:31 183,796 _4dkv.f83 21/08/2004 17:31 183,796 _4dkv.f84 21/08/2004 17:31 183,796 _4dkv.f85 21/08/2004 17:31 183,796 _4dkv.f86 21/08/2004 17:31 183,796 _4dkv.f87 21/08/2004 17:31 183,796 _4dkv.f88 21/08/2004 17:31 183,796 _4dkv.f89 21/08/2004 17:31 183,796 _4dkv.f9 21/08/2004 17:31 183,796 _4dkv.f90 21/08/2004 17:31 183,796 _4dkv.f91 21/08/2004 17:31 183,796 _4dkv.f92 21/08/2004 17:31 183,796 _4dkv.f93 21/08/2004 17:31 183,796 _4dkv.f94 21/08/2004 17:31 183,796 _4dkv.f95 21/08/2004 17:31 183,796 _4dkv.f96 21/08/2004 17:31 183,796 _4dkv.f97 21/08/2004 17:31 183,796 _4dkv.f98 21/08/2004 17:31 183,796 _4dkv.f99 21/08/2004 17:30 206,637,045 _4dkv.fdt 21/08/2004 17:30 1,470,368 _4dkv.fdx 21/08/2004 17:29 5,509 _4dkv.fnm 21/08/2004 17:3130,953,033 _4dkv.frq 21/08/2004 17:3129,334,297 _4dkv.prx 21/08/2004 17:31 225,415 _4dkv.tii 21/08/2004 17:3116,814,807 _4dkv.tis 455 File(s)367,413,520 bytes 2 Dir(s) 6,854,688,768 bytes free Regards, Niraj - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 6:02 PM Subject: Re: indexing size Are you using pre-1.4.1 version of Lucene? There was a bug in one of the older versions that left multiple, old index files around, instead of deleting them. Maybe that's using up the disk space. Give us your index directory's 'ls -al' or 'dir'. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote: You also have a large number of fields, and it looks like a lot (all?) of them are stored and indexed. That's what that large .fdt file indicated. That file is 206 MB in size. Try using Field.UnStored() to avoid storing all those data in your indices as it's usually not necessary. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
I was also thinking on the same lines. Actually the original code was written by some one else who has left and so I have to own this. At almost all the places, it is Field.Text and at some few places its Field.UnIndexed. I looked at the javadocs and found that there is Field.UnStored also. The problem is I am not too sure which one to change to what. It would be really enlightening if you could point the differences between those three and what would I need to change in my search code. If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? Basically what is meant by indexed and stored, indexed and not stored and not indexed and stored? Regards, Niraj - Original Message - From: petite_abeille [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 8:57 PM Subject: Re: indexing size On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote: You also have a large number of fields, and it looks like a lot (all?) of them are stored and indexed. That's what that large .fdt file indicated. That file is 206 MB in size. Try using Field.UnStored() to avoid storing all those data in your indices as it's usually not necessary. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
On Wed, 1 Sep 2004, Niraj Alok wrote I was also thinking on the same lines. Actually the original code was written by some one else who has left and so I have to own this. At almost all the places, it is Field.Text and at some few places its Field.UnIndexed. I looked at the javadocs and found that there is Field.UnStored also. The problem is I am not too sure which one to change to what. It would be really enlightening if you could point the differences between those three and what would I need to change in my search code. If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? Basically what is meant by indexed and stored, indexed and not stored and not indexed and stored? If all you need is to seach a field, you do not need to store it. If it is not stored it can still be tokenised and analysed by lucene. It will then be only stored as a set of token, but not as whole. You can thus use it for fields that you never need to retrieve from the index. For example: the quick brown fox jumped over the lazy dog. will be store in lucene only as tokens, not as a whole, so using a whitespace analyser using a stopword list {the}: You will have these tokens in lucene: quick brown fox jumped over dog You will NOT be able to retrieve the original text, but you will be able to search it. HTH, sv Regards, Niraj - Original Message - From: petite_abeille [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 8:57 PM Subject: Re: indexing size On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote: You also have a large number of fields, and it looks like a lot (all?) of them are stored and indexed. That's what that large .fdt file indicated. That file is 206 MB in size. Try using Field.UnStored() to avoid storing all those data in your indices as it's usually not necessary. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing and Searching Database in Lucene
You need to create a lucene index from the database. Just index the columns and the records from the database. It will be useful to have also a field in lucene that contains the database's primary key, so you can retrieve the actual record from the database Aviran -Original Message- From: sivalingam T [mailto:[EMAIL PROTECTED] Sent: Friday, August 20, 2004 10:55 AM To: [EMAIL PROTECTED] Subject: Indexing and Searching Database in Lucene Hi Can we index and search database in Lucene Search Engine? if anybody have please send reply. With Warm Regards, Sivalingam.T Sai Eswar Innovations (P) Ltd, Chennai-92 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing and Searching Database in Lucene
Funy thing is I was thinking of doing something like this just today. This is especially good when you perform a lot of queries using the LIKE statement. Lucene would increase search performance a great deal. Aviran wrote: You need to create a lucene index from the database. Just index the columns and the records from the database. It will be useful to have also a field in lucene that contains the database's primary key, so you can retrieve the actual record from the database Aviran -Original Message- From: sivalingam T [mailto:[EMAIL PROTECTED]] Sent: Friday, August 20, 2004 10:55 AM To: [EMAIL PROTECTED] Subject: Indexing and Searching Database in Lucene Hi Can we index and search database in Lucene Search Engine? if anybody have please send reply. With Warm Regards, Sivalingam.T Sai Eswar Innovations (P) Ltd, Chennai-92 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Don Vaillancourt Director of Software Development WEB IMPACT INC. phone: 416-815-2000 ext. 245 fax: 416-815-2001 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi John, The source code is available from CVS, make it non-final and do what you need to do. Of course, you may have a hard time finding help later if you aren't using something everyone else is and your solution doesn't work... :-) If I understand correctly what you are trying to do, you already know all of the answers for indexing, you just want Lucene to do the retrieval side of the coin, correct? I suppose a crazy idea might be to write a program that took your info and output it in the Lucene file format, but that seems a bit like overkill. -Grant [EMAIL PROTECTED] 07/07/04 07:37PM Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi Grant: Thanks for the options. How likely will the lucene file formats change? Are there really no more optiosn? :(... Thanks -John On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: Hi John, The source code is available from CVS, make it non-final and do what you need to do. Of course, you may have a hard time finding help later if you aren't using something everyone else is and your solution doesn't work... :-) If I understand correctly what you are trying to do, you already know all of the answers for indexing, you just want Lucene to do the retrieval side of the coin, correct? I suppose a crazy idea might be to write a program that took your info and output it in the Lucene file format, but that seems a bit like overkill. -Grant [EMAIL PROTECTED] 07/07/04 07:37PM Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi Grant: I have something that would extract only the important words from a document along with its importance, furthermore, these important words may not be physically in the document, it could be synonyms to some of the words in the document. So the output of a process for a document is a list of word/importance pairs. I want to be able to query using only these words on the document. I don't think Lucene has such capability. Can you suggest what I can do with the analysers process in doing this without replicating words/tokens? Thanks -John On Thu, 08 Jul 2004 11:10:07 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: Hey John, Those are just options, didn't say they were good ones! :-) I guess the real question is, what is the background of what you are trying to do? Presumably you have some other program that is generating frequencies for you, do you really need that in the current form? Can't the Lucene indexing engine act as a stand-in for this process since your end result _should_ be the same? The Lucene Analyzer process is quite flexible, I bet you could even find a way to hook in your existing tools into the Analyzer process. -Grant [EMAIL PROTECTED] 07/08/04 10:42AM Hi Grant: Thanks for the options. How likely will the lucene file formats change? Are there really no more optiosn? :(... Thanks -John On Thu, 08 Jul 2004 08:50:44 -0400, Grant Ingersoll [EMAIL PROTECTED] wrote: Hi John, The source code is available from CVS, make it non-final and do what you need to do. Of course, you may have a hard time finding help later if you aren't using something everyone else is and your solution doesn't work... :-) If I understand correctly what you are trying to do, you already know all of the answers for indexing, you just want Lucene to do the retrieval side of the coin, correct? I suppose a crazy idea might be to write a program that took your info and output it in the Lucene file format, but that seems a bit like overkill. -Grant [EMAIL PROTECTED] 07/07/04 07:37PM Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
Re: indexing help
Thanks Doug. I will do just that. Just for my education, can you maybe elaborate on using the implement an IndexReader that delivers a synthetic index approach? Thanks in advance -John On Thu, 08 Jul 2004 10:01:59 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. That's easy to fix. We just need to reuse the token: public class VectorTokenStream extends TokenStream { private int term = -1; private int freq = 0; private Token token; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; token = new Token(terms[term], 0, 0); freq = freqs[term]; } freq--; return token; } } Then only two tokens are created, as you desire. If you for some reason don't want to create a dummy document stream, then you could instead implement an IndexReader that delivers a synthetic index for a single document. Then use IndexWriter.addIndexes() to turn this into a real, FSDirectory-based index. However that would be a lot more work and only very marginally faster. So I'd stick with the approach I've outlined above. (Note: this code has not been compiled or run. It may have bugs.) Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
John Wang wrote: Just for my education, can you maybe elaborate on using the implement an IndexReader that delivers a synthetic index approach? IndexReader is an abstract class. It has few data fields, and few non-static methods that are not implemented in terms of abstract methods. So, in effect, it is an interface. When Lucene indexes a token stream it creates a single-document index that is then merged with other single- and multi-document indexes to form an index that is searched. You could bypass the first step of this (indexing a token stream) by instead directly implementing all of IndexReader's abstract methods to return the same thing as the single-document index that Lucene would create. This would be marginally faster, as no Token objects would be created at all. But, since IndexReader has a lot of abstract methods, it would be a lot of work. I didn't really mean it as a practical suggestion. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing help
Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many terms and frequencies, it would create many extra Token instances. The reason I was looking to derving the Field class is because I can directly manipulate the FieldInfo by setting the frequency. But the class is final... Any other suggestions? Thanks -John On Wed, 07 Jul 2004 14:20:24 -0700, Doug Cutting [EMAIL PROTECTED] wrote: John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it can always be 0.) What I can do now is to create a dummy document, e.g. java java java java java lucene lucene lucene lucene lucene and pass it to lucene. This seems hacky and cumbersome. Is there a better alternative? I browsed around in the source code, but couldn't find anything. Write an analyzer that returns terms with the appropriate distribution. For example: public class VectorTokenStream extends TokenStream { private int term; private int freq; public VectorTokenStream(String[] terms, int[] freqs) { this.terms = terms; this.freqs = freqs; } public Token next() { if (freq == 0) { term++; if (term = terms.length) return null; freq = freqs[term]; } freq--; return new Token(terms[term], 0, 0); } } Document doc = new Document(); doc.add(Field.Text(content, )); indexWriter.addDocument(doc, new Analyzer() { public TokenStream tokenStream(String field, Reader reader) { return new VectorTokenStream(new String[] {java,lucene}, new int[] {5,6}); } }); Too bad the Field class is final, otherwise I can derive from it and do something on that line... Extending Field would not help. That's why it's final. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing incrementally concurrently
On Jul 5, 2004, at 9:00 AM, Michael Wechner wrote: If several users are saving documents on the server concurrently and during saving the index shall be updated incrementally ... do I have to make sure that it's going to be threadsave or does Lucene take care of this? Only a single IndexWriter instance at a time can be used - so you will need to coordinate things. Multiple threads can share a single IndexWriter though, so no worries there. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing incrementally concurrently
Erik Hatcher wrote: On Jul 5, 2004, at 9:00 AM, Michael Wechner wrote: If several users are saving documents on the server concurrently and during saving the index shall be updated incrementally ... do I have to make sure that it's going to be threadsave or does Lucene take care of this? Only a single IndexWriter instance at a time can be used - so you will need to coordinate things. Multiple threads can share a single IndexWriter though, so no worries there. ok. Thanks very much for the info Michi Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing with 1.4-rc3 only yields single .cfs file
Otis, can You explain please why 1.4-rc3 leaves old files like _*.cfs in index folder after optimization. The reference to them can be found also in deletetable file. Is it a bug? We switched from multi-file to compound index structure in one of the recent RCs. This should be mentioned in CHANGES.txt file. The change was made to make it more difficult for people to reach 'Too many open files' situations. Otis --- Claude Devarenne [EMAIL PROTECTED] wrote: Hi, I just upgraded to 1.4-rc3 and re-indexed my data. I did not change any code and noticed that in the index directory there is a single .cfs file which I am guessing stands for compound file system. Search works fine but after checking out the latest from CVS I did not see this mentioned in the fileformats documentation. Is this the normal behavior for indexes from now on or is something else going on? When creating the index I see the .tis, .frq and other files being created. Maybe I need to update my indexer, sorry if I did not RTFM. Claude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing with 1.4-rc3 only yields single .cfs file
We switched from multi-file to compound index structure in one of the recent RCs. This should be mentioned in CHANGES.txt file. The change was made to make it more difficult for people to reach 'Too many open files' situations. Otis --- Claude Devarenne [EMAIL PROTECTED] wrote: Hi, I just upgraded to 1.4-rc3 and re-indexed my data. I did not change any code and noticed that in the index directory there is a single .cfs file which I am guessing stands for compound file system. Search works fine but after checking out the latest from CVS I did not see this mentioned in the fileformats documentation. Is this the normal behavior for indexes from now on or is something else going on? When creating the index I see the .tis, .frq and other files being created. Maybe I need to update my indexer, sorry if I did not RTFM. Claude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing japanese PDF documents
I have not tried these other tools yet. Have you asked Ben Litchfield, the PDFBox author, about handling of Japanese text? Otis --- Chandan Tamrakar [EMAIL PROTECTED] wrote: I am using latest PDFbox library for parsing . I can parse a english documents successfully but when I parse a document containing english and japanese I do not get as I expected . Have anyone tried using PDFBox library for parsing a japanese documents ? Or do i need to use other parser like xPDF ,Jpedal ? Thanks in advace Chandan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing japanese PDF documents
Yes he did, but I was away the past couple days. As this is more of a PDFBox issue I responded in the PDFBox forums, please follow the thread there if you are interested. Ben On Mon, 22 Mar 2004, Otis Gospodnetic wrote: I have not tried these other tools yet. Have you asked Ben Litchfield, the PDFBox author, about handling of Japanese text? Otis --- Chandan Tamrakar [EMAIL PROTECTED] wrote: I am using latest PDFbox library for parsing . I can parse a english documents successfully but when I parse a document containing english and japanese I do not get as I expected . Have anyone tried using PDFBox library for parsing a japanese documents ? Or do i need to use other parser like xPDF ,Jpedal ? Thanks in advace Chandan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing HTML
How do I index a HTM document which may have any encoding like EUC,SJIS,Western European or UTF 8. Can I parse and extract the html into string and than convert into Text file in UNICODE ? Is this an appropiate way to index HTML files ? Can anyone suggest me a simple parser other than a parser found in demo of lucene ? Also how do i find the encoding of files ? Whenever there are ANSI text files containing japanese characters i am not able to convert into UTF-16 lucene is indexing properly when I convert into SJIS thnks chandan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing multiple instances of the same field for each document
I don't have access to the process that created the XML, it was done in the past. As I stated in the beginning of this thread, this is just an example of the type of thing I'm trying to accomplish. I think the real issue herein is that the fields are being inserted in reverse order. Here's the comments in the code (for Document.add()): /** Adds a field to a document. Several fields may be added with * the same name. In this case, if the fields are indexed, their text is * treated as though appended for the purposes of search. */ I guess it doesn't specify the order they're appended, however, when I read that comment, I thought that it meant in the order added. It's a pretty simple change to the Document class to make this work as I'd expect it. From Doug's initial response, I think he expected this behavior as well. Thanks again for all your help! Roy -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, February 29, 2004 9:10 AM To: Lucene Users List Subject: Re: Indexing multiple instances of the same field for each document What you are doing is really the job of an Analyzer. You are doing pre-analysis, when instead you could do all of this within the context of a custom analyzer and avoid many of these issues altogether. Do you use the XML only during indexing? If so, you could bypass the whole conversion to XML and then back through Digester all within an analyzer. Or am I missing something that prevents you from doing it this way? Erik On Feb 28, 2004, at 10:05 PM, Roy Klein wrote: Erik, Here's a brief example of the type of thing I'm trying to do: I have a file that contains the words: The quick brown fox jumped over the lazy dog. I run that file through a utility that produces the following xml document: document field name=wordposition1 wordThe/word /field field name=wordposition2 wordquick/word wordfast/word wordspeedy/word /field field name=wordposition3 wordbrown/word wordtan/word worddark/word /field . . . I parse that document (via the digester), and add all the words from each of the fields to one lucene field: contents. The tricky part is that I want to have each word position contain all the words at that position in the lucene index. I.e. word location 1 in the index contains The, word location 2: quick, fast, and speedy, word location 3: brown, tan, and dark, etc. That way, all the following phrase queries will match this document: fast tan quick brown fast brown I wrote a TermAnalyzer that adds all the words from a field into the index at the same position. (via setPositionIncrement(0)). That way I can simply add each set of words to the contents field, and it'll just keep adding them to the same field. However, since it's reversing them, I can't match phrases. Roy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing multiple instances of the same field for each document
Thanks Doug! I was in the midst of testing my fix to it and noticed your checkin... Roy -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, March 01, 2004 12:33 PM To: Lucene Users List Subject: Re: Indexing multiple instances of the same field for each document Erik Hatcher wrote: On Feb 27, 2004, at 6:17 PM, Doug Cutting wrote: I think it's document.add(). Fields are pushed onto the front, rather than added to the end. Ah, ok DocumentFieldList/DocumentFieldEnumeration are the culprits. This is certainly a bug. Yes, a bug that's been there since the genesis of Lucene, six years ago. It is surprising that something like this could go so long unnoticed. I just fixed this in CVS. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing multiple instances of the same field for each document
Roy Klein wrote: Erik, Indexing a single field in chunks solves a design problem I'm working on. It's not the only way to do it, but, it would certainly be the most straightforward. However, if using this method makes phrase searching unusable, then I'll have to go another route. hmm, wouldn't it be easier to index only one term for a list of synomys instead of indexing each synonym for one term? quick, fast, speedy - quick (both when building the index and building the query) this also would solve your problems with the (somehow counterintuative but probably well reasoned) behaviour of lucene to add Fields with the same name at the beginning instead of appending them. Markus Here's a brief example of the type of thing I'm trying to do: I have a file that contains the words: The quick brown fox jumped over the lazy dog. I run that file through a utility that produces the following xml document: document field name=wordposition1 wordThe/word /field field name=wordposition2 wordquick/word wordfast/word wordspeedy/word /field field name=wordposition3 wordbrown/word wordtan/word worddark/word /field . . . I parse that document (via the digester), and add all the words from each of the fields to one lucene field: contents. The tricky part is that I want to have each word position contain all the words at that position in the lucene index. I.e. word location 1 in the index contains The, word location 2: quick, fast, and speedy, word location 3: brown, tan, and dark, etc. That way, all the following phrase queries will match this document: fast tan quick brown fast brown - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing multiple instances of the same field for each document
Hi Markus, What you're saying would work if I wasn't concerned about query performance. If I add the synonym's at document index time, then I only process the word quick once (when I insert the doc into the index). If I process each query to convert fast and speedy to quick at query time, then I might wind up processing those words millions of times. (once for each query) Yes, I could come up with a cache so that the processing is at a minimum, however, it still makes more sense to do it once, at index time. Roy -Original Message- From: Markus Spath [mailto:[EMAIL PROTECTED] Sent: Sunday, February 29, 2004 5:45 AM To: Lucene Users List Subject: Re: Indexing multiple instances of the same field for each document Roy Klein wrote: Erik, Indexing a single field in chunks solves a design problem I'm working on. It's not the only way to do it, but, it would certainly be the most straightforward. However, if using this method makes phrase searching unusable, then I'll have to go another route. hmm, wouldn't it be easier to index only one term for a list of synomys instead of indexing each synonym for one term? quick, fast, speedy - quick (both when building the index and building the query) this also would solve your problems with the (somehow counterintuative but probably well reasoned) behaviour of lucene to add Fields with the same name at the beginning instead of appending them. Markus Here's a brief example of the type of thing I'm trying to do: I have a file that contains the words: The quick brown fox jumped over the lazy dog. I run that file through a utility that produces the following xml document: document field name=wordposition1 wordThe/word /field field name=wordposition2 wordquick/word wordfast/word wordspeedy/word /field field name=wordposition3 wordbrown/word wordtan/word worddark/word /field . . . I parse that document (via the digester), and add all the words from each of the fields to one lucene field: contents. The tricky part is that I want to have each word position contain all the words at that position in the lucene index. I.e. word location 1 in the index contains The, word location 2: quick, fast, and speedy, word location 3: brown, tan, and dark, etc. That way, all the following phrase queries will match this document: fast tan quick brown fast brown - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing multiple instances of the same field for each document
What you are doing is really the job of an Analyzer. You are doing pre-analysis, when instead you could do all of this within the context of a custom analyzer and avoid many of these issues altogether. Do you use the XML only during indexing? If so, you could bypass the whole conversion to XML and then back through Digester all within an analyzer. Or am I missing something that prevents you from doing it this way? Erik On Feb 28, 2004, at 10:05 PM, Roy Klein wrote: Erik, Here's a brief example of the type of thing I'm trying to do: I have a file that contains the words: The quick brown fox jumped over the lazy dog. I run that file through a utility that produces the following xml document: document field name=wordposition1 wordThe/word /field field name=wordposition2 wordquick/word wordfast/word wordspeedy/word /field field name=wordposition3 wordbrown/word wordtan/word worddark/word /field . . . I parse that document (via the digester), and add all the words from each of the fields to one lucene field: contents. The tricky part is that I want to have each word position contain all the words at that position in the lucene index. I.e. word location 1 in the index contains The, word location 2: quick, fast, and speedy, word location 3: brown, tan, and dark, etc. That way, all the following phrase queries will match this document: fast tan quick brown fast brown I wrote a TermAnalyzer that adds all the words from a field into the index at the same position. (via setPositionIncrement(0)). That way I can simply add each set of words to the contents field, and it'll just keep adding them to the same field. However, since it's reversing them, I can't match phrases. Roy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing multiple instances of the same field for each docume nt
- Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, February 27, 2004 4:07 PM Subject: Re: Indexing multiple instances of the same field for each docume nt Does this mean that whenever I want to do keyword searches, I must avoid QueryParser? Not necessarily. This is a bit of an involved issue, and I posted a more extensive reply on this a few weeks ago (pasting a bit of our Lucene in Action discussion on it - perhaps search for KeywordAnalyzer to find that mail) Look into PerFieldAnalyzerWrapper. Thanks for this tip, I've mostly done it now using this route - I guess one could also derive a new Analyzer that does a switch on the basis of FieldName but that wouldn't be so flexible. I see from the DocumentWriter class that all keyword fields are indexed exactly, including case-sensitivity. This really tripped me up, since my version of the KeywordAnalyzer (left by Eric as an exercise to the reader) was applying the LowerCaseFilter, and therefore I got no matches. I guess the best way to handle this problem, other than getting the application to transform values prior to query or indexing, is actually to tokenize the field after all, but use the same KeywordAnalyzer to do it! Yours, Moray McConnachie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing multiple instances of the same field for each docume nt
On Feb 28, 2004, at 5:38 PM, Moray McConnachie (OA) wrote: - Original Message - I guess the best way to handle this problem, other than getting the application to transform values prior to query or indexing, is actually to tokenize the field after all, but use the same KeywordAnalyzer to do it! Bingo... this is the same thinking I've had on this subject. Why even bother with Field.Keyword and the confusion that occurs with QueryParser and such? Just use a KeywordAnalyzer and PerFieldAnalyzerWrapper setup instead for both indexing and querying at least that seems a more confusion-free route to go in a lot of ways. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]