index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)
On Apr 13, 2004, at 02:45, Kevin A. Burton wrote: He mentioned that I might be able to squeeze 5-10% out of index merges this way. Talking of which... what strategy(ies) do people use to minimize downtime when updating an index? My current strategy is as follow: (1) use a temporary RAMDirectory for ongoing updates. (2) perform a copy on write when flushing the RAMDirectory into the persistent index. The second step means that I create an offline copy of a live index before invoking addIndexes() and then substitute the old index with the new, updated, one. While this effectively increase the time it takes to update an index, it nonetheless reduce the *perceived* downtime for it. Thoughts? Alternative strategies? TIA. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Closing IndexWriter object after each file causes NullPointerException?
If you close an IndexWriter more than once, the release of the writeLock creates a NullPointerException. You should clean your code and close your writer only once. Anyway, I don't know why there's no test on the 'writeLock' as in the 'finalize' method. I think it's a little error, so I suggest the attached patch to fix that. Franck Brisbart jitender ahuja wrote: Hi, Can anyone tell what is the cause of error for the following error as the source of error is not any of the following: a) Index directory closing after each file of the directory (to be indexed) : verified by the changing directory size, with the changing number of files to be indexed b) IndexWriter object being closed out : verified by checking the IndexWriter object ( here, writ) being a non-null object, by the line: System.out.println(writ != null); in the attached code Error output: java.lang.NullPointerException at org.apache.lucene.index.IndexWriter.close(Unknown Source) at IndexDatanew.indexDocs(IndexDatanew.java:89) at IndexDatanew.indexDocs(IndexDatanew.java:50) at IndexDatanew.main(IndexDatanew.java:25) The code that causes this error is working fine otherwise (i.e. for indexing purposes) and is attached; the output in detail for a directory of 2 files is also attached.: Thanks Jitender C:\lucrochejava IndexDatanew E:\freebooks\books\whole\jiten Index Directory: E:\freebooks\books\whole\jiten 2 E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm adding: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm File contents from buffer: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm false E:\freebooks\books\whole\jiten\TIJ3_c.htm adding: E:\freebooks\books\whole\jiten\TIJ3_c.htm File contents from buffer: E:\freebooks\books\whole\jiten\TIJ3_c.htm false java.lang.NullPointerException at org.apache.lucene.index.IndexWriter.close(Unknown Source) at IndexDatanew.indexDocs(IndexDatanew.java:89) at IndexDatanew.indexDocs(IndexDatanew.java:50) at IndexDatanew.main(IndexDatanew.java:25) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Franck Brisbart RD http://www.kelkoo.com Index: IndexWriter.java === RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java,v retrieving revision 1.28 diff -u -r1.28 IndexWriter.java --- IndexWriter.java25 Mar 2004 19:34:53 - 1.28 +++ IndexWriter.java13 Apr 2004 16:39:56 - @@ -235,8 +235,10 @@ public synchronized void close() throws IOException { flushRamSegments(); ramDirectory.close(); -writeLock.release(); // release write lock -writeLock = null; +if (writeLock != null) { + writeLock.release(); // release write lock + writeLock = null; +} directory.close(); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)
I'm actually pretty lazy about index updates, and haven't had the need for efficiency, since my requirement is that new documents should be available on a next working day basis. I reindex everything from scatch every night (400,000 docs) and store it in an timestamped index. When the reindexing is done, I alert a controller of the new active index. I keep a few versions of the index in case of a failure somewhere and I can always send a message to the controller to use an old index. cheers, sv On Tue, 13 Apr 2004, petite_abeille wrote: On Apr 13, 2004, at 02:45, Kevin A. Burton wrote: He mentioned that I might be able to squeeze 5-10% out of index merges this way. Talking of which... what strategy(ies) do people use to minimize downtime when updating an index? My current strategy is as follow: (1) use a temporary RAMDirectory for ongoing updates. (2) perform a copy on write when flushing the RAMDirectory into the persistent index. The second step means that I create an offline copy of a live index before invoking addIndexes() and then substitute the old index with the new, updated, one. While this effectively increase the time it takes to update an index, it nonetheless reduce the *perceived* downtime for it. Thoughts? Alternative strategies? TIA. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Suggestion for Token.java
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi there, Just a short suggestion: It would be useful to make Token.termText public (or to provide a reader/ writer pair). That way one can create TokenFilters altering termText (for Synonyms for example) in other packages as org.apache.lucene.analyzer. Mit freundlichem Gruß / With kind regards Holger Klawitter - -- lists at klawitter dot de -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFAe/C41Xdt0HKSwgYRAoUEAKCUARoxSFiPv6OyJCzJhbLCLbtkmwCfQHzH pH4Z4Bk6M/emmLn0CVoEX8w= =1fIA -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion for Token.java
What is wrong with simply creating a new token that replaces an incoming one for synonyms? I'm just playing devil's advocate here since you can already get the termText() through the public _method_. Erik On Apr 13, 2004, at 9:52 AM, Holger Klawitter wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi there, Just a short suggestion: It would be useful to make Token.termText public (or to provide a reader/ writer pair). That way one can create TokenFilters altering termText (for Synonyms for example) in other packages as org.apache.lucene.analyzer. Mit freundlichem Gruß / With kind regards Holger Klawitter - -- lists at klawitter dot de -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFAe/C41Xdt0HKSwgYRAoUEAKCUARoxSFiPv6OyJCzJhbLCLbtkmwCfQHzH pH4Z4Bk6M/emmLn0CVoEX8w= =1fIA -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ANN: Docco 0.3
Hello, we released Docco 0.3 along with two updates for its plugins. Docco is a personal document retrieval tool based on Apache's Lucene indexing engine and Formal Concept Analysis. It allows you to create an index for files on your file system which you can then search for keywords. It can index plain text, HTML, XML and OpenOffice files and with the support of plugins others like PDF, DOC and XLS. This new version of Docco features a number of small enhancements: the diagram layout can be changed, printing and graphic export options have been added and some plugins have been updated. The new POI plugin should be able to index MS Word documents again (the old one broke with recent Java versions), the PDFbox plugin gets all the recent updates from the PDFbox project. Old plugins will still continue to work, though. You can find the updated files here: http://sourceforge.net/project/showfiles.php?group_id=21448 Note that you can now also use the export plugins to add more graphic export options. Enjoy! Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Simple spider demo
I'm wondering if there is interest for a simple spider demo. I've got an example of how to use HttpUnit to spider on a web site and have it index it on disk (only html page now). I can send it to the list if anyone is interested (it's one class, 200 loc). cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
i411 Faceted Metadata Search
Hi, Who knows the diference between i411 Faceted Metadata Search and Lucene Search Engine. Thanks, William. _ Tax headache? MSN Money provides relief with tax tips, tools, IRS forms and more! http://moneycentral.msn.com/tax/workshop/welcome.asp - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ANN: Docco 0.3
Looks cool, but I've got a question: How do you handle symlinks on *nix? I think it's stuck in a loop When indexing my home dir, I see it indexing: /home/vauchers/.Cirano-gnome/.gnome-desktop/Home directory/.Cirano-gnome/... cheers, sv On Wed, 14 Apr 2004, Peter Becker wrote: Hello, we released Docco 0.3 along with two updates for its plugins. Docco is a personal document retrieval tool based on Apache's Lucene indexing engine and Formal Concept Analysis. It allows you to create an index for files on your file system which you can then search for keywords. It can index plain text, HTML, XML and OpenOffice files and with the support of plugins others like PDF, DOC and XLS. This new version of Docco features a number of small enhancements: the diagram layout can be changed, printing and graphic export options have been added and some plugins have been updated. The new POI plugin should be able to index MS Word documents again (the old one broke with recent Java versions), the PDFbox plugin gets all the recent updates from the PDFbox project. Old plugins will still continue to work, though. You can find the updated files here: http://sourceforge.net/project/showfiles.php?group_id=21448 Note that you can now also use the export plugins to add more graphic export options. Enjoy! Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion for Token.java
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Erik, What is wrong with simply creating a new token that replaces an incoming one for synonyms? I'm just playing devil's advocate here since you can already get the termText() through the public _method_. Well, you're right; I forgot about cloning, but ... (Lords advocate :-) 1.) Cloning implies the need to change filters whenever the fields in Token change. 2.) In presence of so many finals it's quite consistent to avoid creation of objects in favour of reuse. Mit freundlichem Gruß / With kind regards Holger Klawitter - -- lists at klawitter dot de -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFAfFxL1Xdt0HKSwgYRAkSRAJoDQcdpeGEl6PaqkqLqYSuzOq+DMQCgoUh9 3jbhzhz00QvH4EUiJgVFgus= =2FEm -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Simple spider demo
I've uploaded it to the wiki: http://wiki.apache.org/jakarta-lucene/HttpUnitExample dislaimer It's not anywhere close to production quality, especially since it's based on a unit test framework. /disclaimer sv On Tue, 13 Apr 2004, Stephane James Vaucher wrote: I'm wondering if there is interest for a simple spider demo. I've got an example of how to use HttpUnit to spider on a web site and have it index it on disk (only html page now). I can send it to the list if anyone is interested (it's one class, 200 loc). cheers, sv - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ANN: Docco 0.3
The underlying assumption was that File.isDirectory() does return false on symlinks, but we never tested under UNIX or Linux and JavaDoc is not very explicit about this (as so often). If that is wrong, can someone mail me some hint how to do it properly? I assume it involves getCanonicalPath() but the details might be tricky :-) Peter Stephane James Vaucher wrote: Looks cool, but I've got a question: How do you handle symlinks on *nix? I think it's stuck in a loop When indexing my home dir, I see it indexing: /home/vauchers/.Cirano-gnome/.gnome-desktop/Home directory/.Cirano-gnome/... cheers, sv On Wed, 14 Apr 2004, Peter Becker wrote: Hello, we released Docco 0.3 along with two updates for its plugins. Docco is a personal document retrieval tool based on Apache's Lucene indexing engine and Formal Concept Analysis. It allows you to create an index for files on your file system which you can then search for keywords. It can index plain text, HTML, XML and OpenOffice files and with the support of plugins others like PDF, DOC and XLS. This new version of Docco features a number of small enhancements: the diagram layout can be changed, printing and graphic export options have been added and some plugins have been updated. The new POI plugin should be able to index MS Word documents again (the old one broke with recent Java versions), the PDFbox plugin gets all the recent updates from the PDFbox project. Old plugins will still continue to work, though. You can find the updated files here: http://sourceforge.net/project/showfiles.php?group_id=21448 Note that you can now also use the export plugins to add more graphic export options. Enjoy! Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion for Token.java
On Tuesday 13 April 2004 15:31, Holger Klawitter wrote: Hi Erik, What is wrong with simply creating a new token that replaces an incoming one for synonyms? I'm just playing devil's advocate here since you can already get the termText() through the public _method_. Well, you're right; I forgot about cloning, but ... (Lords advocate :-) 1.) Cloning implies the need to change filters whenever the fields in Token change. On the other hand, one needs to be sure that no other code assumes Tokens are immutable. For example, if they weren't one couldn't reliably use tokens in Sets or Maps (not sure if it's useful to do that, just an example). I guess it's really matter of whether tokens were designed as immutable (which often makes sense for similar objects), or if they just happen to be, due to lack of modifier method(s). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]