index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)

2004-04-13 Thread petite_abeille
On Apr 13, 2004, at 02:45, Kevin A. Burton wrote:

He mentioned that I might be able to squeeze 5-10% out of index merges 
this way.
Talking of which... what strategy(ies) do people use to minimize 
downtime when updating an index?

My current strategy is as follow:

(1) use a temporary RAMDirectory for ongoing updates.
(2) perform a copy on write when flushing the RAMDirectory into the 
persistent index.

The second step means that I create an offline copy of a live index 
before invoking addIndexes() and then substitute the old index with the 
new, updated, one. While this effectively increase the time it takes to 
update an index, it nonetheless reduce the *perceived* downtime for it.

Thoughts? Alternative strategies?

TIA.

Cheers,

PA.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Closing IndexWriter object after each file causes NullPointerException?

2004-04-13 Thread Brisbart Franck
If you close an IndexWriter more than once, the release of the writeLock 
 creates a NullPointerException.
You should clean your code and close your writer only once. Anyway, I 
don't know why there's no test on the 'writeLock' as in the 'finalize' 
method.
I think it's a little error, so I suggest the attached patch to fix that.

Franck Brisbart

jitender ahuja wrote:
Hi,
 Can anyone tell what is the cause of error for the following error 
as the source of error is not any of the following:
a) Index directory closing after each file of the directory (to be 
indexed) : verified by the changing directory size, with the changing
 number of files to be indexed
b) IndexWriter object being closed out : verified by checking the 
IndexWriter object ( here, writ) being a non-null object, by the line:
System.out.println(writ != null); in the attached code
 
 
Error output:
 java.lang.NullPointerException
at org.apache.lucene.index.IndexWriter.close(Unknown Source)
at IndexDatanew.indexDocs(IndexDatanew.java:89)
at IndexDatanew.indexDocs(IndexDatanew.java:50)
at IndexDatanew.main(IndexDatanew.java:25)
 
The code that causes this error is working fine otherwise (i.e. for 
indexing purposes) and is attached; the output in detail for a
directory of 2 files is also attached.:
 
Thanks
Jitender



C:\lucrochejava IndexDatanew E:\freebooks\books\whole\jiten
Index Directory: E:\freebooks\books\whole\jiten
2
E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
adding: E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
File contents from buffer:
E:\freebooks\books\whole\jiten\Copy of TIJ3_c.htm
false
E:\freebooks\books\whole\jiten\TIJ3_c.htm
adding: E:\freebooks\books\whole\jiten\TIJ3_c.htm
File contents from buffer:
E:\freebooks\books\whole\jiten\TIJ3_c.htm
false
java.lang.NullPointerException
at org.apache.lucene.index.IndexWriter.close(Unknown Source)
at IndexDatanew.indexDocs(IndexDatanew.java:89)
at IndexDatanew.indexDocs(IndexDatanew.java:50)
at IndexDatanew.main(IndexDatanew.java:25)
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Franck Brisbart
RD
http://www.kelkoo.com
Index: IndexWriter.java
===
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java,v
retrieving revision 1.28
diff -u -r1.28 IndexWriter.java
--- IndexWriter.java25 Mar 2004 19:34:53 -  1.28
+++ IndexWriter.java13 Apr 2004 16:39:56 -
@@ -235,8 +235,10 @@
   public synchronized void close() throws IOException {
 flushRamSegments();
 ramDirectory.close();
-writeLock.release();  // release write lock
-writeLock = null;
+if (writeLock != null) {
+  writeLock.release();  // release write lock
+  writeLock = null;
+}
 directory.close();
   }
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)

2004-04-13 Thread Stephane James Vaucher
I'm actually pretty lazy about index updates, and haven't had the need for 
efficiency, since my requirement is that new documents should be 
available on a next working day basis.

I reindex everything from scatch every night (400,000 docs) and store it 
in an timestamped index. When the reindexing is done, I alert a controller 
of the new active index. I keep a few versions of the index in case of 
a failure somewhere and I can always send a message to the controller to 
use an old index.

cheers,
sv

On Tue, 13 Apr 2004, petite_abeille wrote:

 
 On Apr 13, 2004, at 02:45, Kevin A. Burton wrote:
 
  He mentioned that I might be able to squeeze 5-10% out of index merges 
  this way.
 
 Talking of which... what strategy(ies) do people use to minimize 
 downtime when updating an index?
 
 My current strategy is as follow:
 
 (1) use a temporary RAMDirectory for ongoing updates.
 (2) perform a copy on write when flushing the RAMDirectory into the 
 persistent index.
 
 The second step means that I create an offline copy of a live index 
 before invoking addIndexes() and then substitute the old index with the 
 new, updated, one. While this effectively increase the time it takes to 
 update an index, it nonetheless reduce the *perceived* downtime for it.
 
 Thoughts? Alternative strategies?
 
 TIA.
 
 Cheers,
 
 PA.
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Suggestion for Token.java

2004-04-13 Thread Holger Klawitter
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi there,

Just a short suggestion:

It would be useful to make Token.termText public (or to provide a reader/
writer pair).

That way one can create TokenFilters altering termText (for Synonyms for 
example) in other packages as org.apache.lucene.analyzer.

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --
lists at klawitter dot de
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAe/C41Xdt0HKSwgYRAoUEAKCUARoxSFiPv6OyJCzJhbLCLbtkmwCfQHzH
pH4Z4Bk6M/emmLn0CVoEX8w=
=1fIA
-END PGP SIGNATURE-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestion for Token.java

2004-04-13 Thread Erik Hatcher
What is wrong with simply creating a new token that replaces an 
incoming one for synonyms?

I'm just playing devil's advocate here since you can already get 
the termText() through the public _method_.

	Erik

On Apr 13, 2004, at 9:52 AM, Holger Klawitter wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi there,

Just a short suggestion:

It would be useful to make Token.termText public (or to provide a 
reader/
writer pair).

That way one can create TokenFilters altering termText (for Synonyms 
for
example) in other packages as org.apache.lucene.analyzer.

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --
lists at klawitter dot de
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQFAe/C41Xdt0HKSwgYRAoUEAKCUARoxSFiPv6OyJCzJhbLCLbtkmwCfQHzH
pH4Z4Bk6M/emmLn0CVoEX8w=
=1fIA
-END PGP SIGNATURE-
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


ANN: Docco 0.3

2004-04-13 Thread Peter Becker
Hello,

we released Docco 0.3 along with two updates for its plugins.

Docco is a personal document retrieval tool based on Apache's Lucene 
indexing engine and Formal Concept Analysis. It allows you to create an 
index for files on your file system which you can then search for 
keywords. It can index plain text, HTML, XML and OpenOffice files and 
with the support of plugins others like PDF, DOC and XLS.

This new version of Docco features a number of small enhancements: the 
diagram layout can be changed, printing and graphic export options have 
been added and some plugins have been updated.

The new POI plugin should be able to index MS Word documents again (the 
old one broke with recent Java versions), the PDFbox plugin gets all the 
recent updates from the PDFbox project. Old plugins will still continue 
to work, though.

You can find the updated files here:
http://sourceforge.net/project/showfiles.php?group_id=21448
Note that you can now also use the export plugins to add more graphic 
export options.

Enjoy!
 Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Simple spider demo

2004-04-13 Thread Stephane James Vaucher
I'm wondering if there is interest for a simple spider demo.

I've got an example of how to use HttpUnit to spider on a web site and 
have it index it on disk (only html page now). I can send it to the list 
if anyone is interested (it's one class,  200 loc).

cheers,
sv



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



i411 Faceted Metadata Search

2004-04-13 Thread William W
Hi,

Who knows the diference between i411 Faceted Metadata Search and Lucene 
Search Engine.

Thanks,

William.

_
Tax headache? MSN Money provides relief with tax tips, tools, IRS forms and 
more! http://moneycentral.msn.com/tax/workshop/welcome.asp

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ANN: Docco 0.3

2004-04-13 Thread Stephane James Vaucher
Looks cool, but I've got a question:

How do you handle symlinks on *nix? I think it's stuck in a loop

When indexing my home dir, I see it indexing: 
/home/vauchers/.Cirano-gnome/.gnome-desktop/Home directory/.Cirano-gnome/...

cheers,
sv

On Wed, 14 Apr 2004, Peter Becker wrote:

 Hello,
 
 we released Docco 0.3 along with two updates for its plugins.
 
 Docco is a personal document retrieval tool based on Apache's Lucene 
 indexing engine and Formal Concept Analysis. It allows you to create an 
 index for files on your file system which you can then search for 
 keywords. It can index plain text, HTML, XML and OpenOffice files and 
 with the support of plugins others like PDF, DOC and XLS.
 
 This new version of Docco features a number of small enhancements: the 
 diagram layout can be changed, printing and graphic export options have 
 been added and some plugins have been updated.
 
 The new POI plugin should be able to index MS Word documents again (the 
 old one broke with recent Java versions), the PDFbox plugin gets all the 
 recent updates from the PDFbox project. Old plugins will still continue 
 to work, though.
 
 You can find the updated files here:
  http://sourceforge.net/project/showfiles.php?group_id=21448
 
 Note that you can now also use the export plugins to add more graphic 
 export options.
 
 Enjoy!
   Peter
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestion for Token.java

2004-04-13 Thread Holger Klawitter
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Erik,

 What is wrong with simply creating a new token that replaces an
 incoming one for synonyms?
 I'm just playing devil's advocate here since you can already get
 the termText() through the public _method_.

Well, you're right; I forgot about cloning, but ... (Lords advocate :-)

1.) Cloning implies the need to change filters whenever the fields in Token 
change.

2.) In presence of so many finals it's quite consistent to avoid creation of 
objects in favour of reuse.

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --
lists at klawitter dot de
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAfFxL1Xdt0HKSwgYRAkSRAJoDQcdpeGEl6PaqkqLqYSuzOq+DMQCgoUh9
3jbhzhz00QvH4EUiJgVFgus=
=2FEm
-END PGP SIGNATURE-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Simple spider demo

2004-04-13 Thread Stephane James Vaucher
I've uploaded it to the wiki:

http://wiki.apache.org/jakarta-lucene/HttpUnitExample

dislaimer
It's not anywhere close to production quality, especially since it's based 
on a unit test framework.
/disclaimer

sv

On Tue, 13 Apr 2004, Stephane James Vaucher wrote:

 I'm wondering if there is interest for a simple spider demo.
 
 I've got an example of how to use HttpUnit to spider on a web site and 
 have it index it on disk (only html page now). I can send it to the list 
 if anyone is interested (it's one class,  200 loc).
 
 cheers,
 sv
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ANN: Docco 0.3

2004-04-13 Thread Peter Becker
The underlying assumption was that File.isDirectory() does return false 
on symlinks, but we never tested under UNIX or Linux and JavaDoc is not 
very explicit about this (as so often). If that is wrong, can someone 
mail me some hint how to do it properly? I assume it involves 
getCanonicalPath() but the details might be tricky :-)

 Peter

Stephane James Vaucher wrote:

Looks cool, but I've got a question:

How do you handle symlinks on *nix? I think it's stuck in a loop

When indexing my home dir, I see it indexing: 
/home/vauchers/.Cirano-gnome/.gnome-desktop/Home directory/.Cirano-gnome/...

cheers,
sv
On Wed, 14 Apr 2004, Peter Becker wrote:

 

Hello,

we released Docco 0.3 along with two updates for its plugins.

Docco is a personal document retrieval tool based on Apache's Lucene 
indexing engine and Formal Concept Analysis. It allows you to create an 
index for files on your file system which you can then search for 
keywords. It can index plain text, HTML, XML and OpenOffice files and 
with the support of plugins others like PDF, DOC and XLS.

This new version of Docco features a number of small enhancements: the 
diagram layout can be changed, printing and graphic export options have 
been added and some plugins have been updated.

The new POI plugin should be able to index MS Word documents again (the 
old one broke with recent Java versions), the PDFbox plugin gets all the 
recent updates from the PDFbox project. Old plugins will still continue 
to work, though.

You can find the updated files here:
http://sourceforge.net/project/showfiles.php?group_id=21448
Note that you can now also use the export plugins to add more graphic 
export options.

Enjoy!
 Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Suggestion for Token.java

2004-04-13 Thread Tatu Saloranta
On Tuesday 13 April 2004 15:31, Holger Klawitter wrote:
 Hi Erik,

  What is wrong with simply creating a new token that replaces an
  incoming one for synonyms?
  I'm just playing devil's advocate here since you can already get
  the termText() through the public _method_.

 Well, you're right; I forgot about cloning, but ... (Lords advocate :-)

 1.) Cloning implies the need to change filters whenever the fields in Token
 change.

On the other hand, one needs to be sure that no other code assumes Tokens are 
immutable. For example, if they weren't one couldn't reliably use tokens in 
Sets or Maps (not sure if it's useful to do that, just an example).

I guess it's really matter of whether tokens were designed as immutable (which 
often makes sense for similar objects), or if they just happen to be, due to 
lack of modifier method(s).

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]