Hi All,
I'm using Jackrabbit 2.4.3 and my repository has approximately 110
thousand nodes. From these, about 10 thousand nodes has binary values,
wich the content need to be extracted, using Tika, and indexed in Lucene.
I decided to delete the index to make Jackrabbit create them again. The
problem is the time that this operation is taking. I waited for 3 hours
and the repository wasn't initialized (I don't know exactly how long it
take to complete the repository initialization, because I stopped the
process). Disabling Tika's text extraction, it took 5 minutes, so I
concluded that the problem is the time that Tika takes to extract the 10
thousand documents.
If the index become inconsistent and I have to execute the rebuild, my
client doesn't want to wait for more than 3 hours to start using the
system. So I'm planning to create a subclass of
org.apache.jackrabbit.core.query.lucene.SearchIndex and try to modify
how the indexes are re-created. To give to my client a fast access to
the repository, first I'll ignore the text extraction and create the
index with normal properties. With this structure, I can give access to
the repository to my client and he can do many things using only the
normal properties. So, in background, I'll start the text extraction of
each document and update Lucene's document with extracted value.
I have some questions about it.
1) Reading the source code, jackrabbit is using LazyTextExtractorField
(and other classes) to execute the extraction in a separate thread.
Doesn't it do exactly what I want? But, even so I waited 3 hours and the
repository wasn't initialized and ready to use. Is it normal?
2) What I'm planning to do is the best approach? Did anybody make
something similar?
Thanks,
Nelson