Sebastien,
Just as a reminder, we use jackrabbit 1.4. I'm not explicitly using the
text-extractors. Our repository.xml looks like this:
<Repository>
<FileSystem
class="org.apache.jackrabbit.core.fs.db.JNDIDatabaseFileSystem">
<param name="dataSourceLocation" value="kycAppDataSource" />
<param name="schema" value="mssql" />
<param name="schemaObjectPrefix" value="J_R_FS_" />
<param name="bundleCacheSize" value="8" />
<param name="consistencyCheck" value="false" />
<param name="minBlobSize" value="16384" />
</FileSystem>
<Security appName="Jackrabbit">
<AccessManager
class="org.apache.jackrabbit.core.security.SimpleAccessManager">
</AccessManager>
<LoginModule
class="org.apache.jackrabbit.core.security.SimpleLoginModule">
<param name="anonymousId" value="anonymous" />
</LoginModule>
</Security>
<Workspaces rootPath="${rep.home}/workspaces"
defaultWorkspace="default" />
<Workspace name="${wsp.name}">
<FileSystem
class="org.apache.jackrabbit.core.fs.db.JNDIDatabaseFileSystem">
<param name="dataSourceLocation"
value="kycAppDataSource" />
<param name="schema" value="mssql" />
<param name="schemaObjectPrefix"
value="J_FS_${wsp.name}_" />
<param name="bundleCacheSize" value="8" />
<param name="consistencyCheck" value="false" />
<param name="minBlobSize" value="16384" />
</FileSystem>
<PersistenceManager
class="org.apache.jackrabbit.core.persistence.db.JNDIDatabasePersistenceManager">
<param name="dataSourceLocation"
value="kycAppDataSource" />
<param name="schema" value="mssql" />
<param name="schemaObjectPrefix"
value="J_PM_${wsp.name}_" />
<param name="bundleCacheSize" value="8" />
<param name="consistencyCheck" value="false" />
<param name="minBlobSize" value="16384" />
</PersistenceManager>
<SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index" />
</SearchIndex>
</Workspace>
<Versioning rootPath="${rep.home}/version">
<FileSystem
class="org.apache.jackrabbit.core.fs.db.JNDIDatabaseFileSystem">
<param name="dataSourceLocation"
value="kycAppDataSource" />
<param name="schema" value="mssql" />
<param name="schemaObjectPrefix" value="J_V_FS_" />
<param name="bundleCacheSize" value="8" />
<param name="consistencyCheck" value="false" />
<param name="minBlobSize" value="16384" />
</FileSystem>
<PersistenceManager
class="org.apache.jackrabbit.core.persistence.db.JNDIDatabasePersistenceManager">
<param name="dataSourceLocation"
value="kycAppDataSource" />
<param name="schema" value="mssql" />
<param name="schemaObjectPrefix" value="J_V_PM_" />
<param name="bundleCacheSize" value="8" />
<param name="consistencyCheck" value="false" />
<param name="minBlobSize" value="16384" />
</PersistenceManager>
</Versioning>
<SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${rep.home}/repository/index" />
</SearchIndex>
</Repository>
Our customers are alleviating the memory problem by restarting the servers
daily.
The documents we store are numerous (thousands daily) and vary in size. They
are news articles (xml/html) and reports (rtf) and are all stored as binary
content (base64 encoded). We also store some attributes about these articles
that are in string format. We delete thousands of news articles per day when
reports are finalized. We do not need to be able to search the content of
these articles - but I assume they are being indexed because we have specified
SearchIndex elements in our repository xml.
Am I correct here?
Muguet
-----Original Message-----
From: Sébastien Launay [mailto:[email protected]]
Sent: Tuesday, September 29, 2009 8:20 AM
To: [email protected]
Subject: Re: Memory issues with jackrabbit/lucene
Le 29/09/2009 13:51, Muguet Bradbury a écrit :
> Sebastien,
>
> Thanks for the reply. Yes, we do store large documents (rtf and large xml
> documents). When we store each document, we create a session, add the
> document, save the session, and close the session. The LuceneTermBuffers
> remain. However, if the indexing occurs asynchronously, this may be what's
> filling up the memory. Eventually, the application gets an out of memory
> exception.
This is clearly caused by the asynchronous indexing of binary properties.
You can also deactivate index of this kind of documents [1].
Can you provide more informations on these documents (size, number, ...) ?
> I will look into removing the SearchIndex elements from the repository.xml
> and workspace.xml. Do we also need to remove the index directories from the
> wsp.home path? Will removing the SearchIndex elements make retrieval of the
> documents (with the node keys) slower?
>
Removing the index directory is not mandatory as it will not be used
anymore. But, this consumes disk space so you can remove them.
Lucene indexes are only used for search features (XPath, SQL, AQM).
Node#getNodes(), Node#getProperties(), Session#getNodeByUUID(),
... uses an asbtraction called PersistenceManager [2].
Default implementations of PersistenceManager do not use an index.
[1] http://jackrabbit.apache.org/jackrabbit-text-extractors.html
[2] http://wiki.apache.org/jackrabbit/PersistenceManagerFAQ
--
Sébastien Launay
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
______________________________________________________________________