Re: Node Retrieval Performance

Dirk Rudolph Sat, 14 Nov 2015 00:36:45 -0800

> I am planning on storing a lot of data in JackRabbit (terabytes)

But that should not mean storing them all as children of a single Node. 
Probably you should think about driving the hierarchy as explained in 
DavidsModel.


So in general you would structure your files in for example categories:

/categoryA
/categoryB
/categoryC

Or even

/categoryA/sub1/subsuba
/categoryA/sub1/subsubb

and so on. Each of them could then be a root of a NodeSequence managed as 
BTree. This would you additionally allow to split the content over multiple 
jackrabbit instances to increase performance.

In general Jackrabbit is/should be able to handle that many data but 
maintanance might take a lot of time blocking your application. So you should 
try to keep the repository size of a single instance as small as possible by 
for example splitting content by category, region of access, or what ever.

> Or can I simplify it and just do something like this to get a repo


Have a look at: 

https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
 
<https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)>

The parameterMap contains for example

https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
 
<https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI>
https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
 
<https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI>
https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
 
<https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE>

Btw. It should not be required to call ServiceLoader#load() by yourself. 

Cheers, D

Dirk Rudolph | Senior Software Engineer
Netcentric AG

M: +41 79 642 37 11
D: +49 174 966 84 34

[email protected] <mailto:[email protected]> | 
www.netcentric.biz <http://www.netcentric.biz/>
> On 14 Nov 2015, at 01:26, David Marginian <[email protected]> wrote:
> 
> Thanks Dirk, I should have found that page on my own.  I am going to look 
> into using the BTreeManager, just curious what are the limitations for 
> documents/file counts within a node?  I am planning on storing a lot of data 
> in JackRabbit (terabytes).  Also, is the configuration code I posted in my 
> previous posts the best way to do things?  Or can I simplify it and just do 
> something like this to get a repo:
> 
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>  
> return JcrUtils.getRepository(jackabbitServerUrl);
> 
> On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
>> Did I understood you right, you have thousands of child nodes below the
>> root node?
>> 
>> You should avoid this because this is considered bad practice in terms of
>> write performance and depending on your concurrent access this might also
>> block read access.
>> 
>> http://wiki.apache.org/jackrabbit/Performance
>> 
>> Try to introduce a structure to your content using BTreeManger
>> 
>> 
>> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
>> 
>> Cheers, D
>> 
>> 
>> On Friday, 13 November 2015, David Marginian <[email protected]> wrote:
>> 
>>> Thanks Clay.  I am not trying to load that many records at once.  The
>>> application is crawling a directory.  It places the files from that
>>> directory into JackRabbit one at a time, and puts a content id onto a queue
>>> which is picked up by consumers on different servers.  Those consumers then
>>> use the content id to retrieve the file from JackRabbit. Each piece of
>>> content is saved in a node under the root node.  The performance slowdown
>>> is coming from calling session.getRootNode(), from what I can gather from
>>> the docs I need the root node in order to add a child node.  Note the
>>> slowdown is pretty significant and I don't need to have close to 50k to
>>> start seeing it (I start seeing it within a few minutes of running my
>>> app).  I don't need orderable nodes, how do I disable that?
>>> 
>>> 
>>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
>>> 
>>>> Please let us know more about your use case. Why are you even "trying" to
>>>> load that many records all at once. Or at least scan them one by one, I
>>>> mean. In most use cases you wouldn't need to do this kind of thing, unless
>>>> it's some kind of backup or replication. I say "most" cases... I'm not
>>>>   saying you don't need to just asking for a bit more background. BTW: If
>>>> you don't need 'orderable' nodes try to avoid them. That type of node does
>>>> not work at 'scale'... and 50K is propably pushing it.
>>>> 
>>>> Best regards,
>>>> Clay Ferguson
>>>> [email protected]
>>>> 
>>>> 
>>>> On Fri, Nov 13, 2015 at 3:33 PM, <[email protected]> wrote:
>>>> 
>>>> Hi,
>>>>> I am new to JackRabbit and using version 2.11.2.  I am using JackRabbit
>>>>> to
>>>>> store documents in a multi-threaded environment.  I noticed that the time
>>>>> it takes to retrieve the root node is inconsistent and slow (several
>>>>> seconds +) and degrades over time (after 50K plus child nodes retrieval
>>>>> is
>>>>> taking ~15 seconds).
>>>>> 
>>>>> Originally, I was using code as follows to obtain a repository:
>>>>> 
>>>>>   public Repository getRepository() throws ClassNotFoundException,
>>>>> RepositoryException {
>>>>> 
>>>>> 
>>>>> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
>>>>>       return JcrUtils.getRepository(jackabbitServerUrl);
>>>>>   }
>>>>> 
>>>>> Then I came across the following thread:
>>>>> 
>>>>> 
>>>>> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
>>>>> 
>>>>> This thread had some useful information (BatchReadConfig), but I am not
>>>>> certain how to use the API to take advantage of it.  I have changed my
>>>>> code
>>>>> to the following but it doesn't appear that node retrieval performance
>>>>> has
>>>>> improved, is there something I am missing/doing wrong?
>>>>> 
>>>>> 1) Repository Factory
>>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
>>>>> parameters) throws RepositoryException {
>>>>>          String repositoryFactoryName = parameters != null && (
>>>>> 
>>>>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
>>>>>                          parameters.containsKey(PARAM_REPOSITORY_CONFIG))
>>>>>                  ?
>>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
>>>>>                  : "org.apache.jackrabbit.core.RepositoryFactoryImpl";
>>>>> 
>>>>>          Object repositoryFactory;
>>>>>          try {
>>>>>              Class<?> repositoryFactoryClass =
>>>>> Class.forName(repositoryFactoryName, true,
>>>>>                      Thread.currentThread().getContextClassLoader());
>>>>> 
>>>>>              repositoryFactory = repositoryFactoryClass.newInstance();
>>>>>          }
>>>>>          catch (Exception e) {
>>>>>              throw new RepositoryException(e);
>>>>>          }
>>>>> 
>>>>>          if (repositoryFactory instanceof RepositoryFactory) {
>>>>>              return ((RepositoryFactory)
>>>>> repositoryFactory).getRepository(parameters);
>>>>>          }
>>>>>          else {
>>>>>              throw new RepositoryException(repositoryFactory + " is not a
>>>>> RepositoryFactory");
>>>>>          }
>>>>>      }
>>>>> 
>>>>> 2) Use the factory to get a repo:
>>>>>   public Repository getRepository() throws ClassNotFoundException,
>>>>> RepositoryException {
>>>>>          Map<String, RepositoryConfig> parameters =
>>>>> Collections.singletonMap(
>>>>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
>>>>>                  (RepositoryConfig) new
>>>>> RepositoryConfigImpl(jackabbitServerUrl));
>>>>> 
>>>>>          return getRepository(parameters);
>>>>>      }
>>>>> 
>>>>> 3) Repository Config:
>>>>> private static final class RepositoryConfigImpl implements
>>>>> RepositoryConfig {
>>>>> 
>>>>>          private String jackabbitServerUrl;
>>>>> 
>>>>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
>>>>>              super();
>>>>>              this.jackabbitServerUrl = jackabbitServerUrl;
>>>>>          }
>>>>> 
>>>>>          public CacheBehaviour getCacheBehaviour() {
>>>>>              return CacheBehaviour.INVALIDATE;
>>>>>          }
>>>>> 
>>>>>          public int getItemCacheSize() {
>>>>>              return 100;
>>>>>          }
>>>>> 
>>>>>          public int getPollTimeout() {
>>>>>              return 5000;
>>>>>          }
>>>>> 
>>>>>          public RepositoryService getRepositoryService() throws
>>>>> RepositoryException {
>>>>>              BatchReadConfig brc = new BatchReadConfig() {
>>>>>                  public int getDepth(Path path, PathResolver resolver)
>>>>> throws NamespaceException {
>>>>>                      return 1;
>>>>>                  }
>>>>>              };
>>>>>              return new RepositoryServiceImpl(jackabbitServerUrl, brc);
>>>>>          }
>>>>> 
>>>>>      }
>>>>> 
>>>>> Thanks for your time.
>>>>> 
>>>>> David

Re: Node Retrieval Performance

Reply via email to