Re: Node Retrieval Performance

Clay Ferguson Fri, 13 Nov 2015 16:22:10 -0800

In my opinion this one issue is the single most crippling achilies heel of
the entire JCR. Very likely to drive away many potential users of this API.
It's touted as an enterprise-scale API, but yet chokes on just a few tens
of thousands of nodes. This, IMO urgently needs to be addressed. I know
it's a technical limitation, and not a design decision, but to me that just
means it's an 'unsolved' problem. I'm not complaining or criticizing
developers, i'm just saying that as a community we need to solve this. I
should be able to have a 50 million nodes, and not be a problem, in an
ideal situation. RDBMS have solved these issues years ago, by a "never load
everything all at once" rule. However somehow the "It's ok to load all
children in memory" mentality caught on in the JCR and we are now stuck
with the results.



Best regards,
Clay Ferguson
[email protected]


On Fri, Nov 13, 2015 at 4:47 PM, Dirk Rudolph <[email protected]>
wrote:

> Did I understood you right, you have thousands of child nodes below the
> root node?
>
> You should avoid this because this is considered bad practice in terms of
> write performance and depending on your concurrent access this might also
> block read access.
>
> http://wiki.apache.org/jackrabbit/Performance
>
> Try to introduce a structure to your content using BTreeManger
>
>
>
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
>
> Cheers, D
>
>
> On Friday, 13 November 2015, David Marginian <[email protected]> wrote:
>
> > Thanks Clay.  I am not trying to load that many records at once.  The
> > application is crawling a directory.  It places the files from that
> > directory into JackRabbit one at a time, and puts a content id onto a
> queue
> > which is picked up by consumers on different servers.  Those consumers
> then
> > use the content id to retrieve the file from JackRabbit. Each piece of
> > content is saved in a node under the root node.  The performance slowdown
> > is coming from calling session.getRootNode(), from what I can gather from
> > the docs I need the root node in order to add a child node.  Note the
> > slowdown is pretty significant and I don't need to have close to 50k to
> > start seeing it (I start seeing it within a few minutes of running my
> > app).  I don't need orderable nodes, how do I disable that?
> >
> >
> > On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> >
> >> Please let us know more about your use case. Why are you even "trying"
> to
> >> load that many records all at once. Or at least scan them one by one, I
> >> mean. In most use cases you wouldn't need to do this kind of thing,
> unless
> >> it's some kind of backup or replication. I say "most" cases... I'm not
> >>   saying you don't need to just asking for a bit more background. BTW:
> If
> >> you don't need 'orderable' nodes try to avoid them. That type of node
> does
> >> not work at 'scale'... and 50K is propably pushing it.
> >>
> >> Best regards,
> >> Clay Ferguson
> >> [email protected]
> >>
> >>
> >> On Fri, Nov 13, 2015 at 3:33 PM, <[email protected]> wrote:
> >>
> >> Hi,
> >>> I am new to JackRabbit and using version 2.11.2.  I am using JackRabbit
> >>> to
> >>> store documents in a multi-threaded environment.  I noticed that the
> time
> >>> it takes to retrieve the root node is inconsistent and slow (several
> >>> seconds +) and degrades over time (after 50K plus child nodes retrieval
> >>> is
> >>> taking ~15 seconds).
> >>>
> >>> Originally, I was using code as follows to obtain a repository:
> >>>
> >>>   public Repository getRepository() throws ClassNotFoundException,
> >>> RepositoryException {
> >>>
> >>>
> >>>
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> >>>       return JcrUtils.getRepository(jackabbitServerUrl);
> >>>   }
> >>>
> >>> Then I came across the following thread:
> >>>
> >>>
> >>>
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> >>>
> >>> This thread had some useful information (BatchReadConfig), but I am not
> >>> certain how to use the API to take advantage of it.  I have changed my
> >>> code
> >>> to the following but it doesn't appear that node retrieval performance
> >>> has
> >>> improved, is there something I am missing/doing wrong?
> >>>
> >>> 1) Repository Factory
> >>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> >>> parameters) throws RepositoryException {
> >>>          String repositoryFactoryName = parameters != null && (
> >>>
> >>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
> >>>
> parameters.containsKey(PARAM_REPOSITORY_CONFIG))
> >>>                  ?
> >>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
> >>>                  : "org.apache.jackrabbit.core.RepositoryFactoryImpl";
> >>>
> >>>          Object repositoryFactory;
> >>>          try {
> >>>              Class<?> repositoryFactoryClass =
> >>> Class.forName(repositoryFactoryName, true,
> >>>                      Thread.currentThread().getContextClassLoader());
> >>>
> >>>              repositoryFactory = repositoryFactoryClass.newInstance();
> >>>          }
> >>>          catch (Exception e) {
> >>>              throw new RepositoryException(e);
> >>>          }
> >>>
> >>>          if (repositoryFactory instanceof RepositoryFactory) {
> >>>              return ((RepositoryFactory)
> >>> repositoryFactory).getRepository(parameters);
> >>>          }
> >>>          else {
> >>>              throw new RepositoryException(repositoryFactory + " is
> not a
> >>> RepositoryFactory");
> >>>          }
> >>>      }
> >>>
> >>> 2) Use the factory to get a repo:
> >>>   public Repository getRepository() throws ClassNotFoundException,
> >>> RepositoryException {
> >>>          Map<String, RepositoryConfig> parameters =
> >>> Collections.singletonMap(
> >>>                  "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
> >>>                  (RepositoryConfig) new
> >>> RepositoryConfigImpl(jackabbitServerUrl));
> >>>
> >>>          return getRepository(parameters);
> >>>      }
> >>>
> >>> 3) Repository Config:
> >>> private static final class RepositoryConfigImpl implements
> >>> RepositoryConfig {
> >>>
> >>>          private String jackabbitServerUrl;
> >>>
> >>>          private RepositoryConfigImpl(String jackabbitServerUrl) {
> >>>              super();
> >>>              this.jackabbitServerUrl = jackabbitServerUrl;
> >>>          }
> >>>
> >>>          public CacheBehaviour getCacheBehaviour() {
> >>>              return CacheBehaviour.INVALIDATE;
> >>>          }
> >>>
> >>>          public int getItemCacheSize() {
> >>>              return 100;
> >>>          }
> >>>
> >>>          public int getPollTimeout() {
> >>>              return 5000;
> >>>          }
> >>>
> >>>          public RepositoryService getRepositoryService() throws
> >>> RepositoryException {
> >>>              BatchReadConfig brc = new BatchReadConfig() {
> >>>                  public int getDepth(Path path, PathResolver resolver)
> >>> throws NamespaceException {
> >>>                      return 1;
> >>>                  }
> >>>              };
> >>>              return new RepositoryServiceImpl(jackabbitServerUrl, brc);
> >>>          }
> >>>
> >>>      }
> >>>
> >>> Thanks for your time.
> >>>
> >>> David
> >>>
> >>>
> >>>
> >>>
> >>>
> >
>
> --
>
> Dirk Rudolph | Senior Software Engineer
>
> Netcentric AG
>
> M: +41 79 642 37 11
> D: +49 174 966 84 34
>
> [email protected] | www.netcentric.biz
>

Re: Node Retrieval Performance

Reply via email to