Re: Node Retrieval Performance

Clay Ferguson Sat, 14 Nov 2015 08:56:46 -0800

Dirk,
You are not adding new information. Everything you just said was a known
and a given. We all realize we can be creative and solve this problem, and
avoid large numbers of children in all manor of creative and
straightforward ways. However, can you imagine yourself making the same
statement about RDBMS tables? If you were a developer on a RDBMS,
struggling to get scale working, would you ever say this to your boss: "Oh
well, if the table gets over 50K, we can just add new tables, because since
the DB can't deal with it we can just put the responsibility on the app
developers." If that would be a silly statement in the RDBMS world, it will
be silly in the NoSQL world for all the same exact reasons.



Best regards,
Clay Ferguson
[email protected]


On Sat, Nov 14, 2015 at 10:07 AM, Dirk Rudolph <[email protected]>
wrote:

> Each of the records has an primary key I guess. So build the uuid or any
> hash from it and use it as key in a BTree structure. Simple and
> straightforward.
>
> Actually the idea is to find structure in your data. This is a core idea of
> structured document stores. In case you have a large amount of siblings the
> detail level of your structure might not be deep enough.
>
> Anyway if you want to store key value tables somewhere there is a broad
> pool of available open source solutions.
>
> Cheers, D
>
> On Saturday, 14 November 2015, Clay Ferguson <[email protected]> wrote:
>
> > Dirk,
> > What you're explaining would work great if the data had naturally
> occurring
> > categories all being conveniently at whatever size JCR happens to handle
> > ok. This just doesn't work well in actuality. What if I just need to
> store
> > a table of 25 million arbitrary records? The "it can't be done" with JCR
> is
> > the honest answer. Solving it by creating a bunch of separate buckets is
> a
> > massive ugly kluge. Whatever the technical limitation is, it's INSIDE
> > Jackrabbit, and badly needs to be addressed rather than forcing
> developers
> > to jump thru hoops in application code. Surely I can't be the only one to
> > think this? Is everybody else just afraid to be critical like me, because
> > they are getting paid to work on JCR? Why don't we just be honest.
> >
> > Best regards,
> > Clay Ferguson
> > [email protected] <javascript:;>
> >
> >
> > On Sat, Nov 14, 2015 at 2:35 AM, Dirk Rudolph <
> [email protected]
> > <javascript:;>>
> > wrote:
> >
> > > > I am planning on storing a lot of data in JackRabbit (terabytes)
> > >
> > > But that should not mean storing them all as children of a single Node.
> > > Probably you should think about driving the hierarchy as explained in
> > > DavidsModel.
> > >
> > > So in general you would structure your files in for example categories:
> > >
> > > /categoryA
> > > /categoryB
> > > /categoryC
> > >
> > > Or even
> > >
> > > /categoryA/sub1/subsuba
> > > /categoryA/sub1/subsubb
> > >
> > > and so on. Each of them could then be a root of a NodeSequence managed
> as
> > > BTree. This would you additionally allow to split the content over
> > multiple
> > > jackrabbit instances to increase performance.
> > >
> > > In general Jackrabbit is/should be able to handle that many data but
> > > maintanance might take a lot of time blocking your application. So you
> > > should try to keep the repository size of a single instance as small as
> > > possible by for example splitting content by category, region of
> access,
> > or
> > > what ever.
> > >
> > > > Or can I simplify it and just do something like this to get a repo
> > >
> > >
> > > Have a look at:
> > >
> > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > > <
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)
> > > >
> > >
> > > The parameterMap contains for example
> > >
> > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > > <
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI
> > > >
> > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > > <
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI
> > > >
> > >
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > > <
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE
> > > >
> > >
> > > Btw. It should not be required to call ServiceLoader#load() by
> yourself.
> > >
> > > Cheers, D
> > >
> > > Dirk Rudolph | Senior Software Engineer
> > > Netcentric AG
> > >
> > > M: +41 79 642 37 11
> > > D: +49 174 966 84 34
> > >
> > > [email protected] <javascript:;> <mailto:
> > [email protected] <javascript:;>> |
> > > www.netcentric.biz <http://www.netcentric.biz/>
> > > > On 14 Nov 2015, at 01:26, David Marginian <[email protected]
> > <javascript:;>> wrote:
> > > >
> > > > Thanks Dirk, I should have found that page on my own.  I am going to
> > > look into using the BTreeManager, just curious what are the limitations
> > for
> > > documents/file counts within a node?  I am planning on storing a lot of
> > > data in JackRabbit (terabytes).  Also, is the configuration code I
> posted
> > > in my previous posts the best way to do things?  Or can I simplify it
> and
> > > just do something like this to get a repo:
> > > >
> > > >
> > >
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > > > return JcrUtils.getRepository(jackabbitServerUrl);
> > > >
> > > > On 11/13/2015 03:47 PM, Dirk Rudolph wrote:
> > > >> Did I understood you right, you have thousands of child nodes below
> > the
> > > >> root node?
> > > >>
> > > >> You should avoid this because this is considered bad practice in
> terms
> > > of
> > > >> write performance and depending on your concurrent access this might
> > > also
> > > >> block read access.
> > > >>
> > > >> http://wiki.apache.org/jackrabbit/Performance
> > > >>
> > > >> Try to introduce a structure to your content using BTreeManger
> > > >>
> > > >>
> > > >>
> > >
> >
> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html
> > > >>
> > > >> Cheers, D
> > > >>
> > > >>
> > > >> On Friday, 13 November 2015, David Marginian <[email protected]
> > <javascript:;>>
> > > wrote:
> > > >>
> > > >>> Thanks Clay.  I am not trying to load that many records at once.
> The
> > > >>> application is crawling a directory.  It places the files from that
> > > >>> directory into JackRabbit one at a time, and puts a content id
> onto a
> > > queue
> > > >>> which is picked up by consumers on different servers.  Those
> > consumers
> > > then
> > > >>> use the content id to retrieve the file from JackRabbit. Each piece
> > of
> > > >>> content is saved in a node under the root node.  The performance
> > > slowdown
> > > >>> is coming from calling session.getRootNode(), from what I can
> gather
> > > from
> > > >>> the docs I need the root node in order to add a child node.  Note
> the
> > > >>> slowdown is pretty significant and I don't need to have close to
> 50k
> > to
> > > >>> start seeing it (I start seeing it within a few minutes of running
> my
> > > >>> app).  I don't need orderable nodes, how do I disable that?
> > > >>>
> > > >>>
> > > >>> On 11/13/2015 03:10 PM, Clay Ferguson wrote:
> > > >>>
> > > >>>> Please let us know more about your use case. Why are you even
> > > "trying" to
> > > >>>> load that many records all at once. Or at least scan them one by
> > one,
> > > I
> > > >>>> mean. In most use cases you wouldn't need to do this kind of
> thing,
> > > unless
> > > >>>> it's some kind of backup or replication. I say "most" cases... I'm
> > not
> > > >>>>   saying you don't need to just asking for a bit more background.
> > > BTW: If
> > > >>>> you don't need 'orderable' nodes try to avoid them. That type of
> > node
> > > does
> > > >>>> not work at 'scale'... and 50K is propably pushing it.
> > > >>>>
> > > >>>> Best regards,
> > > >>>> Clay Ferguson
> > > >>>> [email protected] <javascript:;>
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Nov 13, 2015 at 3:33 PM, <[email protected]
> > <javascript:;>> wrote:
> > > >>>>
> > > >>>> Hi,
> > > >>>>> I am new to JackRabbit and using version 2.11.2.  I am using
> > > JackRabbit
> > > >>>>> to
> > > >>>>> store documents in a multi-threaded environment.  I noticed that
> > the
> > > time
> > > >>>>> it takes to retrieve the root node is inconsistent and slow
> > (several
> > > >>>>> seconds +) and degrades over time (after 50K plus child nodes
> > > retrieval
> > > >>>>> is
> > > >>>>> taking ~15 seconds).
> > > >>>>>
> > > >>>>> Originally, I was using code as follows to obtain a repository:
> > > >>>>>
> > > >>>>>   public Repository getRepository() throws
> ClassNotFoundException,
> > > >>>>> RepositoryException {
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > >
> >
> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory"));
> > > >>>>>       return JcrUtils.getRepository(jackabbitServerUrl);
> > > >>>>>   }
> > > >>>>>
> > > >>>>> Then I came across the following thread:
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > >
> >
> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302
> > > >>>>>
> > > >>>>> This thread had some useful information (BatchReadConfig), but I
> am
> > > not
> > > >>>>> certain how to use the API to take advantage of it.  I have
> changed
> > > my
> > > >>>>> code
> > > >>>>> to the following but it doesn't appear that node retrieval
> > > performance
> > > >>>>> has
> > > >>>>> improved, is there something I am missing/doing wrong?
> > > >>>>>
> > > >>>>> 1) Repository Factory
> > > >>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map
> > > >>>>> parameters) throws RepositoryException {
> > > >>>>>          String repositoryFactoryName = parameters != null && (
> > > >>>>>
> > > >>>>>  parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) ||
> > > >>>>>
> > > parameters.containsKey(PARAM_REPOSITORY_CONFIG))
> > > >>>>>                  ?
> > > >>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory"
> > > >>>>>                  :
> > > "org.apache.jackrabbit.core.RepositoryFactoryImpl";
> > > >>>>>
> > > >>>>>          Object repositoryFactory;
> > > >>>>>          try {
> > > >>>>>              Class<?> repositoryFactoryClass =
> > > >>>>> Class.forName(repositoryFactoryName, true,
> > > >>>>>
> > Thread.currentThread().getContextClassLoader());
> > > >>>>>
> > > >>>>>              repositoryFactory =
> > > repositoryFactoryClass.newInstance();
> > > >>>>>          }
> > > >>>>>          catch (Exception e) {
> > > >>>>>              throw new RepositoryException(e);
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          if (repositoryFactory instanceof RepositoryFactory) {
> > > >>>>>              return ((RepositoryFactory)
> > > >>>>> repositoryFactory).getRepository(parameters);
> > > >>>>>          }
> > > >>>>>          else {
> > > >>>>>              throw new RepositoryException(repositoryFactory + "
> is
> > > not a
> > > >>>>> RepositoryFactory");
> > > >>>>>          }
> > > >>>>>      }
> > > >>>>>
> > > >>>>> 2) Use the factory to get a repo:
> > > >>>>>   public Repository getRepository() throws
> ClassNotFoundException,
> > > >>>>> RepositoryException {
> > > >>>>>          Map<String, RepositoryConfig> parameters =
> > > >>>>> Collections.singletonMap(
> > > >>>>>
> "org.apache.jackrabbit.jcr2spi.RepositoryConfig",
> > > >>>>>                  (RepositoryConfig) new
> > > >>>>> RepositoryConfigImpl(jackabbitServerUrl));
> > > >>>>>
> > > >>>>>          return getRepository(parameters);
> > > >>>>>      }
> > > >>>>>
> > > >>>>> 3) Repository Config:
> > > >>>>> private static final class RepositoryConfigImpl implements
> > > >>>>> RepositoryConfig {
> > > >>>>>
> > > >>>>>          private String jackabbitServerUrl;
> > > >>>>>
> > > >>>>>          private RepositoryConfigImpl(String jackabbitServerUrl)
> {
> > > >>>>>              super();
> > > >>>>>              this.jackabbitServerUrl = jackabbitServerUrl;
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          public CacheBehaviour getCacheBehaviour() {
> > > >>>>>              return CacheBehaviour.INVALIDATE;
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          public int getItemCacheSize() {
> > > >>>>>              return 100;
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          public int getPollTimeout() {
> > > >>>>>              return 5000;
> > > >>>>>          }
> > > >>>>>
> > > >>>>>          public RepositoryService getRepositoryService() throws
> > > >>>>> RepositoryException {
> > > >>>>>              BatchReadConfig brc = new BatchReadConfig() {
> > > >>>>>                  public int getDepth(Path path, PathResolver
> > > resolver)
> > > >>>>> throws NamespaceException {
> > > >>>>>                      return 1;
> > > >>>>>                  }
> > > >>>>>              };
> > > >>>>>              return new RepositoryServiceImpl(jackabbitServerUrl,
> > > brc);
> > > >>>>>          }
> > > >>>>>
> > > >>>>>      }
> > > >>>>>
> > > >>>>> Thanks for your time.
> > > >>>>>
> > > >>>>> David
> > >
> > >
> >
>
>
> --
>
> Dirk Rudolph | Senior Software Engineer
>
> Netcentric AG
>
> M: +41 79 642 37 11
> D: +49 174 966 84 34
>
> [email protected] | www.netcentric.biz
>

Re: Node Retrieval Performance

Reply via email to