> I am planning on storing a lot of data in JackRabbit (terabytes) But that should not mean storing them all as children of a single Node. Probably you should think about driving the hierarchy as explained in DavidsModel.
So in general you would structure your files in for example categories: /categoryA /categoryB /categoryC Or even /categoryA/sub1/subsuba /categoryA/sub1/subsubb and so on. Each of them could then be a root of a NodeSequence managed as BTree. This would you additionally allow to split the content over multiple jackrabbit instances to increase performance. In general Jackrabbit is/should be able to handle that many data but maintanance might take a lot of time blocking your application. So you should try to keep the repository size of a single instance as small as possible by for example splitting content by category, region of access, or what ever. > Or can I simplify it and just do something like this to get a repo Have a look at: https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map) <https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#getRepository(java.util.Map)> The parameterMap contains for example https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI <https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/JcrUtils.html#REPOSITORY_URI> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI <https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_REPOSITORY_URI> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE <https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/spi2davex/Spi2davexRepositoryServiceFactory.html#PARAM_ITEMINFO_CACHE_SIZE> Btw. It should not be required to call ServiceLoader#load() by yourself. Cheers, D Dirk Rudolph | Senior Software Engineer Netcentric AG M: +41 79 642 37 11 D: +49 174 966 84 34 [email protected] <mailto:[email protected]> | www.netcentric.biz <http://www.netcentric.biz/> > On 14 Nov 2015, at 01:26, David Marginian <[email protected]> wrote: > > Thanks Dirk, I should have found that page on my own. I am going to look > into using the BTreeManager, just curious what are the limitations for > documents/file counts within a node? I am planning on storing a lot of data > in JackRabbit (terabytes). Also, is the configuration code I posted in my > previous posts the best way to do things? Or can I simplify it and just do > something like this to get a repo: > > ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory")); > > return JcrUtils.getRepository(jackabbitServerUrl); > > On 11/13/2015 03:47 PM, Dirk Rudolph wrote: >> Did I understood you right, you have thousands of child nodes below the >> root node? >> >> You should avoid this because this is considered bad practice in terms of >> write performance and depending on your concurrent access this might also >> block read access. >> >> http://wiki.apache.org/jackrabbit/Performance >> >> Try to introduce a structure to your content using BTreeManger >> >> >> https://jackrabbit.apache.org/api/2.10/org/apache/jackrabbit/commons/flat/BTreeManager.html >> >> Cheers, D >> >> >> On Friday, 13 November 2015, David Marginian <[email protected]> wrote: >> >>> Thanks Clay. I am not trying to load that many records at once. The >>> application is crawling a directory. It places the files from that >>> directory into JackRabbit one at a time, and puts a content id onto a queue >>> which is picked up by consumers on different servers. Those consumers then >>> use the content id to retrieve the file from JackRabbit. Each piece of >>> content is saved in a node under the root node. The performance slowdown >>> is coming from calling session.getRootNode(), from what I can gather from >>> the docs I need the root node in order to add a child node. Note the >>> slowdown is pretty significant and I don't need to have close to 50k to >>> start seeing it (I start seeing it within a few minutes of running my >>> app). I don't need orderable nodes, how do I disable that? >>> >>> >>> On 11/13/2015 03:10 PM, Clay Ferguson wrote: >>> >>>> Please let us know more about your use case. Why are you even "trying" to >>>> load that many records all at once. Or at least scan them one by one, I >>>> mean. In most use cases you wouldn't need to do this kind of thing, unless >>>> it's some kind of backup or replication. I say "most" cases... I'm not >>>> saying you don't need to just asking for a bit more background. BTW: If >>>> you don't need 'orderable' nodes try to avoid them. That type of node does >>>> not work at 'scale'... and 50K is propably pushing it. >>>> >>>> Best regards, >>>> Clay Ferguson >>>> [email protected] >>>> >>>> >>>> On Fri, Nov 13, 2015 at 3:33 PM, <[email protected]> wrote: >>>> >>>> Hi, >>>>> I am new to JackRabbit and using version 2.11.2. I am using JackRabbit >>>>> to >>>>> store documents in a multi-threaded environment. I noticed that the time >>>>> it takes to retrieve the root node is inconsistent and slow (several >>>>> seconds +) and degrades over time (after 50K plus child nodes retrieval >>>>> is >>>>> taking ~15 seconds). >>>>> >>>>> Originally, I was using code as follows to obtain a repository: >>>>> >>>>> public Repository getRepository() throws ClassNotFoundException, >>>>> RepositoryException { >>>>> >>>>> >>>>> ServiceLoader.load(Class.forName("org.apache.jackrabbit.jcr2dav.Jcr2davRepositoryFactory")); >>>>> return JcrUtils.getRepository(jackabbitServerUrl); >>>>> } >>>>> >>>>> Then I came across the following thread: >>>>> >>>>> >>>>> http://jackrabbit.510166.n4.nabble.com/getRootNode-takes-27-seconds-td1571027.html#a1571302 >>>>> >>>>> This thread had some useful information (BatchReadConfig), but I am not >>>>> certain how to use the API to take advantage of it. I have changed my >>>>> code >>>>> to the following but it doesn't appear that node retrieval performance >>>>> has >>>>> improved, is there something I am missing/doing wrong? >>>>> >>>>> 1) Repository Factory >>>>> public Repository getRepository(@SuppressWarnings("rawtypes") Map >>>>> parameters) throws RepositoryException { >>>>> String repositoryFactoryName = parameters != null && ( >>>>> >>>>> parameters.containsKey(PARAM_REPOSITORY_SERVICE_FACTORY) || >>>>> parameters.containsKey(PARAM_REPOSITORY_CONFIG)) >>>>> ? >>>>> "org.apache.jackrabbit.jcr2spi.Jcr2spiRepositoryFactory" >>>>> : "org.apache.jackrabbit.core.RepositoryFactoryImpl"; >>>>> >>>>> Object repositoryFactory; >>>>> try { >>>>> Class<?> repositoryFactoryClass = >>>>> Class.forName(repositoryFactoryName, true, >>>>> Thread.currentThread().getContextClassLoader()); >>>>> >>>>> repositoryFactory = repositoryFactoryClass.newInstance(); >>>>> } >>>>> catch (Exception e) { >>>>> throw new RepositoryException(e); >>>>> } >>>>> >>>>> if (repositoryFactory instanceof RepositoryFactory) { >>>>> return ((RepositoryFactory) >>>>> repositoryFactory).getRepository(parameters); >>>>> } >>>>> else { >>>>> throw new RepositoryException(repositoryFactory + " is not a >>>>> RepositoryFactory"); >>>>> } >>>>> } >>>>> >>>>> 2) Use the factory to get a repo: >>>>> public Repository getRepository() throws ClassNotFoundException, >>>>> RepositoryException { >>>>> Map<String, RepositoryConfig> parameters = >>>>> Collections.singletonMap( >>>>> "org.apache.jackrabbit.jcr2spi.RepositoryConfig", >>>>> (RepositoryConfig) new >>>>> RepositoryConfigImpl(jackabbitServerUrl)); >>>>> >>>>> return getRepository(parameters); >>>>> } >>>>> >>>>> 3) Repository Config: >>>>> private static final class RepositoryConfigImpl implements >>>>> RepositoryConfig { >>>>> >>>>> private String jackabbitServerUrl; >>>>> >>>>> private RepositoryConfigImpl(String jackabbitServerUrl) { >>>>> super(); >>>>> this.jackabbitServerUrl = jackabbitServerUrl; >>>>> } >>>>> >>>>> public CacheBehaviour getCacheBehaviour() { >>>>> return CacheBehaviour.INVALIDATE; >>>>> } >>>>> >>>>> public int getItemCacheSize() { >>>>> return 100; >>>>> } >>>>> >>>>> public int getPollTimeout() { >>>>> return 5000; >>>>> } >>>>> >>>>> public RepositoryService getRepositoryService() throws >>>>> RepositoryException { >>>>> BatchReadConfig brc = new BatchReadConfig() { >>>>> public int getDepth(Path path, PathResolver resolver) >>>>> throws NamespaceException { >>>>> return 1; >>>>> } >>>>> }; >>>>> return new RepositoryServiceImpl(jackabbitServerUrl, brc); >>>>> } >>>>> >>>>> } >>>>> >>>>> Thanks for your time. >>>>> >>>>> David
