[ANNOUNCE] Apache Jackrabbit Oak 1.7.1 released

2017-06-09 Thread Davide Giannella
The Apache Jackrabbit community is pleased to announce the release of
Apache Jackrabbit Oak. The release is available for download at:

http://jackrabbit.apache.org/downloads.html

See the full release notes below for details about this release:

Release Notes -- Apache Jackrabbit Oak -- Version 1.7.1

Introduction


Jackrabbit Oak is a scalable, high-performance hierarchical content
repository designed for use as the foundation of modern world-class
web sites and other demanding content applications.

Apache Jackrabbit Oak 1.7.1 is an unstable release cut directly from
Jackrabbit Oak trunk, with a focus on new features and other
improvements. For production use we recommend the latest stable 1.6.x
release.

The Oak effort is a part of the Apache Jackrabbit project.
Apache Jackrabbit is a project of the Apache Software Foundation.

Changes in Oak 1.7.1
-

Sub-task

[OAK-6227] - There should be a way to retrieve oldest timestamp to
keep from nodestores

Technical task

[OAK-4612] - Multiplexing support for CugPermissionProvider
[OAK-6196] - Improve Javadoc of multiplexing SPI
[OAK-6270] - There should be a way for editors to be notified by
AsyncIndexUpdate about success/failure of indexing commit
[OAK-6282] - Implement a DummyDataStore to be used to test setup
with no BlobStore access

Bug

[OAK-5573] -
org.apache.jackrabbit.oak.segment.standby.StandbyTestIT.testSyncLoop
[OAK-6266] - SolrQueryIndexProviderService should always have
NodeAggregator
[OAK-6267] - Version restore fails if restore would not change
bundling root but changes bundled nodes
[OAK-6273] -
FilteringNodeStateTest#shouldHaveCorrectChildOrderProperty is
failing
[OAK-6277] - UserQueryManager: redundant check for colliding bound
and offset
[OAK-6278] - UserQueryManager: scope filtering for everyone
groupId compares to principal name
[OAK-6283] - FileCache should ignore when file evicted with
replacement
[OAK-6290] - UserQueryManager.findAuthorizables fails with
IllegalArgumentException when there are multiple selectors
[OAK-6292] - SecurityProviderRegistration.maybeUnregister: typo on
comment
[OAK-6293] - Enable test log creation for oak-blob-plugins
[OAK-6300] - CacheConsistencyTestBase: potential NPE in teardown
[OAK-6306] - upgrade uses lucene wrong version (transient
dependency)

Improvement

[OAK-2808] - Active deletion of 'deleted' Lucene index files from
DataStore without relying on full scale Blob GC
[OAK-3498] - DN can't be used as the group name in the external
auth handler
[OAK-4513] - Detect and log references across stores
[OAK-5525] - VisibleEditor should use the NodeStateUtils to
determine visibility
[OAK-5935] - AbstractSharedCachingDataStore#getRecordIfStored
should use the underlying cache.get
[OAK-6256] - Prevent creating the across-mounts references
[OAK-6272] - AbstractNodeState.toString does not scale to many
child nodes
[OAK-6289] - Unreferenced argument reference in method
SegmentBufferWriter.writeRecordId
[OAK-6296] - Move JACKRABBIT_2_SINGLE_QUOTED_PHRASE from
o.a.j.oak.query.ask.FullTextSearchImpl to
oak.fulltext.FullTextParser
[OAK-6298] - FacetHelper should have private constructor
[OAK-6299] - FilterIterators should have a private constructor
[OAK-6301] - Make QueryEngineSettingsMBeanImpl an inner class of
o.a.j.oak.Oak
[OAK-6302] - UserInitializer: createSystemRoot can get null value
for QueryEngineSettings
[OAK-6307] - Function to find all large docs in Mongo

Task

[OAK-6280] - Expose whiteboard from NodeStoreFixture to provide
access to NodeStore components
[OAK-6281] - Dump metrics data to system out if metrics option is
enabled

Test

[OAK-5882] - Improve coverage for oak.security code in oak-core

In addition to the above-mentioned changes, this release contains
all changes included up to the Apache Jackrabbit Oak 1.7.x release.

For more detailed information about all the changes in this and other
Oak releases, please see the Oak issue tracker at

  https://issues.apache.org/jira/browse/OAK

Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.md file for instructions on how to build this release.

The source archive is accompanied by SHA1 and MD5 checksums and a PGP
signature that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
http://www.apache.org/dist/jackrabbit/KEYS.

About Apache Jackrabbit Oak
---

Jackrabbit Oak is a scalable, high-performance hierarchical content
repository designed for use as the foundation of modern world-class
web sites and other demanding content applications.

The Oak effo

[RESULT][VOTE] Release Apache Jackrabbit Oak 1.7.1

2017-06-09 Thread Davide Giannella
Hello Team,

the vote passes as follows:

+1 Julian Reschke
+1 Alex Parvulescu
+1 Robert Munteanu
+1 Amit Jain
+1 Davide Giannella

Thanks for voting. I'll push the release out.

-- Davide



Re: Oak 1.7.2 release plan

2017-06-09 Thread Davide Giannella
On 08/06/2017 15:01, Davide Giannella wrote:
> I'm planning to cut Oak
> tomorrow, 9th of June

Change of plans. We'll be following the usual schedule. :)

D.




Re: index selection when 0 or 1 property clause (or sort clause) matches

2017-06-09 Thread Alvaro Cabrerizo
Good news is always welcome!

Many thanks Chetan.



On Fri, Jun 9, 2017 at 9:44 AM, Chetan Mehrotra 
wrote:

> On Wed, May 17, 2017 at 1:40 PM, Alvaro Cabrerizo 
> wrote:
> > The main issue of delegating in entryCount, is that if the index contains
> > more than 1000 docs and the query does not contain fulltext clauses the
> > index planner will use the number *1000 *as the entryCount, ovewriting
> the
> > actual size of the index [Math.min(definition.getEntryCount(),
> getReader().
> > *numDocs())*].
>
> A very late reply here but I have opened OAK-6333 to remove this
> artificial limit of 1000 in cost estimates. Once we have consensus on
> the approach then I would fix and backport it to older branched
>
> Chetan Mehrotra
>


Re: minimize the impact when creating a new index (or re-indexing)

2017-06-09 Thread Alvaro Cabrerizo
Thank you guys for your comments.

On Fri, Jun 9, 2017 at 9:50 AM, Ian Boston  wrote:

> Hi,
> Assuming the MongoDB instance is performing well and does not show any slow
> queries in the mongodb logs, running the index operation on many cores,
> each core handling one index writer should parallelise the operation. IIRC
> this is theoretically possible, and might have been implemented in the
> latest versions of Oak (Chetan?). If you are in AWS then a X1 instance will
> give you 128 cores and upto 2TB of ram, for the duration of the re-index.
> Other cloud vendors have equivalent VMS. Whatever the instance is, the Oak
> cluster leader should be allocated to this instance as IIRC only the Oak
> cluster leader performs the index operation. The single threaded index
> writer is a feature/limitation of the way Lucene works, but Oak has many
> independent indexes. your deployment may not have 128 so may not be able to
> use all the cores of the largest instance.
>
> If however, the MongoDB cluster is showing any signs or slow queries in the
> logs (> 100ms), any level of read IO then however many cores over however
> many VMs wont speed the process up and may slow the process down. To be
> certain of no bottleneck in MongoDB, ensure the VM has more memory than the
> disk size of the database. The latest version of MongoDB supported by Oak,
> running WiredTiger will greatly reduce memory pressure and IO as it doesnt
> use memory mapping as the primary DB to disk mechanism, and compresses the
> data as it writes.
>
> The instance running Oak must also be sized correctly. I suspect you will
> be running a persistent cache which must be sized to give optimum
> performance and minimise IO, which then also requires sufficient memory.
> For the period of the re-index, the largest AEM instance you can afford
> will minimise IO.  Big VMs (in AWS at least) have more network bandwidth
> which also helps.
>
> Finally disks. Dont use HDD, only use SSD and ensure that there is
> sufficient IOPS available at all times, and enable all the Oak indexing
> optimisation switches (copyOnRead, copyOnWrite etc)
>
> IO generally kills performance, and if the VMs have not been configured
> (THP off, readhead low, XFS or noatime ext4 disks) then that IO will be
> amplified.
>
> If you have done all of this, then you might have to wait for OAK-6246 (I
> see Chetan just responded), but if you haven't please do check that you are
> running as fast as possible with no constrained resources.
>
> HTH, if its been said before sorry for the noise and please ignore.
> Best Regards
> Ian
>
> On 9 June 2017 at 07:49, Alvaro Cabrerizo  wrote:
>
> > Thanks Chetan,
> >
> > Sorry, but that part is out of my reach. There is an IT team in charge of
> > managing the infrastructure and make optimizations, so It is difficult to
> > get that information. Basically what is was looking for is the way
> > to parallelize the indexing process. On the other hand, reducing the
> > indexing time would be fine (it was previously reduced from 7 to 2 days),
> > but I think that traversing more than 1 nodes is a pretty tough
> > operation and I'm not sure if there is much we can do. Anyway, any
> pointer
> > related to indexing optimization or any advice on how to design the repo
> > (e.g. use different paths to isolate different groups of assets, use
> > different nodetypes to differentiate content type, create different
> > repositories [is that possible?] for different groups of uses...) is
> > welcome.
> >
> > Regards.
> >
> > On Thu, Jun 8, 2017 at 12:44 PM, Chetan Mehrotra <
> > chetan.mehro...@gmail.com>
> > wrote:
> >
> > > On Thu, Jun 8, 2017 at 4:04 PM, Alvaro Cabrerizo 
> > > wrote:
> > > > It is a DocumentNodeStore based instance. We don't extract data from
> > > binary
> > > > files, just indexing metadata stored on nodes.
> > >
> > > In that case 48 hrs is a long time. Can you share some details around
> > > how many nodes are being indexed as part of that index and the repo
> > > size in terms of Mongo stats if possible?
> > >
> > > Chetan Mehrotra
> > >
> >
>


Re: minimize the impact when creating a new index (or re-indexing)

2017-06-09 Thread Ian Boston
Hi,
Assuming the MongoDB instance is performing well and does not show any slow
queries in the mongodb logs, running the index operation on many cores,
each core handling one index writer should parallelise the operation. IIRC
this is theoretically possible, and might have been implemented in the
latest versions of Oak (Chetan?). If you are in AWS then a X1 instance will
give you 128 cores and upto 2TB of ram, for the duration of the re-index.
Other cloud vendors have equivalent VMS. Whatever the instance is, the Oak
cluster leader should be allocated to this instance as IIRC only the Oak
cluster leader performs the index operation. The single threaded index
writer is a feature/limitation of the way Lucene works, but Oak has many
independent indexes. your deployment may not have 128 so may not be able to
use all the cores of the largest instance.

If however, the MongoDB cluster is showing any signs or slow queries in the
logs (> 100ms), any level of read IO then however many cores over however
many VMs wont speed the process up and may slow the process down. To be
certain of no bottleneck in MongoDB, ensure the VM has more memory than the
disk size of the database. The latest version of MongoDB supported by Oak,
running WiredTiger will greatly reduce memory pressure and IO as it doesnt
use memory mapping as the primary DB to disk mechanism, and compresses the
data as it writes.

The instance running Oak must also be sized correctly. I suspect you will
be running a persistent cache which must be sized to give optimum
performance and minimise IO, which then also requires sufficient memory.
For the period of the re-index, the largest AEM instance you can afford
will minimise IO.  Big VMs (in AWS at least) have more network bandwidth
which also helps.

Finally disks. Dont use HDD, only use SSD and ensure that there is
sufficient IOPS available at all times, and enable all the Oak indexing
optimisation switches (copyOnRead, copyOnWrite etc)

IO generally kills performance, and if the VMs have not been configured
(THP off, readhead low, XFS or noatime ext4 disks) then that IO will be
amplified.

If you have done all of this, then you might have to wait for OAK-6246 (I
see Chetan just responded), but if you haven't please do check that you are
running as fast as possible with no constrained resources.

HTH, if its been said before sorry for the noise and please ignore.
Best Regards
Ian

On 9 June 2017 at 07:49, Alvaro Cabrerizo  wrote:

> Thanks Chetan,
>
> Sorry, but that part is out of my reach. There is an IT team in charge of
> managing the infrastructure and make optimizations, so It is difficult to
> get that information. Basically what is was looking for is the way
> to parallelize the indexing process. On the other hand, reducing the
> indexing time would be fine (it was previously reduced from 7 to 2 days),
> but I think that traversing more than 1 nodes is a pretty tough
> operation and I'm not sure if there is much we can do. Anyway, any pointer
> related to indexing optimization or any advice on how to design the repo
> (e.g. use different paths to isolate different groups of assets, use
> different nodetypes to differentiate content type, create different
> repositories [is that possible?] for different groups of uses...) is
> welcome.
>
> Regards.
>
> On Thu, Jun 8, 2017 at 12:44 PM, Chetan Mehrotra <
> chetan.mehro...@gmail.com>
> wrote:
>
> > On Thu, Jun 8, 2017 at 4:04 PM, Alvaro Cabrerizo 
> > wrote:
> > > It is a DocumentNodeStore based instance. We don't extract data from
> > binary
> > > files, just indexing metadata stored on nodes.
> >
> > In that case 48 hrs is a long time. Can you share some details around
> > how many nodes are being indexed as part of that index and the repo
> > size in terms of Mongo stats if possible?
> >
> > Chetan Mehrotra
> >
>


Re: index selection when 0 or 1 property clause (or sort clause) matches

2017-06-09 Thread Chetan Mehrotra
On Wed, May 17, 2017 at 1:40 PM, Alvaro Cabrerizo  wrote:
> The main issue of delegating in entryCount, is that if the index contains
> more than 1000 docs and the query does not contain fulltext clauses the
> index planner will use the number *1000 *as the entryCount, ovewriting the
> actual size of the index [Math.min(definition.getEntryCount(), getReader().
> *numDocs())*].

A very late reply here but I have opened OAK-6333 to remove this
artificial limit of 1000 in cost estimates. Once we have consensus on
the approach then I would fix and backport it to older branched

Chetan Mehrotra


Re: minimize the impact when creating a new index (or re-indexing)

2017-06-09 Thread Chetan Mehrotra
> indexing time would be fine (it was previously reduced from 7 to 2 days),
> but I think that traversing more than 1 nodes is a pretty tough

Yes that would take time (i.e. indexing 100M+ nodes). This is an area
where work is in progress (OAK-6246) to get much shorter indexing
time.

> related to indexing optimization or any advice on how to design the repo
> (e.g. use different paths to isolate different groups of assets, use
> different nodetypes to differentiate content type, create different
> repositories [is that possible?] for different groups of uses...) is

Hard to give a generic advice here. It all depends on type of query,
index definition and content structure. So would need such details to
provide any suggestion.

Chetan Mehrotra