[jira] [Commented] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups

2017-12-19 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297924#comment-16297924
 ] 

Chetan Mehrotra commented on OAK-6353:
--

Some performance numbers for reindexing done for repo having 255M Mongo Docs, 
66M nodes under /content and having 4.2M assets

# Normal NodeStore traversal - 13.66 h

*Document Traversal*

A - Default setup 

# Total time - 3.469 h
## Time in dumping - 2.405 h
## Time in sorting - 39.87 min
###  Batch sorting - 19.13 min
###  Merging - 20.17
## Indexing 24 mins
# Space consumed
#* dumped json - 43.6 GB
#* chunked files - 43.6 GB
#* index size - 2.5 GB

{noformat}
2017-12-15 16:48:34 Proceeding to index [/oak:index/damAssetLucene2] upto 
checkpoint head {} 
2017-12-15 19:12:55 Dumped 65472172 nodestates in json format in 2.405 h 
2017-12-15 19:12:55 Compression enabled while sorting : false 
(oak.indexer.useZip) 
2017-12-15 19:12:55 Delete original dump from traversal : true 
(oak.indexer.deleteOriginal) 
2017-12-15 19:12:55 Max heap memory (GB) to be used for merge sort : 3 
(oak.indexer.maxSortMemoryInGB) 
2017-12-15 19:12:57 Sorting with memory 3.2 GB (estimated 12.6 GB) 
2017-12-15 19:32:05 Batch sorting done in 19.13 min with 29 files of size 43.6 
GB to merge 
2017-12-15 19:32:05 Removing the original file temp/flat-file-store/store.json 
2017-12-15 19:52:50 Merging of sorted files completed in 20.71 min 
2017-12-15 19:52:50 Sorting completed in 39.87 min 
2017-12-15 19:52:50 Estimated node count to be traversed for reindexing under / 
is [65472172] 
2017-12-15 20:16:35 Indexing report
- /oak:index/damAssetLucene2*(4407265)
2017-12-15 20:16:43 Indexing completed for indexes [/oak:index/damAssetLucene2] 
in 3.469 h (12488171 ms) 
{noformat}

B - Compression enabled in sorting

# Total time - 3.811 h
## Time in dumping - 2.929 h
## Time in sorting - 29.56 min
###  Batch sorting - 17.67 min
###  Merging - 11.87 min
## Indexing 24 mins
# Space consumed
#* dumped json - 43.6 GB
#* chunked files - 5.5 GB
#* index size - 2.5 GB

{noformat}
2017-12-19 10:56:00  Proceeding to index [/oak:index/damAssetLucene2] upto 
checkpoint head {} 
2017-12-19 13:51:50 oreBuilder - Dumped 65469575 nodestates in json format in 
2.929 h (43.6 GB) 
2017-12-19 13:51:50 oreBuilder - Compression enabled while sorting : true 
(oak.indexer.useZip) 
2017-12-19 13:51:50 oreBuilder - Delete original dump from traversal : true 
(oak.indexer.deleteOriginal) 
2017-12-19 13:51:50 oreBuilder - Max heap memory (GB) to be used for merge sort 
: 3 (oak.indexer.maxSortMemoryInGB) 
2017-12-19 13:51:52 Sorter - Sorting with memory 3.2 GB (estimated 12.6 GB) 
2017-12-19 14:09:32 Sorter - Batch sorting done in 17.67 min with 29 files of 
size 5.5 GB to merge 
2017-12-19 14:09:32 Sorter - Removing the original file 
temp/flat-file-store/store.json 
2017-12-19 14:21:25 Sorter - Merging of sorted files completed in 11.87 min 
2017-12-19 14:21:25 Sorter - Sorting completed in 29.56 min 
2017-12-19 14:21:26 Estimated node count to be traversed for reindexing under / 
is [65469575] 
2017-12-19 14:44:30 Indexing report
- /oak:index/damAssetLucene2*(4407265)
 2017-12-19 14:44:30 Reindexing completed 
2017-12-19 14:44:30 Switched the async lane for indexes at 
[/oak:index/damAssetLucene2] back to there original lanes 
2017-12-19 14:44:39 Indexing completed for indexes [/oak:index/damAssetLucene2] 
in 3.811 h (13718589 ms)
{noformat}

> Use Document order traversal for reindexing performed on DocumentNodeStore 
> setups
> -
>
> Key: OAK-6353
> URL: https://issues.apache.org/jira/browse/OAK-6353
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.7.13, 1.8
>
> Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested 
> [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442]
>  that document order traversal can be faster compared to current mode of path 
> based traversal. Initial test indicate that such a traversal can be order of 
> magnitude faster. 
> So this task is meant to implement such an approach and see if it can be a 
> viable indexing mode used for DocumentNodeStore based setups



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups

2017-12-18 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294747#comment-16294747
 ] 

Chetan Mehrotra commented on OAK-6353:
--

There are some aspects which still need to be taken care of. See OAK-7074

> Use Document order traversal for reindexing performed on DocumentNodeStore 
> setups
> -
>
> Key: OAK-6353
> URL: https://issues.apache.org/jira/browse/OAK-6353
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.7.13, 1.8
>
> Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested 
> [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442]
>  that document order traversal can be faster compared to current mode of path 
> based traversal. Initial test indicate that such a traversal can be order of 
> magnitude faster. 
> So this task is meant to implement such an approach and see if it can be a 
> viable indexing mode used for DocumentNodeStore based setups



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups

2017-11-14 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251128#comment-16251128
 ] 

Julian Reschke commented on OAK-6353:
-

Fixed nullability annotation in [r1815190|http://svn.apache.org/r1815190] 

> Use Document order traversal for reindexing performed on DocumentNodeStore 
> setups
> -
>
> Key: OAK-6353
> URL: https://issues.apache.org/jira/browse/OAK-6353
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.8
>
> Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested 
> [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442]
>  that document order traversal can be faster compared to current mode of path 
> based traversal. Initial test indicate that such a traversal can be order of 
> magnitude faster. 
> So this task is meant to implement such an approach and see if it can be a 
> viable indexing mode used for DocumentNodeStore based setups



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups

2017-09-12 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164170#comment-16164170
 ] 

Chetan Mehrotra commented on OAK-6353:
--

Based on some tests done by [~chibulcu] for 100M nodes

* Simple DBCursor traversal - ~50 mins , 82k docs/sec
* NodeStore traversal 
** From a remote setup - 1.2d
** From a local setup (on same machine as Mongo) - 8 hrs, 8k docs/sec

So to perform full reindexing on such setups we would need to make better use 
of Document order traversal


> Use Document order traversal for reindexing performed on DocumentNodeStore 
> setups
> -
>
> Key: OAK-6353
> URL: https://issues.apache.org/jira/browse/OAK-6353
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.8
>
> Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested 
> [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442]
>  that document order traversal can be faster compared to current mode of path 
> based traversal. Initial test indicate that such a traversal can be order of 
> magnitude faster. 
> So this task is meant to implement such an approach and see if it can be a 
> viable indexing mode used for DocumentNodeStore based setups



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups

2017-09-06 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156484#comment-16156484
 ] 

Chetan Mehrotra commented on OAK-6353:
--

Mongodb provides a {{\$natual}} option [1] which forces the cursor to use the 
natual order. This can then be coupled with a regex which can filter out hidden 
nodes alltogeher


[1] 
https://docs.mongodb.com/manual/reference/operator/meta/natural/#metaOp._S_natural

> Use Document order traversal for reindexing performed on DocumentNodeStore 
> setups
> -
>
> Key: OAK-6353
> URL: https://issues.apache.org/jira/browse/OAK-6353
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.8
>
> Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested 
> [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442]
>  that document order traversal can be faster compared to current mode of path 
> based traversal. Initial test indicate that such a traversal can be order of 
> magnitude faster. 
> So this task is meant to implement such an approach and see if it can be a 
> viable indexing mode used for DocumentNodeStore based setups



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups

2017-09-05 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154791#comment-16154791
 ] 

Chetan Mehrotra commented on OAK-6353:
--

Applied the patch with r1807438

> Use Document order traversal for reindexing performed on DocumentNodeStore 
> setups
> -
>
> Key: OAK-6353
> URL: https://issues.apache.org/jira/browse/OAK-6353
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: run
>Reporter: Chetan Mehrotra
>Assignee: Chetan Mehrotra
> Fix For: 1.8
>
> Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested 
> [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442]
>  that document order traversal can be faster compared to current mode of path 
> based traversal. Initial test indicate that such a traversal can be order of 
> magnitude faster. 
> So this task is meant to implement such an approach and see if it can be a 
> viable indexing mode used for DocumentNodeStore based setups



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)