[jira] [Updated] (ATLAS-4408) Dynamic handling of failure in updating index

2021-10-10 Thread Sarath Subramanian (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarath Subramanian updated ATLAS-4408:
--
Affects Version/s: 3.0.0
   2.2.0

> Dynamic handling of failure in updating index
> -
>
> Key: ATLAS-4408
> URL: https://issues.apache.org/jira/browse/ATLAS-4408
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Radhika Kundam
>Assignee: Radhika Kundam
>Priority: Major
> Attachments: IndexRecovery.png, IndexRecovery_FunctionalFlow.png
>
>
> *Index failure resilience:* dynamic handling of failure in updating index 
> (i.e. HBase commit succeeds but index commit fails).
> In case of secondary persistence failure scenario, there will be 
> inconsistency with indexes for all the transactions failed at Solr. And to 
> repair that, the existing option is re-indexing all the data which is time 
> consuming as it involves indexing the entire database.
> To recover such inconsistencies we can use the *transaction write-ahead log 
> option*. By enabling write-ahead log(tx.log-tx), JanusGraph maintains all the 
> transaction log data which can be used to recover indices in case of 
> failures. With this approach, it’s extra overhead to maintain the log data 
> for all transactions but with this approach we can guarantee the system is 
> more resilient and proactive. So advantages of this approach can nullify the 
> overhead of maintaining log data.
> Design details as below.
>  # Start new service - IndexRecoveryService at Atlas startup.
>  ## Continuously monitor for Solr(Index Client) health for every retryTime 
> millisecs
>  ### If Solr is healthy and recovery start time is available, 
>   Start Transaction Recovery with available recovery start time(which is 
> noted when Solr became unhealthy)
>   Persist current recovery time as previous which can be used later by 
> passing as custom recovery time to start index recovery if required.
>   Reset current recovery start time
>   Continue with Solr health checkup.
>  ### If Solr is unhealthy and no recovery start time is available, 
>   Shutdown the existing transaction recovery process.
>   Note down the time which should be the next recovery start time and 
> persist in graph.
>   Continue with Solr health checkup.
> Configuration properties to be used for this feature.
> 1.To enable or disable index recovery(By default index recovery will be 
> enabled on Atlas startup)
>     *atlas.graph.enable.index.recovery=true*
>  2.To configure how frequently SOLR health check should be done
>     *atlas.graph.index.search.solr.status.retry.interval=*
>  3.To start index recovery by custom recovery time as user provided
>     *atlas.graph.index.search.solr.recovery.start.time=1630086622*
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ATLAS-4408) Dynamic handling of failure in updating index

2021-10-10 Thread Sarath Subramanian (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarath Subramanian updated ATLAS-4408:
--
Labels: indexing  (was: )

> Dynamic handling of failure in updating index
> -
>
> Key: ATLAS-4408
> URL: https://issues.apache.org/jira/browse/ATLAS-4408
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Radhika Kundam
>Assignee: Radhika Kundam
>Priority: Major
>  Labels: indexing
> Fix For: 3.0.0, 2.3.0
>
> Attachments: IndexRecovery.png, IndexRecovery_FunctionalFlow.png
>
>
> *Index failure resilience:* dynamic handling of failure in updating index 
> (i.e. HBase commit succeeds but index commit fails).
> In case of secondary persistence failure scenario, there will be 
> inconsistency with indexes for all the transactions failed at Solr. And to 
> repair that, the existing option is re-indexing all the data which is time 
> consuming as it involves indexing the entire database.
> To recover such inconsistencies we can use the *transaction write-ahead log 
> option*. By enabling write-ahead log(tx.log-tx), JanusGraph maintains all the 
> transaction log data which can be used to recover indices in case of 
> failures. With this approach, it’s extra overhead to maintain the log data 
> for all transactions but with this approach we can guarantee the system is 
> more resilient and proactive. So advantages of this approach can nullify the 
> overhead of maintaining log data.
> Design details as below.
>  # Start new service - IndexRecoveryService at Atlas startup.
>  ## Continuously monitor for Solr(Index Client) health for every retryTime 
> millisecs
>  ### If Solr is healthy and recovery start time is available, 
>   Start Transaction Recovery with available recovery start time(which is 
> noted when Solr became unhealthy)
>   Persist current recovery time as previous which can be used later by 
> passing as custom recovery time to start index recovery if required.
>   Reset current recovery start time
>   Continue with Solr health checkup.
>  ### If Solr is unhealthy and no recovery start time is available, 
>   Shutdown the existing transaction recovery process.
>   Note down the time which should be the next recovery start time and 
> persist in graph.
>   Continue with Solr health checkup.
> Configuration properties to be used for this feature.
> 1.To enable or disable index recovery(By default index recovery will be 
> enabled on Atlas startup)
>     *atlas.graph.enable.index.recovery=true*
>  2.To configure how frequently SOLR health check should be done
>     *atlas.graph.index.search.solr.status.retry.interval=*
>  3.To start index recovery by custom recovery time as user provided
>     *atlas.graph.index.search.solr.recovery.start.time=1630086622*
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ATLAS-4408) Dynamic handling of failure in updating index

2021-10-10 Thread Sarath Subramanian (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarath Subramanian updated ATLAS-4408:
--
Fix Version/s: 2.3.0
   3.0.0

> Dynamic handling of failure in updating index
> -
>
> Key: ATLAS-4408
> URL: https://issues.apache.org/jira/browse/ATLAS-4408
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Radhika Kundam
>Assignee: Radhika Kundam
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
> Attachments: IndexRecovery.png, IndexRecovery_FunctionalFlow.png
>
>
> *Index failure resilience:* dynamic handling of failure in updating index 
> (i.e. HBase commit succeeds but index commit fails).
> In case of secondary persistence failure scenario, there will be 
> inconsistency with indexes for all the transactions failed at Solr. And to 
> repair that, the existing option is re-indexing all the data which is time 
> consuming as it involves indexing the entire database.
> To recover such inconsistencies we can use the *transaction write-ahead log 
> option*. By enabling write-ahead log(tx.log-tx), JanusGraph maintains all the 
> transaction log data which can be used to recover indices in case of 
> failures. With this approach, it’s extra overhead to maintain the log data 
> for all transactions but with this approach we can guarantee the system is 
> more resilient and proactive. So advantages of this approach can nullify the 
> overhead of maintaining log data.
> Design details as below.
>  # Start new service - IndexRecoveryService at Atlas startup.
>  ## Continuously monitor for Solr(Index Client) health for every retryTime 
> millisecs
>  ### If Solr is healthy and recovery start time is available, 
>   Start Transaction Recovery with available recovery start time(which is 
> noted when Solr became unhealthy)
>   Persist current recovery time as previous which can be used later by 
> passing as custom recovery time to start index recovery if required.
>   Reset current recovery start time
>   Continue with Solr health checkup.
>  ### If Solr is unhealthy and no recovery start time is available, 
>   Shutdown the existing transaction recovery process.
>   Note down the time which should be the next recovery start time and 
> persist in graph.
>   Continue with Solr health checkup.
> Configuration properties to be used for this feature.
> 1.To enable or disable index recovery(By default index recovery will be 
> enabled on Atlas startup)
>     *atlas.graph.enable.index.recovery=true*
>  2.To configure how frequently SOLR health check should be done
>     *atlas.graph.index.search.solr.status.retry.interval=*
>  3.To start index recovery by custom recovery time as user provided
>     *atlas.graph.index.search.solr.recovery.start.time=1630086622*
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ATLAS-4408) Dynamic handling of failure in updating index

2021-08-31 Thread Radhika Kundam (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radhika Kundam updated ATLAS-4408:
--
Attachment: IndexRecovery.png
IndexRecovery_FunctionalFlow.png

> Dynamic handling of failure in updating index
> -
>
> Key: ATLAS-4408
> URL: https://issues.apache.org/jira/browse/ATLAS-4408
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Radhika Kundam
>Assignee: Radhika Kundam
>Priority: Major
> Attachments: IndexRecovery.png, IndexRecovery_FunctionalFlow.png
>
>
> *Index failure resilience:* dynamic handling of failure in updating index 
> (i.e. HBase commit succeeds but index commit fails).
> In case of secondary persistence failure scenario, there will be 
> inconsistency with indexes for all the transactions failed at Solr. And to 
> repair that, the existing option is re-indexing all the data which is time 
> consuming as it involves indexing the entire database.
> To recover such inconsistencies we can use the *transaction write-ahead log 
> option*. By enabling write-ahead log(tx.log-tx), JanusGraph maintains all the 
> transaction log data which can be used to recover indices in case of 
> failures. With this approach, it’s extra overhead to maintain the log data 
> for all transactions but with this approach we can guarantee the system is 
> more resilient and proactive. So advantages of this approach can nullify the 
> overhead of maintaining log data.
> Design details as below.
>  # Start new service - IndexRecoveryService at Atlas startup.
>  ## Continuously monitor for Solr(Index Client) health for every retryTime 
> millisecs
>  ### If Solr is healthy and recovery start time is available, 
>   Start Transaction Recovery with available recovery start time(which is 
> noted when Solr became unhealthy)
>   Persist current recovery time as previous which can be used later by 
> passing as custom recovery time to start index recovery if required.
>   Reset current recovery start time
>   Continue with Solr health checkup.
>  ### If Solr is unhealthy and no recovery start time is available, 
>   Shutdown the existing transaction recovery process.
>   Note down the time which should be the next recovery start time and 
> persist in graph.
>   Continue with Solr health checkup.
> Configuration properties to be used for this feature.
> 1.To enable or disable index recovery(By default index recovery will be 
> enabled on Atlas startup)
>     *atlas.graph.enable.index.recovery=true*
>  2.To configure how frequently SOLR health check should be done
>     *atlas.graph.index.search.solr.status.retry.interval=*
>  3.To start index recovery by custom recovery time as user provided
>     *atlas.graph.index.search.solr.recovery.start.time=1630086622*
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ATLAS-4408) Dynamic handling of failure in updating index

2021-08-30 Thread Radhika Kundam (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radhika Kundam updated ATLAS-4408:
--
Description: 
*Index failure resilience:* dynamic handling of failure in updating index (i.e. 
HBase commit succeeds but index commit fails).

In case of secondary persistence failure scenario, there will be inconsistency 
with indexes for all the transactions failed at Solr. And to repair that, the 
existing option is re-indexing all the data which is time consuming as it 
involves indexing the entire database.

To recover such inconsistencies we can use the *transaction write-ahead log 
option*. By enabling write-ahead log(tx.log-tx), JanusGraph maintains all the 
transaction log data which can be used to recover indices in case of failures. 
With this approach, it’s extra overhead to maintain the log data for all 
transactions but with this approach we can guarantee the system is more 
resilient and proactive. So advantages of this approach can nullify the 
overhead of maintaining log data.

Design details as below.
 # Start new service - IndexRecoveryService at Atlas startup.
 ## Continuously monitor for Solr(Index Client) health for every retryTime 
millisecs
 ### If Solr is healthy and recovery start time is available, 
  Start Transaction Recovery with available recovery start time(which is 
noted when Solr became unhealthy)
  Persist current recovery time as previous which can be used later by 
passing as custom recovery time to start index recovery if required.
  Reset current recovery start time
  Continue with Solr health checkup.
 ### If Solr is unhealthy and no recovery start time is available, 
  Shutdown the existing transaction recovery process.
  Note down the time which should be the next recovery start time and 
persist in graph.
  Continue with Solr health checkup.

Configuration properties to be used for this feature.

1.To enable or disable index recovery(By default index recovery will be enabled 
on Atlas startup)
    *atlas.graph.enable.index.recovery=true*
 2.To configure how frequently SOLR health check should be done
    *atlas.graph.index.search.solr.status.retry.interval=*
 3.To start index recovery by custom recovery time as user provided
    *atlas.graph.index.search.solr.recovery.start.time=1630086622*

 

  was:
*Index failure resilience:* dynamic handling of failure in updating index (i.e. 
HBase commit succeeds but index commit fails).

To support this feature, need to enable *tx.log-tx* property which will start 
storing write-ahead logs.*With this approach we need to maintain more data 
related to write-ahead transaction logs*. But by comparing the advantages of 
index recovery proactively over reindexing entire data incase of secondary 
persistent failures, it's worth  to have this feature though overhead of 
maintaining more data.

Design details as below.
 # Start new service - IndexRecoveryService at Atlas startup.
 ## Continuously monitor for Solr(Index Client) health for every retryTime 
millisecs
 ### If Solr is healthy and recovery start time is available, 
  Start Transaction Recovery with available recovery start time(which is 
noted when Solr became unhealthy)
  Persist current recovery time as previous which can be used later by 
passing as custom recovery time to start index recovery if required.
  Reset current recovery start time
  Continue with Solr health checkup.
 ### If Solr is unhealthy and no recovery start time is available, 
  Shutdown the existing transaction recovery process.
  Note down the time which should be the next recovery start time and 
persist in graph.
  Continue with Solr health checkup.

Configuration properties to be used for this feature.

1.To enable or disable index recovery(By default index recovery will be enabled 
on Atlas startup)
    *atlas.graph.enable.index.recovery=true*
 2.To configure how frequently SOLR health check should be done
    *atlas.graph.index.search.solr.status.retry.interval=*
 3.To start index recovery by custom recovery time as user provided
    *atlas.graph.index.search.solr.recovery.start.time=1630086622*

 


> Dynamic handling of failure in updating index
> -
>
> Key: ATLAS-4408
> URL: https://issues.apache.org/jira/browse/ATLAS-4408
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Radhika Kundam
>Assignee: Radhika Kundam
>Priority: Major
>
> *Index failure resilience:* dynamic handling of failure in updating index 
> (i.e. HBase commit succeeds but index commit fails).
> In case of secondary persistence failure scenario, there will be 
> inconsistency with indexes for all the transactions failed at Solr. And to 
> repair that, the existing option is re-indexing all the data 

[jira] [Updated] (ATLAS-4408) Dynamic handling of failure in updating index

2021-08-30 Thread Radhika Kundam (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radhika Kundam updated ATLAS-4408:
--
Description: 
*Index failure resilience:* dynamic handling of failure in updating index (i.e. 
HBase commit succeeds but index commit fails).

To support this feature, need to enable *tx.log-tx* property which will start 
storing write-ahead logs.*With this approach we need to maintain more data 
related to write-ahead transaction logs*. But by comparing the advantages of 
index recovery proactively over reindexing entire data incase of secondary 
persistent failures, it's worth  to have this feature though overhead of 
maintaining more data.

Design details as below.
 # Start new service - IndexRecoveryService at Atlas startup.
 ## Continuously monitor for Solr(Index Client) health for every retryTime 
millisecs
 ### If Solr is healthy and recovery start time is available, 
  Start Transaction Recovery with available recovery start time(which is 
noted when Solr became unhealthy)
  Persist current recovery time as previous which can be used later by 
passing as custom recovery time to start index recovery if required.
  Reset current recovery start time
  Continue with Solr health checkup.
 ### If Solr is unhealthy and no recovery start time is available, 
  Shutdown the existing transaction recovery process.
  Note down the time which should be the next recovery start time and 
persist in graph.
  Continue with Solr health checkup.

Configuration properties to be used for this feature.

1.To enable or disable index recovery(By default index recovery will be enabled 
on Atlas startup)
    *atlas.graph.enable.index.recovery=true*
 2.To configure how frequently SOLR health check should be done
    *atlas.graph.index.search.solr.status.retry.interval=*
 3.To start index recovery by custom recovery time as user provided
    *atlas.graph.index.search.solr.recovery.start.time=1630086622*

 

  was:
*Index failure resilience:* dynamic handling of failure in updating index (i.e. 
HBase commit succeeds but index commit fails).

Design details as below.
 # Start new service - IndexRecoveryService at Atlas startup.
 ## Continuously monitor for Solr(Index Client) health for every retryTime 
millisecs
 ### If Solr is healthy and recovery start time is available, 
  Start Transaction Recovery with available recovery start time(which is 
noted when Solr became unhealthy)
  Persist current recovery time as previous which can be used later by 
passing as custom recovery time to start index recovery if required.
  Reset current recovery start time
  Continue with Solr health checkup.
 ### If Solr is unhealthy and no recovery start time is available, 
  Shutdown the existing transaction recovery process.
  Note down the time which should be the next recovery start time and 
persist in graph.
  Continue with Solr health checkup.

Configuration properties to be used for this feature.

1.To enable or disable index recovery(By default index recovery will be enabled 
on Atlas startup)
   *atlas.graph.enable.index.recovery=true*
2.To configure how frequently SOLR health check should be done
   *atlas.graph.index.search.solr.status.retry.interval=*
3.To start index recovery by custom recovery time as user provided
   *atlas.graph.index.search.solr.recovery.start.time=1630086622*


> Dynamic handling of failure in updating index
> -
>
> Key: ATLAS-4408
> URL: https://issues.apache.org/jira/browse/ATLAS-4408
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Radhika Kundam
>Assignee: Radhika Kundam
>Priority: Major
>
> *Index failure resilience:* dynamic handling of failure in updating index 
> (i.e. HBase commit succeeds but index commit fails).
> To support this feature, need to enable *tx.log-tx* property which will start 
> storing write-ahead logs.*With this approach we need to maintain more data 
> related to write-ahead transaction logs*. But by comparing the advantages of 
> index recovery proactively over reindexing entire data incase of secondary 
> persistent failures, it's worth  to have this feature though overhead of 
> maintaining more data.
> Design details as below.
>  # Start new service - IndexRecoveryService at Atlas startup.
>  ## Continuously monitor for Solr(Index Client) health for every retryTime 
> millisecs
>  ### If Solr is healthy and recovery start time is available, 
>   Start Transaction Recovery with available recovery start time(which is 
> noted when Solr became unhealthy)
>   Persist current recovery time as previous which can be used later by 
> passing as custom recovery time to start index recovery if required.
>   Reset current recovery start time
>   

[jira] [Updated] (ATLAS-4408) Dynamic handling of failure in updating index

2021-08-27 Thread Radhika Kundam (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radhika Kundam updated ATLAS-4408:
--
Description: 
*Index failure resilience:* dynamic handling of failure in updating index (i.e. 
HBase commit succeeds but index commit fails).

Design details as below.
 # Start new service - IndexRecoveryService at Atlas startup.
 ## Continuously monitor for Solr(Index Client) health for every retryTime 
millisecs
 ### If Solr is healthy and recovery start time is available, 
  Start Transaction Recovery with available recovery start time(which is 
noted when Solr became unhealthy)
  Persist current recovery time as previous which can be used later by 
passing as custom recovery time to start index recovery if required.
  Reset current recovery start time
  Continue with Solr health checkup.
 ### If Solr is unhealthy and no recovery start time is available, 
  Shutdown the existing transaction recovery process.
  Note down the time which should be the next recovery start time and 
persist in graph.
  Continue with Solr health checkup.

Configuration properties to be used for this feature.

1.To enable or disable index recovery(By default index recovery will be enabled 
on Atlas startup)
   *atlas.graph.enable.index.recovery=true*
2.To configure how frequently SOLR health check should be done
   *atlas.graph.index.search.solr.status.retry.interval=*
3.To start index recovery by custom recovery time as user provided
   *atlas.graph.index.search.solr.recovery.start.time=1630086622*

  was:
*Index failure resilience:* dynamic handling of failure in updating index (i.e. 
HBase commit succeeds but index commit fails
 * monitor thread to check state of index

 * save index state in graph node

 * basic-search to use graph-queries instead of index-queries

 * partial reindex of vertices i.e. vertices that were updated since last 
successful index update


> Dynamic handling of failure in updating index
> -
>
> Key: ATLAS-4408
> URL: https://issues.apache.org/jira/browse/ATLAS-4408
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Radhika Kundam
>Assignee: Radhika Kundam
>Priority: Major
>
> *Index failure resilience:* dynamic handling of failure in updating index 
> (i.e. HBase commit succeeds but index commit fails).
> Design details as below.
>  # Start new service - IndexRecoveryService at Atlas startup.
>  ## Continuously monitor for Solr(Index Client) health for every retryTime 
> millisecs
>  ### If Solr is healthy and recovery start time is available, 
>   Start Transaction Recovery with available recovery start time(which is 
> noted when Solr became unhealthy)
>   Persist current recovery time as previous which can be used later by 
> passing as custom recovery time to start index recovery if required.
>   Reset current recovery start time
>   Continue with Solr health checkup.
>  ### If Solr is unhealthy and no recovery start time is available, 
>   Shutdown the existing transaction recovery process.
>   Note down the time which should be the next recovery start time and 
> persist in graph.
>   Continue with Solr health checkup.
> Configuration properties to be used for this feature.
> 1.To enable or disable index recovery(By default index recovery will be 
> enabled on Atlas startup)
>    *atlas.graph.enable.index.recovery=true*
> 2.To configure how frequently SOLR health check should be done
>    *atlas.graph.index.search.solr.status.retry.interval=*
> 3.To start index recovery by custom recovery time as user provided
>    *atlas.graph.index.search.solr.recovery.start.time=1630086622*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)