Re: Review Request 71924: ATLAS-3562: Hive metadata has the same classification multiple times

2019-12-17 Thread Madhan Neethiraj

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71924/#review219052
---


Ship it!




Ship It!

- Madhan Neethiraj


On Dec. 18, 2019, 7:43 a.m., Mandar Ambawane wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71924/
> ---
> 
> (Updated Dec. 18, 2019, 7:43 a.m.)
> 
> 
> Review request for atlas, Ashutosh Mestry, Madhan Neethiraj, Nixon Rodrigues, 
> and Sarath Subramanian.
> 
> 
> Bugs: ATLAS-3562
> https://issues.apache.org/jira/browse/ATLAS-3562
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> Put Lock on entity before its gets cached.
> 
> Moved GraphTransactionInterceptor.lockObjectAndReleasePostCommit(guid);
> before code gets AtlasVertex for the entity guid.
> 
> 
> Diffs
> -
> 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityStoreV2.java
>  ea5e6ab 
> 
> 
> Diff: https://reviews.apache.org/r/71924/diff/1/
> 
> 
> Testing
> ---
> 
> TESTING:
> Testing done by sending 2 simultaneous curl resquests to associate same 
> Classification to the same entity.
> 
> RESULT: 
> 
> Classification gets associted with the entity only once.
> 
> Application throws exception for the other simultaneous curl request:
> org.apache.atlas.exception.AtlasBaseException: invalid parameters: entity: 
> , already associated with classification: 
> 
> 
> Thanks,
> 
> Mandar Ambawane
> 
>



Review Request 71924: ATLAS-3562: Hive metadata has the same classification multiple times

2019-12-17 Thread Mandar Ambawane

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71924/
---

Review request for atlas, Ashutosh Mestry, Madhan Neethiraj, Nixon Rodrigues, 
and Sarath Subramanian.


Bugs: ATLAS-3562
https://issues.apache.org/jira/browse/ATLAS-3562


Repository: atlas


Description
---

Put Lock on entity before its gets cached.

Moved GraphTransactionInterceptor.lockObjectAndReleasePostCommit(guid);
before code gets AtlasVertex for the entity guid.


Diffs
-

  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityStoreV2.java
 ea5e6ab 


Diff: https://reviews.apache.org/r/71924/diff/1/


Testing
---

TESTING:
Testing done by sending 2 simultaneous curl resquests to associate same 
Classification to the same entity.

RESULT: 

Classification gets associted with the entity only once.

Application throws exception for the other simultaneous curl request:
org.apache.atlas.exception.AtlasBaseException: invalid parameters: entity: 
, already associated with classification: 


Thanks,

Mandar Ambawane



[jira] [Commented] (ATLAS-3563) Improve tag propagation performance using in-memory traversal

2019-12-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998804#comment-16998804
 ] 

ASF subversion and git services commented on ATLAS-3563:


Commit db467ed60ed9537be8cad258322bea9f2e53f3f9 in atlas's branch 
refs/heads/branch-2.0 from Sarath Subramanian
[ https://gitbox.apache.org/repos/asf?p=atlas.git;h=db467ed ]

ATLAS-3563: Improve tag propagation performance using in-memory traversal

(cherry picked from commit 7423addb220330cceadbf690d3bf4e4e5fcde99f)


> Improve tag propagation performance using in-memory traversal
> -
>
> Key: ATLAS-3563
> URL: https://issues.apache.org/jira/browse/ATLAS-3563
> Project: Atlas
>  Issue Type: Task
>  Components:  atlas-core
>Affects Versions: 2.0.0
>Reporter: Sarath Subramanian
>Assignee: Sarath Subramanian
>Priority: Major
> Fix For: 2.1.0
>
>
> Tag propagation uses gremlin query to find entities to which the tag has to 
> be propagated to.
> Gremlin query doesn't scale well for entities with large lineage (with many 
> depth). In-memory traversal seems to have improved performance significantly 
> since it avoids the overhead added by gremlin script engine initialization, 
> query execution time.
> Performance improvement in tag propagation from *3004 ms* to *180 ms* is seen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ATLAS-3563) Improve tag propagation performance using in-memory traversal

2019-12-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998802#comment-16998802
 ] 

ASF subversion and git services commented on ATLAS-3563:


Commit 7423addb220330cceadbf690d3bf4e4e5fcde99f in atlas's branch 
refs/heads/master from Sarath Subramanian
[ https://gitbox.apache.org/repos/asf?p=atlas.git;h=7423add ]

ATLAS-3563: Improve tag propagation performance using in-memory traversal


> Improve tag propagation performance using in-memory traversal
> -
>
> Key: ATLAS-3563
> URL: https://issues.apache.org/jira/browse/ATLAS-3563
> Project: Atlas
>  Issue Type: Task
>  Components:  atlas-core
>Affects Versions: 2.0.0
>Reporter: Sarath Subramanian
>Assignee: Sarath Subramanian
>Priority: Major
> Fix For: 2.1.0
>
>
> Tag propagation uses gremlin query to find entities to which the tag has to 
> be propagated to.
> Gremlin query doesn't scale well for entities with large lineage (with many 
> depth). In-memory traversal seems to have improved performance significantly 
> since it avoids the overhead added by gremlin script engine initialization, 
> query execution time.
> Performance improvement in tag propagation from *3004 ms* to *180 ms* is seen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ATLAS-3546) isOptional for composition relationship category?

2019-12-17 Thread charles shen (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998011#comment-16998011
 ] 

charles shen edited comment on ATLAS-3546 at 12/18/19 3:51 AM:
---

Also, it brings a side effect that load a parent will load all contained 
children(through ownedref), it's somehow conflict with ATLAS-3056, consider aws 
S3 as an example, there might be thousands of pseudo dirs under one bucket and 
thousands of S3 objects under one pseudo dirs, it's a huge performance issue.

Back to the cascade delete, it's also a huge performance issue to cascade 
delete all the S3 objects, async/offline cascade delete might be a better 
option.

 


was (Author: api123):
Also, it brings a side effect that load a parent will load all contained 
children(through ownedref), it's somehow conflict with 
[ATLAS-3056|https://issues.apache.org/jira/browse/ATLAS-3056], consider aws S3 
as an example, there might be thousands of pseudo dirs under one bucket and 
thousands of S3 object under one pseudo dirs, it's a huge performance issue.

 

> isOptional for composition relationship category?
> -
>
> Key: ATLAS-3546
> URL: https://issues.apache.org/jira/browse/ATLAS-3546
> Project: Atlas
>  Issue Type: Improvement
>  Components:  atlas-core
>Affects Versions: 2.0.0
>Reporter: charles shen
>Priority: Major
> Attachments: json.jpg, xml.jpg, xml.jpg
>
>
> I noticed since 
> [ATLAS-3051|https://issues.apache.org/jira/browse/ATLAS-3051], the 
> relationship attribute must be specified in the end def which is not 
> container and relationship category is composition. 
> I understand it's to prohibit orphan children but is it too strong? Reason 
> below:
>  # I have to provide all the entities along the relationship path in the 
> payload when creating a child, eg, for RDBMS, I have to provide 
> rdbms_instance, rdbms_db, rdbms_table, rdbms_column where I just want to 
> create a single rdbms_column, it brings performance overhead to check 
> existence of rdbms_instance, rdbms_db, etc..., 
>  # I have defined a composition relationship type where each end is the same 
> entity type, it couldn't be created successfully anymore since it always 
> requires the mandatory attribute where it's the same type and then falls into 
> infinite loop.
> Three possible fixes:
>  # Remove the isOptional constraint, since ownedRef/inverseRef doesn't have 
> such constraint.
>  # Add isOptional to relationship type end def.
>  # Add option in Rest to ignore the isOptional constraint for relationship 
> type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Review Request 71919: ATLAS-3563: Improve tag propagation performance using in-memory traversal

2019-12-17 Thread Madhan Neethiraj

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71919/#review219049
---


Ship it!




Ship It!

- Madhan Neethiraj


On Dec. 18, 2019, 1:31 a.m., Sarath Subramanian wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71919/
> ---
> 
> (Updated Dec. 18, 2019, 1:31 a.m.)
> 
> 
> Review request for atlas, Ashutosh Mestry, Aadarsh Jajodia, keval bhatt, 
> Sridhar K, Le Ma, Mandar Ambawane, mayank jain, Nixon Rodrigues, Sameer 
> Shaikh, and Sarath Subramanian.
> 
> 
> Bugs: ATLAS-3563
> https://issues.apache.org/jira/browse/ATLAS-3563
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> Tag propagation uses gremlin query to find entities to which the tag has to 
> be propagated to.
> 
> Gremlin query doesn't scale well for entities with large lineage (with many 
> depth). In-memory traversal seems to have improved performance significantly 
> since it avoids the overhead added by gremlin script engine initialization, 
> query execution time.
> 
>  
> 
> Performance improvement in tag propagation from 3004 ms to 180 ms is seen
> 
> 
> Diffs
> -
> 
>   
> graphdb/api/src/main/java/org/apache/atlas/repository/graphdb/AtlasVertex.java
>  6de4dcf10 
>   
> graphdb/janus/src/main/java/org/apache/atlas/repository/graphdb/janus/AtlasJanusVertex.java
>  71b285731 
>   intg/src/main/java/org/apache/atlas/AtlasErrorCode.java 7a2aae2e9 
>   intg/src/main/java/org/apache/atlas/type/AtlasEntityType.java 928ac0d8b 
>   repository/src/main/java/org/apache/atlas/repository/graph/GraphHelper.java 
> 1e7acf1e7 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/DeleteHandlerV1.java
>  c9ed79750 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasRelationshipStoreV2.java
>  1c8b057ba 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphMapper.java
>  a415d3084 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java
>  8a24fa127 
>   
> repository/src/main/java/org/apache/atlas/util/AtlasGremlin3QueryProvider.java
>  20c570f7f 
>   
> repository/src/main/java/org/apache/atlas/util/AtlasGremlinQueryProvider.java 
> d201db338 
>   
> repository/src/test/java/org/apache/atlas/repository/tagpropagation/ClassificationPropagationTest.java
>  6f9c05e7a 
> 
> 
> Diff: https://reviews.apache.org/r/71919/diff/4/
> 
> 
> Testing
> ---
> 
> Manually validated tag propagation works.
> 
> * Add classification
> * Block propagation
> * Change Propagation direction
> * Remove Classification
> 
> 
> Thanks,
> 
> Sarath Subramanian
> 
>



Re: Review Request 71919: ATLAS-3563: Improve tag propagation performance using in-memory traversal

2019-12-17 Thread Sarath Subramanian

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71919/
---

(Updated Dec. 17, 2019, 5:31 p.m.)


Review request for atlas, Ashutosh Mestry, Aadarsh Jajodia, keval bhatt, 
Sridhar K, Le Ma, Mandar Ambawane, mayank jain, Nixon Rodrigues, Sameer Shaikh, 
and Sarath Subramanian.


Bugs: ATLAS-3563
https://issues.apache.org/jira/browse/ATLAS-3563


Repository: atlas


Description
---

Tag propagation uses gremlin query to find entities to which the tag has to be 
propagated to.

Gremlin query doesn't scale well for entities with large lineage (with many 
depth). In-memory traversal seems to have improved performance significantly 
since it avoids the overhead added by gremlin script engine initialization, 
query execution time.

 

Performance improvement in tag propagation from 3004 ms to 180 ms is seen


Diffs (updated)
-

  
graphdb/api/src/main/java/org/apache/atlas/repository/graphdb/AtlasVertex.java 
6de4dcf10 
  
graphdb/janus/src/main/java/org/apache/atlas/repository/graphdb/janus/AtlasJanusVertex.java
 71b285731 
  intg/src/main/java/org/apache/atlas/AtlasErrorCode.java 7a2aae2e9 
  intg/src/main/java/org/apache/atlas/type/AtlasEntityType.java 928ac0d8b 
  repository/src/main/java/org/apache/atlas/repository/graph/GraphHelper.java 
1e7acf1e7 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/DeleteHandlerV1.java
 c9ed79750 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasRelationshipStoreV2.java
 1c8b057ba 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphMapper.java
 a415d3084 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java
 8a24fa127 
  
repository/src/main/java/org/apache/atlas/util/AtlasGremlin3QueryProvider.java 
20c570f7f 
  repository/src/main/java/org/apache/atlas/util/AtlasGremlinQueryProvider.java 
d201db338 
  
repository/src/test/java/org/apache/atlas/repository/tagpropagation/ClassificationPropagationTest.java
 6f9c05e7a 


Diff: https://reviews.apache.org/r/71919/diff/4/

Changes: https://reviews.apache.org/r/71919/diff/3-4/


Testing
---

Manually validated tag propagation works.

* Add classification
* Block propagation
* Change Propagation direction
* Remove Classification


Thanks,

Sarath Subramanian



Review Request 71922: ATLAS-3564: New version of AWS S3 model addition

2019-12-17 Thread Sidharth Mishra

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71922/
---

Review request for atlas, Ashutosh Mestry, Sridhar K, Madhan Neethiraj, and 
Sarath Subramanian.


Bugs: ATLAS-3564
https://issues.apache.org/jira/browse/ATLAS-3564


Repository: atlas


Description
---

This is the new version) of aws s3 model at atlas to allow the hierarchical 
structure and attributes same as AWS S3 Console. The existing aws s3 models has 
limitations like pseudo directory not containing another pseudo directory, 
Object doesn't have version information, both object and bucket doesn't 
containing all the attributes of AWS S3 etc.


Diffs
-

  addons/models/3000-Cloud/3030-aws_s3_typedefs_v2.json PRE-CREATION 


Diff: https://reviews.apache.org/r/71922/diff/1/


Testing
---


Thanks,

Sidharth Mishra



[jira] [Created] (ATLAS-3564) New AWS S3 model addition

2019-12-17 Thread Sidharth Kumar Mishra (Jira)
Sidharth Kumar Mishra created ATLAS-3564:


 Summary: New AWS S3 model addition
 Key: ATLAS-3564
 URL: https://issues.apache.org/jira/browse/ATLAS-3564
 Project: Atlas
  Issue Type: New Feature
  Components:  atlas-core
Reporter: Sidharth Kumar Mishra
Assignee: Sidharth Kumar Mishra






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ATLAS-3546) isOptional for composition relationship category?

2019-12-17 Thread Bolke de Bruin (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998540#comment-16998540
 ] 

Bolke de Bruin commented on ATLAS-3546:
---

+1 on that last comment see also ATLAS-3254

> isOptional for composition relationship category?
> -
>
> Key: ATLAS-3546
> URL: https://issues.apache.org/jira/browse/ATLAS-3546
> Project: Atlas
>  Issue Type: Improvement
>  Components:  atlas-core
>Affects Versions: 2.0.0
>Reporter: charles shen
>Priority: Major
> Attachments: json.jpg, xml.jpg, xml.jpg
>
>
> I noticed since 
> [ATLAS-3051|https://issues.apache.org/jira/browse/ATLAS-3051], the 
> relationship attribute must be specified in the end def which is not 
> container and relationship category is composition. 
> I understand it's to prohibit orphan children but is it too strong? Reason 
> below:
>  # I have to provide all the entities along the relationship path in the 
> payload when creating a child, eg, for RDBMS, I have to provide 
> rdbms_instance, rdbms_db, rdbms_table, rdbms_column where I just want to 
> create a single rdbms_column, it brings performance overhead to check 
> existence of rdbms_instance, rdbms_db, etc..., 
>  # I have defined a composition relationship type where each end is the same 
> entity type, it couldn't be created successfully anymore since it always 
> requires the mandatory attribute where it's the same type and then falls into 
> infinite loop.
> Three possible fixes:
>  # Remove the isOptional constraint, since ownedRef/inverseRef doesn't have 
> such constraint.
>  # Add isOptional to relationship type end def.
>  # Add option in Rest to ignore the isOptional constraint for relationship 
> type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ATLAS-3254) Atlas entity with large array of refs causes performance issues for lineage

2019-12-17 Thread Bolke de Bruin (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998538#comment-16998538
 ] 

Bolke de Bruin edited comment on ATLAS-3254 at 12/17/19 8:08 PM:
-

What [~mayank_nj] do you consider to load “properly”? What is the time taken to 
show the properties? What is the size of the json sent over the network (ours 
is > 27mb)? What is the load time? What is the render time?

Are you saying the loading of 200K objects in a pseudodir is taking over 1h? 
That is not “proper” I think?


was (Author: bolke):
What [~mayank_nj] do you consider to load “properly”? What is the time taken to 
show the properties? What is the size of the json sent over the network (ours 
is > 27mb)? What is the load time? What is the render time?

> Atlas entity with large array of refs causes performance issues for lineage
> ---
>
> Key: ATLAS-3254
> URL: https://issues.apache.org/jira/browse/ATLAS-3254
> Project: Atlas
>  Issue Type: Bug
>  Components:  atlas-core, atlas-webui
>Affects Versions: 1.0.0, 2.0.0
>Reporter: Adam Rempter
>Assignee: Mayank Jain
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-11-28 at 21.18.44.png, 
> entity_auto_create.sh, example_create_entities.json, 
> rest_entity_get_pseudodir.json
>
>
> We use “aws_s3_pseudo_dir” type from 3020-aws_s3_typedefs.json model.
> It has following property: 
> "name":    "s3Objects",
> "typeName":    "array"
>  
> Now in AWS buckets you can have thousands of objects. This causes that 
> s3Objects array grows quite quickly, causing aws_s3_pseudo_dir entity Json to 
> rich easly few MBs.
>  
> Then we start seeing problems like:
>  * UI is dying on displaying entity properties or lineage
>  * Error in logs: audit record too long: entityType=aws_s3_pseudo_dir, 
> guid=24398271-6ba0-4db5-adfa-38e432dc55ce, size=1053931; maxSize=1048576. 
> entity attribute values not stored in audit (EntityAuditListenerV2:234)
>  * Some errors with write to HBase (java.lang.IllegalArgumentException: 
> KeyValue size too large, as workaround we set hbase.client.keyvalue.maxsize 
> param to 0)
>  * kafka consumer errors (we can of course set some parameters on consumer, 
> but I think it is just workaround)
> …
> Exception in NotificationHookConsumer (NotificationHookConsumer:332)
> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
> completed since the group has already rebalanced and assigned the partitions 
> to another member. This means that the time between subsequen
> t calls to poll() was longer than the configured max.poll.interval.ms, which 
> typically implies that the poll loop is spending too much time message 
> processing. You can address this either by increasing the sessio
> n timeout or by reducing the maximum size of batches returned in poll() with 
> max.poll.records.
> …
> Specifying pseudo_dir is required for s3objects:
> name": "pseudoDirectory",
> "typeName": "aws_s3_pseudo_dir",
> "cardinality": "SINGLE",
> "isIndexable": false,
> *"isOptional": false,*
> "isUnique": false,
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ATLAS-3254) Atlas entity with large array of refs causes performance issues for lineage

2019-12-17 Thread Bolke de Bruin (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998538#comment-16998538
 ] 

Bolke de Bruin commented on ATLAS-3254:
---

What [~mayank_nj] do you consider to load “properly”? What is the time taken to 
show the properties? What is the size of the json sent over the network (ours 
is > 27mb)? What is the load time? What is the render time?

> Atlas entity with large array of refs causes performance issues for lineage
> ---
>
> Key: ATLAS-3254
> URL: https://issues.apache.org/jira/browse/ATLAS-3254
> Project: Atlas
>  Issue Type: Bug
>  Components:  atlas-core, atlas-webui
>Affects Versions: 1.0.0, 2.0.0
>Reporter: Adam Rempter
>Assignee: Mayank Jain
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-11-28 at 21.18.44.png, 
> entity_auto_create.sh, example_create_entities.json, 
> rest_entity_get_pseudodir.json
>
>
> We use “aws_s3_pseudo_dir” type from 3020-aws_s3_typedefs.json model.
> It has following property: 
> "name":    "s3Objects",
> "typeName":    "array"
>  
> Now in AWS buckets you can have thousands of objects. This causes that 
> s3Objects array grows quite quickly, causing aws_s3_pseudo_dir entity Json to 
> rich easly few MBs.
>  
> Then we start seeing problems like:
>  * UI is dying on displaying entity properties or lineage
>  * Error in logs: audit record too long: entityType=aws_s3_pseudo_dir, 
> guid=24398271-6ba0-4db5-adfa-38e432dc55ce, size=1053931; maxSize=1048576. 
> entity attribute values not stored in audit (EntityAuditListenerV2:234)
>  * Some errors with write to HBase (java.lang.IllegalArgumentException: 
> KeyValue size too large, as workaround we set hbase.client.keyvalue.maxsize 
> param to 0)
>  * kafka consumer errors (we can of course set some parameters on consumer, 
> but I think it is just workaround)
> …
> Exception in NotificationHookConsumer (NotificationHookConsumer:332)
> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
> completed since the group has already rebalanced and assigned the partitions 
> to another member. This means that the time between subsequen
> t calls to poll() was longer than the configured max.poll.interval.ms, which 
> typically implies that the poll loop is spending too much time message 
> processing. You can address this either by increasing the sessio
> n timeout or by reducing the maximum size of batches returned in poll() with 
> max.poll.records.
> …
> Specifying pseudo_dir is required for s3objects:
> name": "pseudoDirectory",
> "typeName": "aws_s3_pseudo_dir",
> "cardinality": "SINGLE",
> "isIndexable": false,
> *"isOptional": false,*
> "isUnique": false,
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Review Request 71919: ATLAS-3563: Improve tag propagation performance using in-memory traversal

2019-12-17 Thread Sarath Subramanian

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71919/
---

(Updated Dec. 17, 2019, 10:05 a.m.)


Review request for atlas, Ashutosh Mestry, Aadarsh Jajodia, keval bhatt, 
Sridhar K, Le Ma, Mandar Ambawane, mayank jain, Nixon Rodrigues, Sameer Shaikh, 
and Sarath Subramanian.


Bugs: ATLAS-3563
https://issues.apache.org/jira/browse/ATLAS-3563


Repository: atlas


Description
---

Tag propagation uses gremlin query to find entities to which the tag has to be 
propagated to.

Gremlin query doesn't scale well for entities with large lineage (with many 
depth). In-memory traversal seems to have improved performance significantly 
since it avoids the overhead added by gremlin script engine initialization, 
query execution time.

 

Performance improvement in tag propagation from 3004 ms to 180 ms is seen


Diffs (updated)
-

  
graphdb/api/src/main/java/org/apache/atlas/repository/graphdb/AtlasVertex.java 
6de4dcf10 
  
graphdb/janus/src/main/java/org/apache/atlas/repository/graphdb/janus/AtlasJanusVertex.java
 71b285731 
  intg/src/main/java/org/apache/atlas/type/AtlasEntityType.java 928ac0d8b 
  repository/src/main/java/org/apache/atlas/repository/graph/GraphHelper.java 
1e7acf1e7 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/DeleteHandlerV1.java
 c9ed79750 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasRelationshipStoreV2.java
 1c8b057ba 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphMapper.java
 a415d3084 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java
 8a24fa127 
  
repository/src/main/java/org/apache/atlas/util/AtlasGremlin3QueryProvider.java 
20c570f7f 
  repository/src/main/java/org/apache/atlas/util/AtlasGremlinQueryProvider.java 
d201db338 


Diff: https://reviews.apache.org/r/71919/diff/3/

Changes: https://reviews.apache.org/r/71919/diff/2-3/


Testing
---

Manually validated tag propagation works.

* Add classification
* Block propagation
* Change Propagation direction
* Remove Classification


Thanks,

Sarath Subramanian



Re: Review Request 71919: ATLAS-3563: Improve tag propagation performance using in-memory traversal

2019-12-17 Thread Madhan Neethiraj

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71919/#review219043
---




graphdb/janus/src/main/java/org/apache/atlas/repository/graphdb/janus/AtlasJanusVertex.java
Lines 76 (patched)


Given the underlying vertex classes expect a string array, consider using 
"String[]"  as the type for parameter "edgeLabels", instead of 
"Collection".



intg/src/main/java/org/apache/atlas/type/AtlasEntityType.java
Lines 284 (patched)


LOG.info ==> LOG.debug



intg/src/main/java/org/apache/atlas/type/AtlasEntityType.java
Lines 407 (patched)


";;" => ";"



repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java
Lines 412 (patched)


impactedEntityVertices => propagatedEntities
  // entity vertices to which the classification is currently propagated to

impactedEntityVerticesWithRestrictions => impactedEntities
  // entity vertices to which the classifications must be propagated to



repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java
Lines 418 (patched)


- is 'ret' in #416 the list of propagations to be added?
- is 'ret' in #418 the list of propagations to be removed?

Consider adding a comment for this method. Looking at the caller of this 
method in AtlasRelationshipStoreV2.handleBlockedClassifications(), the list 
returned from this method seems to be used to both remove and add propagations. 
Please review and refactor/rename as neceessary.



repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java
Lines 466 (patched)


classificationIdToExclude => classificationId
  in #466 and #474



repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java
Lines 517 (patched)


getAdjacentVertex() => getOtherVertex() // to be inline with 
JanusGraphEdge.otherVertex()


- Madhan Neethiraj


On Dec. 17, 2019, 8:29 a.m., Sarath Subramanian wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71919/
> ---
> 
> (Updated Dec. 17, 2019, 8:29 a.m.)
> 
> 
> Review request for atlas, Ashutosh Mestry, Aadarsh Jajodia, keval bhatt, 
> Sridhar K, Le Ma, Mandar Ambawane, mayank jain, Nixon Rodrigues, Sameer 
> Shaikh, and Sarath Subramanian.
> 
> 
> Bugs: ATLAS-3563
> https://issues.apache.org/jira/browse/ATLAS-3563
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> Tag propagation uses gremlin query to find entities to which the tag has to 
> be propagated to.
> 
> Gremlin query doesn't scale well for entities with large lineage (with many 
> depth). In-memory traversal seems to have improved performance significantly 
> since it avoids the overhead added by gremlin script engine initialization, 
> query execution time.
> 
>  
> 
> Performance improvement in tag propagation from 3004 ms to 180 ms is seen
> 
> 
> Diffs
> -
> 
>   
> graphdb/api/src/main/java/org/apache/atlas/repository/graphdb/AtlasVertex.java
>  6de4dcf10 
>   
> graphdb/janus/src/main/java/org/apache/atlas/repository/graphdb/janus/AtlasJanusVertex.java
>  71b285731 
>   intg/src/main/java/org/apache/atlas/type/AtlasEntityType.java 928ac0d8b 
>   repository/src/main/java/org/apache/atlas/repository/graph/GraphHelper.java 
> 1e7acf1e7 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/DeleteHandlerV1.java
>  c9ed79750 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasRelationshipStoreV2.java
>  1c8b057ba 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphMapper.java
>  a415d3084 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java
>  8a24fa127 
>   
> repository/src/main/java/org/apache/atlas/util/AtlasGremlin3QueryProvider.java
>  20c570f7f 
>   
> repository/src/main/java/org/apache/atlas/util/AtlasGremlinQueryProvider.java 
> d201db338 
> 
> 
> Diff: https://reviews.apache.org/r/71919/diff/2/
> 
> 
> Testing
> ---
> 
> Manually validated tag propagation works.
> 
> * Add classification
> * Block propagation
> * Change Propagation direction
> * Remove Classification
> 
> 
> Thanks,
> 
> Sarath Subramanian
> 
>



[jira] [Commented] (ATLAS-3254) Atlas entity with large array of refs causes performance issues for lineage

2019-12-17 Thread Mayank Jain (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998035#comment-16998035
 ] 

Mayank Jain commented on ATLAS-3254:


@Team,

[^entity_auto_create.sh]

I was unable to reproduce the issue as there were no such issues occurring 
mentioned above,The steps which i followed are mentioned below
 # Created a cluster with 3 instance and then followed the above mentioned 
steps .

 I used a script to create 200k+ entities in a bucket and it took 3 days for 
its creation and after that if we try to load the bucket and the properties it 
initially takes time of approx 1 hr 18 mins but after loading the properties 
were getting loaded properly and did'nt found errors in log or 
NotificationHookConsumer error as stated above.

I have the script file which i used fo creating 200k entitites. 

Kindly let me know in case i have missed out some thing.

 

> Atlas entity with large array of refs causes performance issues for lineage
> ---
>
> Key: ATLAS-3254
> URL: https://issues.apache.org/jira/browse/ATLAS-3254
> Project: Atlas
>  Issue Type: Bug
>  Components:  atlas-core, atlas-webui
>Affects Versions: 1.0.0, 2.0.0
>Reporter: Adam Rempter
>Assignee: Mayank Jain
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-11-28 at 21.18.44.png, 
> entity_auto_create.sh, example_create_entities.json, 
> rest_entity_get_pseudodir.json
>
>
> We use “aws_s3_pseudo_dir” type from 3020-aws_s3_typedefs.json model.
> It has following property: 
> "name":    "s3Objects",
> "typeName":    "array"
>  
> Now in AWS buckets you can have thousands of objects. This causes that 
> s3Objects array grows quite quickly, causing aws_s3_pseudo_dir entity Json to 
> rich easly few MBs.
>  
> Then we start seeing problems like:
>  * UI is dying on displaying entity properties or lineage
>  * Error in logs: audit record too long: entityType=aws_s3_pseudo_dir, 
> guid=24398271-6ba0-4db5-adfa-38e432dc55ce, size=1053931; maxSize=1048576. 
> entity attribute values not stored in audit (EntityAuditListenerV2:234)
>  * Some errors with write to HBase (java.lang.IllegalArgumentException: 
> KeyValue size too large, as workaround we set hbase.client.keyvalue.maxsize 
> param to 0)
>  * kafka consumer errors (we can of course set some parameters on consumer, 
> but I think it is just workaround)
> …
> Exception in NotificationHookConsumer (NotificationHookConsumer:332)
> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
> completed since the group has already rebalanced and assigned the partitions 
> to another member. This means that the time between subsequen
> t calls to poll() was longer than the configured max.poll.interval.ms, which 
> typically implies that the poll loop is spending too much time message 
> processing. You can address this either by increasing the sessio
> n timeout or by reducing the maximum size of batches returned in poll() with 
> max.poll.records.
> …
> Specifying pseudo_dir is required for s3objects:
> name": "pseudoDirectory",
> "typeName": "aws_s3_pseudo_dir",
> "cardinality": "SINGLE",
> "isIndexable": false,
> *"isOptional": false,*
> "isUnique": false,
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ATLAS-3254) Atlas entity with large array of refs causes performance issues for lineage

2019-12-17 Thread Mayank Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Jain updated ATLAS-3254:
---
Attachment: entity_auto_create.sh

> Atlas entity with large array of refs causes performance issues for lineage
> ---
>
> Key: ATLAS-3254
> URL: https://issues.apache.org/jira/browse/ATLAS-3254
> Project: Atlas
>  Issue Type: Bug
>  Components:  atlas-core, atlas-webui
>Affects Versions: 1.0.0, 2.0.0
>Reporter: Adam Rempter
>Assignee: Mayank Jain
>Priority: Major
>  Labels: performance
> Attachments: Screenshot 2019-11-28 at 21.18.44.png, 
> entity_auto_create.sh, example_create_entities.json, 
> rest_entity_get_pseudodir.json
>
>
> We use “aws_s3_pseudo_dir” type from 3020-aws_s3_typedefs.json model.
> It has following property: 
> "name":    "s3Objects",
> "typeName":    "array"
>  
> Now in AWS buckets you can have thousands of objects. This causes that 
> s3Objects array grows quite quickly, causing aws_s3_pseudo_dir entity Json to 
> rich easly few MBs.
>  
> Then we start seeing problems like:
>  * UI is dying on displaying entity properties or lineage
>  * Error in logs: audit record too long: entityType=aws_s3_pseudo_dir, 
> guid=24398271-6ba0-4db5-adfa-38e432dc55ce, size=1053931; maxSize=1048576. 
> entity attribute values not stored in audit (EntityAuditListenerV2:234)
>  * Some errors with write to HBase (java.lang.IllegalArgumentException: 
> KeyValue size too large, as workaround we set hbase.client.keyvalue.maxsize 
> param to 0)
>  * kafka consumer errors (we can of course set some parameters on consumer, 
> but I think it is just workaround)
> …
> Exception in NotificationHookConsumer (NotificationHookConsumer:332)
> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
> completed since the group has already rebalanced and assigned the partitions 
> to another member. This means that the time between subsequen
> t calls to poll() was longer than the configured max.poll.interval.ms, which 
> typically implies that the poll loop is spending too much time message 
> processing. You can address this either by increasing the sessio
> n timeout or by reducing the maximum size of batches returned in poll() with 
> max.poll.records.
> …
> Specifying pseudo_dir is required for s3objects:
> name": "pseudoDirectory",
> "typeName": "aws_s3_pseudo_dir",
> "cardinality": "SINGLE",
> "isIndexable": false,
> *"isOptional": false,*
> "isUnique": false,
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ATLAS-3546) isOptional for composition relationship category?

2019-12-17 Thread charles shen (Jira)


[ 
https://issues.apache.org/jira/browse/ATLAS-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998011#comment-16998011
 ] 

charles shen commented on ATLAS-3546:
-

Also, it brings a side effect that load a parent will load all contained 
children(through ownedref), it's somehow conflict with 
[ATLAS-3056|https://issues.apache.org/jira/browse/ATLAS-3056], consider aws S3 
as an example, there might be thousands of pseudo dirs under one bucket and 
thousands of S3 object under one pseudo dirs, it's a huge performance issue.

 

> isOptional for composition relationship category?
> -
>
> Key: ATLAS-3546
> URL: https://issues.apache.org/jira/browse/ATLAS-3546
> Project: Atlas
>  Issue Type: Improvement
>  Components:  atlas-core
>Affects Versions: 2.0.0
>Reporter: charles shen
>Priority: Major
> Attachments: json.jpg, xml.jpg, xml.jpg
>
>
> I noticed since 
> [ATLAS-3051|https://issues.apache.org/jira/browse/ATLAS-3051], the 
> relationship attribute must be specified in the end def which is not 
> container and relationship category is composition. 
> I understand it's to prohibit orphan children but is it too strong? Reason 
> below:
>  # I have to provide all the entities along the relationship path in the 
> payload when creating a child, eg, for RDBMS, I have to provide 
> rdbms_instance, rdbms_db, rdbms_table, rdbms_column where I just want to 
> create a single rdbms_column, it brings performance overhead to check 
> existence of rdbms_instance, rdbms_db, etc..., 
>  # I have defined a composition relationship type where each end is the same 
> entity type, it couldn't be created successfully anymore since it always 
> requires the mandatory attribute where it's the same type and then falls into 
> infinite loop.
> Three possible fixes:
>  # Remove the isOptional constraint, since ownedRef/inverseRef doesn't have 
> such constraint.
>  # Add isOptional to relationship type end def.
>  # Add option in Rest to ignore the isOptional constraint for relationship 
> type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ATLAS-3563) Improve tag propagation performance using in-memory traversal

2019-12-17 Thread Sarath Subramanian (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarath Subramanian updated ATLAS-3563:
--
Description: 
Tag propagation uses gremlin query to find entities to which the tag has to be 
propagated to.

Gremlin query doesn't scale well for entities with large lineage (with many 
depth). In-memory traversal seems to have improved performance significantly 
since it avoids the overhead added by gremlin script engine initialization, 
query execution time.

Performance improvement in tag propagation from *3004 ms* to *180 ms* is seen

  was:
Tag propagation uses gremlin query to find entities to which the tag has to be 
propagated to.

Gremlin query is not scaling well for entities with large lineage (with many 
depth). In-memory traversal seems to have improved performance significantly 
since it avoids the overhead added by gremlin script engine initialization, 
query execution time.

 

Performance improvement in tag propagation from 3004 ms to 180 ms is seen


> Improve tag propagation performance using in-memory traversal
> -
>
> Key: ATLAS-3563
> URL: https://issues.apache.org/jira/browse/ATLAS-3563
> Project: Atlas
>  Issue Type: Task
>  Components:  atlas-core
>Affects Versions: 2.0.0
>Reporter: Sarath Subramanian
>Assignee: Sarath Subramanian
>Priority: Major
> Fix For: 2.1.0
>
>
> Tag propagation uses gremlin query to find entities to which the tag has to 
> be propagated to.
> Gremlin query doesn't scale well for entities with large lineage (with many 
> depth). In-memory traversal seems to have improved performance significantly 
> since it avoids the overhead added by gremlin script engine initialization, 
> query execution time.
> Performance improvement in tag propagation from *3004 ms* to *180 ms* is seen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ATLAS-3563) Improve tag propagation performance using in-memory traversal

2019-12-17 Thread Sarath Subramanian (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarath Subramanian updated ATLAS-3563:
--
Attachment: (was: ATLAS-3563.001.patch)

> Improve tag propagation performance using in-memory traversal
> -
>
> Key: ATLAS-3563
> URL: https://issues.apache.org/jira/browse/ATLAS-3563
> Project: Atlas
>  Issue Type: Task
>  Components:  atlas-core
>Affects Versions: 2.0.0
>Reporter: Sarath Subramanian
>Assignee: Sarath Subramanian
>Priority: Major
> Fix For: 2.1.0
>
>
> Tag propagation uses gremlin query to find entities to which the tag has to 
> be propagated to.
> Gremlin query is not scaling well for entities with large lineage (with many 
> depth). In-memory traversal seems to have improved performance significantly 
> since it avoids the overhead added by gremlin script engine initialization, 
> query execution time.
>  
> Performance improvement in tag propagation from 3004 ms to 180 ms is seen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ATLAS-3563) Improve tag propagation performance using in-memory traversal

2019-12-17 Thread Sarath Subramanian (Jira)


 [ 
https://issues.apache.org/jira/browse/ATLAS-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarath Subramanian updated ATLAS-3563:
--
Attachment: ATLAS-3563.001.patch

> Improve tag propagation performance using in-memory traversal
> -
>
> Key: ATLAS-3563
> URL: https://issues.apache.org/jira/browse/ATLAS-3563
> Project: Atlas
>  Issue Type: Task
>  Components:  atlas-core
>Affects Versions: 2.0.0
>Reporter: Sarath Subramanian
>Assignee: Sarath Subramanian
>Priority: Major
> Fix For: 2.1.0
>
> Attachments: ATLAS-3563.001.patch
>
>
> Tag propagation uses gremlin query to find entities to which the tag has to 
> be propagated to.
> Gremlin query is not scaling well for entities with large lineage (with many 
> depth). In-memory traversal seems to have improved performance significantly 
> since it avoids the overhead added by gremlin script engine initialization, 
> query execution time.
>  
> Performance improvement in tag propagation from 3004 ms to 180 ms is seen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Review Request 71919: ATLAS-3563: Improve tag propagation performance using in-memory traversal

2019-12-17 Thread Sarath Subramanian

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71919/
---

Review request for atlas, Ashutosh Mestry, Aadarsh Jajodia, keval bhatt, 
Sridhar K, Le Ma, Mandar Ambawane, mayank jain, Nixon Rodrigues, Sameer Shaikh, 
and Sarath Subramanian.


Bugs: ATLAS-3563
https://issues.apache.org/jira/browse/ATLAS-3563


Repository: atlas


Description
---

Tag propagation uses gremlin query to find entities to which the tag has to be 
propagated to.

Gremlin query is not scaling well for entities with large lineage (with many 
depth). In-memory traversal seems to have improved performance significantly 
since it avoids the overhead added by gremlin script engine initialization, 
query execution time.

 

Performance improvement in tag propagation from 3004 ms to 180 ms is seen


Diffs
-

  
graphdb/api/src/main/java/org/apache/atlas/repository/graphdb/AtlasVertex.java 
6de4dcf10 
  
graphdb/janus/src/main/java/org/apache/atlas/repository/graphdb/janus/AtlasJanusVertex.java
 71b285731 
  intg/src/main/java/org/apache/atlas/type/AtlasEntityType.java 928ac0d8b 
  repository/src/main/java/org/apache/atlas/repository/graph/GraphHelper.java 
1e7acf1e7 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/DeleteHandlerV1.java
 c9ed79750 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasRelationshipStoreV2.java
 1c8b057ba 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphMapper.java
 a415d3084 
  
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphRetriever.java
 8a24fa127 
  
repository/src/main/java/org/apache/atlas/util/AtlasGremlin3QueryProvider.java 
20c570f7f 
  repository/src/main/java/org/apache/atlas/util/AtlasGremlinQueryProvider.java 
d201db338 


Diff: https://reviews.apache.org/r/71919/diff/1/


Testing
---

Manually validated tag propagation works.

* Add classification
* Block propagation
* Change Propagation direction
* Remove Classification


Thanks,

Sarath Subramanian



[jira] [Created] (ATLAS-3563) Improve tag propagation performance using in-memory traversal

2019-12-17 Thread Sarath Subramanian (Jira)
Sarath Subramanian created ATLAS-3563:
-

 Summary: Improve tag propagation performance using in-memory 
traversal
 Key: ATLAS-3563
 URL: https://issues.apache.org/jira/browse/ATLAS-3563
 Project: Atlas
  Issue Type: Task
  Components:  atlas-core
Affects Versions: 2.0.0
Reporter: Sarath Subramanian
Assignee: Sarath Subramanian
 Fix For: 2.1.0


Tag propagation uses gremlin query to find entities to which the tag has to be 
propagated to.

Gremlin query is not scaling well for entities with large lineage (with many 
depth). In-memory traversal seems to have improved performance significantly 
since it avoids the overhead added by gremlin script engine initialization, 
query execution time.

 

Performance improvement in tag propagation from 3004 ms to 180 ms is seen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)