[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-09-24 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626588#comment-16626588
 ] 

t oo commented on ATLAS-2708:
-

tracking further work in ATLAS-2889

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, 
> ATLAS-2708.patch, all_AWS_common_typedefs.json, 
> all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, 
> all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-09-20 Thread Barbara Eckman (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622266#comment-16622266
 ] 

Barbara Eckman commented on ATLAS-2708:
---

[~ayushmnnit] Agreed.

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, 
> ATLAS-2708.patch, all_AWS_common_typedefs.json, 
> all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, 
> all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-09-19 Thread Ayush Nigam (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621527#comment-16621527
 ] 

Ayush Nigam commented on ATLAS-2708:


[~toopt4] You have to create your own Lambda code that creates AtlasEntities on 
the fly as trigerred by Lambda Function(on changes made to s3 object) and then 
push to Kafka Queue. This particular functionality is not part of Atlas tool as 
of now.

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, 
> ATLAS-2708.patch, all_AWS_common_typedefs.json, 
> all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, 
> all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-09-19 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621191#comment-16621191
 ] 

t oo commented on ATLAS-2708:
-

[~barbara] what config/steps should I run to enable this? or are you saying 
this is homegrown and not included in the atlas tool?

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, 
> ATLAS-2708.patch, all_AWS_common_typedefs.json, 
> all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, 
> all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-09-19 Thread Barbara Eckman (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621077#comment-16621077
 ] 

Barbara Eckman commented on ATLAS-2708:
---

It doesn't do it automatically through a listener like the hive hook.  We do it 
via lambda functions, triggered, say, on the creation of S3 object or 
pseudodirectory or bucket.  We package up the info into AtlasEntities and then 
publish to the ATLAS_HOOK kafka topic.

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, 
> ATLAS-2708.patch, all_AWS_common_typedefs.json, 
> all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, 
> all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-09-19 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620568#comment-16620568
 ] 

t oo commented on ATLAS-2708:
-

Can atlas import 's3 object tags' from all s3 objects under a given s3 bucket?

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, 
> ATLAS-2708.patch, all_AWS_common_typedefs.json, 
> all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, 
> all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-19 Thread Eckman, Barbara (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517207#comment-16517207
 ] 

Eckman, Barbara commented on ATLAS-2708:


Awesome!  Thanks for your suggestions, too!

On 6/19/18, 2:00 AM, "Don Bosco Durai (JIRA)"  wrote:


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516825#comment-16516825
 ] 

Don Bosco Durai commented on ATLAS-2708:


{quote}[~bosco] [~madhan.neethiraj] Of course you can use my JSON for the 
demo!
{quote}
[~barbara] I was able to swap out my model with yours. Now it looks pretty 
good :) Thanks

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, 
ATLAS-2708.patch, all_AWS_common_typedefs.json, 
all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, 
all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. 
It would be nice to add typedefs for AWS data lake objects (buckets and 
pseudo-directories) and lineage processes that move the data from another 
source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of 
objects in an S3 bucket.  For example, in the case of an object with key 
“myWork/Development/Projects1.xls”, “myWork/Development” is the 
pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a 
transition of the data in a bucket to a storageClass after a specific time 
interval, or expiration.  For example, transition to GLACIER after 60 days, or 
expire (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
expiration)
>  * AWSTag type represents a tag-value pair created by the user and 
associated with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that 
is monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that 
filter or limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects 
stored in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg 
GetObject, PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the 
bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket 
or its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the 
process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic 
and S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, 
> ATLAS-2708.patch, all_AWS_common_typedefs.json, 
> all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, 
> all_datalake_typedefs_v2.json
>

[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-19 Thread Don Bosco Durai (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516825#comment-16516825
 ] 

Don Bosco Durai commented on ATLAS-2708:


{quote}[~bosco] [~madhan.neethiraj] Of course you can use my JSON for the demo!
{quote}
[~barbara] I was able to swap out my model with yours. Now it looks pretty good 
:) Thanks

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0, 2.0.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, 
> ATLAS-2708.patch, all_AWS_common_typedefs.json, 
> all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, 
> all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-17 Thread Madhan Neethiraj (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16515156#comment-16515156
 ] 

Madhan Neethiraj commented on ATLAS-2708:
-

[~barbara] - thanks for the updated model files. The changes look good. To be 
consistent with type-names used in rest of the models, I renamed the types to 
replace CamelCase with lower_case_underscore_seperation. Please review.

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Fix For: 1.1.0
>
> Attachments: 3010-aws_model.json, ATLAS-2708.patch, 
> all_AWS_common_typedefs.json, all_AWS_common_typedefs_v2.json, 
> all_datalake_typedefs.json, all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-15 Thread Barbara Eckman (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514229#comment-16514229
 ] 

Barbara Eckman commented on ATLAS-2708:
---

[~bosco] 

 bq. You have S3AccessPolicy in AWSS3Bucket as string. In S3, Bucket Policy is 
a list of Statement Structure. If we are not using it now, we should probably 
remove it and add it when we need to. Or we can create a placeholder 
S3BucketPolicy entity and associate that with AWSS3Bucket

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: 3010-aws_model.json, all_AWS_common_typedefs.json, 
> all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-15 Thread Barbara Eckman (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514222#comment-16514222
 ] 

Barbara Eckman commented on ATLAS-2708:
---

[~madhan.neethiraj] I agree with all your changes except one: 
 - is avroSchema applicable for AWSS3PseudoDir?  
 ** Yes, for those that don't model objects and stop at pseudo (like us).  We 
recommend that there be a 1:1 relationship between pseudo and avro_schema, but 
we can't enforce that, so it's an array of schemas in the pseudo type.  I have 
changed it to a single schema in the object type as you suggest.

I am breaking your file into two jsons, one datalake-specific and one general 
AWS, as discussed with [~bosco]. 

Nice to see an example of relationshipDefs.  Can you please point me to a place 
where the semantics of these attributes is documented?  I'd like to start using 
them.

 

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: 3010-aws_model.json, all_AWS_common_typedefs.json, 
> all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-15 Thread Barbara Eckman (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514168#comment-16514168
 ] 

Barbara Eckman commented on ATLAS-2708:
---

[~bosco] [~madhan.neethiraj] Of course you can use my JSON for the demo!!

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: 3010-aws_model.json, all_AWS_common_typedefs.json, 
> all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-15 Thread Don Bosco Durai (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513593#comment-16513593
 ] 

Don Bosco Durai commented on ATLAS-2708:


{quote}I updated the model files for above comments, except the last one, and 
uploaded in this JIRA - [^3010-aws_model.json]. Please review.
{quote}
[~madhan.neethiraj] , seems our comments crossed path at the same time.

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: 3010-aws_model.json, all_AWS_common_typedefs.json, 
> all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-15 Thread Don Bosco Durai (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513580#comment-16513580
 ] 

Don Bosco Durai commented on ATLAS-2708:


[~barbara] looks good. Having a common JSON for AWS makes sense.

Sorry, two more feedback :)
 # We should add AWSTags to S3Object also
 # You have S3AccessPolicy in AWSS3Bucket as string. In S3, Bucket Policy is a 
list of Statement Structure. If we are not using it now, we should probably 
remove it and add it when we need to. Or we can create a placeholder 
S3BucketPolicy entity and associate that with AWSS3Bucket

 

[~madhan.neethiraj] and I are giving a talk next week in San Jose at DataWorks 
conference 
[https://dataworkssummit.com/san-jose-2018/session/securing-data-in-hybrid-environments-using-apache-ranger/].
 Are you okay if we can use your JSON for the demo?

Thanks

 

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: 3010-aws_model.json, all_AWS_common_typedefs.json, 
> all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-15 Thread Madhan Neethiraj (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513574#comment-16513574
 ] 

Madhan Neethiraj commented on ATLAS-2708:
-

[~barbara] - thanks for AWS/S3 model types. Here are my comments:

- looks like following types are suitable to be modeled as a struct, instead of 
entity - given each instance will be contained within an instance of 
AWSS3Bucket; and they don't need their own separate identity outside of its 
container.
-- AWSTag
-- AWSCloudWatchMetric
-- AWSS3BucketLifeCycleRule
- is avroSchema applicable for AWSS3PseudoDir?
- consider renaming attribute AWSTag.tag to AWSTag.key - to be in sync with 
names used in 
https://docs.aws.amazon.com/AmazonS3/latest/dev/object-tagging.html
- I would suggest using attribute names that begin with a lower case letter, to 
be consistent with rest of types. AWSS3Bucket.S3AccessPolicy, 
AWSS3Bucket.AWSTags
- AWSS3Object has an array of avro_schema associated with. Wouldn't a single 
avro_schema be enough?

I updated the model files for above comments, except the last one, and uploaded 
in this JIRA - 3010-aws_model.json. Please review.

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: all_AWS_common_typedefs.json, all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-14 Thread Barbara Eckman (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512944#comment-16512944
 ] 

Barbara Eckman commented on ATLAS-2708:
---

[~bosco]  good point, I should have a backpointer from AWSS3Pseudo to 
AWSS3Bucket.  I will add that when I add the AWSS3Object typedef (with 
backpointer to AWSS3Pseudo).  And I take your preference to be two jsons (one 
for S3-specific and one for more general AWS entities).  I will do that too. 

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-14 Thread Don Bosco Durai (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512845#comment-16512845
 ] 

Don Bosco Durai commented on ATLAS-2708:


[~barbara], my comments are inline
{quote}bq. 1.  We actually didn't model each object (file), because we don't 
plan to store metadata on each object.  But I can whip up an object typedef for 
you.  
{quote}
Thanks for the clarification. After giving some more thought, it felt, the S3 
hierarchy could be similar to database model. E.g. hive_db -> hive_table -> 
hive_column. Out here, it would be S3Bucket ->Pseudodir -> S3Object. Because at 
each level there could be different attributes and properties, so we should be 
able to capture it. Currently, in your current model, it seems we have lists of 
AWSS3PseudoDir within AWSS3Bucket, whereas in hive_model.json, hive_table has 
reference to hive_db (however, hive_table has array of hive_columns also).

In your case, the S3Object could be optional.
{quote}I could include an attribute of type Schema rather than 
avro_schema,associated with it.
{quote}
Thanks for pointing out. I didn't realize that AVRO was first class attribute. 
Yes, it makes sense to have it more generic. 
{quote}Any other attributes you can think of?
{quote}
I can think of compression type for now. If it is CSV, then it might need 
delimiter also. But should these be dynamic attributes? I would let the others 
comment on the best practice for modeling.

 
{quote}bq. 2. Do you mean there'd be a separate Jira ticket for AWS common 
Entities like Tags, Permissions, CloudWatchMetrics, etc, instead of lumping 
them together with the S3 entities?  Then AWSS3Bucket and AWSDynamoDB could 
each have attributes of type AWSTag.  In terms of CloudWatchMetrics, for 
Dynamo, are metric name and scope sufficient?  I'd have to check.  
{quote}
Yes, you are right. In some cases, it could be an array of entities (e.g. 
AWSTag) and in some cases, it could be just extending from a base class like 
fs_path and hdfs_path. Regardless it would be good to abstract them out.

For now, it could be in the same json and part this JIRA.

 

 

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic 

[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-13 Thread Barbara Eckman (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511218#comment-16511218
 ] 

Barbara Eckman commented on ATLAS-2708:
---

OK... can you point me to documentation or examples of typedefs using this 
other style?  I found the box and arrow diagrams but no examples. 

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-13 Thread Barbara Eckman (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511216#comment-16511216
 ] 

Barbara Eckman commented on ATLAS-2708:
---

[~bosco], Great, I'm glad they will be useful.  

1.  We actually didn't model each object (file), because we don't plan to store 
metadata on each object.  But I can whip up an object typedef for you.  I'd 
imagine attributes like:  pseudodir it's in, compression format, creation time. 
 For us they'd be .avro files, so I'd include an avro schema.  For you and 
others:  would you have a .csv "schema" associated with the object?  Or a JSON 
schema? I could include an attribute of type Schema rather than avro_schema, so 
any schema can be associated with it.  Any other attributes you can think of?

2. Do you mean there'd be a separate Jira ticket for AWS common Entities like 
Tags, Permissions, CloudWatchMetrics, etc, instead of lumping them together 
with the S3 entities?  Then AWSS3Bucket and AWSDynamoDB could each have 
attributes of type AWSTag.  In terms of CloudWatchMetrics, for Dynamo, are 
metric name and scope sufficient?  I'd have to check.  

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-12 Thread Don Bosco Durai (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510324#comment-16510324
 ] 

Don Bosco Durai commented on ATLAS-2708:


[~barbara] Thanks for coming up with S3 typedef. I am working on the Ranger 
Jira https://issues.apache.org/jira/browse/RANGER-1974  and planning to use 
your S3 typedef for pushing the S3 resources and tags to Atlas. I reviewed your 
JSON and it is looking good. I had a question and also a suggestion.
 # In your design, how would you represent an object like 
s3://my_bucket/my_path/my_file.csv. E.g would s3://my_bucket be represented as 
AWSS3Bucket, my_path as AWSS3PseudoDir. And what about my_file.csv? Or the 
entire path is AWSS3Bucket?
 # Would it be convenient if we abstract AWS common Entities like Tags, 
Permissions, CloudWatchMetrics, etc into its own definition/entity and 
AWSS3Bucket and others can extend from it where applicable? This would help us 
add additional AWS services like DynamoDB, RDS, etc easily.

 

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-11 Thread David Radley (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507807#comment-16507807
 ] 

David Radley commented on ATLAS-2708:
-

[~barbara] I have had a quick look at the json file, it looks like you are 
using the old style of definition. I would recommend that that you use the new 
RelationshipDef style rather than constraints. This is more descriptive of 
relationships - it allows you to specify attributes on the relationship as 
well.  I think the constraints style is only there for so that the Hadoop types 
could continue to work.

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-10 Thread Barbara Eckman (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507532#comment-16507532
 ] 

Barbara Eckman commented on ATLAS-2708:
---

I'm sorry, that was stupid.  I'm attaching a file that reflects your 
suggestions.

> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ATLAS-2708) AWS S3 data lake typedefs for Atlas

2018-06-10 Thread Madhan Neethiraj (JIRA)


[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507520#comment-16507520
 ] 

Madhan Neethiraj commented on ATLAS-2708:
-

[~barbara] - thanks for adding models for S3 types. Couple of comments:

{noformat}
AWSTag, AWSCloudWatchMetric, AWSS3BucketLifeCycleRule: are these really 
datasets? Perhaps Referenceable should be the only super-type?
{noformat}

{noformat}
AWSS3PseudoDir, AWSS3Bucket: both DataSet and Asset as listed as super-types. 
Given DataSet already has has Asset as its super-type, it is enough to list 
only DataSet as the super-type.
{noformat}


> AWS S3 data lake typedefs for Atlas
> ---
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
>  Issue Type: New Feature
>  Components:  atlas-core
>Reporter: Barbara Eckman
>Assignee: Barbara Eckman
>Priority: Critical
> Attachments: all_datalake_types.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)