[ 
https://issues.apache.org/jira/browse/ATLAS-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Mestry reassigned ATLAS-4389:
--------------------------------------

    Assignee: Ashutosh Mestry

> Best practice or a way to bring in large number of entities on a regular 
> basis.
> -------------------------------------------------------------------------------
>
>                 Key: ATLAS-4389
>                 URL: https://issues.apache.org/jira/browse/ATLAS-4389
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: Saad
>            Assignee: Ashutosh Mestry
>            Priority: Major
>              Labels: documentation, newbie, performance
>         Attachments: image-2021-08-05-11-22-29-259.png, 
> image-2021-08-05-11-23-05-440.png
>
>
> Would you be so kind to let us know if there is any best practice or a way to 
> bring in large number of entities on a regular basis.
> *Our use case:*
> We will be bringing in around 12,000  datasets, 12,000 jobs and 70,000 
> columns. We want to do this as part of our deployment pipeline for other 
> upstream projects.
> At every deploy we want to do the following:
>  - Add the jobs, datasets and columns that are not in Atlas
>  - Update the jobs, datasets and columns that are in Atlas
>  - Delete the jobs from Atlas that are deleted from the upstream systems.
> So far we have considered using the bulk API endpoint(/v2/entity/bulk). This 
> has its own issues. We found that if the payload is too big in our case 
> bigger than 300-500 entities this times out. The more deeper the 
> relationships the fewer the entities you can send through the bulk endpoint.
> Inspecting some of the code we feel that both REST and streaming data through 
> Kafka follow the same codepath and finally yield the same performance.
> Further we found that when creating entities the type registry becomes the 
> bottle neck. We discovered this by profiling the jvm. We found that only one 
> core processes the the entities and their relationships.
> *Questions:*
> 1- What is the best practice when bulk loading lots on entities in a 
> reasonable time. We are aiming to load 12k jobs, 12k datasets and 70k columns 
> in less than 10 mins.?
> 2- Where should we start if we want to scale the API, is there any known way 
> to horizontally scale Atlas?
> Here are some of the stats for the load testing we did,
>  
> !image-2021-08-05-11-23-05-440.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to