Re: Performance improvements: SQL queries

Francesco Chicchiriccò Wed, 09 Oct 2019 04:28:44 -0700

Hi Dmitriy,
thanks for your analysis.

Please consider that:


1. you should refer to latest Syncope 2.1 (2.1.5 at the moment), rather than 
old 2.0.12.
2. the queries are generated in the current form because of the various use 
cases they need to work for, and that are mostly checked by various tests
3. Syncope 2.1 also provides Postgresql JSONB support, which dramatically 
improves overall performance

Anyway, feel free to submit any PR that we'll be glad to review.
Regards.

On 09/10/19 13:03, Dmitriy Brashevets wrote:
>
> Hi Apache Syncope team!
>
>  
>
> Currently, we're using the Apache Syncope 2.0.12 version.
>
>  
>
> AS provides the wide rest API and our team would like to have the 
> org.apache.syncope.common.rest.api.service.AnyService#search working faster 
> as we do the search of groups and users using FIQL queries.
>
>  
>
> Here is an example of the URI with a request path that has the FIQL query:
>
>  
>
> https://hostname/syncope/rest/users?fiql=%28realm%3D%3Dea696a4f-e77a-4ef1-be67-8f8093bc8686%3BlastChangeDate%3Dle%3D2019-10-10T13%253A57%253A00%252B0300%29&size=50&orderby=username+ASC&details=true&page=1
>
>  
>
>  
>
> During the quick investigation of how this endpoint works under the hood, I 
> found the places that potentially can be optimized.
>
>  
>
> The result of generated sql query that searches for AnyTO keys looks like 
> this (in a particular case I'm taking into consideration the User anyTO):
>
>  
>
> {code:sql}
>
> SELECT u.any_id,sv.username FROM (SELECT DISTINCT any_id FROM user_search 
> WHERE (realm_id=? AND any_id IN ( SELECT DISTINCT any_id FROM user_search 
> WHERE lastChangeDate<=?))) u,user_search sv WHERE u.any_id=sv.any_id AND 
> u.any_id IN (SELECT any_id FROM user_search WHERE realm_id IN (SELECT id AS 
> realm_id FROM Realm WHERE id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR 
> id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? 
> OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR 
> id=? OR id=? OR id=? OR id=?)) ORDER BY sv.username ASC
>
> {code}
>
>  
>
> Firstly, we can optimize the multiple realm OR-queries that is used for 
> effective realms to [1st Optimization]:
>
> {code:sql}
>
> SELECT u.any_id, sv.username
>
> FROM (SELECT DISTINCT any_id
>
>       FROM user_search
>
>       WHERE (realm_id=? AND any_id IN (SELECT DISTINCT any_id FROM 
> user_search WHERE lastChangeDate<=?))) u,
>
>      user_search sv
>
> WHERE u.any_id = sv.any_id
>
>   AND u.any_id IN (SELECT any_id
>
>                    FROM user_search
>
>                   WHERE realm_id IN (SELECT id AS realm_id
>
>                                       FROM Realm
>
>                                       WHERE realm_id IN (?, ?, ?, ?, ?, ?, ?, 
> ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)))
>
> ORDER BY sv.username ASC
>
> {code}
>
>  
>
> Also, it is not clear for me why there is an inner join of two tables "u" and 
> "sv"? As far as I can see the effective realms clause and fiql query can be 
> done without inner join.  Thus, the query can be optimized to [2nd 
> Optimization]:
>
> {code:sql}
>
> SELECT DISTINCT any_id, username
>
> FROM user_search
>
> WHERE (realm_id=? AND any_id IN (SELECT DISTINCT any_id FROM user_search 
> WHERE lastChangeDate<=?))
>
>   and realm_id IN
>
>       (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 
> ?, ?,
>
>        ?, ?, ?, ?) ORDER BY user_search.username ASC;
>
> {code}
>
>  
>
>  
>
> Is it possbile that in "user_search" view two rows with the same 
> `user_search.any_id` can present? If not then the query can be optimized to  
> [3rd Optimization]:
>
> {code:sql}
>
> SELECT any_id, username
>
> FROM user_search
>
> WHERE (realm_id=? AND any_id IN (SELECT any_id FROM user_search WHERE 
> lastChangeDate<=?))
>
>   and realm_id IN
>
>       (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 
> ?, ?,
>
>        ?, ?, ?, ?) ORDER BY user_search.username ASC;
>
> {code}
>
>  
>
>  
>
> After fetching the any_id <-> username pairs 
> org.apache.syncope.core.persistence.jpa.dao.AbstractAnySearchDAO#buildResult 
> method plays its role.
>
> Here the code selects each entity one by one in the loop by sorted `any_id` 
> order keys.
>
> {code:sql}
>
> for (Object anyKey : raw) {
>
>             String actualKey = anyKey instanceof Object[]
>
>                     ? (String) ((Object[]) anyKey)[0]
>
>                     : ((String) anyKey);
>
>  
>
>             @SuppressWarnings("unchecked")
>
>             T any = kind == AnyTypeKind.USER
>
>                     ? (T) userDAO.find(actualKey)
>
>                     : kind == AnyTypeKind.GROUP
>
>                             ? (T) groupDAO.find(actualKey)
>
>                             : (T) anyObjectDAO.find(actualKey);
>
>             if (any == null) {
>
>                 LOG.error("Could not find {} with id {}, even if returned by 
> native query", kind, actualKey);
>
>             } else if (!result.contains(any)) {
>
>                 result.add(any);
>
>             }
>
>         }
>
> {code}
>
>  
>
> What if we can optimize this logic and instead of producing the multiple sql 
> requests, where each sql request fetches only one user, we will use the IN 
> clause? For this purpose we can add an additional method "findByKeys" into 
> org.apache.syncope.core.persistence.jpa.dao.AbstractAnyDAO. It can look like 
> this [4th Optimization]:
>
> {code:sql}
>
>     @Transactional(readOnly = true)
>
>     @Override
>
>     public Collection<A> findByKeys(Set<String> keys) {
>
>         validateKeys(keys);
>
>  
>
>         Class<A> entityClass = anyUtils().anyClass();
>
>         TypedQuery<A> query = entityManager().createQuery(
>
>                 "SELECT e FROM " + entityClass.getSimpleName()
>
>                         + " e WHERE e.id IN (:keys)",
>
>                 entityClass);
>
>         query.setParameter("keys", keys);
>
>         List<A> foundAnys = query.getResultList();
>
>  
>
>         validateAllAnysPresent(keys, foundAnys);
>
>  
>
>         // TODO: adopt to multiple keys
>
>         for (A any : foundAnys) {
>
>             securityChecks(any);
>
>         }
>
>  
>
>         return foundAnys;
>
>     }
>
> {code}
>
> Note that this is a quick implementation and method like 
> org.apache.syncope.core.persistence.jpa.dao.AbstractAnyDAO#securityChecks can 
> be also optimized and adopted to consume multiple entities.
>
> And we can use the new method "findByKeys" instead of the loop.
>
> After all, entities are fetched by keys in a single query we can sort the 
> entities in Java code using the sorted keys.
>
>  
>
> I prepared the PR where I implemented quickly 
> https://github.com/DmitriyBrashevets/syncope/pull/1:
>
> * [1st Optimization]
>
> * [4th Optimization]
>
>  
>
>  
>
> What I also want to achieve in result:
>
> 1) Understand if Apache Syncope team also sees the benefits in these 
> optimizations and to ensure that everybody agrees with them and I haven't 
> missed anything and everything written above is correct.
>
> 2) If everything is correct we can find other places in the code and also 
> optimize them
>
> 3) Finally, we can measure the performance if AS team is agree with these 
> changes. I hope that AS team already have an infrastructure with integration 
> or e2e tests that do measure the performance.
>
>  
>
>  
>
> Thanks in advance,
>
> Dmitriy Brashevets
>

-- 
Francesco Chicchiriccò

Tirasa - Open Source Excellence
http://www.tirasa.net/

Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/

Re: Performance improvements: SQL queries

Reply via email to