Smalyshev created this task.
Herald added a subscriber: Aklapper.

TASK DESCRIPTION

On December 8, I have noticed that Updater is getting stuck on updates. Turns out there is a performance problem in Updater code, specifically in RdfRepository.java in this piece:

Collection<Statement> aboutStatements = new HashSet<>(insertStatements);
aboutStatements.removeAll(entityStatements);
aboutStatements.removeAll(statementStatements);
aboutStatements.removeAll(filtered(insertStatements).withSubjectStarts(uris.value()));
aboutStatements.removeAll(filtered(insertStatements).withSubjectStarts(uris.reference()));

The problem is in the implementation of removeAll:

if (size() > c.size()) {
     for (Iterator<?> i = c.iterator(); i.hasNext(); )
         modified |= remove(i.next());
 } else {
     for (Iterator<?> i = iterator(); i.hasNext(); ) {
         if (c.contains(i.next())) {
             i.remove();
             modified = true;
         }
     }
 }

As we can see, in certain situations, instead of going over elements of c and removing them, it opts to go over elements of the set and check if they are in c. The problem is that in this case c is a filter on a 100K-size list, which means each check produces the scan of the whole (or close to it) list. This makes the whole procedure extremely slow.


TASK DETAIL
https://phabricator.wikimedia.org/T182464

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Wikidata-Query-Service, Aklapper, Smalyshev
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to