debt created this task.
debt added projects: Discovery, German-Community-Wishlist, Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service.
Herald added a subscriber: Aklapper.
Herald added a project: TCB-Team.

TASK DESCRIPTION

Problem:

Often times, the search for subcategories results in results sets that are too big, and cannot effectively be used as a basis for search results.

Suggestion:

We set a limit how many subcategories we support. If that limit is hit, we inform the user that the category was too unspecific, which is why we can't return anything. In the calculation of how many subcategories there are, loops should be detected and not counted.

Reasoning / Background info:

  • We don't want to return incomplete results
  • We need to set a limit
    • Elastic has a max of 1,024 conditions (categories) that it can have in any query
      • ie: if we're searching for 1,000 categories, there are 1,000 conditions
      • Elastic searches categories breadth first and then depth
    • But there are additional conditions that are also taking up numbers because of the search string itself
      • ie: depending on how complex the query is, that is the remaining number of conditions (categories) that we can search for
    • One option for paring down things is that there might be a way to look first in the database for categories that don't have any pages associated with them (empty), and thus, not show them in the query results and we can hopefully return more useful results.
  • We want to notify users if the category hit the limit (WMDE UI component)
    • exclude loops in the category tree
    • exclude empty categories from the result list
    • have a deep cat keyword for search
    • The search query building probably needs to be a combination of API and curl
  • How many empty categories are there?
    • enwiki - 1.5million total categories, 400K are empty (~25% are empty)
  • Can empty categories be excluded?
    • Yes, kind of. We can do daily or weekly dumps of the categories for searching on. It takes about 3 hours to update the database and it's unusable until it's complete. We don't have the ability to do real time database updates.
  • Links to the WMDE catwatch project:

Action items:

  • We need to set a limit.
    • Let's start with using 800 categories as a limit
  • Apply empty category filter (ignore them)
  • Apply the limit of category counts (800)
  • Use the daily database dump to search on
    • Users can’t use the search while the dump is loading
      • Stas will investigate to see if we can minimize the user’s not being able to search because the database is locked
  • Create an API and cURL combination for the keyword creation
    • WMF will do this work and it should take a couple weeks
      • goal is to be done by end of January 2018
    • the keyword will have a UI component
      • WMDE will complete the frontend work to expose it
  • Handover between the Discovery team and Technical Wishes team

TASK DETAIL
https://phabricator.wikimedia.org/T181549

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: debt
Cc: Aklapper, Addshore, Lea_WMDE, Smalyshev, debt, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, Avner, Gehel, Jonas, FloNight, Xmlizer, KasiaWMDE, Luke081515, Bmueller, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, jayvdb, Tobi_WMDE_SW, Mbch331, Jay8g
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to