GoranSMilovanovic added a comment.

I've been thinking about his for some time already. The following idea is probably an overkill, but I would say it's a way to go if we cannot establish any prima facie criteria:

(1) Preprocessing, a text-mining approach: (1a) go for the dumps, (1b) search through all pages, (1c) collect metrics: page length (we can get this without actually performing text-search, I guess), number of references, number of external links, how many sections there are, properties of the word frequency distribution, sentiment, whatever can be measured, etc; formal descriptions would do (page length, frequency distributions, distributions of syntactic categories used), not semantics;

(2) Machine-Learning: use pages for which we know that are stubs against a sample of pages that are certainly not stubs to train the model (binary logistic regression, decision tree, random forest - something); train until some acceptable classification accuracy is reached (if possible from the given set of features produced in phase (1)); use the model to predict which of the remaining pages are stubs.

However, this is time consuming and would take a lot of experimentation and model tuning before we figure out what model exactly would deliver a satisfactory result... The feature extraction phase (1) would be difficult and computationally intensive, while (2) training a predictive model on a set of several million preprocessed pages should not be a problem for R from a single machine.


TASK DETAIL
https://phabricator.wikimedia.org/T119976

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: GoranSMilovanovic, Esc3300, Liuxinyu970226, ChrisPins, Mbch331, Izno, aude, Aklapper, Lydia_Pintscher, Addshore, StudiesWorld, QZanden, Fuzheado, Cwek, Wikidata-bugs, zhuyifei1999, Shizhao
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to