| GoranSMilovanovic added a subscriber: Addshore. GoranSMilovanovic added a comment. |
@Lydia_Pintscher @Addshore @JAllemandou
Hey: don't you think that we could get this from the pagelinks table in Wikidatawiki's database?
The way I see it (please correct me if I am wrong):
0. In the pagelinks table we have the following fields:
- pl_from int unsigned NOT NULL default 0 - Key to the page_id of the page containing the link;
- pl_from_namespace int NOT NULL default 0 - Namespace for this page;
- pl_namespace int NOT NULL default 0 - Key to page_namespace of the target page;
- pl_title varchar(255) binary NOT NULL default - Key to page_title of the target page;
1. I get all the Wikidata IDs - and these should as well correspond to the respective page titles on the Wikidatawiki - for all external identifiers of interest (I can do SPARQL, or a Blazegraph GAS program for this);
2. I would probably need to have the Wikidatawiki's pagelinks table sqooped to Hadoop, so that I can cash it to Spark and join with the external identifiers' titles there.
Since we have the pl_from_namespace field in the pagelinks table, we don't even need the whole table (because we are looking for items and lexemes only, so we first filter out everything else).
Also, it is not impossible that R alone could handle this on, say, stat1007 - the Wiktionary Cognate Dashboard works with hundreds of millions of rows exported from SQL there, and cracks the numbers w. {data.table} package regularly (it's update every six hours). I could experiment to see how the problems scales and then decide if we go for Pyspark or we do it in R?
Please let me know what you think. Thanks.
Cc: Addshore, JAllemandou, Aklapper, GoranSMilovanovic, Lydia_Pintscher, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, Jonas, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
