mforns added a comment.

Hey!

As @Ladsgroup knows, I worked on this task during the BCN Hackathon.
It was super-interesting and I learned a lot about Wikidata :]
Thanks for the opportunity!
Here's a summary about what I did, issues I had, and next steps:

  • After a while of reading docs and understanding basic stuff, I wrote a small bash script to extract Wikidata items from the dump in /mnt/data/xmldatadumps/public/wikidatawiki/entities/20180514/wikidata-20180514-all.json.gz, abridge its contents limiting them to: id, type, labels and sitelinks. And finally split them in 1M-lines files, to be processed in hdfs/hadoop in a distributed way. The script is:
nice -n19 ionice -c2 -n7 sh -c "zcat /mnt/data/xmldatadumps/public/wikidatawiki/entities/20180514/wikidata-20180514-all.json.gz | head -n -1 | tail -n +2 | sed 's/,$//' | jq -c 'select(.type == \"item\") | {id, labels: .labels | [keys[] as \$k | [\$k, .[\$k].value]], sitelinks: .sitelinks | [keys[] as \$k | [\$k, .[\$k].title]]}' | split -l 1000000 - ~/wikidata_items_abridged_20180514/part_"

  • Then, I compressed each file separately (hadoop can only distribute computation for compressed files, if they are compressed separately) and moved those to hdfs: /user/mforns/wikidata_items_abridged_20180514. Actually, I only moved 5 of the 49 files, to avoid computation of the whole data set while developing. But the rest are ready in stat1005:/home/mforns/wikidata_items_abridged_20180514 and can be copied over there any time.
  • I also wrote a spark/scala script that reads the item files in hdfs and processes them to find duplicate candidates. The logic identifies items that have identic labels for at least one language, or that have identic sitelinks for at least one site. Labels or sitelinks of different languages/sites are not compared. As this is executed in the cluster using spark RDDs (resilient distributed datasets), the algorithm can compare all Wikidata items against themselves and output a graph, where the vertices are item IDs (Q12345) and edges mean two vertices have identic labels/sitelinks. The weight of the edge corresponds to the number of matches in labels/sitelinks between both vertices (items). Here's the code:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession

type Item = (String, Map[String, String], Map[String, String])

def parseItems(
    sourceDirectory: String,
    spark: SparkSession
): RDD[Item] = {
    val schema = StructType(Seq(
        StructField("id", StringType, nullable = false),
        StructField("type", StringType, nullable = false),
        StructField("labels", ArrayType(ArrayType(StringType)), nullable = false),
        StructField("sitelinks", ArrayType(ArrayType(StringType)), nullable = false)
    ))
    val items = spark.read.schema(schema).json(sourceDirectory + "/*").rdd
    items.map(r => (
        r.getString(0),
        r.getSeq(2).asInstanceOf[Seq[Any]].map(e => e.asInstanceOf[Seq[String]]).map(e => e(0) -> e(1)).toMap,
        r.getSeq(3).asInstanceOf[Seq[Any]].map(e => e.asInstanceOf[Seq[String]]).map(e => e(0) -> e(1)).toMap
    ))
}

val items = parseItems("/user/mforns/wikidata_items_abridged_20180514", spark)

val expressions = items.flatMap { item =>
    (
        item._2.map(label => (label._1, label._2, item._1)) ++
        item._3.map(sitelink => (sitelink._1, sitelink._2, item._1))
    ).filter(e => e._2.size > 2)
}

val expressionGroups = (expressions
    .keyBy(e => (e._1, e._2))
    .groupByKey
    .map(g => (g._1, g._2.map(_._3).toSeq.sortBy(id => id)))
    .filter(g => g._2.size > 1))

val explodedEdges = expressionGroups.flatMap(g => g._2.combinations(2))

val weightedEdges = explodedEdges.keyBy(e => e).groupByKey.map(g => (g._1, g._2.size))

val edges = weightedEdges.filter(e => e._2 > 1)

edges.map(e => e._1(0) + "\t" + e._1(1) + "\t" + e._2).saveAsTextFile("/user/mforns/duplicate_candidates")

The output looks like this (you can access it in hdfs under /user/mforns/duplicate_candidates):

Q7545947	Q7545948	4
Q2581746	Q3779054	2
Q32850943	Q32851055	2
Q32498252	Q804060	2
Q4451724	Q4451776	5
...

Finally, I wrote a python script to read that output on a single machine and calculate the graph's connected components. I haven't tested it, but here it is:

import networkx as nx
import sys

G = nx.Graph()

with open(sys.argv[1], 'r') as input_file:
    for line in input_file:
        v1, v2, w = line.split(' ')
        G.add_edge(v1, v2, weight=w)

for component in nx.connected_components(G):
    print(component)

This should return all groups of items that are likely to be duplicates (same-label/sitelink duplicates, that is).

Issues

If you look at the duplicate_candidates files, you can quickly identify false positives. I found 2 types:

  • Disambiguation pages: They have the same label as their specific pages, and thus are identified as duplicates, but they are not. To fix this, we should look into the statements section of the item's data. However, that section was not in the abridged version of the data I was using, so I didn't work on this.
  • Different locations with the same name: I found for example Q19468507 and Q19468544 that have identic labels, but are different streets in the Netherlands. To fix this we also would need to look into statements (i.e. postal code).

Next steps

  • Reimport all data without abridging it. It's not so big for the hadoop cluster to handle. However, it must be split and compressed in chunks.
  • Modify the scala/spark code to consider statements (maybe also descriptions?)
  • If we reach a level where there's few enough false positives, we can productionize this and let it execute every week, with each Wikidata dump?

Cheers!


TASK DETAIL
https://phabricator.wikimedia.org/T127467

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mforns
Cc: mforns, Lahi, MichaelSchoenitzer_WMDE, Ladsgroup, Esc3300, Liuxinyu970226, matej_suchanek, Bugreporter, Ricordisamoa, Aklapper, StudiesWorld, Lydia_Pintscher, samuwmde, Gq86, Vacio, GoranSMilovanovic, QZanden, LawExplorer, Culex, Wikidata-bugs, aude, Alchimista, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to