| mforns added a comment. |
Hey!
As @Ladsgroup knows, I worked on this task during the BCN Hackathon.
It was super-interesting and I learned a lot about Wikidata :]
Thanks for the opportunity!
Here's a summary about what I did, issues I had, and next steps:
- After a while of reading docs and understanding basic stuff, I wrote a small bash script to extract Wikidata items from the dump in /mnt/data/xmldatadumps/public/wikidatawiki/entities/20180514/wikidata-20180514-all.json.gz, abridge its contents limiting them to: id, type, labels and sitelinks. And finally split them in 1M-lines files, to be processed in hdfs/hadoop in a distributed way. The script is:
nice -n19 ionice -c2 -n7 sh -c "zcat /mnt/data/xmldatadumps/public/wikidatawiki/entities/20180514/wikidata-20180514-all.json.gz | head -n -1 | tail -n +2 | sed 's/,$//' | jq -c 'select(.type == \"item\") | {id, labels: .labels | [keys[] as \$k | [\$k, .[\$k].value]], sitelinks: .sitelinks | [keys[] as \$k | [\$k, .[\$k].title]]}' | split -l 1000000 - ~/wikidata_items_abridged_20180514/part_"- Then, I compressed each file separately (hadoop can only distribute computation for compressed files, if they are compressed separately) and moved those to hdfs: /user/mforns/wikidata_items_abridged_20180514. Actually, I only moved 5 of the 49 files, to avoid computation of the whole data set while developing. But the rest are ready in stat1005:/home/mforns/wikidata_items_abridged_20180514 and can be copied over there any time.
- I also wrote a spark/scala script that reads the item files in hdfs and processes them to find duplicate candidates. The logic identifies items that have identic labels for at least one language, or that have identic sitelinks for at least one site. Labels or sitelinks of different languages/sites are not compared. As this is executed in the cluster using spark RDDs (resilient distributed datasets), the algorithm can compare all Wikidata items against themselves and output a graph, where the vertices are item IDs (Q12345) and edges mean two vertices have identic labels/sitelinks. The weight of the edge corresponds to the number of matches in labels/sitelinks between both vertices (items). Here's the code:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
type Item = (String, Map[String, String], Map[String, String])
def parseItems(
sourceDirectory: String,
spark: SparkSession
): RDD[Item] = {
val schema = StructType(Seq(
StructField("id", StringType, nullable = false),
StructField("type", StringType, nullable = false),
StructField("labels", ArrayType(ArrayType(StringType)), nullable = false),
StructField("sitelinks", ArrayType(ArrayType(StringType)), nullable = false)
))
val items = spark.read.schema(schema).json(sourceDirectory + "/*").rdd
items.map(r => (
r.getString(0),
r.getSeq(2).asInstanceOf[Seq[Any]].map(e => e.asInstanceOf[Seq[String]]).map(e => e(0) -> e(1)).toMap,
r.getSeq(3).asInstanceOf[Seq[Any]].map(e => e.asInstanceOf[Seq[String]]).map(e => e(0) -> e(1)).toMap
))
}
val items = parseItems("/user/mforns/wikidata_items_abridged_20180514", spark)
val expressions = items.flatMap { item =>
(
item._2.map(label => (label._1, label._2, item._1)) ++
item._3.map(sitelink => (sitelink._1, sitelink._2, item._1))
).filter(e => e._2.size > 2)
}
val expressionGroups = (expressions
.keyBy(e => (e._1, e._2))
.groupByKey
.map(g => (g._1, g._2.map(_._3).toSeq.sortBy(id => id)))
.filter(g => g._2.size > 1))
val explodedEdges = expressionGroups.flatMap(g => g._2.combinations(2))
val weightedEdges = explodedEdges.keyBy(e => e).groupByKey.map(g => (g._1, g._2.size))
val edges = weightedEdges.filter(e => e._2 > 1)
edges.map(e => e._1(0) + "\t" + e._1(1) + "\t" + e._2).saveAsTextFile("/user/mforns/duplicate_candidates")The output looks like this (you can access it in hdfs under /user/mforns/duplicate_candidates):
Q7545947 Q7545948 4 Q2581746 Q3779054 2 Q32850943 Q32851055 2 Q32498252 Q804060 2 Q4451724 Q4451776 5 ...
Finally, I wrote a python script to read that output on a single machine and calculate the graph's connected components. I haven't tested it, but here it is:
import networkx as nx
import sys
G = nx.Graph()
with open(sys.argv[1], 'r') as input_file:
for line in input_file:
v1, v2, w = line.split(' ')
G.add_edge(v1, v2, weight=w)
for component in nx.connected_components(G):
print(component)This should return all groups of items that are likely to be duplicates (same-label/sitelink duplicates, that is).
Issues
If you look at the duplicate_candidates files, you can quickly identify false positives. I found 2 types:
- Disambiguation pages: They have the same label as their specific pages, and thus are identified as duplicates, but they are not. To fix this, we should look into the statements section of the item's data. However, that section was not in the abridged version of the data I was using, so I didn't work on this.
- Different locations with the same name: I found for example Q19468507 and Q19468544 that have identic labels, but are different streets in the Netherlands. To fix this we also would need to look into statements (i.e. postal code).
Next steps
- Reimport all data without abridging it. It's not so big for the hadoop cluster to handle. However, it must be split and compressed in chunks.
- Modify the scala/spark code to consider statements (maybe also descriptions?)
- If we reach a level where there's few enough false positives, we can productionize this and let it execute every week, with each Wikidata dump?
Cheers!
TASK DETAIL
EMAIL PREFERENCES
To: mforns
Cc: mforns, Lahi, MichaelSchoenitzer_WMDE, Ladsgroup, Esc3300, Liuxinyu970226, matej_suchanek, Bugreporter, Ricordisamoa, Aklapper, StudiesWorld, Lydia_Pintscher, samuwmde, Gq86, Vacio, GoranSMilovanovic, QZanden, LawExplorer, Culex, Wikidata-bugs, aude, Alchimista, Mbch331
Cc: mforns, Lahi, MichaelSchoenitzer_WMDE, Ladsgroup, Esc3300, Liuxinyu970226, matej_suchanek, Bugreporter, Ricordisamoa, Aklapper, StudiesWorld, Lydia_Pintscher, samuwmde, Gq86, Vacio, GoranSMilovanovic, QZanden, LawExplorer, Culex, Wikidata-bugs, aude, Alchimista, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
