[Wikidata-bugs] [Maniphest] [Commented On] T217821: Investigate duplication of strings in wb_terms table for wikidatawiki

JAllemandou Thu, 07 Mar 2019 01:36:08 -0800

JAllemandou added a comment.


  Exact analysis ran on 2018-12-06:
  
    val df = 
spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001")
    val base_rdd = df.select("labels", "descriptions", "aliases").rdd
    val strings = base_rdd.flatMap(r => {
      r.getMap[String,String](0).values ++
      r.getMap[String,String](1).values ++
      r.getMap[String,Seq[String]](2).values.flatMap(l => l)
    })
    
    val grouped_strings = strings.map(s => (s, 1)).reduceByKey(_+_)
    
    
    val total_bytes = grouped_strings.map(t => t._1.getBytes.length * 
t._2).sum()
    val duplicate_bytes = grouped_strings.map(t => t._1.getBytes.length * (t._2 
- 1)).sum()
    
    println(f"Total bytes for strings: $total_bytes%15.0f")
    println(f"Total duplicate bytes for strings: $duplicate_bytes%15.0f")
    println(f"Usefull bytes for strings: ${total_bytes - 
duplicate_bytes}%15.0f")
    
    //Total bytes for strings: 45,724,033,674
    //Total duplicate bytes for strings: 41,630,588,801
    //Usefull bytes for strings: 4,093,444,873
    // Usefull is 1 order of magnitude less than used
    
    // Triple check usefull bytes for strings:
    grouped_strings.map(_._1.getBytes.length).sum() == (total_bytes - 
duplicate_bytes)
    // true
    
    
    // How many unique strings?
    grouped_strings.count()
    // 98,524,732
    
    // How many string with 1 instance?
    grouped_strings.filter(t => t._2 == 1).count()
    // 72,584,179
    // Leaving 25,940,553 unique strings having multiple instances
    
    // --> If we go for table-indirection, we'll need ~100M longs (4 bytes)
    // --> 400,000,000 bytes  - 1 order of magnitude less than unique string 
size

TASK DETAIL
  https://phabricator.wikimedia.org/T217821

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou
Cc: JAllemandou, Aklapper, Addshore, alaa_wmde, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, 
aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T217821: Investigate duplication of strings in wb_terms table for wikidatawiki

Reply via email to