JAllemandou added a comment.
Exact analysis ran on 2018-12-06: val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001") val base_rdd = df.select("labels", "descriptions", "aliases").rdd val strings = base_rdd.flatMap(r => { r.getMap[String,String](0).values ++ r.getMap[String,String](1).values ++ r.getMap[String,Seq[String]](2).values.flatMap(l => l) }) val grouped_strings = strings.map(s => (s, 1)).reduceByKey(_+_) val total_bytes = grouped_strings.map(t => t._1.getBytes.length * t._2).sum() val duplicate_bytes = grouped_strings.map(t => t._1.getBytes.length * (t._2 - 1)).sum() println(f"Total bytes for strings: $total_bytes%15.0f") println(f"Total duplicate bytes for strings: $duplicate_bytes%15.0f") println(f"Usefull bytes for strings: ${total_bytes - duplicate_bytes}%15.0f") //Total bytes for strings: 45,724,033,674 //Total duplicate bytes for strings: 41,630,588,801 //Usefull bytes for strings: 4,093,444,873 // Usefull is 1 order of magnitude less than used // Triple check usefull bytes for strings: grouped_strings.map(_._1.getBytes.length).sum() == (total_bytes - duplicate_bytes) // true // How many unique strings? grouped_strings.count() // 98,524,732 // How many string with 1 instance? grouped_strings.filter(t => t._2 == 1).count() // 72,584,179 // Leaving 25,940,553 unique strings having multiple instances // --> If we go for table-indirection, we'll need ~100M longs (4 bytes) // --> 400,000,000 bytes - 1 order of magnitude less than unique string size TASK DETAIL https://phabricator.wikimedia.org/T217821 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: JAllemandou, Aklapper, Addshore, alaa_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs