[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16607 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r99263532 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -302,16 +302,36 @@ class Word2VecModel private[ml] ( @Since("1.6.0") object Word2VecModel extends MLReadable[Word2VecModel] { + private case class Data(word: String, vector: Array[Float]) + private[Word2VecModel] class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter { -private case class Data(wordIndex: Map[String, Int], wordVectors: Seq[Float]) - override protected def saveImpl(path: String): Unit = { DefaultParamsWriter.saveMetadata(instance, path, sc) - val data = Data(instance.wordVectors.wordIndex, instance.wordVectors.wordVectors.toSeq) + + val wordVectors = instance.wordVectors.getVectors + val dataArray = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) }.toArray --- End diff -- No need to convert back to an Array --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r99263525 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -320,14 +340,29 @@ object Word2VecModel extends MLReadable[Word2VecModel] { private val className = classOf[Word2VecModel].getName override def load(path: String): Word2VecModel = { + val spark = sparkSession + import spark.implicits._ + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val (major, minor) = VersionUtils.majorMinorVersion(metadata.sparkVersion) + val dataPath = new Path(path, "data").toString - val data = sparkSession.read.parquet(dataPath) -.select("wordIndex", "wordVectors") -.head() - val wordIndex = data.getAs[Map[String, Int]](0) - val wordVectors = data.getAs[Seq[Float]](1).toArray - val oldModel = new feature.Word2VecModel(wordIndex, wordVectors) + + val oldModel = if (major.toInt < 2 || (major.toInt == 2 && minor.toInt < 2)) { --- End diff -- major, minor are already Ints --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r99259617 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -18,10 +18,9 @@ package org.apache.spark.ml.feature import org.apache.hadoop.fs.Path - --- End diff -- Keep newline between non-spark and spark imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user Krimit commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r96672243 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -320,14 +341,29 @@ object Word2VecModel extends MLReadable[Word2VecModel] { private val className = classOf[Word2VecModel].getName override def load(path: String): Word2VecModel = { + val spark = sparkSession + import spark.implicits._ + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val dataPath = new Path(path, "data").toString - val data = sparkSession.read.parquet(dataPath) -.select("wordIndex", "wordVectors") -.head() - val wordIndex = data.getAs[Map[String, Int]](0) - val wordVectors = data.getAs[Seq[Float]](1).toArray - val oldModel = new feature.Word2VecModel(wordIndex, wordVectors) + val rawData = spark.read.parquet(dataPath) + + val oldModel = if (rawData.columns.contains("wordIndex")) { --- End diff -- @jkbradley - Please see https://issues.apache.org/jira/browse/SPARK-15573, I left a comment on sniffing model versions, curious to hear your opinion. I'll follow the ⨠version pattern if you think it's best --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r96524580 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -320,14 +341,29 @@ object Word2VecModel extends MLReadable[Word2VecModel] { private val className = classOf[Word2VecModel].getName override def load(path: String): Word2VecModel = { + val spark = sparkSession + import spark.implicits._ + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val dataPath = new Path(path, "data").toString - val data = sparkSession.read.parquet(dataPath) -.select("wordIndex", "wordVectors") -.head() - val wordIndex = data.getAs[Map[String, Int]](0) - val wordVectors = data.getAs[Seq[Float]](1).toArray - val oldModel = new feature.Word2VecModel(wordIndex, wordVectors) + val rawData = spark.read.parquet(dataPath) + + val oldModel = if (rawData.columns.contains("wordIndex")) { --- End diff -- @Krimit You're right that the versioned SaveLoad code was in spark.mllib only. There isn't a standard to follow yet for spark.ml. I believe that relying on the Spark version is currently the best option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user Krimit commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r96328450 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -320,14 +341,29 @@ object Word2VecModel extends MLReadable[Word2VecModel] { private val className = classOf[Word2VecModel].getName override def load(path: String): Word2VecModel = { + val spark = sparkSession + import spark.implicits._ + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val dataPath = new Path(path, "data").toString - val data = sparkSession.read.parquet(dataPath) -.select("wordIndex", "wordVectors") -.head() - val wordIndex = data.getAs[Map[String, Int]](0) - val wordVectors = data.getAs[Seq[Float]](1).toArray - val oldModel = new feature.Word2VecModel(wordIndex, wordVectors) + val rawData = spark.read.parquet(dataPath) + + val oldModel = if (rawData.columns.contains("wordIndex")) { --- End diff -- I'd only ever seen ``SaveLoadV1_0`` used in MLlib, is it still the preferred way to mark versions? In ml land I've seen things like relying on the spark version: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L981, https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L210, https://github.com/apache/spark/blob/7db09abb0168b77697064c69126ee82ca89609a0/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L234 which I don't really like in this case since it relies on something extraneous and makes it difficult to backport. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user Krimit commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r96327575 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -302,16 +303,36 @@ class Word2VecModel private[ml] ( @Since("1.6.0") object Word2VecModel extends MLReadable[Word2VecModel] { + private case class Data(word: String, vector: Seq[Float]) + private[Word2VecModel] class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter { -private case class Data(wordIndex: Map[String, Int], wordVectors: Seq[Float]) - override protected def saveImpl(path: String): Unit = { DefaultParamsWriter.saveMetadata(instance, path, sc) - val data = Data(instance.wordVectors.wordIndex, instance.wordVectors.wordVectors.toSeq) + + val wordVectors = instance.wordVectors.getVectors + val dataArray = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) } val dataPath = new Path(path, "data").toString - sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) + sparkSession.createDataFrame(dataArray) +.repartition(calculateNumberOfPartitions) +.write +.parquet(dataPath) +} + +val FloatSize = 4 +val AverageWordSize = 15 +def calculateNumberOfPartitions(): Int = { + // [SPARK-11994] - We want to partition the model in partitions smaller than + // spark.kryoserializer.buffer.max + val bufferSizeInBytes = Utils.byteStringAsBytes( +sc.conf.get("spark.kryoserializer.buffer.max", "64m")) + // Calculate the approximate size of the model. + // Assuming an average word size of 15 bytes, the formula is: + // (floatSize * vectorSize + 15) * numWords + val numWords = instance.wordVectors.wordIndex.size + val approximateSizeInBytes = (FloatSize * instance.getVectorSize + AverageWordSize) * numWords + ((approximateSizeInBytes / bufferSizeInBytes) + 1).toInt --- End diff -- This is basically copied from here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L661-L671. Could you please clarify what you mean by rounding it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user Krimit commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r96327379 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -302,16 +303,36 @@ class Word2VecModel private[ml] ( @Since("1.6.0") object Word2VecModel extends MLReadable[Word2VecModel] { + private case class Data(word: String, vector: Seq[Float]) + private[Word2VecModel] class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter { -private case class Data(wordIndex: Map[String, Int], wordVectors: Seq[Float]) - override protected def saveImpl(path: String): Unit = { DefaultParamsWriter.saveMetadata(instance, path, sc) - val data = Data(instance.wordVectors.wordIndex, instance.wordVectors.wordVectors.toSeq) + + val wordVectors = instance.wordVectors.getVectors + val dataArray = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) } val dataPath = new Path(path, "data").toString - sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) + sparkSession.createDataFrame(dataArray) +.repartition(calculateNumberOfPartitions) +.write +.parquet(dataPath) +} + +val FloatSize = 4 --- End diff -- I was trying to follow the scala naming conventions for constants (http://docs.scala-lang.org/style/naming-conventions.html), which to my understanding state that constants should be UpperCamelCase. Coming from Java, I was looking for the equivalent of ``private static final float FLOAT_SIZE``. Happy to just use local vals if that's more idiomatic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r96307400 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -320,14 +341,29 @@ object Word2VecModel extends MLReadable[Word2VecModel] { private val className = classOf[Word2VecModel].getName override def load(path: String): Word2VecModel = { + val spark = sparkSession + import spark.implicits._ + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val dataPath = new Path(path, "data").toString - val data = sparkSession.read.parquet(dataPath) -.select("wordIndex", "wordVectors") -.head() - val wordIndex = data.getAs[Map[String, Int]](0) - val wordVectors = data.getAs[Seq[Float]](1).toArray - val oldModel = new feature.Word2VecModel(wordIndex, wordVectors) + val rawData = spark.read.parquet(dataPath) + + val oldModel = if (rawData.columns.contains("wordIndex")) { --- End diff -- Have a look for `SaveLoadV1_0` elsewhere in the code. I think there's a different standard approach to versioning. I am not so familiar with it but you can see who's written it with git. Maybe @jkbradley ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r96307257 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -302,16 +303,36 @@ class Word2VecModel private[ml] ( @Since("1.6.0") object Word2VecModel extends MLReadable[Word2VecModel] { + private case class Data(word: String, vector: Seq[Float]) + private[Word2VecModel] class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter { -private case class Data(wordIndex: Map[String, Int], wordVectors: Seq[Float]) - override protected def saveImpl(path: String): Unit = { DefaultParamsWriter.saveMetadata(instance, path, sc) - val data = Data(instance.wordVectors.wordIndex, instance.wordVectors.wordVectors.toSeq) + + val wordVectors = instance.wordVectors.getVectors + val dataArray = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) } val dataPath = new Path(path, "data").toString - sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) + sparkSession.createDataFrame(dataArray) +.repartition(calculateNumberOfPartitions) +.write +.parquet(dataPath) +} + +val FloatSize = 4 +val AverageWordSize = 15 +def calculateNumberOfPartitions(): Int = { + // [SPARK-11994] - We want to partition the model in partitions smaller than + // spark.kryoserializer.buffer.max + val bufferSizeInBytes = Utils.byteStringAsBytes( +sc.conf.get("spark.kryoserializer.buffer.max", "64m")) + // Calculate the approximate size of the model. + // Assuming an average word size of 15 bytes, the formula is: + // (floatSize * vectorSize + 15) * numWords + val numWords = instance.wordVectors.wordIndex.size + val approximateSizeInBytes = (FloatSize * instance.getVectorSize + AverageWordSize) * numWords + ((approximateSizeInBytes / bufferSizeInBytes) + 1).toInt --- End diff -- Just round it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r96307241 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -302,16 +303,36 @@ class Word2VecModel private[ml] ( @Since("1.6.0") object Word2VecModel extends MLReadable[Word2VecModel] { + private case class Data(word: String, vector: Seq[Float]) + private[Word2VecModel] class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter { -private case class Data(wordIndex: Map[String, Int], wordVectors: Seq[Float]) - override protected def saveImpl(path: String): Unit = { DefaultParamsWriter.saveMetadata(instance, path, sc) - val data = Data(instance.wordVectors.wordIndex, instance.wordVectors.wordVectors.toSeq) + + val wordVectors = instance.wordVectors.getVectors + val dataArray = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) } val dataPath = new Path(path, "data").toString - sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) + sparkSession.createDataFrame(dataArray) +.repartition(calculateNumberOfPartitions) +.write +.parquet(dataPath) +} + +val FloatSize = 4 --- End diff -- Nit: camelCase here, like floatSize. These can be local variables? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16607#discussion_r96307444 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -320,14 +341,29 @@ object Word2VecModel extends MLReadable[Word2VecModel] { private val className = classOf[Word2VecModel].getName override def load(path: String): Word2VecModel = { + val spark = sparkSession + import spark.implicits._ + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val dataPath = new Path(path, "data").toString - val data = sparkSession.read.parquet(dataPath) -.select("wordIndex", "wordVectors") -.head() - val wordIndex = data.getAs[Map[String, Int]](0) - val wordVectors = data.getAs[Seq[Float]](1).toArray - val oldModel = new feature.Word2VecModel(wordIndex, wordVectors) + val rawData = spark.read.parquet(dataPath) + + val oldModel = if (rawData.columns.contains("wordIndex")) { +val data = rawData + .select("wordIndex", "wordVectors") + .head() +val wordIndex = data.getAs[Map[String, Int]](0) +val wordVectors = data.getAs[Seq[Float]](1).toArray +new feature.Word2VecModel(wordIndex, wordVectors) + } else { +val wordVectorsMap: Map[String, Array[Float]] = rawData.as[Data] --- End diff -- Type isn't needed here, nor in general on locals --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org