I'm running into an error that's not making a lot of sense to me, and
couldn't find sufficient info on the web to answer it myself. BTW, you can
reply at Stack Overflow too:
http://stackoverflow.com/questions/36254005/nosuchelementexception-in-chisqselector-fit-method-version-1-6-0

I've written code to generate a list of (String, ArrayBuffer[String]) pairs
and then use HashingTF to convert the features column to vectors (bc it's
for NLP research on parsing where I end up with a whole lot of unique
features; long story). Then I convert the string labels using StringIndexer.
I get the "key not found" error when running ChiSqSelector.fit on the
training data. The stack trace points to a hashmap lookup in ChiSqTest for
labels. This struck me as strange, because I could sort of reason that
perhaps I was using it wrong and had not somehow accounted for unseen labels
-- except this was the fit method on training data. 

Anyway, here's the interesting bit of my code followed by the important part
of the stack trace. Any help would be very much appreciated!!


    val parSdp = sc.parallelize(sdp.take(10)) // it dies on a small amount
of data
    val insts: RDD[(String, ArrayBuffer[String])] =
        parSdp.flatMap(x=> TrainTest.transformGraphSpark(x))
    
    val indexer = new StringIndexer()
        .setInputCol("labels")
        .setOutputCol("labelIndex")
    
    val instDF = sqlContext.createDataFrame(insts)
        .toDF("labels","feats")
    val hash = new HashingTF()
        .setInputCol("feats")
        .setOutputCol("hashedFeats")
        .setNumFeatures(1000000)
    val readyDF = hash.transform(indexer
        .fit(instDF)
        .transform(instDF))
    
    val selector = new ChiSqSelector()
        .setNumTopFeatures(100)
        .setFeaturesCol("hashedFeats")
        .setLabelCol("labelIndex")
        .setOutputCol("selectedFeatures")
        
    val Array(training, dev,test) = readyDF.randomSplit(Array(0.8,0.1,0.1),
seed = 12345)
    
    val chisq = selector.fit(training)

And the stack trace:

    java.util.NoSuchElementException: key not found: 23.0                       
    
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:58)
        at
org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4$$anonfun$apply$4.apply(ChiSqTest.scala:131)
        at
org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4$$anonfun$apply$4.apply(ChiSqTest.scala:129)
        at
scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
        at
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
        at
org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:129)
        at
org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:125)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
        at
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
        at
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
        at
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
        at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at
org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:125)
        at
org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:176)
        at
org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:193)
        at
org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:86)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:89)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:122)
        ... etc etc

I also realized that by changing the size of sdp.take larger (to 100) above
I get a different error:

    java.lang.IllegalArgumentException: Chi-squared statistic undefined for
input matrix due to0 sum in column [4].
        at
org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredMatrix(ChiSqTest.scala:229)
        at
org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:134)
        at
org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:125)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
        at
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
        at
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
        at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at
org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:125)
        at
org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:176)
        at
org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:193)
        at
org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:86)
        at $iwC$$iwC.<init>(<console>:96)
        at $iwC.<init>(<console>:130)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-in-ChiSqSelector-fit-method-version-1-6-0-tp26614.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to