Hi There,

Using spark-mllib_2.11-2.1.0. Facing issue that
BucketedRandomProjectionLSHModel.approxNearestNeighbors returns one result,
always.

Dataset looks like:

+----+--------------------+-------------+------------------------+----------------------+
|  id|           
features|kmeansCluster|predictionVectorFeatures|featuresInNewDimension|
+----+--------------------+-------------+------------------------+----------------------+
|1045|(16384,[196,11016...|            0|    (16384,[196],[0.2...|  [[0.0],
[0.0], [0...|
|1041|(16384,[4110,1065...|            0|    (16384,[196],[0.2...|  [[0.0],
[0.0], [-...|
+----+--------------------+-------------+------------------------+----------------------+
Execution code:

Dataset<Row> approximatedDS = (Dataset<Row>)
((BucketedRandomProjectionLSHModel)model)
                            .approxNearestNeighbors(dataset,
                            vectorToCalculateAgainst, numberOfResults,
false, MLFlowConstants.THEMES_PREDICTION_COLUMNS.distance.name());
Where:

numberOfResults = 2
vectorToCalculateAgainst = first vector in predictionVectorFeatures column
approximatedDS looks like follows:

+----+--------------------+-------------+------------------------+----------------------+------------------+
|  id|           
features|kmeansCluster|predictionVectorFeatures|featuresInNewDimension|         
distance|
+----+--------------------+-------------+------------------------+----------------------+------------------+
|1061|(16384,[196,11016...|            1|    (16384,[196],[0.2...|  [[0.0],
[0.0], [0...|0.8536603178950374|
+----+--------------------+-------------+------------------------+----------------------+------------------+
I have suspicion, that in LSH.scala

  // Compute threshold to get exact k elements.
  // TODO: SPARK-18409: Use approxQuantile to get the threshold
  val modelDatasetSortedByHash =
modelDataset.sort(hashDistCol).limit(numNearestNeighbors)
  val thresholdDataset = modelDatasetSortedByHash.select(max(hashDistCol))
  val hashThreshold = thresholdDataset.take(1).head.getDouble(0)

  // Filter the dataset where the hash value is less than the threshold.
  modelDataset.filter(hashDistCol <= hashThreshold)
}
last filter does wrong filtering, but may be wrong (do not know scala).

Can anyone help me understand how to make
BucketedRandomProjectionLSHModel.approxNearestNeighbors to return multiple
"nearest" vectors?

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/BucketedRandomProjectionLSHModel-algorithm-details-tp28578.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to