Hi, I need to build a system to determine N (i.e. 10) most similar documents to a given document. I have some (theoretical) knowledge of Mahout algorithms, but not enough to build the system. Can you give me some suggestions?
At first I was researching Latent Semantic Analysis for the task, but since Mahout doesn't support it, I started researching some other options. I got a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation) in Mahout to achieve similar and even better results. However ... and this is where I got confused ... LDA is a clustering algorithm. However, what I need is not to cluster the documents into N clusters - I need to get a matrix (similar to TF-IDF) from which I can calculate some sort of a distance for any two documents to get N most similar documents for any given document. How do I achieve that? My idea was (still mostly theoretical, since I have some problems with running the LDA algorithm) to extract some number of topics with LDA, but I need not cluster the documents with the help of this topics, but to get the matrix of documents as one dimention and topics as the other dimension. I was guessing I could then use this matrix an an input to row-similarity algorithm. Is this the correct concept? Or am I missing something? And, since LDA is not supperted on Spark/Samsara, how could I achieve similar results on Spark? Thanks in advance, David