Hey there, I'm using Solr for my thesis, where I have to implement a content-based recommender system for movies.
I have indexed about 20thousand movies with their informations: movie-id title genre plot/movie-description <- !!! cast I've enabled the TermvektorComponent for the fields genre, description and cast. So I can get the tf-idf-values for the terms of every movie. With these term-TfIdfValue-couples I have to compute the similarities between movies by using the cosine similarity. I know about the Solr-Feature MLT (MoreLikeThis), but thats not the solution, I have to implement the CosineSimilarity in java myself. Now I have some problems/questions: I get the responses in XML-format, which I read out with an XML-reader in Java, where it wriggle trough every child-node in order to reach the right node. Is there a better way, to get these values in Node-Attributes or node-texts? I have tried it with wt=csv but for the requests I get responses only with the Movie-ID's, nothing more. By XML-responseWriter my request is for example this: http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true I get the right response with all terms and tf-tdf's - in xml. And if I add csv-notation http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true&wt=csv I get only this: id 1800180382 Maybe my request is wrong? Another problem is, if I get the terms and their tfidf-values, I store them in a map. But there isn't a succession in the values. I want e.g. store only the 10 chief terms, so 10 terms with the highest tfidf-values. Can I sort them in a descending succession? I haven't find anything therefor. If its not possible, I must sort them later in the map. My last question is: any movie has a genre - often more than one. Its like the "cat"-field (category) in the exampledocs with ipod/monitor etc. and its an important pointfor the movies. How can I integrate this factor? I changed the boost-attribute in the Solr-Xml-Schema like this: <field name="genre" type="string" indexed="true" stored="true" multiValued="true" omitNorms="false" boost="3" termVectors="true" termPositions="true" termOffsets="true"/> Is that enough or is there any other possibility? Perhaps you see, that I am a beginner in Solr, at the beginning a few weeks ago it was even more difficult for me but now it goes better. I would be very grateful for any help, ideas, tips or suggestions! Many regards Nejla