Solr and TF-IDF

Nejla Karacan Thu, 26 Jan 2012 09:19:23 -0800

Hey there,

I'm using Solr for my thesis, where I have to implement a content-based
recommender system for movies.


I have indexed about 20thousand movies with their informations:
movie-id
title
genre
plot/movie-description <- !!!
cast

I've enabled the TermvektorComponent for the fields genre, description and
cast.
So I can get the tf-idf-values for the terms of every movie.

With these term-TfIdfValue-couples I have to compute the similarities
between movies by using the cosine similarity.
I know about the Solr-Feature MLT (MoreLikeThis), but thats not the
solution, I have to
implement the CosineSimilarity in java myself.

Now I have some problems/questions:
I get the responses in XML-format, which I read out with an XML-reader in
Java,
where it wriggle trough every child-node in order to reach the right node.
Is there a better way, to get these values in Node-Attributes or node-texts?
I have tried it with wt=csv but for the requests I get
responses only with the Movie-ID's, nothing more.
By XML-responseWriter my request is for example this:
http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true
I get the right response with all terms and tf-tdf's - in xml.

And if I add csv-notation
http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true&wt=csv
I get only this:
id
1800180382

Maybe my request is wrong?

Another problem is, if I get the terms and their tfidf-values, I store
them in a map.
But there isn't a succession in the values. I want e.g. store only the 10
chief terms,
so 10 terms with the highest tfidf-values. Can I sort them in a descending
succession?
I haven't find anything therefor. If its not possible, I must sort them
later in the map.

My last question is:
any movie has a genre - often more than one.
Its like the "cat"-field (category) in the exampledocs with ipod/monitor
etc. and its an important pointfor the movies.
How can I integrate this factor?
I changed the boost-attribute in the Solr-Xml-Schema like this:
<field name="genre" type="string" indexed="true" stored="true"
multiValued="true" omitNorms="false" boost="3" termVectors="true"
termPositions="true" termOffsets="true"/>
Is that enough or is there any other possibility?

Perhaps you see, that I am a beginner in Solr,
at the beginning a few weeks ago it was even more difficult for me but now
it goes better.
I would be very grateful for any help, ideas, tips or suggestions!

Many regards
Nejla

Solr and TF-IDF

Reply via email to