Hi Jake, Thanks for that. I'll give it a try with cosine similarity first off and as I get more experience I'll try and implement some other similarity methods.
Kris 2010/6/9 Jake Mannix <[email protected]> > On Wed, Jun 9, 2010 at 5:11 AM, Kris Jack <[email protected]> wrote: > > > Hi, > > > > Thanks very for the code. In implementing it, I got a little stuck on > > specifying the JobConf's similarity job - > > > > JobConf conf = new JobConf("similarity job"); > > > > I assume that I should define here how I would like two vectors to be > > compared with one another? Please do correct me if that's wrong. If so, > > however, could you point me to any examples of what this code should look > > like (e.g. cosine similarity)? I'm sure that these kinds of similarity > > jobs > > must already exist in Mahout but being new to both Mahout and MapReduce, > > I'm > > not sure where to find them. > > > > In the sample I mentioned (using sparse matrix multiplication), you don't > get to chose the similarity - if the input vectors are unit-length > normalized, > then the computation is cosine similarity. You would have to write your > own > map-reduce job to do a different one. > > -jake > > > > > > > > Thanks, > > Kris > > > > > > > > 2010/6/8 Jake Mannix <[email protected]> > > > > > Hi Kris, > > > > > > So you already know how to make a sparse feature matrix out of > > > your Solr index, based on Grant's instructions? Once you have that > > > matrix loaded into HDFS, then the following Java code should > > > create a document-document similarity matrix: > > > > > > // > > > String p = "/path/to/matrix/on/hdfs"; > > > String tmpPath = "/tmp/matrixmultiplyspace"; > > > int numDocuments = // whatever your numDocuments is > > > int numTerms = // total number of terms in the matrix > > > > > > DistributedRowMatrix text = new DistributedRowMatrix(inputPath, > > > tmpPath, numDocuments, numTerms); > > > JobConf conf = new JobConf("similarity job"); > > > text.configure(conf); > > > > > > DistributedRowMatrix transpose = text.transpose(); > > > > > > DistributedRowMatrix similarity = transpose.times(transpose); > > > System.out.println("Similarity matrix lives: " + > > similarity.getRowPath()); > > > // > > > > > > Now, the rows of this similarity are going to be way too dense, so > > > you'll want to write a small map-reduce job (well, no reduce is > > necessary) > > > to run over this matrix and trim out all the unuseful entries of each > > > row, but that shouldn't be too hard to do. > > > > > > Of course, to do it really efficiently, that functionality could be > > folded > > > into the reducer of the matrix multiply job, and done in the same pass > > over > > > the data as that one. > > > > > > -jake > > > > > > > > > > > > On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <[email protected]> > wrote: > > > > > > > Hi Jake, > > > > > > > > Thanks for that. The first solution that you suggest is more like > what > > I > > > > was imagining. > > > > > > > > Please excuse me, I'm new to Mahout and don't know how to use it to > > > > generate > > > > the full document-document similarity matrix. I would rather not > have > > to > > > > re-implement the moreLikeThis algorithm that, although rather > straight > > > > forward, may take time for a newbie to MapReduce like me. Could you > > > guide > > > > me a little in finding the relevant Mahout code for generating the > > matrix > > > > or > > > > is it not really designed for that? > > > > > > > > For the moment, I would be happy to have an off-line batch version > > > working. > > > > Also, it is desirable to take advantage of the text processing > features > > > > that > > > > I have already configured using solr, so I would prefer to read in > the > > > > feature vectors for the documents from a lucene index, as I am doing > at > > > > present (e.g. > > > > > > > > > > > > > > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ > > > > ). > > > > > > > > Thanks, > > > > Kris > > > > > > > > > > > > > > > > 2010/6/8 Jake Mannix <[email protected]> > > > > > > > > > Hi Kris, > > > > > > > > > > If you generate a full document-document similarity matrix > offline, > > > and > > > > > then make sure to sparsify the rows (trim off all similarities > below > > a > > > > > threshold, or only take the top N for each row, etc...). Then > > encoding > > > > > these values directly in the index would indeed allow for > *superfast* > > > > > MoreLikeThis functionality, because you've already computed all > > > > > of the similar results offline. > > > > > > > > > > The only downside is that it won't apply to newly indexed > documents. > > > > > If your indexing setup is such that you don't fold in new documents > > > live, > > > > > but do so in batch, then this should be fine. > > > > > > > > > > An alternative is to use something like a Locality Sensitive Hash > > > > > (something one of my co-workers is writing up a nice implementation > > > > > of now, and I'm going to get him to contribute it once it's fully > > > > tested), > > > > > to reduce the search space (as a lucene Filter) and speed up the > > > > > query. > > > > > > > > > > -jake > > > > > > > > > > On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <[email protected]> > > > wrote: > > > > > > > > > > > Hi Olivier, > > > > > > > > > > > > Thanks for your suggestions. I have over 10 million documents > and > > > they > > > > > > have > > > > > > quite a lot of meta-data associated with them including rather > > large > > > > text > > > > > > fields. It is possible to tweak the moreLikeThis function from > > solr. > > > > I > > > > > > have tried changing the parameters ( > > > > > > http://wiki.apache.org/solr/MoreLikeThis) > > > > > > but am not managing to get results in under 300ms without > > sacrificing > > > > the > > > > > > quality of the results too much. > > > > > > > > > > > > I suspect that there would be gains to be made from reducing the > > > > > > dimensionality of the feature vectors before indexing with lucene > > so > > > I > > > > > may > > > > > > give that a try. I'll keep you posted if I come up with other > > > > solutions. > > > > > > > > > > > > Thanks, > > > > > > Kris > > > > > > > > > > > > > > > > > > > > > > > > 2010/6/8 Olivier Grisel <[email protected]> > > > > > > > > > > > > > 2010/6/8 Kris Jack <[email protected]>: > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > I currently use lucene's moreLikeThis function through solr > to > > > find > > > > > > > > documents that are related to one another. A single call, > > > however, > > > > > > takes > > > > > > > > around 4 seconds to complete and I would like to reduce this. > > I > > > > got > > > > > to > > > > > > > > thinking that I might be able to use Mahout to generate a > > > document > > > > > > > > similarity matrix offline that could then be looked-up in > real > > > time > > > > > for > > > > > > > > serving. Is this a reasonable use of Mahout? If so, what > > > > functions > > > > > > will > > > > > > > > generate a document similarity matrix? Also, I would like to > > be > > > > able > > > > > > to > > > > > > > > keep the text processing advantages provided through lucene > so > > it > > > > > would > > > > > > > help > > > > > > > > if I could still use my lucene index. If not, then could you > > > > > recommend > > > > > > > any > > > > > > > > alternative solutions please? > > > > > > > > > > > > > > How many documents do you have in your index? Have you tried to > > > tweak > > > > > > > the MoreLikeThis parameters ? (I don't know if it's possible > > using > > > > the > > > > > > > solr interface, I use it directly using the lucene java API) > > > > > > > > > > > > > > For instance you can trade off recall for speed by decreasing > the > > > > > > > number of terms to use in the query and trade recall for > > precision > > > > and > > > > > > > speed by increasing the percentage of terms that should match. > > > > > > > > > > > > > > You could also use Mahout implementation of SVD to build low > > > > > > > dimensional semantic vectors representing your documents > (a.k.a. > > > > > > > Latent Semantic Indexing) and then index those transformed > > > frequency > > > > > > > vectors in a dedicated lucene index (or document field provided > > you > > > > > > > name the resulting terms with something that does not match > real > > > life > > > > > > > terms present in other). However using standard SVD will > probably > > > > > > > result in dense (as opposed to sparse) low dimensional semantic > > > > > > > vectors. I don't think lucene's lookup performance is good with > > > dense > > > > > > > frequency vectors even though the number of terms is greatly > > > reduced > > > > > > > by SVD. Hence it would probably be better to either threshold > the > > > top > > > > > > > 100 absolute values of each semantic vectors before indexing > > > > (probably > > > > > > > the simpler solution) or using a sparsifying penalty contrained > > > > > > > variant of SVD / LSI. You should have a look at the literature > on > > > > > > > sparse coding or sparse dictionary learning, Sparse-PCA and > more > > > > > > > generally L1 penalty regression methods such as the Lasso and > > LARS. > > > I > > > > > > > don't know about any library for sparse semantic coding of > > > document > > > > > > > that works automatically with lucene. Probably some non trivial > > > > coding > > > > > > > is needed there. > > > > > > > > > > > > > > Another alternative is finding low dimensional (64 or 32 > > > components) > > > > > > > dense codes and then binary thresholding then and store integer > > > code > > > > > > > in the DB or the lucene index and then build smart exact match > > > > queries > > > > > > > to find all document lying in the hamming ball of size 1 or 2 > of > > > the > > > > > > > reference document's binary code. But I think this approach > while > > > > > > > promising for web scale document collections is even more > > > > experimental > > > > > > > and requires very good code low dim encoders (I don't think > > linear > > > > > > > models such as SVD are good enough for reducing sparse 10e6 > > > > components > > > > > > > vectors to dense 64 components vectors, non linear encoders > such > > as > > > > > > > Stacked Restricted Boltzmann Machines are probably a better > > > choice). > > > > > > > > > > > > > > In any case let us know about your results, I am really > > interested > > > on > > > > > > > practical yet scalable solutions to this problem. > > > > > > > > > > > > > > -- > > > > > > > Olivier > > > > > > > http://twitter.com/ogrisel - http://github.com/ogrisel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Dr Kris Jack, > > > > > > http://www.mendeley.com/profiles/kris-jack/ > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Dr Kris Jack, > > > > http://www.mendeley.com/profiles/kris-jack/ > > > > > > > > > > > > > > > -- > > Dr Kris Jack, > > http://www.mendeley.com/profiles/kris-jack/ > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
