Re: Taste Recommenders, LuceneIterable Weights and loading custom code
Hi all, I'm not involved into mahout dev team, but I have my little experience on the problem mentionned (ie loading custom implementations for certain things). We faced this problem in our team to manage RDF data. We defined an RDFHelperAPI and then any of us compete to make to the usefull and fast implementation. So to manage this and allow to load one implementation or another, we simply made a Factory that load the implmeenation class name from a property file and load it up. It's similar to Spring injection in a way but really simple. So you can add any number of implementations in jars and then dynamically change the one use by playing with the factory or let use the configuration file to define the right one. I'm pretty sure that this actually not the best solution (nor the brightest) but that was just to mention it. br, On Thu, Jun 18, 2009 at 05:55, Ted Dunning ted.dunn...@gmail.com wrote: +2 On Wed, Jun 17, 2009 at 4:36 PM, Sean Owen sro...@gmail.com wrote: I may be on a tangent now but I suppose my basic reaction is: skip this complexity and build this as an extensible library. As always I am open to being convinced otherwise. -- Gérard Dupont Information Processing Control and Cognition (IPCC) - EADS DS http://weblab-project.org Document Learning team - LITIS Laboratory
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721215#action_12721215 ] Grant Ingersoll commented on MAHOUT-126: Hey David, I'm not sure what's going on here, because that value being null means the term is not the index, yet is in the Term Vector for that doc. Are you sure you're loading the same field? Can you share the indexing code? This fix works, though, but I'd like to know at a deeper level what's going on. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721346#action_12721346 ] David Hall commented on MAHOUT-126: --- That's not the only time. This constructor clearly lets certain things slip through. {code} public CachedTermInfo(IndexReader reader, String field, int minDf, int maxDfPercent) throws IOException { this.field = field; TermEnum te = reader.terms(new Term(field, )); int count = 0; int numDocs = reader.numDocs(); double percent = numDocs * maxDfPercent / 100.0; //Should we use a linked hash map so that we no terms are in order? termEntries = new LinkedHashMapString, TermEntry(); do { Term term = te.term(); if (term == null || term.field().equals(field) == false){ break; } int df = te.docFreq(); if (df minDf || df percent){ continue; } TermEntry entry = new TermEntry(term.text(), count++, df); termEntries.put(entry.term, entry); } while (te.next()); te.close(); {code} My code is essentially Lucene's demo indexing code (IndexFiles.java and FileDocument.java: http://google.com/codesearch/p?hl=ensa=Ncd=1ct=rc#uGhWbO8eR20/trunk/src/demo/org/apache/lucene/demo/FileDocument.javaq=org.apache.lucene.demo.IndexFiles } except that I replaced {code}doc.add(new Field(contents, new FileReader(f)));{code} with {code} doc.add(new Field(contents, new FileReader(f),Field.TermVector.YES));{code} I then ran {code} java -cp classpath org.apache.lucene.demo.IndexFiles /Users/dlwh/txt-reuters/ {code} and then {code} java -cp classpath org.apache.mahout.utils.vectors.Driver --dir /Users/dlwh/src/lucene/index/ --output ~/src/vec-reuters -f contents -t /Users/dlwh/dict --weight TF {code} For what's it worth, it gives a null on reuters, which is not usually a stop word, except that every single document ends with it, and so the IDF filtering above is catching it. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721351#action_12721351 ] Grant Ingersoll commented on MAHOUT-126: Yep, you are right. I committed your patch anyway. We probably should add to the cmd line to support setting minDF, maxDF. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: MAHOUT-65
oh, wow, nevermind. Vector implements writable. Sorry everyone. -- David On Thu, Jun 18, 2009 at 12:19 PM, David Halld...@cs.stanford.edu wrote: actually, it looks like someone went to all the trouble to make both SparseVector and DenseVector have all the methods required by Writable, but they don't implement Writable. Could I just make Vector extend Writable? -- David On Thu, Jun 18, 2009 at 12:01 PM, David Halld...@cs.stanford.edu wrote: following up on my earlier email. Would anyone be interested in a compressed serialization for DenseVector/SparseVector that follows in the vein of hadoop.io.Writable? The space overhead for gson (parsing issues not-withstanding) is pretty high, and it wouldn't be terribly hard to implement a high-performance thing for vectors. -- David On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastmanj...@windwardsolutions.com wrote: +1, you added name constructors that I didn't have and the equals/equivalent stuff. Ya, Gson makes it all pretty trivial once you grok it. Grant Ingersoll wrote: Shall I take that as approval of the approach? BTW, the Gson stuff seems like a winner for serialization. On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote: You gonna commit your patch? I agree with shortening the class name in the JsonVectorAdapter and will do it once you commit ur stuff. Jeff
Re: MAHOUT-65
Writable should be plenty! On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote: See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it.
Re: MAHOUT-65
How often does Mahout need the Comparable part for Vectors? Are vectors commonly used as map output keys? In terms of space efficiency, I'd bet it's probably a bit better than a factor of two in the average case, especially for densevectors. The gson format is storing both the int index and the double as raw strings, plus whatever boundary characters. The writable implementation stores just the bytes of the double, plus a length. -- David On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com wrote: +1 asWritableComparable is a simple implementation that uses asFormatString. It would be good to rewrite it for internal communication. A factor of two is still a factor of two. Jeff Grant Ingersoll wrote: On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: Writable should be plenty! +1. Still nice to have JSON for user facing though. On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote: See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it.
Re: MAHOUT-65
I don't know of any situations where Vectors are used as keys. It hardly makes sense to use them as they are so unwieldy. Suggest we could change to just Writable and be ahead. In terms of the potential density improvement, it will be interesting to see what can typically be achieved. r786323 just removed all calls to asWritableComparable, replacing them with asFormatString which was correct anyway. Shall I change the method to asWritable()? Jeff David Hall wrote: How often does Mahout need the Comparable part for Vectors? Are vectors commonly used as map output keys? In terms of space efficiency, I'd bet it's probably a bit better than a factor of two in the average case, especially for densevectors. The gson format is storing both the int index and the double as raw strings, plus whatever boundary characters. The writable implementation stores just the bytes of the double, plus a length. -- David On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com wrote: +1 asWritableComparable is a simple implementation that uses asFormatString. It would be good to rewrite it for internal communication. A factor of two is still a factor of two. Jeff Grant Ingersoll wrote: On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: Writable should be plenty! +1. Still nice to have JSON for user facing though. On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote: See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it. PGP.sig Description: PGP signature
Re: MAHOUT-65
On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastmanj...@windwardsolutions.com wrote: Shall I change the method to asWritable()? I'd just be for getting rid of it. Vector implements Writable, so asWritable() could just be return this;, which seems gratuitous As for actual efficiency: lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java is currently dumping output values as the text strings. If there's a standard dataset, that would be an easy place to do the test. - David I don't know of any situations where Vectors are used as keys. It hardly makes sense to use them as they are so unwieldy. Suggest we could change to just Writable and be ahead. In terms of the potential density improvement, it will be interesting to see what can typically be achieved. r786323 just removed all calls to asWritableComparable, replacing them with asFormatString which was correct anyway. Jeff David Hall wrote: How often does Mahout need the Comparable part for Vectors? Are vectors commonly used as map output keys? In terms of space efficiency, I'd bet it's probably a bit better than a factor of two in the average case, especially for densevectors. The gson format is storing both the int index and the double as raw strings, plus whatever boundary characters. The writable implementation stores just the bytes of the double, plus a length. -- David On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com wrote: +1 asWritableComparable is a simple implementation that uses asFormatString. It would be good to rewrite it for internal communication. A factor of two is still a factor of two. Jeff Grant Ingersoll wrote: On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote: Writable should be plenty! +1. Still nice to have JSON for user facing though. On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote: See my followup on another thread (sorry for the schizophrenic posting); Vector already implements Writable, so that's all I really can ask of it. Is there something more you'd like? I'd be happy to do it.
Re: MAHOUT-65
Er, um, I see what you mean. How about just deleting the method? What really needs doing then is for all of the various clusters to themselves implement Writable so that they don't need to call asFormatString but can just emit themselves. Jeff Ted Dunning wrote: What does this method do? If the vector already implements Writable, what is the purpose of a conversion? On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman j...@windwardsolutions.comwrote: Shall I change the method to asWritable()? PGP.sig Description: PGP signature
[jira] Created: (MAHOUT-135) Allow FileDataModel to transpose users and items
Allow FileDataModel to transpose users and items Key: MAHOUT-135 URL: https://issues.apache.org/jira/browse/MAHOUT-135 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 0.2 Sometimes it would be nice to flip around users and items in the FileDataModel. This patch adds a transpose boolean that flips userId and itemId in the processLine method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-135) Allow FileDataModel to transpose users and items
[ https://issues.apache.org/jira/browse/MAHOUT-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-135: --- Attachment: MAHOUT-135.patch Patch that adds transpose and tests Allow FileDataModel to transpose users and items Key: MAHOUT-135 URL: https://issues.apache.org/jira/browse/MAHOUT-135 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 0.2 Attachments: MAHOUT-135.patch Sometimes it would be nice to flip around users and items in the FileDataModel. This patch adds a transpose boolean that flips userId and itemId in the processLine method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721646#action_12721646 ] Sean Owen commented on MAHOUT-121: -- Since I am not hearing objections, and cognizant that people are waiting on this, going to commit. If there are issues we can roll back or tweak from there. Speed up distance calculations for sparse vectors - Key: MAHOUT-121 URL: https://issues.apache.org/jira/browse/MAHOUT-121 Project: Mahout Issue Type: Improvement Components: Matrix Reporter: Shashikant Kore Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch From my mail to the Mahout mailing list. I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list. http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ I tried out this optimization. The test setup had 2000 document vectors with few hundred items. I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200. Current Canopy Generation: 28 min 15 sec. Canopy Generation with distance optimization: 1 min 38 sec. I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector implementation to used primitive collections by Trove [ http://trove4j.sourceforge.net/ ]. Distance optimization with Trove: 59 sec Current canopy generation with Trove: 21 min 55 sec To sum, these two optimizations reduced cluster generation time by a 97%. Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. Licensing of Trove seems to be an issue which needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [GSOC] Thoughts about Random forests map-reduce implementation
Very similar, but I was talking about building trees on each split of the data (a la map reduce split). That would give many small splits and would thus give very different results from bagging because the splits would be small and contigous rather than large and random. On Thu, Jun 18, 2009 at 1:37 AM, deneche abdelhakim a_dene...@yahoo.frwrote: build multiple trees for different portions of the data What's the difference with the basic bagging algorithm, which builds 'each tree' using a different portion (about 2/3) of the data ?
[jira] Commented: (MAHOUT-135) Allow FileDataModel to transpose users and items
[ https://issues.apache.org/jira/browse/MAHOUT-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721653#action_12721653 ] Sean Owen commented on MAHOUT-135: -- Looks OK to me -- I applied the patch locally and tweaked a few things. Seems like a rare use case but simple to implement anyway. Mind if I submit over here? Allow FileDataModel to transpose users and items Key: MAHOUT-135 URL: https://issues.apache.org/jira/browse/MAHOUT-135 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 0.2 Attachments: MAHOUT-135.patch Sometimes it would be nice to flip around users and items in the FileDataModel. This patch adds a transpose boolean that flips userId and itemId in the processLine method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAHOUT-135) Allow FileDataModel to transpose users and items
Transposing is actually a common need as you abstract away from users and ratings. On Thu, Jun 18, 2009 at 10:19 PM, Sean Owen (JIRA) j...@apache.org wrote: Looks OK to me -- I applied the patch locally and tweaked a few things. Seems like a rare use case but simple to implement anyway. Mind if I submit over here? Allow FileDataModel to transpose users and items