Re: Taste Recommenders, LuceneIterable Weights and loading custom code

2009-06-18 Thread Gérard Dupont
Hi all,

I'm not involved into mahout dev team, but I have my little experience on
the problem mentionned (ie loading custom implementations for certain
things).

We faced this problem in our team to manage RDF data. We defined an
RDFHelperAPI and then any of us compete to make to the usefull and fast
implementation. So to manage this and allow to load one implementation or
another, we simply made a Factory that load the implmeenation class name
from a property file and load it up. It's similar to Spring injection in a
way but really simple. So you can add any number of implementations in jars
and then dynamically change the one use by playing with the factory or let
use the configuration file to define the right one.

I'm pretty sure that this actually not the best solution (nor the brightest)
but that was just to mention it.

br,



On Thu, Jun 18, 2009 at 05:55, Ted Dunning ted.dunn...@gmail.com wrote:

 +2

 On Wed, Jun 17, 2009 at 4:36 PM, Sean Owen sro...@gmail.com wrote:

  I may be on a tangent now but I suppose my basic reaction is: skip this
  complexity and build this as an extensible library. As always I am open
 to
  being convinced otherwise.
 




-- 
Gérard Dupont
Information Processing Control and Cognition (IPCC) - EADS DS
http://weblab-project.org

Document  Learning team - LITIS Laboratory


[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-18 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721215#action_12721215
 ] 

Grant Ingersoll commented on MAHOUT-126:


Hey David,

I'm not sure what's going on here, because that value being null means the term 
is not the index, yet is in the Term Vector for that doc.  Are you sure you're 
loading the same field?  Can you share the indexing code?

This fix works, though, but I'd like to know at a deeper level what's going on.

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, 
 MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
 MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-18 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721346#action_12721346
 ] 

David Hall commented on MAHOUT-126:
---

That's not the only time. This constructor clearly lets certain things slip 
through.

{code}
  public CachedTermInfo(IndexReader reader, String field, int minDf, int 
maxDfPercent) throws IOException {
this.field = field;
TermEnum te = reader.terms(new Term(field, ));
int count = 0;
int numDocs = reader.numDocs();
double percent = numDocs * maxDfPercent / 100.0;
//Should we use a linked hash map so that we no terms are in order?
termEntries = new LinkedHashMapString, TermEntry();
do {
  Term term = te.term();
  if (term == null || term.field().equals(field) == false){
break;
  }
  int df = te.docFreq();
  if (df  minDf || df  percent){
continue;
  }
  TermEntry entry = new TermEntry(term.text(), count++, df);
  termEntries.put(entry.term, entry);
} while (te.next());
te.close();
{code}

My code is essentially Lucene's demo indexing code (IndexFiles.java and 
FileDocument.java: 
http://google.com/codesearch/p?hl=ensa=Ncd=1ct=rc#uGhWbO8eR20/trunk/src/demo/org/apache/lucene/demo/FileDocument.javaq=org.apache.lucene.demo.IndexFiles
} except that I replaced
{code}doc.add(new Field(contents, new FileReader(f)));{code}

with
{code}   doc.add(new Field(contents, new 
FileReader(f),Field.TermVector.YES));{code}

I then ran {code} java -cp classpath org.apache.lucene.demo.IndexFiles 
/Users/dlwh/txt-reuters/ {code}

and then {code} java -cp classpath org.apache.mahout.utils.vectors.Driver 
--dir /Users/dlwh/src/lucene/index/ --output ~/src/vec-reuters -f contents -t 
/Users/dlwh/dict --weight TF {code}

For what's it worth, it gives a null on reuters, which is not usually a stop 
word, except that every single document ends with it, and so the IDF filtering 
above is catching it.



 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, 
 MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
 MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-18 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721351#action_12721351
 ] 

Grant Ingersoll commented on MAHOUT-126:


Yep, you are right.  I committed your patch anyway.  We probably should add to 
the cmd line to support setting minDF, maxDF.

 Prepare document vectors from the text
 --

 Key: MAHOUT-126
 URL: https://issues.apache.org/jira/browse/MAHOUT-126
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-126-benson.patch, 
 MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
 MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
 MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch


 Clustering algorithms presently take the document vectors as input.  
 Generating these document vectors from the text can be broken in two tasks. 
 1. Create lucene index of the input  plain-text documents 
 2. From the index, generate the document vectors (sparse) with weights as 
 TF-IDF values of the term. With lucene index, this value can be calculated 
 very easily. 
 Presently, I have created two separate utilities, which could possibly be 
 invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: MAHOUT-65

2009-06-18 Thread David Hall
oh, wow, nevermind. Vector implements writable.

Sorry everyone.

-- David

On Thu, Jun 18, 2009 at 12:19 PM, David Halld...@cs.stanford.edu wrote:
 actually, it looks like someone went to all the trouble to make both
 SparseVector and DenseVector have all the methods required by
 Writable, but they don't implement Writable.

 Could I just make Vector extend Writable?

 -- David

 On Thu, Jun 18, 2009 at 12:01 PM, David Halld...@cs.stanford.edu wrote:
 following up on my earlier email.

 Would anyone be interested in a compressed serialization for
 DenseVector/SparseVector that follows in the vein of
 hadoop.io.Writable? The space overhead for gson (parsing issues
 not-withstanding) is pretty high, and it wouldn't be terribly hard to
 implement a high-performance thing for vectors.

 -- David

 On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastmanj...@windwardsolutions.com 
 wrote:
 +1, you added name constructors that I didn't have and the equals/equivalent
 stuff. Ya, Gson makes it all pretty trivial once you grok it.


 Grant Ingersoll wrote:

 Shall I take that as approval of the approach?

 BTW, the Gson stuff seems like a winner for serialization.

 On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:

 You gonna commit your patch? I agree with shortening the class name in
 the JsonVectorAdapter and will do it once you commit ur stuff.
 Jeff










Re: MAHOUT-65

2009-06-18 Thread Ted Dunning
Writable should be plenty!

On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote:

 See my followup on another thread (sorry for the schizophrenic
 posting); Vector already implements Writable, so that's all I really
 can ask of it. Is there something more you'd like? I'd be happy to do
 it.




Re: MAHOUT-65

2009-06-18 Thread David Hall
How often does Mahout need the Comparable part for Vectors? Are
vectors commonly used as map output keys?

In terms of space efficiency, I'd bet it's probably a bit better than
a factor of two in the average case, especially for densevectors. The
gson format is storing both the int index and the double as raw
strings, plus whatever boundary characters.  The writable
implementation stores just the bytes of the double, plus a length.

-- David

On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com wrote:
 +1 asWritableComparable is a simple implementation that uses asFormatString.
 It would be good to rewrite it for internal communication. A factor of two
 is still a factor of two.

 Jeff


 Grant Ingersoll wrote:

 On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:

 Writable should be plenty!


 +1.  Still nice to have JSON for user facing though.

 On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote:

 See my followup on another thread (sorry for the schizophrenic
 posting); Vector already implements Writable, so that's all I really
 can ask of it. Is there something more you'd like? I'd be happy to do
 it.










Re: MAHOUT-65

2009-06-18 Thread Jeff Eastman
I don't know of any situations where Vectors are used as keys. It hardly 
makes sense to use them as they are so unwieldy. Suggest we could change 
to just Writable and be ahead. In terms of the potential density 
improvement, it will be interesting to see what can typically be achieved.


r786323 just removed all calls to asWritableComparable, replacing them 
with asFormatString which was correct anyway.


Shall I change the method to asWritable()?

Jeff

David Hall wrote:

How often does Mahout need the Comparable part for Vectors? Are
vectors commonly used as map output keys?

In terms of space efficiency, I'd bet it's probably a bit better than
a factor of two in the average case, especially for densevectors. The
gson format is storing both the int index and the double as raw
strings, plus whatever boundary characters.  The writable
implementation stores just the bytes of the double, plus a length.

-- David

On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com wrote:
  

+1 asWritableComparable is a simple implementation that uses asFormatString.
It would be good to rewrite it for internal communication. A factor of two
is still a factor of two.

Jeff


Grant Ingersoll wrote:


On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:

  

Writable should be plenty!



+1.  Still nice to have JSON for user facing though.

  

On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu wrote:



See my followup on another thread (sorry for the schizophrenic
posting); Vector already implements Writable, so that's all I really
can ask of it. Is there something more you'd like? I'd be happy to do
it.


  



  




  




PGP.sig
Description: PGP signature


Re: MAHOUT-65

2009-06-18 Thread David Hall
On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastmanj...@windwardsolutions.com wrote:
 Shall I change the method to asWritable()?

I'd just be for getting rid of it. Vector implements Writable, so
asWritable() could just be return this;, which seems gratuitous

As for actual efficiency:
   
lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java

is currently dumping output values as the text strings. If there's a
standard dataset, that would be an easy place to do the test.

- David

 I don't know of any situations where Vectors are used as keys. It hardly
 makes sense to use them as they are so unwieldy. Suggest we could change to
 just Writable and be ahead. In terms of the potential density improvement,
 it will be interesting to see what can typically be achieved.

 r786323 just removed all calls to asWritableComparable, replacing them with
 asFormatString which was correct anyway.



 Jeff

 David Hall wrote:

 How often does Mahout need the Comparable part for Vectors? Are
 vectors commonly used as map output keys?

 In terms of space efficiency, I'd bet it's probably a bit better than
 a factor of two in the average case, especially for densevectors. The
 gson format is storing both the int index and the double as raw
 strings, plus whatever boundary characters.  The writable
 implementation stores just the bytes of the double, plus a length.

 -- David

 On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastmanj...@windwardsolutions.com
 wrote:


 +1 asWritableComparable is a simple implementation that uses
 asFormatString.
 It would be good to rewrite it for internal communication. A factor of
 two
 is still a factor of two.

 Jeff


 Grant Ingersoll wrote:


 On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:



 Writable should be plenty!



 +1.  Still nice to have JSON for user facing though.



 On Thu, Jun 18, 2009 at 1:15 PM, David Hall d...@cs.stanford.edu
 wrote:



 See my followup on another thread (sorry for the schizophrenic
 posting); Vector already implements Writable, so that's all I really
 can ask of it. Is there something more you'd like? I'd be happy to do
 it.















Re: MAHOUT-65

2009-06-18 Thread Jeff Eastman
Er, um, I see what you mean. How about just deleting the method? What 
really needs doing then is for all of the various clusters to themselves 
implement Writable so that they don't need to call asFormatString but 
can just emit themselves.

Jeff




Ted Dunning wrote:

What does this method do?

If the vector already implements Writable, what is the purpose of a
conversion?

On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman j...@windwardsolutions.comwrote:

  

Shall I change the method to asWritable()?






  




PGP.sig
Description: PGP signature


[jira] Created: (MAHOUT-135) Allow FileDataModel to transpose users and items

2009-06-18 Thread Grant Ingersoll (JIRA)
Allow FileDataModel to transpose users and items


 Key: MAHOUT-135
 URL: https://issues.apache.org/jira/browse/MAHOUT-135
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.2


Sometimes it would be nice to flip around users and items in the FileDataModel. 
 This patch adds a transpose boolean that flips userId and itemId in the 
processLine method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-135) Allow FileDataModel to transpose users and items

2009-06-18 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-135:
---

Attachment: MAHOUT-135.patch

Patch that adds transpose and tests

 Allow FileDataModel to transpose users and items
 

 Key: MAHOUT-135
 URL: https://issues.apache.org/jira/browse/MAHOUT-135
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.2

 Attachments: MAHOUT-135.patch


 Sometimes it would be nice to flip around users and items in the 
 FileDataModel.  This patch adds a transpose boolean that flips userId and 
 itemId in the processLine method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721646#action_12721646
 ] 

Sean Owen commented on MAHOUT-121:
--

Since I am not hearing objections, and cognizant that people are waiting on 
this, going to commit. If there are issues we can roll back or tweak from there.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
 Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
 Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [GSOC] Thoughts about Random forests map-reduce implementation

2009-06-18 Thread Ted Dunning
Very similar, but I was talking about building trees on each split of the
data (a la map reduce split).

That would give many small splits and would thus give very different results
from bagging because the splits would be small and contigous rather than
large and random.


On Thu, Jun 18, 2009 at 1:37 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 build multiple trees for different portions of the data

 What's the difference with the basic bagging algorithm, which builds 'each
 tree' using a different portion (about 2/3) of the data ?


[jira] Commented: (MAHOUT-135) Allow FileDataModel to transpose users and items

2009-06-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721653#action_12721653
 ] 

Sean Owen commented on MAHOUT-135:
--

Looks OK to me -- I applied the patch locally and tweaked a few things. Seems 
like a rare use case but simple to implement anyway. Mind if I submit over here?

 Allow FileDataModel to transpose users and items
 

 Key: MAHOUT-135
 URL: https://issues.apache.org/jira/browse/MAHOUT-135
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.2

 Attachments: MAHOUT-135.patch


 Sometimes it would be nice to flip around users and items in the 
 FileDataModel.  This patch adds a transpose boolean that flips userId and 
 itemId in the processLine method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-135) Allow FileDataModel to transpose users and items

2009-06-18 Thread Ted Dunning
Transposing is actually a common need as you abstract away from users and
ratings.

On Thu, Jun 18, 2009 at 10:19 PM, Sean Owen (JIRA) j...@apache.org wrote:

 Looks OK to me -- I applied the patch locally and tweaked a few things.
 Seems like a rare use case but simple to implement anyway. Mind if I submit
 over here?

  Allow FileDataModel to transpose users and items