Re: Commercializing Mahout: the Myrrix recommender platform

2012-04-20 Thread Jeff Eastman
+1 to both of these developments. I'm very happy to see corporate involvement in Mahout and I think it will be very good for the project in the long run. For-profit priorities will certainly have an impact upon our future activities but this will lead to broader market acceptance and use. On

Re: Commercializing Mahout: the Myrrix recommender platform

2012-04-20 Thread Hector Yee
On a related note, wish i could share the data i have to see how these algorithms stack up to the ones we use for large scale learning. Are there other examples of large data sets people use? I know there's the Exxon one and possibly the one used in the netflix prize. There's also image net but

Re: Commercializing Mahout: the Myrrix recommender platform

2012-04-20 Thread Sebastian Schelter
Yahoo offers a 700M datapoints ratings dataset [1] which I recently used. That's still academicly large but at least its a lot more challenging than Netflix :) [1] http://webscope.sandbox.yahoo.com/catalog.php?datatype=r Best, Sebastian On 20.04.2012 18:05, Hector Yee wrote: On a related note,

Quartiles computation with M/R or Pig (combine function states)

2012-04-20 Thread Dmitriy Lyubimov
Hello, There should be some way to compile quartiles in a map/reduce fashion (i.e. with api similar to Pig's Arithmetic custom function) without keeping enormous count hash? There's this countsketch thing that i implemented before on map reduce, but it is sort of like bloom filter: if it gives a

Re: Quartiles computation with M/R or Pig (combine function states)

2012-04-20 Thread Dmitriy Lyubimov
Thanks in advance . On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hello, There should be some way to compile quartiles in a map/reduce fashion (i.e. with api similar to Pig's Arithmetic custom function) without keeping enormous count hash? There's this

[jira] [Created] (MAHOUT-1000) Implementation of Single Sample T-Test using Map Reduce/Mahout

2012-04-20 Thread Dev Lakhani (Created) (JIRA)
Implementation of Single Sample T-Test using Map Reduce/Mahout -- Key: MAHOUT-1000 URL: https://issues.apache.org/jira/browse/MAHOUT-1000 Project: Mahout Issue Type: New Feature

Re: Quartiles computation with M/R or Pig (combine function states)

2012-04-20 Thread Hector Yee
how about this http://en.wikipedia.org/wiki/Reservoir_sampling On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov dlie...@gmail.comwrote: Hello, There should be some way to compile quartiles in a map/reduce fashion (i.e. with api similar to Pig's Arithmetic custom function) without keeping

Re: Quartiles computation with M/R or Pig (combine function states)

2012-04-20 Thread Dmitriy Lyubimov
Thank you, sir. Let me consider this. On Fri, Apr 20, 2012 at 11:50 AM, Hector Yee hector@gmail.com wrote: how about this http://en.wikipedia.org/wiki/Reservoir_sampling On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov dlie...@gmail.comwrote: Hello, There should be some way to

Re: Quartiles computation with M/R or Pig (combine function states)

2012-04-20 Thread Ted Dunning
Look at our OnlineSummarizer. THis should be roughly parallelizable. On Fri, Apr 20, 2012 at 2:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Thank you, sir. Let me consider this. On Fri, Apr 20, 2012 at 11:50 AM, Hector Yee hector@gmail.com wrote: how about this

Re: Quartiles computation with M/R or Pig (combine function states)

2012-04-20 Thread Dmitriy Lyubimov
Thank you, Ted. On Fri, Apr 20, 2012 at 2:30 PM, Ted Dunning ted.dunn...@gmail.com wrote: Look at our OnlineSummarizer.  THis should be roughly parallelizable. On Fri, Apr 20, 2012 at 2:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Thank you, sir. Let me consider this. On Fri, Apr 20,

Re: Quartiles computation with M/R or Pig (combine function states)

2012-04-20 Thread Ted Dunning
The basic idea is that you would extend the OnlineSummarize to get more quantiles. Then you would combine these OnlineSummarizer estimates weighted by how much data they represent. This won't work if the data is perversely ordered. Hector's suggestions will give you lower accuracy for random

[jira] [Commented] (MAHOUT-1000) Implementation of Single Sample T-Test using Map Reduce/Mahout

2012-04-20 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258623#comment-13258623 ] Ted Dunning commented on MAHOUT-1000: - I am not sure that I see the value here. All

Jenkins build is still unstable: Mahout-Quality #1444

2012-04-20 Thread Apache Jenkins Server
See https://builds.apache.org/job/Mahout-Quality/1444/