+1 to both of these developments. I'm very happy to see corporate
involvement in Mahout and I think it will be very good for the project
in the long run. For-profit priorities will certainly have an impact
upon our future activities but this will lead to broader market
acceptance and use.
On
On a related note, wish i could share the data i have to see how these
algorithms stack up to the ones we use for large scale learning.
Are there other examples of large data sets people use? I know there's the
Exxon one and possibly the one used in the netflix prize.
There's also image net but
Yahoo offers a 700M datapoints ratings dataset [1] which I recently
used. That's still academicly large but at least its a lot more
challenging than Netflix :)
[1] http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
Best,
Sebastian
On 20.04.2012 18:05, Hector Yee wrote:
On a related note,
Hello,
There should be some way to compile quartiles in a map/reduce fashion
(i.e. with api similar to Pig's Arithmetic custom function) without
keeping enormous count hash?
There's this countsketch thing that i implemented before on map
reduce, but it is sort of like bloom filter: if it gives a
Thanks in advance .
On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:
Hello,
There should be some way to compile quartiles in a map/reduce fashion
(i.e. with api similar to Pig's Arithmetic custom function) without
keeping enormous count hash?
There's this
Implementation of Single Sample T-Test using Map Reduce/Mahout
--
Key: MAHOUT-1000
URL: https://issues.apache.org/jira/browse/MAHOUT-1000
Project: Mahout
Issue Type: New Feature
how about this
http://en.wikipedia.org/wiki/Reservoir_sampling
On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:
Hello,
There should be some way to compile quartiles in a map/reduce fashion
(i.e. with api similar to Pig's Arithmetic custom function) without
keeping
Thank you, sir. Let me consider this.
On Fri, Apr 20, 2012 at 11:50 AM, Hector Yee hector@gmail.com wrote:
how about this
http://en.wikipedia.org/wiki/Reservoir_sampling
On Fri, Apr 20, 2012 at 10:44 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:
Hello,
There should be some way to
Look at our OnlineSummarizer. THis should be roughly parallelizable.
On Fri, Apr 20, 2012 at 2:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
Thank you, sir. Let me consider this.
On Fri, Apr 20, 2012 at 11:50 AM, Hector Yee hector@gmail.com wrote:
how about this
Thank you, Ted.
On Fri, Apr 20, 2012 at 2:30 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Look at our OnlineSummarizer. THis should be roughly parallelizable.
On Fri, Apr 20, 2012 at 2:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
Thank you, sir. Let me consider this.
On Fri, Apr 20,
The basic idea is that you would extend the OnlineSummarize to get more
quantiles. Then you would combine these OnlineSummarizer estimates
weighted by how much data they represent. This won't work if the data is
perversely ordered. Hector's suggestions will give you lower accuracy for
random
[
https://issues.apache.org/jira/browse/MAHOUT-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258623#comment-13258623
]
Ted Dunning commented on MAHOUT-1000:
-
I am not sure that I see the value here. All
See https://builds.apache.org/job/Mahout-Quality/1444/
13 matches
Mail list logo