Hi,
I'm upgrading some classification code from 0.6 to 0.8 and am wondering what
the replacement is for the ClassifierContext?
Thanks,
Grant
Hi Mahout Users,
Drew Farris, Tom Morton and I are currently working on the 2nd Edition of
Taming Text (http://www.manning.com/ingersoll for first ed.) and are soliciting
interested parties who would be willing to contribute to a chapter on practical
use cases (i.e. you have something in
question but I find it hard to find details on
these specifics.
Many thanks,
Will
Grant Ingersoll | @gsingers
http://www.lucidworks.com
The Apache Mahout PMC is pleased to announce the release of Mahout 0.8.
Mahout's goal is to build scalable machine learning libraries focused
primarily in the areas of collaborative filtering (recommenders),
clustering and classification (known collectively as the 3Cs), as well as the
A _preview_ of release artifacts for 0.8 are at
https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/.
This is not an official release. I will call a vote in a day or two, pending
feedback on this thread, so please review/test.
A _preview_ of the release
, 2013 at 1:06 PM, Grant Ingersoll gsing...@apache.org
wrote:
The CSVVectorIterator in the Integration package will take in a CSV file
and produce vectors. It assumes that each row is the equivalent of a
DenseVector (does MovieLens fit that?) If you need otherwise, I'd
suggest
starting
doubt is, Is
there any need to convert the movielens rating.csv file into a sequence
file. If needed what are the commands for applying clustering technique
using mahout and the hadoop.
Thanking you,
Neetha Suan Thampi
Grant Ingersoll | @gsingers
committers, this is a biased first proposal,
please shout, if you see things different and want to have things kept.
Best,
Sebastian
On 08.06.2013 16:42, Grant Ingersoll wrote:
More tests are always welcome.
On Jun 8, 2013, at 10:29 AM, Ravi Mummulla ravi.mummu...@gmail.com wrote:
Hi Grant
- rowid - cvb.
lucene.vector will still give you higher performance at the cost of extra
storage (and the fact that it doesn't work in M/R and can't handle multiple
directories).
I'd say we keep it for now.
From: Grant Ingersoll gsing...@apache.org
Grant Ingersoll | @gsingers
http://www.lucidworks.com
On Jun 2, 2013, at 10:42 AM, Sebastian Schelter s...@apache.org wrote:
I don't think unmaintained code should stay in our codebase.
+1
This will
only create frustration amongst our users, as they will not get
questions answered and bugs fixed. It would also be an obstacle for a
1.0
FP Growth seems to not have a lot of dev support. Are there users out there
using it? Should it live on or get the axe prior to 1.0?
-Grant
.
Grant Ingersoll | @gsingers
http://www.lucidworks.com
Hi,
I'm looking for interns for the summer for those interested in Mahout and
Machine Learning:
Research Engineer Internship
DESCRIPTION
LucidWorks, the leading commercial company for Apache Lucene and Solr, is
looking for interns to work on building next generation search, analytics and
--
View this message in context:
http://lucene.472066.n3.nabble.com/Clustering-using-Solr-Index-vs-Lucene-Index-Different-Results-tp4037198.html
Sent from the Mahout User List mailing list archive at Nabble.com.
Grant Ingersoll
http
@ Cloudera
hadoop: http://www.cloudera.com
Grant Ingersoll
http://www.lucidworks.com
a
solicitation of an offer to buy, any financial product.
Grant Ingersoll
http://www.lucidworks.com
Hi,
I'm wondering if any has any rules of thumb around model size and memory usage
for SGD? I'm doing some testing of it myself, but thought I would ask to see
how it compares.
Thanks,
Grant
Grant Ingersoll
http://www.lucidimagination.com
On Apr 20, 2012, at 12:05 PM, Hector Yee wrote:
On a related note, wish i could share the data i have to see how these
algorithms stack up to the ones we use for large scale learning.
That certainly would be interesting.
Are there other examples of large data sets people use? I know
, as it's key to the
long-term project health. It's most certainly going to be the year of
the application layer (analytics, machine learning) for Big Data.
Thank you!
Sean
Grant Ingersoll
http://www.lucidimagination.com
Hi,
I have internships open for this summer for students interested in working on
search and machine learning. Description is below.
-Grant
Research Engineer Internship
DESCRIPTION
Lucid Imagination, the leading commercial company for Apache Lucene and Solr,
is looking for interns to work
accompanied by some plan to address the
contributions already in line in JIRA. It's not OK to be implicitly
rejecting so much from the community by not planning to fix that first
and foremost.
Grant Ingersoll
http://www.lucidimagination.com
Hi Mahout's,
Thought some here might be interested as search and machine learning often go
together.
--
Lucene Revolution will be here May 9-10 in Boston. Reserve your spot today with
Early Bird pricing of $575. Committers and accepted speakers are entitled to
free admission. Our CFP is
On Feb 22, 2012, at 7:24 AM, Jake Mannix wrote:
On recent threads on the dev@ list, and discussions off-list, it's pretty
clear that we need to have cleanup be a priority for the next release.
How about this for a formal proposal:
- The 0.7 release will have issues (both new and on
, California
TRAVEL
Minimal
Grant Ingersoll
http://www.lucidimagination.com
On Jan 31, 2012, at 2:14 PM, Keary Cavin wrote:
Dhruv, I downloaded the MAHOUT-627 patch and applied the files to the current
mahout release. I'll let you know when I have questions.
Note, the plan is to put this patch into 0.7 once the remaining test issue is
fixed.
-Grant
3 -seq
Thanks,
John
On Sun, Jan 22, 2012 at 3:00 PM, Grant Ingersoll gsing...@apache.orgwrote:
What were the command/options you were passing in?
On Jan 18, 2012, at 4:26 PM, John Conwell wrote:
I got latest from Trunk and built it, and when
running SparseVectorsFromSequenceFiles
,
minLLRValue, -1.0f, false, reduceTasks, chunkSize,
sequentialAccessOutput, namedVectors);
}
--
Thanks,
John C
--
-- John C
Grant Ingersoll
http://www.lucidimagination.com
please help?
--
Regards
Junaid
Grant Ingersoll
http://www.lucidimagination.com
EMR including
clusterdumper following the instructions on:
https://cwiki.apache.org/MAHOUT/mahout-on-elastic-mapreduce.html
Thanks once again,
Ipshita
Grant Ingersoll
http://www.lucidimagination.com
the TF IDF from the documents
present in a directory.
Can you please help me with the Steps to go about it using Apache Mahout?
Thank you.
--
Regards
Junaid
Grant Ingersoll
http://www.lucidimagination.com
task is to try and predict what project an email belongs to based on
its content.
Are these textual
features? Or what?
On Tue, Jan 3, 2012 at 2:53 PM, Grant Ingersoll gsing...@apache.org wrote:
I'm trying to run the full ASF email SGD classifier problem and am facing
heap size issues
, there are some issues w/ this example and the SGD code that are still
being worked through. See https://issues.apache.org/jira/browse/MAHOUT-904 for
more info.
Grant Ingersoll
http://www.lucidimagination.com
a compressed binary format would be useful for representing such
attributes, unless you also needed a count.
Thanks, Don
--- On Wed, 12/21/11, Grant Ingersoll gsing...@apache.org wrote:
From: Grant Ingersoll gsing...@apache.org
Subject: Re: Will mahout arff.vector correctly convert
to poke around does anyone agree this looks wrong?
I'm running a 0.6-SNAPSHOT I cloned today from github. Was considering
trying 0.5 but a quick look at recent changes doesn't seem to suggest this
code has changed in awhile...
Cheers,
Mat
Grant
Grant that was the point of my first question..
Now I'll take a look at the vector implementation.
Thanks again
Daniele
On 14 December 2011 23:44, Grant Ingersoll gsing...@apache.org wrote:
While Ted answered the Dissector question, your original issue, I believe,
is that Mahout currently
:370)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
On 13 December 2011 19:52, Grant Ingersoll gsing...@apache.org wrote:
What steps have you done?
On Dec 13, 2011, at 12:29 PM, Daniele Volpi wrote:
Hi everyone,
I'm trying to implement the Naive Bayes
get this error:
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.mahout.math.VectorWritable
Do you have some hints on the right usage of this class?
Thanks,
Daniele Volpi
Grant Ingersoll
http
happened? please help me, thanks a lot.
--
View this message in context:
http://lucene.472066.n3.nabble.com/mahout-exception-lucene-vector-tp3569144p3569144.html
Sent from the Mahout User List mailing list archive at Nabble.com.
Grant Ingersoll
testclassifier \
-m ${WORK_DIR}/myproj-bydate/bayes-model \
-d ${WORK_DIR}/myproj-bydate/bayes-test-input \
-type bayes \
-ng 1 \
-source hdfs \
-v \
-method mapreduce
Any suggestions? Thanks
Grant Ingersoll
http://www.lucidimagination.com
Grant Ingersoll
http://www.lucidimagination.com
I launched a micro instance and mounted the volume and downloaded it. That's
the only way to get that exact data set that I am aware of. I've got a smaller
sample up on the Lucid website. Otherwise, if you just want something like it,
you can use your ASF credentials to get it. I can point
attached it, but those get stripped. I didn't realize that this
was going to the list.
Try here: http://dl.dropbox.com/u/36863361/cluster-viz.r
And here for the image: http://dl.dropbox.com/u/36863361/xyz.png
On Wed, Nov 30, 2011 at 4:04 PM, Grant Ingersoll gsing...@apache.org
wrote:
Can you
are near.xyz.png
On Tue, Nov 29, 2011 at 8:03 AM, Grant Ingersoll gsing...@apache.org wrote:
I'm still learning R, do you have code handy you could share?
On Nov 29, 2011, at 6:25 AM, Ted Dunning wrote:
Coloring is pretty easy in R, which is what I use. I just build a color
map with the right
://issues.apache.org/jira/browse/MAHOUT-899) but would really like to be
able to produce much prettier visualizations out of the box.
Grant Ingersoll
http://www.lucidimagination.com
the transparency according to how seriously
down-sampled the cluster is. That lets me get a good visual feel for the
actual cluster size.
On Tue, Nov 29, 2011 at 5:03 AM, Grant Ingersoll gsing...@apache.orgwrote:
Anyone have an easy algorithm for coloring clusters in a nice way? That
is, given k
that the NGram attribute was set to the default value of 1 when
creating the tf-idf vectors from sequence files.
Suneel
From: Grant Ingersoll gsing...@apache.org
To: user@mahout.apache.org
Sent: Tuesday, October 25, 2011 5:55 AM
Subject: Re: MinHash
Grant Ingersoll
http://www.lucidimagination.com
();
But it is returning null .
Please help me to move further .
Thanks and Regards,
S SYED ABDUL KATHER
Grant Ingersoll
http://www.lucidimagination.com
For those in the San Francisco area, there will be a Mahout User Meeting on
Nov. 29th at Lucid Imagination's offices. Details and RSVP are at
http://sf-mahout-11-11.eventbrite.com/
For those not in the SF area, I _believe_ we will be recording it and posting
it.
, such that I wonder if they are more
or less empty. Running now to check.
I am assuming that the NGram attribute was set to the default value of 1 when
creating the tf-idf vectors from sequence files.
Suneel
From: Grant Ingersoll gsing...@apache.org
From: Grant Ingersoll gsing...@apache.org
To: user@mahout.apache.org
Sent: Tuesday, October 25, 2011 5:55 AM
Subject: Re: MinHash Clustering in Mahout
On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:
I was trying to run the MinHash algorithm
in Action. If they do what I think they do, I will definitely try
them, and probably complain on the list (Ted) if I can't interpret them right
:).
Thanks for the reply,
--
Ioan Eugen Stan
Grant Ingersoll
http://www.lucidimagination.com
,
k,
null, //returnableFeatures
output,
updater)
Grant Ingersoll
http://www.lucidimagination.com
Might be of interest: Clustering Very Large Multi-dimensional Datasets with
MapReduce
http://www.cs.cmu.edu/~jclopez/ref/kdd2011-mr-clustering.pdf
Grant Ingersoll
http://www.lucidimagination.com
I've never implemented LSI. Is there a way to incrementally build the model
(by simply indexing documents) or is it something that one only runs after the
fact once one has built up the much bigger matrix? If it's the former, I bet
it wouldn't be that hard to just implement the appropriate
Might be useful: https://github.com/algoriffic/lsa4solr
Looks like it hasn't been kept up to date.
On Nov 13, 2011, at 1:47 PM, Sebastian Schelter wrote:
Is there some documentation/tutorial available on how to build a LSI
pipeline with mahout and lucene?
--sebastian
:0.011,2:0.032,..etc
As seen above in MSV-441 there is no presence of : in the output whereas
MSV-770 has ):-0.025.
Can anyone throw some light as to what is the difference and why is it
present there..??
Thanks.
Grant Ingersoll
http
Cool, how about adding it to the Wiki?
On Nov 9, 2011, at 8:15 AM, Suneel Marthi wrote:
I can put together a doc if we don't already have one, know the SGD code
pretty well.
Regards,
Suneel
From: Grant Ingersoll grant.ingers...@gmail.com
To: user
this
far. Any help would be gratefully received.
R
Grant Ingersoll
http://www.lucidimagination.com
.
-Grant
On Nov 7, 2011, at 8:54 PM, Suneel Marthi wrote:
Do we have an answer for this?
Sent from my iPhone
On Nov 2, 2011, at 7:20 AM, Grant Ingersoll gsing...@apache.org wrote:
What's the Minhash key groups value used for in the MinhashDriver? I mean,
I see it is used for building up
We've been debating removing/archiving the Watchmaker integration in Mahout due
to seeming lack of maintenance and interest. Is anybody actually using it?
-Grant
,
thanks guys.
That would be a great addition! Also, javadoc would be helpful, so patches
would be great there.
Grant Ingersoll
http://www.lucidimagination.com
field.
Also, I assume at some point this could be a map-reduce job in hadoop.
I'm just asking for sanity check, or if there are any better ideas out there.
Thanks
Bob
--
Grant Ingersoll
http://www.lucidimagination.com
two speakers giving presentations related to Mahout: Ted Dunning, MapR
and Grant Ingersoll of Lucid Imagination (me). Both Ted and Grant are long
time committers on the Mahout project.
Ted's talk: How and why random projections work?
Mine: Using Mahout to Cluster, Classify and Recommend
What's the Minhash key groups value used for in the MinhashDriver? I mean, I
see it is used for building up the key out of the hashed values, but what's the
significance of different values for it? The default is 2, what does it mean
practically speaking if I choose, say, 10? AFAICT, it
What functionality, specifically, are you proposing to remove? I know we had a
lot of discussion around some of this stuff way back when as to how best to do
it, but of course, that doesn't mean it has uptake. If it's on the Matrix,
then doesn't it more easily get shipped around via the
On Nov 2, 2011, at 7:17 AM, Tharindu Mathew wrote:
I want to create a java UI tool (based on a web app) that can pick and
apply different algorithms available in Mahout to different data sets.
Very cool! Keep us posted, as this would be immensely useful! Any chance it
will be donated back?
On Nov 2, 2011, at 10:58 AM, Jake Mannix wrote:
On Wed, Nov 2, 2011 at 7:34 AM, Grant Ingersoll gsing...@apache.org wrote:
What functionality, specifically, are you proposing to remove?
I'm suggesting we kill, from Matrix.java and descendents, all of the
following methods
In the vein of users become contributors become committers:
It seems there has been some spark of interest in contributing more, so I
thought I would pass along a few pointers:
1. https://cwiki.apache.org/MAHOUT/how-to-contribute.html -- Details how to
submit patches, etc. IDE codestyles at
available?
--
Regards,
Tharindu
blog: http://mackiemathew.com/
Grant Ingersoll
http://www.lucidimagination.com
...@gmail.com
Grant Ingersoll
http://www.lucidimagination.com
I seem to recall past discussions on where one hits the bottleneck w/ user
based recommendation approaches in Mahout, but I can't seem to locate it
anymore. Anyone know off hand? Where do user based approaches hit their
limits, more or less?
Thanks,
Grant
, but on Hadoop.
On Wed, Oct 26, 2011 at 1:56 PM, Grant Ingersoll gsing...@apache.org wrote:
I seem to recall past discussions on where one hits the bottleneck w/ user
based recommendation approaches in Mahout, but I can't seem to locate it
anymore. Anyone know off hand? Where do user based
Just a friendly nudge to those on the fence for ApacheCon in Vancouver this
year that there will be both a Mahout training and some Mahout talks. I think
a few of us committers will also be hacking Mahout on Tuesday if you are
interested.
Training info: http://na11.apachecon.com/talks/18395
. Thanks a lot .
Grant Ingersoll
http://www.lucidimagination.com
this model to classify new data, all sample will be classified to
unknown
My Environment:
1. Os : cent-os 5
2. Mahout : 0.5
3. Hadoop : 0.20.205
Thanks,
Wangda
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011
:17, Grant Ingersoll wrote:
Were you able to get the data, Sebastian?
On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
Grant,
Can you share a little more details about the results, do you get any
exceptions? Or do you just get no results?
Using the NaNs inside the similarity matrix
will probably have to tweak it.
Lance
On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter s...@apache.org wrote:
Only got the raw data, how did you convert it to our standard
recommender input?
--sebastian
On 14.10.2011 01:17, Grant Ingersoll wrote:
Were you able to get the data
this job worked for someone?
On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll
gsing...@apache.orgwrote:
Both local and on EC2
On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
Hi Grant,
Just curious, are you running this locally or distributed?
I'd run into a similar issue, though
at 7:33 AM, Lance Norskog goks...@gmail.com wrote:
Is this job working well for anyone now?
When was the last time this job worked for someone?
On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll
gsing...@apache.orgwrote:
Both local and on EC2
On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote
Note, the next version (13df29e4fe97b4370f24d7e91ab5909de76f0f3b) doesn't work.
Debugging.
On Oct 13, 2011, at 9:31 PM, Grant Ingersoll wrote:
OK, I can confirm that an earlier version
(54300025dbdd6e688a4eb3d043016eb641067c7e in github/lucidimagination/mahout)
worked. Now, to figure
Looks like it is me. Still not sure why, but getting there.
On Oct 13, 2011, at 10:35 PM, Grant Ingersoll wrote:
Note, the next version (13df29e4fe97b4370f24d7e91ab5909de76f0f3b) doesn't
work. Debugging.
On Oct 13, 2011, at 9:31 PM, Grant Ingersoll wrote:
OK, I can confirm
-for-20-Newsgroups-testing-tp3415474p3415474.html
Sent from the Mahout User List mailing list archive at Nabble.com.
--
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com
on how to interpret this as I haven't dug into the math here yet or
figured out where those NaN are coming from originally.
On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
Where
.
When running locally, this wasn't getting cleared between loops, and thus I
got wonky results.
The same thing would have happened with JVM reuse enabled.
-- Ken
On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
Digging some more:
In AggregateAndRecommend, around lines 143, I
I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting
any recommendations due to NaNs being calculated in the AggregateAndRecommend
step. I'm not quite sure what is going on as it seems like this was working as
little as two weeks ago (post Sebastian's big change to
larger data set on Hadoop, it's just
that's a whole lot harder to debug.
On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll gsing...@apache.org wrote:
I'm running trunk RecommenderJob (via build-asf-email.sh) and am not getting
any recommendations due to NaNs being calculated
On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
Where is the NaN coming up -- what has this value?
simColumn seems to be the originator in the Aggregate step. For instance, my
current breakpoint shows:
{309682
at 11:54 AM, Grant Ingersoll gsing...@apache.orgwrote:
On Oct 2, 2011, at 11:52 PM, Walter Chang wrote:
Hi ,
i have used mahout to produce kmeans clustering for my tf-idf result. I
use
the mahout command line to produce the clusters and it seems it
successfully
completes.
$MAHOUT_HOME
to provide some?
Thank you,
--
Colin Wang
Skype : colin.bin.wang
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com
belongs to.
Thanks a lot,
Weide
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com
for the users
and devs of Mahout?
I will be moving there next week and was curious to know about the
networking opportunities with similar minded folks in the coming months.
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com
What's the interpretation of scores for the output from the new (complementary)
naive bayes classifiers? Larger is better, right?
Thanks,
Grant
to the
complement class, you have highest affinity to the actual class which the
data belongs to.
Unless the new computation is spitting out positive numbers
in which case its the largest.
:-)
On Thu, Sep 15, 2011 at 9:18 PM, Grant Ingersoll gsing...@apache.orgwrote:
What's the interpretation
to be ordered, but my features are
not ordered.
I would then use DictionaryVectorizer.createTermFrequencyVectors and
TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.
Am I on the right track?
Grant Ingersoll
http
...
This is the error which I get: http://pastebin.com/ADPm0Vbx
Am I missing any steps?
Also on a side note is there a post on using MinHash in Mahout?
--
Regards,
Varun Thacker
http://varunthacker.wordpress.com
Grant Ingersoll
directory (userVectors I think).
On Thu, Sep 1, 2011 at 4:30 PM, Grant Ingersoll gsing...@apache.org wrote:
On Sep 1, 2011, at 10:04 AM, Sean Owen wrote:
Your input needs to be CSV if you want to use it all as-is. But, it
quickly
creates vectors out of things, so really you can comment out
1 - 100 of 212 matches
Mail list logo