Generally, you want to do a bit of projection on these data before
clustering.
One option is random projection. This maps each item to a sparse binary
vector based on a few independent hashes of the original item id. This
gives you are moderate dimensional vector to do clustering in (say
You won't necessarily see any distinct clumps, depending on your data. With
some text. you might get such, but with resumes, especially if you don't do
IDF weighting you are likely to have a pretty nasty distribution that
doesn't clump very well at all. Even with IDF weighting on terms and the
What happens if the number is too large? Is this a dense matrix we are
talking about?
Would it work to make it a random access sparse matrix with very, very large
bounds?
On Sun, May 23, 2010 at 10:29 AM, Jeff Eastman
j...@windwardsolutions.comwrote:
I agree it is not very friendly.
Just to forestall some effort on this, LLR is very good for threshold, but
the value is bad as a score so substituting TF or TFIDF is entirely
appropriate.
There may be use cases for keeping LLR if only for diagnostic purposes.
On Thu, May 27, 2010 at 8:52 AM, Drew Farris drew.far...@gmail.com
A bit off topic, but what you really want is collocations that bring
different information to the party than the constituent words. That is, you
need to detect cases where the meaning of the collocation is not
compositionally predicted by the meanings of the words in the collocation.
Simple
That should be a small change (and helpful for a lot of mining tasks).
But once you jump on that slippery slope, why not allow a tiny Groovy
closure to be injected? Or to pass in an object that will extract a map of
values from each line?
On Thu, May 27, 2010 at 2:59 PM, Grant Ingersoll
understand Nu x VTk, but then P is defined by
an additional product with Uk
In short... what?
On Thu, Jun 3, 2010 at 4:15 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Fire away.
On Thu, Jun 3, 2010 at 3:52 AM, Sean Owen sro...@gmail.com wrote:
Is anyone out there familiar enough
better approach with
SVD++ and their time dynamics trick. That is much the same as mean removal.
On Fri, Jun 4, 2010 at 6:48 AM, Ted Dunning ted.dunn...@gmail.com wrote:
You are correct. The paper has an appalling treatment of the folding in
approach.
In fact, the procedure is dead
Threshold are generally dangerous. It is usually preferable to specify the
sparseness you want (1%, 0.2%, whatever), sort the results in descending
score order using Hadoop's builtin capabilities and just drop the rest.
On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com wrote:
I
items?
On Tue, Jun 15, 2010 at 8:12 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
You have most of the workings available to do a reasonable job of this in
Mahout. The simplest method in my mind is to grovel the logs and emit
pairs
of items with the key being the last item and previous
I would follow Sean's suggestion and try simpler methods first. My guess is
that the important structure of the HMM may be much easier to learn by
sparsification techniques.
Sequence aware methods also have potential for harm in that they may just be
reverse-engineer your current link structure.
How large is your input and how is it arranged in files?
Is your input oddly distributed? Are there big skews in item frequency?
2010/6/16 Björn Jacobs jac...@gmx.de
Is this a bug or do I have to configure something to get this working?
Tamas,
In what context is this serialization occurring? Would it be better to use
an alternative serialization framework such as Gson or Hadoop or Avro?
I tend to try to avoid native serialization because of the problems that
come up so easily.
On Sun, Jun 20, 2010 at 5:54 PM, Tamas Jambor
You can also recommend attributes to users by reducing the user, item
history file to a user, attribute history file. Once you have recommended
attributes, you can use a search engine or an attribute to item
recommendation engine to get the items to recommend.
On Tue, Jun 22, 2010 at 5:43 AM,
The SGD and SVM implementations (neither released yet) both have sequential
versions. I expect that for pretty large corpora that they will be faster
than the MR learners due to lower overhead and faster convergence. See
http://leon.bottou.org/projects/sgd for why.
On Wed, Jun 23, 2010 at
On Wed, Jun 23, 2010 at 11:13 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
* Do any classifiers offer the option of basing classification on
linguistic rules?
It is common in advanced test classifiers to include human guided features
such as you suggest here. This is one of the
Pranay,
Sean's comments are dead-on. You may be able to get a feel for how good (or
not) that these results are by marking all unrated items either as good or
bad. That will likely tell you that the real precision is between 0.22 and
0.9. This same problem is exhibited by essentially all other
How much speedup do you observe?
On Mon, Jun 28, 2010 at 2:29 PM, Tamas Jambor jambo...@gmail.com wrote:
Hi,
I was looking at the SVD code, I am sure you are aware of this
modification, but it would really make things faster. The idea is that you
set up a minimum RMSE improvement so it
or
later).
Tamas
On 28/06/2010 22:31, Ted Dunning wrote:
How much speedup do you observe?
On Mon, Jun 28, 2010 at 2:29 PM, Tamas Jamborjambo...@gmail.com
wrote:
Hi,
I was looking at the SVD code, I am sure you are aware of this
modification, but it would really make things faster
Indeed.
Did you mention it?
On Wed, Jun 30, 2010 at 12:32 AM, Danny Leshem dles...@gmail.com wrote:
Just came back from ICML / COLT.
The two conferences held a joint workshop day, with one of the tracks
concentrating on open-source software for machine-learning (see
Also note that there *is* a pretty large scale SVD solver in Mahout. That
can give you a short-cut to pageRank.
On Wed, Jun 30, 2010 at 12:11 PM, Grant Ingersoll gsing...@apache.orgwrote:
If not, I'd like to implement it. Any advice appreciated,
Have a look at the matrix/vector libraries.
Jimmy Lin's presentation (first link on this page:
http://www.umiacs.umd.edu/~jimmylin/) had to do with data structure
improvements for link distance computations. After his talk, there was an
interesting discussion with Arun Murthy of the map-reduce team at Yahoo.
Arun's contention was that it
By this, do you mean migrate from using the Mahout recommendation framework
without hadoop to using the Mahout recommendation framework with Hadoop?
On Fri, Jul 2, 2010 at 8:26 AM, matboeh...@googlemail.com wrote:
However, I am currently looking for an easy way of how to migrate to
Hadoop.
Practically speaking, term weighting is important, but you also have to
watch out for eigen-spoke behavior.
https://research.sprintlabs.com/publications/uploads/icdm-09-ldmta-camera-ready.pdf
This can arise when you have strong clique-phenomenon in your data (not
likely in your case) or where
Pity. I am in the San Francisco Bay area. Would love to help.
Robin Anil is in India, but I think he is totally over-committed.
On Wed, Jul 7, 2010 at 9:17 AM, tog guillaume.all...@gmail.com wrote:
Hi,
I am looking for a Mahout (and related technologies) expert in Bangalore
for
a few
Clustering of time series data is usually better done in an abstract
relatively low dimensional coordinate space based on some transform like a
locality sensitive frequency transform. Gabor transforms might be
appropriate.
You might be able to get away with something like an SVD of your daily
On Mon, Jul 19, 2010 at 1:29 AM, ihadanny ido.hada...@gmail.com wrote:
I've been trying out mahout-228: Sequential LR (using SGD).
Thanks!
Few things I haven't been able to figure out:
1. Is there a parallel version? Can it integrate with hadoop and do each
pass in parallel?
Not
That would be great!
On Mon, Jul 19, 2010 at 7:38 PM, Josh Patterson j...@cloudera.com wrote:
From just a personal
time perspective, I may try and mock up some demos for something like
this.
This is, roughly, a reasonable thing to do.
If you want to maintain the fiction of counts a little bit more closely, you
might consider just having counts decay over time and having short visits
only give partial credit.
On Wed, Jul 21, 2010 at 3:54 PM, Dave Williford
This is a ubiquitous problem with coocurrence algorithms since they scale in
the square of the number of occurrences most popular item.
The good news is that you learn everything there is to learn about that item
if you look at just a sampling of the occurrences so sampling is your
friend. If
make me even less likely to consider it as an early
design option.
On Wed, Jul 21, 2010 at 5:02 PM, Ted Dunning ted.dunn...@gmail.com wrote:
This is, roughly, a reasonable thing to do.
If you want to maintain the fiction of counts a little bit more closely,
you might consider just having
Sean,
Are you back yet?
I have a friend in London who is apparently in somewhat dire straits
(mugged, everything taken except passport). I am looking for resources in
London to help him out.
On Tue, Jul 27, 2010 at 6:26 AM, Sean Owen sro...@gmail.com wrote:
There's no direct way to do this,
Lucene 4.0? 3.0 just came out.
http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/lastSuccessfulBuild/artifact/lucene/build/docs/changes/Changes.html#older
On Fri, Aug 6, 2010 at 8:59 AM, smcgi...@seas.upenn.edu wrote:
Hello,
I am trying to import an index from Solr 1.5,
importing vectors from this
Solr-trunk/Lucene-trunk combination.
Thanks!
Steve
Quoting Ted Dunning ted.dunn...@gmail.com:
Lucene 4.0? 3.0 just came out.
http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/lastSuccessfulBuild/artifact/lucene/build/docs/changes/Changes.html
I was at this talk and it was appallingly bad.
The most serious confusion is that the algorithms behind the prediction API
are NOT the same as the algorithms described in the talk. The talk was
really two talks glued together without a transition. The first part was
essentially just a rehash of
Focussing on rating error is also problematic in that it causes us to worry
about being correct about the estimated ratings for items that will *never*
be shown to a user.
In my mind, the only thing that matters in a practical system is the
ordering of the top few items and the rough composition
or
tomorrow.
On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
Jimmy Lin's stripes work was presented at the last Summit and there was
heated (well, warm and cordial at least) discussion with the Map-reduce
committers about whether good use of a combiner wouldn't do
Ahh thanks for being brave enough to ask.
A JIRA is a bug ticket. See http://issues.apache.org/jira/browse/MAHOUT
Filing a complete statement of the problem there will really help with
documenting the problem. Also, if you can develop a patch that helps
fix the problem, you can attach it
We don't have mega scale ols but we do have mega scale svd which
should be close to what you want if you have sparse data.
Sent from my iPhone
On Aug 20, 2010, at 1:37 PM, Chris Bates christopher.andrew.ba...@gmail.com
wrote:
Hi all,
I'm new to the list. I have a bunch of algorithms
Sorry to chime in late, but removing items after recommendation isn't such a
crazy thing to do.
In particular, it is common to remove previously viewed items (for a period
of time). Likewise, it the user says don't show this again, it makes
sense to backstop the actual recommendation system with
Can you file a bug report at http://issues.apache.org/jira/browse/MAHOUT ?
Please attach your test case.
On Wed, Aug 25, 2010 at 7:25 AM, Laszlo Dosa laszlo.d...@fredhopper.comwrote:
Hi,
I tried to iterate over the elements of a SequentialAccessSparseVector.
I run the following test and
I formatted your tests as a patch and attached them to the bug itself.
On Fri, Aug 27, 2010 at 8:38 AM, Laszlo Dosa laszlo.d...@fredhopper.comwrote:
It is files as MAHOUT-489.
Regards,
Laszlo
-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: 25 August
I don't know much about weka lately, but I don't know about any support for
calling Mahout clustering
algorithms from weka. Typically people run Mahout clustering from the
command line.
On Fri, Aug 27, 2010 at 1:06 PM, Valerio valerio.cera...@gmail.com wrote:
hi all,
I need some guides that
These are examples of what I call cross-recommendation where you have user x
item1 and user x item2 data and you
want item1 = item2 recommendations.
All of the standard techniques apply (user-based, LLR cooccurrence, SVD,
latent factor models), but you have to rejigger things here
and there.
Like Jake said.
On Sun, Aug 29, 2010 at 4:48 PM, Ted Dunning ted.dunn...@gmail.com wrote:
In particular, since our sparse representation requires an int (4 bytes)
and a double (8 bytes) to store one non-zero entry while a dense row
requires only 8 bytes per entry then your original data
Metaphorically speaking if user x search term is A and user x item is B,
then transpose(B) * B is item x item, transpose(A) * B) is search term x
search term and transpose(B)*A is item x search-term.
Depending on what kind of recommendation system you are using, the actual
mechanics will be
Lance,
As Sean said, there is definitely a performance and API-intelligibility
motivated difference between From-things and To-things, but you are right
that there is a conceptual symmetry between the two objects just as there is
symmetry or duality between the rows and columns of a matrix.
On
A 20% spread in what?
Speed? Results? Iterations?
On Mon, Aug 30, 2010 at 11:26 PM, Lance Norskog goks...@gmail.com wrote:
SVDRecommender is really sensitive to the random number seed. AADRE
gives about a 20% spread in its evaluations. (I have only tried
Yes.
Mahout can support this.
On Tue, Aug 31, 2010 at 2:55 PM, hdev ml hde...@gmail.com wrote:
But we also want to mine this data to get some predictive capabilities like
what is the likelihood that the user will use the same device again or if
we
get sales/marketing data (on the roadmap
For categorization, there are several different answers to the integration
problem, but text
export of a sampled and curated data file is pretty typical as a data path.
The on-line sequential classifiers are a bit more flexible and would allow
different input
formats at the cost of coding on your
I think that Chris was actually recommending stuff that is too simple to
call data-mining.
Basically this stuff is simpler than any machine learning algorithm so there
isn't anything really
to write.
An example for recommendations is to simply recommend the most popular items
to everybody,
+1
I'm in.
On Thu, Sep 2, 2010 at 6:50 AM, Ken Krugler kkrugler_li...@transpac.comwrote:
On Sun, Aug 29, 2010 at 6:33 AM, Grant Ingersoll gsing...@apache.org
wrote:
Anyone in the Bay Area interested in getting together to talk Mahout on
Sept. 16th or 17th? Nothing formal required. If
What version of Mahout? (I will assume the trunk)
What platform?
I see that you are using hadoop 0.21. So far, we only officially support
0.20.2, although that is clearly not your problem. It may become a problem
in your next step.
This looks like a problem in the Mahout compilation. The
Multiple classification is a classic problem and raises many problems.
Currently Mahout has classifiers that do 1 of n classification which is a
useful basis for multiple classification, but it isn't the final answer by
any means.
As a simple start, you can build multiple binary classifiers, one
Not much that I know of. There are bound to be some off-line academic
talks, and possibly some academic areas.
On Sun, Sep 5, 2010 at 8:32 PM, Lance Norskog goks...@gmail.com wrote:
The Hadoop lists seem to be all about the sysad aspects of Hadoop, while
Mahout users talk about algorithms a
Just to cross-check, is it true that your data has 35 x 100 million
non-zeros in it?
On Tue, Sep 7, 2010 at 6:16 PM, Akshay Bhat akshayub...@gmail.com wrote:
- the total number of non-zero elements. This drives the scan time and,
to
some extent the cost of the multiplies.
The total
Should?
or
Is?
The answer to the should question is possibly.
The answer to the is question is no.
This behavior is the reason for the jar-with-dependencies maven assembly
that is built in. Very handy for this problem.
On Fri, Sep 10, 2010 at 6:44 PM, Mark static.void@gmail.com wrote:
Should be close. The matrixMult step may be redundant if you want to
cluster the same data that you decomposed. That would make the second
transpose unnecessary as well.
On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll gsing...@apache.orgwrote:
To put this in bin/mahout speak, this would look
I think you were translating. But the last multiply is still redundant, I
think.
On Sat, Sep 11, 2010 at 4:55 PM, Grant Ingersoll gsing...@apache.orgwrote:
On Sep 11, 2010, at 5:50 PM, Ted Dunning wrote:
Should be close. The matrixMult step may be redundant if you want to
cluster
Steven's comments are correct. Weka has a larger collection of algorithms.
Mahout is specialized
around scalable algorithms and scalable implementations.
Both packages support supervised and unsupervised algorithms. Due to
scalability concerns, Mahout
does not have much in the way of
I don't know the answer to this, but previously this kind of problem was
caused by highly skewed statistics in the input data.
If there are things that cooccur with everything, then you will have
problems with the speed of the algorithm.
Can you say something about the distribution of your data?
Good advice relative to Mahout as well. Trying it on a smaller sample will
tell you if it is due to bad scaling or really a hangup.
On Sat, Sep 18, 2010 at 12:03 PM, Mark static.void@gmail.com wrote:
Thanks. Ill give this a try and see how it performs
On 9/18/10 12:01 PM, Neal Richter
Anonymous can mean many things.
It can mean
a) here is a user with no history
or
b) here is a user with history but possibly no formal login
It is normally true that the history that a user has when recommendations
need done is not the history that that or any user necessarily had when the
Did you do [mvn -DskipTests install] at the top level before trying this?
On Tue, Sep 21, 2010 at 9:15 AM, Neil Ghosh neil.gh...@gmail.com wrote:
Hi ,
I am trying to run the example using Mahout 0.3 at
https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
I have carried out
then it will hang and never finish. Is this a possible hadoop
configuration bug?
On 9/18/10 12:08 PM, Ted Dunning wrote:
Good advice relative to Mahout as well. Trying it on a smaller sample
will
tell you if it is due to bad scaling or really a hangup.
On Sat, Sep 18, 2010 at 12:03 PM
Isabel noted the same thing. I will get to it shortly. Most likely I have
broken these older API's in some subtle (or not) fashion.
On Wed, Sep 22, 2010 at 2:57 AM, Frank Wang wangfan...@gmail.com wrote:
I was running the donut example for logistic regression. It has always
worked until
This is cool:
http://lca2011.linux.org.au/programme/schedule/view_talk/213?day=None
That is the first Mahout talk I have seen announced by somebody whose name I
don't recognize. It looks like a reasonable topic and I will be interested
to hear how their results turned out.
Are Aneesha Bakharia
I don't think that the future.get() will ever be done. Testing for
!future.done() will always return false after
invokeAll because invokeAll waits for all tasks to complete.
On Thu, Sep 23, 2010 at 7:57 PM, Stanley Ipkiss saurabhnan...@gmail.comwrote:
According to me, the first line is
This looks like a great series.
Could you do us a favor and point to http://mahout.apache.org instead? The
URL you have is old and we haven't
yet redirected from there to the current web site.
On Thu, Sep 23, 2010 at 9:38 PM, Timothy Potter thelabd...@gmail.comwrote:
I've just put the
There isn't a lot more documentation than that. There is a
forthcoming book by Grant called Taming Text that might help you and
the currently being written classification sections of the forthcoming
Mahout in Action book might be helpful.
On 9/24/10, Neil Ghosh neil.gh...@gmail.com wrote:
Is
That would be fabulous.
On Fri, Sep 24, 2010 at 6:07 AM, Alex Baranau alex.barano...@gmail.comwrote:
I'd suggest to use the approach discussed (and accepted) at
https://issues.apache.org/jira/browse/TIKA-488, which is about using
multiple search engines.
Will create a patch (to include both
Is that the complete stack trace? Threaded code like this usually has two
or three levels of Caused by seconds. The last is the critical one.
On Fri, Sep 24, 2010 at 1:07 PM, Stanley Ipkiss saurabhnan...@gmail.comwrote:
I did that change yesterday in my code, but forgot to post the update
Either Naive Bayes or the SGD classifiers will do a nice job for most text
classification problems.
On Sat, Sep 25, 2010 at 11:48 AM, Neil Ghosh neil.gh...@gmail.com wrote:
ctually I want to know how can I use mahout for text classification.
Will naive bayes. be enough ?
Drew,
You do recall correctly. This is a good example to follow for the Naive
Bayes side of the house.
On Sun, Sep 26, 2010 at 1:05 PM, Drew Farris d...@apache.org wrote:
The
PrepareTwentyNewsgroups example converts a bunch of files organized
into directories into the Bayes input format,
The test that you are reading is testing an entire command line interface.
If you look inside that code, you can probably see something simpler.
Also, you can take a look at the SGD models which are much easier to use on
a small scale. There the pertinent classes are
That is exactly what it does.
On Thu, Sep 30, 2010 at 8:37 AM, Neal Richter nrich...@gmail.com wrote:
On Thu, Sep 30, 2010 at 8:37 AM, Neil Ghosh neil.gh...@gmail.com wrote:
Does anybody have examples/reference how to use TF-IDF weights in mahout
cbayes for particular words and phrases
A very good practice is to use a data set like this:
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
Segregating by date avoids problems with duplicate documents appearing in
both training and test. It also gives you a standard split so that you can
compare to other
And if you want to see more about recommendation using side data as well as
interaction data,
the best reference I know of is Menon and Elkan's recent paper:
http://arxiv.org/abs/1006.2156
On Thu, Sep 30, 2010 at 4:45 PM, Sebastian Schelter s...@apache.org wrote:
If you just wanna know more
The best argument I have seen (with one powered-by sticker still pending) is
that it
helps with recruiting.
On Fri, Oct 1, 2010 at 1:34 AM, Isabel Drost isa...@apache.org wrote:
On Thu, 30 Sep 2010 Grant Ingersoll gsing...@apache.org wrote:
Now, if we could just get people to add to the
No there isn't. Your other option is to use kmeans directly and set k (as
you seem to do now).
t1 and t2 can also be quite delicate parameters.
My own tendency is to try to use a good initialization scheme such as
kmeans++ (which we don't
yet have) and just specify the number of clusters. If
Yes. Instance = training example.
Your method of duplicating lines is just what Robin meant.
On Fri, Oct 1, 2010 at 3:55 AM, Robin Anil robin.a...@gmail.com wrote:
Let me list what I understood. Pl confirm if I got it correct?
Add duplicate extra lines many times in an extra file
Jake,
You asked a bit ago about strategies for very large SVD's.
I wonder if interpolative decompositions might be an avenue toward that.
See, for instance, Less is More: Compact Matrix Decomposition for Large
Sparse Graphs http://www.cs.cmu.edu/~jimeng/papers/SunSDM07.pdf
The idea is that if
Can you provide a transcript of the commands you use to do this?
You might even try computing an md5sum on all of the source files in the src
directory and the class files in the
target directory to verify that you know exactly what is changing.
In general, when I have these kinds of problems,
Matt,
This is good detail.
On Fri, Oct 1, 2010 at 3:44 PM, Matt Tanquary matt.tanqu...@gmail.comwrote:
I forced rebuild of the projects after changing
org.apache.mahout.clustering.kmeans.KMeansDriver
I noticed that the
-type bayes is the other option. If time allows cbayes will probably be
better for most purposes.
See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.8572 for
details on the algorithm and comparisons.
On Fri, Oct 1, 2010 at 11:13 PM, Neil Ghosh neil.gh...@gmail.com wrote:
Hello,
If that 50GB represents 20million training examples for a classifier, then
you are fine without hadoop.
If it is data to cluster or do SVD on, the answer is probably the same.
This might be near the edge.
If it is data for recommendations, that is a moderate amount and with or
without hadoop is
You will need to make sure that the tokenization is done reasonable.
There is an example program for a sequential classifier in
org.apache.mahout.classifiers.sgd.TrainNewsGroups
It assumes data in the 20 news groups format and uses a Lucene tokenizer.
The NaiveBayes code also uses a Lucene
To rebuild the job jar use maven's command [mvn -DskipTests install] (but
make sure you run the tests occasionally)
You can't trust Eclipse to understand the entire build. It will be ok if
you are running unit tests, but if you try to submit
a Hadoop job, you need to package everything up.
On
The SGD classifier software will use all the cores for training even without
Hadoop.
Hadoop can definitely run on a multi-core machine, but the overhead
introduced will mean that your net gain will be distinctly less than 8x.
On Sat, Oct 2, 2010 at 6:43 PM, Latency Buster
This paper had some interesting references. The problem they worked on was
different from yours, but if you
know something abou the training images, this might work out. The something
might be the original web-site
nearby text or almost anything.
verify with Hindi text as string ?
Thanks
Neil
On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Hindi should be pretty good to go with the default Lucene analyzer. You
should look at the
tokens to be sure they are reasonable. Punctuation and some other work
a large library such as Mahout.
On Sun, Oct 3, 2010 at 11:41 AM, gagan chhabra gagan.13031...@gmail.comwrote:
I was proposed yo use MATLAB for this project but I had no idea so i
somehow
ended up here.
Is it possible to implement in MATLAB??
On Sun, Oct 3, 2010 at 11:48 PM, Ted Dunning
In that case, another Faloutsos paper would be of interest:
2002 Performance - best student paper award: Mengzhi Wang, Anastassia
Ailamaki and Christos Faloutsos, *Capturing the spatio-temporal behavior of
real traffic
datahttp://www.cs.cmu.edu/~christos/PUBLICATIONS/performance02.pdf
*
mention.
On Mon, Oct 4, 2010 at 1:34 AM, Ted Dunning ted.dunn...@gmail.com wrote:
Try this:
http://www.public.asu.edu/~huanliu/sbp09/Presentations/paper%20presentations/SBP09_3-31(Baoxin%20Li%20-4).pdf
On Sun, Oct 3, 2010 at 12:57 PM, Federico Castanedo
fcast...@inf.uc3m.es
wrote
Texture models like Gabor transforms.
On Mon, Oct 4, 2010 at 9:10 AM, gagan chhabra gagan.13031...@gmail.comwrote:
So wat about the images of animals and humans..?? Any particulars for them
like histogram is for snow and sunsets etc.
My own best candidate for using side information, of which context is just
one source, is the latent factor log-linear approach described in Menon and
Elkan's paper. I am part-way into an implementation of this, but it will
not be integrated into the recommendation framework at first. As soon as
There is currently no provision for a payload in the VectorWritable. It is
plausible that such a capability could be added.
Perhaps you could suggest an implementation?
On Tue, Oct 12, 2010 at 2:28 PM, Lance Norskog goks...@gmail.com wrote:
Ok. Now, how would one save payloads with the Vector
On Tue, Oct 12, 2010 at 5:30 PM, Lance Norskog goks...@gmail.com wrote:
This use case is doing Random Projection with paired vectors. Look up
'semantic vectors' for an explanation.
Even so, I think that there is another way to do this by just keeping an id
on each vector.
In random
can you attach your test docs to a jira report?
On Thu, Oct 14, 2010 at 2:51 AM, Sreejith S srssreej...@gmail.com wrote:
Hi all...
I used Mahout CBayes Classifier (and Bayes) to tarin a sample data set.The
data set consists of 500 positive and 500 negative documents.After training
i passed
If you are comparing ranking systems against a gold standard of relevance,
the accepted standard measure is AUC. You can define AUC most conveniently
as the probability that the score of a randomly chosen known good example is
higher than the score of a randomly chosen known bad example. This is
1 - 100 of 1929 matches
Mail list logo