Re: Using ItemSimilarity.scala from Java

2014-09-26 Thread Frank Scholten
remaining stages = 0
2014-09-26 13:36:30,683 DEBUG FileSystem - Starting clear of FileSystem
cache with 1 elements.
2014-09-26 13:36:30,684 DEBUG FileSystem - Removing filesystem for file:///
2014-09-26 13:36:30,684 DEBUG FileSystem - Removing filesystem for file:///
2014-09-26 13:36:30,684 DEBUG FileSystem - Done clearing cache
2014-09-26 13:36:30,685 DEBUG Logging$class - Shutdown hook called
Disconnected from the target VM, address: '127.0.0.1:53897', transport:
'socket'

On Fri, Sep 12, 2014 at 7:05 PM, Pat Ferrel  wrote:

> True but a bit daunting to get started.
>
> Here is a translation to Scala.
> https://gist.github.com/pferrel/9cfee8b5723bb2e2a22c
>
> It uses the MahoutDriver and IndexedDataset and is compiled from
> org.apache.mahout.examples, which I created and so you’ll need to add the
> right imports if you do it somewhere else. For a bonus it uses Spark's
> parallel writing to part files and you can add command line parsing quite
> easily.
>
> article_views.txt:
> pat,article1
> pat,article2
> pat,article3
> frank,article3
> frank,article4
> joe-bob,article10
> joe-bob,article11
>
> indicators/part-0
> article2article1:3.819085009768877 article3:1.046496287529096
> article3article2:1.046496287529096 article4:1.046496287529096
> article1:1.046496287529096
> article11   article10:3.819085009768877
> article4article3:1.046496287529096
> article10   article11:3.819085009768877
> article1article2:3.819085009768877 article3:1.046496287529096
>
> The search using frank’s history will return article2, article3(filter
> out), article4(filter out), and article 1 as you’d expect.
>
> Oh, and I was wrong about the bug—works from the current repo.
>
> You still need to get the right jars in the classpath when running from
> the command line
>
> On Sep 12, 2014, at 9:04 AM, Peter Wolf  wrote:
>
> I'm new here, but I just wanted to add that Scala is extremely cool.  I've
> moved to Scala wherever possible in my work.  It's really nice, and well
> worth effort to learn.  Scala has put the joy back into programming.
>
> Instead of trying to call Scala from Java, perhaps you might enjoy writing
> your stuff in Scala.
>
> On Fri, Sep 12, 2014 at 11:53 AM, Pat Ferrel 
> wrote:
>
> > #1 I’m glad to see someone using this. I haven’t tried calling Scala from
> > Java and would expect a fair amount of difficulty with it. Scala
> constructs
> > objects to deal with its new features (anonymous functions, traits,
> > implicits) and you have to guess at what those will look like to java.
> > Maybe you could try the Scala community.
> >
> > Intellij will auto convert java to scala when you paste it into a .scala
> > file. For some reason yours doesn’t seem to work but I’ve seen it work
> > pretty well.
> >
> > I started to convert your code and it pointed out a bug in mine, a bad
> > value in the default schema. I’d be interested in helping with this as a
> > way to work out the kinks in creating drivers.
> >
> > Are you interested in this or are you set on using java? Either way I’ll
> > post a gist of your code using the MahoutDriver as the template and
> > converted to Scala. It’ll take me a few minutes.
> >
> > On Sep 12, 2014, at 6:46 AM, Frank Scholten 
> > wrote:
> >
> > Hi all,
> >
> > Trying out the new spark-itemsimilarity code, but I am new to Scala and
> > have hard time calling certain methods from Java.
> >
> > Here is a Gist with a Java main that runs the cooccurrence analysis:
> >
> > https://gist.github.com/frankscholten/d373c575ad721dd0204e
> >
> > When I run this I get an exception:
> >
> > Exception in thread "main" java.lang.NoSuchMethodError:
> >
> >
> org.apache.mahout.drivers.TextDelimitedIndexedDatasetReader.readElementsFrom(Ljava/lang/String;Lcom/google/common/collect/BiMap;)Lorg/apache/mahout/drivers/IndexedDataset;
> >
> > What do I have to do here to use the Scala readers from Java?
> >
> > Cheers,
> >
> > Frank
> >
> >
>
>


Using ItemSimilarity.scala from Java

2014-09-12 Thread Frank Scholten
Hi all,

Trying out the new spark-itemsimilarity code, but I am new to Scala and
have hard time calling certain methods from Java.

Here is a Gist with a Java main that runs the cooccurrence analysis:

https://gist.github.com/frankscholten/d373c575ad721dd0204e

When I run this I get an exception:

Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.mahout.drivers.TextDelimitedIndexedDatasetReader.readElementsFrom(Ljava/lang/String;Lcom/google/common/collect/BiMap;)Lorg/apache/mahout/drivers/IndexedDataset;

What do I have to do here to use the Scala readers from Java?

Cheers,

Frank


MultithreadedBatchItemSimilarities with LLR versus Spark co-occurrence

2014-08-01 Thread Frank Scholten
Hi all,

I noticed the development of the Spark co-occurrence of MAHOUT-1464 and I
wondered if I could get similar results but with less scalability when I
use MultithreadedBatchItemSimilarities with LLRSimilarity.

I want to use a co-occurrence recommender on a smallish datasets of a few
GBs that does not warrant the use of a Spark cluster. Is the Spark
implementation mostly a more scalable version or is it an improved
implementation that gives different or better results?

Cheers,

Frank


Re: Setting up a recommender

2014-04-21 Thread Frank Scholten
Pat and Ted: I am late to the party but this is very interesting!

I am not sure I understand all the steps, though. Do you still create a
cooccurrence matrix and compute LLR scores during this process or do you
only compute matrix multiplication times the history vector: B'B * h and
B'A * h?

Cheers,

Frank


On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel  wrote:

> I finally got some time to work on this and have a first cut at output to
> Solr working on the github repo. It only works on 2-action input but I'll
> have that cleaned up soon so it will work with one action. Solr indexing
> has not been tested yet and the field names and/or types may need tweaking.
>
> It takes the result of the previous drop:
> 1) DRMs for B (user history or B items action1) and A (user history of A
> items action2)
> 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
>
> There are two final outputs created using mapreduce but requiring 2
> in-memory hashmaps. I think this will work on a cluster (the hashmaps are
> instantiated on each node) but haven't tried yet. It orders items in #2
> fields by strength of "link", which is the similarity value used in [B'B]
> or [B'A]. It would be nice to order #1 by recency but there is no provision
> for passing through timestamps at present so they are ordered by the
> strength of preference. This is probably not useful and so can be ignored.
> Ordering by recency might be useful for truncating queries by recency while
> leaving the training data containing 100% of available history.
>
> 1) It joins #1 DRMs to produce a single set of docs in CSV form, which
> looks like this:
> id,history_b,history_a
> user1,iphone ipad,iphone ipad galaxy
> ...
>
> 2) it joins #2 DRMs to produce a single set of docs in CSV form, which
> looks like this:
> id,b_b_links,b_a_links
> u1,iphone ipad,iphone ipad galaxy
> …
>
> It may work on a cluster, I haven't tried yet. As soon as someone has some
> large-ish sample log files I'll give them a try. Check the sample input
> files in the resources dir for format.
>
> https://github.com/pferrel/solr-recommender
>
>
> On Aug 13, 2013, at 10:17 AM, Pat Ferrel  wrote:
>
> When I started looking at this I was a bit skeptical. As a Search engine
> Solr may be peerless, but as yet another NoSQL db?
>
> However getting further into this I see one very large benefit. It has one
> feature that sets it completely apart from the typical NoSQL db. The type
> of queries you do return fuzzy results--in the very best sense of that
> word. The most interesting queries are based on similarity to some
> exemplar. Results are returned in order of similarity strength, not ordered
> by a sort field.
>
> Wherever similarity based queries are important I'll look at Solr first.
> SolrJ looks like an interesting way to get Solr queries on POJOs. It's
> probably at least an alternative to using docs and CSVs to import the data
> from Mahout.
>
>
>
> On Aug 12, 2013, at 2:32 PM, Ted Dunning  wrote:
>
> Yes.  That would be interesting.
>
>
>
>
> On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan  wrote:
>
> > A little digression: Might a Matrix implementation backed by a Solr index
> > and uses SolrJ for querying help at all for the Solr recommendation
> > approach?
> >
> > It supports multiple fields of String, Text, or boolean flags.
> >
> > Best
> > Gokhan
> >
> >
> > On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel  wrote:
> >
> >> Also a question about user history.
> >>
> >> I was planning to write these into separate directories so Solr could
> >> fetch them from different sources but it occurs to me that it would be
> >> better to join A and B by user ID and output a doc per user ID with
> three
> >> fields, id, A item history, and B item history. Other fields could be
> > added
> >> for users metadata.
> >>
> >> Sound correct? This is what I'll do unless someone stops me.
> >>
> >> On Aug 7, 2013, at 11:25 AM, Pat Ferrel  wrote:
> >>
> >> Once you have a sample or example of what you think the
> >> "log file" version will look like, can you post it? It would be great to
> >> have example lines for two actions with or without the same item IDs.
> > I'll
> >> make sure we can digest it.
> >>
> >> I thought more about the ingest part and I don't think the
> one-item-space
> >> is actually a problem. It just means one item dictionary. A and B will
> > have
> >> the right content, all I have to do is make sure the right ranks are
> > input
> >> to the MM,
> >> Transpose, and RSJ. This in turn is only one extra count of the # of
> > items
> >> in A's item space. This should be a very easy change If my thinking is
> >> correct.
> >>
> >>
> >> On Aug 7, 2013, at 8:09 AM, Ted Dunning  wrote:
> >>
> >> On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel 
> wrote:
> >>
> >>> 4) To add more metadata to the Solr output will be left to the consumer
> >>> for now. If there is a good data set to use we can illustrate how to do
> >> it
> >>> in the project. Ted may have some data for this from musicbrainz.
>

Re: lucene2seq error: field does not exist in the index

2014-04-16 Thread Frank Scholten
Hi Terry,

What happens when you make the 'body' field indexed in your schema?

LuceneIndexHelper checks the field using an IndexSearcher so it might be
that the field has to be indexed as well as being stored, which would be a
bug because lucene2seq is designed to load stored fields.

Cheers,

Frank


On Fri, Apr 11, 2014 at 5:33 AM, Terry Blankers  wrote:

> Hi All, I'm very new to trying to use lucene2seq so I'm not sure if it's
> just user error, but I'm experiencing some unexpected behavior when running
> lucene2seq against my solr index (4.7.1). I've tried using both 0.9 and the
> trunk build of mahout. (And BTW, I have been able to successfully run
> Reuters example as a test baseline.)
>
>
> Here's the command I'm running:
>
>$MAHOUT_HOME/bin/mahout lucene2seq -i
>/home/ec2-user/solr/solr-data/solrindex/index -o solr/sequence -id
>key_sha1hex -f body -xm sequential -q topics:diabetes -n 500
>
>
> Excerpts from my solr schema:
>
>  true"multiValued="true"/>
> 
>
>  source="body" dest="content" />
> content
>
>
>
> When I use SolrAdmin and specify fl=body the search handler returns the
> 'body' field with data as expected. Yet I get the following error when
> running lucene2seq and specify '-f body':
>
>/IllegalArgumentException: Field 'body' does not exist in the index/
>
>
>
> And if I specify '-f content', lucene2seq runs without errors or warnings,
> but seqdumper output shows no values for any key:
>
>/Key class: class org.apache.hadoop.io.Text Value Class: class
>org.apache.hadoop.io.Text
>Key: 96C4C76CF9D7449C724CA77CB8F650EAFD33E31C: Value:
>Key: D6842B81B8D09733B50BEDB4767C2A5C49E43B20: Value:
>Key: 61CB95FEE2C6BF0AC6E8A1F7738338CA36F42264: Value:
>Key: 0F9903B72A7C9F0373A5171403B3AAEB291B16E1: Value: /
>
>
> Can anyone give me any suggestions as to how to track down what might be
> happening here?
>
> Many thanks,
>
> Terry
>
>
>
>
>
>
>
>
>


Difference between CiMapper and ClusterIterator

2014-03-31 Thread Frank Scholten
Hi all,

I noticed in the CIMapper that the policy.update() call is done in the
setup of the mapper, while
in the ClusterIterator it is called for every vector in the iteration.

In the sequential version there is only a single policy while in the MR
version we will get a policy per mapper. Which implementation is correct?
If I recall correctly from the previous K-means implementation the update
centroids step was done at the end of each iteration, so I think the
policy.update() call should be moved outside of the vector loop in
ClusterIterator.

Thoughts?

Cheers,

Frank


Re: Text clustering with hashing vector encoders

2014-03-21 Thread Frank Scholten
Ah, interesting. I am going to try it out.

Thanks for your comments!


On Fri, Mar 21, 2014 at 9:29 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> Hi frank,
>
> no, no collocation job. You just take a big enough sample of documents and
> assign it to it's cluster with the learned ClusterClassifier. Parallel to
> that you count the total words in a guava multiset and the per cluster word
> counts in a multiset. The LogLikelihood class contains a convenient method
> that takes two multisets that you use for all clusters.
>
> there should be no need in starting a map reduce job for that, with some
> ram you can just stream the documents from the hdfs
>
>
>
>
> On Fri, Mar 21, 2014 at 5:29 PM, Frank Scholten  >wrote:
>
> > Hi Johannes,
> >
> > Sounds good.
> >
> > The step for finding labels is still unclear to me. You use the
> > Loglikelihood class on the original documents? How? Or do you mean the
> > collocation job?
> >
> > Cheers,
> >
> > Frank
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte <
> > johannes.schu...@gmail.com> wrote:
> >
> > > Hi Frank, we are using a very similar system in production.
> > > Hashing text like data to a 5 dimensional vector with two probes,
> and
> > > then applying tf-idf weighting.
> > >
> > > For IDF we dont keep a separate weight dictionary but just count the
> > > distinct training examples ("documents") that have a non null value per
> > > column.
> > > so there is a full idf vector that can be used.
> > > Instead of Euclidean Distance we use Cosine (Performance Reasons).
> > >
> > > The results are very good, building such a system is easy and maybe
> it's
> > > worth a try.
> > >
> > > For representing the cluster we have a separate job that assigns users
> > > ("documents") to clusters and shows the most discriminating words for
> the
> > > cluster via the LogLikelihood class. The results are then visualized
> > using
> > > http://wordcram.org/ for the whoah effect.
> > >
> > > Cheers,
> > >
> > > Johannes
> > >
> > >
> > > On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning 
> > > wrote:
> > >
> > > > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <
> > fr...@frankscholten.nl
> > > > >wrote:
> > > >
> > > > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <
> ted.dunn...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Yes.  Hashing vector encoders will preserve distances when used
> > with
> > > > > > multiple probes.
> > > > > >
> > > > >
> > > > > So if a token occurs two times in a document the first token will
> be
> > > > mapped
> > > > > to a given location and when the token is hashed the second time it
> > > will
> > > > be
> > > > > mapped to a different location, right?
> > > > >
> > > >
> > > > No.  The same token will always hash to the same location(s).
> > > >
> > > >
> > > > > I am wondering if when many probes are used and a large enough
> vector
> > > > this
> > > > > process mimics TF weighting, since documents that have a high TF
> of a
> > > > given
> > > > > token will have the same positions marked in the vector. As Suneel
> > said
> > > > > when we then use the Hamming distance the vectors that are close to
> > > each
> > > > > other should be in the same cluster.
> > > > >
> > > >
> > > > Hamming distance doesn't quite work because you want to have
> collisions
> > > to
> > > > a sum rather than an OR.  Also, if you apply weights to the words,
> > these
> > > > weights will be added to all of the probe locations for the words.
> >  This
> > > > means we still need a plus/times/L2 dot product rather than an
> > > plus/AND/L1
> > > > dot product like the Hamming distance uses.
> > > >
> > > > >
> > > > > > Interpretation becomes somewhat difficult, but there is code
> > > available
> > > > to
> > > > > > reverse engineer labels on hashed vectors.
> > > > >
> > &

Re: Text clustering with hashing vector encoders

2014-03-21 Thread Frank Scholten
Hi Johannes,

Sounds good.

The step for finding labels is still unclear to me. You use the
Loglikelihood class on the original documents? How? Or do you mean the
collocation job?

Cheers,

Frank







On Thu, Mar 20, 2014 at 8:39 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> Hi Frank, we are using a very similar system in production.
> Hashing text like data to a 5 dimensional vector with two probes, and
> then applying tf-idf weighting.
>
> For IDF we dont keep a separate weight dictionary but just count the
> distinct training examples ("documents") that have a non null value per
> column.
> so there is a full idf vector that can be used.
> Instead of Euclidean Distance we use Cosine (Performance Reasons).
>
> The results are very good, building such a system is easy and maybe it's
> worth a try.
>
> For representing the cluster we have a separate job that assigns users
> ("documents") to clusters and shows the most discriminating words for the
> cluster via the LogLikelihood class. The results are then visualized using
> http://wordcram.org/ for the whoah effect.
>
> Cheers,
>
> Johannes
>
>
> On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning 
> wrote:
>
> > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten  > >wrote:
> >
> > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning 
> > > wrote:
> > >
> > > > Yes.  Hashing vector encoders will preserve distances when used with
> > > > multiple probes.
> > > >
> > >
> > > So if a token occurs two times in a document the first token will be
> > mapped
> > > to a given location and when the token is hashed the second time it
> will
> > be
> > > mapped to a different location, right?
> > >
> >
> > No.  The same token will always hash to the same location(s).
> >
> >
> > > I am wondering if when many probes are used and a large enough vector
> > this
> > > process mimics TF weighting, since documents that have a high TF of a
> > given
> > > token will have the same positions marked in the vector. As Suneel said
> > > when we then use the Hamming distance the vectors that are close to
> each
> > > other should be in the same cluster.
> > >
> >
> > Hamming distance doesn't quite work because you want to have collisions
> to
> > a sum rather than an OR.  Also, if you apply weights to the words, these
> > weights will be added to all of the probe locations for the words.  This
> > means we still need a plus/times/L2 dot product rather than an
> plus/AND/L1
> > dot product like the Hamming distance uses.
> >
> > >
> > > > Interpretation becomes somewhat difficult, but there is code
> available
> > to
> > > > reverse engineer labels on hashed vectors.
> > >
> > >
> > > I saw that AdaptiveWordEncoder has a built in dictionary so I can see
> > which
> > > words it has seen but I don't see how to go from a position or several
> > > positions in the vector to labels. Is there an example in the code I
> can
> > > look at?
> > >
> >
> > Yes.  The newsgroups example applies.
> >
> > The AdaptiveWordEncoder counts word occurrences that it sees and uses the
> > IDF based on the resulting counts.  This assumes that all instances of
> the
> > AWE will see the same rough distribution of words to work.  It is fine
> for
> > lots of applications and not fine for lots.
> >
> >
> > >
> > >
> > > > IDF weighting is slightly tricky, but quite doable if you keep a
> > > dictionary
> > > > of, say, the most common 50-200 thousand words and assume all others
> > have
> > > > constant and equal frequency.
> > > >
> > >
> > > How would IDF weighting work in conjunction with hashing? First build
> up
> > a
> > > dictionary of 50-200 and pass that into the vector encoders? The
> drawback
> > > of this is that you have another pass through the data and another
> > 'input'
> > > to keep track of and configure. But maybe it has to be like that.
> >
> >
> > With hashing, you still have the option of applying a weight to the
> hashed
> > representation of each word.  The question is what weight.
> >
> > To build a small dictionary, you don't have to go through all of the
> data.
> >  Just enough to get reasonably accurate weights for most words.  All
> words
> > not yet seen can be assumed to be rare an

Re: Text clustering with hashing vector encoders

2014-03-19 Thread Frank Scholten
On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning  wrote:

> Yes.  Hashing vector encoders will preserve distances when used with
> multiple probes.
>

So if a token occurs two times in a document the first token will be mapped
to a given location and when the token is hashed the second time it will be
mapped to a different location, right?

I am wondering if when many probes are used and a large enough vector this
process mimics TF weighting, since documents that have a high TF of a given
token will have the same positions marked in the vector. As Suneel said
when we then use the Hamming distance the vectors that are close to each
other should be in the same cluster.


>
> Interpretation becomes somewhat difficult, but there is code available to
> reverse engineer labels on hashed vectors.


I saw that AdaptiveWordEncoder has a built in dictionary so I can see which
words it has seen but I don't see how to go from a position or several
positions in the vector to labels. Is there an example in the code I can
look at?


> IDF weighting is slightly tricky, but quite doable if you keep a dictionary
> of, say, the most common 50-200 thousand words and assume all others have
> constant and equal frequency.
>

How would IDF weighting work in conjunction with hashing? First build up a
dictionary of 50-200 and pass that into the vector encoders? The drawback
of this is that you have another pass through the data and another 'input'
to keep track of and configure. But maybe it has to be like that. The
reason I like the hashed encoders is that vectorizing can be done in a
streaming manner at the last possible moment. With the current tools you
have to do: data -> data2seq -> seq2sparse -> kmeans.

If this approach is doable I would like to code up a Java non-Hadoop
example using the Reuters dataset which vectorizes each doc using the
hashing encoders, configures KMeans with Hamming distance and then write
some code to get the labels.

Cheers,

Frank


>
>
>
> On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten  >wrote:
>
> > Hi all,
> >
> > Would it be possible to use hashing vector encoders for text clustering
> > just like when classifying?
> >
> > Currently we vectorize using a dictionary where we map each token to a
> > fixed position in the dictionary. After the clustering we use have to
> > retrieve the dictionary to determine the cluster labels.
> > This is quite a complex process where multiple outputs are read and
> written
> > in the entire clustering process.
> >
> > I think it would be great if both algorithms could use the same encoding
> > process but I don't know if this is possible.
> >
> > The problem is that we lose the mapping between token and position when
> > hashing. We need this mapping to determine cluster labels.
> >
> > However, maybe we could make it so hashed encoders can be used and that
> > determining top labels is left to the user. This might be a possibility
> > because I noticed a problem with the current cluster labeling code. This
> is
> > what happens: first vectors are vectorized with TF-IDF and clustered.
> Then
> > the labels are ranked, but again according to TF-IDF, instead of TF. So
> it
> > is possible that a token becomes the top ranked label, even though it is
> > rare within the cluster. The document with that token is in the cluster
> > because of other tokens. If the labels are determined based on a TF score
> > within the cluster I think you would have better labels. But this
> requires
> > a post-processing step on your original data and doing a TF count.
> >
> > Thoughts?
> >
> > Cheers,
> >
> > Frank
> >
>


Text clustering with hashing vector encoders

2014-03-18 Thread Frank Scholten
Hi all,

Would it be possible to use hashing vector encoders for text clustering
just like when classifying?

Currently we vectorize using a dictionary where we map each token to a
fixed position in the dictionary. After the clustering we use have to
retrieve the dictionary to determine the cluster labels.
This is quite a complex process where multiple outputs are read and written
in the entire clustering process.

I think it would be great if both algorithms could use the same encoding
process but I don't know if this is possible.

The problem is that we lose the mapping between token and position when
hashing. We need this mapping to determine cluster labels.

However, maybe we could make it so hashed encoders can be used and that
determining top labels is left to the user. This might be a possibility
because I noticed a problem with the current cluster labeling code. This is
what happens: first vectors are vectorized with TF-IDF and clustered. Then
the labels are ranked, but again according to TF-IDF, instead of TF. So it
is possible that a token becomes the top ranked label, even though it is
rare within the cluster. The document with that token is in the cluster
because of other tokens. If the labels are determined based on a TF score
within the cluster I think you would have better labels. But this requires
a post-processing step on your original data and doing a TF count.

Thoughts?

Cheers,

Frank


Re: Naive Bayes classification

2014-03-18 Thread Frank Scholten
Hi Tharindu,

If I understand correctly seqdirectory creates labels based on the file
name but this is not what you want. What do you want the labels to be?

Cheers,

Frank


On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
wrote:

> Hi everyone,
> I'm developing an application where I need to train a Naive Bayes
> classification model and use this model to classify new entities(In this
> case text files based on their content)
>
> I observed that seqdirectory command always adds the file/directory name as
> the "key" field for each document which will be used as the label in
> classification jobs.
> This makes sense when I need to train a model and create the labelindex
> since I have organized my training data according to their labels in
> separate directories.
>
> Now I'm trying to use this model and infer the best label for an unknown
> document.
> My requirement is to ask Mahout to read my new file and output the
> predicted category by looking at the labelindex and the tfidf vector of the
> new content.
> I tried creating vectors from the new content (seqdirectory and
> seq2sparse), and then using this vector to run testnb command. But
> unfortunately seqdirectory commands adds file names as labels which does
> not make sense in classification.
>
> The following error message will further demonstrate this behavior.
> imput0.txt is the file name of my new document.
>
> [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
> classifying documents
> java.lang.IllegalArgumentException: Label not found: input0.txt
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
> at
>
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
> at
>
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
> at
>
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
> at
>
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
> at
>
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
> at
>
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
> at
>
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at
>
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)
>
>
> So how can I achieve what I'm trying to do here?
>
> Thanks,
>
>
> --
> M.P. Tharindu Rusira Kumara
>
> Department of Computer Science and Engineering,
> University of Moratuwa,
> Sri Lanka.
> +94757033733
> www.tharindu-rusira.blogspot.com
>


Re: Problem with K-Means clustering on Amazon EMR

2014-03-16 Thread Frank Scholten
Hi Konstantin,

Good to hear from you.

The link you mentioned points to EigenSeedGenerator not
RandomSeedGenerator. The problem seems to be with the call to

fs.getFileStatus(input).isDir()


It's been a while and I don't remember but perhaps you have to set
additional Hadoop fs properties to use S3. See
https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the cause of
this by creating a small Java main app with that line of code and run it in
the debugger.

Cheers,

Frank



On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
wrote:

> Hello!
>
> I run a text-documents clustering on Hadoop cluster in Amazon Elastic Map
> Reduce. As input and output I use S3 Amazon file system. I specify all
> paths as "s3://bucket-name/folder-name".
>
> SparceVectorsFromSequenceFile works correctly with S3
> but when I start K-Means clustering job, I get this error:
>
> Exception in thread "main" java.lang.IllegalArgumentException: This
> file system object (hdfs://172.31.41.65:9000) does not support access
> to the request path
>
> 's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'
> You possibly called FileSystem.get(conf) when you should have called
> FileSystem.get(uri, conf) to obtain a file system supporting your
> path.
>
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
> at
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at
> bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121)
> at
> bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause
> of this a
> at
> bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
> I checked RandomSeedGenerator.buildRandom
> (
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/EigenSeedGenerator.java?av=f
> )
> and I assume it has correct code:
>
> FileSystem fs = FileSystem.get(output.toUri(), conf);
>
>
> I can not run clustering because of this error. May be you have any
> ideas how to fix this?
>


Re: Welcome Andrew Musselman as new comitter

2014-03-07 Thread Frank Scholten
Congratulations Andrew!


On Fri, Mar 7, 2014 at 6:12 PM, Sebastian Schelter  wrote:

> Hi,
>
> this is to announce that the Project Management Committee (PMC) for Apache
> Mahout has asked Andrew Musselman to become committer and we are pleased to
> announce that he has accepted.
>
> Being a committer enables easier contribution to the project since in
> addition to posting patches on JIRA it also gives write access to the code
> repository. That also means that now we have yet another person who can
> commit patches submitted by others to our repo *wink*
>
> Andrew, we look forward to working with you in the future. Welcome! It
> would be great if you could introduce yourself with a few words :)
>
> Sebastian
>


Re: Rework our website

2014-03-05 Thread Frank Scholten
+1 for design 2


On Wed, Mar 5, 2014 at 6:00 PM, Suneel Marthi wrote:

> +1 for Option# 2.
>
>
>
>
>
> On Wednesday, March 5, 2014 7:11 AM, Sebastian Schelter 
> wrote:
>
> Hi everyone,
>
> In our latest discussion, I argued that the lack (and errors) of
> documentation on our website is one of the main pain points of Mahout
> atm. To be honest, I'm also not very happy with the design, especially
> fonts and spacing make it super hard to read long articles. This also
> prevents me from wanting to add articles and documentation.
>
> I think we should have a beautiful website, where it is fun to add new
> stuff.
>
> My design skills are pretty limited, but fortunately my brother is an
> art director! I asked him to make our website a bit more beautiful
> without changing to much of the structure, so that a redesign wouldn't
> take too long.
>
> I really like the results and would volunteer to dig out my CSS skills
> and do the redesign, if people agree.
>
> Here are his drafts, I like the second one best:
>
> https://people.apache.org/~ssc/mahout/mahout.jpg
> https://people.apache.org/~ssc/mahout/mahout2.jpg
>
> Let me know what you think!
>
> Best,
> Sebastian
>


Re: SGD classifier demo app

2014-02-04 Thread Frank Scholten
Thanks to you too, Johannes, for your comments!


On Tue, Feb 4, 2014 at 7:39 PM, Frank Scholten wrote:

> Thanks Ted!
>
> Would indeed be a nice example to add.
>
>
> On Tue, Feb 4, 2014 at 10:40 AM, Ted Dunning wrote:
>
>> Yes.
>>
>>
>> On Tue, Feb 4, 2014 at 1:31 AM, Sebastian Schelter 
>> wrote:
>>
>> > Would be great to add this as an example to Mahout's codebase.
>> >
>> >
>> > On 02/04/2014 10:27 AM, Ted Dunning wrote:
>> >
>> >> Frank,
>> >>
>> >> I just munched on your code and sent a pull request.
>> >>
>> >> In doing this, I made a bunch of changes.  Hope you liked them.
>> >>
>> >> These include massive simplification of the reading and vectorization.
>> >>   This wasn't strictly necessary, but it seemed like a good idea.
>> >>
>> >> More important was the way that I changed the vectorization.  For the
>> >> continuous values, I added log transforms.  For the categorical
>> values, I
>> >> encoded as they are.  I also increased the feature vector size to 100
>> to
>> >> avoid excessive collisions.
>> >>
>> >> In the learning code itself, I got rid of the use of index arrays in
>> favor
>> >> of shuffling the training data itself.  I also tuned the learning
>> >> parameters a lot.
>> >>
>> >> The result is that the AUC that results is just a tiny bit less than
>> 0.9
>> >> which is pretty close to what I got in R.
>> >>
>> >> For everybody else, see
>> >> https://github.com/tdunning/mahout-sgd-bank-marketing for my version
>> and
>> >> https://github.com/tdunning/mahout-sgd-bank-marketing/
>> >> compare/frankscholten:master...masterfor
>> >> my pull request.
>> >>
>> >>
>> >>
>> >> On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning 
>> >> wrote:
>> >>
>> >>
>> >>> Johannes,
>> >>>
>> >>> Very good comments.
>> >>>
>> >>> Frank,
>> >>>
>> >>> As a benchmark, I just spent a few minutes building a logistic
>> regression
>> >>> model using R.  For this model AUC on 10% held-out data is about 0.9.
>> >>>
>> >>> Here is a gist summarizing the results:
>> >>>
>> >>> https://gist.github.com/tdunning/8794734
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte <
>> >>> johannes.schu...@gmail.com> wrote:
>> >>>
>> >>>  Hi Frank,
>> >>>>
>> >>>> you are using the feature vector encoders which hash a combination of
>> >>>> feature name and feature value to 2 (default) locations in the
>> vector.
>> >>>> The
>> >>>> vector size you configured is 11 and this is imo very small to the
>> >>>> possible
>> >>>> combination of values you have for your data (education, marital,
>> >>>> campaign). You can do no harm by using a much bigger cardinality (try
>> >>>> 1000).
>> >>>>
>> >>>> Second, you are using a continuous value encoder with passing in the
>> >>>> weight
>> >>>> your are using as string (e.g. variable "pDays"). I am not quite sure
>> >>>> about
>> >>>> the reasons in th mahout code right now but the way it is implemented
>> >>>> now,
>> >>>> every unique value should end up in a different location because the
>> >>>> continuous value is part of the hashing. Try adding the weight
>> directly
>> >>>> using a static word value encoder, addToVector("pDays",v,pDays)
>> >>>>
>> >>>> Last, you are also putting in the variable "campaign" as a continous
>> >>>> variable which should be probably a categorical variable, so just
>> added
>> >>>> with a StaticWorldValueEncoder.
>> >>>>
>> >>>> And finally and probably most important after looking at your target
>> >>>> variable: you are using a Dictionary for mapping either y or no to 0
>> or
>> >>>> 1.
>> >>>> This is bad. Depending on what comes first in th

Re: SGD classifier demo app

2014-02-04 Thread Frank Scholten
Thanks Ted!

Would indeed be a nice example to add.


On Tue, Feb 4, 2014 at 10:40 AM, Ted Dunning  wrote:

> Yes.
>
>
> On Tue, Feb 4, 2014 at 1:31 AM, Sebastian Schelter  wrote:
>
> > Would be great to add this as an example to Mahout's codebase.
> >
> >
> > On 02/04/2014 10:27 AM, Ted Dunning wrote:
> >
> >> Frank,
> >>
> >> I just munched on your code and sent a pull request.
> >>
> >> In doing this, I made a bunch of changes.  Hope you liked them.
> >>
> >> These include massive simplification of the reading and vectorization.
> >>   This wasn't strictly necessary, but it seemed like a good idea.
> >>
> >> More important was the way that I changed the vectorization.  For the
> >> continuous values, I added log transforms.  For the categorical values,
> I
> >> encoded as they are.  I also increased the feature vector size to 100 to
> >> avoid excessive collisions.
> >>
> >> In the learning code itself, I got rid of the use of index arrays in
> favor
> >> of shuffling the training data itself.  I also tuned the learning
> >> parameters a lot.
> >>
> >> The result is that the AUC that results is just a tiny bit less than 0.9
> >> which is pretty close to what I got in R.
> >>
> >> For everybody else, see
> >> https://github.com/tdunning/mahout-sgd-bank-marketing for my version
> and
> >> https://github.com/tdunning/mahout-sgd-bank-marketing/
> >> compare/frankscholten:master...masterfor
> >> my pull request.
> >>
> >>
> >>
> >> On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning 
> >> wrote:
> >>
> >>
> >>> Johannes,
> >>>
> >>> Very good comments.
> >>>
> >>> Frank,
> >>>
> >>> As a benchmark, I just spent a few minutes building a logistic
> regression
> >>> model using R.  For this model AUC on 10% held-out data is about 0.9.
> >>>
> >>> Here is a gist summarizing the results:
> >>>
> >>> https://gist.github.com/tdunning/8794734
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte <
> >>> johannes.schu...@gmail.com> wrote:
> >>>
> >>>  Hi Frank,
> >>>>
> >>>> you are using the feature vector encoders which hash a combination of
> >>>> feature name and feature value to 2 (default) locations in the vector.
> >>>> The
> >>>> vector size you configured is 11 and this is imo very small to the
> >>>> possible
> >>>> combination of values you have for your data (education, marital,
> >>>> campaign). You can do no harm by using a much bigger cardinality (try
> >>>> 1000).
> >>>>
> >>>> Second, you are using a continuous value encoder with passing in the
> >>>> weight
> >>>> your are using as string (e.g. variable "pDays"). I am not quite sure
> >>>> about
> >>>> the reasons in th mahout code right now but the way it is implemented
> >>>> now,
> >>>> every unique value should end up in a different location because the
> >>>> continuous value is part of the hashing. Try adding the weight
> directly
> >>>> using a static word value encoder, addToVector("pDays",v,pDays)
> >>>>
> >>>> Last, you are also putting in the variable "campaign" as a continous
> >>>> variable which should be probably a categorical variable, so just
> added
> >>>> with a StaticWorldValueEncoder.
> >>>>
> >>>> And finally and probably most important after looking at your target
> >>>> variable: you are using a Dictionary for mapping either y or no to 0
> or
> >>>> 1.
> >>>> This is bad. Depending on what comes first in the data set, either a
> >>>> positive or negative example might be 0 or 1, totally random. Make a
> >>>> hard
> >>>> mapping from the possible values (y/n?) to zero and one, having yes
> the
> >>>> 1
> >>>> and no the zero.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <
> fr...@frankscholten.nl
> >>>>
> >>>>> w

Re: Annotation based vectorizer

2014-02-03 Thread Frank Scholten
The second field of Newsgroup should be called bodyText of course.


On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten wrote:

> Hi all,
>
> I put together a utility which vectorizes plain old Java objects annotated
> with @Feature and @Target via Mahout's vector encoders.
>
> See my Github branch:
> https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
>
> and the unit test:
> https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java
>
> Use it like this:
>
> class NewsgroupPost {
>
>   @Target
>   private String newsgroup;
>
>   @Feature(encoder = TextValueEncoder.class)
>   private String newsgroup;
>
>   // Getters & setters
>
> }
>
> AnnotationBasedVectorizer vectorizer = new
> AnnotationBasedVectorizer(new
> TypeReference(){});
>
> Here the vectorizer scans the NewsgroupPost's annotations. Then you can do
> this:
>
> NewsgroupPost post = ...
>
> Vector vector = vectorizer.vectorize(post);
> int target = vectorizer.getTarget(post);
> int numFeatures = vectorizer.getNumberOfFeatures();
>
> Note that vectorize() and getTarget() methods are genericly typed and due
> to the type token passed in the constructor we can enforce that only
> NewsgroupPosts are accepted.
>
> The vectorizer uses a Dictionary for encoding the target.
>
> Thoughts?
>
> Cheers,
>
> Frank
>


Annotation based vectorizer

2014-02-03 Thread Frank Scholten
Hi all,

I put together a utility which vectorizes plain old Java objects annotated
with @Feature and @Target via Mahout's vector encoders.

See my Github branch:
https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer

and the unit test:
https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java

Use it like this:

class NewsgroupPost {

  @Target
  private String newsgroup;

  @Feature(encoder = TextValueEncoder.class)
  private String newsgroup;

  // Getters & setters

}

AnnotationBasedVectorizer vectorizer = new
AnnotationBasedVectorizer(new
TypeReference(){});

Here the vectorizer scans the NewsgroupPost's annotations. Then you can do
this:

NewsgroupPost post = ...

Vector vector = vectorizer.vectorize(post);
int target = vectorizer.getTarget(post);
int numFeatures = vectorizer.getNumberOfFeatures();

Note that vectorize() and getTarget() methods are genericly typed and due
to the type token passed in the constructor we can enforce that only
NewsgroupPosts are accepted.

The vectorizer uses a Dictionary for encoding the target.

Thoughts?

Cheers,

Frank


Re: Data(Set) creation of for train and test.

2014-02-03 Thread Frank Scholten
Sorry I didn't properly read your message. The random forest code is quite
different and what I suggested is not applicable.

The DataConverter converts a String to a Vector wrapped by Instance. With
this you can create your training set I think.



On Mon, Feb 3, 2014 at 10:09 PM, Frank Scholten wrote:

> Have a look at OnlineLogisticRegressionTest.iris().
>
> Here List.subList() is used in combination with Collections.shuffle() to
> make the train and test dataset split.
>
> So you could first read the dataset in a list and then use this trick.
>
> I just pushed an example to Github that also uses this approach but I
> wrapped this logic into a utility
>
> See: https://github.com/frankscholten/mahout-sgd-bank-marketing and
>
>
> https://github.com/frankscholten/mahout-sgd-bank-marketing/blob/master/src/main/java/bankmarketing/util/TrainAndTestSetUtil.java
>
> Cheers,
>
> Frank
>
>
> On Mon, Feb 3, 2014 at 10:01 PM, j.barrett Strausser <
> j.barrett.straus...@gmail.com> wrote:
>
>> Two part question.
>>
>> 1. String Descriptor for input data
>>
>> Can anyone confirm my reasoning on the following -
>>
>> I believe the below code does the following.  It says the first column is
>> the feature to be predicted (is a label) all other columns are to be used
>> in the tree construction e.g. as variable to split on.
>>
>> val descriptor = "L N N"
>> val trainDataValues = fileAsStringArray("myTrainFile.csv");
>> val data = DataLoader.loadData(DataLoader.generateDataset(descriptor,
>> false, trainDataValues), trainDataValues);
>>
>> Where my "myTrainFile.csv has a form like
>>
>> "A", .45,.55
>> ...
>> ...
>> "B" 33.3, 22.3
>>
>>
>>
>> 2. String Descriptor for input data
>>
>> I'm now provided a new file "myTestData.csv"
>>
>> This data has no labels, but is otherwise the same as above. So if I
>> attempt to create a dataset an error will be thrown with complain of no
>> label.
>>
>> All I'm interested in is being able to call forest.classify(..., ...) but
>> I'm not sure how to correctly construct my training dataset.
>>
>> I cannot simply split the original dataset as is done in most examples.
>>
>>
>> Any examples showing test data construction independent of the original
>> training set would be appreciated.
>>
>>
>> --
>>
>>
>> https://github.com/bearrito
>> @deepbearrito
>>
>
>


Re: Data(Set) creation of for train and test.

2014-02-03 Thread Frank Scholten
Have a look at OnlineLogisticRegressionTest.iris().

Here List.subList() is used in combination with Collections.shuffle() to
make the train and test dataset split.

So you could first read the dataset in a list and then use this trick.

I just pushed an example to Github that also uses this approach but I
wrapped this logic into a utility

See: https://github.com/frankscholten/mahout-sgd-bank-marketing and

https://github.com/frankscholten/mahout-sgd-bank-marketing/blob/master/src/main/java/bankmarketing/util/TrainAndTestSetUtil.java

Cheers,

Frank


On Mon, Feb 3, 2014 at 10:01 PM, j.barrett Strausser <
j.barrett.straus...@gmail.com> wrote:

> Two part question.
>
> 1. String Descriptor for input data
>
> Can anyone confirm my reasoning on the following -
>
> I believe the below code does the following.  It says the first column is
> the feature to be predicted (is a label) all other columns are to be used
> in the tree construction e.g. as variable to split on.
>
> val descriptor = "L N N"
> val trainDataValues = fileAsStringArray("myTrainFile.csv");
> val data = DataLoader.loadData(DataLoader.generateDataset(descriptor,
> false, trainDataValues), trainDataValues);
>
> Where my "myTrainFile.csv has a form like
>
> "A", .45,.55
> ...
> ...
> "B" 33.3, 22.3
>
>
>
> 2. String Descriptor for input data
>
> I'm now provided a new file "myTestData.csv"
>
> This data has no labels, but is otherwise the same as above. So if I
> attempt to create a dataset an error will be thrown with complain of no
> label.
>
> All I'm interested in is being able to call forest.classify(..., ...) but
> I'm not sure how to correctly construct my training dataset.
>
> I cannot simply split the original dataset as is done in most examples.
>
>
> Any examples showing test data construction independent of the original
> training set would be appreciated.
>
>
> --
>
>
> https://github.com/bearrito
> @deepbearrito
>


SGD classifier demo app

2014-02-03 Thread Frank Scholten
Hi all,

I am exploring Mahout's SGD classifier and like some feedback because I
think I didn't properly configure things.

I created an example app that trains an SGD classifier on the 'bank
marketing' dataset from UCI:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

My app is at: https://github.com/frankscholten/mahout-sgd-bank-marketing

The app reads a CSV file of telephone calls, encodes the features into a
vector and tries to predict whether a customer answers yes to a business
proposal.

I do a few runs and measure accuracy but I'm I don't trust the results.
When I only use an intercept term as a feature I get around 88% accuracy
and when I add all features it drops to around 85%. Is this perhaps because
the dataset highly unbalanced? Most customers answer no. Or is the
classifier biased to predict 0 as the target code when it doesn't have any
data to go with?

Any other comments about my code or improvements I can make in the app are
welcome! :)

Cheers,

Frank


Re: Logistic Regression cost function

2014-01-14 Thread Frank Scholten
I see the update rule to the beta matrix is derived from pseudo code in the
innermost loop in 'Algorithm 1: Stochastic Gradient Descent' in
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.177.3514&rep=rep1&type=pdf

In the paper the learning rate is put before the gradient of the error
function and the instance value multiplication is put before the gradient.

Perhaps we can rearrange the code like this so it matches with the paper
and add a comment? The only difference then is the perTermAnnealingRate.

// See 'Algorithm 1: SGD' in
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.177.3514&rep=rep1&type=pdf
double newValue = beta.getQuick(i, j) + learningRate *
perTermLearningRate(j) * instance.get(j) * gradientBase;

Cheers,

Frank







On Mon, Jan 13, 2014 at 10:54 PM, Frank Scholten wrote:

> Thanks guys, I have some reading to do :-)
>
>
> On Mon, Jan 13, 2014 at 10:45 PM, Ted Dunning wrote:
>
>> The reference is to the web site in general.
>>
>> If anything, this blog is closest:
>>
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.177.3514&rep=rep1&type=pdf
>>
>>
>> On Mon, Jan 13, 2014 at 1:14 PM, Suneel Marthi > >wrote:
>>
>> > I think this is the one. Yes, I don't see this paper referenced in the
>> > code sorry about that.
>> > http://leon.bottou.org/publications/pdf/compstat-2010.pdf
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Monday, January 13, 2014 3:51 PM, Frank Scholten <
>> > fr...@frankscholten.nl> wrote:
>> >
>> > Do you know which paper it is? He has quite a few publications. I don't
>> see
>> > any mention of one of his papers in the code. I only see
>> > www.eecs.tufts.edu/~dsculley/papers/combined-ranking-and-regression.pdfin
>> > MixedGradient but this is something different.
>> >
>> >
>> >
>> >
>> > On Mon, Jan 13, 2014 at 1:27 PM, Suneel Marthi > > >wrote:
>> >
>> > > Mahout's impl is based off of Leon Bottou's paper on this subject.  I
>> > > don't gave the link handy but it's referenced in the code or try
>> google
>> > > search
>> > >
>> > > Sent from my iPhone
>> > >
>> > > > On Jan 13, 2014, at 7:14 AM, Frank Scholten > >
>> > > wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > I followed the Coursera Machine Learning course quite a while ago
>> and I
>> > > am
>> > > > trying to find out how Mahout implements the Logistic Regression
>> cost
>> > > > function in the code surrounding AbstractOnlineLogisticRegression.
>> > > >
>> > > > I am looking at the train method in AbstractOnlineLogisticRegression
>> > and
>> > > I
>> > > > see online gradient descent step where the beta matrix is updated
>> but
>> > to
>> > > me
>> > > > its unclear how matches with the cost function described at:
>> > > > http://www.holehouse.org/mlclass/06_Logistic_Regression.html
>> > > >
>> > > > Perhaps Mahout uses an optimized approach for that does not directly
>> > map
>> > > > into the formula at that link?
>> > > >
>> > > > Cheers,
>> > > >
>> > > > Frank
>> > >
>> >
>>
>
>


Re: Logistic Regression cost function

2014-01-13 Thread Frank Scholten
Thanks guys, I have some reading to do :-)


On Mon, Jan 13, 2014 at 10:45 PM, Ted Dunning  wrote:

> The reference is to the web site in general.
>
> If anything, this blog is closest:
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.177.3514&rep=rep1&type=pdf
>
>
> On Mon, Jan 13, 2014 at 1:14 PM, Suneel Marthi  >wrote:
>
> > I think this is the one. Yes, I don't see this paper referenced in the
> > code sorry about that.
> > http://leon.bottou.org/publications/pdf/compstat-2010.pdf
> >
> >
> >
> >
> >
> >
> >
> > On Monday, January 13, 2014 3:51 PM, Frank Scholten <
> > fr...@frankscholten.nl> wrote:
> >
> > Do you know which paper it is? He has quite a few publications. I don't
> see
> > any mention of one of his papers in the code. I only see
> > www.eecs.tufts.edu/~dsculley/papers/combined-ranking-and-regression.pdfin
> > MixedGradient but this is something different.
> >
> >
> >
> >
> > On Mon, Jan 13, 2014 at 1:27 PM, Suneel Marthi  > >wrote:
> >
> > > Mahout's impl is based off of Leon Bottou's paper on this subject.  I
> > > don't gave the link handy but it's referenced in the code or try google
> > > search
> > >
> > > Sent from my iPhone
> > >
> > > > On Jan 13, 2014, at 7:14 AM, Frank Scholten 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I followed the Coursera Machine Learning course quite a while ago
> and I
> > > am
> > > > trying to find out how Mahout implements the Logistic Regression cost
> > > > function in the code surrounding AbstractOnlineLogisticRegression.
> > > >
> > > > I am looking at the train method in AbstractOnlineLogisticRegression
> > and
> > > I
> > > > see online gradient descent step where the beta matrix is updated but
> > to
> > > me
> > > > its unclear how matches with the cost function described at:
> > > > http://www.holehouse.org/mlclass/06_Logistic_Regression.html
> > > >
> > > > Perhaps Mahout uses an optimized approach for that does not directly
> > map
> > > > into the formula at that link?
> > > >
> > > > Cheers,
> > > >
> > > > Frank
> > >
> >
>


Re: Logistic Regression cost function

2014-01-13 Thread Frank Scholten
Do you know which paper it is? He has quite a few publications. I don't see
any mention of one of his papers in the code. I only see
www.eecs.tufts.edu/~dsculley/papers/combined-ranking-and-regression.pdf in
MixedGradient but this is something different.



On Mon, Jan 13, 2014 at 1:27 PM, Suneel Marthi wrote:

> Mahout's impl is based off of Leon Bottou's paper on this subject.  I
> don't gave the link handy but it's referenced in the code or try google
> search
>
> Sent from my iPhone
>
> > On Jan 13, 2014, at 7:14 AM, Frank Scholten 
> wrote:
> >
> > Hi,
> >
> > I followed the Coursera Machine Learning course quite a while ago and I
> am
> > trying to find out how Mahout implements the Logistic Regression cost
> > function in the code surrounding AbstractOnlineLogisticRegression.
> >
> > I am looking at the train method in AbstractOnlineLogisticRegression and
> I
> > see online gradient descent step where the beta matrix is updated but to
> me
> > its unclear how matches with the cost function described at:
> > http://www.holehouse.org/mlclass/06_Logistic_Regression.html
> >
> > Perhaps Mahout uses an optimized approach for that does not directly map
> > into the formula at that link?
> >
> > Cheers,
> >
> > Frank
>


Logistic Regression cost function

2014-01-13 Thread Frank Scholten
Hi,

I followed the Coursera Machine Learning course quite a while ago and I am
trying to find out how Mahout implements the Logistic Regression cost
function in the code surrounding AbstractOnlineLogisticRegression.

I am looking at the train method in AbstractOnlineLogisticRegression and I
see online gradient descent step where the beta matrix is updated but to me
its unclear how matches with the cost function described at:
http://www.holehouse.org/mlclass/06_Logistic_Regression.html

Perhaps Mahout uses an optimized approach for that does not directly map
into the formula at that link?

Cheers,

Frank


Re: Question on OnlineLogisticRegression.iris() test case

2014-01-06 Thread Frank Scholten
Ah of course. Thanks Ted!

Btw for others who are interested, the online statistical learning class at
Stanford starts in a few weeks:
https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about


On Mon, Jan 6, 2014 at 5:37 PM, Ted Dunning  wrote:

> This is an offset element which allows the model to have an intercept term
> in addition to terms for the predictor variables.
>
>
>
>
> On Mon, Jan 6, 2014 at 8:31 AM, Frank Scholten  >wrote:
>
> > Hi,
> >
> > I am studying the LR / SGD code and I was wondering why in the iris test
> > case the first element of each vector is set to 1 in the loop parsing the
> > CSV file via v.set(0,1)
> >
> > for (String line : raw.subList(1, raw.size())) {
> >   // order gets a list of indexes
> >   order.add(order.size());
> >
> >   // parse the predictor variables
> >   Vector v = new DenseVector(5);
> >   v.set(0, 1);
> >   int i = 1;
> >   Iterable values = onComma.split(line);
> >   for (String value : Iterables.limit(values, 4)) {
> > v.set(i++, Double.parseDouble(value));
> >   }
> >   data.add(v);
> >
> >   // and the target
> >   target.add(dict.intern(Iterables.get(values, 4)));
> > }
> >
> > If I remove the line the accuracy drops to 92% but I don't know why this
> is
> > happening. Where is this first element used throughout the algorithm?
> >
> > Cheers,
> >
> > Frank
> >
>


Question on OnlineLogisticRegression.iris() test case

2014-01-06 Thread Frank Scholten
Hi,

I am studying the LR / SGD code and I was wondering why in the iris test
case the first element of each vector is set to 1 in the loop parsing the
CSV file via v.set(0,1)

for (String line : raw.subList(1, raw.size())) {
  // order gets a list of indexes
  order.add(order.size());

  // parse the predictor variables
  Vector v = new DenseVector(5);
  v.set(0, 1);
  int i = 1;
  Iterable values = onComma.split(line);
  for (String value : Iterables.limit(values, 4)) {
v.set(i++, Double.parseDouble(value));
  }
  data.add(v);

  // and the target
  target.add(dict.intern(Iterables.get(values, 4)));
}

If I remove the line the accuracy drops to 92% but I don't know why this is
happening. Where is this first element used throughout the algorithm?

Cheers,

Frank


Re: general mahout working / some solr questions / last version tests

2012-07-07 Thread Frank Scholten
First make sure you can do a normal build.

It seems you have some local changes to the pom because trunk builds
fine on my machine. Do a clean checkout and run

$ mvn clean install -DskipTests=true

Second, the type of input and output depends on the job you want to run.

If you want to do clustering you run several jobs in sequence. Try the
clustering example on the Reuters news dataset.

Have a look at the folder examples/bin/cluster-reuters.sh, run it and
look at the script to see what kind of jobs it runs.

Frank

On Fri, Jul 6, 2012 at 11:45 AM, Videnova, Svetlana
 wrote:
>
>
> Can someone please ask me to following questions:
> 1)What the input of mahout (a xml file? Which is the output of solr, that 
> what it interests me!)?
> 2)What the output of mahout, I mean after clusterisation with k-means for 
> exemple (a xml file again? )?
> 3)Where the output is store?
> 4)Can somebody please give me an exemple of code line command on unix ubuntu?
>  I tried this already :
>  $ $MAHOUT_HOME/bin/mahout --input my_file.txt --output output.txt
> Is that makes any sense for you?
>
>
> I know that there are some script to make solr and mahout working together, 
> and create a connection between both, but no tutorials on this subject(either 
> not very clear, nor too old...). Some ideas,tutoriels,forums ..?
>
>
>
>
> I'm still using this tuto: 
> http://cloudblog.8kmiles.com/2012/01/31/apache-mahout-a-clustering-example/
> But with the implemented code from here: 
> http://zoekja.nl/proxy/index.php?q=aHR0cHM6Ly9naXRodWIuY29tL2FwYWNoZS9tYWhvdXQ%3D
> Ps:hadoop is running OK, java set OK
>
> 
> BUILD SUCCESSFUL
> [INFO] 
> 
> [INFO] Total time: 77 minutes 5 seconds
> [INFO] Finished at: Fri Jul 06 10:48:45 CEST 2012
> [INFO] Final Memory: 67M/170M
> 
> :):):):):):):):):):)
> Then thanks to : Sean Owen and his updates on 
> http://zoekja.nl/proxy/index.php?q=aHR0cHM6Ly9naXRodWIuY29tL2FwYWNoZS9tYWhvdXQ%3D
>
>
>
>
>
>
>
>
> -Message d'origine-
> De : Videnova, Svetlana [mailto:svetlana.viden...@logica.com]
> Envoyé : vendredi 6 juillet 2012 09:44
> À : d...@mahout.apache.org
> Objet : RE: train mahout ex
>
> Hi guys.
>
> I join pom.xml for mahout-distribution-0.7.
> Im following this tutorial: 
> http://cloudblog.8kmiles.com/2012/01/31/apache-mahout-a-clustering-example/
>
> I still have errors when I execute this step: user1@ubuntu-server:~$ mvn 
> clean install
>
> I can't understand what's wrong about the pom.xml This is the output:
>
>
>
> 
>
> /usr/local/mahout-distribution-0.7$ mvn clean install [INFO] Scanning for 
> projects...
> [INFO] 
> 
> [ERROR] FATAL ERROR
> [INFO] 
> 
> [INFO] Error building POM (may not be this project's POM).
>
>
> Project ID: unknown
> POM Location: /usr/local/mahout-distribution-0.7/pom.xml
>
> Reason: Parse error reading POM. Reason: Unrecognised tag: 'relativePath' 
> (position: START_TAG seen ...\r\n  ... @24:17)  for 
> project unknown at /usr/local/mahout-distribution-0.7/pom.xml
>
>
> [INFO] 
> 
> [INFO] Trace
> org.apache.maven.reactor.MavenExecutionException: Parse error reading POM. 
> Reason: Unrecognised tag: 'relativePath' (position: START_TAG seen 
> ...\r\n  ... @24:17)  for project unknown at 
> /usr/local/mahout-distribution-0.7/pom.xml
> at org.apache.maven.DefaultMaven.getProjects(DefaultMaven.java:404)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:272)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
> at 
> org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:60)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
> at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
> at 
> org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
> at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
> Caused by: org.apache.maven.project.InvalidProjectModelException: Parse error 
> reading POM. Reason: Unrecognised tag: 'relativePath' (position: START_TAG 
> seen ...\r\n  ... @24:17)  for project unknown at 
> /usr/local/mahout-distribution-0.7/pom.xml
> at 
> org.apache.maven.project.DefaultMavenProjectBuilder.readModel(DefaultMavenProjectBuilder.java:1610)
> at 
> org.

Re: Collusion detection in online bridge

2012-04-28 Thread Frank Scholten
I have a question about computing the loglikelihood scores for this problem.

In bridge, deals are reused inside a tournament.

I can see how to figure out which players play more against a specific
partner than others. In this case N equals the number of deals, k11
from the loglikelihood contingency table equals the number of deals
played by players A and B, k12 deals played by A but not by B, and so
on.

What I really want is to figure out which players have a lot of wins
from deals that were played by others at the same time or in the past.
The reasoning is that players who have wins only when someone else has
played this deal before are suspect.

However how do I account for this temporal aspect, 'number of won
deals which were played before by player X' into the loglikelihood
counts? It seems I have several subsets, like wins and losses, wins
before a certain time and so on.

I am not sure how to work these factors into a loglikelihood ratio
test. Perhaps there is a different, more suitable method for this type
of problem?

Cheers,

Frank

On Tue, Apr 24, 2012 at 7:32 PM, Frank Scholten  wrote:
> On Tue, Apr 24, 2012 at 5:20 PM, Sean Owen  wrote:
>> OK, this may yet just be an application of statistics.
>>
>> I assume that my skill in bridge is a relatively fixed quantity, and
>> my score in a game is probably a function of the skill of me and my
>> partner, and of our opponents' skill. I don't know how IMPs work, but
>> assume you can establish some "expected" change in score given these
>> two inputs (average skill of my team, their team). Actual changes
>> ought to be normally distributed around that expectation. You look for
>> pairs whose actual change is highly unlikely (too high) given this,
>> like +3 standard deviations above expectation.
>
> That seems like a good approach. Thanks!
>
> Cheers,
>
> Frank
>
>>
>> How's that?
>>
>> On Tue, Apr 24, 2012 at 3:13 PM, Frank Scholten  
>> wrote:
>>> Interesting. However, winning in bridge is not a boolean event, each
>>> deal gives a number of IMPs, International Match Points, to each
>>> player which can be positive and negative. The sum of IMPs of each
>>> deal is always zero.
>>


Re: Collusion detection in online bridge

2012-04-24 Thread Frank Scholten
On Tue, Apr 24, 2012 at 5:20 PM, Sean Owen  wrote:
> OK, this may yet just be an application of statistics.
>
> I assume that my skill in bridge is a relatively fixed quantity, and
> my score in a game is probably a function of the skill of me and my
> partner, and of our opponents' skill. I don't know how IMPs work, but
> assume you can establish some "expected" change in score given these
> two inputs (average skill of my team, their team). Actual changes
> ought to be normally distributed around that expectation. You look for
> pairs whose actual change is highly unlikely (too high) given this,
> like +3 standard deviations above expectation.

That seems like a good approach. Thanks!

Cheers,

Frank

>
> How's that?
>
> On Tue, Apr 24, 2012 at 3:13 PM, Frank Scholten  
> wrote:
>> Interesting. However, winning in bridge is not a boolean event, each
>> deal gives a number of IMPs, International Match Points, to each
>> player which can be positive and negative. The sum of IMPs of each
>> deal is always zero.
>


Re: Collusion detection in online bridge

2012-04-24 Thread Frank Scholten
On Tue, Apr 24, 2012 at 1:18 PM, Sean Owen  wrote:
> Define collusion? You could think of it as playing games together a
> lot, but, friends do that too.
>
> I don't think this is a recommender problem. At best you're describing
> finding unusual similarities, which is something simpler.
> Log-likelihood would probably better normalize the result.
>
> It's probably something more like winning an unusual number of games
> when playing with some other user, right? How about a log-likelihood
> test on win rate when playing with player X vs overall win rate?
>

Interesting. However, winning in bridge is not a boolean event, each
deal gives a number of IMPs, International Match Points, to each
player which can be positive and negative. The sum of IMPs of each
deal is always zero.

> You can translate the example on Wikipedia involving fair coins pretty
> directly to this case:
> http://en.wikipedia.org/wiki/Likelihood-ratio_test
>
> On Tue, Apr 24, 2012 at 11:55 AM, Frank Scholten  
> wrote:
>> Hi all,
>>
>> I am working on a collusion detection system for online bridge.
>>
>> My plan was to use a user-based recommender using TanimotoCoefficient
>> for looking up users that have played many games together as a
>> starting point. I want to use this score as well as other features and
>> feed this into an SGD classifier. Later on we can look into actual gameplay
>> features but that's an advanced topic.
>>
>> How should I model the training data? Should a training sample contain
>> a user pair and features related to this pair of users, such as the
>> tanimoto score or do I create a vector for a single user and create
>> complex features with information about user interactions?
>>
>> If you have any other suggestions I would love hear.
>>
>> Cheers,
>>
>> Frank
>


Collusion detection in online bridge

2012-04-24 Thread Frank Scholten
Hi all,

I am working on a collusion detection system for online bridge.

My plan was to use a user-based recommender using TanimotoCoefficient
for looking up users that have played many games together as a
starting point. I want to use this score as well as other features and
feed this into an SGD classifier. Later on we can look into actual gameplay
features but that's an advanced topic.

How should I model the training data? Should a training sample contain
a user pair and features related to this pair of users, such as the
tanimoto score or do I create a vector for a single user and create
complex features with information about user interactions?

If you have any other suggestions I would love hear.

Cheers,

Frank


Re: Pre-configured Mahout on the cloud

2012-04-03 Thread Frank Scholten
An alternative is to use Apache Whirr to quickly set up a Hadoop
cluster on AWS and install the Mahout binary distribution on one of
the nodes.

Checkout http://whirr.apache.org/ and
http://www.searchworkings.org/blog/-/blogs/apache-whirr-includes-mahout-support
for the mahout-client role

Frank

On Tue, Apr 3, 2012 at 11:39 AM, Sean Owen  wrote:
> This is lightly covered in Mahout in Action but yes there is really little
> more to know. You upload the job jar and run it like anything else in AWS.
> On Apr 3, 2012 10:24 AM, "Sebastian Schelter"  wrote:
>
>> None that I'm aware of. But its supereasy to use Mahout in EMR: You need
>> to upload your data and Mahout's job-jar file to Amazon S3. After that
>> you can you simply start a Hadoop job in EMR that makes use of Mahout,
>> just as you would use it on the command line with 'hadoop jar'
>>
>> Best,
>> Sebastian
>>
>> On 03.04.2012 11:20, Yuval Feinstein wrote:
>> > Hi.
>> > I heard about Amazon's Elastic Map Reduce (
>> > http://aws.amazon.com/elasticmapreduce/)
>> > which provides pre-configured Hadoop servers over the cloud.
>> > Does there exist any service providing Mahout over a similar
>> infrastructure?
>> > i.e a cloud server providing either a stand-alone or a distributed Mahout
>> > service where one can upload data files and run Mahout algorithms?
>> > TIA,
>> > Yuval
>> >
>>
>>


Re: Mahout Hosting Provider

2012-02-17 Thread Frank Scholten
Check out 
http://www.searchworkings.org/blog/-/blogs/apache-whirr-includes-mahout-support
to set up Mahout and Hadoop on Amazon AWS.

You can then SSH into the cluster and submit jobs from the command line.

Frank

On Thu, Feb 16, 2012 at 9:30 AM, VIGNESH PRAJAPATI
 wrote:
> Hi Folks,
>
>  I am new to mahout.I want to know that is there any mahout hosting
> provider for Apache Mahout except amazon.?
>
> --
>
> *Vignesh Prajapati*
> Tel: 9427415949 |
> vignesh2...@gmail.com | www.vipras.com.co.in
> MYTK [image: Facebook]  [image:
> Twitter] [image:
> LinkedIn]  [image:
> about.me] 
> 


Re: Mahout 0.5 java.lang.IllegalStateException: No clusters found. Check your -c path.

2012-02-15 Thread Frank Scholten
You must either specify -k  to have kmeans randomly pick k
initial clusters from the input vectors or use -c to point to a
directory of initial clusters, generated by canopy for example.

2012/2/15 Qiang Xu :
>
> Note, this problem is only happen in hadoop cluster.Mahout Standalone modle 
> is no such problem.
>
>> From: xxqonl...@hotmail.com
>> To: user@mahout.apache.org
>> Subject: RE: Mahout 0.5 java.lang.IllegalStateException: No clusters found. 
>> Check your -c path.
>> Date: Wed, 15 Feb 2012 12:22:26 +0800
>>
>>
>> I have seen there is such problem in mainthread
>> http://lucene.472066.n3.nabble.com/jira-Created-MAHOUT-504-Kmeans-clustering-error-td1531052.html
>> and
>> https://issues.apache.org/jira/browse/MAHOUT-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#issue-tabs
>>
>> But my step is following official guide.
>> https://cwiki.apache.org/MAHOUT/k-means-clustering.html
>>
>> Could you point out what should I do corrctly?
>> I have tried
>> ./bin/mahout kmeans -i
>>  examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c
>>  examples/bin/work/clusters -o  examples/bin/work/reuters-kmeans -x 10
>>  -ow
>> ./bin/mahout kmeans -i
>>  examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c
>>  examples/bin/work/clusters -o  examples/bin/work/reuters-kmeans -x 10
>>  -ow -cl
>> ./bin/mahout kmeans -i
>>  examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c
>>  examples/bin/work/clusters -o  examples/bin/work/reuters-kmeans -x 10
>> -k 0 -ow
>> ./bin/mahout kmeans -i
>>  examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c
>>  examples/bin/work/clusters -o  examples/bin/work/reuters-kmeans -x 10
>> -k 20 -ow
>> > Date: Tue, 14 Feb 2012 19:39:59 -0800
>> > Subject: Re: Mahout 0.5 java.lang.IllegalStateException: No clusters 
>> > found. Check your -c path.
>> > From: goks...@gmail.com
>> > To: user@mahout.apache.org
>> >
>> > See the other mail thread for the MAHOUT-504 JIRA. That jira is closed
>> > and fixed.
>> > The problem is that the program needs one of a few different
>> > combinations of arguments. It does not give you an error message
>> > describing the problem.
>> >
>> > On Tue, Feb 14, 2012 at 6:59 PM, Qiang Xu  wrote:
>> > >
>> > > The new test is using command  ./bin/mahout kmeans -i  
>> > > examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c  
>> > > examples/bin/work/clusters -o  examples/bin/work/reuters-kmeans -x 10  
>> > > -ow -cl
>> > > Still the same problem.
>> > >
>> > >> From: xxqonl...@hotmail.com
>> > >> To: user@mahout.apache.org
>> > >> Subject: RE: Mahout 0.5 java.lang.IllegalStateException: No clusters 
>> > >> found. Check your -c path.
>> > >> Date: Wed, 15 Feb 2012 10:58:25 +0800
>> > >>
>> > >>
>> > >> I have checked the command line:
>> > >> --clustering (-cl)                           If present, run clustering 
>> > >> after
>> > >>                                                the iterations have 
>> > >> taken place
>> > >> And try it, it seems the same behavior, could you give me more clue?
>> > >> op_cluster/hadoop-0.20.2/
>> > >> HADOOP_CONF_DIR=/data/hadoop_cluster/hadoop-0.20.2/conf/
>> > >> 12/02/15 11:16:23 INFO common.AbstractJob: Command line arguments: 
>> > >> {--clustering=null, --clusters=examples/bin/work/clusters, 
>> > >> --convergenceDelta=0.5, 
>> > >> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>> > >>  --endPhase=2147483647, 
>> > >> --input=examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/, 
>> > >> --maxIter=10, --method=mapreduce, 
>> > >> --output=examples/bin/work/reuters-kmeans, --overwrite=null, 
>> > >> --startPhase=0, --tempDir=temp}
>> > >> 12/02/15 11:16:23 INFO common.HadoopUtil: Deleting 
>> > >> examples/bin/work/reuters-kmeans
>> > >> 12/02/15 11:16:23 INFO kmeans.KMeansDriver: Input: 
>> > >> examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors Clusters In: 
>> > >> examples/bin/work/clusters Out: examples/bin/work/reuters-kmeans 
>> > >> Distance: 
>> > >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
>> > >> 12/02/15 11:16:23 INFO kmeans.KMeansDriver: convergence: 0.5 max 
>> > >> Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable 
>> > >> Input Vectors: {}
>> > >> 12/02/15 11:16:23 INFO kmeans.KMeansDriver: K-Means Iteration 1
>> > >> 12/02/15 11:16:24 INFO input.FileInputFormat: Total input paths to 
>> > >> process : 1
>> > >> 12/02/15 11:16:24 INFO mapred.JobClient: Running job: 
>> > >> job_201202131515_0126
>> > >> 12/02/15 11:16:25 INFO mapred.JobClient:  map 0% reduce 0%
>> > >> 12/02/15 11:16:38 INFO mapred.JobClient: Task Id : 
>> > >> attempt_201202131515_0126_m_00_0, Status : FAILED
>> > >> java.lang.IllegalStateException: No clusters found. Check your -c path.
>> > >>         at 
>> > >> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:60)
>> > >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>> > >>

Re: only single cluster per document

2012-02-06 Thread Frank Scholten
Hi Lokesh,

Could you provide more details on the commands you are running, including 
parameters?

If you use seqdirectory on one csv file it will generate one vector and then 
you end up with one cluster

On Feb 6, 2012, at 14:55, Lokesh  wrote:

> hi,
>   I am new to mahout kmeans clustering when i run kmeans clustering i
> get only one cluster if one csv file of any size given can anyone help me to
> know whether this is correct or not.
> 
> Thanks in advance
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/only-single-cluster-per-document-tp3719668p3719668.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
> 


Re: How to present mahout cluster in combination with Solr results

2012-02-02 Thread Frank Scholten
Checkout the recent mailing list post 'Clustering user profiles'

Jeff (Eastman) sums it up clearly.

> Mahout clustering (unsupervised classification) can only deal with 
> continuous, homogeneous vector representations of the input data, where each 
> vector element is weighted the same as the other elements. Mahout
> (supervised) classification can deal with continuous, categorical, word-like 
> and text-like features such as in your problem space.

> To address your problem with Mahout clustering, you would need to develop a 
> mapping for each of your features to continuous vector elements and use a 
> WeightedDistanceMeasure to account for the different element > types and 
> their relative impacts on the overall distance computation. This would be an 
> iterative process which might or might not produce useful results.

> An alternative approach would be to train a Mahout classifier with the 
> various features using marked training data which classifies similar users 
> into a finite number of "clusters" that seem natural to you. With such a
> model, you could then classify new users into those "clusters". This approach 
> would not be very useful for discovering new "clusters" in your data, but it 
> would leverage the classifier training mechanisms to develop the > models as 
> more of a black box than above.

Question also to other people reading this. I looked into this and saw
that there are clustering algorithms for categorical data such as
K-modes. Are these effective for solving these kind of problems? If so
would they be interesting to add to Mahout?

Cheers,

Frank

On Thu, Feb 2, 2012 at 12:38 PM, Vikas Pandya  wrote:
> Frank. Thanks.
>>>In your case you want to cluster items that have several risk levels
>>> as well as other properties. You have to use your original numerical
>>> data, (I assume probabilities) in a clustering algorithm, not the
>>> labels like low, medium, high. How were these labels assigned?
>
>
> RiskLevel1,RiskLevel2,RiskLevel3 all are having actual lookup values (High, 
> Medium,Low etc) in Solr index (Index is stored flatten)
>
> -Vikas
>
>
> 
>  From: Frank Scholten 
> To: user@mahout.apache.org
> Sent: Wednesday, February 1, 2012 3:28 AM
> Subject: Re: How to present mahout cluster in combination with Solr results
>
> Vikas,
>
> Please send messages to the mailinglist so everyone can benefit.
>
>> Frank,
>>
>> To give further details about the usecase.
>>
>> 1)User searches for a free text, this search is served from Solr.
>> 2)User selects a record from the search result, subsequently need to display 
>> all the items where RiskLevels of the items match the values of Risk Levels 
>> of a selected item from search result (and put them under "Similar items" in 
>> UI).
>>
>> upon indexing I am copying RiskLevel1, RiskLevel2,RiskLevel3 into a single 
>> field (solr copyField). Vector is created against that field for Mahout to 
>> create clusters on it. Now the issue is (understandably) when clusters are 
>> created it will find distance between words and its very much possible that 
>> following three records get clustered into a single cluster.
>> RiskLevel1, RiskLevel2, RiskLevel3
>> High             High       Low
>> High             High             High
>> High             High         Medium
>
> Just to make sure, in my presentation I talk about using text
> clustering for document tagging. The documents are vectorized and
> weighted with TF/IDF and are fed into a Mahout clustering algorithm.
>
> In your case you want to cluster items that have several risk levels
> as well as other properties. You have to use your original numerical
> data, (I assume probabilities) in a clustering algorithm, not the
> labels like low, medium, high. How were these labels assigned?
>
>>
>> But clustering on these metadata columns, requirement is to cluster as below 
>> (sequence of the values DO matter)
>>
>> Cluster1:
>> RiskLevel1, RiskLevel2,RiskLevel3
>> High             High           Low
>> High             High           Low
>>
>> Cluster2:
>> RiskLevel1, RiskLevel2,RiskLevel3
>> High            High           High
>> High            High           High
>>
>> Cluster3:
>> RiskLevel1, RiskLevel2,RiskLevel3
>> High            High           Medium
>> High            High            Medium
>>
>> I started thinking about using classification over clustering? but while 
>> playing with Weka (http://www.cs.waikato.ac.nz/ml/weka/ ) Swing based GUI 
>> tool where one can e

Re: How to present mahout cluster in combination with Solr results

2012-02-01 Thread Frank Scholten
Vikas,

Please send messages to the mailinglist so everyone can benefit.

> Frank,
>
> To give further details about the usecase.
>
> 1)User searches for a free text, this search is served from Solr.
> 2)User selects a record from the search result, subsequently need to display 
> all the items where RiskLevels of the items match the values of Risk Levels 
> of a selected item from search result (and put them under "Similar items" in 
> UI).
>
> upon indexing I am copying RiskLevel1, RiskLevel2,RiskLevel3 into a single 
> field (solr copyField). Vector is created against that field for Mahout to 
> create clusters on it. Now the issue is (understandably) when clusters are 
> created it will find distance between words and its very much possible that 
> following three records get clustered into a single cluster.
> RiskLevel1, RiskLevel2, RiskLevel3
> High High   Low
> High High High
> High High Medium

Just to make sure, in my presentation I talk about using text
clustering for document tagging. The documents are vectorized and
weighted with TF/IDF and are fed into a Mahout clustering algorithm.

In your case you want to cluster items that have several risk levels
as well as other properties. You have to use your original numerical
data, (I assume probabilities) in a clustering algorithm, not the
labels like low, medium, high. How were these labels assigned?

>
> But clustering on these metadata columns, requirement is to cluster as below 
> (sequence of the values DO matter)
>
> Cluster1:
> RiskLevel1, RiskLevel2,RiskLevel3
> High High   Low
> High High   Low
>
> Cluster2:
> RiskLevel1, RiskLevel2,RiskLevel3
> HighHigh   High
> HighHigh   High
>
> Cluster3:
> RiskLevel1, RiskLevel2,RiskLevel3
> HighHigh   Medium
> HighHighMedium
>
> I started thinking about using classification over clustering? but while 
> playing with Weka (http://www.cs.waikato.ac.nz/ml/weka/ ) Swing based GUI 
> tool where one can easily play around with different algorithms from UI 
> directly, I found DBScan clustering did cluster results correctly per my 
> requirements, to be precise it created three different clusters (if you pick 
> above mentioned example).
>
> can clustering be done the way I need it to work in Mahout? or any other 
> ideas that can be explore further?
>
> Thanks,

On Fri, Jan 20, 2012 at 6:48 PM, Frank Scholten  wrote:
> On Fri, Jan 20, 2012 at 4:01 PM, Vikas Pandya  wrote:
>> From the example below, solr search results should be clustered in some
>> following way
>> list all the items which have matching RiskLevels e.g.
>>
>>
>> Cluster 1:
>> Title          RiskLevel1          RiskLevel2         RiskLevel3
>> abc            High                     Medium             Low
>> xyz            High                      Medium            High
>> def            Low                        Medium           High
>>
>> Cluster 2:
>> Title          RiskLevel1          RiskLevel2         RiskLevel3
>> omn            Low                     Medium             Low
>> yui            Low                      Medium            High
>> bnm            Medium             Medium           High
>>
>> Though I have a feeling I don't need to use Mahout clustering for this, I am
>> still trying to hook in mahout for this since we have more clustering
>> requirements in the pipeline to cluster based on other features (attributes
>> of objects).
>>
>
> You only have 27 unique risklevel combinations. You could just sort by
> or more risklevels to get a sense of the data.
>
> If you have more attributes then you could indeed look into clustering,
>
> Cheers,
>
> Frank
>
>> Any thoughts?
>>
>> 
>> From: Vikas Pandya 
>> To: Frank Scholten ; "user@mahout.apache.org"
>> 
>> Sent: Thursday, January 19, 2012 11:05 AM
>>
>> Subject: Re: How to present mahout cluster in combination with Solr results
>>
>> Hi Frank,
>>
>> Thanks for the link. That was useful. It's still bit unclear on how he built
>> his index. are we saying, we index  clusterId,clusterSize and clusterLable
>> in the same index (where other data is indexed)? So one index will have two
>> sets of Solr documents in it?  one containing cluster info?
>>
>> My requirement again; I have bunch of db columns which are being indexed.
>> e.g.
>> Title,             RiskLevel1, RiskLevel2,RiskLevel3 etc
>> Ti

FOSDEM 2012 Brussels 4/5 february

2012-01-22 Thread Frank Scholten
Hi all,

I will be visiting FOSDEM in Brussels 4/5 february.

Anybody from this group planning to go there? Would be cool to meet a
few of you there!

I think the graph processing devroom and the virtualization and cloud
devroom will be interesting.

See http://fosdem.org/2012/ and of course the beer event :-)
http://fosdem.org/2012/beerevent

Cheers,

Frank


Re: How to present mahout cluster in combination with Solr results

2012-01-20 Thread Frank Scholten
On Fri, Jan 20, 2012 at 4:01 PM, Vikas Pandya  wrote:
> From the example below, solr search results should be clustered in some
> following way
> list all the items which have matching RiskLevels e.g.
>
>
> Cluster 1:
> Title          RiskLevel1          RiskLevel2         RiskLevel3
> abc            High                     Medium             Low
> xyz            High                      Medium            High
> def            Low                        Medium           High
>
> Cluster 2:
> Title          RiskLevel1          RiskLevel2         RiskLevel3
> omn            Low                     Medium             Low
> yui            Low                      Medium            High
> bnm            Medium             Medium           High
>
> Though I have a feeling I don't need to use Mahout clustering for this, I am
> still trying to hook in mahout for this since we have more clustering
> requirements in the pipeline to cluster based on other features (attributes
> of objects).
>

You only have 27 unique risklevel combinations. You could just sort by
or more risklevels to get a sense of the data.

If you have more attributes then you could indeed look into clustering,

Cheers,

Frank

> Any thoughts?
>
> 
> From: Vikas Pandya 
> To: Frank Scholten ; "user@mahout.apache.org"
> 
> Sent: Thursday, January 19, 2012 11:05 AM
>
> Subject: Re: How to present mahout cluster in combination with Solr results
>
> Hi Frank,
>
> Thanks for the link. That was useful. It's still bit unclear on how he built
> his index. are we saying, we index  clusterId,clusterSize and clusterLable
> in the same index (where other data is indexed)? So one index will have two
> sets of Solr documents in it?  one containing cluster info?
>
> My requirement again; I have bunch of db columns which are being indexed.
> e.g.
> Title,             RiskLevel1, RiskLevel2,RiskLevel3 etc
> Title1        High             Medium      Low
>
> Current requirement is to cluster documents based on their riskLevels and
> NOT the title.
>
> Thanks,
>
>
> 
> From: Frank Scholten 
> To: user@mahout.apache.org; Vikas Pandya 
> Sent: Thursday, January 19, 2012 4:24 AM
> Subject: Re: How to present mahout cluster in combination with Solr results
>
> Hi Vikas,
>
> I suggest indexing the cluster label, cluster size and
> cluster-document mappings so you can use that information to build a
> tag cloud of your data. Checkout this presentation
> http://java.dzone.com/videos/configuring-mahout-clustering
>
> Cheers,
>
> Frank
>
> On Thu, Jan 19, 2012 at 4:18 AM, Vikas Pandya  wrote:
>> Hello,
>>
>> I have successfully created vectors from reading my existing Solr Index.
>> Then created sequenceFile and mahout clusters from it. As I understand that
>> currently solr and mahout clustering aren't integrated, what's the best way
>> to represent mahout clusters to the user? Mine is a search application which
>> renders results by querying solr index. Now I need to incorporate Mahout
>> created clusters in the result. While Solr-Mahout integration isn't there
>> yet, what's the best alternative way to represent this info?
>>
>> Thanks,
>


Re: How to present mahout cluster in combination with Solr results

2012-01-19 Thread Frank Scholten
Hi Vikas,

I suggest indexing the cluster label, cluster size and
cluster-document mappings so you can use that information to build a
tag cloud of your data. Checkout this presentation
http://java.dzone.com/videos/configuring-mahout-clustering

Cheers,

Frank

On Thu, Jan 19, 2012 at 4:18 AM, Vikas Pandya  wrote:
> Hello,
>
> I have successfully created vectors from reading my existing Solr Index. Then 
> created sequenceFile and mahout clusters from it. As I understand that 
> currently solr and mahout clustering aren't integrated, what's the best way 
> to represent mahout clusters to the user? Mine is a search application which 
> renders results by querying solr index. Now I need to incorporate Mahout 
> created clusters in the result. While Solr-Mahout integration isn't there 
> yet, what's the best alternative way to represent this info?
>
> Thanks,


[ANNOUNCE] Apache Whirr 0.7.0 includes Mahout support

2011-12-22 Thread Frank Scholten
Hi all,

Apache Whirr 0.7.0, which was released yesterday, includes Mahout
support. You can install the Mahout binary distribution via the
'mahout-client' role.

For more details see the following blog:
http://www.searchworkings.org/blog/-/blogs/apache-whirr-includes-mahout-support

Cheers,

Frank


Re: New User to Mahout

2011-11-12 Thread Frank Scholten
Hi Sachin,

Most Mahout jobs have several overloaded run methods. For example:

KMeansDriver.run(configuration, input, clustersIn, output, measure,
convergenceDelta,  maxIterations, runClustering,  runSequential)

Also, most of them extend AbstractJob and implement Hadoop's Tool
interface, so you can use Hadoops ToolRunner and create an array with
the arguments you would specify on the command line.

String[] kmeansArgs = new String[] {
  "--input", inputPath,
  "--output", outputPath,
  "--numClusters", numClusters,
  // More arguments
};

ToolRunner.run(configuration, new KMeansDriver(), kmeansArgs);

See 
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/common/AbstractJob.html

Frank

On Sat, Nov 12, 2011 at 12:55 AM, Ramon Wang  wrote:
> Try to read Mahout in Action.
>
> Sent from iPhone
>
>
>
>> Hello everyone,
>>
>> I am new to Mahout and Want to start working on the same. However, on Mahout
>> Website could not find nice java coding examples. I can see some examples
>> which we can run using command line. However, as I felt just running command
>> line will limit the usability of mahout.
>>
>> I want to understand it fully and want coding to be done in Java. If anyone
>> can help me with some examples code that is using Hadoop written examples
>> that would be really helpful.
>>
>>
>> Thanks
>> Sachin
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/New-User-to-Mahout-tp3501316p3501316.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>


Cluster labeling

2011-11-08 Thread Frank Scholten
Hi all,

Sometimes my cluster labels are terms that hardly occur in the
combined text of the documents of a cluster. I would expect to see a
label of a term that occurs very frequently across documents of the
cluster.

For example, suppose there is a cluster of tweets about Mahout. You
would see a lot of occurences of 'Apache Mahout' in every document.
Maybe a few documents have the term 'License' in them. You could end
up with a 'License' label instead of 'Apache Mahout'.

I think this happens when Mahout sorts the cluster centroid by TF-IDF
weight in descending order and fetches the correlated terms. So the
'License' label will be chosen because it has a high TF-IDF even
though it has a low cluster frequency.

Thoughts?

Cheers,

Frank


[Announcement] SearchWorkings.org is live!

2011-09-12 Thread Frank Scholten
Hi all,

This is an announcement of the community site SearchWorkings.org [1]

SearchWorkings.org offers search professionals a point of contact or
comprehensive resource to learn and discuss all the
new developments in the world of open source search and related
subjects like Mahout and Hadoop.

The site is created by a group of search professionals from the
Lucence & Solr community and I am involved in it
to cover topics related to Mahout and Hadoop. The initial focus is on Lucene &
Solr, Mahout and Hadoop but aims to be much broader.

Like any other community website, content will be added on a regular
basis and community members can contribute too.

Right now, you have access to a extensive resource centre offering
online tutorials, downloads, white papers and access to a host of
search specialists in the forum.
In addition you can post blog items and keep up to date with relevant
news.

We look forward to more and more blogs, articles and tutorials, real
case-studies or 3rd party extensions for OSS Search components.

You are more than welcome to contribute and tell your story about
using these technologies.

Have fun,

Frank

[1] http://www.searchworkings.org
[2] Trademark Acknowledgement: Apache Lucene, Apache Solr and Apache
Mahout and respective logos are trademarks of The Apache
Software Foundation. All other marks mentioned may be trademarks or
registered trademarks of their respective owners.


Re: Doubt regarding the kmeans clustering results on mahout

2011-08-01 Thread Frank Scholten
Ah, I meant maybe seq2sparse could produce namedvectors by default.
There was a discussion on that some time ago on
https://issues.apache.org/jira/browse/MAHOUT-401

On Mon, Aug 1, 2011 at 6:20 PM, Jeff Eastman  wrote:
> It's really not possible for the clustering to produce NamedVectors but you 
> are free to send it points which are named. Those points will pass through 
> the clustering process and be available in the output.
>
> -Original Message-
> From: Frank Scholten [mailto:fr...@frankscholten.nl]
> Sent: Saturday, July 30, 2011 4:21 AM
> To: user@mahout.apache.org
> Subject: Re: Doubt regarding the kmeans clustering results on mahout
>
> Maybe it should produce NamedVectors by default as well. This is
> another of those optional settings
> that is often needed in practice.
>
> On Fri, Jul 29, 2011 at 11:42 PM, Jeff Eastman  wrote:
>> No problem. I really think the default needs to be changed anyway. Perhaps 
>> this will get me to do it.
>>
>> -Original Message-
>> From: Abhik Banerjee [mailto:banerjee.abhik@gmail.com]
>> Sent: Friday, July 29, 2011 1:48 PM
>> To: user@mahout.apache.org
>> Subject: Re: Doubt regarding the kmeans clustering results on mahout
>>
>> Thanks a lot , I missed that part in the wiki , My mistake.
>>
>>
>


Re: Doubt regarding the kmeans clustering results on mahout

2011-07-30 Thread Frank Scholten
Maybe it should produce NamedVectors by default as well. This is
another of those optional settings
that is often needed in practice.

On Fri, Jul 29, 2011 at 11:42 PM, Jeff Eastman  wrote:
> No problem. I really think the default needs to be changed anyway. Perhaps 
> this will get me to do it.
>
> -Original Message-
> From: Abhik Banerjee [mailto:banerjee.abhik@gmail.com]
> Sent: Friday, July 29, 2011 1:48 PM
> To: user@mahout.apache.org
> Subject: Re: Doubt regarding the kmeans clustering results on mahout
>
> Thanks a lot , I missed that part in the wiki , My mistake.
>
>


Re: fkmeans or Cluster Dumper not working?

2011-07-21 Thread Frank Scholten
Hi Jeffrey,

Fuzzy kmeans outputs a [Cluster ID, WeightedVectorWritable] file under
clusters/clusteredPoints and a [Cluster ID, SoftCluster] file under
clusters/clusters-*, you don't need to write code for that.

However if you want to display your clusters in an application, along
with nice labels and so on you need to write some code to join all
these clustering outputs together and enrich your original documents
with their cluster ID, or in case of fuzzy kmeans, multiple cluster
IDs along with weights.

I don't know why your fkmeans clustering fails when running with 50
clusters though. I just ran fkmeans on seinfeld transcripts on my
local machine like this:

$MAHOUT_HOME/bin/mahout fkmeans --input $OUTPUT/vectors/tfidf-vectors   \
--output $OUTPUT/fkmeans/clusters
  \
--clusters $OUTPUT/fkmeans/initialclusters
 \
--maxIter 5\
--numClusters 50
 \
--clustering
 \
--m 2
 \
--overwrite


Frank

On Thu, Jul 21, 2011 at 10:29 AM, Jeffrey  wrote:
> Hi again,
>
> Let me update on what's working and what's not working.
>
> Works:
> fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
> fkmeans clustering (5 clusters)
> clusterdump (5 clusters) - so points are not included in the clusterdump and 
> I need to write a program for it?
>
> Not Working:
> fkmeans clustering (50 clusters) - same error
> clusterdump (10 clusters) - same error
>
>
> so it seems to attach points to the cluster dumper output like the synthetic 
> control example does, i would have to write some code as pointed 
> by @Frank_Scholten ? 
> https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>
> Best wishes,
> Jeffrey04
>
>>
>>From: Jeff Eastman 
>>To: "user@mahout.apache.org" ; Jeffrey 
>>
>>Sent: Wednesday, July 20, 2011 11:53 PM
>>Subject: RE: fkmeans or Cluster Dumper not working?
>>
>>Hi Jeffrey,
>>
>>It is always difficult to debug remotely, but here are some suggestions:
>>- First, you are specifying both an input clusters directory --clusters and 
>>--numClusters clusters so the job is sampling 10 points from your input data 
>>set and writing them to clusteredPoints as the prior clusters for the first 
>>iteration. You should pick a different name for this directory, as the 
>>clusteredPoints directory is used by the -cl (--clustering) option (which you 
>>did not supply) to write out the clustered (classified) input vectors. When 
>>you subsequently supplied clusteredPoints to the clusterdumper it was 
>>expecting a different format and that caused the exception you saw. Change 
>>your --clusters directory (clusters-0 is good) and add a -cl argument and 
>>things should go more smoothly. The -cl option is not the default and so no 
>>clustering of the input points is performed without this (Many people get 
>>caught by this and perhaps the default should be changed, but clustering can 
>>be expensive and so it is not performed without request).
>>- If you still have problems, try again with k-means. The similarity to 
>>fkmeans is good and it will eliminate fkmeans itself if you see the same 
>>problems with k-means
>>- I don't see why changing the -k argument from 10 to 50 should cause any 
>>problems, unless your vectors are very large and you are getting an OME in 
>>the reducer. Since the reducer is calculating centroid vectors for the next 
>>iteration these will become more dense and memory will increase substantially.
>>- I can't figure out what might be causing your second exception. It is 
>>bombing inside of Hadoop file IO and this causes me to suspect command 
>>argument problems.
>>
>>Hope this helps,
>>Jeff
>>
>>
>>-Original Message-
>>From: Jeffrey [mailto:mycyber...@yahoo.com]
>>Sent: Wednesday, July 20, 2011 2:41 AM
>>To: user@mahout.apache.org
>>Subject: fkmeans or Cluster Dumper not working?
>>
>>Hi,
>>
>>I am trying to generate clusters using the fkmeans command line tool from my 
>>test data. Not sure if this is correct, as it only runs one iteration (output 
>>from 0.6-snapshot, gotta use some workaround to some weird bug 
>>- http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
>> )
>>
>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
>>sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 --numClusters 
>>10 --overwrite --m 5
>>Running on hadoop, using 
>>HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
>> 
>>/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
>> 14:05:18 INFO common.AbstractJob: Command line arguments: 
>>{--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
>>--distanceMeasure=org.apache.ma

Re: Finding thresholds for canopy

2011-05-17 Thread Frank Scholten
Hi Jeff,

After building this distance matrix, what would then be a good value
for T2? The average distance in the matrix?

Frank

On Wed, Apr 27, 2011 at 10:57 PM, Jeff Eastman  wrote:
> Worth a try, but it ultimately boils down to the distance measure you've 
> chosen, the distributions of input vectors and T2. As a pre-run experiment, 
> you could sample some points from your data set (e.g. using 
> RandomSeedGenerator as you would to prime k-means), then build a distance 
> matrix using your chosen distance measure. That would give you a T2 starting 
> point in a more systematic manner than grabbing it completely out of thin air.
>
> -Original Message-
> From: Paul Mahon [mailto:pma...@decarta.com]
> Sent: Wednesday, April 27, 2011 1:46 PM
> To: user@mahout.apache.org
> Subject: Re: Finding thresholds for canopy
>
> If you have a guess at how many clusters you want you could take the
> total area of the space and divide by the number of clusters to get an
> initial guess of T2 or T1. That might work to get you started,
> depending on the distribution.
>
> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>> I'm using Canopy as first step for K-means clustering, is there any 
>> algorithmic, or even a good heuristic to estimate good T1 and T2 from the 
>> vectorized data?
>


Re: AW: Incremental clustering

2011-05-12 Thread Frank Scholten
What do you recommend for vectorizing the new docs? Run seq2sparse on
a batch of them? Seems there's no code at the moment for quickly
vectorizing a few new documents based on the existing dictionary.

Frank

On Thu, May 12, 2011 at 12:32 PM, Grant Ingersoll  wrote:
> From what I've seen, using Mahout's existing clustering methods, I think most 
> people setup some schedule whereby they cluster the whole collection on a 
> regular basis and then all docs that come in the meantime are simply assigned 
> to the closest cluster until the next whole collection iteration is 
> completed.  There are, of course, other variants one could do, such as kick 
> off the whole clustering when some threshold of number of docs is reached.
>
> There are other clustering methods, as Benson alluded to, that may better 
> support incremental approaches.
>
> On May 12, 2011, at 4:53 AM, David Saile wrote:
>
>> I am still stuck at this problem.
>>
>> Can anyone give me a heads-up on how existing systems handle this?
>> If a collection of documents is modified, is the clustering recomputed from 
>> scratch each time?
>> Or is there in fact any incremental way to handle an evolving set of 
>> documents?
>>
>> I would really appreciate any hint!
>>
>> Thanks,
>> David
>>
>>
>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>
>>> Not an answer, but a follow-up question:
>>> I would be interested in the very same thing, but with the possibility to 
>>> assign new sites to existing clusters OR to new ones.
>>>
>>> Thanks in advance,
>>> Ulrich
>>>
>>> -Ursprüngliche Nachricht-
>>> Von: David Saile [mailto:da...@uni-koblenz.de]
>>> Gesendet: Montag, 9. Mai 2011 11:53
>>> An: user@mahout.apache.org
>>> Betreff: Incremental clustering
>>>
>>> Hi list,
>>>
>>> I am completely new to Mahout, so please forgive me if the answer to my 
>>> question is too obvious.
>>>
>>> For a case study, I am working on a simple incremental web crawler (much 
>>> like Nutch) and I want to include a very simple indexing step that 
>>> incorporates clustering of documents.
>>>
>>> I was hoping to use some kind of incremental clustering algorithm, in order 
>>> to make use of the incremental way the crawler is supposed to work (i.e. 
>>> continuously adding and updating websites).
>>>
>>> Is there some way to achieve the following:
>>>      1) initial clustering of the first web-crawl
>>>      2) assigning new sites to existing clusters
>>>      3) possibly moving modified sites between clusters
>>>
>>> I would really appreciate any help!
>>>
>>> Thanks,
>>> David
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

2011-05-11 Thread Frank Scholten
Just ran seq2sparse on a clean checkout of trunk with a cluster
started by Whirr. This works without problems.

frank@franktop:~/Desktop/mahout$ bin/mahout seq2sparse --input
target/posts --output target/seq2sparse --weight tfidf  --namedVector
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/home/frank/.whirr/frank-cluster/
11/05/11 17:57:17 WARN conf.Configuration: DEPRECATED: hadoop-site.xml
found in the classpath. Usage of hadoop-site.xml is deprecated.
Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to
override properties of core-default.xml, mapred-default.xml and
hdfs-default.xml respectively
11/05/11 17:57:18 INFO vectorizer.SparseVectorsFromSequenceFiles:
Maximum n-gram size is: 1
11/05/11 17:57:18 INFO vectorizer.SparseVectorsFromSequenceFiles:
Minimum LLR value: 1.0
11/05/11 17:57:18 INFO vectorizer.SparseVectorsFromSequenceFiles:
Number of reduce tasks: 1
11/05/11 17:57:19 INFO common.HadoopUtil: Deleting target/seq2sparse
11/05/11 17:58:42 INFO input.FileInputFormat: Total input paths to process : 1
11/05/11 17:58:45 INFO mapred.JobClient: Running job: job_201105111409_0009
11/05/11 17:58:46 INFO mapred.JobClient:  map 0% reduce 0%
11/05/11 17:59:00 INFO mapred.JobClient:  map 100% reduce 0%

Frank

On Tue, May 10, 2011 at 5:34 PM, Jake Mannix  wrote:
> On Tue, May 10, 2011 at 8:24 AM, Sean Owen  wrote:
>
>> I peeked in the examples job jar and it definitely does have this class,
>> along with the other dependencies (after my patch). Double-check that
>> you've
>> done the clean build an "install" again? and maybe even print out
>> MAHOUT_JOB
>> in the script to double-check what it is using?
>>
>
> [jake@smf1-ady-15-sr1 bla]$ jar -tf mahout-examples-0.5-SNAPSHOT-job.jar |
> grep "/Analyzer.class"
> org/apache/lucene/analysis/Analyzer.class
>
> [swap exec for echo in last line of bin/mahout ]
>
> [jake@smf1-ady-15-sr1 mahout-distribution-0.5-SNAPSHOT]$ ./bin/mahout
> Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
> No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf
> /usr/lib/hadoop-0.20/bin/hadoop jar
> /home/jake/mahout-distribution-0.5-SNAPSHOT/mahout-examples-0.5-SNAPSHOT-job.jar
> org.apache.mahout.driver.MahoutDriver
>
> :\
>
>
>> On Tue, May 10, 2011 at 12:40 AM, Jake Mannix 
>> wrote:
>>
>> > wah.  Even trying to do seq2sparse doesn't work for me:
>> >
>> > [jake@smf1-ady-15-sr1 mahout-distribution-0.5-SNAPSHOT]$ ./bin/mahout
>> > seq2sparse -i hdfs:///user/jake/text_temp -o
>> > hdfs:///user/jake/text_vectors_temp
>> > Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
>> > No HADOOP_CONF_DIR set, using /usr/lib/hadoop-0.20/src/conf
>> > 11/05/09 23:36:01 WARN driver.MahoutDriver: No seq2sparse.props found on
>> > classpath, will use command-line arguments only
>> > 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
>> > n-gram size is: 1
>> > 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
>> > LLR value: 1.0
>> > 11/05/09 23:36:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Number
>> of
>> > reduce tasks: 1
>> > 11/05/09 23:36:04 INFO input.FileInputFormat: Total input paths to
>> process
>> > :
>> > 1
>> > 11/05/09 23:36:10 INFO mapred.JobClient: Running job:
>> > job_201104300433_126621
>> > 11/05/09 23:36:12 INFO mapred.JobClient:  map 0% reduce 0%
>> > 11/05/09 23:36:47 INFO mapred.JobClient: Task Id :
>> > attempt_201104300433_126621_m_00_0, Status : FAILED
>> > 11/05/09 23:37:07 INFO mapred.JobClient: Task Id :
>> > attempt_201104300433_126621_m_00_1, Status : FAILED
>> > Error: java.lang.ClassNotFoundException:
>> > org.apache.lucene.analysis.Analyzer
>> >
>> > 
>> >
>> > Note I'm not specifying any fancy analyzer.  Just trying to run with the
>> > defaults. :\
>> >
>> >  -jake
>>
>


Re: Vectorizing arbitrary value types with seq2sparse

2011-05-06 Thread Frank Scholten
Hmm, seems more complex that I thought. I thought of a simple approach
where you could configure your own class that concatenated the desired
fields into one Text value and have the SequenceFileTokenizerMapper
process that value.

But this can give unexpected results? I guess it may find incorrect
n-grams from tokens that were from different fields.

On Fri, May 6, 2011 at 10:17 PM, Ted Dunning  wrote:
> This is definitely desirable but is very different from the current tool.
>
> My guess is the big difficulty will be describing the vectorization to be
> done.  The hashed representations would make that easier, but still not
> trivial.  Dictionary based methods add multiple dictionary specifications
> and also require that we figure out how to combine vectors by concatenation
> or overlay.
>
> On Fri, May 6, 2011 at 1:02 PM, Frank Scholten wrote:
>
>> Hi everyone,
>>
>> At the moment seq2sparse can generate vectors from sequence values of
>> type Text. More specifically, SequenceFileTokenizerMapper handles Text
>> values.
>>
>> Would it be useful if seq2sparse could be configured to vectorize
>> value types such as a Blog article with several textual fields like
>> title, content, tags and so on?
>>
>> Or is it easier to create a separate job for this or use Pig or
>> anything like that?
>>
>> Frank
>>
>


Vectorizing arbitrary value types with seq2sparse

2011-05-06 Thread Frank Scholten
Hi everyone,

At the moment seq2sparse can generate vectors from sequence values of
type Text. More specifically, SequenceFileTokenizerMapper handles Text
values.

Would it be useful if seq2sparse could be configured to vectorize
value types such as a Blog article with several textual fields like
title, content, tags and so on?

Or is it easier to create a separate job for this or use Pig or
anything like that?

Frank


Re: 20NewsGroups Error: Illegal Capacity: -40

2011-04-13 Thread Frank Scholten
This sh error also occurred for the reuters script but has been fixed. Maybe 
good to update all scripts to bash?

On Apr 13, 2011, at 18:34, Ken Williams  wrote:

> Ted Dunning  gmail.com> writes:
> 
>> 
>> This may be a bit of regression.
> 
> Thanks for the reply.
> 
> Just out of interest, I also reckon your 'build-cluster-syntheticcontrol.sh' 
> script should be a bash script (#!/bin/bash) rather than a standard
> shell (#!/bin/sh) script.
> 
> 
> $ trunk/examples/bin/build-cluster-syntheticcontrol.sh 
> trunk/examples/bin/build-cluster-syntheticcontrol.sh: 28: Syntax error: "("
> unexpected (expecting "fi")
> $ 
> 
> 
> Regards,
> 
> Ken
> 
> 
>> 
>> On Wed, Apr 13, 2011 at 4:48 AM, Ken Williams  hotmail.com> 
>> wrote:
>> 
>>> I'm not sure what to try next. Any help would be very welcome.
>>> 
>> 
> 
> 
> 
>