from:"Sebastian Schelter"


Hi,

I'm resending this mail to also include the users list. To wrap up: We 
currently have a discussion whether our frequent pattern mining package 
should stay in the codebase. The original author suggested to remove the 
original implementation and maybe retain the FPGrowth2 implementation.


I'd like to ask our users here on their opionion, is anybody opposed to 
removing the frequent pattern mining code from Mahout? Please shout out.


--sebastian

Re: Future of Frequent Pattern Mining


Hi Michael,

the problem is that currently nodoby is maintaining the fpgrowth code 
anymore or working on documentation for it, that's why we consider it to 
be a candidate for removal. I don't see much value in keeping algorithms 
in the codebase if nobody is maintaining them, answering questions and 
providing documentation. If someone opposes here who has that code in 
production, that could be a reason to retain it however.


People wanting to use the code in the future can always download Mahout 
0.9 which has the current implementation.


--sebastian


On 04/28/2014 08:23 AM, Michael Wechner wrote:

what is the alternative and if one would still want to use the frequent
pattern mining code in the future, how
would this be possible otherwise?

Thanks

Michael

Am 28.04.14 08:19, schrieb Sebastian Schelter:

Hi,

I'm resending this mail to also include the users list. To wrap up: We
currently have a discussion whether our frequent pattern mining
package should stay in the codebase. The original author suggested to
remove the original implementation and maybe retain the FPGrowth2
implementation.

I'd like to ask our users here on their opionion, is anybody opposed
to removing the frequent pattern mining code from Mahout? Please shout
out.

--sebastian

Re: Reading the wiki

Would someone be willing to open a jira ticket for this issue and fix 
the problem?


--sebastian

On 04/28/2014 01:05 AM, Ted Dunning wrote:

Mathjax is both static content and server.

There is an FAQ about this https problem.  I think that part of the issue
is that they don't use the same URL for both http and https connections.

http://www.mathjax.org/resources/faqs/#problem-https

The URL that they suggest to use for getting mathjax.js is

https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js

This is required because the rackspace cdn won't allow the http address to
be used with https.  Perversely, this https address also breaks when used
with http.

My guess is that if we update our css/headers/templates to use this https
address then things will work.


On Sun, Apr 27, 2014 at 11:59 PM, Dmitriy Lyubimov dlie...@gmail.comwrote:


i think we would have to host mathjax to apease the browsers under https
handshake. I am not sure what would be associated with that, I am not sure
if mathjax is solely static content or it is an actual server doing
something.


On Sun, Apr 27, 2014 at 12:41 AM, Sebastian Schelter s...@apache.org
wrote:


What if we store a copy of the js file on our site and also serve it via
https?


On 04/27/2014 05:34 AM, Pat Ferrel wrote:


Often CMSs have a way to configure https access to be used only for
password or other secure areas of the site. No idea if the Apache CMS

does

this but worth asking. If there is no https fix seems like Mathjax

should

be discontinued.


On Apr 26, 2014, at 8:03 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:


I have no solution for https. It is most likely security thing.

I just asked that whomever writes blog to fix https links to simple
unsecure ones.
On Apr 26, 2014 6:24 PM, Andrew Musselman andrew.mussel...@gmail.com



wrote:

  There was chat last week about this breaking, something about https vs

http in the link to Mathjax as I recall.

Dmitriy was dealing with it last I saw.

  On Apr 26, 2014, at 6:04 PM, Pat Ferrel p...@occamsmachete.com

wrote:


I probably missed some announcement but why is the math markup coming


out raw? Do I need a plugin or something?




  \[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\

mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\
boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]

Re: Future of Frequent Pattern Mining


I'm very much in favor of this idea.

On 04/28/2014 10:52 AM, Ted Dunning wrote:

One thought is to extract the code, publish on github with warnings about
no support.  Then if there are requests, we can point them to the GH
archive and tell them to go for it.




On Mon, Apr 28, 2014 at 10:03 AM, Suneel Marthi smar...@apache.org wrote:


+100 to purging this from the codebase. This stuff uses the old MR api and
would have to be upgraded not to mention that this was removed from 0.9 and
was restored only because one user wanted it who promised to maintain it
and has not been heard from.




On Mon, Apr 28, 2014 at 2:19 AM, Sebastian Schelter s...@apache.org
wrote:


Hi,

I'm resending this mail to also include the users list. To wrap up: We
currently have a discussion whether our frequent pattern mining package
should stay in the codebase. The original author suggested to remove the
original implementation and maybe retain the FPGrowth2 implementation.

I'd like to ask our users here on their opionion, is anybody opposed to
removing the frequent pattern mining code from Mahout? Please shout out.

--sebastian

Re: Reading the wiki

2014-04-27 Thread Sebastian Schelter

What if we store a copy of the js file on our site and also serve it via 
https?


On 04/27/2014 05:34 AM, Pat Ferrel wrote:

Often CMSs have a way to configure https access to be used only for password or 
other secure areas of the site. No idea if the Apache CMS does this but worth 
asking. If there is no https fix seems like Mathjax should be discontinued.


On Apr 26, 2014, at 8:03 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

I have no solution for https. It is most likely security thing.

I just asked that whomever writes blog to fix https links to simple
unsecure ones.
On Apr 26, 2014 6:24 PM, Andrew Musselman andrew.mussel...@gmail.com
wrote:


There was chat last week about this breaking, something about https vs
http in the link to Mathjax as I recall.

Dmitriy was dealing with it last I saw.


On Apr 26, 2014, at 6:04 PM, Pat Ferrel p...@occamsmachete.com wrote:

I probably missed some announcement but why is the math markup coming

out raw? Do I need a plugin or something?




\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]

Welcome Pat Ferrel as new committer on Mahout

2014-04-24 Thread Sebastian Schelter


Hi,

this is to announce that the Project Management Committee (PMC) for 
Apache Mahout has asked Pat Ferrel to become committer and we are 
pleased to announce that he has accepted.


Being a committer enables easier contribution to the project since in 
addition to posting patches on JIRA it also gives write access to the 
code repository. That also means that now we have yet another person who 
can commit patches submitted by others to our repo *wink*


Pat, we look forward to working with you in the future. Welcome! It 
would be great if you could introduce yourself with a few words.


-s

Re: Spark Mahout with a CLI?

2014-04-20 Thread Sebastian Schelter


I'll create a jira ticket for this, as I have a little time to work on it.

On 04/16/2014 08:15 PM, Pat Ferrel wrote:

bug in the pseudo code, should use columnIds:

val hashedCrossIndicatorMatrix = new 
HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), 
hashedDrms(1).columnIds())
RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, 
hdfs://some/path/for/output”)

On Apr 16, 2014, at 10:00 AM, Pat Ferrel p...@occamsmachete.com wrote:

Great, and an excellent example is at hand. In it I will play the user and 
contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender 
demo site. This creates logfiles of the form

timestamp, userId, itemId, action
timestamp1, userIdString1, itemIdString1, “view
timestamp2, userIdString2, itemIdString1, “like

These are currently processed using the Solr-recommender example code and 
Hadoop Mahout. The input is split and accumulated into two matrices which could 
then be input to the new Spark cooccurrence analysis code (see the patch here: 
https://issues.apache.org/jira/browse/MAHOUT-1464)

val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
maxInterestingItemsPerThing = 100, maxNumInteractions = 500, 
Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or 
maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible 
job that takes the above logfile input and creates a HashedSparseMatrix. inside the 
HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and 
column external Id - mahout Id lookup.

The ‘cooccurrences' call would be identical and the data it deals with would 
also be identical. But the HashedSparseMatrix would be able to deliver two 
dictionaries, which store the dimensions length and are used to lookup string 
Ids from internal mahout ordinal integer Ids. These could be created with a 
helper function to read from logfiles.

val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, 
“^actions-.*“, \t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new 
HasedSparceMatrix from the cooccurrences indicator matrix and the original 
itemId dictionaries:

val hashedCrossIndicatorMatrix = new 
HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), 
hasedDrms(1).rowIds())
RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, 
hdfs://some/path/for/output)

Here the two Id dictionaries are used to create output file(s) with external 
Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have 
to create a Spark impl of the wrapper for the new cross-cooccurrence indicator 
matrix. And since my scripting/web app language is not Scala the format for the 
output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. 
Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a 
HashedSparseMatrix, The internal calculations are done on RDDs and the drms are 
never written to disk. AND the logfiles can be consumed directly producing data 
that any language can consume directly with external Ids used and preserved.


BTW: in the MAHOUT-1464 example the drms are read in serially single threaded 
but written out using Spark (unless I missed something). In the proposed impl 
the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using 
cron directly with no additional processing pipeline and by people unfamiliar 
with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the 
demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be 
widely used. Otherwise we leave all of this wrapper code to be duplicated over 
and over again buy users and expect them to know too much about Spark Mahout 
internals.



On Apr 15, 2014, at 6:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.


On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel p...@occamsmachete.com wrote:


Which is why it’s an import/export issue.

On Apr 15, 2014, at 5:48 PM, Ted Dunning ted.dunn...@gmail.com wrote:

On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel p...@occamsmachete.com
wrote:


As to the statement There is not, nor do i think there will be a way to
run this stuff with CLI” seems unduly misleading. Really, does anyone
second this?

There will be Scala scripts to drive this stuff and yes even from the

CLI.

Do you imagine that every Mahout USER will be a Scala + Mahout DSL
programmer? That may be fine for commiters but users will be PHP devs,

Ruby

Re: org.apache.mahout.math.IndexException

2014-04-20 Thread Sebastian Schelter

Yes, it should give you the necessary information. The important part is 
this:


Apply the patch with patch -p 0 -i path to patch Throw a --dry-run on 
there if you want to see what happens w/o screwing up your checkout.


On 04/20/2014 09:47 PM, Mario Levitin wrote:

Thanks Sebastian,

I have not applied a patch before. I found the following page
http://mahout.apache.org/developers/patch-check-list.html

is that description enough for applying a patch?




On Sat, Apr 19, 2014 at 2:23 AM, Sebastian Schelter s...@apache.org wrote:


Mario,

could you check whether the patch from https://issues.apache.org/
jira/browse/MAHOUT-1517 fixes your problem?

Best,

Sebastian

On 04/18/2014 11:03 PM, Mario Levitin wrote:


In my dataset ID's are strings so I use MemoryIDMigrator. This migrator
produces large longs.
I'm not doing any translation.

I could not understand why there is a cast to int in the Mahout code. This
will produce errors for large long values.


On Fri, Apr 18, 2014 at 8:06 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

  Are you translating the ID's down into a range that will fit into int's?





On Thu, Apr 17, 2014 at 3:02 PM, Mario Levitin mariolevi...@gmail.com


wrote:



  Hi,


I'm trying to run the ALS algorithm. However, I get the following error:

Exception in thread pool-1-thread-3
org.apache.mahout.math.IndexException: Index -691877539 is outside
allowable range of [0,2147483647)
at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395)
at


  org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer.

sparseUserRatingVector(ALSWRFactorizer.java:305)



At line 305 in ALSWRFactorizer.java, there is the following code

ratings.set((int) preference.getItemID(), preference.getValue());

My suspicion is that the error results from the casting to int in the


above


line. Item IDs in mahout are long, so if you cast a long (which does not
fit into an int) then you will get negative numbers and hence the error.

However, this explanation also seems to me implausible since I don't


think


such an error exists in Mahout code.

Any help will be appreciated.
Thanks

Re: simple idea for improving mahout docs over the next month?


Hm,

I'm not so sure whether introducing another source for documentation 
than the webpage would be so helpful (there still lots of work to do on 
the website...), how do others see this?


--sebastian

On 04/17/2014 05:06 PM, Jay Vyas wrote:

Hi sebastian:  theoretically, one could extract all the information from a
mailing list search but i think a rolling FAQ would much more (1) be
likely evolve into real documentation and (2) be more easily refined .  Is
that a little convincing ? If not i guess we can table the idea///  just a
thought.


On Thu, Apr 17, 2014 at 1:38 AM, Sebastian Schelter s...@apache.org wrote:


Hi Jay,

I'm not sure what the benefit of this approach is, people can already post
their questions to the mailinglist and get answers here, why would a google
doc be helpful?

--sebastian


On 04/16/2014 09:31 PM, Jay Vyas wrote:


hi mahout... i finally thought of a really easy way of ad-hoc improvement
of mahout docs, that can feed into the efforts to get formal docs
improved.

Any interest in creating a shared mahout FAQ file in a google doc.?

we can easily start adding questions into it that point to obvious missing
documentation parts, and mahout commiters can add responses below inline.
then overtime we can take those questions/answers and turn them directly
into real docs.

I think this will make it easier for a broader range of people to rapidly
improve mahout docs in an ad hoc sort of way.  i for one will volunteer to
help translate the QA stream into real documentation / JIRAs etc.

Re: Performance Issue using item-based approach!


You can, but you shouldn't :)

On 04/18/2014 07:23 PM, Ted Dunning wrote:

You can always run Hadoop in a local mode.  Nothing prevents a single node
from being a cluster.  :-)


On Thu, Apr 17, 2014 at 7:43 AM, Najum Ali naju...@googlemail.com wrote:


Ted,

Is it also possible to use ItemSimilarityJob in a non-distributed
environment?

Am 17.04.2014 um 16:22 schrieb Ted Dunning ted.dunn...@gmail.com:


Najum,

You should also be able to use the ItemSimilarityJob to compute a limited
indicator set.

This is stepping off of the path you have been on, but it would allow you
to deploy the recommender via a search engine.

That makes a lot of code simply vanish.  THis is also a well trod
production path.




On Thu, Apr 17, 2014 at 3:57 AM, Najum Ali naju...@googlemail.com

wrote:



@Sebastian

wow … you are right. The original csv file is about 21mb and the
corresponding precomputed item-item similarity file is about 260mb!!
And yes, there are wide more than 50 most similar items“ for an item ..

Trying to restrict this to 50 (or something like that) most similar

items

for an item could do the trick as you said.
Ok I will give it try and reply later.

By the way, what´s about the SampingCandidateItemsStrategy or something
like this, by using this Constructor:
*GenericItemBasedRecommender


https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.html#GenericItemBasedRecommender(org.apache.mahout.cf.taste.model.DataModel,%20org.apache.mahout.cf.taste.similarity.ItemSimilarity,%20org.apache.mahout.cf.taste.recommender.CandidateItemsStrategy,%20org.apache.mahout.cf.taste.recommender.MostSimilarItemsCandidateItemsStrategy)

*

(DataModel

https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/model/DataModel.html



dataModel, ItemSimilarity

https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/similarity/ItemSimilarity.html



similarity, CandidateItemsStrategy

https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html



candidateItemsStrategy,MostSimilarItemsCandidateItemsStrategy

https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/MostSimilarItemsCandidateItemsStrategy.html



mostSimilarItemsCandidateItemsStrategy)


Am 17.04.2014 um 12:41 schrieb Sebastian Schelter s...@apache.org:

Hi Najum,

I think I found the problem. Remember: Two items are similar whenever at
least one user interacted with both of them (the items co-occur).

In the movielens dataset this is true for almost all pairs of items,
unfortunately. From 3076 items, more than 11 million similarities are
created. A common approach for that (which is not yet implemented in our
precomputation unfortunately) is to only retain the top-k similar items

per

item.

A solution would be to take the csv file that is created by the
MultithreadedBatchItemSimilarities and postprocess it so that only the

50

most similar items per item are retained. That should help with your
problem.

Unfortunately, we don't have code for that yet, maybe you want to try to
write that yourself?

Best,
Sebastian

PS: The user-based recommender restricts the number of similar users, I
guess thats why it is so fast here.


On 04/17/2014 12:18 PM, Najum Ali wrote:

Ok, here you go:

I have created a simple class with main-method (no server and other

stuff):


public class RecommenderTest {
public static void main(String[] args) throws IOException,

TasteException {

DataModel dataModel = new FileDataModel(new



File(/Users/najum/Documents/recommender-console/src/main/webapp/resources/preference_csv/1mil.csv));

ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel);
ItemBasedRecommender recommender = new
GenericItemBasedRecommender(dataModel,
similarity);

String pathToPreComputedFile = preComputeSimilarities(recommender,
dataModel.getNumItems());

InputStream inputStream = new FileInputStream(new
File(pathToPreComputedFile));
BufferedReader bufferedReader = new BufferedReader(new
InputStreamReader(inputStream));
CollectionGenericItemSimilarity.ItemItemSimilarity correlations =



bufferedReader.lines().map(mapToItemItemSimilarity).collect(Collectors.toList());

ItemSimilarity precomputedSimilarity = new
GenericItemSimilarity(correlations);
ItemBasedRecommender recommenderWithPrecomputation = new
GenericItemBasedRecommender(dataModel, precomputedSimilarity);

recommend(recommender);
recommend(recommenderWithPrecomputation);
}

private static String preComputeSimilarities(ItemBasedRecommender
recommender,
int simItemsPerItem) throws TasteException {
String pathToAbsolutePath = ;
try {
File resultFile = new File(System.getProperty(java.io.tmpdir),
similarities.csv);
if (resultFile.exists()) {
resultFile.delete();
}
BatchItemSimilarities batchJob = new
MultithreadedBatchItemSimilarities(recommender, simItemsPerItem);
int numSimilarities

Re: Installation on Ubuntu


Which version do you use, it shouldn't be a problem with oracle java.

--sebastian

On 04/18/2014 09:39 PM, Christopher Eugene wrote:

Hello,
I want to install mahout on Ubuntu 14.04. I had previously tried in vain to
install on 13.10. Could the version  of Java be the problem? I am compiling
from source. Any help will be appreciated.

Re: Installation on Ubuntu

That is wrong, but you could use a server such as PredictionIO (which 
uses Mahout internally) with PHP.


--sebastian

On 04/18/2014 09:49 PM, Christopher Eugene wrote:

@sebastian I have version 1.7. @Andrew I plan on using mahout with php
since I heard that there is a new API or am I wrong?


On Fri, Apr 18, 2014 at 10:45 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:


  [image: Boxbe] https://www.boxbe.com/overview This message is eligible
for Automatic Cleanup! (andrew.mussel...@gmail.com) Add cleanup 
rulehttps://www.boxbe.com/popup?url=https%3A%2F%2Fwww.boxbe.com%2Fcleanup%3Ftoken%3DmHSwpoBQ%252B6%252FJ3fW9yUA910ycGPeUT52Q%252Fal25IyYKsdhPwMs0QIM107VdsJQmYwJIZUxElWJcJOFczNqRvadXgKw58KV6DBHGzisKUyc7%252FXdNTfzycKNF8q7TqaJZzQWsiKseZB4uiAuGRbLb4mQVQ%253D%253D%26key%3DLq7NFbPs6NRMzQNN67fbd1t58GhHGdt2F%252F7YgWWx158%253Dtc_serial=16968089574tc_rand=991651927utm_source=stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001|
 More
infohttp://blog.boxbe.com/general/boxbe-automatic-cleanup?tc_serial=16968089574tc_rand=991651927utm_source=stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001

I would say if you want to get started, just grab the pre-built version via
the download button on the home page of http://mahout.apache.org

E.g., following those links you would end up here:
http://apache.cs.utah.edu/mahout/0.9 and then get either the -src or
non--src version and use the pre-built jars and examples.


On Fri, Apr 18, 2014 at 12:39 PM, Christopher Eugene
xriseug...@gmail.comwrote:


Hello,
I want to install mahout on Ubuntu 14.04. I had previously tried in vain

to

install on 13.10. Could the version  of Java be the problem? I am

compiling

from source. Any help will be appreciated.
--
Omar Christopher Eugene
http://about.me/mojo706

Re: Installation on Ubuntu


You can, but I'm not sure how much we can help you. Give it a try :)

On 04/18/2014 10:11 PM, Christopher Eugene wrote:

sorry I thought I replied to it :). I can ask predictionio related
questions on the list too?


On Fri, Apr 18, 2014 at 11:06 PM, Sebastian Schelter s...@apache.org wrote:


Please reply to the list, not to me in person :)


On 04/18/2014 10:05 PM, Christopher Eugene wrote:


Thank you Sebastian, I could've sworn I saw something involving mahout and
php not so long ago. Quick question are all the methods available in
mahout
available on PREDICTIONIO?


On Fri, Apr 18, 2014 at 10:53 PM, Sebastian Schelter s...@apache.org
wrote:

[image: Boxbe] https://www.boxbe.com/overview This message is

eligible
for Automatic Cleanup! (s...@apache.org) Add cleanup rule
https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.
boxbe.com%2Fcleanup%3Ftoken%3DI1jJlussgKo%252FgNnu0piiTjSz4XM0mnIqukN5wT
%252BQRNmLPkyWOH0REpeI8f1ieFq90qMLvqA8YMt1NSyh5v7uv5blLasRGnu
Tyw%252F4uVI3zs%252BXKaoEm2vHJk54%252F1sEmGkvry98ht1MW0M%253D%
26key%3Dv33YAIUda%252F72bTRCeq4yfV92BTK%252FJZM1xG3rsd7W2bY%253Dtc_
serial=16968129293tc_rand=1599246981utm_source=stfutm_
medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001| More
infohttp://blog.boxbe.com/general/boxbe-automatic-
cleanup?tc_serial=16968129293tc_rand=1599246981utm_source=
stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001


That is wrong, but you could use a server such as PredictionIO (which
uses
Mahout internally) with PHP.

--sebastian

On 04/18/2014 09:49 PM, Christopher Eugene wrote:

  @sebastian I have version 1.7. @Andrew I plan on using mahout with php

since I heard that there is a new API or am I wrong?


On Fri, Apr 18, 2014 at 10:45 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 [image: Boxbe] https://www.boxbe.com/overview This message is


eligible
for Automatic Cleanup! (andrew.mussel...@gmail.com) Add cleanup rule
https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.
boxbe.com%2Fcleanup%3Ftoken%3DmHSwpoBQ%252B6%
252FJ3fW9yUA910ycGPeUT52Q%
252Fal25IyYKsdhPwMs0QIM107VdsJQmYwJIZUxElWJcJOFczNqRvadXgKw5
8KV6DBHGzisKUyc7%252FXdNTfzycKNF8q7TqaJZzQWsiKs
eZB4uiAuGRbLb4mQVQ%253D%253D%26key%3DLq7NFbPs6NRMzQNN67fbd1t58GhH
Gdt2F%252F7YgWWx158%253Dtc_serial=16968089574tc_rand=
991651927utm_source=stfutm_medium=emailutm_campaign=
ANNO_CLEANUP_ADDutm_content=001| More
infohttp://blog.boxbe.com/general/boxbe-automatic-
cleanup?tc_serial=16968089574tc_rand=991651927utm_source=
stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001

I would say if you want to get started, just grab the pre-built version
via
the download button on the home page of http://mahout.apache.org

E.g., following those links you would end up here:
http://apache.cs.utah.edu/mahout/0.9 and then get either the -src or
non--src version and use the pre-built jars and examples.


On Fri, Apr 18, 2014 at 12:39 PM, Christopher Eugene
xriseug...@gmail.comwrote:

   Hello,


I want to install mahout on Ubuntu 14.04. I had previously tried in
vain

  to


  install on 13.10. Could the version  of Java be the problem? I am


  compiling


  from source. Any help will be appreciated.

--
Omar Christopher Eugene
http://about.me/mojo706

Re: org.apache.mahout.math.IndexException

Hi Mario,

this is indeed a bug. The problem is that the CF code (taste) uses long
ids, while our math library internally uses int keys.

I'll open a jira and post patch that will hopefully help you.

--sebastian

On 04/18/2014 11:03 PM, Mario Levitin wrote:

In my dataset ID's are strings so I use MemoryIDMigrator. This migrator
produces large longs.
I'm not doing any translation.

I could not understand why there is a cast to int in the Mahout code. This
will produce errors for large long values.

On Fri, Apr 18, 2014 at 8:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Are you translating the ID's down into a range that will fit into int's?

On Thu, Apr 17, 2014 at 3:02 PM, Mario Levitin mariolevi...@gmail.com

wrote:

Hi,

I'm trying to run the ALS algorithm. However, I get the following error:

Exception in thread pool-1-thread-3
org.apache.mahout.math.IndexException: Index -691877539 is outside
allowable range of [0,2147483647)
at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395)
at

org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer.sparseUserRatingVector(ALSWRFactorizer.java:305)

At line 305 in ALSWRFactorizer.java, there is the following code

ratings.set((int) preference.getItemID(), preference.getValue());

My suspicion is that the error results from the casting to int in the

above

line. Item IDs in mahout are long, so if you cast a long (which does not
fit into an int) then you will get negative numbers and hence the error.

However, this explanation also seems to me implausible since I don't

think

such an error exists in Mahout code.

Any help will be appreciated.
Thanks

Re: org.apache.mahout.math.IndexException

Mario,

could you check whether the patch from
https://issues.apache.org/jira/browse/MAHOUT-1517 fixes your problem?

Best,
Sebastian

On 04/18/2014 11:03 PM, Mario Levitin wrote:

In my dataset ID's are strings so I use MemoryIDMigrator. This migrator
produces large longs.
I'm not doing any translation.

I could not understand why there is a cast to int in the Mahout code. This
will produce errors for large long values.

On Fri, Apr 18, 2014 at 8:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Are you translating the ID's down into a range that will fit into int's?

On Thu, Apr 17, 2014 at 3:02 PM, Mario Levitin mariolevi...@gmail.com

wrote:

Hi,

I'm trying to run the ALS algorithm. However, I get the following error:

org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer.sparseUserRatingVector(ALSWRFactorizer.java:305)

At line 305 in ALSWRFactorizer.java, there is the following code

ratings.set((int) preference.getItemID(), preference.getValue());

My suspicion is that the error results from the casting to int in the

above

line. Item IDs in mahout are long, so if you cast a long (which does not
fit into an int) then you will get negative numbers and hence the error.

However, this explanation also seems to me implausible since I don't

think

such an error exists in Mahout code.

Any help will be appreciated.
Thanks

Re: simple idea for improving mahout docs over the next month?


Hi Najum,

please write a new mail to ask a question and don't reply to an 
unrelated thread -- https://people.apache.org/~hossman/#threadhijack


If you write a new mail, I'm sure we can help you with your recommender 
problem. Can you give us a few more details, such as the similarity that 
you used, how you did the precomputation and how you exactly measure the 
response time?


--sebastian



On 04/17/2014 10:49 AM, Najum Ali wrote:

Hi guys,

I´m pretty much new to mahout and I´m working with this problem here:

I have created a precomputed item-item-similarity collection for a 
GenericItemBasedRecommender.
Using the 1M MovieLens data, my item-based recommender is only 40-50% faster 
than without precomputation (like 589.5ms instead 1222.9ms).
But the user-based recommender instead is really fast, it´s like 24.2ms? How 
can this happen?

Why is item-based so slow?

Re: Performance Issue using item-based approach!

Could you take the output of the precomputation, feed it into a 
standalone recommender and test it there?



On 04/17/2014 11:37 AM, Najum Ali wrote:

@sebastian


Are you sure that the precomputation is done only once and not in every request?

Yes, a @Bean annotated Object is in Spring per default a singleton instance.
I also just tested it out using a System.out.println()
Here is my log:

System.out.println( precomputation done!“ is called before returning the
GenericItemSimilarity.

The first two recommendations are Item-based - pearson similarity
The thrid and 4th log are also item-based using pre computed similarity
The last log is the userbased recommender using pearson

Look at the huge time difference!

Am 17.04.2014 um 11:23 schrieb Sebastian Schelter s...@apache.org
mailto:s...@apache.org:


Najum,

this is really strange, feeding an ItemBased Recommender with precomputed
similarities should give you superfast recommendations.

Are you sure that the precomputation is done only once and not in every request?

--sebastian

On 04/17/2014 11:17 AM, Najum Ali wrote:

Hi guys,

I have created a precomputed item-item-similarity collection for a
GenericItemBasedRecommender.
Using the 1M MovieLens data, my item-based recommender is only 40-50% faster
than without precomputation (like 589.5ms instead 1222.9ms).
But the user-based recommender instead is really fast, it´s like 24.2ms? How can
this happen?

Here are more details to my Implementation:

CSV File: 1M pref, 6040 Users, 3706 Items

For my Implementation I´m using screenshots, because having the good
highlighting.
My Recommender runs inside a Webserver (Jetty) using Spring 4 and Java8. I
receive Recommendations as Webservice (JSON).

For DataModel, I´m using FileDataModel.


This code below creates me a precomputed ItemSimilarity when I start the
Webserver and the property isItemPreComputationEnabled is set to true:


For time measuring I´m using AOP. I´m measuring the whole time from entering my
Controller to sending the response.
based on System.nanoTime(); and getting the diff. It´s the same time measure for
user based.

I haved tried to cache the recommender and the similarity with no big
difference. I also tried to use CandidateItemsStrategy and
MostSimilarItemsCandidateItemsStrategy, but also no performance boost.

public RecommenderBuilder createRecommenderBuilder(ItemSimilarity similarity)
throws TasteException {
final int numberOfUsers = dataModel.getNumUsers();
final int numberOfItems = dataModel.getNumItems();
CandidateItemsStrategy candidateItemsStrategy = new
SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new
SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
return model - new GenericItemBasedRecommender(model,
similarity,candidateItemsStrategy,mostSimilarStrategy);
}

I dont know why item-based is taking so much longer then user-based. User-based
is like fast as hell. I even tried a DataSet using 100k Prefs, and 10Million
(Movielens). Everytime the user-based is soo much faster for any similarity.

Hope you anyone can help me to understand this. Maybe I´m doing something wrong.

Thanks!! :))

Re: Performance Issue using item-based approach!

Yes, just to make sure the problem is in the mahout code and not in the 
surrounding environment.


On 04/17/2014 11:43 AM, Najum Ali wrote:

@Sebastian
What do u mean with a standalone recommender? A simple offline java main 
program?

Am 17.04.2014 um 11:41 schrieb Sebastian Schelter s...@apache.org:


Could you take the output of the precomputation, feed it into a standalone 
recommender and test it there?


On 04/17/2014 11:37 AM, Najum Ali wrote:

@sebastian


Are you sure that the precomputation is done only once and not in every request?

Yes, a @Bean annotated Object is in Spring per default a singleton instance.
I also just tested it out using a System.out.println()
Here is my log:

System.out.println( precomputation done!“ is called before returning the
GenericItemSimilarity.

The first two recommendations are Item-based - pearson similarity
The thrid and 4th log are also item-based using pre computed similarity
The last log is the userbased recommender using pearson

Look at the huge time difference!

Am 17.04.2014 um 11:23 schrieb Sebastian Schelter s...@apache.org
mailto:s...@apache.org:


Najum,

this is really strange, feeding an ItemBased Recommender with precomputed
similarities should give you superfast recommendations.

Are you sure that the precomputation is done only once and not in every request?

--sebastian

On 04/17/2014 11:17 AM, Najum Ali wrote:

Hi guys,

I have created a precomputed item-item-similarity collection for a
GenericItemBasedRecommender.
Using the 1M MovieLens data, my item-based recommender is only 40-50% faster
than without precomputation (like 589.5ms instead 1222.9ms).
But the user-based recommender instead is really fast, it´s like 24.2ms? How can
this happen?

Here are more details to my Implementation:

CSV File: 1M pref, 6040 Users, 3706 Items

For my Implementation I´m using screenshots, because having the good
highlighting.
My Recommender runs inside a Webserver (Jetty) using Spring 4 and Java8. I
receive Recommendations as Webservice (JSON).

For DataModel, I´m using FileDataModel.


This code below creates me a precomputed ItemSimilarity when I start the
Webserver and the property isItemPreComputationEnabled is set to true:


For time measuring I´m using AOP. I´m measuring the whole time from entering my
Controller to sending the response.
based on System.nanoTime(); and getting the diff. It´s the same time measure for
user based.

I haved tried to cache the recommender and the similarity with no big
difference. I also tried to use CandidateItemsStrategy and
MostSimilarItemsCandidateItemsStrategy, but also no performance boost.

public RecommenderBuilder createRecommenderBuilder(ItemSimilarity similarity)
throws TasteException {
final int numberOfUsers = dataModel.getNumUsers();
final int numberOfItems = dataModel.getNumItems();
CandidateItemsStrategy candidateItemsStrategy = new
SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new
SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
return model - new GenericItemBasedRecommender(model,
similarity,candidateItemsStrategy,mostSimilarStrategy);
}

I dont know why item-based is taking so much longer then user-based. User-based
is like fast as hell. I even tried a DataSet using 100k Prefs, and 10Million
(Movielens). Everytime the user-based is soo much faster for any similarity.

Hope you anyone can help me to understand this. Maybe I´m doing something wrong.

Thanks!! :))

Re: Is there any website documentation repository or tool for Apache Mahout?

The templates for the individual pages are in the svn under site/ in
markdown format. You can use an online markdown editor to approximately see
how they look like.

We don't have a better solution yet, unfortunately.

--sebastian
Am 17.04.2014 20:09 schrieb Andrew Musselman andrew.mussel...@gmail.com:

 The content of the main part of each page is written in markdown and
 parsed by the CMS to render the HTML.  I'm not aware of a way to submit
 pages except as patches..

  On Apr 17, 2014, at 1:52 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
  +1
 
  the project uses Confluence for the wiki. All but commiters are blocked
 from editing pages.
 
  This is getting increasingly frustrating. How many tickets and patches
 are being passed around now? I can’t follow them all. I haven’t used
 Confluence for 4-5 years now but there must be some way to allow edits and
 new pages from anyone pending approval to publish?
 
  On Apr 17, 2014, at 4:47 AM, tuxdna tux...@gmail.com wrote:
 
  I have seen the instructions here[1], but I am not sure if there is
  any source-code for the documentation for website.
 
  So here are my questions:
 
  * Does Apache Mahout project use any tool to generate website
  documentation as it is now http://mahout.apache.org ?
 
  * Suppose I want to add some correction or edition to current Apache
  Mahout documentation. Can I get a read-only access to the source of
  website, so that I can immediately see how the edits will reflect once
  there are accepted?
 
  I was thinking in terms of the way GitHub pages work. For example if I
  use Jekyll, I can view the changes on my machine, exactly as the will
  appear on final website.
 
 
  Regards,
  Saleem
 
 
  [1] http://mahout.apache.org/developers/how-to-update-the-website.html
  [2] https://pages.github.com/
  [3] http://jekyllrb.com/

Re: simple idea for improving mahout docs over the next month?

2014-04-16 Thread Sebastian Schelter


Hi Jay,

I'm not sure what the benefit of this approach is, people can already 
post their questions to the mailinglist and get answers here, why would 
a google doc be helpful?


--sebastian

On 04/16/2014 09:31 PM, Jay Vyas wrote:

hi mahout... i finally thought of a really easy way of ad-hoc improvement
of mahout docs, that can feed into the efforts to get formal docs improved.

Any interest in creating a shared mahout FAQ file in a google doc.?

we can easily start adding questions into it that point to obvious missing
documentation parts, and mahout commiters can add responses below inline.
then overtime we can take those questions/answers and turn them directly
into real docs.

I think this will make it easier for a broader range of people to rapidly
improve mahout docs in an ad hoc sort of way.  i for one will volunteer to
help translate the QA stream into real documentation / JIRAs etc.

Documentation, Documentation, Documentation

2014-04-13 Thread Sebastian Schelter


Hi,

this is another reminder that we still have to finish our documentation 
improvements! The website looks shiny now and there have been lots of 
discussions about new directions but we still have some work todo in 
cleaning up webpages. We should especially make sure that the examples work.


Please help with that, anyone who is willing to sacrifice some time, go 
through a website and try out the steps described is of great help to 
the project. It would also be awesome to get some help in creating a few 
new pages, especially for the recommenders.


Here's the list of documentation related jira's for 1.0:

https://issues.apache.org/jira/browse/MAHOUT-1441?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Documentation%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

Best,
Sebastian

Re: PreferenceArray userID uniqeness?

2014-04-11 Thread Sebastian Schelter


Yes, its a unique identifier for a user.

--sebastian

On 04/11/2014 04:41 PM, Mike Summers wrote:

Does the userId of a preferenceArray need to be unique across all entries
in a FastByIDMap?

I'm comparing two types of objects that contain the same set of traits
however it's possible that the userID (primary key) is not unique as it's
two db tables.

Thanks.

Re: Best practice for partial cartesian product

2014-04-08 Thread Sebastian Schelter

I don't know a good name for that. The problems is that a quadratic 
amount of pairs needs to be emitted here. In our collaborative filtering 
code, we solve this through downsampling.


--sebastian

On 04/08/2014 10:08 AM, Reinis Vicups wrote:

Hi,

this is not mahout question directly, but I figured that you guys most
likely can answer it.

Actually I have two questions:

1. This: {(1,2); (1,3); (2,3)} is not full cartesian product, right? It
is missing (1,1); (2,2); (3,3); (2,1); My question is - how is it
called? Partial cartesian? Asymetric cartesian?

2. If I try to build the product I described above in reducer, what
would be the best practice? My current code look like this:

 @Override
 public void reduce(final VarLongWritable key, final
IterableVarLongWritable values, final Context context)  {

 final VarLongWritable[] valueArray = Iterables.toArray(values,
VarLongWritable.class);

 for (int i = 0; i  valueArray.length; i++) {
 for (int j = i + 1; j  valueArray.length; j++) {
 context.write(new PairWritable(valueArray[i].get(),
valueArray[j].get()), customerPreferenceWritable);
 }
 }
 }

I don't feel quite right with this solution since I make a copy of
values in valueArray and believe that it will cost me
OoutOfMemoryExceptions with larger data sets.

thanks and br
reinis

Re: Best practice for partial cartesian product

2014-04-08 Thread Sebastian Schelter


Have a look at the sampleDown method in RowSimilarityJob:

https://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/RowSimilarityJob.java?view=markup

On 04/08/2014 10:33 AM, Reinis Vicups wrote:

Sebastian, thank your very much for your response.

Could you or anyone point me to the mahout classes where this is being
solved?

thank you guys
reinis

On 08.04.2014 10:27, Sebastian Schelter wrote:

I don't know a good name for that. The problems is that a quadratic
amount of pairs needs to be emitted here. In our collaborative
filtering code, we solve this through downsampling.

--sebastian

On 04/08/2014 10:08 AM, Reinis Vicups wrote:

Hi,

this is not mahout question directly, but I figured that you guys most
likely can answer it.

Actually I have two questions:

1. This: {(1,2); (1,3); (2,3)} is not full cartesian product, right? It
is missing (1,1); (2,2); (3,3); (2,1); My question is - how is it
called? Partial cartesian? Asymetric cartesian?

2. If I try to build the product I described above in reducer, what
would be the best practice? My current code look like this:

 @Override
 public void reduce(final VarLongWritable key, final
IterableVarLongWritable values, final Context context) {

 final VarLongWritable[] valueArray = Iterables.toArray(values,
VarLongWritable.class);

 for (int i = 0; i  valueArray.length; i++) {
 for (int j = i + 1; j  valueArray.length; j++) {
 context.write(new PairWritable(valueArray[i].get(),
valueArray[j].get()), customerPreferenceWritable);
 }
 }
 }

I don't feel quite right with this solution since I make a copy of
values in valueArray and believe that it will cost me
OoutOfMemoryExceptions with larger data sets.

thanks and br
reinis

Re: Can any one help

2014-04-08 Thread Sebastian Schelter


It seems there is a problem with your hdfs, how did you configure that?

--sebastian

On 04/08/2014 07:23 PM, Neetha wrote:

Hi,


I am trying to run Mahout -kmeans clustering on hadoop, but I am getting
this error,


hduser3@ubuntu:/usr/local/hadoop-1.0.1/mahout3$ bin/mahout seqdirectory \-i
mahout-work/reuters-out \-o mahout-work/reuters-out-seqdir \-c UTF-8 -chunk
5
Warning: $HADOOP_HOME is deprecated.


hduser3@ubuntu:/usr/local/hadoop-1.0.1/mahout3$ bin/mahout seqdirectory \-i
mahout-work/reuters-out \-o mahout-work/reuters-out-seqdir \-c UTF-8 -chunk
5
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.0.1/bin/hadoop and
HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/hadoop-1.0.1/mahout3/examples/target/
mahout-examples-0.7-job.jar
Warning: $HADOOP_HOME is deprecated.

14/04/07 12:10:14 INFO common.AbstractJob: Command line arguments:
{--charset=[UTF-8], --chunkSize=[5], --endPhase=[2147483647],
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
--input=[mahout-work/reuters-out], --keyPrefix=[],
--output=[mahout-work/reuters-out-seqdir], --startPhase=[0],
--tempDir=[temp]}
14/04/07 12:10:15 WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/hduser3/mahout-work/reuters-out-seqdir/chunk-0 could only be
replicated to 0 nodes, instead of 1
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
getAdditionalBlock(FSNamesystem.java:1556)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(
NameNode.java:696)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at org.apache.hadoop.security.UserGroupInformation.doAs(
UserGroupInformation.java:1093)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

 at org.apache.hadoop.ipc.Client.call(Client.java:1066)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at $Proxy1.addBlock(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(
RetryInvocationHandler.java:82)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(
RetryInvocationHandler.java:59)
 at $Proxy1.addBlock(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.
locateFollowingBlock(DFSClient.java:3507)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.
nextBlockOutputStream(DFSClient.java:3370)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.
access$2700(DFSClient.java:2586)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$
DataStreamer.run(DFSClient.java:2826)

14/04/07 12:10:15 WARN hdfs.DFSClient: Error Recovery for block null bad
datanode[0] nodes == null
14/04/07 12:10:15 WARN hdfs.DFSClient: Could not get block locations.
Source file /user/hduser3/mahout-work/reuters-out-seqdir/chunk-0 -
Aborting...
Apr 7, 2014 12:10:15 PM com.google.common.io.Closeables close
WARNING: IOException thrown while closing Closeable.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/hduser3/mahout-work/reuters-out-seqdir/chunk-0 could only be
replicated to 0 nodes, instead of 1
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
getAdditionalBlock(FSNamesystem.java:1556)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(
NameNode.java:696)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at org.apache.hadoop.security.UserGroupInformation.doAs(
UserGroupInformation.java:1093)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

 at

Re: Solr+Mahout Recommender Demo Site

2014-04-06 Thread Sebastian Schelter


The top 3 recommendations based on videos you liked are very good!

Nice job.


On 04/06/2014 07:26 PM, Pat Ferrel wrote:

After having integrated several versions of the Mahout and Myrrix recommenders 
at fairly large scale. I was interested in solving three problems that these 
did not directly provide for:
1) realtime queries for recs using data not yet incorporated into the training 
set. Myrrix allows this but Mahout using the hadoop mr version does not.
2) cross-recommendations from two or more action types (say purchase and 
detail-view)
3) blending metadata and user preference data to return recs (for example category 
 user preferences = recs)

Using Solr + Mahout provided an amazingly flexible and performant way to do 
this. Ted wrote about his experience with this basic approach in his recent 
book. Take user preferences, run them through RowSimilarityJob and you get an 
item by item similarity Matrix. This is the core of an item-based cooccurrence 
recommender. If you take the similarity matrix, and convert it into a list of 
tokens per row, you have something Solr can index. If you then use a user’s 
history as a query on the indexed data you get an ordered list of 
recommendations.

When I set out to do #1 and #3 the need for CF data AND metadata was the first 
problem. So I mined the web for video reviews and video metadata. Then logging 
any users who visit the site will lead to data for #2 and #1.

The demo site is https://guide.finderbots.com and instructions are at the end 
of this for anyone who would like to test it out. As a crude user test there is 
a procedure we ask you to follow to help gather quality of recommendations 
data. It’s running out of my closet over Comcast so if it’s down I may have 
tripped over a cord, sorry try again later.

There are a bunch of different methods for making recs illustrated on the site. 
One method that illustrates blending metadata uses preference data from you, 
and metadata to bias and filter recs. Imagine that you have trained the system 
with your preferences by making some video picks. Now imagine you’d like to get 
recommendations for Comedies from Neflix based on your previous video 
preferences. This is done with a single Solr query on indexed video fields that 
hold genre, similar videos (from the similarity matrix), and sources. The query 
finds similar videos to the ones you have liked, with the genre “Comedy” 
boosted by some amount, but only those that have at least one source = 
“Netflix”.

I’ll be doing some blog posts covering the specifics of how each rec type is 
done, the site and DB architecture, and Solr setup.

The project uses the Solr recommender prep code here: 
https://github.com/pferrel/solr-recommender

BTW I plan to publish obfuscated usage data in the github repo.

begin form letter ===

Please use a very newly updated browser (latest Firefox, Chrome, Safari, and 
nothing older than IE10) the site doesn’t yet check browser compatibility but 
relies on HTML5 and CSS3 rather heavily.

1) go to https://guide.finderbots.com/users/sign_up to create an account
2) go to https://guide.finderbots.com/trainers to ’train' the recommender hit 
thumbs up on videos you like. There are 20 pages of training videos, you can 
leave at any time but if you can go through them all it would be appreciated.
3) go to https://guide.finderbots.com/guides/recommend to immediately get 
personalized recs from your training data. If you completed the trainer check 
the top line of recs, count how many are videos you liked or would like to see. 
Scroll right or left to see a total of 24 in four batches of 6. If you could 
report to me the total you thought were good recs it would be greatly 
appreciated.
4) browse videos by various criteria here: https://guide.finderbots.com/guides 
These are not recommendations, they are simply a catalog.
5) control how you browse videos by clicking the gears icon. You can set all 
videos to be from one or more sources here. If you choose Netflix alone (don’t 
forget to uncheck ‘all’) then recs and browsed videos will all be available on 
Netflix.

Re: Number of features for ALS

2014-03-30 Thread Sebastian Schelter

Use k-fold cross-validation or hold-out tests for estimating the quality 
of different parameter combinations.


--sebastian

On 03/30/2014 11:53 AM, Niklas Ekvall wrote:

Hi,

My name is Niklas Ekvall and I have a implementation of the recommender
algorithm Large-scale Parallel Collaborative Filtering for the Netflix
Prize and now I'm wondering how to choose the number of features and
lambda. Could any of guys help me to explain a stepwise strategy to choose
or optimize these two parameters?

Best regards, Niklas


2014-03-27 19:07 GMT+01:00 j.barrett Strausser 
j.barrett.straus...@gmail.com:


Thanks Ted,

Yes for the time problem. We tend to use aggregations of session data. So
instead of asking for user recommendations we do things like user+sessions
recommendations.

Of course, deciding when sessions start and stop isn't trivial. I ideally
what I would want to is time-weight views using a kernel or convolution.
That's a bit heavy so we typically have a global model, that is is
basically all preferences over times. Then these user+session type models.
We can then combine these at another level to give recommendations based on
what you like throughout time versus what you have been doing recently.



-b


On Thu, Mar 27, 2014 at 1:59 PM, Ted Dunning ted.dunn...@gmail.com
wrote:


For the poly-syllable challenged,

hetereoscedasticity - degree of variation changes.  This is common with
counts because you expect the standard deviation of count data to be
proportional to sqrt(n).

time imhogeneity - changes in behavior over time.  One way to handle this
(roughly) is to first remove variation in personal and item means over

time

(if using ratings) and then to segment user histories into episodes.  By
including both short and long episodes you get some repair for changes in
personal preference.  A great example of how this works/breaks is

Christmas

music.  On December 26th, you want to *stop* recommending this music so

it

really pays to limit histories at this point.  By having an episodic user
session that starts around November and runs to Christmas, you can get

good

recommendations for seasonal songs and not pollute the rest of the
universe.



On Thu, Mar 27, 2014 at 8:30 AM, j.barrett Strausser 
j.barrett.straus...@gmail.com wrote:


For my team it has usually been hetereoscedasticity and time

inhomogeneity.





On Thu, Mar 27, 2014 at 10:18 AM, Tevfik Aytekin
tevfik.ayte...@gmail.comwrote:


Interesting topic,
Ted, can you give examples of those mathematical assumptions
under-pinning ALS which are violated by the real world?

On Thu, Mar 27, 2014 at 3:43 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

How can there be any other practical method?  Essentially all of

the

mathematical assumptions under-pinning ALS are violated by the real

world.

  Why would any mathematical consideration of the number of features

be

much

more than heuristic?

That said, you can make an information content argument.  You can

also

make

the argument that if you take too many features, it doesn't much

hurt

so

you should always take as many as you can compute.



On Thu, Mar 27, 2014 at 6:33 AM, Sebastian Schelter 

s...@apache.org

wrote:



Hi,

does anyone know of a principled approach of choosing the number

of

features for ALS (other than cross-validation?)

--sebastian







--


https://github.com/bearrito
@deepbearrito







--


https://github.com/bearrito
@deepbearrito

Re: (help!) Can someone scan this

2014-03-29 Thread Sebastian Schelter


Jay,

which version of Mahout are you using? Have you tried to explicitly set 
the temp path?


--sebastian

On 03/29/2014 01:52 AM, Jay Vyas wrote:

Hi again mahout:

Im wrapping a distributed recommender like this:

https://raw.githubusercontent.com/jayunit100/bigpetstore/master/src/main/java/org/bigtop/bigpetstore/clustering/BPSRecommnder.java

And its not working.

Any thoguhts on why?  The error message is simply that intermediate data
sets dont exist (i.e. numUsers.bin or /tmp/preparePreferencesMatrix...).

Basically its clear that the intermediate jobs are failing but i cant see
any reason why they would fail And I don't see any meaningfull stack
traces.

I've found alot of good whitepapers and stuff on how the algorithms work ,
but its not clear what is really done for me by mahout, and what i have to
do on my own for the distributed recommender APIs.

Re: The 3 distributed recommenders

2014-03-28 Thread Sebastian Schelter


Hi Jay,

there's not much documentation unfortunately. We're in the process of 
creating that however. We removed the pseudo-distributed recommender, 
mainly because nobody ever used it. There are two research papers that 
could help you with understanding the other two distributed recommenders:


For ALS:

Distributed Matrix Factorization with MapReduce using a series of 
Broadcast-Joins, RecSys'13


http://ssc.io/wp-content/uploads/2011/12/sys024-schelter.pdf

For item-based:

Scalable Similarity-Based Neighborhood Methods with MapReduce, RecSys'12

http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf


On 03/28/2014 02:04 PM, Jay Vyas wrote:

Hi mahout:

Looking through the source code there are 3 distributed recommenders...

the als recommender
the item recommender
the pseudo recommender

Any docs differentiating these?

Number of features for ALS

2014-03-27 Thread Sebastian Schelter


Hi,

does anyone know of a principled approach of choosing the number of 
features for ALS (other than cross-validation?)


--sebastian

Re: Does Recommender System Overview Demo work?

2014-03-24 Thread Sebastian Schelter


Hi Bhargav,

you are right, the content on the page is outdated and contains some 
errors. I've created a jira ticket to fix this [1].


Thank you for reporting the problem!

[1] https://issues.apache.org/jira/browse/MAHOUT-1485


On 03/24/2014 04:41 AM, Bhargav Golla wrote:

Hi

I was wondering if the demo existing at
https://mahout.apache.org/users/recommender/recommender-documentation.htmlstill
works. I don't find webapp directory in integration/ and hence even
after I add jetty plugin in the pom.xml in integration/, it is throwing an
exception.

Bhargav Golla
Committer, ASF
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
  | Website http://www.bhargavgolla.com/

Re: Does Recommender System Overview Demo work?

2014-03-24 Thread Sebastian Schelter

The webapp in Mahout does not offer much functionality. If you'd like to 
use Mahout via a webinterface, I suggest you either use predictionIO [1] 
or kornakapi [2].


Best,
Sebastian


[1} http://prediction.io
[2] http://ssc.io/a-recommendation-webservice-in-10-minutes/

On 03/24/2014 02:29 PM, Bhargav Golla wrote:

Hi Sebastian

Thanks for letting me know. I was wondering if it was removed only in 0.9
version. Can I check the 0.8 branch and use the webapp in that branch?

Bhargav Golla
Developer. Freelancer.
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
  | Website http://www.bhargavgolla.com/


On Mon, Mar 24, 2014 at 2:12 AM, Sebastian Schelter s...@apache.org wrote:


Hi Bhargav,

you are right, the content on the page is outdated and contains some
errors. I've created a jira ticket to fix this [1].

Thank you for reporting the problem!

[1] https://issues.apache.org/jira/browse/MAHOUT-1485



On 03/24/2014 04:41 AM, Bhargav Golla wrote:


Hi

I was wondering if the demo existing at
https://mahout.apache.org/users/recommender/recommender-
documentation.htmlstill
works. I don't find webapp directory in integration/ and hence even
after I add jetty plugin in the pom.xml in integration/, it is throwing an
exception.

Bhargav Golla
Committer, ASF
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
   | Website http://www.bhargavgolla.com/

Re: Does Recommender System Overview Demo work?

2014-03-24 Thread Sebastian Schelter


Would be great to have such an overview on the mahout website.

On 03/24/2014 03:18 PM, Jay Vyas wrote:

I've tried to start disambiguating the difference between mahout
distributed vs local tutorials here, because ive found it causes problems
for a lot of people (including me)

http://jayunit100.blogspot.com/2014/02/a-few-nice-posts-about-distirbuted.html

anyone want to collaborate on a two table wiki page which links to
tutorials about distributed vs single node implementations of all
algorithms?


On Mon, Mar 24, 2014 at 10:00 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:


It was removed in 0.9 and am not sure if it was there in 0.8. I vaguely
remember removing it in 0.9 based on a conversation with Manuel on user@.
Manuel, if u could chime in here.





On Monday, March 24, 2014 9:44 AM, Sebastian Schelter s...@apache.org
wrote:

The webapp in Mahout does not offer much functionality. If you'd like to
use Mahout via a webinterface, I suggest you either use predictionIO [1]
or kornakapi [2].

Best,
Sebastian


[1} http://prediction.io
[2] http://ssc.io/a-recommendation-webservice-in-10-minutes/


On 03/24/2014 02:29 PM, Bhargav Golla wrote:

Hi Sebastian

Thanks for letting me know. I was wondering if it was removed only in 0.9
version. Can I check the 0.8 branch and use the webapp in that branch?

Bhargav Golla
Developer. Freelancer.
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
   | Website http://www.bhargavgolla.com/


On Mon, Mar 24, 2014 at 2:12 AM, Sebastian Schelter s...@apache.org

wrote:



Hi Bhargav,

you are right, the content on the page is outdated and contains some
errors. I've created a jira ticket to fix this [1].

Thank you for reporting the problem!

[1] https://issues.apache.org/jira/browse/MAHOUT-1485



On 03/24/2014 04:41 AM, Bhargav Golla wrote:


Hi

I was wondering if the demo existing at
https://mahout.apache.org/users/recommender/recommender-
documentation.htmlstill
works. I don't find webapp directory in integration/ and hence even
after I add jetty plugin in the pom.xml in integration/, it is

throwing an

exception.

Bhargav Golla
Committer, ASF
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
| Website http://www.bhargavgolla.com/

Re: Problem with K-Means clustering on Amazon EMR

2014-03-23 Thread Sebastian Schelter

Hi Konstantin,

Great to see that you located the error. Could you open a jira issue and
submit a patch that contains an updated error message?

Thank you,
Sebastian

On 03/23/2014 02:57 PM, Konstantin Slisenko wrote:

Hi!

I investigated the situation. RandomSeedGenerator (
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?av=f)
has following code:

FileSystem fs = FileSystem.get(output.toUri(), conf);

...

fs.getFileStatus(input).isDir()

FileSystem object was created from output path, which was not specified
correctly by me. (I didn't use prefix s3:// for this path). Afterwards
getFileStatus has parameter for input path, which was correct. This caused
misunderstanding.

To prevent this misunderstanding, I propose to improve error message adding
following details:
1. Specify which filesystem type used (DistributedFileSystem,
NativeS3FileSystem, etc. using fs.getClass().getName())
2. Then specify which path can not be processed correctly.

This can be done by validation utility which can be applied to many places
in Mahout. When we use Mahout we need to specify many paths and we also can
use many types of file systems: local for debugging, distributed on Hadoop,
and s3 on Amazon. In this case better error messages can save much time. I
think that refactoring is not needed for this case.

2014-03-16 22:19 GMT+03:00 Jay Vyas jayunit...@gmail.com:

I agree best to be explicit when creating filesystem instances by using
the two argument get(...). it's time to update it filesystem 2.0 Apis. Can
you file a Jira for this ? If not I will :)

On Mar 16, 2014, at 12:37 PM, Sebastian Schelter s...@apache.org wrote:

I've also encountered a similar error once. It's really just the

FileSystem.get call that needs to be modified. I think its a good idea to
walk through the codebase and refactor this where necessary.

--sebastian

On 03/16/2014 05:16 PM, Andrew Musselman wrote:
Another wild guess, I've had issues trying to use the 's3' protocol

from Hadoop and got things working by using the 's3n' protocol instead.

On Mar 16, 2014, at 8:41 AM, Jay Vyas jayunit...@gmail.com wrote:

I specifically have fixed mapreduce jobs by doing what the error

message suggests.

But maybe (hopefully) there is another workaround that is

configuration driven.

Just a hunch but, Maybe mahout needs to be refactored to create fs

objects using the get(uri,conf) calls?

As hadoop evolves to support different flavored of hcfs probably using

API calls that are more flexible (i.e. Like the fs.get(uri,conf) one), will
probably be a good thing to keep in mind.

On Mar 16, 2014, at 9:22 AM, Frank Scholten fr...@frankscholten.nl

wrote:

Hi Konstantin,

Good to hear from you.

The link you mentioned points to EigenSeedGenerator not
RandomSeedGenerator. The problem seems to be with the call to

fs.getFileStatus(input).isDir()

It's been a while and I don't remember but perhaps you have to set
additional Hadoop fs properties to use S3. See
https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the

cause of

this by creating a small Java main app with that line of code and run

it in

the debugger.

Cheers,

Frank

On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
kslise...@gmail.comwrote:

Hello!

I run a text-documents clustering on Hadoop cluster in Amazon

Elastic Map

Reduce. As input and output I use S3 Amazon file system. I specify

all

paths as s3://bucket-name/folder-name.

SparceVectorsFromSequenceFile works correctly with S3
but when I start K-Means clustering job, I get this error:

Exception in thread main java.lang.IllegalArgumentException: This
file system object (hdfs://172.31.41.65:9000) does not support

access

to the request path

's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'

You possibly called FileSystem.get(conf) when you should have called
FileSystem.get(uri, conf) to obtain a file system supporting your
path.

org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)

org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)

org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)

org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76)

org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at

bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121)

bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause

of this a
at

bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41

Documentation, documentation, documentation

2014-03-22 Thread Sebastian Schelter


Hi,

It's great to see a lot of work being spent on cleaning up the website. 
I think we have already done a great job here, but there are still a few 
more pages that need work.


I created a jira issue for every single page that needs some work, would 
be awesome if we could find enough volunteers to finish this quickly.


If you wanna take a ticket, write a comment that you start work on it, 
go through the website, check it for dead links and formatting errors 
and try out examples that are listed with the current release to see if 
everything still works. Either attach a textfile containing a new 
version of the page to the issue or add a comment on the issue that 
details the fix that you want to see (e.g. remove link ... because it 
is dead.)


Here's an overview of the tickets:

MAHOUT-1471 Clean up website on Canopy Clustering
MAHOUT-1472 Clean up website on Fuzzy k-Means
MAHOUT-1473 Clean up website on Spectral Clustering
MAHOUT-1474 Add Seinfeld clustering example

MAHOUT-1475 Clean up website on Naive Bayes
MAHOUT-1476 Clean up website on Hidden Markov Models
MAHOUT-1477 Clean up website on Logistic Regression
MAHOUT-1478 Clean up website on Random Forests
MAHOUT-1479 Clean up website on wikipedia example
MAHOUT-1480 Clean up website on 20 newsgroups
MAHOUT-1481 Clean up website on breiman example

MAHOUT-1482 Rework quickstart website

I would kindly ask Shannon to take 1473, Frank 1474 and Frank or Ted 1477.

Let's quickly finish the work on documenting what we have, so we can 
move on to new and exciting developments in Mahout!


--sebastian

Re: Documentation, documentation, documentation

2014-03-22 Thread Sebastian Schelter


Sry, I seem to have overlooked this.
Could you move the cleanings of canopy to 1471?

Thank you.

On 03/22/2014 04:54 PM, Pavan Kumar N wrote:

i have already added canopy vlustering cleansing as part of jira 1450 ..
also created new issue for adding streaming kmeans .
On Mar 22, 2014 8:37 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

It's great to see a lot of work being spent on cleaning up the website. I
think we have already done a great job here, but there are still a few more
pages that need work.

I created a jira issue for every single page that needs some work, would
be awesome if we could find enough volunteers to finish this quickly.

If you wanna take a ticket, write a comment that you start work on it, go
through the website, check it for dead links and formatting errors and try
out examples that are listed with the current release to see if everything
still works. Either attach a textfile containing a new version of the page
to the issue or add a comment on the issue that details the fix that you
want to see (e.g. remove link ... because it is dead.)

Here's an overview of the tickets:

MAHOUT-1471 Clean up website on Canopy Clustering
MAHOUT-1472 Clean up website on Fuzzy k-Means
MAHOUT-1473 Clean up website on Spectral Clustering
MAHOUT-1474 Add Seinfeld clustering example

MAHOUT-1475 Clean up website on Naive Bayes
MAHOUT-1476 Clean up website on Hidden Markov Models
MAHOUT-1477 Clean up website on Logistic Regression
MAHOUT-1478 Clean up website on Random Forests
MAHOUT-1479 Clean up website on wikipedia example
MAHOUT-1480 Clean up website on 20 newsgroups
MAHOUT-1481 Clean up website on breiman example

MAHOUT-1482 Rework quickstart website

I would kindly ask Shannon to take 1473, Frank 1474 and Frank or Ted 1477.

Let's quickly finish the work on documenting what we have, so we can move
on to new and exciting developments in Mahout!

--sebastian

Re: Problem with K-Means clustering on Amazon EMR

2014-03-16 Thread Sebastian Schelter

I've also encountered a similar error once. It's really just the 
FileSystem.get call that needs to be modified. I think its a good idea 
to walk through the codebase and refactor this where necessary.


--sebastian


On 03/16/2014 05:16 PM, Andrew Musselman wrote:

Another wild guess, I've had issues trying to use the 's3' protocol from Hadoop 
and got things working by using the 's3n' protocol instead.


On Mar 16, 2014, at 8:41 AM, Jay Vyas jayunit...@gmail.com wrote:

I specifically have fixed mapreduce jobs by doing what the error message 
suggests.

But maybe (hopefully) there is another workaround that is configuration driven.

Just a hunch but, Maybe mahout needs to be refactored to create fs objects 
using the get(uri,conf) calls?

As hadoop evolves to support different flavored of hcfs probably using API 
calls that are more flexible (i.e. Like the fs.get(uri,conf) one), will 
probably be a good thing to keep in mind.


On Mar 16, 2014, at 9:22 AM, Frank Scholten fr...@frankscholten.nl wrote:

Hi Konstantin,

Good to hear from you.

The link you mentioned points to EigenSeedGenerator not
RandomSeedGenerator. The problem seems to be with the call to

fs.getFileStatus(input).isDir()


It's been a while and I don't remember but perhaps you have to set
additional Hadoop fs properties to use S3. See
https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the cause of
this by creating a small Java main app with that line of code and run it in
the debugger.

Cheers,

Frank



On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
kslise...@gmail.comwrote:


Hello!

I run a text-documents clustering on Hadoop cluster in Amazon Elastic Map
Reduce. As input and output I use S3 Amazon file system. I specify all
paths as s3://bucket-name/folder-name.

SparceVectorsFromSequenceFile works correctly with S3
but when I start K-Means clustering job, I get this error:

Exception in thread main java.lang.IllegalArgumentException: This
file system object (hdfs://172.31.41.65:9000) does not support access
to the request path

's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'
You possibly called FileSystem.get(conf) when you should have called
FileSystem.get(uri, conf) to obtain a file system supporting your
path.

   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)
   at
org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
   at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
   at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
   at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76)
   at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at
bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121)
   at
bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause
of this a
   at
bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


I checked RandomSeedGenerator.buildRandom
(
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/EigenSeedGenerator.java?av=f
)
and I assume it has correct code:

FileSystem fs = FileSystem.get(output.toUri(), conf);


I can not run clustering because of this error. May be you have any
ideas how to fix this?

Re: Compiling Mahout with maven in Eclipse


Maven should generate the classes automatically. Have you tried running

mvn -DskipTests clean install

on the commandline?



On 03/13/2014 09:50 AM, Kevin Moulart wrote:

How can I generate them to make these errors go away then ? Or don't I have
to ?

Kévin Moulart


2014-03-13 9:17 GMT+01:00 Sebastian Schelter ssc.o...@googlemail.com:


Those are autogenerated.


On 03/13/2014 09:05 AM, Kevin Moulart wrote:


Ok it does compile with maven in eclipse as well, but still, many imports
are not recognized in the sources :

- import org.apache.mahout.math.function.IntObjectProcedure;
- import org.apache.mahout.math.map.OpenIntLongHashMap;
- import org.apache.mahout.math.map.OpenIntObjectHashMap;
- import org.apache.mahout.math.set.OpenIntHashSet;
- import org.apache.mahout.math.list.DoubleArrayList;
...

Pretty much all the problems come from the OpenInt... classes that it
doesn't seem to find. Is there a jar or a pom entry I need to add here ?
Or do I have the wrong version of org.apache.mahout.math, because I can't
find those maps/sets/lists in the math package ?

(I have the same problem on both my windows, centos and mac os)

Kévin Moulart


2014-03-12 17:00 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com:

  Never mind, I found where the problem lied, I deleted the full content of

.m2 and retried it as non root user and it worked. Trying in Eclipse now,
with tests I'll let you now if it doesn't work.

Kévin Moulart


2014-03-12 16:45 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com:

Hi,



I tried to fix all the problem I had to configure eclipse in order to
compile mahout in it using maven clean package as goal.

First I had to make a change in mahout core in the class GroupTree.java,
line 171 :

  stack = new ArrayDequeGroupTree();





Then I tried compiling with eclipse (I already had the plugin and all
imported and I'm working on the trunk version).

  From eclipse it runs until it tries compiling the examples :

  [INFO] Building jar:

/home/myCompany/Workspace_eclipse/mahout-trunk/examples/
target/mahout-examples-1.0-SNAPSHOT-job.jar
[INFO]


[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools  SUCCESS [
   1.173 s]
[INFO] Apache Mahout . SUCCESS [
   0.307 s]
[INFO] Mahout Math ... SUCCESS [
   8.041 s]
[INFO] Mahout Core ... SUCCESS [
   8.378 s]
[INFO] Mahout Integration  SUCCESS [
   1.030 s]
[INFO] Mahout Examples ... FAILURE [
   5.325 s]
[INFO] Mahout Release Package  SKIPPED
[INFO] Mahout Math/Scala wrappers  SKIPPED
[INFO] Mahout Spark bindings . SKIPPED
[INFO]


[INFO] BUILD FAILURE
[INFO]


[INFO] Total time: 24.630 s
[INFO] Finished at: 2014-03-12T16:38:08+01:00
[INFO] Final Memory: 101M/1430M
[INFO]


[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-assembly-plugin:2.4:single (job) on
project
mahout-examples: Failed to create assembly: Error creating assembly
archive
job: IOException when zipping com/ibm/icu/ICUConfig.properties:
invalid LOC
header (bad signature) - [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with
the
-e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/
MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with
the
command
[ERROR]   mvn goals -rf :mahout-examples




It does the exact same thing when I try typing mvn clean package in
terminal, but when I try it as root, it works, so it might be an issue
with
the permissions however I fail to see where (I did a chown -R on my
entire
home folder just to be on the safe side and it still fails).

Anyone had the same problem ? Any idea about how to fix it ?

Kévin Moulart

Re: Compiling Mahout with maven in Eclipse


Are executing maven in the topmost directory?

On 03/13/2014 10:09 AM, Kevin Moulart wrote:

I did, but then it fails because of these missing files :
https://gist.github.com/kmoulart/9524828

Kévin Moulart


2014-03-13 9:57 GMT+01:00 Sebastian Schelter s...@apache.org:


Maven should generate the classes automatically. Have you tried running

mvn -DskipTests clean install

on the commandline?




On 03/13/2014 09:50 AM, Kevin Moulart wrote:


How can I generate them to make these errors go away then ? Or don't I
have
to ?

Kévin Moulart


2014-03-13 9:17 GMT+01:00 Sebastian Schelter ssc.o...@googlemail.com:

  Those are autogenerated.



On 03/13/2014 09:05 AM, Kevin Moulart wrote:

  Ok it does compile with maven in eclipse as well, but still, many

imports
are not recognized in the sources :

- import org.apache.mahout.math.function.IntObjectProcedure;
- import org.apache.mahout.math.map.OpenIntLongHashMap;
- import org.apache.mahout.math.map.OpenIntObjectHashMap;
- import org.apache.mahout.math.set.OpenIntHashSet;
- import org.apache.mahout.math.list.DoubleArrayList;
...

Pretty much all the problems come from the OpenInt... classes that it
doesn't seem to find. Is there a jar or a pom entry I need to add here ?
Or do I have the wrong version of org.apache.mahout.math, because I
can't
find those maps/sets/lists in the math package ?

(I have the same problem on both my windows, centos and mac os)

Kévin Moulart


2014-03-12 17:00 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com:

   Never mind, I found where the problem lied, I deleted the full
content of


.m2 and retried it as non root user and it worked. Trying in Eclipse
now,
with tests I'll let you now if it doesn't work.

Kévin Moulart


2014-03-12 16:45 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com:

Hi,



I tried to fix all the problem I had to configure eclipse in order to
compile mahout in it using maven clean package as goal.

First I had to make a change in mahout core in the class
GroupTree.java,
line 171 :

   stack = new ArrayDequeGroupTree();






Then I tried compiling with eclipse (I already had the plugin and all
imported and I'm working on the trunk version).

   From eclipse it runs until it tries compiling the examples :

   [INFO] Building jar:


/home/myCompany/Workspace_eclipse/mahout-trunk/examples/
target/mahout-examples-1.0-SNAPSHOT-job.jar
[INFO]


[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools  SUCCESS [
1.173 s]
[INFO] Apache Mahout . SUCCESS [
0.307 s]
[INFO] Mahout Math ... SUCCESS [
8.041 s]
[INFO] Mahout Core ... SUCCESS [
8.378 s]
[INFO] Mahout Integration  SUCCESS [
1.030 s]
[INFO] Mahout Examples ... FAILURE [
5.325 s]
[INFO] Mahout Release Package  SKIPPED
[INFO] Mahout Math/Scala wrappers  SKIPPED
[INFO] Mahout Spark bindings . SKIPPED
[INFO]


[INFO] BUILD FAILURE
[INFO]


[INFO] Total time: 24.630 s
[INFO] Finished at: 2014-03-12T16:38:08+01:00
[INFO] Final Memory: 101M/1430M
[INFO]


[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-assembly-plugin:2.4:single (job) on
project
mahout-examples: Failed to create assembly: Error creating assembly
archive
job: IOException when zipping com/ibm/icu/ICUConfig.properties:
invalid LOC
header (bad signature) - [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with
the
-e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug
logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/
MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with
the
command
[ERROR]   mvn goals -rf :mahout-examples




It does the exact same thing when I try typing mvn clean package in
terminal, but when I try it as root, it works, so it might be an issue
with
the permissions however I fail to see where (I did a chown -R on my
entire
home folder just to be on the safe side and it still fails).

Anyone had the same problem ? Any idea about how to fix it ?

Kévin Moulart

Re: verbose output

To my knowledge, there is no such flag for mahout. You can check 
hadoop's logs for further information however.


On 03/13/2014 10:21 AM, Mahmood Naderan wrote:

Hi,
Is there any verbosity flag for hadoop and mahout commands? I can not find such 
thing in the command line.


Regards,
Mahmood

Re: Website, urgent help needed


Hi Scott,

Create a jira ticket and attach your scripts and a text version of the 
page there.


Best,
Sebastian


On 03/12/2014 03:27 PM, Scott C. Cote wrote:

I took the tour of the text analysis and pushed through despite the
problems on the page.  Commiters helped me over the hump where others
might have just gave up (to your point).
When I did it, I made shell scripts so that my steps would be repeatable
with an anticipation of updating the page.

Unforunately, I gave up on trying to figure out how to update the page
(there were links indicating that I could do it), and I didn¹t want to
appear to be stupid asking how to update the documentation (my bad - not
anyone else).  Now I know that it was not possible unless I was a commiter.

Who should I send my scripts to, or how should I proceed with a current
form of the page?

SCott

On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote:


Hi Pavan,

Awesome that you're willing to help. The documentation are the pages
listed under Clustering in the navigation bar under mahout.apache.org

If you start working on one of the pages listed there (e.g. the k-Means
doc), please created jira ticket in our issue tracker with a title along
the lines of Cleaning up the documentation for k-Means on the website.

Put a list of errors and corrections into the jira and I (or some other
committer) will make sure to fix the website.

Thanks,
Sebastian


On 03/12/2014 08:48 AM, Pavan Kumar N wrote:

i ll help with clustering algorithms documentation. do send me old
documentation and i will check and remove errors.  or better let me know
how to proceed.

Pavan
On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

As you've probably noticed, I've put in a lot of effort over the last
days
to kickstart cleaning up our website. I've thrown out a lot of stuff
and
have been startled by the amout of outdated and incorrect information
on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for
new
people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't
have
the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification
and
Clustering on the website. For the algorithms, the content and
claims of
the articles need to be checked, for the examples we need to make sure
that
everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour
for
that in the evening. She will go to our website, download Mahout, read
the
description of an algorithm and try to run an example. In the current
state
of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.

Re: Problem with FileSystem in Kmeans


Hi Bikash,

Have you tried adding hdfs:// to your input path? Maybe that helps.

--sebastian

On 03/11/2014 11:22 AM, Bikash Gupta wrote:

Hi,

I am running Kmeans in cluster where I am setting the configuration of
fs.hdfs.impl and fs.file.impl before hand as mentioned below

conf.set(fs.hdfs.impl,org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set(fs.file.impl,org.apache.hadoop.fs.LocalFileSystem.class.getName());

Problem is that cluster-0 directory is getting created in local file system
and cluster-1 is getting created in HDFS, and Kmeans map reduce job is
unable to find cluster-0 . Please see below the stacktrace

2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments:
{--clustering=null, --clusters=[/3/clusters-0-final],
--convergenceDelta=[0.1],
--distanceMeasure=[org.apache.mahout.common.distance.EuclideanDistanceMeasure],
--endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100],
--method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0],
--tempDir=[temp]}
2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence
Clusters In: /3/clusters-0-final Out: /5
2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max
Iterations: 100
2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths to
process : 3
2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job:
job_201403111332_0011
2014-03-11 14:52:20 o.a.h.m.JobClient [INFO]  map 0% reduce 0%
2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id :
attempt_201403111332_0011_m_00_0, Status : FAILED
2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException:
/5/clusters-0
 at
org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:78)
 at
org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208)
 at
org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
 at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.io.FileNotFoundException: File /5/clusters-0

Please suggest!!!

Website, urgent help needed


Hi,

As you've probably noticed, I've put in a lot of effort over the last 
days to kickstart cleaning up our website. I've thrown out a lot of 
stuff and have been startled by the amout of outdated and incorrect 
information on our website, as well as links pointing to nowhere.


I think our lack of documentation makes it superhard to use Mahout for 
new people. A crucial next step is to clean up the documentation on 
classification and clustering. I cannot do this alone, because I don't 
have the time and I'm not so familiar with the background of the algorithms.


I need volunteers to go through all the pages under Classification and 
Clustering on the website. For the algorithms, the content and claims 
of the articles need to be checked, for the examples we need to make 
sure that everything still works as described. It would also be great to 
move articles from personal blogs to our website.


Imagine that some developer wants to try out Mahout and takes one hour 
for that in the evening. She will go to our website, download Mahout, 
read the description of an algorithm and try to run an example. In the 
current state of the documentation, I'm afraid that most people will 
walk away frustrated, because the website does not help them as it should.


Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release 
depend on whether we manage to clean up and maintain our documentation.

Re: Website, urgent help needed

We don't exactly have that page, but we have pages that touch parts of
it, such as
https://mahout.apache.org/users/basics/creating-vectors-from-text.html

It would be great if you could create a jira ticket which lists the
errors. I'll fix them then.

Best,
Sebastian

On 03/12/2014 08:42 AM, Juan José Ramos wrote:

Hi Sebastian,
I am afraid I am only familiar with the recommendation part.

In previous posts, I pointed a couple of errors in this wiki page:
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line

If you are planning to keep it in the new web, I can help pointing them out
again.

Thanks a lot for your effort.

On Wed, Mar 12, 2014 at 7:03 AM, Sebastian Schelter s...@apache.org wrote:

Hi,

As you've probably noticed, I've put in a lot of effort over the last days
to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for new
people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't have
the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims of
the articles need to be checked, for the examples we need to make sure that
everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour for
that in the evening. She will go to our website, download Mahout, read the
description of an algorithm and try to run an example. In the current state
of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.

Re: Website, urgent help needed


Hi Pavan,

Awesome that you're willing to help. The documentation are the pages 
listed under Clustering in the navigation bar under mahout.apache.org


If you start working on one of the pages listed there (e.g. the k-Means 
doc), please created jira ticket in our issue tracker with a title along 
the lines of Cleaning up the documentation for k-Means on the website.


Put a list of errors and corrections into the jira and I (or some other 
committer) will make sure to fix the website.


Thanks,
Sebastian


On 03/12/2014 08:48 AM, Pavan Kumar N wrote:

i ll help with clustering algorithms documentation. do send me old
documentation and i will check and remove errors.  or better let me know
how to proceed.

Pavan
On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

As you've probably noticed, I've put in a lot of effort over the last days
to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for new
people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't have
the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims of
the articles need to be checked, for the examples we need to make sure that
everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour for
that in the evening. She will go to our website, download Mahout, read the
description of an algorithm and try to run an example. In the current state
of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.

Re: Website, urgent help needed


Hi Manoj,

Awesome that you're willing to help.

I suggest we proceed analogously to the clustering cleanup:

The documentation are the pages listed under Classification in the 
navigation bar under mahout.apache.org


If you start working on one of the pages listed there (e.g. the Naive 
Bayes doc), please created jira ticket in our issue tracker with a title 
along the lines of Cleaning up the documentation for Naive Bayes on the 
website.


Put a list of errors and corrections into the jira and I (or some other 
committer) will make sure to fix the website.


Best,
Sebastian

On 03/12/2014 09:05 AM, Manoj Awasthi wrote:

Thanks Sebastian to you and others for effort in cleaning up the website
interface. It looks much better (fonts  layout) and much more usable if I
may say.

I will be happy to volunteer for the pages under classification in whatever
ways I can. I would want to contribute specially on verifying that the
examples provided work in the form they exist on the website and will be
happy to do any corrections wherever possible.

If there is initial backlog list which provides tasks at a granular level
then it will be great OR I can start looking on the page myself.

Manoj



On Wed, Mar 12, 2014 at 12:33 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

As you've probably noticed, I've put in a lot of effort over the last days
to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for new
people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't have
the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims of
the articles need to be checked, for the examples we need to make sure that
everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour for
that in the evening. She will go to our website, download Mahout, read the
description of an algorithm and try to run an example. In the current state
of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.

Re: Website, urgent help needed


Hi Kevin,

Thank you for offer to help! Feel free to ask questions here how to 
setup the sources in Eclipse. If you succeed, you could writeup what you 
did and we could add this to the website, as I'm sure a lot of others 
will have the same problem.


It would be great if you could start improving the javadoc, its totally 
fine if your english is not perfect, we can always ask a native speaker 
to read over it. If you start working on the javadoc, please create a 
jira issue for that work before you start.


Best,
Sebastian



On 03/12/2014 09:30 AM, Kevin Moulart wrote:

I can confirm what Sebastian said, I'm fairly new on this and I did find
myself so desperate at some point that I almost gave up on Mahout dut to
lack of documentation, but my feeling is that it doesn't only concerns the
website : the API is too few documented as well. At this point there are no
simple way for a beginner to know what kind of format any one of the
algorithms expects and what it outputs exactly, how to chain processes
etc... They might go as far as reading the javadoc (although not everyone
does that) but they won't all, as I had to and did, download the sources
and try making sense of them to get the information.

Hopefully the mailing list is particularly active and one can find the
answer if he has time and will to search them and ask kindly, which is a
very strong strength of Mahout, but the average beginner, wanting to just
try the library can't and won't do that.

I'm willing to document the parts of the code I used and began to
understand, however I've been facing difficulties to set up the maven
project in eclipse for now. Also since I'm a Belgian, English is not my
mother tongue so I'm almost certain to make mistakes, but I think it would
take less time to you to correct these few English mistakes than to write
the documentation :)
I'll go ahead and try to set thing up with Eclipse and if I don't succeed
I'll write a mail on the dev list for help in that matter.

I also can, if I find the time, continue my efforts of reporting bugs and
not working or accurate links and descriptions on the website, if need be
and update my JIRA entry accordingly.

Kévin Moulart


2014-03-12 8:48 GMT+01:00 Pavan Kumar N pavan.naraya...@gmail.com:


i ll help with clustering algorithms documentation. do send me old
documentation and i will check and remove errors.  or better let me know
how to proceed.

Pavan
On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

As you've probably noticed, I've put in a lot of effort over the last

days

to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for

new

people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't

have

the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims

of

the articles need to be checked, for the examples we need to make sure

that

everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour

for

that in the evening. She will go to our website, download Mahout, read

the

description of an algorithm and try to run an example. In the current

state

of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.

Re: Website, urgent help needed

Here you can see all issues (resolved and unresolved) for the next release:

https://issues.apache.org/jira/browse/MAHOUT-1413?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%201.0%20ORDER%20BY%20priority%20DESC

When you start to work on the cleanup of a page, make sure that there is
no ticket existing for that. If it isnt, create a jira ticket with the
name of the page in the title.

--sebastian

On 03/12/2014 11:20 AM, pramit choudhary wrote:

Hi All,
I would also like to participate in cleaning up the documentation.
Since, I am fairly new to the Mahout infrastructure. It will in-turn help
me understand things better. Do we already have a Jira ticket for
organizing the cleaning up of documentation ?
Just want to be sure, that I am not stepping on pages some else has already
updated.

Thanks
Regards,
Pramit

On Wed, Mar 12, 2014 at 3:07 AM, Sebastian Schelter s...@apache.org wrote:

Hi Kevin,

Thank you for offer to help! Feel free to ask questions here how to setup
the sources in Eclipse. If you succeed, you could writeup what you did and
we could add this to the website, as I'm sure a lot of others will have the
same problem.

It would be great if you could start improving the javadoc, its totally
fine if your english is not perfect, we can always ask a native speaker to
read over it. If you start working on the javadoc, please create a jira
issue for that work before you start.

Best,
Sebastian

On 03/12/2014 09:30 AM, Kevin Moulart wrote:

I can confirm what Sebastian said, I'm fairly new on this and I did find
myself so desperate at some point that I almost gave up on Mahout dut to
lack of documentation, but my feeling is that it doesn't only concerns the
website : the API is too few documented as well. At this point there are
no
simple way for a beginner to know what kind of format any one of the
algorithms expects and what it outputs exactly, how to chain processes
etc... They might go as far as reading the javadoc (although not everyone
does that) but they won't all, as I had to and did, download the sources
and try making sense of them to get the information.

Hopefully the mailing list is particularly active and one can find the
answer if he has time and will to search them and ask kindly, which is a
very strong strength of Mahout, but the average beginner, wanting to just
try the library can't and won't do that.

I'm willing to document the parts of the code I used and began to
understand, however I've been facing difficulties to set up the maven
project in eclipse for now. Also since I'm a Belgian, English is not my
mother tongue so I'm almost certain to make mistakes, but I think it would
take less time to you to correct these few English mistakes than to write
the documentation :)
I'll go ahead and try to set thing up with Eclipse and if I don't succeed
I'll write a mail on the dev list for help in that matter.

I also can, if I find the time, continue my efforts of reporting bugs and
not working or accurate links and descriptions on the website, if need be
and update my JIRA entry accordingly.

Kévin Moulart

2014-03-12 8:48 GMT+01:00 Pavan Kumar N pavan.naraya...@gmail.com:

i ll help with clustering algorithms documentation. do send me old

documentation and i will check and remove errors. or better let me know
how to proceed.

Pavan
On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:

Hi,

As you've probably noticed, I've put in a lot of effort over the last

days

to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for

new

people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't

have

the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims

the articles need to be checked, for the examples we need to make sure

that

everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour

for

that in the evening. She will go to our website, download Mahout, read

the

description of an algorithm and try to run an example. In the current

state

of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.

Re: Few questions about SVM configuration in Mahout

2014-03-10 Thread Sebastian Schelter


Hi Quentin,

Mahout does not have SVMs.

Best,
Sebastian

On 03/10/2014 10:38 AM, Quentin-Gabriel Thurier wrote:

Hi all,

Just few questions about the configuration of an SVM in Mahout :

- Is it possible to do a multi-class classification ?
- Which kernels are already available (linear, polynomial, rbf) ?
- Where can we find details about the way the algorithm has been
distributed ?

Many thanks,

Quentin

Re: [blog post] Comparing Document Classification Functions of Lucene and Mahout

2014-03-09 Thread Sebastian Schelter

Hi Koji,

I've added a link to your article to our website:

https://mahout.apache.org/general/books-tutorials-and-talks.html

On 03/07/2014 03:29 AM, Koji Sekiguchi wrote:
 Hello,
 
 I just posted an article on Comparing Document Classification Functions
 of Lucene and Mahout.
 
 http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html
 
 Comments are welcome. :)
 
 Thanks!
 
 koji

Re: Heap space

2014-03-09 Thread Sebastian Schelter

I usually do try and error. Start with some very large value and do a 
binary search :)


--sebastian

On 03/09/2014 01:30 PM, Mahmood Naderan wrote:

Excuse me, I added the -Xmx option and restarted the hadoop services using
sbin/stop-all.sh  sbin/start-all.sh

however still I get heap size error. How can I find the correct and needed heap 
size?


Regards,
Mahmood



On Sunday, March 9, 2014 1:37 PM, Mahmood Naderan nt_mahm...@yahoo.com wrote:

OK  I found that I have to add this property to mapred-site.xml


property
namemapred.child.java.opts/name
value-Xmx2048m/value
/property



Regards,
Mahmood




On Sunday, March 9, 2014 11:39 AM, Mahmood Naderan nt_mahm...@yahoo.com wrote:

Hello,
I ran this command

 ./bin/mahout wikipediaXMLSplitter -d 
examples/temp/enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64

but got this error
  Exception in thread main java.lang.OutOfMemoryError: Java heap space

There are many web pages regarding this and the solution is to add -Xmx 2048M for 
example. My question is, that option should be passed to java command and not Mahout. As  result, 
running ./bin/mahout -Xmx 2048M shows that there is no such option. What should I do?


Regards,
Mahmood

Re: Welcome Andrew Musselman as new comitter

2014-03-08 Thread Sebastian Schelter


Hi Pavan,

Committership is given for engagement with the project like providing 
documentation, answering questions on the mailinglist, reviewing 
patches, testing patches and submitting patches.


We currently have a discussion ongoing about the future of mahout, feel 
free to participate.


--sebastian


On 03/07/2014 06:41 PM, Pavan Kumar N wrote:

Congratulations to Andrew. Would be nice to have some
information/background on how PMC evaluated Andrew to become committer.
Also would be nice what future aspects/algorithms of machine learning is
mahout is going to focus on.

I have been keen to maintain code for one of the projects and mistakenly I
spent time on developing map reduce version of weighted linear regression
solutions procedure. Only recently I saw mahout's webpages are updated.
Would appreciate any advice from Andrew and other PMC members.

Pavan


On 7 March 2014 22:56, Frank Scholten fr...@frankscholten.nl wrote:


Congratulations Andrew!


On Fri, Mar 7, 2014 at 6:12 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

this is to announce that the Project Management Committee (PMC) for

Apache

Mahout has asked Andrew Musselman to become committer and we are pleased

to

announce that he has accepted.

Being a committer enables easier contribution to the project since in
addition to posting patches on JIRA it also gives write access to the

code

repository. That also means that now we have yet another person who can
commit patches submitted by others to our repo *wink*

Andrew, we look forward to working with you in the future. Welcome! It
would be great if you could introduce yourself with a few words :)

Sebastian

Welcome Andrew Musselman as new comitter

2014-03-07 Thread Sebastian Schelter


Hi,

this is to announce that the Project Management Committee (PMC) for 
Apache Mahout has asked Andrew Musselman to become committer and we are 
pleased to announce that he has accepted.


Being a committer enables easier contribution to the project since in 
addition to posting patches on JIRA it also gives write access to the 
code repository. That also means that now we have yet another person who 
can commit patches submitted by others to our repo *wink*


Andrew, we look forward to working with you in the future. Welcome! It 
would be great if you could introduce yourself with a few words :)


Sebastian

Re: Rework our website

2014-03-06 Thread Sebastian Schelter

Thank you very much! Could you create a jira ticket and post the links 
there? That would be awesome, then we can track that this stuff gets fixed.


Best,
Sebastian

On 03/06/2014 02:58 PM, Kevin Moulart wrote:

Hi I also prefer the second one.

While I'm at it, there are several links that point to absent pages. I just
clicked on all the link present on page :
http://mahout.apache.org/users/basics/quickstart.html

And those links are broken :
http://mahout.apache.org/users/basics/recommender-documentation.html
http://mahout.apache.org/users/classification/partial-implementation.html
http://mahout.apache.org/users/basics/TasteCommandLine
http://mahout.apache.org/users/recommender/recommendationexamples.html
http://mahout.apache.org/users/basics/parallel-frequent-pattern-mining.html
http://mahout.apache.org/users/basics/mahout.ga.tutorial.html
http://hadoop.apache.org.html/

That's just the ones I found in 2 minutes on the quickstart page.

Best Regards,
Kevin


2014-03-05 23:43 GMT+01:00 Sebastian Schelter s...@apache.org:


At the moment, only committers can change the website unfortunately. If
you have a text to add, I'm happy to work it in and add your name to our
contributers list in the CHANGELOG.

Best,
Sebastian



On 03/05/2014 04:58 PM, Scott C. Cote wrote:


I had recently taken the text tour of mahout, but I couldn't decipher a
way to contribute updates to the tour (some of the file names have
changed, etc).

How would I start?   (this was part of my offer to help with the
documentation of Mahout).

SCott

On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:

  What no centered text??


;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs
have a way to allow input that is moderated. Might this make getting
documentation help easier? Allow anyone to contribute but committers can
filter out the bad‹sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of
documentation on our website is one of the main pain points of Mahout
atm. To be honest, I'm also not very happy with the design, especially
fonts and spacing make it super hard to read long articles. This also
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new
stuff.

My design skills are pretty limited, but fortunately my brother is an art
director! I asked him to make our website a bit more beautiful without
changing to much of the structure, so that a redesign wouldn't take too
long.

I really like the results and would volunteer to dig out my CSS skills
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian

Re: Rework our website

2014-03-06 Thread Sebastian Schelter


Could you add the missing pages to the jira issue? I'll have a look later.

On 03/06/2014 03:25 PM, Suneel Marthi wrote:

I fixed some of the broken links. For some of others eg: TasteCommandline, 
Recommendationexamples either the pages have not been migrated or the links 
have to be purged?






On Thursday, March 6, 2014 9:07 AM, Sebastian Schelter s...@apache.org wrote:

Thank you very much! Could you create a jira ticket and post the links
there? That would be awesome, then we can track that this stuff gets fixed.

Best,
Sebastian


On 03/06/2014 02:58 PM, Kevin Moulart wrote:

Hi I also prefer the second one.

While I'm at it, there are several links that point to absent pages. I just
clicked on all the link present on page :
http://mahout.apache.org/users/basics/quickstart.html

And those links are broken :
http://mahout.apache.org/users/basics/recommender-documentation.html
http://mahout.apache.org/users/classification/partial-implementation.html
http://mahout.apache.org/users/basics/TasteCommandLine
http://mahout.apache.org/users/recommender/recommendationexamples.html
http://mahout.apache.org/users/basics/parallel-frequent-pattern-mining.html
http://mahout.apache.org/users/basics/mahout.ga.tutorial.html
http://hadoop.apache.org.html/

That's just the ones I found in 2 minutes on the quickstart page.

Best Regards,
Kevin


2014-03-05 23:43 GMT+01:00 Sebastian Schelter s...@apache.org:


At the moment, only committers can change the website unfortunately. If
you have a text to add, I'm happy to work it in and add your name to our
contributers list in the CHANGELOG.

Best,
Sebastian



On 03/05/2014 04:58 PM, Scott C. Cote wrote:


I had recently taken the text tour of mahout, but I couldn't decipher a
way to contribute updates to the tour (some of the file names have
changed, etc).

How would I start?   (this was part of my offer to help with the
documentation of Mahout).

SCott

On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:

What no centered text??


;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs
have a way to allow input that is moderated. Might this make getting
documentation help easier? Allow anyone to contribute but committers can
filter out the bad‹sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of
documentation on our website is one of the main pain points of Mahout
atm. To be honest, I'm also not very happy with the design, especially
fonts and spacing make it super hard to read long articles. This also
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new
stuff.

My design skills are pretty limited, but fortunately my brother is an art
director! I asked him to make our website a bit more beautiful without
changing to much of the structure, so that a redesign wouldn't take too
long.

I really like the results and would volunteer to dig out my CSS skills
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian

Re: Recommend items not rated by any user


Hi Juan,

that is a good catch. CandidateItemsStrategy is the right place to 
implement this. Maybe we should simply extend its interface to add a 
parameter that says whether to keep or remove the current users items?


We could even do this in the abstract base class then.

--sebastian

On 03/05/2014 10:42 AM, Juan José Ramos wrote:

In case somebody runs into the same situation, the key seems to be in the
CandidateItemStrategy being passed to the constructor
of GenericItemBasedRecommender. Looking into the code, if no
CandidateItemStrategy is specified in the
constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and
as the documentation says, the doGetCandidateItems method: returns all
items that have not been rated by the user and that were preferred by
another user that has preferred at least one item that the current user has
preferred too.

So, a different CandidateItemStrategy needs to be passed. For this problem,
it seems to me that AllSimilarItemsCandidateItemsStrategy,
AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
know where to find some documentation about the different
CandidateItemStrategy? Based on the name I would say that:
1) AllSimilarItemsCandidateItemsStrategy returns all similar items
regardless of whether they have been already rated by someone or not.
2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
have not been rated by anyone yet.

Does anybody know if it works like that?
Thanks.


On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com wrote:


First thing is thatI know this requirement would not make sense in a CF
Recommender. In my case, I am trying to use Mahout to create something
closer to a Content-Based Recommender.

In particular, I am pre-computing a similarity matrix between all the
documents (items) of my catalogue and using that matrix as the
ItemSimilarity for my Item-Based Recommender.

So, when a user rates a document, how could I make the recommender outputs
similar documents to that ones the user has already rated even if no other
user in the system has rated them yet? Is that even possible in the first
place?

Thanks a lot.

Rework our website


Hi everyone,

In our latest discussion, I argued that the lack (and errors) of 
documentation on our website is one of the main pain points of Mahout 
atm. To be honest, I'm also not very happy with the design, especially 
fonts and spacing make it super hard to read long articles. This also 
prevents me from wanting to add articles and documentation.


I think we should have a beautiful website, where it is fun to add new 
stuff.


My design skills are pretty limited, but fortunately my brother is an 
art director! I asked him to make our website a bit more beautiful 
without changing to much of the structure, so that a redesign wouldn't 
take too long.


I really like the results and would volunteer to dig out my CSS skills 
and do the redesign, if people agree.


Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian

Re: Recommend items not rated by any user

On 03/05/2014 01:23 PM, Juan José Ramos wrote:

Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base class
though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition,
it returns the item not rated by the user and rated by somebody else.

Good point. So we seem to need special implementations.

Back to my last post, I have been playing around with
AllSimilarItemsCandidateItemsStrategy
and AllUnknownItemsCandidateItemsStrategy, and although they both do what I
wanted (recommend items not previously rated by any user), I honestly can't
tell the difference between the two strategies. In my tests the output was
always the same. If the eventual output of the recommender will not include
items already rated by the user as pointed out here (
http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E),
AllSimilarItemsCandidateItemsStrategy should be equivalent to
AllUnkownItemsCandidateItemsStrategy, shouldn't it?

AllSimilarItems returns all items that are similar to any item that the
user already knows. AllUnknownItems simply returns all items that the
user has not interacted with yet.

These are two different things, although they might overlap in some
scenarios.

Best,
Sebastian

Thanks.

On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org wrote:

Hi Juan,

that is a good catch. CandidateItemsStrategy is the right place to

implement this. Maybe we should simply extend its interface to add a
parameter that says whether to keep or remove the current users items?

We could even do this in the abstract base class then.

--sebastian

On 03/05/2014 10:42 AM, Juan José Ramos wrote:

In case somebody runs into the same situation, the key seems to be in the
CandidateItemStrategy being passed to the constructor
of GenericItemBasedRecommender. Looking into the code, if no
CandidateItemStrategy is specified in the
constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and
as the documentation says, the doGetCandidateItems method: returns all
items that have not been rated by the user and that were preferred by
another user that has preferred at least one item that the current user

has

preferred too.

So, a different CandidateItemStrategy needs to be passed. For this

problem,

it seems to me that AllSimilarItemsCandidateItemsStrategy,
AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
know where to find some documentation about the different
CandidateItemStrategy? Based on the name I would say that:
1) AllSimilarItemsCandidateItemsStrategy returns all similar items
regardless of whether they have been already rated by someone or not.
2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
have not been rated by anyone yet.

Does anybody know if it works like that?
Thanks.

On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com

wrote:

First thing is thatI know this requirement would not make sense in a CF
Recommender. In my case, I am trying to use Mahout to create something
closer to a Content-Based Recommender.

In particular, I am pre-computing a similarity matrix between all the
documents (items) of my catalogue and using that matrix as the
ItemSimilarity for my Item-Based Recommender.

So, when a user rates a document, how could I make the recommender

outputs

similar documents to that ones the user has already rated even if no

other

user in the system has rated them yet? Is that even possible in the

first

place?

Thanks a lot.

Re: Recommend items not rated by any user

So both strategies seems to be effectively the same, I don't know what
the implementers had in mind when designing
AllSimilarItemsCandidateItemsStrategy.

It can take a long time to estimate preferences for all items a user
doesn't know. Especially if you have a lot of items. Traditional
item-based recommenders will not recommend any item that is not similar
to at least one of the items the user interacted with, so
AllSimilarItemsStrategy already selects the maximum set of items that
could be potentially recommended to the user.

--sebastian

On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:

If the similarity between item 5 and two of the items user 1 preferred are not
NaN then it will return 1, that is what I'm saying. If the
similarities were all NaN then
it will not return it.

But surely, you might wonder if all similarities between an item and
user's items are NaN, then
AllUnknownItemsCandidateItemsStrategy probably will not return it.

On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:

@Tevfik, running this recommender:

GenericItemBasedRecommender itemRecommender = new
GenericItemBasedRecommender(dataModel, itemSimilarity, new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity));

With this dataModel:
1,1,1.0
1,2,2.0
1,3,1.0
1,4,2.0
2,1,1.0
2,2,4.0

And these similarities
1,2,0.1
1,3,0.2
1,4,0.3
2,3,0.5
3,4,0.5
5,1,0.2
5,2,1.0

Returns item 5 for User 1. So item 5 has not been preferred by user 1, and
the similarity between item 5 and two of the items user 1 preferred are not
NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So,
I'm truly sorry to insist on this, but I still really do not get the
difference.

On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:

Juan,
You got me wrong,

AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

So, it does not simply return all items that have not been rated by
the user. For example, if there is an item X which has not been rated
by the user and if the similarity value between X and at least one of
the items rated (preferred) by the user is not NaN, then X will be not
be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
returned by AllUnknownItemsCandidateItemsStrategy.

On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote:

Hi Tefik,

Thanks for the response. I think what you says contradicts what Sebastian
pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy

returns

all items that have not been rated by the user, what would
AllUnknownItemsCandidateItemsStrategy return?

On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
wrote:

Sorry there was a typo in the previous paragraph.

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin

tevfik.ayte...@gmail.com

wrote:

Hi Juan,

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value that is with at
least one of the items preferred by the user.

Tevfik

On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org

wrote:

On 03/05/2014 01:23 PM, Juan José Ramos wrote:

Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base

class

though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by

definition,

it returns the item not rated by the user and rated by somebody

else.

Good point. So we seem to need special implementations.

Back to my last post, I have been playing around with
AllSimilarItemsCandidateItemsStrategy
and AllUnknownItemsCandidateItemsStrategy, and although they both do

what

I
wanted (recommend items not previously rated by any user), I

honestly

can't
tell the difference between the two strategies. In my tests the

output

was

always the same. If the eventual output of the recommender will not
include
items already rated by the user as pointed out here (

http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E

AllSimilarItemsCandidateItemsStrategy should be equivalent to
AllUnkownItemsCandidateItemsStrategy, shouldn't it?

AllSimilarItems returns all items that are similar to any item that

the

user

already knows. AllUnknownItems simply returns all items that the user

has

not interacted with yet.

These are two different things, although they might overlap in some
scenarios.

Best

Re: Recommend items not rated by any user

For SVD based algorithms, you would should use the AllUnknownItems 
Strategy then, thats correct.


In the majority of industry usecases that I have seen, people use 
pre-computed item similarities (Mahout has lots of machinery for doing 
this, btw), so AllSimilarItems totally makes sense there.


--sebastian

On 03/05/2014 06:01 PM, Tevfik Aytekin wrote:

It can even make things worse in SVD-based algorithms for which
preference estimation is very fast.

On Wed, Mar 5, 2014 at 7:00 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote:

Hi Sebastian,
But in order not to select items that is not similar to at least one
of the items the user interacted with you have to compute the
similarity with all user items (which is the main task for estimating
the preference of an item in item-based method). So, it seems to me
that AllSimilarItemsStrategy does not bring much advantage over
AllUnknownItemsCandidateItemsStrategy.

On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote:

So both strategies seems to be effectively the same, I don't know what
the implementers had in mind when designing
AllSimilarItemsCandidateItemsStrategy.


It can take a long time to estimate preferences for all items a user doesn't
know. Especially if you have a lot of items. Traditional item-based
recommenders will not recommend any item that is not similar to at least one
of the items the user interacted with, so AllSimilarItemsStrategy already
selects the maximum set of items that could be potentially recommended to
the user.

--sebastian




On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:


If the similarity between item 5 and two of the items user 1 preferred are
not
NaN then it will return 1, that is what I'm saying. If the
similarities were all NaN then
it will not return it.

But surely, you might wonder if all similarities between an item and
user's items are NaN, then
AllUnknownItemsCandidateItemsStrategy probably will not return it.




On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:


@Tevfik, running this recommender:

GenericItemBasedRecommender itemRecommender = new
GenericItemBasedRecommender(dataModel, itemSimilarity, new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


With this dataModel:
1,1,1.0
1,2,2.0
1,3,1.0
1,4,2.0
2,1,1.0
2,2,4.0


And these similarities
1,2,0.1
1,3,0.2
1,4,0.3
2,3,0.5
3,4,0.5
5,1,0.2
5,2,1.0

Returns item 5 for User 1. So item 5 has not been preferred by user 1,
and
the similarity between item 5 and two of the items user 1 preferred are
not
NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item.
So,
I'm truly sorry to insist on this, but I still really do not get the
difference.


On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin
tevfik.ayte...@gmail.comwrote:


Juan,
You got me wrong,

AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

So, it does not simply return all items that have not been rated by
the user. For example, if there is an item X which has not been rated
by the user and if the similarity value between X and at least one of
the items rated (preferred) by the user is not NaN, then X will be not
be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
returned by AllUnknownItemsCandidateItemsStrategy.



On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com
wrote:


Hi Tefik,

Thanks for the response. I think what you says contradicts what
Sebastian
pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy


returns


all items that have not been rated by the user, what would
AllUnknownItemsCandidateItemsStrategy return?


On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin
tevfik.ayte...@gmail.com
wrote:


Sorry there was a typo in the previous paragraph.

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 


tevfik.ayte...@gmail.com


wrote:


Hi Juan,

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value that is with at
least one of the items preferred by the user.

Tevfik

On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org


wrote:


On 03/05/2014 01:23 PM, Juan José Ramos wrote:



Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base


class


though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by


definition,


it returns the item not rated by the user and rated by somebody


else.




Good point. So we seem to need special

Re: Rework our website

At the moment, only committers can change the website unfortunately. If 
you have a text to add, I'm happy to work it in and add your name to our 
contributers list in the CHANGELOG.


Best,
Sebastian


On 03/05/2014 04:58 PM, Scott C. Cote wrote:

I had recently taken the text tour of mahout, but I couldn't decipher a
way to contribute updates to the tour (some of the file names have
changed, etc).

How would I start?   (this was part of my offer to help with the
documentation of Mahout).

SCott

On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:


What no centered text??

;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs
have a way to allow input that is moderated. Might this make getting
documentation help easier? Allow anyone to contribute but committers can
filter out the bad‹sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of
documentation on our website is one of the main pain points of Mahout
atm. To be honest, I'm also not very happy with the design, especially
fonts and spacing make it super hard to read long articles. This also
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new
stuff.

My design skills are pretty limited, but fortunately my brother is an art
director! I asked him to make our website a bit more beautiful without
changing to much of the structure, so that a redesign wouldn't take too
long.

I really like the results and would volunteer to dig out my CSS skills
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian

Re: Mahout-232-0.8.patch using

2014-03-04 Thread Sebastian Schelter

I think you should rather choose a different library that already offers 
an SVM than trying to revive a 4 year old patch.


--sebastian

On 03/04/2014 08:51 AM, Amol Kakade wrote:

Hi,
I am new user of Mahout and want to run sample SVM algorithm with Mahout.
Can you please list me steps to use Mahout-232-0.8.patch for SVM in Mahout
I have been trying for last 2 days but getting errors.
--
Amol  Kakade.

Re: how to recommend users already consumed items

2014-03-04 Thread Sebastian Schelter

I think we should introduce a new parameter for the recommend() method 
in the Recommender interface that tells whether already known items 
should be recommended or not.


What do you think?

Best,
Sebastian

On 03/04/2014 05:32 PM, Pat Ferrel wrote:

I’d suggest a command line option if you want to submit a patch. Most people 
will want that line executed so the default should be the current behavior. But 
a large minority will want it your way.

And please do submit a patch with the Jira, it will make your life easier when 
new releases come out you won’t have to manage a fork.

On Mar 2, 2014, at 12:38 PM, Mario Levitin mariolevi...@gmail.com wrote:

Juan, I don't understand your solution, if there are no ratings how can you
blend the recommendations from the system and the user's already read news.

Anyway, I think, as Pat does, the best way is to remove the mentioned line.
It should be the responsibility of the business logic to remove user's
items if needed.

I will also create a Jira issue as you suggested.

thanks
On Sun, Mar 2, 2014 at 7:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:


On Sun, Mar 2, 2014 at 8:52 AM, Pat Ferrel p...@occamsmachete.com wrote:


You are not the only one to see this so I'd recommend creating an option
for the Job, which will be checked before executing that line of code

then

submit it as a patch to the Jira you need to create in any case.

That way it might get into the mainline and you won't have to maintain a
fork.



Avoiding the cost of a fork over a trivial issue like this is a grand idea.

Re: how to recommend users already consumed items

2014-03-04 Thread Sebastian Schelter


That's fine, I was talking about the non-distributed part only.

This page has instructions on how to create patches:

https://mahout.apache.org/developers/how-to-contribute.html

Let me know if you need more infos!

Best,
Sebastian


On 03/05/2014 12:27 AM, Mario Levitin wrote:

I have created a Jira issue already.
I only use the non-hadoop part of Mahout recommender algorithms.
May be I can create a patch for that part. However, I have not done it
before, and don't know how to proceed.


On Wed, Mar 5, 2014 at 1:01 AM, Sebastian Schelter s...@apache.org wrote:


Would you be willing to set up a jira issue and create a patch for this?

--sebastian


On 03/04/2014 11:58 PM, Mario Levitin wrote:




I think we should introduce a new parameter for the recommend() method in
the Recommender interface that tells whether already known items should
be
recommended or not.




I agree (if the parameter is missing then defaults to current behavior as
Pat suggested)







On 03/04/2014 05:32 PM, Pat Ferrel wrote:



  I'd suggest a command line option if you want to submit a patch. Most

people will want that line executed so the default should be the current
behavior. But a large minority will want it your way.

And please do submit a patch with the Jira, it will make your life
easier
when new releases come out you won't have to manage a fork.

On Mar 2, 2014, at 12:38 PM, Mario Levitin mariolevi...@gmail.com
wrote:

Juan, I don't understand your solution, if there are no ratings how can
you
blend the recommendations from the system and the user's already read
news.

Anyway, I think, as Pat does, the best way is to remove the mentioned
line.
It should be the responsibility of the business logic to remove user's
items if needed.

I will also create a Jira issue as you suggested.

thanks
On Sun, Mar 2, 2014 at 7:12 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

   On Sun, Mar 2, 2014 at 8:52 AM, Pat Ferrel p...@occamsmachete.com


wrote:

   You are not the only one to see this so I'd recommend creating an
option


for the Job, which will be checked before executing that line of code

  then


  submit it as a patch to the Jira you need to create in any case.


That way it might get into the mainline and you won't have to
maintain a
fork.


  Avoiding the cost of a fork over a trivial issue like this is a grand

idea.

Re: Issue updating a FileDataModel


Hi Juan,

IIRC then FileDataModel has a parameter that determines how much time 
must have been spent since the last modification of the underlying file. 
You can also directly append new data to the original file.


If you want a to have a DataModel that can be concurrently updated, I 
suggest your data to a database.


--sebastian

On 03/02/2014 11:11 PM, Juan José Ramos wrote:

I am having issues refreshing my recommender, in particular with the
DataModel.

I am using a FileDataModel and a GenericItemBasedRecommender that also has
a CachingItemSimilarity wrapping a FileItemSimilarity. But for the test I
am running I am making things even simpler.

By the time I instantiate the recommender, these two files are in the
FileSystem:
data/datamodel.txt
0,1,0.0

data/datamodel.0.txt
0,2,1.0

And then I run the code you can find below:

---

   FileDataModel dataModel = new FileDataModel(new File(data/dataModel.txt
));

FileItemSimilarity itemSimilarity = new FileItemSimilarity(new File(
data/similarities));

  GenericItemBasedRecommender itemRecommender =
newGenericItemBasedRecommender(dataModel, itemSimilarity);

System.out.println(Number of users in the system:  +
itemRecommender.getDataModel().getNumUsers()+ and  +
itemRecommender.getDataModel().getNumItems() + items);

  FileWriter writer = new FileWriter(new File(data/dataModel.1.txt));

  writer.write(1,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.2.txt));

  writer.write(2,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.3.txt));

  writer.write(3,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.4.txt));

  writer.write(4,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.5.txt));

  writer.write(5,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.6.txt));

  writer.write(6,2,1.0\r);

  writer.close();

  itemRecommender.refresh(null);

  System.out.println(Number of users in the system:  +
itemRecommender.getDataModel().getNumUsers()+ and  +
itemRecommender.getDataModel().getNumItems() + items);

---

The output is the same in both println: Number of users in the system: 2
and 2items. So, only the information from the files that were on the system
by the time I run this test seem to get loaded on the DataModel.

What can be causing that? Is there a maximum number of updates a
FileDataModel can take up in every refresh?

Could it be that actually by the time I call itemRecommender.refresh(null)
the files have not been written to the FileSystem?

Should I be calling refresh in a different manner?

Thank you for your help.

Re: classification in standalone application in Apache Mahout 0.9

If you don't want to call a shell, I assume you don't want to use a 
Hadoop cluster, right? In that case, you should rather try Mahout's 
logistic regression classifier, which is tuned for usage on a single 
machine.


--sebastian

On 03/03/2014 03:07 PM, Hollow Quincy wrote:

I am looking for simple example in Java (without any shell call) how
to use NaiveBayesClassifier in Apache Mahout 0.9.

I have a samples of text. I want to learn algorithm base on this data
and that I want to classify a new text.

class Main {
 public static void main(String[] args) {
 //learn algorithm base on some data
 //classify some data
 }
}

There is no example how to do it in Apache Mahout 0.9 !

Thanks for help

Re: classification in standalone application in Apache Mahout 0.9

Its certainly possible to run Hadoop on a single machine, but it will 
give you terrible performance. We don't have a single machine 
implementation of naive bayes, so I'd really suggest you use the 
logistic regression code.


--sebastian

On 03/03/2014 03:15 PM, Hollow Quincy wrote:

You are right. I want to call my program on single machine in classic
public static void main() standalone application.
In my opinion Naive Bayes Classification would suit great to my problem.
Is there a way to call it from my java code ?
I cannot find any example..

Thanks for help

2014-03-03 15:11 GMT+01:00 Sebastian Schelter s...@apache.org:

If you don't want to call a shell, I assume you don't want to use a Hadoop
cluster, right? In that case, you should rather try Mahout's logistic
regression classifier, which is tuned for usage on a single machine.

--sebastian


On 03/03/2014 03:07 PM, Hollow Quincy wrote:


I am looking for simple example in Java (without any shell call) how
to use NaiveBayesClassifier in Apache Mahout 0.9.

I have a samples of text. I want to learn algorithm base on this data
and that I want to classify a new text.

class Main {
  public static void main(String[] args) {
  //learn algorithm base on some data
  //classify some data
  }
}

There is no example how to do it in Apache Mahout 0.9 !

Thanks for help

Re: Issue updating a FileDataModel

I think it depends on the difference between the time of the call to 
refresh() and the last modified time of the file.


--sebastian

On 03/03/2014 04:45 PM, Juan José Ramos wrote:

Thanks for the reply, Sebastian.

I do not have concurrent updates, but they actually may happen very, very
close in time.

Would the fact of adding the new preferences to new files or appending to
the existing one make any difference or does everything depends on the time
elapsed between two calls to recommender.refresh(null)?

Many thanks.


On Mon, Mar 3, 2014 at 1:18 PM, Sebastian Schelter s...@apache.org wrote:


Hi Juan,

IIRC then FileDataModel has a parameter that determines how much time must
have been spent since the last modification of the underlying file. You can
also directly append new data to the original file.

If you want a to have a DataModel that can be concurrently updated, I
suggest your data to a database.

--sebastian


On 03/02/2014 11:11 PM, Juan José Ramos wrote:


I am having issues refreshing my recommender, in particular with the
DataModel.

I am using a FileDataModel and a GenericItemBasedRecommender that also has
a CachingItemSimilarity wrapping a FileItemSimilarity. But for the test I
am running I am making things even simpler.

By the time I instantiate the recommender, these two files are in the
FileSystem:
data/datamodel.txt
0,1,0.0

data/datamodel.0.txt
0,2,1.0

And then I run the code you can find below:


---

FileDataModel dataModel = new FileDataModel(new
File(data/dataModel.txt
));

 FileItemSimilarity itemSimilarity = new FileItemSimilarity(new File(
data/similarities));

   GenericItemBasedRecommender itemRecommender =
newGenericItemBasedRecommender(dataModel, itemSimilarity);


 System.out.println(Number of users in the system:  +
itemRecommender.getDataModel().getNumUsers()+ and  +
itemRecommender.getDataModel().getNumItems() + items);

   FileWriter writer = new FileWriter(new File(data/dataModel.1.txt));

   writer.write(1,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.2.txt));

   writer.write(2,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.3.txt));

   writer.write(3,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.4.txt));

   writer.write(4,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.5.txt));

   writer.write(5,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.6.txt));

   writer.write(6,2,1.0\r);

   writer.close();

   itemRecommender.refresh(null);

   System.out.println(Number of users in the system:  +
itemRecommender.getDataModel().getNumUsers()+ and  +
itemRecommender.getDataModel().getNumItems() + items);


---

The output is the same in both println: Number of users in the system: 2
and 2items. So, only the information from the files that were on the
system
by the time I run this test seem to get loaded on the DataModel.

What can be causing that? Is there a maximum number of updates a
FileDataModel can take up in every refresh?

Could it be that actually by the time I call itemRecommender.refresh(null)
the files have not been written to the FileSystem?

Should I be calling refresh in a different manner?

Thank you for your help.

Re: Mahout-232-0.8.patch using


Hi Amol,

SVMs are not integrated in Mahout. I'd suggest you try our logistic 
regression classifier instead.


Best,
Sebastian

On 03/04/2014 08:51 AM, Amol Kakade wrote:

Hi,
I am new user of Mahout and want to run sample SVM algorithm with Mahout.
Can you please list me steps to use Mahout-232-0.8.patch for SVM in Mahout
I have been trying for last 2 days but getting errors.
--
Amol  Kakade.

Re: parallelALS and RMSE TEST

2014-03-01 Thread Sebastian Schelter

The output of parallelALS are two matrices U and M whose product is an 
approximation of your input matrix.


The matrices are outputed as sequence files with an IntWritable as key 
(the index of the row in the matrix) and a VectorWritable as value which 
holds the contents of the row vector.


--sebastian

On 02/27/2014 06:30 PM, AJ Rader wrote:


Sean Owen srowen at gmail.com writes:



Parallel ALS is exactly an example of where you can use matrix
factorization for 0/1 data.

On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin tevfik.aytekin at

gmail.com wrote:

Hi Sean,
Isn't boolean preferences is supported in the context of memory-based
recommendation algorithms in Mahout?
Are there matrix factorization algorithms in Mahout which can work
with this kind of data (that is, the kind of data which consists of
users and the movies they have seen).




On Mon, May 6, 2013 at 10:34 PM, Sean Owen srowen at gmail.com

wrote:

Yes, it goes by the name 'boolean prefs' in the project since target
variables don't have values -- they just exist or don't.
So, yes it's certainly supported but the question here is how to
evaluate the output.

On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin tevfik.aytekin at

gmail.com wrote:

This problem is called one-class classification problem. In the domain
of collaborative filtering it is called one-class collaborative
filtering (since what you have are only positive preferences). You may
search the web with these key words to find papers providing
solutions. I'm not sure whether Mahout has algorithms for one-class
collaborative filtering.

On Mon, May 6, 2013 at 1:42 PM, Sean Owen srowen at gmail.com

wrote:

ALS-WR weights the error on each term differently, so the average
error doesn't really have meaning here, even if you are comparing the
difference with 1. I think you will need to fall back to mean
average precision or something.

On Mon, May 6, 2013 at 11:24 AM, William icswilliam2010 at

gmail.com wrote:

Sean Owen srowen at gmail.com writes:



If you have no ratings, how are you using RMSE? this typically
measures error in reconstructing ratings.
I think you are probably measuring something meaningless.




I suppose the rate of seen movies are 1. Is it right?
If I use Collaborative Filtering with ALS-WR to get some

recommendations, I

must have a real rating-matrix?





I was wondering what kind of format the output produced by parallelALS is
stored in. More specifically I am looking for a way to decode/read this
information.

I have been able to run the mahout parallelALS command, calculate RMSE using
mahout evaluateFactorization, and generate recommendations via mahout
recommendfactorized.

However I would like to take a closer look at things like the factorized
products for my probeSet (stored in --tempDir from the 'mahout
evaluateFactorization' command) and the actual feature vectors stored in the
/out/U/ and /out/M/ directories.

thanks
AJ

Re: Load output of rowsimilarity to memory

2014-02-25 Thread Sebastian Schelter


Hi Juan,

It would definitely be nice to have that in the API! It would be great 
if you could submit a patch after you implemented this.


Best,
Sebastian

On 02/25/2014 10:52 AM, Juan José Ramos wrote:

Thanks for the answer.

That was the approach I had in mind in the first place the only difference
would be that I will write the output to a file that can be later used to
create a FileItemSimilarity.

I think that would be a very nice feature to have in the API.

Thanks again.


On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter s...@apache.org wrote:


I overlooked that you're interested in document similarities. Sry again :)

Another way would be to read the output of RowSimilarityJob with a
o.a.m.common.iterator.sequencefile.SequenceFileDirIterable

You create a list of instances of o.a.m.cf.taste.impl.similarity.
GenericItemSimilarity.ItemItemSimilarity

e.g. for the output


Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}

you would do

list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016));
list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565));
...

After that you create a GenericItemSimilarity from the list of
ItemItemSimilarities, which is the in-memory item similarity you asked for.

Hope that helps,
Sebastian



On 02/24/2014 10:04 PM, Juan José Ramos wrote:


Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be
for
item-based CF? In particular, in the documentation I can read that:
Preferences in the input file should look like
userID,itemID[,preferencevalue]

And in my case the input I have is just text documents and I want to
pre-compute similarities between them beforehand, even before any user has
expressed any preference value for any item.

In order to use ItemSimilarityJob for this purpose, what should be the
input I need to provide? Would it be the output of seq2sparse?

Thanks again.


On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter s...@apache.org
wrote:

  You're right, my bad. If you don't use RowSimilarityJob directly, but

org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
(which calls RowSimilarityJob under the covers), your output will be a
textfile that is directly usable with FileItemSimilarity.

--sebastian


On 02/24/2014 09:30 PM, Juan José Ramos wrote:

  Thanks for the prompt reply.


RowSimilarityJob produces an output in the form of:
Key: 0: Value: {61112:0.21139380179557016,
52144:0.23797846026935565,...}

whereas FileItemSimilarity is expecting a comma or tab separated inputs.

I assume that you meant that the output of RowSimilarityJob can be
loaded
by the FileItemSimilarity after doing the appropriate parsing. Is that
correct, or is there actually a way to load the raw output of
RowSimilarityJob into FileItemSimilarity?

Thanks.


On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org
wrote:

   The output of RowSimilarityJob can be loaded by the
FileItemSimilarity.



--sebastian


On 02/24/2014 08:31 PM, Juan José Ramos wrote:

   Is there a way to reproduce this process:


https://cwiki.apache.org/confluence/display/MAHOUT/
Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not
interested
in the clustering part but in 'Calculate several similar docs to each
doc
in the data'. In particular, I am interested in loading the output of
the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every doc
in
my
catalogue I have the similarity with the 100 (that is the threshold I
am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling
the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix
and
RowItemSimilarityJob. But I still see cannot see an easy way of
parsing
the
output of RowItemSimilarityJob to the memory representation I intend
to
use.

Thanks a lot.

Re: Load output of rowsimilarity to memory

2014-02-25 Thread Sebastian Schelter

If you iterate over the vector, you will get Vector.Element objects. 
elem.index() gives you the id of the similar thing, elem.get() gives you 
the similarity value.


--sebastian

On 02/25/2014 11:58 AM, Juan José Ramos wrote:

Regarding the parsing of a VectorWriteble object, what is the recommended
approach to access the different 'DocID: similarity' pairs?

I can see that if I get the String representation of the
org.apache.mahout.math.Vector
object it should not be hard to parse using the text representation.

However, is there a way to access the individual elements of the 'DocID:
similarity' pair? I tried iterating through the individual Vector.Element
objects and calling get(), but that does not return what I intend to.

More than happy to contribute to the project once I get this working.

Thanks a lot.

On Tue, Feb 25, 2014 at 9:52 AM, Juan José Ramos jjar...@gmail.com wrote:


Thanks for the answer.

That was the approach I had in mind in the first place the only difference
would be that I will write the output to a file that can be later used to
create a FileItemSimilarity.

I think that would be a very nice feature to have in the API.

Thanks again.


On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter s...@apache.orgwrote:


I overlooked that you're interested in document similarities. Sry again :)

Another way would be to read the output of RowSimilarityJob with a
o.a.m.common.iterator.sequencefile.SequenceFileDirIterable

You create a list of instances of o.a.m.cf.taste.impl.similarity.
GenericItemSimilarity.ItemItemSimilarity

e.g. for the output


Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}

you would do

list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016));
list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565));
...

After that you create a GenericItemSimilarity from the list of
ItemItemSimilarities, which is the in-memory item similarity you asked for.

Hope that helps,
Sebastian



On 02/24/2014 10:04 PM, Juan José Ramos wrote:


Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be
for
item-based CF? In particular, in the documentation I can read that:
Preferences in the input file should look like
userID,itemID[,preferencevalue]

And in my case the input I have is just text documents and I want to
pre-compute similarities between them beforehand, even before any user
has
expressed any preference value for any item.

In order to use ItemSimilarityJob for this purpose, what should be the
input I need to provide? Would it be the output of seq2sparse?

Thanks again.


On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter s...@apache.org
wrote:

  You're right, my bad. If you don't use RowSimilarityJob directly, but

org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
(which calls RowSimilarityJob under the covers), your output will be a
textfile that is directly usable with FileItemSimilarity.

--sebastian


On 02/24/2014 09:30 PM, Juan José Ramos wrote:

  Thanks for the prompt reply.


RowSimilarityJob produces an output in the form of:
Key: 0: Value: {61112:0.21139380179557016,
52144:0.23797846026935565,...}

whereas FileItemSimilarity is expecting a comma or tab separated
inputs.

I assume that you meant that the output of RowSimilarityJob can be
loaded
by the FileItemSimilarity after doing the appropriate parsing. Is that
correct, or is there actually a way to load the raw output of
RowSimilarityJob into FileItemSimilarity?

Thanks.


On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org
wrote:

   The output of RowSimilarityJob can be loaded by the
FileItemSimilarity.



--sebastian


On 02/24/2014 08:31 PM, Juan José Ramos wrote:

   Is there a way to reproduce this process:


https://cwiki.apache.org/confluence/display/MAHOUT/
Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not
interested
in the clustering part but in 'Calculate several similar docs to each
doc
in the data'. In particular, I am interested in loading the output of
the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every
doc in
my
catalogue I have the similarity with the 100 (that is the threshold
I am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling
the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix
and
RowItemSimilarityJob. But I still see cannot see an easy way of
parsing
the
output of RowItemSimilarityJob to the memory representation I intend
to
use.

Thanks a lot.

Re: Use Naïve Bayes on a large CSV

NaiveBayes expects a SequenceFile as input. The key is the class label 
as Text, the value are the features as VectorWritable.


--sebastian

On 02/24/2014 11:51 AM, Kevin Moulart wrote:

Hi again,
I finally set my mind on going through java to make a sequence file for the
naive bayes,
but I still can't manage to find anyplace stating exactly what should be in
the sequence file
for mahout to process it with Naive Bayes.

I tried virtually every piece of code i found related to this subject, with
no luck.

My CSV file is like this :
Label that I want to predict, feature 1, feature 2, ..., feature 1628

Could someone tell me exactly what Naive Bayes training procedure expects ?


2014-02-20 13:56 GMT+01:00 Jay Vyas jayunit...@gmail.com:


This relates to a previous question I have:  Does mahout have a concept of
adapters which allow us to read data csv style data with filters to create
exact format  for its various inputs (i.e. Recommender three column
format).?  If not is it worth a jira?



On Feb 20, 2014, at 7:50 AM, Kevin Moulart kevinmoul...@gmail.com

wrote:


Hi and thanks !

What about the command line, is there a way to do that using the existing
command line ?




2014-02-20 12:02 GMT+01:00 Suneel Marthi suneel_mar...@yahoo.com:


To convert input CSV to vectors, u can either:

a) Use CSVIterator
b) use InputDriver

Either of the above should generate vectors from input CSV that could

then

be fed into Mahout classifier/clustering jobs.





On Thursday, February 20, 2014 5:57 AM, Kevin Moulart 
kevinmoul...@gmail.com wrote:

Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from
the command line.

I know I have to feed the classifier with a seq file, so I tried to put

my

csv into one using the command seqdirectory, but even when I try with a
really small csv (less than 100Mo) I instantly get an

outOfMemoryException

from java heap space :

mahout seqdirectory -i /user/cacf/Echant/testSeq -o

/user/cacf/resSeq

-ow
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using

/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop

and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
--input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
--output=[/user/cacf/resSeq],

--overwrite=null, --startPhase=[0],

--tempDir=[temp]}
14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
Exception in thread main java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at



java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)

at



java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)

at

java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)

at java.lang.StringBuilder.append(StringBuilder.java:132)
at



org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62)

at



org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90)

at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
at

org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)

at



org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at



org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at



sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at



sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at



org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)

at

org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)

at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at



sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at



sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)



Do you have an idea or a simple way to use Naive Bayes against my large

CSV

?

Thanks in advance !
--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45




--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45

Re: Load output of rowsimilarity to memory


The output of RowSimilarityJob can be loaded by the FileItemSimilarity.

--sebastian

On 02/24/2014 08:31 PM, Juan José Ramos wrote:

Is there a way to reproduce this process:
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not interested
in the clustering part but in 'Calculate several similar docs to each doc
in the data'. In particular, I am interested in loading the output of the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every doc in my
catalogue I have the similarity with the 100 (that is the threshold I am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and
RowItemSimilarityJob. But I still see cannot see an easy way of parsing the
output of RowItemSimilarityJob to the memory representation I intend to
use.

Thanks a lot.

Re: Load output of rowsimilarity to memory

You're right, my bad. If you don't use RowSimilarityJob directly, but 
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob 
(which calls RowSimilarityJob under the covers), your output will be a 
textfile that is directly usable with FileItemSimilarity.


--sebastian

On 02/24/2014 09:30 PM, Juan José Ramos wrote:

Thanks for the prompt reply.

RowSimilarityJob produces an output in the form of:
Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}

whereas FileItemSimilarity is expecting a comma or tab separated inputs.

I assume that you meant that the output of RowSimilarityJob can be loaded
by the FileItemSimilarity after doing the appropriate parsing. Is that
correct, or is there actually a way to load the raw output of
RowSimilarityJob into FileItemSimilarity?

Thanks.


On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org wrote:


The output of RowSimilarityJob can be loaded by the FileItemSimilarity.

--sebastian


On 02/24/2014 08:31 PM, Juan José Ramos wrote:


Is there a way to reproduce this process:
https://cwiki.apache.org/confluence/display/MAHOUT/
Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not interested
in the clustering part but in 'Calculate several similar docs to each doc
in the data'. In particular, I am interested in loading the output of the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every doc in
my
catalogue I have the similarity with the 100 (that is the threshold I am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and
RowItemSimilarityJob. But I still see cannot see an easy way of parsing
the
output of RowItemSimilarityJob to the memory representation I intend to
use.

Thanks a lot.

Re: Load output of rowsimilarity to memory