Hi,

Erm, sorry for the super late reply because was stuck with some other problems 
last week. I am currently having data in this form (eacb line in format: 
tag_uri image_1_uri image_2_uri ...)

http://flickr.com/photos/tags/100commentgroup 
http://flickr.com/photos/34254318@N06/4019040356 
http://flickr.com/photos/46857830@N03/5651576112
http://flickr.com/photos/tags/beauty 
http://flickr.com/photos/7309029@N06/3233772398
http://flickr.com/photos/tags/canon 
http://flickr.com/photos/13980928@N03/6001200971 
http://flickr.com/photos/21207178@N07/5441742937 
http://flickr.com/photos/25845846@N06/3033371575 
http://flickr.com/photos/35581435@N07/5655217599 
http://flickr.com/photos/42987061@N04/5872736361

and what I did before this was turning the raw data above into csv/arff as 
follows with each line describing a tag

image_uri_1,image_uri_2,... (header)
1,0,0,1,1,1,0, ... (example data for tag_uri_1)
1,1,0,1,0,1,0, ... (example data for tag_uri_2)

...

if I have 1000 records (1000 rows), do I have to generate 1000 sequenceFiles or 
just 1 (like the csv)? and considering the data that I have in hand (basically 
a sparse matrix), can I generate sequenceFile directly (without going through 
the arff->mvc step)?

For the clusterdumper out of memory problem, I have just read Jeff Eastman's 
email 
<http://mahout.markmail.org/search/?q=#query:+page:4+mid:jpzv6u36kcellzvp+state:results> that
 suggests a workaround. Would try that after I get the cluster done again.

Best wishes,
Jeffrey04

>________________________________
>From: Sean Owen <[email protected]>
>To: Jeffrey <[email protected]>
>Cc: "[email protected]" <[email protected]>
>Sent: Tuesday, August 9, 2011 3:19 PM
>Subject: Re: Needs clue to create a Proof of Concept recommender
>
>You need some glue code here -- what you need to create in Java is a
>SequenceFile.Writer, and feed that to a VectorWritable, which knows how to
>write vectors in the right format. It's straightforward but needs some
>coding. There's no magic that ingests SQL and outputs this.
>
>Yes, but where the the memory error? then we can say what setting to change.
>Is it a Hadoop worker?
>
>OK, so we're on clustering, good to clarify. So the question is just how to
>get the input in the right place and format and how to avoid that error?
>
>On Tue, Aug 9, 2011 at 8:15 AM, Jeffrey <[email protected]> wrote:
>
>> Hi Sean,
>>
>> Thanks for the help, is currently reading <
>> http://wiki.apache.org/hadoop/SequenceFile> for more information (please
>> let me know if I am not reading the right document). So in short, by using
>> the API, I can produce a SequenceFile by feeding the sql result containing
>> image and tag data into it?
>>
>> OME - Out of Memory Error lol (for more information on my attempt to
>> cluster my test data, please refer to <
>> http://mahout.markmail.org/search/?q=#query:+page:30+mid:nseo36uopmgat5iv+state:results>,
>> let me know if the link is broken)
>>
>> Yea, I am making a recommender, but I can't implement the whole thing at
>> once and I have no idea how to implement the other parts right now (yea,
>> have the habit of breaking a project into small parts). My current task is
>> to implement the tag clustering component as mentioned in the previous mail.
>>
>> @Jeffrey04
>>
>> ------------------------------
>> *From:* Sean Owen <[email protected]>
>> *To:* [email protected]; Jeffrey <[email protected]>
>> *Sent:* Tuesday, August 9, 2011 2:54 PM
>> *Subject:* Re: Needs clue to create a Proof of Concept recommender
>>
>> You don't need ARFF, no. You can write some Java code to write a
>> SequenceFile directly, one entry at a time. It would take a little study of
>> the code to understand how it works but it's probably just 10 lines.
>>
>> What is the "OME" error?
>>
>> Results can live wherever you want; HDFS is the most natural choice for a
>> SequenceFile.
>>
>> You say you're making a recommender but sounds like your task now is
>> clustering?
>>
>> On Tue, Aug 9, 2011 at 7:27 AM, Jeffrey <[email protected]> wrote:
>>
>> Hi,
>>
>> I am trying to implement a recommender system for my postgraduate project.
>> I currently have all my data (collected using flickr API) stored in the
>> MySQL database in RDF form using Redland <http://librdf.org> (lol, PHP is
>> my main language hence Redland).
>>
>> The recommender system is basically designed similarly with the paper
>> published by Jonathan Gemmell et. al (reference listed below), where tag
>> clusters are also generated to find out the similarity measure between
>> clusters and items/users (hence was really frustrating when I failed to dump
>> the points for fuzzy k-means cluster). I am currently reading some articles
>> on implementing taste (recommender framework) with mahout but the use cases
>> described in the article are quite different than what I am about to
>> implement.
>>
>> I am still trying to build the tag clusters properly now. Each tag is now
>> represented as a vector of resources (each equivalent to a row in item-tag
>> matrix), I am currently generate the vector by converting a pre-generated
>> arff by following this tutorial <
>> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Weka%27s+ARFF+Format>.
>> Is there another way of doing this (is it possible to generate the vectors
>> without first generate arff)? I have also read this <
>> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text>
>> but can't seem to relate it to my use case right now.
>>
>> Since I can't dump the points for the clusters using cluster dumper (keep
>> getting OME) I would probably calculate the degree of membership manually.
>> Where should I store the result (MySQL via JDBC? Hadoop Bigtable?
>> Cassandra?) so that I can reuse it later for further calculation (eg.
>> similarity of an item with a cluster)?
>>
>> Reference:
>> Shepitsen, Andriy; Gemmell, Jonathan; Mobasher, Bamshad; Burke
>> Robin. Personalized Recommendation in Folksonomies. Proceedings of the 2nd
>> International Conference on Recommender Systems. Lausanne, Switzerland.
>> October 23, 2008.
>>
>> p/s: I probably really should find a copy of "Mahout in Action" since I
>> keep seeing it being recommended.
>>
>> best wishes,
>> Jeffrey04
>>
>>
>>
>>
>>
>
>
>

Reply via email to