Hi, Erm, sorry for the super late reply because was stuck with some other problems last week. I am currently having data in this form (eacb line in format: tag_uri image_1_uri image_2_uri ...)
http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318@N06/4019040356 http://flickr.com/photos/46857830@N03/5651576112 http://flickr.com/photos/tags/beauty http://flickr.com/photos/7309029@N06/3233772398 http://flickr.com/photos/tags/canon http://flickr.com/photos/13980928@N03/6001200971 http://flickr.com/photos/21207178@N07/5441742937 http://flickr.com/photos/25845846@N06/3033371575 http://flickr.com/photos/35581435@N07/5655217599 http://flickr.com/photos/42987061@N04/5872736361 and what I did before this was turning the raw data above into csv/arff as follows with each line describing a tag image_uri_1,image_uri_2,... (header) 1,0,0,1,1,1,0, ... (example data for tag_uri_1) 1,1,0,1,0,1,0, ... (example data for tag_uri_2) ... if I have 1000 records (1000 rows), do I have to generate 1000 sequenceFiles or just 1 (like the csv)? and considering the data that I have in hand (basically a sparse matrix), can I generate sequenceFile directly (without going through the arff->mvc step)? For the clusterdumper out of memory problem, I have just read Jeff Eastman's email <http://mahout.markmail.org/search/?q=#query:+page:4+mid:jpzv6u36kcellzvp+state:results> that suggests a workaround. Would try that after I get the cluster done again. Best wishes, Jeffrey04 >________________________________ >From: Sean Owen <[email protected]> >To: Jeffrey <[email protected]> >Cc: "[email protected]" <[email protected]> >Sent: Tuesday, August 9, 2011 3:19 PM >Subject: Re: Needs clue to create a Proof of Concept recommender > >You need some glue code here -- what you need to create in Java is a >SequenceFile.Writer, and feed that to a VectorWritable, which knows how to >write vectors in the right format. It's straightforward but needs some >coding. There's no magic that ingests SQL and outputs this. > >Yes, but where the the memory error? then we can say what setting to change. >Is it a Hadoop worker? > >OK, so we're on clustering, good to clarify. So the question is just how to >get the input in the right place and format and how to avoid that error? > >On Tue, Aug 9, 2011 at 8:15 AM, Jeffrey <[email protected]> wrote: > >> Hi Sean, >> >> Thanks for the help, is currently reading < >> http://wiki.apache.org/hadoop/SequenceFile> for more information (please >> let me know if I am not reading the right document). So in short, by using >> the API, I can produce a SequenceFile by feeding the sql result containing >> image and tag data into it? >> >> OME - Out of Memory Error lol (for more information on my attempt to >> cluster my test data, please refer to < >> http://mahout.markmail.org/search/?q=#query:+page:30+mid:nseo36uopmgat5iv+state:results>, >> let me know if the link is broken) >> >> Yea, I am making a recommender, but I can't implement the whole thing at >> once and I have no idea how to implement the other parts right now (yea, >> have the habit of breaking a project into small parts). My current task is >> to implement the tag clustering component as mentioned in the previous mail. >> >> @Jeffrey04 >> >> ------------------------------ >> *From:* Sean Owen <[email protected]> >> *To:* [email protected]; Jeffrey <[email protected]> >> *Sent:* Tuesday, August 9, 2011 2:54 PM >> *Subject:* Re: Needs clue to create a Proof of Concept recommender >> >> You don't need ARFF, no. You can write some Java code to write a >> SequenceFile directly, one entry at a time. It would take a little study of >> the code to understand how it works but it's probably just 10 lines. >> >> What is the "OME" error? >> >> Results can live wherever you want; HDFS is the most natural choice for a >> SequenceFile. >> >> You say you're making a recommender but sounds like your task now is >> clustering? >> >> On Tue, Aug 9, 2011 at 7:27 AM, Jeffrey <[email protected]> wrote: >> >> Hi, >> >> I am trying to implement a recommender system for my postgraduate project. >> I currently have all my data (collected using flickr API) stored in the >> MySQL database in RDF form using Redland <http://librdf.org> (lol, PHP is >> my main language hence Redland). >> >> The recommender system is basically designed similarly with the paper >> published by Jonathan Gemmell et. al (reference listed below), where tag >> clusters are also generated to find out the similarity measure between >> clusters and items/users (hence was really frustrating when I failed to dump >> the points for fuzzy k-means cluster). I am currently reading some articles >> on implementing taste (recommender framework) with mahout but the use cases >> described in the article are quite different than what I am about to >> implement. >> >> I am still trying to build the tag clusters properly now. Each tag is now >> represented as a vector of resources (each equivalent to a row in item-tag >> matrix), I am currently generate the vector by converting a pre-generated >> arff by following this tutorial < >> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Weka%27s+ARFF+Format>. >> Is there another way of doing this (is it possible to generate the vectors >> without first generate arff)? I have also read this < >> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text> >> but can't seem to relate it to my use case right now. >> >> Since I can't dump the points for the clusters using cluster dumper (keep >> getting OME) I would probably calculate the degree of membership manually. >> Where should I store the result (MySQL via JDBC? Hadoop Bigtable? >> Cassandra?) so that I can reuse it later for further calculation (eg. >> similarity of an item with a cluster)? >> >> Reference: >> Shepitsen, Andriy; Gemmell, Jonathan; Mobasher, Bamshad; Burke >> Robin. Personalized Recommendation in Folksonomies. Proceedings of the 2nd >> International Conference on Recommender Systems. Lausanne, Switzerland. >> October 23, 2008. >> >> p/s: I probably really should find a copy of "Mahout in Action" since I >> keep seeing it being recommended. >> >> best wishes, >> Jeffrey04 >> >> >> >> >> > > >
