thanks a lot ,that is a quit good example for my study.



At 2011-08-29 15:50:36,"Danny Bickson" <[email protected]> wrote:
> Mahout - SVD matrix factorization - formatting input matrix
> Converting Input Format into Mahout's SVD Distributed Matrix Factorization
>Solver
>
>Purpose
>The code below, converts a matrix from csv format:
><from row>,<to col>,<value>\n
>Into Mahout's SVD solver format.
>
>
>For example,
>The 3x3 matrix:
>0    1.0 2.1
>3.0  4.0 5.0
>-5.0 6.2 0
>
>
>Will be given as input in a csv file as:
>1,0,3.0
>2,0,-5.0
>0,1,1.0
>1,1,4.0
>2,1,6.2
>0,2,2.1
>1,2,5.0
>
>NOTE: I ASSUME THE MATRIX IS SORTED BY THE COLUMNS ORDER
>This code is based on code by Danny Leshem, ContextIn.
>
>Command line arguments:
> args[0] - path to csv input file
>args[1] - cardinality of the matrix (number of columns)
>args[2] - path the resulting Mahout's SVD input file
>
>Method:
>The code below, goes over the csv file, and for each matrix column, creates
>a SequentialAccessSparseVector which contains all the non-zero row entries
>for this column.
>Then it appends the column vector to file.
>
>Compilation:
>Copy the java code below into an java file named Convert2SVD.java
>Add to your IDE project path both Mahout and Hadoop jars. Alternatively, a
>command line option for compilation is given below.
>
>
>view 
>plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>
>   1. import java.io.BufferedReader;
>   2. import java.io.FileReader;
>   3. import java.util.StringTokenizer;
>   4.
>   5. import org.apache.mahout.math.SequentialAccessSparseVector;
>   6. import org.apache.mahout.math.Vector;
>   7. import org.apache.mahout.math.VectorWritable;
>   8. import org.apache.hadoop.conf.Configuration;
>   9. import org.apache.hadoop.fs.FileSystem;
>   10. import org.apache.hadoop.fs.Path;
>   11. import org.apache.hadoop.io.IntWritable;
>   12. import org.apache.hadoop.io.SequenceFile;
>   13. import org.apache.hadoop.io.SequenceFile.CompressionType;
>   14.
>   15. /**
>   16.  * Code for converting CSV format to Mahout's SVD format
>   17.  * @author Danny Bickson, CMU
>   18.
>    * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE
>SECOND FIELD).
>
>   19.  *
>   20.  */
>   21.
>   22. public class Convert2SVD {
>   23.
>   24.
>   25.         public static int Cardinality;
>   26.
>   27.         /**
>   28.          *
>   29.          * @param args[0] - input csv file
>   30.          * @param args[1] - cardinality (length of vector)
>   31.          * @param args[2] - output file for svd
>   32.          */
>   33.         public static void main(String[] args){
>   34.
>   35. try {
>   36.         Cardinality = Integer.parseInt(args[1]);
>   37.         final Configuration conf = new Configuration();
>   38.         final FileSystem fs = FileSystem.get(conf);
>   39.         final
>    SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new
>    Path(args[2]), IntWritable.class, VectorWritable.class
>   , CompressionType.BLOCK);
>   40.
>   41.           final IntWritable key = new IntWritable();
>   42.           final VectorWritable value = new VectorWritable();
>   43.
>   44.
>   45.            String thisLine;
>   46.
>   47.            BufferedReader br = new BufferedReader(new
>    FileReader(args[0]));
>   48.            Vector vector = null;
>   49.            int from = -1,to  =-1;
>   50.            int last_to = -1;
>   51.            float val = 0;
>   52.            int total = 0;
>   53.            int nnz = 0;
>   54.            int e = 0;
>   55.            int max_to =0;
>   56.            int max_from = 0;
>   57.
>   58.            while ((thisLine = br.readLine()) != null) {
>   // while loop begins here
>   59.
>   60.                  StringTokenizer st = new StringTokenizer(thisLine,
>   ",");
>   61.                  while(st.hasMoreTokens()) {
>   62.                      from = Integer.parseInt(st.nextToken())-1;
>   //convert from 1 based to zero based
>   63.                      to = Integer.parseInt(st.nextToken())-1;
>   //convert from 1 based to zero basd
>   64.                      val = Float.parseFloat(st.nextToken());
>   65.                      if (max_from < from) max_from = from;
>   66.                      if (max_to < to) max_to = to;
>   67.                      if (from < 0 || to < 0
>    || to > Cardinality || val == 0.0)
>   68.                          throw new NumberFormatException("wrong data"
>    + from + " to: " + to + " val: " + val);
>   69.                  }
>   70.
>   71.
>   //we are working on an existing column, set non-zero rows in it
>   72.                  if (last_to != to && last_to != -1){
>   73.                      value.set(vector);
>   74.
>   75.                      writer.append(key, value);
>   //write the older vector
>   76.                      e+= vector.getNumNondefaultElements();
>   77.                  }
>   78.                  //a new column is observed, open a new vector for it
>
>   79.                  if (last_to != to){
>   80.                      vector = new
>    SequentialAccessSparseVector(Cardinality);
>   81.                      key.set(to); // open a new vector
>   82.                      total++;
>   83.                  }
>   84.
>   85.                  vector.set(from, val);
>   86.                  nnz++;
>   87.
>   88.                  if (nnz % 1000000 == 0){
>   89.                    System.out.println("Col" + total + " nnz: "
>    + nnz);
>   90.                  }
>   91.                  last_to = to;
>   92.
>   93.           } // end while
>   94.
>   95.            value.set(vector);
>   96.            writer.append(key,value);//write last row
>   97.            e+= vector.getNumNondefaultElements();
>   98.            total++;
>   99.
>   100.            writer.close();
>   101.            System.out.println("Wrote a total of " + total + " cols "
>    + " nnz: " + nnz);
>   102.            if (e != nnz)
>   103.                 System.err.println("Bug:missing edges! we only got"
>    + e);
>   104.
>   105.            System.out.println("Highest column: " + max_to +
>   " highest row: " + max_from );
>   106.         } catch(Exception ex){
>   107.                 ex.printStackTrace();
>   108.         }
>   109.     }
>   110. }
>
>import java.io.BufferedReader;
>import java.io.FileReader;
>import java.util.StringTokenizer;
>
>import org.apache.mahout.math.SequentialAccessSparseVector;
>import org.apache.mahout.math.Vector;
>import org.apache.mahout.math.VectorWritable;
>import org.apache.hadoop.conf.Configuration;
>import org.apache.hadoop.fs.FileSystem;
>import org.apache.hadoop.fs.Path;
>import org.apache.hadoop.io.IntWritable;
>import org.apache.hadoop.io.SequenceFile;
>import org.apache.hadoop.io.SequenceFile.CompressionType;
>
>/**
> * Code for converting CSV format to Mahout's SVD format
> * @author Danny Bickson, CMU
> * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE
>SECOND FIELD).
> *
> */
>
>public class Convert2SVD {
>
>
>        public static int Cardinality;
>
>        /**
>         *
>         * @param args[0] - input csv file
>         * @param args[1] - cardinality (length of vector)
>         * @param args[2] - output file for svd
>         */
>        public static void main(String[] args){
>
>try {
>        Cardinality = Integer.parseInt(args[1]);
>        final Configuration conf = new Configuration();
>        final FileSystem fs = FileSystem.get(conf);
>        final SequenceFile.Writer writer =
>SequenceFile.createWriter(fs, conf, new Path(args[2]),
>IntWritable.class, VectorWritable.class, CompressionType.BLOCK);
>
>          final IntWritable key = new IntWritable();
>          final VectorWritable value = new VectorWritable();
>
>
>           String thisLine;
>
>           BufferedReader br = new BufferedReader(new FileReader(args[0]));
>           Vector vector = null;
>           int from = -1,to  =-1;
>           int last_to = -1;
>           float val = 0;
>           int total = 0;
>           int nnz = 0;
>           int e = 0;
>           int max_to =0;
>           int max_from = 0;
>
>           while ((thisLine = br.readLine()) != null) { // while loop
>begins here
>
>                 StringTokenizer st = new StringTokenizer(thisLine, ",");
>                 while(st.hasMoreTokens()) {
>                     from = Integer.parseInt(st.nextToken())-1;
>//convert from 1 based to zero based
>                     to = Integer.parseInt(st.nextToken())-1;
>//convert from 1 based to zero basd
>                     val = Float.parseFloat(st.nextToken());
>                     if (max_from < from) max_from = from;
>                     if (max_to < to) max_to = to;
>                     if (from < 0 || to < 0 || to > Cardinality || val == 0.0)
>                         throw new NumberFormatException("wrong data"
>+ from + " to: " + to + " val: " + val);
>                 }
>
>                 //we are working on an existing column, set non-zero rows in 
> it
>                 if (last_to != to && last_to != -1){
>                     value.set(vector);
>
>                     writer.append(key, value); //write the older vector
>                     e+= vector.getNumNondefaultElements();
>                 }
>                 //a new column is observed, open a new vector for it
>                 if (last_to != to){
>                     vector = new SequentialAccessSparseVector(Cardinality);
>                     key.set(to); // open a new vector
>                     total++;
>                 }
>
>                 vector.set(from, val);
>                 nnz++;
>
>                 if (nnz % 1000000 == 0){
>                   System.out.println("Col" + total + " nnz: " + nnz);
>                 }
>                 last_to = to;
>
>          } // end while
>
>           value.set(vector);
>           writer.append(key,value);//write last row
>           e+= vector.getNumNondefaultElements();
>           total++;
>
>           writer.close();
>           System.out.println("Wrote a total of " + total + " cols " +
>" nnz: " + nnz);
>           if (e != nnz)
>                System.err.println("Bug:missing edges! we only got" + e);
>
>           System.out.println("Highest column: " + max_to + " highest
>row: " + max_from );
>        } catch(Exception ex){
>                ex.printStackTrace();
>        }
>    }
>}
>
>
>
>A second option to compile this file is create a Makefile, with the
>following in it:
>view 
>plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>
>   1. all:
>   2.         javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-
>   3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
>   /taste-web/target/mahout-taste-webapp-0.5
>   -SNAPSHOT/WEB-INF/lib/mahout-core-0.5
>   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
>   /taste-web/target/mahout-taste-webapp-0.5
>   -SNAPSHOT/WEB-INF/lib/mahout-math-0.5
>   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-
>   1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2
>   -core.jar *.java
>
>all:
>        javac -cp
>/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar
>*.java
>
>Note that you will have the change location of the jars to point to where
>your jars are stored.
>
>Example for running this conversion for netflix data:
>view 
>plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>
>   1. java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1
>   .jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
>   /taste-web/target/mahout-taste-webapp-0.5
>   -SNAPSHOT/WEB-INF/lib/mahout-core-0.5
>   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
>   /taste-web/target/mahout-taste-webapp-0.5
>   -SNAPSHOT/WEB-INF/lib/mahout-math-0.5
>   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-
>   1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2
>   -core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-
>   1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2
>   /lib/commons-logging-api-1.0.4.jar Convert2SVD ../../netflixe.csv 17770
>    netflixe.seq
>   2. Aug 23, 2011 1:16:06
>    PM org.apache.hadoop.util.NativeCodeLoader <clinit>
>   3. WARNING: Unable to load native-hadoop library for
>    your platform... using builtin-java classes where applicable
>   4. Aug 23, 2011 1:16:06
>    PM org.apache.hadoop.io.compress.CodecPool getCompressor
>   5. INFO: Got brand-new compressor
>   6. Row241 nnz: 1000000
>   7. Row381 nnz: 2000000
>   8. Row571 nnz: 3000000
>   9. Row789 nnz: 4000000
>   10. Row1046 nnz: 5000000
>   11. Row1216 nnz: 6000000
>   12. Row1441 nnz: 7000000
>   13.
>   14. ...
>   15. </clinit>
>
>java -cp 
>.:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar
>Convert2SVD ../../netflixe.csv 17770 netflixe.seq
>Aug 23, 2011 1:16:06 PM org.apache.hadoop.util.NativeCodeLoader
>WARNING: Unable to load native-hadoop library for your platform...
>using builtin-java classes where applicable
>Aug 23, 2011 1:16:06 PM org.apache.hadoop.io.compress.CodecPool getCompressor
>INFO: Got brand-new compressor
>Row241 nnz: 1000000
>Row381 nnz: 2000000
>Row571 nnz: 3000000
>Row789 nnz: 4000000
>Row1046 nnz: 5000000
>Row1216 nnz: 6000000
>Row1441 nnz: 7000000
>
>...
>
>
>NOTE: You may want also to checkout GraphLab's collaborative filtering
>library: here <http://graphlab.org/pmf.html>. GraphLab has a 100% compatible
>SVD solver to Mahout, with performance gains up to x50 times faster. I have
>created Java code to convert Mahout sequence files into Graphlab's format
>and back. Email me and I will send you the code.
>
>2011/8/29 myn <[email protected]>
>
>> thanks
>> But could you send the content ofhttp://
>> bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html to me ?
>> I can`t open it  in china .
>>
>>
>>
>>
>>
>> At 2011-08-29 15:29:40,"Danny Bickson" <[email protected]> wrote:
>> >Command line arguments are found here:
>> >https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
>> >I wrote a quick tutorial on how to prepare sparse matrices as input to
>> >Mahout SVD here:
>> >http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>> >
>> >Let me know if you have further questions.
>> >
>> >2011/8/29 myn <[email protected]>
>> >
>> >> i want to study Singular Value Decomposition algorithms;
>> >> I also have a book called mahout in action,but i can`t found sth about
>> this
>> >> algorithm;
>> >> is there someplace introduce how to use the method?
>> >> till now DistributedLanczosSolver  is not a mapreduce method
>> >> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd
>>

Reply via email to