the best way is to read the sorce code ; @_@
At 2011-08-29 16:02:57,"Lance Norskog" <[email protected]> wrote: >'R' also has an svd implementation, directly in the base package. > >There are a few answers to your question: >1) What is SVD? The video lecture above will help. Also, searching for >'singular value decomposition' on Baidu finds a lot of basic explanations. >2) Why do you want it? It creates in on pass a few different unique >explanations of what is going on inside your dataset. >3) Mahout Distributed Matrix code, DistributedLanczos etc. are >implementations specifically for large-scale problems. There are sub-parts >of SVD that you may not need for your problem, and these jobs avoid some of >the work. > >Until you have a solid grasp of what SVD can tell you, there is no point >trying the distributed mahout jobs. The SingularValueDecomposition class in >Mahout has served me well in my researches. > >Lance > >On Mon, Aug 29, 2011 at 12:50 AM, Danny Bickson <[email protected]>wrote: > >> Mahout - SVD matrix factorization - formatting input matrix >> Converting Input Format into Mahout's SVD Distributed Matrix Factorization >> Solver >> >> Purpose >> The code below, converts a matrix from csv format: >> <from row>,<to col>,<value>\n >> Into Mahout's SVD solver format. >> >> >> For example, >> The 3x3 matrix: >> 0 1.0 2.1 >> 3.0 4.0 5.0 >> -5.0 6.2 0 >> >> >> Will be given as input in a csv file as: >> 1,0,3.0 >> 2,0,-5.0 >> 0,1,1.0 >> 1,1,4.0 >> 2,1,6.2 >> 0,2,2.1 >> 1,2,5.0 >> >> NOTE: I ASSUME THE MATRIX IS SORTED BY THE COLUMNS ORDER >> This code is based on code by Danny Leshem, ContextIn. >> >> Command line arguments: >> args[0] - path to csv input file >> args[1] - cardinality of the matrix (number of columns) >> args[2] - path the resulting Mahout's SVD input file >> >> Method: >> The code below, goes over the csv file, and for each matrix column, creates >> a SequentialAccessSparseVector which contains all the non-zero row entries >> for this column. >> Then it appends the column vector to file. >> >> Compilation: >> Copy the java code below into an java file named Convert2SVD.java >> Add to your IDE project path both Mahout and Hadoop jars. Alternatively, a >> command line option for compilation is given below. >> >> >> view plain< >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >> print< >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >> ?< >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >> >> 1. import java.io.BufferedReader; >> 2. import java.io.FileReader; >> 3. import java.util.StringTokenizer; >> 4. >> 5. import org.apache.mahout.math.SequentialAccessSparseVector; >> 6. import org.apache.mahout.math.Vector; >> 7. import org.apache.mahout.math.VectorWritable; >> 8. import org.apache.hadoop.conf.Configuration; >> 9. import org.apache.hadoop.fs.FileSystem; >> 10. import org.apache.hadoop.fs.Path; >> 11. import org.apache.hadoop.io.IntWritable; >> 12. import org.apache.hadoop.io.SequenceFile; >> 13. import org.apache.hadoop.io.SequenceFile.CompressionType; >> 14. >> 15. /** >> 16. * Code for converting CSV format to Mahout's SVD format >> 17. * @author Danny Bickson, CMU >> 18. >> * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE >> SECOND FIELD). >> >> 19. * >> 20. */ >> 21. >> 22. public class Convert2SVD { >> 23. >> 24. >> 25. public static int Cardinality; >> 26. >> 27. /** >> 28. * >> 29. * @param args[0] - input csv file >> 30. * @param args[1] - cardinality (length of vector) >> 31. * @param args[2] - output file for svd >> 32. */ >> 33. public static void main(String[] args){ >> 34. >> 35. try { >> 36. Cardinality = Integer.parseInt(args[1]); >> 37. final Configuration conf = new Configuration(); >> 38. final FileSystem fs = FileSystem.get(conf); >> 39. final >> SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new >> Path(args[2]), IntWritable.class, VectorWritable.class >> , CompressionType.BLOCK); >> 40. >> 41. final IntWritable key = new IntWritable(); >> 42. final VectorWritable value = new VectorWritable(); >> 43. >> 44. >> 45. String thisLine; >> 46. >> 47. BufferedReader br = new BufferedReader(new >> FileReader(args[0])); >> 48. Vector vector = null; >> 49. int from = -1,to =-1; >> 50. int last_to = -1; >> 51. float val = 0; >> 52. int total = 0; >> 53. int nnz = 0; >> 54. int e = 0; >> 55. int max_to =0; >> 56. int max_from = 0; >> 57. >> 58. while ((thisLine = br.readLine()) != null) { >> // while loop begins here >> 59. >> 60. StringTokenizer st = new StringTokenizer(thisLine, >> ","); >> 61. while(st.hasMoreTokens()) { >> 62. from = Integer.parseInt(st.nextToken())-1; >> //convert from 1 based to zero based >> 63. to = Integer.parseInt(st.nextToken())-1; >> //convert from 1 based to zero basd >> 64. val = Float.parseFloat(st.nextToken()); >> 65. if (max_from < from) max_from = from; >> 66. if (max_to < to) max_to = to; >> 67. if (from < 0 || to < 0 >> || to > Cardinality || val == 0.0) >> 68. throw new NumberFormatException("wrong data" >> + from + " to: " + to + " val: " + val); >> 69. } >> 70. >> 71. >> //we are working on an existing column, set non-zero rows in it >> 72. if (last_to != to && last_to != -1){ >> 73. value.set(vector); >> 74. >> 75. writer.append(key, value); >> //write the older vector >> 76. e+= vector.getNumNondefaultElements(); >> 77. } >> 78. //a new column is observed, open a new vector for it >> >> 79. if (last_to != to){ >> 80. vector = new >> SequentialAccessSparseVector(Cardinality); >> 81. key.set(to); // open a new vector >> 82. total++; >> 83. } >> 84. >> 85. vector.set(from, val); >> 86. nnz++; >> 87. >> 88. if (nnz % 1000000 == 0){ >> 89. System.out.println("Col" + total + " nnz: " >> + nnz); >> 90. } >> 91. last_to = to; >> 92. >> 93. } // end while >> 94. >> 95. value.set(vector); >> 96. writer.append(key,value);//write last row >> 97. e+= vector.getNumNondefaultElements(); >> 98. total++; >> 99. >> 100. writer.close(); >> 101. System.out.println("Wrote a total of " + total + " cols " >> + " nnz: " + nnz); >> 102. if (e != nnz) >> 103. System.err.println("Bug:missing edges! we only got" >> + e); >> 104. >> 105. System.out.println("Highest column: " + max_to + >> " highest row: " + max_from ); >> 106. } catch(Exception ex){ >> 107. ex.printStackTrace(); >> 108. } >> 109. } >> 110. } >> >> import java.io.BufferedReader; >> import java.io.FileReader; >> import java.util.StringTokenizer; >> >> import org.apache.mahout.math.SequentialAccessSparseVector; >> import org.apache.mahout.math.Vector; >> import org.apache.mahout.math.VectorWritable; >> import org.apache.hadoop.conf.Configuration; >> import org.apache.hadoop.fs.FileSystem; >> import org.apache.hadoop.fs.Path; >> import org.apache.hadoop.io.IntWritable; >> import org.apache.hadoop.io.SequenceFile; >> import org.apache.hadoop.io.SequenceFile.CompressionType; >> >> /** >> * Code for converting CSV format to Mahout's SVD format >> * @author Danny Bickson, CMU >> * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE >> SECOND FIELD). >> * >> */ >> >> public class Convert2SVD { >> >> >> public static int Cardinality; >> >> /** >> * >> * @param args[0] - input csv file >> * @param args[1] - cardinality (length of vector) >> * @param args[2] - output file for svd >> */ >> public static void main(String[] args){ >> >> try { >> Cardinality = Integer.parseInt(args[1]); >> final Configuration conf = new Configuration(); >> final FileSystem fs = FileSystem.get(conf); >> final SequenceFile.Writer writer = >> SequenceFile.createWriter(fs, conf, new Path(args[2]), >> IntWritable.class, VectorWritable.class, CompressionType.BLOCK); >> >> final IntWritable key = new IntWritable(); >> final VectorWritable value = new VectorWritable(); >> >> >> String thisLine; >> >> BufferedReader br = new BufferedReader(new FileReader(args[0])); >> Vector vector = null; >> int from = -1,to =-1; >> int last_to = -1; >> float val = 0; >> int total = 0; >> int nnz = 0; >> int e = 0; >> int max_to =0; >> int max_from = 0; >> >> while ((thisLine = br.readLine()) != null) { // while loop >> begins here >> >> StringTokenizer st = new StringTokenizer(thisLine, ","); >> while(st.hasMoreTokens()) { >> from = Integer.parseInt(st.nextToken())-1; >> //convert from 1 based to zero based >> to = Integer.parseInt(st.nextToken())-1; >> //convert from 1 based to zero basd >> val = Float.parseFloat(st.nextToken()); >> if (max_from < from) max_from = from; >> if (max_to < to) max_to = to; >> if (from < 0 || to < 0 || to > Cardinality || val == >> 0.0) >> throw new NumberFormatException("wrong data" >> + from + " to: " + to + " val: " + val); >> } >> >> //we are working on an existing column, set non-zero rows >> in it >> if (last_to != to && last_to != -1){ >> value.set(vector); >> >> writer.append(key, value); //write the older vector >> e+= vector.getNumNondefaultElements(); >> } >> //a new column is observed, open a new vector for it >> if (last_to != to){ >> vector = new SequentialAccessSparseVector(Cardinality); >> key.set(to); // open a new vector >> total++; >> } >> >> vector.set(from, val); >> nnz++; >> >> if (nnz % 1000000 == 0){ >> System.out.println("Col" + total + " nnz: " + nnz); >> } >> last_to = to; >> >> } // end while >> >> value.set(vector); >> writer.append(key,value);//write last row >> e+= vector.getNumNondefaultElements(); >> total++; >> >> writer.close(); >> System.out.println("Wrote a total of " + total + " cols " + >> " nnz: " + nnz); >> if (e != nnz) >> System.err.println("Bug:missing edges! we only got" + e); >> >> System.out.println("Highest column: " + max_to + " highest >> row: " + max_from ); >> } catch(Exception ex){ >> ex.printStackTrace(); >> } >> } >> } >> >> >> >> A second option to compile this file is create a Makefile, with the >> following in it: >> view plain< >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >> print< >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >> ?< >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >> >> 1. all: >> 2. javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core- >> 3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4 >> /taste-web/target/mahout-taste-webapp-0.5 >> -SNAPSHOT/WEB-INF/lib/mahout-core-0.5 >> -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4 >> /taste-web/target/mahout-taste-webapp-0.5 >> -SNAPSHOT/WEB-INF/lib/mahout-math-0.5 >> -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli- >> 1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2 >> -core.jar *.java >> >> all: >> javac -cp >> >> /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar >> *.java >> >> Note that you will have the change location of the jars to point to where >> your jars are stored. >> >> Example for running this conversion for netflix data: >> view plain< >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >> print< >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >> ?< >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >> >> 1. java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1 >> .jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4 >> /taste-web/target/mahout-taste-webapp-0.5 >> -SNAPSHOT/WEB-INF/lib/mahout-core-0.5 >> -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4 >> /taste-web/target/mahout-taste-webapp-0.5 >> -SNAPSHOT/WEB-INF/lib/mahout-math-0.5 >> -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli- >> 1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2 >> -core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging- >> 1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2 >> /lib/commons-logging-api-1.0.4.jar Convert2SVD ../../netflixe.csv 17770 >> netflixe.seq >> 2. Aug 23, 2011 1:16:06 >> PM org.apache.hadoop.util.NativeCodeLoader <clinit> >> 3. WARNING: Unable to load native-hadoop library for >> your platform... using builtin-java classes where applicable >> 4. Aug 23, 2011 1:16:06 >> PM org.apache.hadoop.io.compress.CodecPool getCompressor >> 5. INFO: Got brand-new compressor >> 6. Row241 nnz: 1000000 >> 7. Row381 nnz: 2000000 >> 8. Row571 nnz: 3000000 >> 9. Row789 nnz: 4000000 >> 10. Row1046 nnz: 5000000 >> 11. Row1216 nnz: 6000000 >> 12. Row1441 nnz: 7000000 >> 13. >> 14. ... >> 15. </clinit> >> >> java -cp >> .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar >> Convert2SVD ../../netflixe.csv 17770 netflixe.seq >> Aug 23, 2011 1:16:06 PM org.apache.hadoop.util.NativeCodeLoader >> WARNING: Unable to load native-hadoop library for your platform... >> using builtin-java classes where applicable >> Aug 23, 2011 1:16:06 PM org.apache.hadoop.io.compress.CodecPool >> getCompressor >> INFO: Got brand-new compressor >> Row241 nnz: 1000000 >> Row381 nnz: 2000000 >> Row571 nnz: 3000000 >> Row789 nnz: 4000000 >> Row1046 nnz: 5000000 >> Row1216 nnz: 6000000 >> Row1441 nnz: 7000000 >> >> ... >> >> >> NOTE: You may want also to checkout GraphLab's collaborative filtering >> library: here <http://graphlab.org/pmf.html>. GraphLab has a 100% >> compatible >> SVD solver to Mahout, with performance gains up to x50 times faster. I have >> created Java code to convert Mahout sequence files into Graphlab's format >> and back. Email me and I will send you the code. >> >> 2011/8/29 myn <[email protected]> >> >> > thanks >> > But could you send the content ofhttp:// >> > bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html to me >> ? >> > I can`t open it in china . >> > >> > >> > >> > >> > >> > At 2011-08-29 15:29:40,"Danny Bickson" <[email protected]> wrote: >> > >Command line arguments are found here: >> > >https://cwiki.apache.org/MAHOUT/dimensional-reduction.html >> > >I wrote a quick tutorial on how to prepare sparse matrices as input to >> > >Mahout SVD here: >> > > >> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html >> > > >> > >Let me know if you have further questions. >> > > >> > >2011/8/29 myn <[email protected]> >> > > >> > >> i want to study Singular Value Decomposition algorithms; >> > >> I also have a book called mahout in action,but i can`t found sth about >> > this >> > >> algorithm; >> > >> is there someplace introduce how to use the method? >> > >> till now DistributedLanczosSolver is not a mapreduce method >> > >> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = >> svd >> > >> > > > >-- >Lance Norskog >[email protected]
