thanks a lot ,that is a quit good example for my study.
At 2011-08-29 15:50:36,"Danny Bickson" <[email protected]> wrote: > Mahout - SVD matrix factorization - formatting input matrix > Converting Input Format into Mahout's SVD Distributed Matrix Factorization >Solver > >Purpose >The code below, converts a matrix from csv format: ><from row>,<to col>,<value>\n >Into Mahout's SVD solver format. > > >For example, >The 3x3 matrix: >0 1.0 2.1 >3.0 4.0 5.0 >-5.0 6.2 0 > > >Will be given as input in a csv file as: >1,0,3.0 >2,0,-5.0 >0,1,1.0 >1,1,4.0 >2,1,6.2 >0,2,2.1 >1,2,5.0 > >NOTE: I ASSUME THE MATRIX IS SORTED BY THE COLUMNS ORDER >This code is based on code by Danny Leshem, ContextIn. > >Command line arguments: > args[0] - path to csv input file >args[1] - cardinality of the matrix (number of columns) >args[2] - path the resulting Mahout's SVD input file > >Method: >The code below, goes over the csv file, and for each matrix column, creates >a SequentialAccessSparseVector which contains all the non-zero row entries >for this column. >Then it appends the column vector to file. > >Compilation: >Copy the java code below into an java file named Convert2SVD.java >Add to your IDE project path both Mahout and Hadoop jars. Alternatively, a >command line option for compilation is given below. > > >view >plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> > > 1. import java.io.BufferedReader; > 2. import java.io.FileReader; > 3. import java.util.StringTokenizer; > 4. > 5. import org.apache.mahout.math.SequentialAccessSparseVector; > 6. import org.apache.mahout.math.Vector; > 7. import org.apache.mahout.math.VectorWritable; > 8. import org.apache.hadoop.conf.Configuration; > 9. import org.apache.hadoop.fs.FileSystem; > 10. import org.apache.hadoop.fs.Path; > 11. import org.apache.hadoop.io.IntWritable; > 12. import org.apache.hadoop.io.SequenceFile; > 13. import org.apache.hadoop.io.SequenceFile.CompressionType; > 14. > 15. /** > 16. * Code for converting CSV format to Mahout's SVD format > 17. * @author Danny Bickson, CMU > 18. > * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE >SECOND FIELD). > > 19. * > 20. */ > 21. > 22. public class Convert2SVD { > 23. > 24. > 25. public static int Cardinality; > 26. > 27. /** > 28. * > 29. * @param args[0] - input csv file > 30. * @param args[1] - cardinality (length of vector) > 31. * @param args[2] - output file for svd > 32. */ > 33. public static void main(String[] args){ > 34. > 35. try { > 36. Cardinality = Integer.parseInt(args[1]); > 37. final Configuration conf = new Configuration(); > 38. final FileSystem fs = FileSystem.get(conf); > 39. final > SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new > Path(args[2]), IntWritable.class, VectorWritable.class > , CompressionType.BLOCK); > 40. > 41. final IntWritable key = new IntWritable(); > 42. final VectorWritable value = new VectorWritable(); > 43. > 44. > 45. String thisLine; > 46. > 47. BufferedReader br = new BufferedReader(new > FileReader(args[0])); > 48. Vector vector = null; > 49. int from = -1,to =-1; > 50. int last_to = -1; > 51. float val = 0; > 52. int total = 0; > 53. int nnz = 0; > 54. int e = 0; > 55. int max_to =0; > 56. int max_from = 0; > 57. > 58. while ((thisLine = br.readLine()) != null) { > // while loop begins here > 59. > 60. StringTokenizer st = new StringTokenizer(thisLine, > ","); > 61. while(st.hasMoreTokens()) { > 62. from = Integer.parseInt(st.nextToken())-1; > //convert from 1 based to zero based > 63. to = Integer.parseInt(st.nextToken())-1; > //convert from 1 based to zero basd > 64. val = Float.parseFloat(st.nextToken()); > 65. if (max_from < from) max_from = from; > 66. if (max_to < to) max_to = to; > 67. if (from < 0 || to < 0 > || to > Cardinality || val == 0.0) > 68. throw new NumberFormatException("wrong data" > + from + " to: " + to + " val: " + val); > 69. } > 70. > 71. > //we are working on an existing column, set non-zero rows in it > 72. if (last_to != to && last_to != -1){ > 73. value.set(vector); > 74. > 75. writer.append(key, value); > //write the older vector > 76. e+= vector.getNumNondefaultElements(); > 77. } > 78. //a new column is observed, open a new vector for it > > 79. if (last_to != to){ > 80. vector = new > SequentialAccessSparseVector(Cardinality); > 81. key.set(to); // open a new vector > 82. total++; > 83. } > 84. > 85. vector.set(from, val); > 86. nnz++; > 87. > 88. if (nnz % 1000000 == 0){ > 89. System.out.println("Col" + total + " nnz: " > + nnz); > 90. } > 91. last_to = to; > 92. > 93. } // end while > 94. > 95. value.set(vector); > 96. writer.append(key,value);//write last row > 97. e+= vector.getNumNondefaultElements(); > 98. total++; > 99. > 100. writer.close(); > 101. System.out.println("Wrote a total of " + total + " cols " > + " nnz: " + nnz); > 102. if (e != nnz) > 103. System.err.println("Bug:missing edges! we only got" > + e); > 104. > 105. System.out.println("Highest column: " + max_to + > " highest row: " + max_from ); > 106. } catch(Exception ex){ > 107. ex.printStackTrace(); > 108. } > 109. } > 110. } > >import java.io.BufferedReader; >import java.io.FileReader; >import java.util.StringTokenizer; > >import org.apache.mahout.math.SequentialAccessSparseVector; >import org.apache.mahout.math.Vector; >import org.apache.mahout.math.VectorWritable; >import org.apache.hadoop.conf.Configuration; >import org.apache.hadoop.fs.FileSystem; >import org.apache.hadoop.fs.Path; >import org.apache.hadoop.io.IntWritable; >import org.apache.hadoop.io.SequenceFile; >import org.apache.hadoop.io.SequenceFile.CompressionType; > >/** > * Code for converting CSV format to Mahout's SVD format > * @author Danny Bickson, CMU > * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE >SECOND FIELD). > * > */ > >public class Convert2SVD { > > > public static int Cardinality; > > /** > * > * @param args[0] - input csv file > * @param args[1] - cardinality (length of vector) > * @param args[2] - output file for svd > */ > public static void main(String[] args){ > >try { > Cardinality = Integer.parseInt(args[1]); > final Configuration conf = new Configuration(); > final FileSystem fs = FileSystem.get(conf); > final SequenceFile.Writer writer = >SequenceFile.createWriter(fs, conf, new Path(args[2]), >IntWritable.class, VectorWritable.class, CompressionType.BLOCK); > > final IntWritable key = new IntWritable(); > final VectorWritable value = new VectorWritable(); > > > String thisLine; > > BufferedReader br = new BufferedReader(new FileReader(args[0])); > Vector vector = null; > int from = -1,to =-1; > int last_to = -1; > float val = 0; > int total = 0; > int nnz = 0; > int e = 0; > int max_to =0; > int max_from = 0; > > while ((thisLine = br.readLine()) != null) { // while loop >begins here > > StringTokenizer st = new StringTokenizer(thisLine, ","); > while(st.hasMoreTokens()) { > from = Integer.parseInt(st.nextToken())-1; >//convert from 1 based to zero based > to = Integer.parseInt(st.nextToken())-1; >//convert from 1 based to zero basd > val = Float.parseFloat(st.nextToken()); > if (max_from < from) max_from = from; > if (max_to < to) max_to = to; > if (from < 0 || to < 0 || to > Cardinality || val == 0.0) > throw new NumberFormatException("wrong data" >+ from + " to: " + to + " val: " + val); > } > > //we are working on an existing column, set non-zero rows in > it > if (last_to != to && last_to != -1){ > value.set(vector); > > writer.append(key, value); //write the older vector > e+= vector.getNumNondefaultElements(); > } > //a new column is observed, open a new vector for it > if (last_to != to){ > vector = new SequentialAccessSparseVector(Cardinality); > key.set(to); // open a new vector > total++; > } > > vector.set(from, val); > nnz++; > > if (nnz % 1000000 == 0){ > System.out.println("Col" + total + " nnz: " + nnz); > } > last_to = to; > > } // end while > > value.set(vector); > writer.append(key,value);//write last row > e+= vector.getNumNondefaultElements(); > total++; > > writer.close(); > System.out.println("Wrote a total of " + total + " cols " + >" nnz: " + nnz); > if (e != nnz) > System.err.println("Bug:missing edges! we only got" + e); > > System.out.println("Highest column: " + max_to + " highest >row: " + max_from ); > } catch(Exception ex){ > ex.printStackTrace(); > } > } >} > > > >A second option to compile this file is create a Makefile, with the >following in it: >view >plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> > > 1. all: > 2. javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core- > 3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4 > /taste-web/target/mahout-taste-webapp-0.5 > -SNAPSHOT/WEB-INF/lib/mahout-core-0.5 > -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4 > /taste-web/target/mahout-taste-webapp-0.5 > -SNAPSHOT/WEB-INF/lib/mahout-math-0.5 > -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli- > 1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2 > -core.jar *.java > >all: > javac -cp >/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar >*.java > >Note that you will have the change location of the jars to point to where >your jars are stored. > >Example for running this conversion for netflix data: >view >plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> >?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#> > > 1. java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1 > .jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4 > /taste-web/target/mahout-taste-webapp-0.5 > -SNAPSHOT/WEB-INF/lib/mahout-core-0.5 > -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4 > /taste-web/target/mahout-taste-webapp-0.5 > -SNAPSHOT/WEB-INF/lib/mahout-math-0.5 > -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli- > 1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2 > -core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging- > 1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2 > /lib/commons-logging-api-1.0.4.jar Convert2SVD ../../netflixe.csv 17770 > netflixe.seq > 2. Aug 23, 2011 1:16:06 > PM org.apache.hadoop.util.NativeCodeLoader <clinit> > 3. WARNING: Unable to load native-hadoop library for > your platform... using builtin-java classes where applicable > 4. Aug 23, 2011 1:16:06 > PM org.apache.hadoop.io.compress.CodecPool getCompressor > 5. INFO: Got brand-new compressor > 6. Row241 nnz: 1000000 > 7. Row381 nnz: 2000000 > 8. Row571 nnz: 3000000 > 9. Row789 nnz: 4000000 > 10. Row1046 nnz: 5000000 > 11. Row1216 nnz: 6000000 > 12. Row1441 nnz: 7000000 > 13. > 14. ... > 15. </clinit> > >java -cp >.:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar >Convert2SVD ../../netflixe.csv 17770 netflixe.seq >Aug 23, 2011 1:16:06 PM org.apache.hadoop.util.NativeCodeLoader >WARNING: Unable to load native-hadoop library for your platform... >using builtin-java classes where applicable >Aug 23, 2011 1:16:06 PM org.apache.hadoop.io.compress.CodecPool getCompressor >INFO: Got brand-new compressor >Row241 nnz: 1000000 >Row381 nnz: 2000000 >Row571 nnz: 3000000 >Row789 nnz: 4000000 >Row1046 nnz: 5000000 >Row1216 nnz: 6000000 >Row1441 nnz: 7000000 > >... > > >NOTE: You may want also to checkout GraphLab's collaborative filtering >library: here <http://graphlab.org/pmf.html>. GraphLab has a 100% compatible >SVD solver to Mahout, with performance gains up to x50 times faster. I have >created Java code to convert Mahout sequence files into Graphlab's format >and back. Email me and I will send you the code. > >2011/8/29 myn <[email protected]> > >> thanks >> But could you send the content ofhttp:// >> bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html to me ? >> I can`t open it in china . >> >> >> >> >> >> At 2011-08-29 15:29:40,"Danny Bickson" <[email protected]> wrote: >> >Command line arguments are found here: >> >https://cwiki.apache.org/MAHOUT/dimensional-reduction.html >> >I wrote a quick tutorial on how to prepare sparse matrices as input to >> >Mahout SVD here: >> >http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html >> > >> >Let me know if you have further questions. >> > >> >2011/8/29 myn <[email protected]> >> > >> >> i want to study Singular Value Decomposition algorithms; >> >> I also have a book called mahout in action,but i can`t found sth about >> this >> >> algorithm; >> >> is there someplace introduce how to use the method? >> >> till now DistributedLanczosSolver is not a mapreduce method >> >> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd >>
