You could also look at using Pig with the elephantbird package for creating sequence files.
There's an example on the Readme at https://github.com/kevinweil/elephant- bird/blob/master/Readme.md On Fri, May 23, 2014 at 11:05 AM, jamal sasha <[email protected]> wrote: > Hi, > I tried to use one of the implementation.. Here is the copy paste for the > reference > > import java.io.BufferedReader; > import java.io.FileReader; > import java.io.IOException; > > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.io.IntWritable; > import org.apache.hadoop.io.SequenceFile; > import org.apache.hadoop.io.Text; > import org.apache.mahout.math.DenseVector; > import org.apache.mahout.math.Vector; > import org.apache.mahout.math.VectorWritable; > > public class SequenceOutput{ > public static void main(String[] args) throws IOException, > InterruptedException, ClassNotFoundException { > Configuration conf = new Configuration(true); > FileSystem fs = FileSystem.get(conf); > > // The input file is not in hdfs > BufferedReader reader = new BufferedReader(new FileReader(args[1])); > Path filePath = new Path(args[2]); > // Delete previous file if exists > if (fs.exists(filePath)) > fs.delete(filePath, true); > SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, > filePath, Text.class, VectorWritable.class); > // Run through the input file > String line; > //System.out.prin > System.out.println(args[3].length()); > while ((line = reader.readLine()) != null) { > // We surround with try catch to get rid of the exception when header is > > try { > //System.out.println(line); > // Split with the given separator > String[] c = line.split(args[3]); > if (c.length > 1) { > double[] d = new double[c.length]; > // Get the feature set > for (int i = 1; i < c.length; i++) > d[i] = Double.parseDouble(c[i]); > // Put it in a vector > Vector vec = new DenseVector(c.length); > vec.assign(d); > VectorWritable writable = new VectorWritable(); > writable.set(vec); > > // Create a label with a / and the class label > String label = c[0] + "/" + c[0]; > > // Write all in the seqfile > writer.append(new Text(label), writable); > } > } catch (NumberFormatException e) { > continue; > } > } > writer.close(); > reader.close(); > } > } > > > It generates the output but then throws an error when I try to run > rowSimilarity job > 14/05/23 11:01:02 INFO mapreduce.Job: Task Id : > attempt_1400790649200_0044_m_000000_1, Status : FAILED > Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be > cast to org.apache.hadoop.io.IntWritable > at > > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > 14/05/23 11:01:02 INFO mapreduce.Job: Task Id : > attempt_1400790649200_0044_m_000001_1, Status : FAILED > Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be > cast to org.apache.hadoop.io.IntWritable > at > > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > > Any clues? > > > On Fri, May 23, 2014 at 1:55 AM, Suneel Marthi <[email protected]> wrote: > > > The input needs to be converted to a sequencefile of vectors in order to > be > > processed by Mahout's pipeline. This has been asked a few times recently > > and search for Kevin Moulart's recent posts for doing this in the mail > > archives. > > > > The converted vectors are then fed to RowIdJob with output matrix and > > docIndex, then feed the matrix (which is a DRM) to RowSimilarityJob. > > > > > > > > > > On Fri, May 23, 2014 at 1:31 AM, jamal sasha <[email protected]> > > wrote: > > > > > Hi, > > > I have data where each row is comma seperated vector... > > > And these are bunch of text files...like > > > 0.123,01433,0.932 > > > 0.129,0.932,0.123 > > > And I want to run's mahout rowIdSimilarity module on it.. butI am > > guessing > > > the input requirement is different. > > > How do I convert this csv vectors into format consumed by mahout > > > rowIdSimilarity module? > > > Thanks > > > > > >
