Re: Converting to sequence file in mahout

Andrew Musselman Fri, 23 May 2014 11:41:18 -0700

You could also look at using Pig with the elephantbird package for creating
sequence files.


There's an example on the Readme at https://github.com/kevinweil/elephant-
bird/blob/master/Readme.md


On Fri, May 23, 2014 at 11:05 AM, jamal sasha <[email protected]> wrote:

> Hi,
>   I tried to use one of the implementation.. Here is the copy paste for the
> reference
>
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.IntWritable;
> import org.apache.hadoop.io.SequenceFile;
> import org.apache.hadoop.io.Text;
> import org.apache.mahout.math.DenseVector;
> import org.apache.mahout.math.Vector;
> import org.apache.mahout.math.VectorWritable;
>
> public class SequenceOutput{
> public static void main(String[] args) throws IOException,
> InterruptedException, ClassNotFoundException {
> Configuration conf = new Configuration(true);
> FileSystem fs = FileSystem.get(conf);
>
> // The input file is not in hdfs
> BufferedReader reader = new BufferedReader(new FileReader(args[1]));
> Path filePath = new Path(args[2]);
>  // Delete previous file if exists
> if (fs.exists(filePath))
>   fs.delete(filePath, true);
>  SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
>  filePath, Text.class, VectorWritable.class);
>  // Run through the input file
>   String line;
>   //System.out.prin
>   System.out.println(args[3].length());
>   while ((line = reader.readLine()) != null) {
>  // We surround with try catch to get rid of the exception when header is
>
>   try {
>   //System.out.println(line);
>   // Split with the given separator
>   String[] c = line.split(args[3]);
>   if (c.length > 1) {
>   double[] d = new double[c.length];
>   // Get the feature set
>   for (int i = 1; i < c.length; i++)
>   d[i] = Double.parseDouble(c[i]);
>   // Put it in a vector
>   Vector vec = new DenseVector(c.length);
>   vec.assign(d);
>   VectorWritable writable = new VectorWritable();
>   writable.set(vec);
>
>   // Create a label with a / and the class label
>   String label = c[0] + "/" + c[0];
>
>  // Write all in the seqfile
>   writer.append(new Text(label), writable);
>   }
>   } catch (NumberFormatException e) {
>   continue;
>   }
>   }
>   writer.close();
>   reader.close();
>  }
> }
>
>
> It generates the output but then throws an error when I try to run
> rowSimilarity job
> 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> attempt_1400790649200_0044_m_000000_1, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io.IntWritable
> at
>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> attempt_1400790649200_0044_m_000001_1, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io.IntWritable
> at
>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> Any clues?
>
>
> On Fri, May 23, 2014 at 1:55 AM, Suneel Marthi <[email protected]> wrote:
>
> > The input needs to be converted to a sequencefile of vectors in order to
> be
> > processed by Mahout's pipeline. This has been asked a few times recently
> > and search for Kevin Moulart's recent posts for doing this in the mail
> > archives.
> >
> >  The converted vectors are then fed to RowIdJob with output matrix and
> > docIndex, then feed the matrix (which is a DRM) to RowSimilarityJob.
> >
> >
> >
> >
> > On Fri, May 23, 2014 at 1:31 AM, jamal sasha <[email protected]>
> > wrote:
> >
> > > Hi,
> > >    I have data where each row is comma seperated vector...
> > > And these are bunch of text files...like
> > > 0.123,01433,0.932
> > > 0.129,0.932,0.123
> > > And I want to run's mahout rowIdSimilarity module on it.. butI am
> > guessing
> > > the input requirement is different.
> > > How do I convert this csv vectors into format consumed by mahout
> > > rowIdSimilarity module?
> > > Thanks
> > >
> >
>

Re: Converting to sequence file in mahout

Reply via email to