Re: Converting to sequence file in mahout

Mohit Singh Fri, 23 May 2014 12:51:28 -0700

Hi Jamal,
  Probably, I can answer here.. Since I modified the same code to get
started
In the code you pasted.. change the following to adhere to the requirment:
public class SequenceOutput{
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration(true);
FileSystem fs = FileSystem.get(conf);


// The input file is not in hdfs
BufferedReader reader = new BufferedReader(new FileReader(args[1]));
Path filePath = new Path(args[2]);
 // Delete previous file if exists
if (fs.exists(filePath))
  fs.delete(filePath, true);
 SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
 filePath, *IntWritable*.class, VectorWritable.class);
 // Run through the input file
  String line;
  //System.out.prin
  System.out.println(args[3].length());
  while ((line = reader.readLine()) != null) {
 // We surround with try catch to get rid of the exception when header is

  try {
  //System.out.println(line);
  // Split with the given separator
  String[] c = line.split(args[3]);
  if (c.length > 1) {
  double[] d = new double[c.length];
  // Get the feature set
  for (int i = 1; i < c.length; i++)
  d[i] = Double.parseDouble(c[i]);
  // Put it in a vector
  Vector vec = new DenseVector(c.length);
  vec.assign(d);
  VectorWritable writable = new VectorWritable();
  writable.set(vec);

  // Create a label with a / and the class label
  //String label = c[0] + "/" + c[0];
  *int label = // somelabel*
 // Write all in the seqfile
  writer.append(new IntWritable(*label*), writable);
  }
  } catch (NumberFormatException e) {
  continue;
  }
  }
  writer.close();
  reader.close();
 }
}

Just figure out a way to assign label (key) to your vectors.. I just used a
counter..


On Fri, May 23, 2014 at 11:39 AM, Andrew Musselman <
[email protected]> wrote:

> You could also look at using Pig with the elephantbird package for creating
> sequence files.
>
> There's an example on the Readme at https://github.com/kevinweil/elephant-
> bird/blob/master/Readme.md
>
>
> On Fri, May 23, 2014 at 11:05 AM, jamal sasha <[email protected]>
> wrote:
>
> > Hi,
> >   I tried to use one of the implementation.. Here is the copy paste for
> the
> > reference
> >
> > import java.io.BufferedReader;
> > import java.io.FileReader;
> > import java.io.IOException;
> >
> > import org.apache.hadoop.conf.Configuration;
> > import org.apache.hadoop.fs.FileSystem;
> > import org.apache.hadoop.fs.Path;
> > import org.apache.hadoop.io.IntWritable;
> > import org.apache.hadoop.io.SequenceFile;
> > import org.apache.hadoop.io.Text;
> > import org.apache.mahout.math.DenseVector;
> > import org.apache.mahout.math.Vector;
> > import org.apache.mahout.math.VectorWritable;
> >
> > public class SequenceOutput{
> > public static void main(String[] args) throws IOException,
> > InterruptedException, ClassNotFoundException {
> > Configuration conf = new Configuration(true);
> > FileSystem fs = FileSystem.get(conf);
> >
> > // The input file is not in hdfs
> > BufferedReader reader = new BufferedReader(new FileReader(args[1]));
> > Path filePath = new Path(args[2]);
> >  // Delete previous file if exists
> > if (fs.exists(filePath))
> >   fs.delete(filePath, true);
> >  SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> >  filePath, Text.class, VectorWritable.class);
> >  // Run through the input file
> >   String line;
> >   //System.out.prin
> >   System.out.println(args[3].length());
> >   while ((line = reader.readLine()) != null) {
> >  // We surround with try catch to get rid of the exception when header is
> >
> >   try {
> >   //System.out.println(line);
> >   // Split with the given separator
> >   String[] c = line.split(args[3]);
> >   if (c.length > 1) {
> >   double[] d = new double[c.length];
> >   // Get the feature set
> >   for (int i = 1; i < c.length; i++)
> >   d[i] = Double.parseDouble(c[i]);
> >   // Put it in a vector
> >   Vector vec = new DenseVector(c.length);
> >   vec.assign(d);
> >   VectorWritable writable = new VectorWritable();
> >   writable.set(vec);
> >
> >   // Create a label with a / and the class label
> >   String label = c[0] + "/" + c[0];
> >
> >  // Write all in the seqfile
> >   writer.append(new Text(label), writable);
> >   }
> >   } catch (NumberFormatException e) {
> >   continue;
> >   }
> >   }
> >   writer.close();
> >   reader.close();
> >  }
> > }
> >
> >
> > It generates the output but then throws an error when I try to run
> > rowSimilarity job
> > 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> > attempt_1400790649200_0044_m_000000_1, Status : FAILED
> > Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable
> > at
> >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >
> > 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> > attempt_1400790649200_0044_m_000001_1, Status : FAILED
> > Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable
> > at
> >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >
> > Any clues?
> >
> >
> > On Fri, May 23, 2014 at 1:55 AM, Suneel Marthi <[email protected]>
> wrote:
> >
> > > The input needs to be converted to a sequencefile of vectors in order
> to
> > be
> > > processed by Mahout's pipeline. This has been asked a few times
> recently
> > > and search for Kevin Moulart's recent posts for doing this in the mail
> > > archives.
> > >
> > >  The converted vectors are then fed to RowIdJob with output matrix and
> > > docIndex, then feed the matrix (which is a DRM) to RowSimilarityJob.
> > >
> > >
> > >
> > >
> > > On Fri, May 23, 2014 at 1:31 AM, jamal sasha <[email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >    I have data where each row is comma seperated vector...
> > > > And these are bunch of text files...like
> > > > 0.123,01433,0.932
> > > > 0.129,0.932,0.123
> > > > And I want to run's mahout rowIdSimilarity module on it.. butI am
> > > guessing
> > > > the input requirement is different.
> > > > How do I convert this csv vectors into format consumed by mahout
> > > > rowIdSimilarity module?
> > > > Thanks
> > > >
> > >
> >
>



-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates

Re: Converting to sequence file in mahout

Reply via email to