Hi Jamal,
Probably, I can answer here.. Since I modified the same code to get
started
In the code you pasted.. change the following to adhere to the requirment:
public class SequenceOutput{
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration(true);
FileSystem fs = FileSystem.get(conf);
// The input file is not in hdfs
BufferedReader reader = new BufferedReader(new FileReader(args[1]));
Path filePath = new Path(args[2]);
// Delete previous file if exists
if (fs.exists(filePath))
fs.delete(filePath, true);
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
filePath, *IntWritable*.class, VectorWritable.class);
// Run through the input file
String line;
//System.out.prin
System.out.println(args[3].length());
while ((line = reader.readLine()) != null) {
// We surround with try catch to get rid of the exception when header is
try {
//System.out.println(line);
// Split with the given separator
String[] c = line.split(args[3]);
if (c.length > 1) {
double[] d = new double[c.length];
// Get the feature set
for (int i = 1; i < c.length; i++)
d[i] = Double.parseDouble(c[i]);
// Put it in a vector
Vector vec = new DenseVector(c.length);
vec.assign(d);
VectorWritable writable = new VectorWritable();
writable.set(vec);
// Create a label with a / and the class label
//String label = c[0] + "/" + c[0];
*int label = // somelabel*
// Write all in the seqfile
writer.append(new IntWritable(*label*), writable);
}
} catch (NumberFormatException e) {
continue;
}
}
writer.close();
reader.close();
}
}
Just figure out a way to assign label (key) to your vectors.. I just used a
counter..
On Fri, May 23, 2014 at 11:39 AM, Andrew Musselman <
[email protected]> wrote:
> You could also look at using Pig with the elephantbird package for creating
> sequence files.
>
> There's an example on the Readme at https://github.com/kevinweil/elephant-
> bird/blob/master/Readme.md
>
>
> On Fri, May 23, 2014 at 11:05 AM, jamal sasha <[email protected]>
> wrote:
>
> > Hi,
> > I tried to use one of the implementation.. Here is the copy paste for
> the
> > reference
> >
> > import java.io.BufferedReader;
> > import java.io.FileReader;
> > import java.io.IOException;
> >
> > import org.apache.hadoop.conf.Configuration;
> > import org.apache.hadoop.fs.FileSystem;
> > import org.apache.hadoop.fs.Path;
> > import org.apache.hadoop.io.IntWritable;
> > import org.apache.hadoop.io.SequenceFile;
> > import org.apache.hadoop.io.Text;
> > import org.apache.mahout.math.DenseVector;
> > import org.apache.mahout.math.Vector;
> > import org.apache.mahout.math.VectorWritable;
> >
> > public class SequenceOutput{
> > public static void main(String[] args) throws IOException,
> > InterruptedException, ClassNotFoundException {
> > Configuration conf = new Configuration(true);
> > FileSystem fs = FileSystem.get(conf);
> >
> > // The input file is not in hdfs
> > BufferedReader reader = new BufferedReader(new FileReader(args[1]));
> > Path filePath = new Path(args[2]);
> > // Delete previous file if exists
> > if (fs.exists(filePath))
> > fs.delete(filePath, true);
> > SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
> > filePath, Text.class, VectorWritable.class);
> > // Run through the input file
> > String line;
> > //System.out.prin
> > System.out.println(args[3].length());
> > while ((line = reader.readLine()) != null) {
> > // We surround with try catch to get rid of the exception when header is
> >
> > try {
> > //System.out.println(line);
> > // Split with the given separator
> > String[] c = line.split(args[3]);
> > if (c.length > 1) {
> > double[] d = new double[c.length];
> > // Get the feature set
> > for (int i = 1; i < c.length; i++)
> > d[i] = Double.parseDouble(c[i]);
> > // Put it in a vector
> > Vector vec = new DenseVector(c.length);
> > vec.assign(d);
> > VectorWritable writable = new VectorWritable();
> > writable.set(vec);
> >
> > // Create a label with a / and the class label
> > String label = c[0] + "/" + c[0];
> >
> > // Write all in the seqfile
> > writer.append(new Text(label), writable);
> > }
> > } catch (NumberFormatException e) {
> > continue;
> > }
> > }
> > writer.close();
> > reader.close();
> > }
> > }
> >
> >
> > It generates the output but then throws an error when I try to run
> > rowSimilarity job
> > 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> > attempt_1400790649200_0044_m_000000_1, Status : FAILED
> > Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable
> > at
> >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >
> > 14/05/23 11:01:02 INFO mapreduce.Job: Task Id :
> > attempt_1400790649200_0044_m_000001_1, Status : FAILED
> > Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable
> > at
> >
> >
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:184)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >
> > Any clues?
> >
> >
> > On Fri, May 23, 2014 at 1:55 AM, Suneel Marthi <[email protected]>
> wrote:
> >
> > > The input needs to be converted to a sequencefile of vectors in order
> to
> > be
> > > processed by Mahout's pipeline. This has been asked a few times
> recently
> > > and search for Kevin Moulart's recent posts for doing this in the mail
> > > archives.
> > >
> > > The converted vectors are then fed to RowIdJob with output matrix and
> > > docIndex, then feed the matrix (which is a DRM) to RowSimilarityJob.
> > >
> > >
> > >
> > >
> > > On Fri, May 23, 2014 at 1:31 AM, jamal sasha <[email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > > I have data where each row is comma seperated vector...
> > > > And these are bunch of text files...like
> > > > 0.123,01433,0.932
> > > > 0.129,0.932,0.123
> > > > And I want to run's mahout rowIdSimilarity module on it.. butI am
> > > guessing
> > > > the input requirement is different.
> > > > How do I convert this csv vectors into format consumed by mahout
> > > > rowIdSimilarity module?
> > > > Thanks
> > > >
> > >
> >
>
--
Mohit
"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates