Hello again, and sorry to bother you with this once again,

I'm having a bit of trouble.  My CSV files are just full of numbers (doubles).  
Each line looks something like this: 2.4135,1.1120.  I'm not sure if this makes 
a big difference.  But when I try to do step #2, I can't seem to figure out 
what I should put for field and idField for the input.  What would I put for 
these options?  Or how could I find out what they are if they already exist?

Thanks very much for your help,
Ronald


Oh, and if it helps, this is the java code that I came up with to get CSV file 
to text files.  I then tried to use lucene to get text files into an index.  I 
did this because I couldn't quite follow the code from the link you gave me.  I 
don't think I needed to use a hashmap, but just wanted to learn to use them.

        public static void main(String[] args) throws IOException {
                        
                String inputFile = "/home/r9r/seqTest/seqTestData.csv";
                String outputPath = "/home/r9r/seqTest/seqTestOut/";


                try {

                        File myFile = new File(inputFile);
                        FileReader fileReader = new FileReader(myFile);

                        BufferedReader reader = new BufferedReader(fileReader);

                        String text = null;
                        int j = 0;

                        while ((text = reader.readLine()) != null) {

                                List<String> line = new ArrayList<String>();
                                line.add(text);
                                
                                Map<String, List<String>> aHashMap = new 
HashMap<String, List<String>>();
                                aHashMap.put(Integer.toString(j), line);


                        
                File newFile = new File(outputPath + Integer.toString(j));
                PrintWriter writer = new PrintWriter(newFile);
                        
                Set set = aHashMap.entrySet();

                Iterator i = set.iterator();

                while(i.hasNext()) {
                   Map.Entry me = (Map.Entry)i.next();
                   writer.println(me.getKey() + " " + 
me.getValue().toString().replace("[", " ").replace(",", " ").replace("]", " "));
                   System.out.print(me.getKey() + " " + 
me.getValue().toString().replace("[", " ").replace(",", " ").replace("]", " "));
                   System.out.println();
                }
                
                j++;
                

                        }
            
                        reader.close();

                } catch (Exception ex) {
                        ex.printStackTrace();
                }        
        }




________________________________________
From: Suneel Marthi [[email protected]]
Sent: Tuesday, February 11, 2014 5:44 PM
To: [email protected]
Subject: Re: seqdumper output?

You should run the clusterdump on 
/home/r9r/seqTest/seqKmeans/clusters-1-final/part-xxxxx to see the points that 
are in the cluster.
But u need a dictionary for that which wouldn't be available if the vectors 
were generated from CSV.

So one way to generate a dictionary for a CSV and verify the clustering output 
would be to go through the below process :-

1. Convert CSV file to a lucene index (see 
http://glaforge.appspot.com/article/lucene-s-fun for sample code).
2. Run the lucene index from (1) through Mahout's lucene2seq utility - this 
converts the lucene indexes into sequencefiles
3. Run the output of (2) thru seq2sparse - this should generate tf-idf vectors, 
dictionary, tf-vectors, wordcounts
4. Run the output of (3) thru KMeans Driver.

Please give this a try.





On Tuesday, February 11, 2014 3:33 PM, "Allen, Ronald L." <[email protected]> 
wrote:

Hello,

I have done something wrong with
 clustering a CSV file and can't quite figure it out.  I am using Mahout 0.9 on 
a local machine only.  Below is the output from seqdumper, and I am not sure 
how to interpret it.  Can anyone help?

Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/_policy
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.mahout.clustering.iterator.ClusteringPolicyWritable
Key: : Value: 
org.apache.mahout.clustering.iterator.ClusteringPolicyWritable@78be9eb3
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00000
Key class:
 class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.clustering.iterator.ClusterWritable
Key: 0: Value: org.apache.mahout.clustering.iterator.ClusterWritable@592ea0f8
Count: 1
Input Path: file:/home/r9r/seqTest/seqKmeans/clusters-1-final/part-00001
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.clustering.iterator.ClusterWritable
Key: 1: Value: org.apache.mahout.clustering.iterator.ClusterWritable@44a2786
Count: 1

There's probably a good chance I am still not getting my CSV data into 
something usable.  I can get it into a sequence file, but this is the output.

Thanks,
Ronald

Reply via email to