Chapter 14 of Mahout in Action talks about vectorizing data like this.

The basic idea is that you have to reduce these feature to numerical form
so that you can have a nice metric defined for them.



On Tue, Jun 17, 2014 at 5:04 AM, Aleksander Sadecki <
[email protected]> wrote:

> Hi,
>
> I am using KMeansDriver to create few clusters for my input data. For
> numeric data it works perfectly. I have no idea how to make it work with
> more complex values. Let's assume that we have got the following example:
>
>   | car     | color      | year of production | price
> --|-----------------------------------------------------
> 1 | Audi A1 | dark red   | 2009               | 65.000 £
> 2 | Audi A2 | green      | 2011               | 75.000 £
> 3 | Audi A3 | red        | 2012               | 80.000 £
> 4 | BMW X3  | black      | 2009               | 72.000 £
> 5 | BMW X4  | black      | 2013               | 82.000 £
> 6 | BMW X5  | white      | 2014               | 89.000 £
>
> I have no idea how to put this data to the KMeansDriver.
>
> I have got few requairments for my clusters:
>
> i) Audi A1, A2, A3 are really similar, these 3 could create one cluster.
> The same for BMW X1, X2, X3.
>
> ii) For colors, we can find 4 groups: white, black, green, red + dark red
>
> iii) Year of production, f.e. 2 groups, before and after 2010
>
> iv) here we can find f.e. 3 clusters 60-70, 70-80, 80+ £
>
> Thank you in advance
> Alex
>
> Here you can see my piece of code:
>
>                 List<Vector> vectors = getPointsFromFile(filename);
>
>                 File testData = new File("testdata");
>
>                 if (!testData.exists()) { testData.mkdir();}
>
>                 testData = new File("testdata/points");
>
>                 if (!testData.exists()) {testData.mkdir();}
>
>                 Configuration conf = new Configuration();
>                 FileSystem fs = FileSystem.get(conf);
>                 ClusterHelper.writePointsToFile(vectors, conf, new Path(
> "testdata/points/file1"));
>
>                 Path path = new Path("testdata/clusters/part-00000");
>                 SequenceFile.Writer writer = new SequenceFile.Writer(fs,
> conf, path, Text.class, Kluster.class);
>
>                 for (int i = 0; i < k; i++) {
>                         Vector vec = vectors.get(i);
>                         Kluster cluster = new Kluster(vec, i, new
> EuclideanDistanceMeasure());
>                         writer.append(new Text(cluster.getIdentifier()),
> cluster);
>                 }
>                 writer.close();
>
>                 Path output = new Path("output");
>                 HadoopUtil.delete(conf, output);
>
>                 KMeansDriver.run(conf, new Path("testdata/points"), new
> Path("testdata/clusters"), output, 0.001, 10, true, 0.0, false);
>
>
> where getPointsFromFile(filename) method looks like this:
>
>
>         public static List<Vector> getPointsFromFile(String file) {
>                 List<Vector> points = new ArrayList<Vector>();
>                 String sCurrentLine;
>                 BufferedReader br = null;
>                 try {
>                         br = new BufferedReader(new FileReader(file));
>                         while ((sCurrentLine = br.readLine()) != null) {
>                                 String[] splitted =
> sCurrentLine.split(",");
>                                 double[] fr = new double[splitted.length];
>                                 for (int i = 0; i < splitted.length; i++) {
>                                         fr[i] =
> Double.parseDouble(splitted[i]);
>                                 }
>                                 Vector vec = new
> RandomAccessSparseVector(fr.length);
>                                 vec.assign(fr);
>                                 points.add(vec);
>                         }
>                 } catch (FileNotFoundException e) {
>
>                         e.printStackTrace();
>                 } catch (IOException e) {
>                         e.printStackTrace();
>                 }
>
>                 return points;
>         }
>

Reply via email to