Hi,
I am using KMeansDriver to create few clusters for my input data. For numeric
data it works perfectly. I have no idea how to make it work with more complex
values. Let's assume that we have got the following example:
| car | color | year of production | price
--|-----------------------------------------------------
1 | Audi A1 | dark red | 2009 | 65.000 £
2 | Audi A2 | green | 2011 | 75.000 £
3 | Audi A3 | red | 2012 | 80.000 £
4 | BMW X3 | black | 2009 | 72.000 £
5 | BMW X4 | black | 2013 | 82.000 £
6 | BMW X5 | white | 2014 | 89.000 £
I have no idea how to put this data to the KMeansDriver.
I have got few requairments for my clusters:
i) Audi A1, A2, A3 are really similar, these 3 could create one cluster. The
same for BMW X1, X2, X3.
ii) For colors, we can find 4 groups: white, black, green, red + dark red
iii) Year of production, f.e. 2 groups, before and after 2010
iv) here we can find f.e. 3 clusters 60-70, 70-80, 80+ £
Thank you in advance
Alex
Here you can see my piece of code:
List<Vector> vectors = getPointsFromFile(filename);
File testData = new File("testdata");
if (!testData.exists()) { testData.mkdir();}
testData = new File("testdata/points");
if (!testData.exists()) {testData.mkdir();}
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
ClusterHelper.writePointsToFile(vectors, conf, new Path(
"testdata/points/file1"));
Path path = new Path("testdata/clusters/part-00000");
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
path, Text.class, Kluster.class);
for (int i = 0; i < k; i++) {
Vector vec = vectors.get(i);
Kluster cluster = new Kluster(vec, i, new
EuclideanDistanceMeasure());
writer.append(new Text(cluster.getIdentifier()),
cluster);
}
writer.close();
Path output = new Path("output");
HadoopUtil.delete(conf, output);
KMeansDriver.run(conf, new Path("testdata/points"), new
Path("testdata/clusters"), output, 0.001, 10, true, 0.0, false);
where getPointsFromFile(filename) method looks like this:
public static List<Vector> getPointsFromFile(String file) {
List<Vector> points = new ArrayList<Vector>();
String sCurrentLine;
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(file));
while ((sCurrentLine = br.readLine()) != null) {
String[] splitted = sCurrentLine.split(",");
double[] fr = new double[splitted.length];
for (int i = 0; i < splitted.length; i++) {
fr[i] = Double.parseDouble(splitted[i]);
}
Vector vec = new
RandomAccessSparseVector(fr.length);
vec.assign(fr);
points.add(vec);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return points;
}