Dear all,
For SGD algorithm, I learned the idea that we should use some detailed
parsing code to parse the input features to reduce the time
spent on parse the input and put it into the Vector from Chapter 16.3.4 from
<Mahout In Action>. And per my test, it will reduce 80% of the time spent on
parsing the input by the SimpleCsvExamples.java in the code base.
I am trying to use the similar way to do an optimization test on parsing
category features but it looks it will only reduce about 30% of the time
based on code like the following:
Vector v = new RandomAccessSparseVector(1000);
//.... some old codes
} else if ("--fast".equals(args[0])) {
FastLineReader in = new FastLineReader(new FileInputStream(args[1]));
try {
FastLine line = in.read();
while (line != null) {
v.assign(0);
for (int i = 0; i < FIELDS; i++) {
// double z = line.getDouble(i);
// s[i].add(z);
byte[] category = line.getBytes(i);
encoder[i].addToVector(category, 1, v);
}
line = in.read();
}
} finally {
IOUtils.quietClose(in);
}
private static final class FastLine {
public byte[] getBytes(int field) {
int offset = start.get(field);
int size = length.get(field);
byte[] result = new byte[size];
System.arraycopy(base.array(), offset, result, 0, size);
return result;
}
}
I am wondering if anyone would like to help me to find a better solution?
Since I found about 80% of the time for SGD was spent on parse the features
and add it to the Vector. If I could optimize the performance on category
features as well, it would make the algorithm even faster and might be able
to handle 100 million or even billions of lines data on a single machine.
Thanks.
Best wishes,
Stanley Xu