I want to train create model for classification. For me this text is coming
from database and I really do not want to store them to file for mahout
training. I checked out the the MIA source code and changed the following code
for very basic training task. Usual issue with mahout examples are either they
show how to use mahout from cmd prompt using 20 news group, or the code has lot
of dependency on Hadoop Zookeeper etc. I will really appreciate if someone can
have a look at my code, or point me to a very simple tutorial which show how to
train a model and then use it.
As of now in following code I am never getting past if (best != null)
becauselearningAlgorithm.getBest(); is always returning null!
Sorry for posting the whole code, but didn't see any other option
public class Classifier { private static final int FEATURES = 10000;
private static final TextValueEncoder encoder = new TextValueEncoder("body");
private static final FeatureVectorEncoder bias = new
ConstantValueEncoder("Intercept"); private static final String[] LEAK_LABELS
= {"none", "month-year", "day-month-year"}; /** * @param args the
command line arguments */ public static void main(String[] args) throws
Exception { int leakType = 0; // TODO code application logic here
AdaptiveLogisticRegression learningAlgorithm = new
AdaptiveLogisticRegression(20, FEATURES, new L1()); Dictionary
newsGroups = new Dictionary(); //ModelDissector md = new
ModelDissector(); ListMultimap<String, String> noteBySection =
LinkedListMultimap.create(); noteBySection.put("good", "I love this
product, the screen is a pleasure to work with and is a great choice for any
business"); noteBySection.put("good", "What a product!! Really amazing
clarity and works pretty well"); noteBySection.put("good", "This product
has good battery life and is a little bit heavy but I like it");
noteBySection.put("bad", "I am really bored with the same UI, this is their 5th
version(or fourth or sixth, who knows) and it looks just like the first one");
noteBySection.put("bad", "The phone is bulky and useless");
noteBySection.put("bad", "I wish i had never bought this laptop. It died in the
first year and now i am not able to return it"); encoder.setProbes(2);
double step = 0; int[] bumps = {1, 2, 5}; double
averageCorrect = 0; double averageLL = 0; int k = 0;
//------------------------------------- //notes.keySet() for
(String key : noteBySection.keySet()) { System.out.println(key);
List<String> notes = noteBySection.get(key); for
(Iterator<String> it = notes.iterator(); it.hasNext();) { String
note = it.next(); int actual = newsGroups.intern(key);
Vector v = encodeFeatureVector(note);
learningAlgorithm.train(actual, v); k++; int bump
= bumps[(int) Math.floor(step) % bumps.length]; int scale =
(int) Math.pow(10, Math.floor(step / bumps.length));
State<AdaptiveLogisticRegression.Wrapper, CrossFoldLearner> best =
learningAlgorithm.getBest(); double maxBeta;
double nonZeros; double positive; double norm;
double lambda = 0; double mu = 0; if
(best != null) { CrossFoldLearner state =
best.getPayload().getLearner(); averageCorrect =
state.percentCorrect(); averageLL = state.logLikelihood();
OnlineLogisticRegression model = state.getModels().get(0);
// finish off pending regularization
model.close(); Matrix beta = model.getBeta();
maxBeta = beta.aggregate(Functions.MAX, Functions.ABS);
nonZeros = beta.aggregate(Functions.PLUS, new DoubleFunction() {
@Override public double apply(double v) {
return Math.abs(v) > 1.0e-6 ? 1 : 0;
} }); positive =
beta.aggregate(Functions.PLUS, new DoubleFunction() {
@Override public double apply(double v) {
return v > 0 ? 1 : 0; }
}); norm = beta.aggregate(Functions.PLUS, Functions.ABS);
lambda = learningAlgorithm.getBest().getMappedParams()[0];
mu = learningAlgorithm.getBest().getMappedParams()[1];
} else { maxBeta = 0; nonZeros = 0;
positive = 0; norm = 0; }
System.out.println(k % (bump * scale)); if (k %
(bump * scale) == 0) { if (learningAlgorithm.getBest() !=
null) {
System.out.println("----------------------------");
ModelSerializer.writeBinary("c:/tmp/news-group-" + k + ".model",
learningAlgorithm.getBest().getPayload().getLearner().getModels().get(0));
} step += 0..25;
System.out.printf("%.2f\t%.2f\t%.2f\t%.2f\t%.8g\t%.8g\t", maxBeta, nonZeros,
positive, norm, lambda, mu);
System.out.printf("%d\t%.3f\t%.2f\t%s\n", k,
averageLL, averageCorrect * 100, LEAK_LABELS[leakType % 3]); }
} } learningAlgorithm.close(); } private static
Vector encodeFeatureVector(String text) {
encoder.addText(text.toLowerCase());
//System.out.println(encoder.asString(text)); Vector v = new
RandomAccessSparseVector(FEATURES); bias.addToVector((byte[]) null, 1,
v); encoder.flush(1, v); return v; }}
Sapankumar Parikh
Product Development
eClinicalWorks
2 Technology Drive | Westborough, MA 01581
T: 5084750450 x 17269
[mailto:[email protected]] [email protected] |
[http://www.eclinicalworks.com/] www.eclinicalworks.com
70,000+ physicians | 220,000+ providers | 410,000+ users | 23,000+ practices
Voted Most Interesting Vendor in 2010 by Healthcare Informatics | Top-rated
vendor by IDC Health Insights | Seven Davies Award Winners – eCW Customers |
Named in Inc. 5000 list 2007 - 2012
This transmission contains confidential information belonging to the sender
that is legally privileged and proprietary and may be subject to protection
under the law, including the Health Insurance Portability and Accountability
Act (HIPAA). If you are not the intended recipient of this e-mail, you are
prohibited from sharing, copying, or otherwise using or disclosing its
contents. If you have received this e-mail in error, please notify the sender
immediately by reply e-mail and permanently delete this e-mail and any
attachments without reading, forwarding, or saving them. Thank you.
Please consider the environment and only print this e-mail if necessary
CONFIDENTIALITY NOTICE TO RECIPIENT: This transmission contains confidential
information belonging to the sender that is legally privileged and proprietary
and may be subject to protection under the law, including the Health Insurance
Portability and Accountability Act (HIPAA). If you are not the intended
recipient of this e-mail, you are prohibited from sharing, copying, or
otherwise using or disclosing its contents. If you have received this e-mail in
error, please notify the sender immediately by reply e-mail and permanently
delete this e-mail and any attachments without reading, forwarding or saving
them. Thank you.