On Mon, May 2, 2011 at 12:39 AM, Tim Snyder <[email protected]> wrote:
> ... > In your options a through c - I am not sure I understand the difference > between (a) and (c). Is (a) the current state (let it fail), and (c) a > small fix to let things complete, but understand that it is probably not > valuable? > The current state is that it fails silently which is not acceptable. I will think about how to make small training data work well. It shouldn't be too hard. My guess is that this will look a bit more like a batch training interface, but I am not sure yet. > Assuming that I could do a previous processing step on the messages, > similar to spam exclusion, to get to a 1 in 50 or 1 in 20 potential > interesting msg content, I could develop a larger training dataset. > Good. It is possible to build moderately good models with few dozen examples, but it is common that having so little training data limits the sophistication of the models you can build. > With only 1 in 10,000 msgs of interest, I don't think I can get to a 10,000 > training set. Any recommendations on how to do this? > The key problem is that with a very low hit rate, you have to do a lot of work to find positive examples. The general technique you need is called active learning. This is where the model helps you find training data to hand tag. There are two sub-problems. One is finding the training data and the other is dealing with the fact that you now have a very strangely selected training set that isn't like the real data. THe first problem is the key in most practical situations. > I am looking at > Chapter 12 of MIA on clustering of Twitter msgs as a possible way of > implementing an unsupervised learning for clustering. I would need to > take this output and be able to discard those clusters (and resultant > msgs) which are not of interest. > This is an excellent way to stratify your search. You can also use the output from any early models that you build to guide you. Sort by score and judge examples from many different score ranges. Then re-run the training (but keep all old training data, of course).
