I did a review of the code and it was definitely written to support having multiple training sets in the same collection. So, it sounds like something is not working as designed.
I planned on testing out model building with different types of training sets anyway, so I'll can comment on my findings in the ticket. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Mar 22, 2017 at 9:58 AM, Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Thank you Tim. I appreciated the tips. At this point, I'm just trying to > understand how to use it. The 30 tweets that I've selected so far, are, in > fact threatening. The things people say! My favorite so far is > 'disingenuous twat waffle'. No kidding. > > The issue that I'm having is not with the model, it's with creating the > model from a query other than *:*. > > Example: > > update(models2, batchSize="50", > train(TRAINING, > features(TRAINING, > q="*:*", > featureSet="threat1", > field="ClusterText", > outcome="out_i", > positiveLabel=1, > numTerms=100), > q="*:*", > name="threat1", > field="ClusterText", > outcome="out_i", > maxIterations="100")) > > Works great. Makes a model - model works - can see reasonable results. > However, say I've tagged a training set inside a larger collection called > COL1 with a field called JoeID - like this: > > update(models2, batchSize="50", > train(COL1, > features(COL1, > q="JoeID:Training", > featureSet="threat2", > field="ClusterText", > outcome="out_i", > positiveLabel=1, > numTerms=1000), > q="JoeID:Training", > name="threat2", > field="ClusterText", > outcome="out_i", > maxIterations="100")) > > This does not work as expected. I can query the COL1 collection for > JoeID:Training, and get a result set that I want to train on, but the model > creation seems to not work. At this point, if I want to make a model, I > need to create a collection, put the training set into it, and then train > on *:*. This is fine, but I'm not sure if it's how it is supposed to work. > > -Joe > > > > On 3/21/2017 10:17 PM, Tim Casey wrote: > >> Joe, >> >> To do this correctly, soundly, you will need to sample the data and mark >> them as threatening or neutral. You can probably expand on this quite a >> bit, but that would be a good start. You can then draw another set of >> samples and see how you did. You use one to train and one to validate. >> >> What you are doing is probably just noise, from a model point of view, and >> it will probably not make too much difference how you index/query/model >> through the noise. >> >> I don't mean this critically, just plainly. Effectively the less >> mathematically correctly you do this process, the more anecdotal the >> result. >> >> tim >> >> >> On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein <joels...@gmail.com> >> wrote: >> >> I've only tested with the training data in it's own collection, but it was >>> designed for multiple training sets in the same collection. >>> >>> I suspect you're training set is too small to get a reliable model from. >>> The training sets we tested with were considerably larger. >>> >>> All the idfs_ds values being the same seems odd though. The idfs_ds in >>> particular were designed to be accurate when there are multiple training >>> sets in the same collection. >>> >>> Joel Bernstein >>> http://joelsolr.blogspot.com/ >>> >>> On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger < >>> joseph.obernber...@gmail.com> wrote: >>> >>> If I put the training data into its own collection and use q="*:*", then >>>> it works correctly. Is that a requirement? >>>> Thank you. >>>> >>>> -Joe >>>> >>>> >>>> >>>> On 3/20/2017 3:47 PM, Joe Obernberger wrote: >>>> >>>> I'm trying to build a model using tweets. I've manually tagged 30 >>>>> >>>> tweets >>> >>>> as threatening, and 50 random tweets as non-threatening. When I build >>>>> >>>> the >>> >>>> mode with: >>>>> >>>>> update(models2, batchSize="50", >>>>> train(UNCLASS, >>>>> features(UNCLASS, >>>>> q="ProfileID:PROFCLUST1", >>>>> featureSet="threatFeatures3", >>>>> field="ClusterText", >>>>> outcome="out_i", >>>>> positiveLabel=1, >>>>> numTerms=250), >>>>> q="ProfileID:PROFCLUST1", >>>>> name="threatModel3", >>>>> field="ClusterText", >>>>> outcome="out_i", >>>>> maxIterations="100")) >>>>> >>>>> It appears to work, but all the idfs_ds values are identical. The >>>>> terms_ss values look reasonable, but nearly all the weights_ds are 1.0. >>>>> For out_i it is either -1 for non-threatening tweets, and +1 for >>>>> threatening tweets. I'm trying to follow along with Joel Bernstein's >>>>> excellent post here: >>>>> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s >>>>> ystem-with-solrs.html >>>>> >>>>> Tips? >>>>> >>>>> Thank you! >>>>> >>>>> -Joe >>>>> >>>>> >>>>> >