I did a review of the code and it was definitely written to support having
multiple training sets in the same collection. So, it sounds like something
is not working as designed.

I planned on testing out model building with different types of training
sets anyway, so I'll can comment on my findings in the ticket.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Mar 22, 2017 at 9:58 AM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you Tim.  I appreciated the tips.  At this point, I'm just trying to
> understand how to use it.  The 30 tweets that I've selected so far, are, in
> fact threatening.  The things people say!  My favorite so far is
> 'disingenuous twat waffle'.  No kidding.
>
> The issue that I'm having is not with the model, it's with creating the
> model from a query other than *:*.
>
> Example:
>
> update(models2, batchSize="50",
>              train(TRAINING,
>                       features(TRAINING,
>                                      q="*:*",
>                                      featureSet="threat1",
>                                      field="ClusterText",
>                                      outcome="out_i",
>                                      positiveLabel=1,
>                                      numTerms=100),
>                       q="*:*",
>                       name="threat1",
>                       field="ClusterText",
>                       outcome="out_i",
>                       maxIterations="100"))
>
> Works great.  Makes a model - model works - can see reasonable results.
> However, say I've tagged a training set inside a larger collection called
> COL1 with a field called JoeID - like this:
>
> update(models2, batchSize="50",
>              train(COL1,
>                       features(COL1,
>                                      q="JoeID:Training",
>                                      featureSet="threat2",
>                                      field="ClusterText",
>                                      outcome="out_i",
>                                      positiveLabel=1,
>                                      numTerms=1000),
>                       q="JoeID:Training",
>                       name="threat2",
>                       field="ClusterText",
>                       outcome="out_i",
>                       maxIterations="100"))
>
> This does not work as expected.  I can query the COL1 collection for
> JoeID:Training, and get a result set that I want to train on, but the model
> creation seems to not work.  At this point, if I want to make a model, I
> need to create a collection, put the training set into it, and then train
> on *:*.  This is fine, but I'm not sure if it's how it is supposed to work.
>
> -Joe
>
>
>
> On 3/21/2017 10:17 PM, Tim Casey wrote:
>
>> Joe,
>>
>> To do this correctly, soundly, you will need to sample the data and mark
>> them as threatening or neutral.  You can probably expand on this quite a
>> bit, but that would be a good start.  You can then draw another set of
>> samples and see how you did.  You use one to train and one to validate.
>>
>> What you are doing is probably just noise, from a model point of view, and
>> it will probably not make too much difference how you index/query/model
>> through the noise.
>>
>> I don't mean this critically, just plainly.  Effectively the less
>> mathematically correctly you do this process, the more anecdotal the
>> result.
>>
>> tim
>>
>>
>> On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein <joels...@gmail.com>
>> wrote:
>>
>> I've only tested with the training data in it's own collection, but it was
>>> designed for multiple training sets in the same collection.
>>>
>>> I suspect you're training set is too small to get a reliable model from.
>>> The training sets we tested with were considerably larger.
>>>
>>> All the idfs_ds values being the same seems odd though. The idfs_ds in
>>> particular were designed to be accurate when there are multiple training
>>> sets in the same collection.
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
>>> joseph.obernber...@gmail.com> wrote:
>>>
>>> If I put the training data into its own collection and use q="*:*", then
>>>> it works correctly.  Is that a requirement?
>>>> Thank you.
>>>>
>>>> -Joe
>>>>
>>>>
>>>>
>>>> On 3/20/2017 3:47 PM, Joe Obernberger wrote:
>>>>
>>>> I'm trying to build a model using tweets.  I've manually tagged 30
>>>>>
>>>> tweets
>>>
>>>> as threatening, and 50 random tweets as non-threatening.  When I build
>>>>>
>>>> the
>>>
>>>> mode with:
>>>>>
>>>>> update(models2, batchSize="50",
>>>>>               train(UNCLASS,
>>>>>                        features(UNCLASS,
>>>>>                                       q="ProfileID:PROFCLUST1",
>>>>>                                       featureSet="threatFeatures3",
>>>>>                                       field="ClusterText",
>>>>>                                       outcome="out_i",
>>>>>                                       positiveLabel=1,
>>>>>                                       numTerms=250),
>>>>>                        q="ProfileID:PROFCLUST1",
>>>>>                        name="threatModel3",
>>>>>                        field="ClusterText",
>>>>>                        outcome="out_i",
>>>>>                        maxIterations="100"))
>>>>>
>>>>> It appears to work, but all the idfs_ds values are identical. The
>>>>> terms_ss values look reasonable, but nearly all the weights_ds are 1.0.
>>>>> For out_i it is either -1 for non-threatening tweets, and +1 for
>>>>> threatening tweets.  I'm trying to follow along with Joel Bernstein's
>>>>> excellent post here:
>>>>> http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s
>>>>> ystem-with-solrs.html
>>>>>
>>>>> Tips?
>>>>>
>>>>> Thank you!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>>
>

Reply via email to