Hi Ram,

Yes, I complete agree.  An exception is poor way to handle this case, and
training on a dataset of zero labels and no one labels should simply work
without exceptions.

Fortunately, it looks like someone else has recently patched the problem
with LogisticRegression:


https://github.com/apache/spark/commit/2388de51912efccaceeb663ac56fc500a79d2ceb

This should resolve the issue I'm experiencing.  I'll get hold of a build
from source and try it out.

Thanks for all your help!

David

On Wed, Jan 27, 2016 at 12:51 AM Ram Sriharsha <sriharsha....@gmail.com>
wrote:

> btw, OneVsRest is using the labels in the dataset that is fed to the fit
> method, in case the metadata is missing.
> So if the metadata contains a label, we expect that label to be present in
> the dataset passed to the fit method.
> If you want OneVsRest to compute the labels you can leave the label
> metadata empty in which case we first compute the # of
> labels in the training dataset.
>
> If the training dataset contains a given label, then logistic regression
> should work fine regardless of the rarity of that label (performance might
> be bad but it won't throw an exception afaik)
>
> if the training dataset does not contain a given label but the metadata
> does, then we do end up training classifiers which will never see that
> label.
> But even here, what gets passed to the underlying classifier is a dataset
> with only say zero labels and no one labels.
> A classifier should be able to handle this... but if it cannot for some
> reason, we can have a check in OneVsRest that doesn't train that classifier
>
> On Tue, Jan 26, 2016 at 4:33 PM, Ram Sriharsha <sriharsha....@gmail.com>
> wrote:
>
>> Hey David
>>
>> In your scenario, OneVsRest is training a classifier for 1 vs not 1...
>> and the input dataset for fit (or train) has labeled data for label 1
>>
>> But the underlying binary classifier (LogisticRegression) uses sampling
>> to determine the subset of data to sample during each iteration and it is
>> possible that this sample does not include any examples with label 1 (ie
>> numClasses = 1)
>>
>> So the examples it selects in that iteration only include 0 labeled data
>> and nothing with label 1.
>>
>> But why should it throw an exception? if it does, then i would think we
>> need to fix the issue in the underlying algorithm instead of the
>> reduction somehow knowing that the binary classifier is sampling from the
>> training dataset.
>>
>> Or am I misunderstanding the issue here?
>>
>> I'll take a look at the gist you linked when i get a chance , thanks!
>>
>> Ram
>>
>> On Tue, Jan 26, 2016 at 4:06 PM, David Brooks <da...@whisk.co.uk> wrote:
>>
>>> Hi Ram, Joseph,
>>>
>>> That's right, but I will clarify:
>>>
>>> (a) a random split can generate a training set that does not contain
>>> some rare class
>>> (b) when LogisticRegression is run over a dataframe where all instances
>>> have the same class label, it throws an ArrayIndexOutOfBoundsException.
>>>
>>> When (a) occurs, (b) is the consequence.  The rare class is missing from
>>> the training set, so you would not expect OneVsRest to train a binary
>>> classifier on it; however, because OneVsRest trains binary classifiers on
>>> all class labels in the range (0 to numClasses), it *will* train a
>>> binary classifier on the missing class, which leads to the exception from
>>> (b).
>>>
>>> A concrete example:
>>>
>>>    - class labels 0, 1, 2, 3 are present in dataset (*numClasses* = 4);
>>>    - 0, 2, 3 are in the training set after random split (no *1*);
>>>    - The range (0 to 4) is used to train binary classifiers on each of
>>>    0, *1*, 2, 3
>>>    - As soon as the classifier is trained on *1*, the exception is
>>>    thrown
>>>
>>> I'd suggest:
>>>
>>>    1. In LogisticRegression, where numClasses == 1, thrown a more
>>>    meaningful validation exception (avoiding the more cryptic
>>>    ArrayIndexOutOfBoundsException)
>>>    2. Only run OneVsRest for class labels that appear in the dataframe,
>>>    rather than all labels in the Range(0, numClasses).
>>>
>>> I created a few simple test cases for running from SBT, like this one
>>> <https://github.com/junglebarry/SparkOneVsRestTest/blob/master/src/main/scala/SparkOneVsRestTest_2_Errors.scala>,
>>> but I've turned them into gists now for spark-shell:
>>>
>>>    - LogisticRegression throwing ArrayIndexOutOfBoundsException
>>>    <https://gist.github.com/junglebarry/a7cedce6eaf978d7b9ee>
>>>    - OneVsRest throwing ArrayIndexOutOfBoundsException
>>>    <https://gist.github.com/junglebarry/66234edfebaad6254ebe> (with a
>>>    simulated missing class from a Range)
>>>    - OneVsRest throwing ArrayIndexOutOfBoundsException with random split
>>>    <https://gist.github.com/junglebarry/6073aa474d89f3322063>.  Only
>>>    exceptions in 2/3 of cases, due to randomness.
>>>
>>> If these look good as test cases, I'll take a look at filing JIRAs and
>>> getting patches tomorrow morning.  It's late here!
>>>
>>> Thanks for the swift response,
>>> David
>>>
>>>
>>> On Tue, Jan 26, 2016 at 11:09 PM Ram Sriharsha <sriharsha....@gmail.com>
>>> wrote:
>>>
>>>> Hi David
>>>>
>>>> If I am reading the email right, there are two problems here right?
>>>> a) for rare classes the random split will likely miss the rare class.
>>>> b) if it misses the rare class an exception is thrown
>>>>
>>>> I thought the exception stems from b), is that right?... i wouldn't
>>>> expect an exception to be thrown in the case the training dataset is
>>>> missing the rare class.
>>>> could you reproduce this in a simple snippet of code that we can
>>>> quickly test on the shell?
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jan 26, 2016 at 3:02 PM, Ram Sriharsha <sriharsha....@gmail.com
>>>> > wrote:
>>>>
>>>>> Hey David, Yeah absolutely!, feel free to create a JIRA and attach
>>>>> your patch to it. We can help review it and pull in the fix... happy to
>>>>> accept contributions!
>>>>> ccing Joseph who is one of the maintainers of MLLib as well.. when
>>>>> creating the JIRA can you attach a simple test case?
>>>>>
>>>>> On Tue, Jan 26, 2016 at 2:59 PM, David Brooks <da...@whisk.co.uk>
>>>>> wrote:
>>>>>
>>>>>> Hi again Ram,
>>>>>>
>>>>>> Sorry, I was too hasty in my previous response.  I've done a bit more
>>>>>> digging through the code, and StringIndexer does indeed provide metadata,
>>>>>> as a NominalAttribute with a known number of class labels.  I don't think
>>>>>> the issue is related to the use of metadata, however.
>>>>>>
>>>>>> It seems to me to be caused by the interaction between OneVsRest and
>>>>>> TrainValidationSplit.  For rare target classes under OneVsRest, it seems
>>>>>> quite possible for this random-split approach to select a training subset
>>>>>> where all items belong to non-target classes - all of which are given the
>>>>>> same class label by OneVsRest.  In this case, we start training
>>>>>> LogisticRegression on data of a single class, which seems odd.  The
>>>>>> exception stems from there.
>>>>>>
>>>>>> The cause looks to me to be that OneVsRest.fit runs binary
>>>>>> classifications from 0 to numClasses (OneVsRest.scala:209), and this 
>>>>>> seems
>>>>>> incompatible with the random split, which cannot guarantee training
>>>>>> examples for all labels in the range.  It might be preferable to iterate
>>>>>> over the observed labels in the training set, rather than all labels in 
>>>>>> the
>>>>>> range.  I don't know the performance effects of that change, but it does
>>>>>> look incompatible with using the label metadata as a shortcut.
>>>>>>
>>>>>> Do you agree that there is an issue here?  Would you accept
>>>>>> contributions to the code to remedy it?  I'd gladly take a look if I can 
>>>>>> be
>>>>>> of help.
>>>>>>
>>>>>> Many thanks,
>>>>>> David
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ram,
>>>>>>>
>>>>>>> I didn't include an explicit label column in my reproduction as I
>>>>>>> thought it superfluous.  However, in my original use-case, I was using a
>>>>>>> StringIndexer, where the labels were indexed across the entire dataset
>>>>>>> (training+validation+test).  The (indexed) label column was then 
>>>>>>> explicitly
>>>>>>> provided to the OneVsRest instance.
>>>>>>>
>>>>>>> Here's the abridged version:
>>>>>>>
>>>>>>> val textDocuments = ??? // real data here
>>>>>>>
>>>>>>> // Index labels, adding metadata to the label column.
>>>>>>> // Fit on whole dataset to include all labels in index.
>>>>>>> val labelIndexer = new StringIndexer()
>>>>>>>   .setInputCol("label")
>>>>>>>   .setOutputCol("labelIndexed")
>>>>>>>   .fit(textDocuments)
>>>>>>>
>>>>>>> val lrClassifier = new LogisticRegression()
>>>>>>>
>>>>>>> val classifier = new OneVsRest()
>>>>>>>   .setClassifier(lrClassifier)
>>>>>>>   .setLabelCol(labelIndexer.getOutputCol)
>>>>>>>
>>>>>>> // ...
>>>>>>>
>>>>>>>
>>>>>>> There's an explicit reference to the label column, and when created,
>>>>>>> that column contains all possible values of the label (it's `fit` over 
>>>>>>> all
>>>>>>> data).  It looks to me like StringIndexer computes label metadata at 
>>>>>>> that
>>>>>>> point (in `transform`) and attaches it to the column.  This way, I'd 
>>>>>>> hope
>>>>>>> that even once TrainValidationSplit returns a subset dataframe -
>>>>>>> which may not contain all labels - the metadata on the column
>>>>>>> should still contain all labels.
>>>>>>>
>>>>>>> Does my use of StringIndexer count as "metadata", here?  If so, I
>>>>>>> still see the exception as before.
>>>>>>>
>>>>>>> I've pushed a new example using StringIndexer to my earlier repo, so
>>>>>>> you can see the code and issue.  I'm happy to try a simpler method for
>>>>>>> providing column metadata, if one is available.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>> On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <
>>>>>>> sriharsha....@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi David
>>>>>>>>
>>>>>>>> What happens if you provide the class labels via metadata instead
>>>>>>>> of letting OneVsRest determine the labels?
>>>>>>>>
>>>>>>>> Ram
>>>>>>>>
>>>>>>>> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've run into an exception using MLlib OneVsRest with logistic
>>>>>>>>> regression (v1.6.0, but also in previous versions).
>>>>>>>>>
>>>>>>>>> The issue is intermittent.  When running multiclass classification
>>>>>>>>> with K-fold cross validation, there are scenarios where the split 
>>>>>>>>> does not
>>>>>>>>> contain instances for every target label.  In such cases, an
>>>>>>>>> ArrayIndexOutOfBoundsException is generated.
>>>>>>>>>
>>>>>>>>> I've tried to reproduce the problem in a simple SBT project here:
>>>>>>>>>
>>>>>>>>>    https://github.com/junglebarry/SparkOneVsRestTest
>>>>>>>>>
>>>>>>>>> I don't imagine this is typical - it first surfaced when running
>>>>>>>>> over a dataset with some very rare classes.
>>>>>>>>>
>>>>>>>>> I'm happy to look into patching the code, but I first wanted to
>>>>>>>>> confirm that the problem was real, and that I wasn't somehow
>>>>>>>>> misunderstanding how I should be using OneVsRest.
>>>>>>>>>
>>>>>>>>> Any guidance would be appreciated - I'm new to the list.
>>>>>>>>>
>>>>>>>>> Many thanks,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ram Sriharsha
>>>>>>>> Architect, Spark and Data Science
>>>>>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>>>>>> Santa Clara, CA 95054
>>>>>>>> Ph: 408-510-8635
>>>>>>>> email: har...@apache.org
>>>>>>>>
>>>>>>>> [image: https://www.linkedin.com/in/harsha340]
>>>>>>>> <https://www.linkedin.com/in/harsha340>
>>>>>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ram Sriharsha
>>>>> Architect, Spark and Data Science
>>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>>> Santa Clara, CA 95054
>>>>> Ph: 408-510-8635
>>>>> email: har...@apache.org
>>>>>
>>>>> [image: https://www.linkedin.com/in/harsha340]
>>>>> <https://www.linkedin.com/in/harsha340>
>>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ram Sriharsha
>>>> Architect, Spark and Data Science
>>>> Hortonworks, 2550 Great America Way, 2nd Floor
>>>> Santa Clara, CA 95054
>>>> Ph: 408-510-8635
>>>> email: har...@apache.org
>>>>
>>>> [image: https://www.linkedin.com/in/harsha340]
>>>> <https://www.linkedin.com/in/harsha340>
>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
>>>>
>>>>
>>
>>
>> --
>> Ram Sriharsha
>> Architect, Spark and Data Science
>> Hortonworks, 2550 Great America Way, 2nd Floor
>> Santa Clara, CA 95054
>> Ph: 408-510-8635
>> email: har...@apache.org
>>
>> [image: https://www.linkedin.com/in/harsha340]
>> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
>> <https://github.com/harsha2010/>
>>
>>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: har...@apache.org
>
> [image: https://www.linkedin.com/in/harsha340]
> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
> <https://github.com/harsha2010/>
>
>

Reply via email to