Hi Ram, Yes, I complete agree. An exception is poor way to handle this case, and training on a dataset of zero labels and no one labels should simply work without exceptions.
Fortunately, it looks like someone else has recently patched the problem with LogisticRegression: https://github.com/apache/spark/commit/2388de51912efccaceeb663ac56fc500a79d2ceb This should resolve the issue I'm experiencing. I'll get hold of a build from source and try it out. Thanks for all your help! David On Wed, Jan 27, 2016 at 12:51 AM Ram Sriharsha <sriharsha....@gmail.com> wrote: > btw, OneVsRest is using the labels in the dataset that is fed to the fit > method, in case the metadata is missing. > So if the metadata contains a label, we expect that label to be present in > the dataset passed to the fit method. > If you want OneVsRest to compute the labels you can leave the label > metadata empty in which case we first compute the # of > labels in the training dataset. > > If the training dataset contains a given label, then logistic regression > should work fine regardless of the rarity of that label (performance might > be bad but it won't throw an exception afaik) > > if the training dataset does not contain a given label but the metadata > does, then we do end up training classifiers which will never see that > label. > But even here, what gets passed to the underlying classifier is a dataset > with only say zero labels and no one labels. > A classifier should be able to handle this... but if it cannot for some > reason, we can have a check in OneVsRest that doesn't train that classifier > > On Tue, Jan 26, 2016 at 4:33 PM, Ram Sriharsha <sriharsha....@gmail.com> > wrote: > >> Hey David >> >> In your scenario, OneVsRest is training a classifier for 1 vs not 1... >> and the input dataset for fit (or train) has labeled data for label 1 >> >> But the underlying binary classifier (LogisticRegression) uses sampling >> to determine the subset of data to sample during each iteration and it is >> possible that this sample does not include any examples with label 1 (ie >> numClasses = 1) >> >> So the examples it selects in that iteration only include 0 labeled data >> and nothing with label 1. >> >> But why should it throw an exception? if it does, then i would think we >> need to fix the issue in the underlying algorithm instead of the >> reduction somehow knowing that the binary classifier is sampling from the >> training dataset. >> >> Or am I misunderstanding the issue here? >> >> I'll take a look at the gist you linked when i get a chance , thanks! >> >> Ram >> >> On Tue, Jan 26, 2016 at 4:06 PM, David Brooks <da...@whisk.co.uk> wrote: >> >>> Hi Ram, Joseph, >>> >>> That's right, but I will clarify: >>> >>> (a) a random split can generate a training set that does not contain >>> some rare class >>> (b) when LogisticRegression is run over a dataframe where all instances >>> have the same class label, it throws an ArrayIndexOutOfBoundsException. >>> >>> When (a) occurs, (b) is the consequence. The rare class is missing from >>> the training set, so you would not expect OneVsRest to train a binary >>> classifier on it; however, because OneVsRest trains binary classifiers on >>> all class labels in the range (0 to numClasses), it *will* train a >>> binary classifier on the missing class, which leads to the exception from >>> (b). >>> >>> A concrete example: >>> >>> - class labels 0, 1, 2, 3 are present in dataset (*numClasses* = 4); >>> - 0, 2, 3 are in the training set after random split (no *1*); >>> - The range (0 to 4) is used to train binary classifiers on each of >>> 0, *1*, 2, 3 >>> - As soon as the classifier is trained on *1*, the exception is >>> thrown >>> >>> I'd suggest: >>> >>> 1. In LogisticRegression, where numClasses == 1, thrown a more >>> meaningful validation exception (avoiding the more cryptic >>> ArrayIndexOutOfBoundsException) >>> 2. Only run OneVsRest for class labels that appear in the dataframe, >>> rather than all labels in the Range(0, numClasses). >>> >>> I created a few simple test cases for running from SBT, like this one >>> <https://github.com/junglebarry/SparkOneVsRestTest/blob/master/src/main/scala/SparkOneVsRestTest_2_Errors.scala>, >>> but I've turned them into gists now for spark-shell: >>> >>> - LogisticRegression throwing ArrayIndexOutOfBoundsException >>> <https://gist.github.com/junglebarry/a7cedce6eaf978d7b9ee> >>> - OneVsRest throwing ArrayIndexOutOfBoundsException >>> <https://gist.github.com/junglebarry/66234edfebaad6254ebe> (with a >>> simulated missing class from a Range) >>> - OneVsRest throwing ArrayIndexOutOfBoundsException with random split >>> <https://gist.github.com/junglebarry/6073aa474d89f3322063>. Only >>> exceptions in 2/3 of cases, due to randomness. >>> >>> If these look good as test cases, I'll take a look at filing JIRAs and >>> getting patches tomorrow morning. It's late here! >>> >>> Thanks for the swift response, >>> David >>> >>> >>> On Tue, Jan 26, 2016 at 11:09 PM Ram Sriharsha <sriharsha....@gmail.com> >>> wrote: >>> >>>> Hi David >>>> >>>> If I am reading the email right, there are two problems here right? >>>> a) for rare classes the random split will likely miss the rare class. >>>> b) if it misses the rare class an exception is thrown >>>> >>>> I thought the exception stems from b), is that right?... i wouldn't >>>> expect an exception to be thrown in the case the training dataset is >>>> missing the rare class. >>>> could you reproduce this in a simple snippet of code that we can >>>> quickly test on the shell? >>>> >>>> >>>> >>>> >>>> On Tue, Jan 26, 2016 at 3:02 PM, Ram Sriharsha <sriharsha....@gmail.com >>>> > wrote: >>>> >>>>> Hey David, Yeah absolutely!, feel free to create a JIRA and attach >>>>> your patch to it. We can help review it and pull in the fix... happy to >>>>> accept contributions! >>>>> ccing Joseph who is one of the maintainers of MLLib as well.. when >>>>> creating the JIRA can you attach a simple test case? >>>>> >>>>> On Tue, Jan 26, 2016 at 2:59 PM, David Brooks <da...@whisk.co.uk> >>>>> wrote: >>>>> >>>>>> Hi again Ram, >>>>>> >>>>>> Sorry, I was too hasty in my previous response. I've done a bit more >>>>>> digging through the code, and StringIndexer does indeed provide metadata, >>>>>> as a NominalAttribute with a known number of class labels. I don't think >>>>>> the issue is related to the use of metadata, however. >>>>>> >>>>>> It seems to me to be caused by the interaction between OneVsRest and >>>>>> TrainValidationSplit. For rare target classes under OneVsRest, it seems >>>>>> quite possible for this random-split approach to select a training subset >>>>>> where all items belong to non-target classes - all of which are given the >>>>>> same class label by OneVsRest. In this case, we start training >>>>>> LogisticRegression on data of a single class, which seems odd. The >>>>>> exception stems from there. >>>>>> >>>>>> The cause looks to me to be that OneVsRest.fit runs binary >>>>>> classifications from 0 to numClasses (OneVsRest.scala:209), and this >>>>>> seems >>>>>> incompatible with the random split, which cannot guarantee training >>>>>> examples for all labels in the range. It might be preferable to iterate >>>>>> over the observed labels in the training set, rather than all labels in >>>>>> the >>>>>> range. I don't know the performance effects of that change, but it does >>>>>> look incompatible with using the label metadata as a shortcut. >>>>>> >>>>>> Do you agree that there is an issue here? Would you accept >>>>>> contributions to the code to remedy it? I'd gladly take a look if I can >>>>>> be >>>>>> of help. >>>>>> >>>>>> Many thanks, >>>>>> David >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk> >>>>>> wrote: >>>>>> >>>>>>> Hi Ram, >>>>>>> >>>>>>> I didn't include an explicit label column in my reproduction as I >>>>>>> thought it superfluous. However, in my original use-case, I was using a >>>>>>> StringIndexer, where the labels were indexed across the entire dataset >>>>>>> (training+validation+test). The (indexed) label column was then >>>>>>> explicitly >>>>>>> provided to the OneVsRest instance. >>>>>>> >>>>>>> Here's the abridged version: >>>>>>> >>>>>>> val textDocuments = ??? // real data here >>>>>>> >>>>>>> // Index labels, adding metadata to the label column. >>>>>>> // Fit on whole dataset to include all labels in index. >>>>>>> val labelIndexer = new StringIndexer() >>>>>>> .setInputCol("label") >>>>>>> .setOutputCol("labelIndexed") >>>>>>> .fit(textDocuments) >>>>>>> >>>>>>> val lrClassifier = new LogisticRegression() >>>>>>> >>>>>>> val classifier = new OneVsRest() >>>>>>> .setClassifier(lrClassifier) >>>>>>> .setLabelCol(labelIndexer.getOutputCol) >>>>>>> >>>>>>> // ... >>>>>>> >>>>>>> >>>>>>> There's an explicit reference to the label column, and when created, >>>>>>> that column contains all possible values of the label (it's `fit` over >>>>>>> all >>>>>>> data). It looks to me like StringIndexer computes label metadata at >>>>>>> that >>>>>>> point (in `transform`) and attaches it to the column. This way, I'd >>>>>>> hope >>>>>>> that even once TrainValidationSplit returns a subset dataframe - >>>>>>> which may not contain all labels - the metadata on the column >>>>>>> should still contain all labels. >>>>>>> >>>>>>> Does my use of StringIndexer count as "metadata", here? If so, I >>>>>>> still see the exception as before. >>>>>>> >>>>>>> I've pushed a new example using StringIndexer to my earlier repo, so >>>>>>> you can see the code and issue. I'm happy to try a simpler method for >>>>>>> providing column metadata, if one is available. >>>>>>> >>>>>>> Thanks, >>>>>>> David >>>>>>> >>>>>>> On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha < >>>>>>> sriharsha....@gmail.com> wrote: >>>>>>> >>>>>>>> Hi David >>>>>>>> >>>>>>>> What happens if you provide the class labels via metadata instead >>>>>>>> of letting OneVsRest determine the labels? >>>>>>>> >>>>>>>> Ram >>>>>>>> >>>>>>>> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I've run into an exception using MLlib OneVsRest with logistic >>>>>>>>> regression (v1.6.0, but also in previous versions). >>>>>>>>> >>>>>>>>> The issue is intermittent. When running multiclass classification >>>>>>>>> with K-fold cross validation, there are scenarios where the split >>>>>>>>> does not >>>>>>>>> contain instances for every target label. In such cases, an >>>>>>>>> ArrayIndexOutOfBoundsException is generated. >>>>>>>>> >>>>>>>>> I've tried to reproduce the problem in a simple SBT project here: >>>>>>>>> >>>>>>>>> https://github.com/junglebarry/SparkOneVsRestTest >>>>>>>>> >>>>>>>>> I don't imagine this is typical - it first surfaced when running >>>>>>>>> over a dataset with some very rare classes. >>>>>>>>> >>>>>>>>> I'm happy to look into patching the code, but I first wanted to >>>>>>>>> confirm that the problem was real, and that I wasn't somehow >>>>>>>>> misunderstanding how I should be using OneVsRest. >>>>>>>>> >>>>>>>>> Any guidance would be appreciated - I'm new to the list. >>>>>>>>> >>>>>>>>> Many thanks, >>>>>>>>> David >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ram Sriharsha >>>>>>>> Architect, Spark and Data Science >>>>>>>> Hortonworks, 2550 Great America Way, 2nd Floor >>>>>>>> Santa Clara, CA 95054 >>>>>>>> Ph: 408-510-8635 >>>>>>>> email: har...@apache.org >>>>>>>> >>>>>>>> [image: https://www.linkedin.com/in/harsha340] >>>>>>>> <https://www.linkedin.com/in/harsha340> >>>>>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/> >>>>>>>> >>>>>>>> >>>>> >>>>> >>>>> -- >>>>> Ram Sriharsha >>>>> Architect, Spark and Data Science >>>>> Hortonworks, 2550 Great America Way, 2nd Floor >>>>> Santa Clara, CA 95054 >>>>> Ph: 408-510-8635 >>>>> email: har...@apache.org >>>>> >>>>> [image: https://www.linkedin.com/in/harsha340] >>>>> <https://www.linkedin.com/in/harsha340> >>>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Ram Sriharsha >>>> Architect, Spark and Data Science >>>> Hortonworks, 2550 Great America Way, 2nd Floor >>>> Santa Clara, CA 95054 >>>> Ph: 408-510-8635 >>>> email: har...@apache.org >>>> >>>> [image: https://www.linkedin.com/in/harsha340] >>>> <https://www.linkedin.com/in/harsha340> >>>> <https://twitter.com/halfabrane> <https://github.com/harsha2010/> >>>> >>>> >> >> >> -- >> Ram Sriharsha >> Architect, Spark and Data Science >> Hortonworks, 2550 Great America Way, 2nd Floor >> Santa Clara, CA 95054 >> Ph: 408-510-8635 >> email: har...@apache.org >> >> [image: https://www.linkedin.com/in/harsha340] >> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane> >> <https://github.com/harsha2010/> >> >> > > > -- > Ram Sriharsha > Architect, Spark and Data Science > Hortonworks, 2550 Great America Way, 2nd Floor > Santa Clara, CA 95054 > Ph: 408-510-8635 > email: har...@apache.org > > [image: https://www.linkedin.com/in/harsha340] > <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane> > <https://github.com/harsha2010/> > >