Re: MLlib OneVsRest causing intermittent exceptions

2016-01-27 Thread David Brooks
Hi Ram, Yes, I complete agree. An exception is poor way to handle this case, and training on a dataset of zero labels and no one labels should simply work without exceptions. Fortunately, it looks like someone else has recently patched the problem with LogisticRegression:

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread David Brooks
Hi Ram, I didn't include an explicit label column in my reproduction as I thought it superfluous. However, in my original use-case, I was using a StringIndexer, where the labels were indexed across the entire dataset (training+validation+test). The (indexed) label column was then explicitly

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread David Brooks
Hi again Ram, Sorry, I was too hasty in my previous response. I've done a bit more digging through the code, and StringIndexer does indeed provide metadata, as a NominalAttribute with a known number of class labels. I don't think the issue is related to the use of metadata, however. It seems

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread Ram Sriharsha
Hey David, Yeah absolutely!, feel free to create a JIRA and attach your patch to it. We can help review it and pull in the fix... happy to accept contributions! ccing Joseph who is one of the maintainers of MLLib as well.. when creating the JIRA can you attach a simple test case? On Tue, Jan 26,

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread Ram Sriharsha
Hi David If I am reading the email right, there are two problems here right? a) for rare classes the random split will likely miss the rare class. b) if it misses the rare class an exception is thrown I thought the exception stems from b), is that right?... i wouldn't expect an exception to be

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread David Brooks
Hi Ram, Joseph, That's right, but I will clarify: (a) a random split can generate a training set that does not contain some rare class (b) when LogisticRegression is run over a dataframe where all instances have the same class label, it throws an ArrayIndexOutOfBoundsException. When (a) occurs,

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread Ram Sriharsha
Hey David In your scenario, OneVsRest is training a classifier for 1 vs not 1... and the input dataset for fit (or train) has labeled data for label 1 But the underlying binary classifier (LogisticRegression) uses sampling to determine the subset of data to sample during each iteration and it is

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread Ram Sriharsha
btw, OneVsRest is using the labels in the dataset that is fed to the fit method, in case the metadata is missing. So if the metadata contains a label, we expect that label to be present in the dataset passed to the fit method. If you want OneVsRest to compute the labels you can leave the label

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-25 Thread Ram Sriharsha
Hi David What happens if you provide the class labels via metadata instead of letting OneVsRest determine the labels? Ram On Mon, Jan 25, 2016 at 3:06 PM, David Brooks wrote: > Hi, > > I've run into an exception using MLlib OneVsRest with logistic regression > (v1.6.0, but

MLlib OneVsRest causing intermittent exceptions

2016-01-25 Thread David Brooks
Hi, I've run into an exception using MLlib OneVsRest with logistic regression (v1.6.0, but also in previous versions). The issue is intermittent. When running multiclass classification with K-fold cross validation, there are scenarios where the split does not contain instances for every target