thanks Nick.
This Jira seems to be in stagnant state for a while any update when this
will be released ?
On Mon, Aug 22, 2016 at 5:07 AM, Nick Pentreath
wrote:
> I believe it may be because of this issue (https://issues.apache.org/
> jira/browse/SPARK-13030). OHE is not an estimator - hence in c
I believe it may be because of this issue (
https://issues.apache.org/jira/browse/SPARK-13030). OHE is not an estimator
- hence in cases where the number of categories differ between train and
test, it's not usable in the current form.
It's tricky to work around, though one option is to use featur
Thanks Krishna for your response.
Features in the training set has more categories than test set so when
vectorAssembler is used these numbers are usually different and I believe
it is as expected right ?
Test dataset usually will not have so many categories in their features as
Train is the belie
Hi,
Just after I sent the mail, I realized that the error might be with the
training-dataset not the test-dataset.
1. it might be that you are feeding the full Y vector for training.
2. Which could mean, you are using ~50-50 training-test split.
3. Take a good look at the code that doe
Hi,
Looks like the test-dataset has different sizes for X & Y. Possible steps:
1. What is the test-data-size ?
- If it is 15,909, check the prediction variable vector - it is now
29,471, should be 15,909
- If you expect it to be 29,471, then the X Matrix is not right.
Hi,
I have built the logistic regression model using training-dataset.
When I am predicting on a test-dataset, it is throwing the below error of
size mismatch.
Steps done:
1. String indexers on categorical features.
2. One hot encoding on these indexed features.
Any help is appreciated to resolv