Re: General question on using StringIndexer in SparkML

Vishnu Viswanath Tue, 01 Dec 2015 22:32:09 -0800

Hi Jeff,

I went through the link you provided and I could understand how the fit()
and transform() work.
I tried to use the pipeline in my code and I am getting exception  Caused
by: org.apache.spark.SparkException: Unseen label:


The reason for this error as per my understanding is:
For the column on which I am doing StringIndexing, the test data is having
values which was not there in train data.
Since fit() is done only on the train data, the indexing is failing.

Can you suggest me what can be done in this situation.

Thanks,

On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

Thank you Jeff.
>
> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> StringIndexer is an estimator which would train a model to be used both
>> in training & prediction. So it is consistent between training & prediction.
>>
>> You may want to read this section of spark ml doc
>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>>
>>
>>
>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.com> wrote:
>>
>>> Thanks for the reply Yanbo.
>>>
>>> I understand that the model will be trained using the indexer map
>>> created during the training stage.
>>>
>>> But since I am getting a new set of data during prediction, and I have
>>> to do StringIndexing on the new data also,
>>> Right now I am using a new StringIndexer for this purpose, or is there
>>> any way that I can reuse the Indexer used for training stage.
>>>
>>> Note: I am having a pipeline with StringIndexer in it, and I am fitting
>>> my train data in it and building the model. Then later when i get the new
>>> data for prediction, I am using the same pipeline to fit the data again and
>>> do the prediction.
>>>
>>> Thanks and Regards,
>>> Vishnu Viswanath
>>>
>>>
>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>>>
>>>> Hi Vishnu,
>>>>
>>>> The string and indexer map is generated at model training step and
>>>> used at model prediction step.
>>>> It means that the string and indexer map will not changed when
>>>> prediction. You will use the original trained model when you do
>>>> prediction.
>>>>
>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <
>>>> vishnu.viswanat...@gmail.com>:
>>>> > Hi All,
>>>> >
>>>> > I have a general question on using StringIndexer.
>>>> > StringIndexer gives an index to each label in the feature starting
>>>> from 0 (
>>>> > 0 for least frequent word).
>>>> >
>>>> > Suppose I am building a model, and I use StringIndexer for
>>>> transforming on
>>>> > of my column.
>>>> > e.g., suppose A was most frequent word followed by B and C.
>>>> >
>>>> > So the StringIndexer will generate
>>>> >
>>>> > A  0.0
>>>> > B  1.0
>>>> > C  2.0
>>>> >
>>>> > After building the model, I am going to do some prediction using this
>>>> model,
>>>> > So I do the same transformation on my new data which I need to
>>>> predict. And
>>>> > suppose the new dataset has C as the most frequent word, followed by
>>>> B and
>>>> > A. So the StringIndexer will assign index as
>>>> >
>>>> > C 0.0
>>>> > B 1.0
>>>> > A 2.0
>>>> >
>>>> > These indexes are different from what we used for modeling. So won’t
>>>> this
>>>> > give me a wrong prediction if I use StringIndexer?
>>>> >
>>>> >
>>>>
>>>
>>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
>

Re: General question on using StringIndexer in SparkML

Reply via email to