Wait. I thought a "feature" is an abstract concept for clumps of
"meaning" that are found by analyzing the set of "feature vectors"
described above.

On Sun, May 22, 2011 at 12:04 PM, Em <[email protected]> wrote:
> Thank you Ted,
>
> your explanations really helped.
>
> Regards,
> Em
>
> Am 22.05.2011 19:43, schrieb Ted Dunning:
>> On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote:
>>
>>> So, let's say I got a descriptional-text of 100-200 words (text-like).
>>> Does this mean that I got one feature (the description) or does it mean
>>> that I got 100-200 features (the words)?
>>>
>>
>> There is a bit of confusion because the term feature can be used at two
>> points in the process.
>>
>> At raw data level, you have one feature that is text-like.
>>
>> You have to encode this feature, however, as a numerical vector.  You can do
>> that in a number of ways, but you can't encode text-like data into a single
>> numerical value.  You need to use lots of numerical values to encode it.
>>  That can be done where every possible word has a different numerical value
>> or you can use the hashed encoding where you pick the number of numerical
>> values and the hashing encoder deals with your data and your choice.
>>
>> After you encode the data, you are left with a typically sparse Vector.  The
>> learning algorithm never sees your original data, just this Vector.
>>
>> So, from the viewpoint of the learning algorithm, each element of this
>> Vector is a feature.
>>
>> Unfortunately this dual use of nomenclature is completely wide-spread when
>> people describe supervised machine learning such as the classifiers in
>> Mahout do.
>>
>>
>>
>>> The OnlineLogisticRegression-class requires me to tell it how many
>>> categories are there and how many features I like to provide.
>>>
>>
>> Categories refer to the target variable.  You have to say how many possible
>> values of the target that there are.
>>
>> The number of features given here is *after* encoding.  Your text variable
>> would probably be encoded into a Vector of size 10,000-1,000,000 so this
>> size is what you should give the OnlineLogisticRegression.
>>
>>
>>> My question now is, if I got a categorical- and a text-like feature, do
>>> I have to tell the class that I am going to add two features?
>>>
>>
>> With the hashed encoding what you would do is create two encoders with
>> different types and names.  Pick an output vector size that is pretty big
>> (100,000 should do).  Then use each encoder with the corresponding data.
>>
>>
>>>
>>> What happens, if I encode 20 different features into the vector but
>>> missconfigured the algorithm in a way that I told there were only 10
>>>
>>
>> You would have 20 different encoders and some sized Vector.
>>
>> If you give the learning algorithm a wrong-sized Vector, it should
>> immediately complain.  If it doesn't or if it doesn't complain clearly with
>> a good message, file a bug.
>>
>> features? I miss a little bit some formula or something like that for
>>> the algorithms that are part of mahout. This would make understanding
>>> the different parameters more easy, I think.
>>>
>>
>> I think that this is genuinely confusing.  Keep going in the book.  The next
>> chapters go into more detail on this process.
>>
>



-- 
Lance Norskog
[email protected]

Reply via email to