The traditional meaning of feature in machine learning as I understand
it is an arbitrary piece of information about some object.  These
features are usually grouped by type into a feature vector which
provides a uniform way to describe of any object of the same class.

Daniel.

On Mon, May 23, 2011 at 5:00 PM, Lance Norskog <[email protected]> wrote:
> Wait. I thought a "feature" is an abstract concept for clumps of
> "meaning" that are found by analyzing the set of "feature vectors"
> described above.
>
> On Sun, May 22, 2011 at 12:04 PM, Em <[email protected]> wrote:
>> Thank you Ted,
>>
>> your explanations really helped.
>>
>> Regards,
>> Em
>>
>> Am 22.05.2011 19:43, schrieb Ted Dunning:
>>> On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote:
>>>
>>>> So, let's say I got a descriptional-text of 100-200 words (text-like).
>>>> Does this mean that I got one feature (the description) or does it mean
>>>> that I got 100-200 features (the words)?
>>>>
>>>
>>> There is a bit of confusion because the term feature can be used at two
>>> points in the process.
>>>
>>> At raw data level, you have one feature that is text-like.
>>>
>>> You have to encode this feature, however, as a numerical vector.  You can do
>>> that in a number of ways, but you can't encode text-like data into a single
>>> numerical value.  You need to use lots of numerical values to encode it.
>>>  That can be done where every possible word has a different numerical value
>>> or you can use the hashed encoding where you pick the number of numerical
>>> values and the hashing encoder deals with your data and your choice.
>>>
>>> After you encode the data, you are left with a typically sparse Vector.  The
>>> learning algorithm never sees your original data, just this Vector.
>>>
>>> So, from the viewpoint of the learning algorithm, each element of this
>>> Vector is a feature.
>>>
>>> Unfortunately this dual use of nomenclature is completely wide-spread when
>>> people describe supervised machine learning such as the classifiers in
>>> Mahout do.
>>>
>>>
>>>
>>>> The OnlineLogisticRegression-class requires me to tell it how many
>>>> categories are there and how many features I like to provide.
>>>>
>>>
>>> Categories refer to the target variable.  You have to say how many possible
>>> values of the target that there are.
>>>
>>> The number of features given here is *after* encoding.  Your text variable
>>> would probably be encoded into a Vector of size 10,000-1,000,000 so this
>>> size is what you should give the OnlineLogisticRegression.
>>>
>>>
>>>> My question now is, if I got a categorical- and a text-like feature, do
>>>> I have to tell the class that I am going to add two features?
>>>>
>>>
>>> With the hashed encoding what you would do is create two encoders with
>>> different types and names.  Pick an output vector size that is pretty big
>>> (100,000 should do).  Then use each encoder with the corresponding data.
>>>
>>>
>>>>
>>>> What happens, if I encode 20 different features into the vector but
>>>> missconfigured the algorithm in a way that I told there were only 10
>>>>
>>>
>>> You would have 20 different encoders and some sized Vector.
>>>
>>> If you give the learning algorithm a wrong-sized Vector, it should
>>> immediately complain.  If it doesn't or if it doesn't complain clearly with
>>> a good message, file a bug.
>>>
>>> features? I miss a little bit some formula or something like that for
>>>> the algorithms that are part of mahout. This would make understanding
>>>> the different parameters more easy, I think.
>>>>
>>>
>>> I think that this is genuinely confusing.  Keep going in the book.  The next
>>> chapters go into more detail on this process.
>>>
>>
>
>
>
> --
> Lance Norskog
> [email protected]
>

Reply via email to