On Mon, May 23, 2011 at 8:34 PM, Daniel McEnnis <[email protected]> wrote:
> The traditional meaning of feature in machine learning as I understand
> it is an arbitrary piece of information about some object.  These
> features are usually grouped by type into a feature vector which
> provides a uniform way to describe of any object of the same class.

Except when they aren't. Consider a sequence tagger. There are
features, and no vectors.


>
> Daniel.
>
> On Mon, May 23, 2011 at 5:00 PM, Lance Norskog <[email protected]> wrote:
>> Wait. I thought a "feature" is an abstract concept for clumps of
>> "meaning" that are found by analyzing the set of "feature vectors"
>> described above.
>>
>> On Sun, May 22, 2011 at 12:04 PM, Em <[email protected]> wrote:
>>> Thank you Ted,
>>>
>>> your explanations really helped.
>>>
>>> Regards,
>>> Em
>>>
>>> Am 22.05.2011 19:43, schrieb Ted Dunning:
>>>> On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote:
>>>>
>>>>> So, let's say I got a descriptional-text of 100-200 words (text-like).
>>>>> Does this mean that I got one feature (the description) or does it mean
>>>>> that I got 100-200 features (the words)?
>>>>>
>>>>
>>>> There is a bit of confusion because the term feature can be used at two
>>>> points in the process.
>>>>
>>>> At raw data level, you have one feature that is text-like.
>>>>
>>>> You have to encode this feature, however, as a numerical vector.  You can 
>>>> do
>>>> that in a number of ways, but you can't encode text-like data into a single
>>>> numerical value.  You need to use lots of numerical values to encode it.
>>>>  That can be done where every possible word has a different numerical value
>>>> or you can use the hashed encoding where you pick the number of numerical
>>>> values and the hashing encoder deals with your data and your choice.
>>>>
>>>> After you encode the data, you are left with a typically sparse Vector.  
>>>> The
>>>> learning algorithm never sees your original data, just this Vector.
>>>>
>>>> So, from the viewpoint of the learning algorithm, each element of this
>>>> Vector is a feature.
>>>>
>>>> Unfortunately this dual use of nomenclature is completely wide-spread when
>>>> people describe supervised machine learning such as the classifiers in
>>>> Mahout do.
>>>>
>>>>
>>>>
>>>>> The OnlineLogisticRegression-class requires me to tell it how many
>>>>> categories are there and how many features I like to provide.
>>>>>
>>>>
>>>> Categories refer to the target variable.  You have to say how many possible
>>>> values of the target that there are.
>>>>
>>>> The number of features given here is *after* encoding.  Your text variable
>>>> would probably be encoded into a Vector of size 10,000-1,000,000 so this
>>>> size is what you should give the OnlineLogisticRegression.
>>>>
>>>>
>>>>> My question now is, if I got a categorical- and a text-like feature, do
>>>>> I have to tell the class that I am going to add two features?
>>>>>
>>>>
>>>> With the hashed encoding what you would do is create two encoders with
>>>> different types and names.  Pick an output vector size that is pretty big
>>>> (100,000 should do).  Then use each encoder with the corresponding data.
>>>>
>>>>
>>>>>
>>>>> What happens, if I encode 20 different features into the vector but
>>>>> missconfigured the algorithm in a way that I told there were only 10
>>>>>
>>>>
>>>> You would have 20 different encoders and some sized Vector.
>>>>
>>>> If you give the learning algorithm a wrong-sized Vector, it should
>>>> immediately complain.  If it doesn't or if it doesn't complain clearly with
>>>> a good message, file a bug.
>>>>
>>>> features? I miss a little bit some formula or something like that for
>>>>> the algorithms that are part of mahout. This would make understanding
>>>>> the different parameters more easy, I think.
>>>>>
>>>>
>>>> I think that this is genuinely confusing.  Keep going in the book.  The 
>>>> next
>>>> chapters go into more detail on this process.
>>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>
>

Reply via email to