Thank you Ted,

your explanations really helped.

Regards,
Em

Am 22.05.2011 19:43, schrieb Ted Dunning:
> On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote:
> 
>> So, let's say I got a descriptional-text of 100-200 words (text-like).
>> Does this mean that I got one feature (the description) or does it mean
>> that I got 100-200 features (the words)?
>>
> 
> There is a bit of confusion because the term feature can be used at two
> points in the process.
> 
> At raw data level, you have one feature that is text-like.
> 
> You have to encode this feature, however, as a numerical vector.  You can do
> that in a number of ways, but you can't encode text-like data into a single
> numerical value.  You need to use lots of numerical values to encode it.
>  That can be done where every possible word has a different numerical value
> or you can use the hashed encoding where you pick the number of numerical
> values and the hashing encoder deals with your data and your choice.
> 
> After you encode the data, you are left with a typically sparse Vector.  The
> learning algorithm never sees your original data, just this Vector.
> 
> So, from the viewpoint of the learning algorithm, each element of this
> Vector is a feature.
> 
> Unfortunately this dual use of nomenclature is completely wide-spread when
> people describe supervised machine learning such as the classifiers in
> Mahout do.
> 
> 
> 
>> The OnlineLogisticRegression-class requires me to tell it how many
>> categories are there and how many features I like to provide.
>>
> 
> Categories refer to the target variable.  You have to say how many possible
> values of the target that there are.
> 
> The number of features given here is *after* encoding.  Your text variable
> would probably be encoded into a Vector of size 10,000-1,000,000 so this
> size is what you should give the OnlineLogisticRegression.
> 
> 
>> My question now is, if I got a categorical- and a text-like feature, do
>> I have to tell the class that I am going to add two features?
>>
> 
> With the hashed encoding what you would do is create two encoders with
> different types and names.  Pick an output vector size that is pretty big
> (100,000 should do).  Then use each encoder with the corresponding data.
> 
> 
>>
>> What happens, if I encode 20 different features into the vector but
>> missconfigured the algorithm in a way that I told there were only 10
>>
> 
> You would have 20 different encoders and some sized Vector.
> 
> If you give the learning algorithm a wrong-sized Vector, it should
> immediately complain.  If it doesn't or if it doesn't complain clearly with
> a good message, file a bug.
> 
> features? I miss a little bit some formula or something like that for
>> the algorithms that are part of mahout. This would make understanding
>> the different parameters more easy, I think.
>>
> 
> I think that this is genuinely confusing.  Keep going in the book.  The next
> chapters go into more detail on this process.
> 

Reply via email to