On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote:

> So, let's say I got a descriptional-text of 100-200 words (text-like).
> Does this mean that I got one feature (the description) or does it mean
> that I got 100-200 features (the words)?
>

There is a bit of confusion because the term feature can be used at two
points in the process.

At raw data level, you have one feature that is text-like.

You have to encode this feature, however, as a numerical vector.  You can do
that in a number of ways, but you can't encode text-like data into a single
numerical value.  You need to use lots of numerical values to encode it.
 That can be done where every possible word has a different numerical value
or you can use the hashed encoding where you pick the number of numerical
values and the hashing encoder deals with your data and your choice.

After you encode the data, you are left with a typically sparse Vector.  The
learning algorithm never sees your original data, just this Vector.

So, from the viewpoint of the learning algorithm, each element of this
Vector is a feature.

Unfortunately this dual use of nomenclature is completely wide-spread when
people describe supervised machine learning such as the classifiers in
Mahout do.



> The OnlineLogisticRegression-class requires me to tell it how many
> categories are there and how many features I like to provide.
>

Categories refer to the target variable.  You have to say how many possible
values of the target that there are.

The number of features given here is *after* encoding.  Your text variable
would probably be encoded into a Vector of size 10,000-1,000,000 so this
size is what you should give the OnlineLogisticRegression.


> My question now is, if I got a categorical- and a text-like feature, do
> I have to tell the class that I am going to add two features?
>

With the hashed encoding what you would do is create two encoders with
different types and names.  Pick an output vector size that is pretty big
(100,000 should do).  Then use each encoder with the corresponding data.


>
> What happens, if I encode 20 different features into the vector but
> missconfigured the algorithm in a way that I told there were only 10
>

You would have 20 different encoders and some sized Vector.

If you give the learning algorithm a wrong-sized Vector, it should
immediately complain.  If it doesn't or if it doesn't complain clearly with
a good message, file a bug.

features? I miss a little bit some formula or something like that for
> the algorithms that are part of mahout. This would make understanding
> the different parameters more easy, I think.
>

I think that this is genuinely confusing.  Keep going in the book.  The next
chapters go into more detail on this process.

Reply via email to