On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote:
> So, let's say I got a descriptional-text of 100-200 words (text-like). > Does this mean that I got one feature (the description) or does it mean > that I got 100-200 features (the words)? > There is a bit of confusion because the term feature can be used at two points in the process. At raw data level, you have one feature that is text-like. You have to encode this feature, however, as a numerical vector. You can do that in a number of ways, but you can't encode text-like data into a single numerical value. You need to use lots of numerical values to encode it. That can be done where every possible word has a different numerical value or you can use the hashed encoding where you pick the number of numerical values and the hashing encoder deals with your data and your choice. After you encode the data, you are left with a typically sparse Vector. The learning algorithm never sees your original data, just this Vector. So, from the viewpoint of the learning algorithm, each element of this Vector is a feature. Unfortunately this dual use of nomenclature is completely wide-spread when people describe supervised machine learning such as the classifiers in Mahout do. > The OnlineLogisticRegression-class requires me to tell it how many > categories are there and how many features I like to provide. > Categories refer to the target variable. You have to say how many possible values of the target that there are. The number of features given here is *after* encoding. Your text variable would probably be encoded into a Vector of size 10,000-1,000,000 so this size is what you should give the OnlineLogisticRegression. > My question now is, if I got a categorical- and a text-like feature, do > I have to tell the class that I am going to add two features? > With the hashed encoding what you would do is create two encoders with different types and names. Pick an output vector size that is pretty big (100,000 should do). Then use each encoder with the corresponding data. > > What happens, if I encode 20 different features into the vector but > missconfigured the algorithm in a way that I told there were only 10 > You would have 20 different encoders and some sized Vector. If you give the learning algorithm a wrong-sized Vector, it should immediately complain. If it doesn't or if it doesn't complain clearly with a good message, file a bug. features? I miss a little bit some formula or something like that for > the algorithms that are part of mahout. This would make understanding > the different parameters more easy, I think. > I think that this is genuinely confusing. Keep going in the book. The next chapters go into more detail on this process.
