Thank you Ted, your explanations really helped.
Regards, Em Am 22.05.2011 19:43, schrieb Ted Dunning: > On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote: > >> So, let's say I got a descriptional-text of 100-200 words (text-like). >> Does this mean that I got one feature (the description) or does it mean >> that I got 100-200 features (the words)? >> > > There is a bit of confusion because the term feature can be used at two > points in the process. > > At raw data level, you have one feature that is text-like. > > You have to encode this feature, however, as a numerical vector. You can do > that in a number of ways, but you can't encode text-like data into a single > numerical value. You need to use lots of numerical values to encode it. > That can be done where every possible word has a different numerical value > or you can use the hashed encoding where you pick the number of numerical > values and the hashing encoder deals with your data and your choice. > > After you encode the data, you are left with a typically sparse Vector. The > learning algorithm never sees your original data, just this Vector. > > So, from the viewpoint of the learning algorithm, each element of this > Vector is a feature. > > Unfortunately this dual use of nomenclature is completely wide-spread when > people describe supervised machine learning such as the classifiers in > Mahout do. > > > >> The OnlineLogisticRegression-class requires me to tell it how many >> categories are there and how many features I like to provide. >> > > Categories refer to the target variable. You have to say how many possible > values of the target that there are. > > The number of features given here is *after* encoding. Your text variable > would probably be encoded into a Vector of size 10,000-1,000,000 so this > size is what you should give the OnlineLogisticRegression. > > >> My question now is, if I got a categorical- and a text-like feature, do >> I have to tell the class that I am going to add two features? >> > > With the hashed encoding what you would do is create two encoders with > different types and names. Pick an output vector size that is pretty big > (100,000 should do). Then use each encoder with the corresponding data. > > >> >> What happens, if I encode 20 different features into the vector but >> missconfigured the algorithm in a way that I told there were only 10 >> > > You would have 20 different encoders and some sized Vector. > > If you give the learning algorithm a wrong-sized Vector, it should > immediately complain. If it doesn't or if it doesn't complain clearly with > a good message, file a bug. > > features? I miss a little bit some formula or something like that for >> the algorithms that are part of mahout. This would make understanding >> the different parameters more easy, I think. >> > > I think that this is genuinely confusing. Keep going in the book. The next > chapters go into more detail on this process. >
