Wait. I thought a "feature" is an abstract concept for clumps of "meaning" that are found by analyzing the set of "feature vectors" described above.
On Sun, May 22, 2011 at 12:04 PM, Em <[email protected]> wrote: > Thank you Ted, > > your explanations really helped. > > Regards, > Em > > Am 22.05.2011 19:43, schrieb Ted Dunning: >> On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote: >> >>> So, let's say I got a descriptional-text of 100-200 words (text-like). >>> Does this mean that I got one feature (the description) or does it mean >>> that I got 100-200 features (the words)? >>> >> >> There is a bit of confusion because the term feature can be used at two >> points in the process. >> >> At raw data level, you have one feature that is text-like. >> >> You have to encode this feature, however, as a numerical vector. You can do >> that in a number of ways, but you can't encode text-like data into a single >> numerical value. You need to use lots of numerical values to encode it. >> That can be done where every possible word has a different numerical value >> or you can use the hashed encoding where you pick the number of numerical >> values and the hashing encoder deals with your data and your choice. >> >> After you encode the data, you are left with a typically sparse Vector. The >> learning algorithm never sees your original data, just this Vector. >> >> So, from the viewpoint of the learning algorithm, each element of this >> Vector is a feature. >> >> Unfortunately this dual use of nomenclature is completely wide-spread when >> people describe supervised machine learning such as the classifiers in >> Mahout do. >> >> >> >>> The OnlineLogisticRegression-class requires me to tell it how many >>> categories are there and how many features I like to provide. >>> >> >> Categories refer to the target variable. You have to say how many possible >> values of the target that there are. >> >> The number of features given here is *after* encoding. Your text variable >> would probably be encoded into a Vector of size 10,000-1,000,000 so this >> size is what you should give the OnlineLogisticRegression. >> >> >>> My question now is, if I got a categorical- and a text-like feature, do >>> I have to tell the class that I am going to add two features? >>> >> >> With the hashed encoding what you would do is create two encoders with >> different types and names. Pick an output vector size that is pretty big >> (100,000 should do). Then use each encoder with the corresponding data. >> >> >>> >>> What happens, if I encode 20 different features into the vector but >>> missconfigured the algorithm in a way that I told there were only 10 >>> >> >> You would have 20 different encoders and some sized Vector. >> >> If you give the learning algorithm a wrong-sized Vector, it should >> immediately complain. If it doesn't or if it doesn't complain clearly with >> a good message, file a bug. >> >> features? I miss a little bit some formula or something like that for >>> the algorithms that are part of mahout. This would make understanding >>> the different parameters more easy, I think. >>> >> >> I think that this is genuinely confusing. Keep going in the book. The next >> chapters go into more detail on this process. >> > -- Lance Norskog [email protected]
