The traditional meaning of feature in machine learning as I understand it is an arbitrary piece of information about some object. These features are usually grouped by type into a feature vector which provides a uniform way to describe of any object of the same class.
Daniel. On Mon, May 23, 2011 at 5:00 PM, Lance Norskog <[email protected]> wrote: > Wait. I thought a "feature" is an abstract concept for clumps of > "meaning" that are found by analyzing the set of "feature vectors" > described above. > > On Sun, May 22, 2011 at 12:04 PM, Em <[email protected]> wrote: >> Thank you Ted, >> >> your explanations really helped. >> >> Regards, >> Em >> >> Am 22.05.2011 19:43, schrieb Ted Dunning: >>> On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote: >>> >>>> So, let's say I got a descriptional-text of 100-200 words (text-like). >>>> Does this mean that I got one feature (the description) or does it mean >>>> that I got 100-200 features (the words)? >>>> >>> >>> There is a bit of confusion because the term feature can be used at two >>> points in the process. >>> >>> At raw data level, you have one feature that is text-like. >>> >>> You have to encode this feature, however, as a numerical vector. You can do >>> that in a number of ways, but you can't encode text-like data into a single >>> numerical value. You need to use lots of numerical values to encode it. >>> That can be done where every possible word has a different numerical value >>> or you can use the hashed encoding where you pick the number of numerical >>> values and the hashing encoder deals with your data and your choice. >>> >>> After you encode the data, you are left with a typically sparse Vector. The >>> learning algorithm never sees your original data, just this Vector. >>> >>> So, from the viewpoint of the learning algorithm, each element of this >>> Vector is a feature. >>> >>> Unfortunately this dual use of nomenclature is completely wide-spread when >>> people describe supervised machine learning such as the classifiers in >>> Mahout do. >>> >>> >>> >>>> The OnlineLogisticRegression-class requires me to tell it how many >>>> categories are there and how many features I like to provide. >>>> >>> >>> Categories refer to the target variable. You have to say how many possible >>> values of the target that there are. >>> >>> The number of features given here is *after* encoding. Your text variable >>> would probably be encoded into a Vector of size 10,000-1,000,000 so this >>> size is what you should give the OnlineLogisticRegression. >>> >>> >>>> My question now is, if I got a categorical- and a text-like feature, do >>>> I have to tell the class that I am going to add two features? >>>> >>> >>> With the hashed encoding what you would do is create two encoders with >>> different types and names. Pick an output vector size that is pretty big >>> (100,000 should do). Then use each encoder with the corresponding data. >>> >>> >>>> >>>> What happens, if I encode 20 different features into the vector but >>>> missconfigured the algorithm in a way that I told there were only 10 >>>> >>> >>> You would have 20 different encoders and some sized Vector. >>> >>> If you give the learning algorithm a wrong-sized Vector, it should >>> immediately complain. If it doesn't or if it doesn't complain clearly with >>> a good message, file a bug. >>> >>> features? I miss a little bit some formula or something like that for >>>> the algorithms that are part of mahout. This would make understanding >>>> the different parameters more easy, I think. >>>> >>> >>> I think that this is genuinely confusing. Keep going in the book. The next >>> chapters go into more detail on this process. >>> >> > > > > -- > Lance Norskog > [email protected] >
