On Mon, May 23, 2011 at 8:34 PM, Daniel McEnnis <[email protected]> wrote: > The traditional meaning of feature in machine learning as I understand > it is an arbitrary piece of information about some object. These > features are usually grouped by type into a feature vector which > provides a uniform way to describe of any object of the same class.
Except when they aren't. Consider a sequence tagger. There are features, and no vectors. > > Daniel. > > On Mon, May 23, 2011 at 5:00 PM, Lance Norskog <[email protected]> wrote: >> Wait. I thought a "feature" is an abstract concept for clumps of >> "meaning" that are found by analyzing the set of "feature vectors" >> described above. >> >> On Sun, May 22, 2011 at 12:04 PM, Em <[email protected]> wrote: >>> Thank you Ted, >>> >>> your explanations really helped. >>> >>> Regards, >>> Em >>> >>> Am 22.05.2011 19:43, schrieb Ted Dunning: >>>> On Sun, May 22, 2011 at 10:32 AM, Em <[email protected]> wrote: >>>> >>>>> So, let's say I got a descriptional-text of 100-200 words (text-like). >>>>> Does this mean that I got one feature (the description) or does it mean >>>>> that I got 100-200 features (the words)? >>>>> >>>> >>>> There is a bit of confusion because the term feature can be used at two >>>> points in the process. >>>> >>>> At raw data level, you have one feature that is text-like. >>>> >>>> You have to encode this feature, however, as a numerical vector. You can >>>> do >>>> that in a number of ways, but you can't encode text-like data into a single >>>> numerical value. You need to use lots of numerical values to encode it. >>>> That can be done where every possible word has a different numerical value >>>> or you can use the hashed encoding where you pick the number of numerical >>>> values and the hashing encoder deals with your data and your choice. >>>> >>>> After you encode the data, you are left with a typically sparse Vector. >>>> The >>>> learning algorithm never sees your original data, just this Vector. >>>> >>>> So, from the viewpoint of the learning algorithm, each element of this >>>> Vector is a feature. >>>> >>>> Unfortunately this dual use of nomenclature is completely wide-spread when >>>> people describe supervised machine learning such as the classifiers in >>>> Mahout do. >>>> >>>> >>>> >>>>> The OnlineLogisticRegression-class requires me to tell it how many >>>>> categories are there and how many features I like to provide. >>>>> >>>> >>>> Categories refer to the target variable. You have to say how many possible >>>> values of the target that there are. >>>> >>>> The number of features given here is *after* encoding. Your text variable >>>> would probably be encoded into a Vector of size 10,000-1,000,000 so this >>>> size is what you should give the OnlineLogisticRegression. >>>> >>>> >>>>> My question now is, if I got a categorical- and a text-like feature, do >>>>> I have to tell the class that I am going to add two features? >>>>> >>>> >>>> With the hashed encoding what you would do is create two encoders with >>>> different types and names. Pick an output vector size that is pretty big >>>> (100,000 should do). Then use each encoder with the corresponding data. >>>> >>>> >>>>> >>>>> What happens, if I encode 20 different features into the vector but >>>>> missconfigured the algorithm in a way that I told there were only 10 >>>>> >>>> >>>> You would have 20 different encoders and some sized Vector. >>>> >>>> If you give the learning algorithm a wrong-sized Vector, it should >>>> immediately complain. If it doesn't or if it doesn't complain clearly with >>>> a good message, file a bug. >>>> >>>> features? I miss a little bit some formula or something like that for >>>>> the algorithms that are part of mahout. This would make understanding >>>>> the different parameters more easy, I think. >>>>> >>>> >>>> I think that this is genuinely confusing. Keep going in the book. The >>>> next >>>> chapters go into more detail on this process. >>>> >>> >> >> >> >> -- >> Lance Norskog >> [email protected] >> >
