On Fri, Dec 13, 2013 at 09:14:12PM -0500, Michael Crawford wrote: > I found this piece of code on github > > https://gist.github.com/kljensen/5452382 > > def one_hot_dataframe(data, cols, replace=False): > """ Takes a dataframe and a list of columns that need to be encoded. > Returns a 3-tuple comprising the data, the vectorized data, > and the fitted vectorizor. > """ > vec = DictVectorizer() > mkdict = lambda row: dict((col, row[col]) for col in cols) > #<<<<<<<<<<<<<<<<<< > vecData = pandas.DataFrame(vec.fit_transform(data[cols].apply(mkdict, > axis=1)).toarray()) > vecData.columns = vec.get_feature_names() > vecData.index = data.index > if replace is True: > data = data.drop(cols, axis=1) > data = data.join(vecData) > return (data, vecData, vec) > > I don't understand how that lambda expression works.
Lambda is just syntactic sugar for a function. It is exactly the same as a def function, except with two limitations: - there is no name, or to be precise, the name of all lambda functions is the same, "<lambda>"; - the body of the function is limited to exactly a single expression. So we can take the lambda: lambda row: dict((col, row[col]) for col in cols) give it a more useful name, and turn it into this: def mkdict(row): return dict((col, row[col]) for col in cols) Now let's analyse that function. It takes a single argument, "row". That means that when you call the function, you have to provide a value for the row variable. To take a simpler example, when you call the len() function, you have to provide a value to take the length of! len() => gives an error, because there's nothing to take the length of len("abcdef") => returns 6 Same here with mkdict. It needs to be given a row argument. That is the responsibility of the caller, which we'll get to in a moment. mkdict also has two other variables: - col, which is defined inside the function, it is a loop variable created by the "for col in cols" part; - cols, which is taken from the one_hot_dataframe argument of the same name. Technically, this makes the mkdict function a so-called "closure", but don't worry about that. You'll learn about closures in due course. [For pedants: technically, "dict" is also a variable, but that's not really important as Python ends up using the built-in dict function for that.] > For starters where did row come from? > How did it know it was working on data? To answer these questions, we have to look at the next line, where the mkdict function is actually used: vecData = pandas.DataFrame( vec.fit_transform( data[cols].apply(mkdict, axis=1) ).toarray() ) I've spread that line over multiple physical lines to make it easier to read. The first think you'll notice is that it does a lot of work in a single call: it calls DataFrame, fit_transform, toarray, whatever they are. But the critical part for your question is the middle part: data[cols].apply(mkdict, axis=1) this extracts data[cols] (whatever that gives!) and then calls the "apply" method to it. I don't know what that actually is, I've never used pandas, but judging by the name I can guess that "apply" takes some sort of array of values: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... extracts out either the rows (axis=1) or columns (axis=0 or axis=2 perhaps?), and feeds them to a callback function. In this case, the callback function will be mkdict. So, and remember this is just my guess based on the name, the apply method does something like this: - extract row 1, giving [1, 2, 3] (or whatever the values happen to be; - pass that row to mkdict, giving mkdict([1, 2, 3]) which calculates a dict {blah blah blah}; - stuffs that resulting dict somewhere for later use; - do the same for row 2, then row 3, and so on. That's my expectation. -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor