My own recommendation here would be to run a word-count first and then create a dense dictionary using a sequential process. This sequential step should be very fast because the number of items is quite modest.
I would create an additional dictionary at the same time for email addresses. Once you have the dictionary, you can distribute it to all nodes so that each node can extract a slice of the required matrix using a map-only program. On Wed, Aug 31, 2011 at 8:21 AM, Grant Ingersoll <[email protected]>wrote: > > On Aug 22, 2011, at 12:14 PM, Sean Owen wrote: > > > Here are two ideas: > > > > Recommend threads to users. > > Users are people, items are threads. This might suggest discussions > > you should be a party to, or may be of interest since it concerns > > people you often share a thread with. I think it has slightly more > > potential to be useful, but, probably a non-starter in practice as > > it's not generally true that you'er welcome to see a thread you > > weren't copied on. > > This is the one I am doing. But it brings up an interesting question in > how best to convert the input to ids. > > To do this, I need to convert the strings (message id, from) to ids. Thus, > I more or less modeled the code after what DictionaryVectorizer does. > Creating the dictionaries is pretty straightforward and we likely now have > an opportunity to make a general purpose tool that does it in an M/R way. > > Digging in a bit more, I am now working on doing the actual matrix > creation. In my case, I have two dictionaries: message ids and from > emails. In DictionaryVectorizer (used to take text to sparse vectors, which > is comparable to what I need to do), it creates the matrix by running: > > for each dictionary chunk > for each piece of text //i.e. the input sequence file, handled by > Hadoop > create the (partial) vector > > My initial thoughts for my case are to do: > > for each from id dictionary chunk > for each message id dictionary chunk > for each piece of text //i.e. the input seq. file, handled > by Hadoop > create the vector > > The output would be, for each "from" a list of message ids that the person > interacted with (initiated or replied). It's likely that some of this moot > b/c there will only ever be 1 or two chunks, esp. for the "froms". > > As you can no doubt see, that's a lot of loops and add on top of it you > figure the hit ratio is pretty sparse. I believe the reason we do this in > DictionaryVectorizer is so that we can use a predictable amount of memory in > dealing with the dictionaries. > > Is there a better way of doing this? > > -Grant
