On Aug 31, 2011, at 11:43 AM, Ted Dunning wrote: > My own recommendation here would be to run a word-count first and then > create a dense dictionary using a sequential process. This sequential step > should be very fast because the number of items is quite modest. > > I would create an additional dictionary at the same time for email > addresses.
That's what I've done so far. > > Once you have the dictionary, you can distribute it to all nodes so that > each node can extract a slice of the required matrix using a map-only > program. So, the mapper just loops over the dictionary chunks or are you saying that the dictionary(ies) all fit into memory? I think this is what I was doing other than I was modeling it after DictionaryVectorizer, which assumes the dictionary may become to big to fit into memory and thus it works on chunks at a time, even though in many cases there will only ever be 1 chunk. > > On Wed, Aug 31, 2011 at 8:21 AM, Grant Ingersoll <[email protected]>wrote: > >> >> On Aug 22, 2011, at 12:14 PM, Sean Owen wrote: >> >>> Here are two ideas: >>> >>> Recommend threads to users. >>> Users are people, items are threads. This might suggest discussions >>> you should be a party to, or may be of interest since it concerns >>> people you often share a thread with. I think it has slightly more >>> potential to be useful, but, probably a non-starter in practice as >>> it's not generally true that you'er welcome to see a thread you >>> weren't copied on. >> >> This is the one I am doing. But it brings up an interesting question in >> how best to convert the input to ids. >> >> To do this, I need to convert the strings (message id, from) to ids. Thus, >> I more or less modeled the code after what DictionaryVectorizer does. >> Creating the dictionaries is pretty straightforward and we likely now have >> an opportunity to make a general purpose tool that does it in an M/R way. >> >> Digging in a bit more, I am now working on doing the actual matrix >> creation. In my case, I have two dictionaries: message ids and from >> emails. In DictionaryVectorizer (used to take text to sparse vectors, which >> is comparable to what I need to do), it creates the matrix by running: >> >> for each dictionary chunk >> for each piece of text //i.e. the input sequence file, handled by >> Hadoop >> create the (partial) vector >> >> My initial thoughts for my case are to do: >> >> for each from id dictionary chunk >> for each message id dictionary chunk >> for each piece of text //i.e. the input seq. file, handled >> by Hadoop >> create the vector >> >> The output would be, for each "from" a list of message ids that the person >> interacted with (initiated or replied). It's likely that some of this moot >> b/c there will only ever be 1 or two chunks, esp. for the "froms". >> >> As you can no doubt see, that's a lot of loops and add on top of it you >> figure the hit ratio is pretty sparse. I believe the reason we do this in >> DictionaryVectorizer is so that we can use a predictable amount of memory in >> dealing with the dictionaries. >> >> Is there a better way of doing this? >> >> -Grant -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
