Re: Email and Collab. Filtering

Grant Ingersoll Wed, 31 Aug 2011 08:55:31 -0700

On Aug 31, 2011, at 11:43 AM, Ted Dunning wrote:

> My own recommendation here would be to run a word-count first and then
> create a dense dictionary using a sequential process.  This sequential step
> should be very fast because the number of items is quite modest.
> 
> I would create an additional dictionary at the same time for email
> addresses.


That's what I've done so far.

> 
> Once you have the dictionary, you can distribute it to all nodes so that
> each node can extract a slice of the required matrix using a map-only
> program.

So, the mapper just loops over the dictionary chunks or are you saying that the 
dictionary(ies) all fit into memory?  I think this is what I was doing other 
than I was modeling it after DictionaryVectorizer, which assumes the dictionary 
may become to big to fit into memory and thus it works on chunks at a time, 
even though in many cases there will only ever be 1 chunk.



> 
> On Wed, Aug 31, 2011 at 8:21 AM, Grant Ingersoll <[email protected]>wrote:
> 
>> 
>> On Aug 22, 2011, at 12:14 PM, Sean Owen wrote:
>> 
>>> Here are two ideas:
>>> 
>>> Recommend threads to users.
>>> Users are people, items are threads. This might suggest discussions
>>> you should be a party to, or may be of interest since it concerns
>>> people you often share a thread with. I think it has slightly more
>>> potential to be useful, but, probably a non-starter in practice as
>>> it's not generally true that you'er welcome to see a thread you
>>> weren't copied on.
>> 
>> This is the one I am doing.  But it brings up an interesting question in
>> how best to convert the input to ids.
>> 
>> To do this, I need to convert the strings (message id, from) to ids.  Thus,
>> I more or less modeled the code after what DictionaryVectorizer does.
>> Creating the dictionaries is pretty straightforward and we likely now have
>> an opportunity to make a general purpose tool that does it in an M/R way.
>> 
>> Digging in a bit more, I am now working on doing the actual matrix
>> creation.  In my case, I have two dictionaries:  message ids and from
>> emails.  In DictionaryVectorizer (used to take text to sparse vectors, which
>> is comparable to what I need to do), it creates the matrix by running:
>> 
>> for each dictionary chunk
>>       for each piece of text  //i.e. the input sequence file, handled by
>> Hadoop
>>               create the  (partial) vector
>> 
>> My initial thoughts for my case are to do:
>> 
>> for each from id dictionary chunk
>>       for each message id dictionary chunk
>>               for each piece of text //i.e. the input seq. file, handled
>> by Hadoop
>>                       create the vector
>> 
>> The output would be, for each "from" a list of message ids that the person
>> interacted with (initiated or replied). It's likely that some of this moot
>> b/c there will only ever be 1 or two chunks, esp. for the "froms".
>> 
>> As you can no doubt see, that's a lot of loops and add on top of it you
>> figure the hit ratio is pretty sparse.   I believe the reason we do this in
>> DictionaryVectorizer is so that we can use a predictable amount of memory in
>> dealing with the dictionaries.
>> 
>> Is there a better way of doing this?
>> 
>> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Re: Email and Collab. Filtering

Reply via email to