Re: Email and Collab. Filtering

Ted Dunning Wed, 31 Aug 2011 08:44:31 -0700

My own recommendation here would be to run a word-count first and then
create a dense dictionary using a sequential process.  This sequential step
should be very fast because the number of items is quite modest.


I would create an additional dictionary at the same time for email
addresses.

Once you have the dictionary, you can distribute it to all nodes so that
each node can extract a slice of the required matrix using a map-only
program.

On Wed, Aug 31, 2011 at 8:21 AM, Grant Ingersoll <[email protected]>wrote:

>
> On Aug 22, 2011, at 12:14 PM, Sean Owen wrote:
>
> > Here are two ideas:
> >
> > Recommend threads to users.
> > Users are people, items are threads. This might suggest discussions
> > you should be a party to, or may be of interest since it concerns
> > people you often share a thread with. I think it has slightly more
> > potential to be useful, but, probably a non-starter in practice as
> > it's not generally true that you'er welcome to see a thread you
> > weren't copied on.
>
> This is the one I am doing.  But it brings up an interesting question in
> how best to convert the input to ids.
>
> To do this, I need to convert the strings (message id, from) to ids.  Thus,
> I more or less modeled the code after what DictionaryVectorizer does.
>  Creating the dictionaries is pretty straightforward and we likely now have
> an opportunity to make a general purpose tool that does it in an M/R way.
>
> Digging in a bit more, I am now working on doing the actual matrix
> creation.  In my case, I have two dictionaries:  message ids and from
> emails.  In DictionaryVectorizer (used to take text to sparse vectors, which
> is comparable to what I need to do), it creates the matrix by running:
>
> for each dictionary chunk
>        for each piece of text  //i.e. the input sequence file, handled by
> Hadoop
>                create the  (partial) vector
>
> My initial thoughts for my case are to do:
>
> for each from id dictionary chunk
>        for each message id dictionary chunk
>                for each piece of text //i.e. the input seq. file, handled
> by Hadoop
>                        create the vector
>
> The output would be, for each "from" a list of message ids that the person
> interacted with (initiated or replied). It's likely that some of this moot
> b/c there will only ever be 1 or two chunks, esp. for the "froms".
>
> As you can no doubt see, that's a lot of loops and add on top of it you
> figure the hit ratio is pretty sparse.   I believe the reason we do this in
> DictionaryVectorizer is so that we can use a predictable amount of memory in
> dealing with the dictionaries.
>
> Is there a better way of doing this?
>
> -Grant

Re: Email and Collab. Filtering

Reply via email to