On Aug 22, 2011, at 12:14 PM, Sean Owen wrote:
> Here are two ideas:
>
> Recommend threads to users.
> Users are people, items are threads. This might suggest discussions
> you should be a party to, or may be of interest since it concerns
> people you often share a thread with. I think it has slightly more
> potential to be useful, but, probably a non-starter in practice as
> it's not generally true that you'er welcome to see a thread you
> weren't copied on.
This is the one I am doing. But it brings up an interesting question in how
best to convert the input to ids.
To do this, I need to convert the strings (message id, from) to ids. Thus, I
more or less modeled the code after what DictionaryVectorizer does. Creating
the dictionaries is pretty straightforward and we likely now have an
opportunity to make a general purpose tool that does it in an M/R way.
Digging in a bit more, I am now working on doing the actual matrix creation.
In my case, I have two dictionaries: message ids and from emails. In
DictionaryVectorizer (used to take text to sparse vectors, which is comparable
to what I need to do), it creates the matrix by running:
for each dictionary chunk
for each piece of text //i.e. the input sequence file, handled by
Hadoop
create the (partial) vector
My initial thoughts for my case are to do:
for each from id dictionary chunk
for each message id dictionary chunk
for each piece of text //i.e. the input seq. file, handled by
Hadoop
create the vector
The output would be, for each "from" a list of message ids that the person
interacted with (initiated or replied). It's likely that some of this moot b/c
there will only ever be 1 or two chunks, esp. for the "froms".
As you can no doubt see, that's a lot of loops and add on top of it you figure
the hit ratio is pretty sparse. I believe the reason we do this in
DictionaryVectorizer is so that we can use a predictable amount of memory in
dealing with the dictionaries.
Is there a better way of doing this?
-Grant