Hi Grant,

On Jan 3, 2011, at 8:54am, Grant Ingersoll wrote:

Hi,

I wanted to pick people's brains a little bit on the subject of determining importance. This isn't necessarily Mahout related, although I think we have some tools that help in the area.

One of the emerging trends it seems these days with all our connectivity and content is a notion of importance/priority. Some examples: 1. Google now has "Priority Inbox" for instance and I think most would agree that for things like Twitter and Facebook it would be really nice if you could separate out the Important updates/people from the less important.
2. Identifying important phrases, etc. in text across a corpus.
3. One of the things I think most researchers do when exploring a new topic is to identify the one or two seminal papers in the field, read them, and then read the ones that cite those papers and so on. 4. Take in all the day's news and figure out what the key articles are to read (in some sense it's picking the most representative document in a cluster) or that the article talking about raising Federal income taxes is likely more important
than the one talking about raising local sales tax (or vice versa!)
5. PageRank, TextRank, etc. and other approaches to calculating authority

What I'm looking for is help in researching this area. Is there a name for this (sub-)field (importance theory? prioritization theory?), particularly in mach. learning and NLP that is geared towards this? I realize some (most) of these problems can be solved with classifiers amongst other things like graph algorithms (particularly ones that use the social graph), but it also seems like the area is bigger than a particular implementation, so I wanted to hear what others thought. How would you go about solving these problems? Do you have any pointers to useful references on the subject (theoretical or practical)? What other examples have you run up against?

For what it's worth, we took a run at this issue last February...

1. Collect all of your tweets, and the tweets of people you follow, where the tweet has a URL.

2. Assign importance based on you (high) and the people you follow (depends on # of followers)

3. Fetch and parse referenced pages.

4. Use Mahout's kmeans to generate 50 clusters or so.

5. Take the top clusters (up to 5), where "top" means a tight grouping and significant number of members.

6. Use these top clusters to filter all tweets in the firehose, to generate a ranked list of "important" tweets.

The main challenge here was getting good clustering results. We tried a number of different sparsification techniques on the page data, but by the end of our target deadline we still weren't getting great results in identifying nice, crisp "topics" that were likely to be of interest. It feels like something that would have eventually gotten good enough, if we'd spent a lot more time playing with all of the combinations, but we stuck to our timeline and wound up putting that on the shelf.

-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to