list of documents sentiment analysis - problem with defining proper approach with Spark

xnts Wed, 17 Sep 2014 06:58:38 -0700

Hi,

For last few days I am working on an exercise where I want to understand the 
sentiment of a set of articles.


As the input I have XML file with articles and the AFINN-111.txt file defining 
sentiment of few hundred words.

What I am able to do without any problem is loading of the data, putting it 
into structures (classes for articles, tuples (word, sentiment-value) for 
sentiments).

Then what I think I need to do (from the logical pov) is:

foreach article
   articleWords = split the body by " " 
   join the two lists (articleWords and sentimentWords) together.
   calculate the sentiment for the article by summing up sentiments of all 
words that it includes
dump the article id, sentiment into a flat file

And this is where I am stuck :) I tried multiple combinations of 
map/reduceByKey all either didn't make too much sense (like getting sentiment 
for all articles combined) or resulted in errors that function cannot be 
serialised. Today I even tried to implement this with a brute-force approach 
doing:

articles.foreach(calculateSentiment)

where calculateSentiment looks like below:

val words = sc.parallelize(post.body.split(" ")) // split body by " " 
val wordPairs = words.map(w => (w, 1)).reduceByKey(_+_, 1) // create tuples of 
(word, #occurrences in article)
val joinedValues = wordPairs.join(sentiments_) // join 

But somehow I had a feeling this is not the best idea and I think I was right, 
since the job is running for like an hour (and I have few hundred GBs to 
process only).

So the question is - what I am doing wrong? Any hints or suggestions for 
direction are really appreciated!

Thank you,
Leszek



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

list of documents sentiment analysis - problem with defining proper approach with Spark

Reply via email to