That is correct. My problem is not the categories developed (which are
meaningful by the way) but the fact that a certain document is not assigned
to the proper (LDA generated) category. The document to topics assignment
is really bad...


On Thu, Feb 6, 2014 at 5:08 PM, Ted Dunning <[email protected]> wrote:

> I can't comment on the specific question that you ask, but it should not
> necessarily be expected that LDA will reconstruct the categories that you
> have in mind.  It will develop categories that explain the data as well as
> it can, but that won't necessarily match the categories you intend.
>
> It is likely, however, that the topics that LDA derives would make a good
> set of features for a classifier.
>
>
>
>
> On Thu, Feb 6, 2014 at 2:56 PM, Stamatis Rapanakis
> <[email protected]>wrote:
>
> >   I am trying to run the LDA algorithm. I can create meaningful topics
> but
> > the document/topic assignment is of very bad quality.
> >
> >   I have assigned 30 tweets to the following 10 topics:
> >
> > /grammy awards
> > /greek crisis
> > /greek islands
> > /premier inn
> > /premier league
> > /rihanna
> > /syria
> > /terrorism
> > /winter olympics
> > /winter sales
> >
> >   I have a total of 300 tweets and my purpose is to run the LDA algorithm
> > to see how well these tweets are assigned. For example, if the number of
> > topics parameter is set to 10, how much do they match to the original
> > assignment.
> >
> > 1. I start by creating a file that will contain (in random order) the
> > tweets (*tweets.tsv*). This file will be used to compare the final tweets
> > topic assignment.
> >
> > 2. I remove stopwords, urls, replies and create a file with the tweets
> > text only (*tweets_no_stopwords.tsv*). One tweet (document) per file
> > line. This will be the LDA input file.
> >
> > 3. I use some java code to create a sequence file from
> > *tweets_no_stopwords.tsv.* I use a SequenceFile.Writer object with key an
> > integer and value the tweet text (extract attached
> tweets_no_stopwords.rar
> > that contains a chunk-0 file).
> >
> >  By executing the command: *mahout seqdumper -i
> > tweets_no_stopwords/chunk-0*
> > the chunk-0 file contents appear correctly:
> >
> > *Key: 1: Value: #nowplaying Rihanna - Unfaithful !! π?’™ trop belle !!*
> > *Key: 2: Value: Grammy Awards Hairstyles: Memorable Moments*
> > *...*
> > *Key: 299: Value: team scored goal matches! (Man City)*
> > *Key: 300: Value: Rocsi Diaz Wearing 5th Mercer- Grammy Awards*
> >
> > 4. I convert the data to vectors:
> >
> > bin/mahout seq2sparse -i tweets_no_stopwords -o
> > tweets_no_stopwords-vectors -ow
> >
> > (I review the file with the command: *bin/mahout seqdumper -i
> > tweets_no_stopwords-vectors/tf-vectors/part-r-00000*)
> >
> > 5. I convert keys to IntWritables
> >
> > bin/mahout rowid -i tweets_no_stopwords-vectors/tf-vectors/ -o
> > tweets_no_stopwords-vectors/tf-vectors-cvb
> >
> > The created tf-vectors-cvb/docIndex, tf-vectors-cvb/matrix files have
> keys
> > from 0 - 299 (300 instances).
> >
> > 6. Finally I run the LDA algorithm:
> >
> > *bin/mahout cvb -i tweets_no_stopwords-vectors/tf-vectors-cvb/matrix/ -o
> > lda_output/topicterm -mt lda_output/models -dt lda_output/docTopics -k 10
> > -x 40 -dict tweets_no_stopwords-vectors/dictionary.file-0*
> >
> > Note: I have to enter Cltr+C to stop the command execution (after it
> > finished and the message "Program took XXXX ms" appears). But the folders
> > are created as expected.
> >
> > The topics created (lda_output/topicterm) seem fine. I execute the
> command:
> >
> > *bin/mahout vectordump -i lda_output/topicterm -d
> > tweets_no_stopwords-vectors/dictionary.file-0 -dt sequencefile -c csv -p
> > true -o p_term_topic.txt -sort lda_output/topicterm -vs 10*
> >
> > and follow the steps described in this link (
> >
> http://sujitpal.blogspot.gr/2013/10/topic-modeling-with-mahout-on-amazon-emr.html
> )
> > to create a file *p_term_topic.txt* and show a report with the output.
> >
> > *Topic 0**Topic 1**Topic 2* *Topic 3**Topic 4*winter, sales, olympics,
> > love, played, people, big, photo, sale, trailterrorism, grammy, awards,
> > blaindianexus, 56th, balochistan, bla, rock, 2014, photos islands, greek,
> > greece, travel, find, book, make, kea, days, holidaygreek, crisis, β,
> > lol, s, top, economic, tomorrow, job, eugrammys, found, style, red,
> > hairdressers, room, mata, good, ty, walks *Topic5**Topic 6**Topic
> 7**Topic
> > 8**Topic 9*sochi, team, time, all, usa, war, free, syria, sending,
> checksyria,
> > city, manchester, united, back, hit, watching, chelsea, week, matchday
> syria,
> > support, olympic, economy, video, today, competition, arab, u.s,
> inn'srihanna,
> > time, watch, unapologetic, follow, great, euro, congrats, bet,
> hotelspremier,
> > inn, league, stay, season, β, year, home, goals, won
> >
> >
> >
> > These results are good, if you have in mind the (10) categories they
> > belonged to:
> >
> > /grammy awards
> > /greek crisis
> > /greek islands
> > /premier inn
> > /premier league
> > /rihanna
> > /syria
> > /terrorism
> > /winter olympics
> > /winter sales
> >
> > But the results in the folder *lda_output/docTopics* are really bad!
> >
> > bin/mahout seqdumper -i lda_output/docTopics/part-m-00000  (Display the
> > results)
> >
> > Key: 0: Value:
> >
> {0:2.7932644743653218E-5,1:0.2582390963222569,2:0.03389979994715306,3:0.16986766822778876,4:
> > *0.5144069716184998*
> >
> ,5:6.134281324000599E-5,6:0.022817498374309925,7:1.2427551415773865E-4,8:4.7632128287483606E-4,9:7.909325497553191E-5}
> > Key: 1: Value:
> > {0:0.004101560509130678,1:0.02531905947518225,2:0.14528444920763148,3:
> > *0.32904199007739116*
> >
> ,4:0.06024210378042988,5:0.15510210839789676,6:0.0364093686560865,7:0.13256015086012124,8:0.0613456311044372,9:0.05059357793169288}
> > Key: 2: Value:
> >
> {0:2.093051210521087E-4,1:0.0242076645518674,2:0.12014785226603218,3:0.15589333731396188,4:0.022516226489811282,5:0.015141667919690474,6:0.08494844406302673,7:0.150039462386397,8:0.15927498562672762,9
> > *:0.2676210542614334*}
> >
> >
> > *Tweet**Topic* *Tweet text*14#nowplaying Rihanna Unfaithful !! �?�� trop
> > belle !!23Grammy Awards Hairstyles: Memorable Moments39 Preeminent
> > #terrorism research center website. Check out: cc
> >
> >
> >  Am I missing something? Doesn't key 0 correspond to the first tweet
> > (document), key 2 to the second tweet and so on?
> >
> >   Thank you in advance for your responses.
> >
> >
> >
>

Reply via email to