That is correct. My problem is not the categories developed (which are meaningful by the way) but the fact that a certain document is not assigned to the proper (LDA generated) category. The document to topics assignment is really bad...
On Thu, Feb 6, 2014 at 5:08 PM, Ted Dunning <[email protected]> wrote: > I can't comment on the specific question that you ask, but it should not > necessarily be expected that LDA will reconstruct the categories that you > have in mind. It will develop categories that explain the data as well as > it can, but that won't necessarily match the categories you intend. > > It is likely, however, that the topics that LDA derives would make a good > set of features for a classifier. > > > > > On Thu, Feb 6, 2014 at 2:56 PM, Stamatis Rapanakis > <[email protected]>wrote: > > > I am trying to run the LDA algorithm. I can create meaningful topics > but > > the document/topic assignment is of very bad quality. > > > > I have assigned 30 tweets to the following 10 topics: > > > > /grammy awards > > /greek crisis > > /greek islands > > /premier inn > > /premier league > > /rihanna > > /syria > > /terrorism > > /winter olympics > > /winter sales > > > > I have a total of 300 tweets and my purpose is to run the LDA algorithm > > to see how well these tweets are assigned. For example, if the number of > > topics parameter is set to 10, how much do they match to the original > > assignment. > > > > 1. I start by creating a file that will contain (in random order) the > > tweets (*tweets.tsv*). This file will be used to compare the final tweets > > topic assignment. > > > > 2. I remove stopwords, urls, replies and create a file with the tweets > > text only (*tweets_no_stopwords.tsv*). One tweet (document) per file > > line. This will be the LDA input file. > > > > 3. I use some java code to create a sequence file from > > *tweets_no_stopwords.tsv.* I use a SequenceFile.Writer object with key an > > integer and value the tweet text (extract attached > tweets_no_stopwords.rar > > that contains a chunk-0 file). > > > > By executing the command: *mahout seqdumper -i > > tweets_no_stopwords/chunk-0* > > the chunk-0 file contents appear correctly: > > > > *Key: 1: Value: #nowplaying Rihanna - Unfaithful !! π?’™ trop belle !!* > > *Key: 2: Value: Grammy Awards Hairstyles: Memorable Moments* > > *...* > > *Key: 299: Value: team scored goal matches! (Man City)* > > *Key: 300: Value: Rocsi Diaz Wearing 5th Mercer- Grammy Awards* > > > > 4. I convert the data to vectors: > > > > bin/mahout seq2sparse -i tweets_no_stopwords -o > > tweets_no_stopwords-vectors -ow > > > > (I review the file with the command: *bin/mahout seqdumper -i > > tweets_no_stopwords-vectors/tf-vectors/part-r-00000*) > > > > 5. I convert keys to IntWritables > > > > bin/mahout rowid -i tweets_no_stopwords-vectors/tf-vectors/ -o > > tweets_no_stopwords-vectors/tf-vectors-cvb > > > > The created tf-vectors-cvb/docIndex, tf-vectors-cvb/matrix files have > keys > > from 0 - 299 (300 instances). > > > > 6. Finally I run the LDA algorithm: > > > > *bin/mahout cvb -i tweets_no_stopwords-vectors/tf-vectors-cvb/matrix/ -o > > lda_output/topicterm -mt lda_output/models -dt lda_output/docTopics -k 10 > > -x 40 -dict tweets_no_stopwords-vectors/dictionary.file-0* > > > > Note: I have to enter Cltr+C to stop the command execution (after it > > finished and the message "Program took XXXX ms" appears). But the folders > > are created as expected. > > > > The topics created (lda_output/topicterm) seem fine. I execute the > command: > > > > *bin/mahout vectordump -i lda_output/topicterm -d > > tweets_no_stopwords-vectors/dictionary.file-0 -dt sequencefile -c csv -p > > true -o p_term_topic.txt -sort lda_output/topicterm -vs 10* > > > > and follow the steps described in this link ( > > > http://sujitpal.blogspot.gr/2013/10/topic-modeling-with-mahout-on-amazon-emr.html > ) > > to create a file *p_term_topic.txt* and show a report with the output. > > > > *Topic 0**Topic 1**Topic 2* *Topic 3**Topic 4*winter, sales, olympics, > > love, played, people, big, photo, sale, trailterrorism, grammy, awards, > > blaindianexus, 56th, balochistan, bla, rock, 2014, photos islands, greek, > > greece, travel, find, book, make, kea, days, holidaygreek, crisis, β, > > lol, s, top, economic, tomorrow, job, eugrammys, found, style, red, > > hairdressers, room, mata, good, ty, walks *Topic5**Topic 6**Topic > 7**Topic > > 8**Topic 9*sochi, team, time, all, usa, war, free, syria, sending, > checksyria, > > city, manchester, united, back, hit, watching, chelsea, week, matchday > syria, > > support, olympic, economy, video, today, competition, arab, u.s, > inn'srihanna, > > time, watch, unapologetic, follow, great, euro, congrats, bet, > hotelspremier, > > inn, league, stay, season, β, year, home, goals, won > > > > > > > > These results are good, if you have in mind the (10) categories they > > belonged to: > > > > /grammy awards > > /greek crisis > > /greek islands > > /premier inn > > /premier league > > /rihanna > > /syria > > /terrorism > > /winter olympics > > /winter sales > > > > But the results in the folder *lda_output/docTopics* are really bad! > > > > bin/mahout seqdumper -i lda_output/docTopics/part-m-00000 (Display the > > results) > > > > Key: 0: Value: > > > {0:2.7932644743653218E-5,1:0.2582390963222569,2:0.03389979994715306,3:0.16986766822778876,4: > > *0.5144069716184998* > > > ,5:6.134281324000599E-5,6:0.022817498374309925,7:1.2427551415773865E-4,8:4.7632128287483606E-4,9:7.909325497553191E-5} > > Key: 1: Value: > > {0:0.004101560509130678,1:0.02531905947518225,2:0.14528444920763148,3: > > *0.32904199007739116* > > > ,4:0.06024210378042988,5:0.15510210839789676,6:0.0364093686560865,7:0.13256015086012124,8:0.0613456311044372,9:0.05059357793169288} > > Key: 2: Value: > > > {0:2.093051210521087E-4,1:0.0242076645518674,2:0.12014785226603218,3:0.15589333731396188,4:0.022516226489811282,5:0.015141667919690474,6:0.08494844406302673,7:0.150039462386397,8:0.15927498562672762,9 > > *:0.2676210542614334*} > > > > > > *Tweet**Topic* *Tweet text*14#nowplaying Rihanna Unfaithful !! �?�� trop > > belle !!23Grammy Awards Hairstyles: Memorable Moments39 Preeminent > > #terrorism research center website. Check out: cc > > > > > > Am I missing something? Doesn't key 0 correspond to the first tweet > > (document), key 2 to the second tweet and so on? > > > > Thank you in advance for your responses. > > > > > > >
