Markus, Sorry for the delay in response. Below is another approach that involves a JOIN between term frequency and an LDA output table. It is a bit simpler at least for me.
Regarding the indexing, you are correct that it is inconsistent. It seems term frequency module uses 0 based indexing (e.g., docid) whereas LDA uses 1 based indexing (e.g., topicid). I created a JIRA to address this https://issues.apache.org/jira/browse/MADLIB-1160 probably the best way is to standardize on 1 based indexing. ---- 1) Create data set DROP TABLE IF EXISTS documents; CREATE TABLE documents(docid INT4, contents TEXT); INSERT INTO documents VALUES (0, 'Statistical topic models are a class of Bayesian latent variable models, originally developed for analyzing the semantic content of large document corpora.'), (1, 'By the late 1960s, the balance between pitching and hitting had swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American League batting title with an average of just .301, the lowest in history.'), (2, 'Machine learning is closely related to and often overlaps with computational statistics; a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which deliver methods, theory and application domains to the field.'), (3, 'California''s diverse geography ranges from the Sierra Nevada in the east to the Pacific Coast in the west, from the Redwood–Douglas fir forests of the northwest, to the Mojave Desert areas in the southeast. The center of the state is dominated by the Central Valley, a major agricultural area. '); 2) Convert docs to a list of words ALTER TABLE documents ADD COLUMN words TEXT[]; UPDATE documents SET words = regexp_split_to_array(lower(contents), E'[\\s+\\.\\,]'); 3) Term frequency DROP TABLE IF EXISTS my_training, my_training_vocabulary; SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', TRUE); SELECT * FROM my_training order by docid limit 20; docid | wordid | count -------+--------+------- 0 | 57 | 1 0 | 86 | 1 0 | 4 | 1 0 | 55 | 1 0 | 69 | 2 0 | 81 | 1 0 | 30 | 1 0 | 33 | 1 0 | 36 | 1 0 | 43 | 1 0 | 25 | 1 0 | 65 | 2 0 | 72 | 1 0 | 9 | 1 0 | 0 | 2 0 | 29 | 1 0 | 18 | 1 0 | 12 | 1 0 | 96 | 1 0 | 91 | 1 (20 rows) 4) Run LDA DROP TABLE IF EXISTS my_model, my_outdata; SELECT madlib.lda_train( 'my_training', -- Data table 'my_model', -- Model table 'my_outdata', -- Output table 104, -- Vocab size 5, -- Num topics 10, -- Num iterations 5, -- Alpha (Dirichlet param) 0.01 -- Beta (Dirichlet param) ); 5) Get topic description by top-k words DROP TABLE IF EXISTS my_topic_desc; SELECT madlib.lda_get_topic_desc( 'my_model', 'my_training_vocabulary', 'my_topic_desc', 15); 6) Join topic description with term frequency SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING (wordid) WHERE docid=3 ORDER BY docid, wordid; docid | wordid | topicid -------+--------+--------- 3 | 0 | 3 3 | 4 | 2 3 | 5 | 3 3 | 13 | 2 3 | 14 | 5 3 | 20 | 5 3 | 21 | 2 3 | 23 | 2 3 | 24 | 3 3 | 27 | 2 3 | 32 | 3 3 | 35 | 3 3 | 38 | 5 3 | 39 | 5 etc Frank On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <markus.pa...@gmail.com> wrote: > Hi Frank, > > I want to explain the LDA results for a single document (in this case for > docid = 6) by binding topicid into each wordid in the document. > The SQL query below gives exactly what I want but I am not sure if that is > the most effective way to build docid-wordid-topicid triples. > > SELECT docid, unnest((counts::text || ':' || > words::text)::madlib.svec::float[]) > AS wordid, unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE > docid = 6; > > I have trained LDA with 'lda_output' as the output_data_table argument in > madlib.lda_train. > > > Regards, Markus > > 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fmcquil...@pivotal.io>: > >> Markus, >> >> Please see example 4 in the user docs >> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples >> which provides helper functions for learning more about the learned >> model. >> >> -- The topic description by top-k words >> DROP TABLE IF EXISTS my_topic_desc; >> SELECT madlib.lda_get_topic_desc( 'my_model', >> 'my_training_vocabulary', >> 'my_topic_desc', >> 15); >> select * from my_topic_desc order by topicid, prob DESC; >> >> produces: >> >> topicid | wordid | prob | word >> ---------+--------+--------------------+------------------- >> 1 | 69 | 0.181900726392252 | of >> 1 | 52 | 0.0608353510895884 | is >> 1 | 65 | 0.0608353510895884 | models >> 1 | 30 | 0.0305690072639225 | corpora >> 1 | 1 | 0.0305690072639225 | 1960s >> 1 | 57 | 0.0305690072639225 | latent >> >> Please let us know if this is of use, or you are looking for something >> else? >> >> Frank >> >> >> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.pa...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I found a working but quite awkward way to form docid-wordid-topicid >>> pairing with a single SQL query: >>> >>> SELECT docid, unnest((counts::text || ':' || >>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment) >>> + 1 AS topicid FROM lda_output WHERE docid = 6; >>> >>> Output: >>> >>> docid | wordid | topicid >>> -------+--------+--------- >>> 6 | 7386 | 3 >>> 6 | 42021 | 17 >>> 6 | 42021 | 17 >>> 6 | 7705 | 12 >>> 6 | 105334 | 16 >>> 6 | 18083 | 3 >>> 6 | 89364 | 3 >>> 6 | 31073 | 3 >>> 6 | 28934 | 3 >>> 6 | 28934 | 16 >>> 6 | 56286 | 16 >>> 6 | 61921 | 3 >>> 6 | 61921 | 3 >>> 6 | 59142 | 17 >>> 6 | 33364 | 3 >>> 6 | 79035 | 17 >>> 6 | 37792 | 11 >>> 6 | 91823 | 11 >>> 6 | 30422 | 3 >>> 6 | 94672 | 17 >>> 6 | 62107 | 3 >>> 6 | 94673 | 2 >>> 6 | 62080 | 16 >>> 6 | 101046 | 17 >>> 6 | 4379 | 8 >>> 6 | 4379 | 8 >>> 6 | 4379 | 8 >>> 6 | 4379 | 8 >>> 6 | 4379 | 8 >>> 6 | 26503 | 12 >>> 6 | 61105 | 3 >>> 6 | 19193 | 3 >>> 6 | 28929 | 3 >>> >>> >>> Is there any simpler way to do that? >>> >>> >>> Regards, >>> Markus Paaso >>> >>> >>> >>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.pa...@gmail.com>: >>> >>>> Hi, >>>> >>>> I am having some problems reading the LDA output. >>>> >>>> >>>> Please see this row of madlib.lda_train output: >>>> >>>> docid | 6 >>>> wordcount | 33 >>>> words | {7386,42021,7705,105334,18083, >>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823, >>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929} >>>> counts | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1 >>>> ,1,1,1,1,1,1,5,1,1,1,1} >>>> topic_count | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0} >>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2 >>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2} >>>> >>>> >>>> It's hard to find which word ids are topic ids assigned to given when >>>> *words* array have different length than *topic_assignment* array. >>>> It would be nice if *words* array was same length than >>>> *topic_assignment* array >>>> >>>> 1. What kind of SQL query would give a result with wordid - topicid >>>> pairs? >>>> I tried to match them by hand but failed for wordid: 28934. I wonder if >>>> a repeating wordid can have different topic assignments in a same document? >>>> >>>> wordid | topicid >>>> ---------------- >>>> 7386 | 2 >>>> 42021 | 16 >>>> 7705 | 11 >>>> 105334 | 15 >>>> 18083 | 2 >>>> 89364 | 2 >>>> 31073 | 2 >>>> 28934 | 2 OR 15 ? >>>> 56286 | 15 >>>> 61921 | 2 >>>> 59142 | 16 >>>> 33364 | 2 >>>> 79035 | 16 >>>> 37792 | 10 >>>> 91823 | 10 >>>> 30422 | 2 >>>> 94672 | 16 >>>> 62107 | 2 >>>> 94673 | 1 >>>> 62080 | 15 >>>> 101046 | 16 >>>> 4379 | 7 >>>> 26503 | 11 >>>> 61105 | 2 >>>> 19193 | 2 >>>> 28929 | 2 >>>> >>>> >>>> 2. Why is the *topic_assignment* using zero based indexing while other >>>> results use one base indexing? >>>> >>>> >>>> >>>> Regards, >>>> Markus Paaso >>>> >>> >>> >>> >>> -- >>> Markus Paaso >>> Tel: +358504067849 <+358%2050%204067849> >>> >> >> > > > -- > Markus Paaso > Tel: +358504067849 <+358%2050%204067849> >