Hi Frank, I want to explain the LDA results for a single document (in this case for docid = 6) by binding topicid into each wordid in the document. The SQL query below gives exactly what I want but I am not sure if that is the most effective way to build docid-wordid-topicid triples.
SELECT docid, unnest((counts::text || ':' || words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6; I have trained LDA with 'lda_output' as the output_data_table argument in madlib.lda_train. Regards, Markus 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fmcquil...@pivotal.io>: > Markus, > > Please see example 4 in the user docs > http://madlib.apache.org/docs/latest/group__grp__lda.html#examples > which provides helper functions for learning more about the learned model. > > > -- The topic description by top-k words > DROP TABLE IF EXISTS my_topic_desc; > SELECT madlib.lda_get_topic_desc( 'my_model', > 'my_training_vocabulary', > 'my_topic_desc', > 15); > select * from my_topic_desc order by topicid, prob DESC; > > produces: > > topicid | wordid | prob | word > ---------+--------+--------------------+------------------- > 1 | 69 | 0.181900726392252 | of > 1 | 52 | 0.0608353510895884 | is > 1 | 65 | 0.0608353510895884 | models > 1 | 30 | 0.0305690072639225 | corpora > 1 | 1 | 0.0305690072639225 | 1960s > 1 | 57 | 0.0305690072639225 | latent > > Please let us know if this is of use, or you are looking for something > else? > > Frank > > > On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.pa...@gmail.com> > wrote: > >> Hi, >> >> I found a working but quite awkward way to form docid-wordid-topicid >> pairing with a single SQL query: >> >> SELECT docid, unnest((counts::text || ':' || >> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment) >> + 1 AS topicid FROM lda_output WHERE docid = 6; >> >> Output: >> >> docid | wordid | topicid >> -------+--------+--------- >> 6 | 7386 | 3 >> 6 | 42021 | 17 >> 6 | 42021 | 17 >> 6 | 7705 | 12 >> 6 | 105334 | 16 >> 6 | 18083 | 3 >> 6 | 89364 | 3 >> 6 | 31073 | 3 >> 6 | 28934 | 3 >> 6 | 28934 | 16 >> 6 | 56286 | 16 >> 6 | 61921 | 3 >> 6 | 61921 | 3 >> 6 | 59142 | 17 >> 6 | 33364 | 3 >> 6 | 79035 | 17 >> 6 | 37792 | 11 >> 6 | 91823 | 11 >> 6 | 30422 | 3 >> 6 | 94672 | 17 >> 6 | 62107 | 3 >> 6 | 94673 | 2 >> 6 | 62080 | 16 >> 6 | 101046 | 17 >> 6 | 4379 | 8 >> 6 | 4379 | 8 >> 6 | 4379 | 8 >> 6 | 4379 | 8 >> 6 | 4379 | 8 >> 6 | 26503 | 12 >> 6 | 61105 | 3 >> 6 | 19193 | 3 >> 6 | 28929 | 3 >> >> >> Is there any simpler way to do that? >> >> >> Regards, >> Markus Paaso >> >> >> >> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.pa...@gmail.com>: >> >>> Hi, >>> >>> I am having some problems reading the LDA output. >>> >>> >>> Please see this row of madlib.lda_train output: >>> >>> docid | 6 >>> wordcount | 33 >>> words | {7386,42021,7705,105334,18083, >>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823, >>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929} >>> counts | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1} >>> topic_count | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0} >>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2 >>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2} >>> >>> >>> It's hard to find which word ids are topic ids assigned to given when >>> *words* array have different length than *topic_assignment* array. >>> It would be nice if *words* array was same length than >>> *topic_assignment* array >>> >>> 1. What kind of SQL query would give a result with wordid - topicid >>> pairs? >>> I tried to match them by hand but failed for wordid: 28934. I wonder if >>> a repeating wordid can have different topic assignments in a same document? >>> >>> wordid | topicid >>> ---------------- >>> 7386 | 2 >>> 42021 | 16 >>> 7705 | 11 >>> 105334 | 15 >>> 18083 | 2 >>> 89364 | 2 >>> 31073 | 2 >>> 28934 | 2 OR 15 ? >>> 56286 | 15 >>> 61921 | 2 >>> 59142 | 16 >>> 33364 | 2 >>> 79035 | 16 >>> 37792 | 10 >>> 91823 | 10 >>> 30422 | 2 >>> 94672 | 16 >>> 62107 | 2 >>> 94673 | 1 >>> 62080 | 15 >>> 101046 | 16 >>> 4379 | 7 >>> 26503 | 11 >>> 61105 | 2 >>> 19193 | 2 >>> 28929 | 2 >>> >>> >>> 2. Why is the *topic_assignment* using zero based indexing while other >>> results use one base indexing? >>> >>> >>> >>> Regards, >>> Markus Paaso >>> >> >> >> >> -- >> Markus Paaso >> Tel: +358504067849 <+358%2050%204067849> >> > > -- Markus Paaso Tel: +358504067849