Markus, Please see example 4 in the user docs http://madlib.apache.org/docs/latest/group__grp__lda.html#examples which provides helper functions for learning more about the learned model.
-- The topic description by top-k words DROP TABLE IF EXISTS my_topic_desc; SELECT madlib.lda_get_topic_desc( 'my_model', 'my_training_vocabulary', 'my_topic_desc', 15); select * from my_topic_desc order by topicid, prob DESC; produces: topicid | wordid | prob | word ---------+--------+--------------------+------------------- 1 | 69 | 0.181900726392252 | of 1 | 52 | 0.0608353510895884 | is 1 | 65 | 0.0608353510895884 | models 1 | 30 | 0.0305690072639225 | corpora 1 | 1 | 0.0305690072639225 | 1960s 1 | 57 | 0.0305690072639225 | latent Please let us know if this is of use, or you are looking for something else? Frank On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.pa...@gmail.com> wrote: > Hi, > > I found a working but quite awkward way to form docid-wordid-topicid > pairing with a single SQL query: > > SELECT docid, unnest((counts::text || ':' || > words::text)::madlib.svec::float[]) > AS wordid, unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE > docid = 6; > > Output: > > docid | wordid | topicid > -------+--------+--------- > 6 | 7386 | 3 > 6 | 42021 | 17 > 6 | 42021 | 17 > 6 | 7705 | 12 > 6 | 105334 | 16 > 6 | 18083 | 3 > 6 | 89364 | 3 > 6 | 31073 | 3 > 6 | 28934 | 3 > 6 | 28934 | 16 > 6 | 56286 | 16 > 6 | 61921 | 3 > 6 | 61921 | 3 > 6 | 59142 | 17 > 6 | 33364 | 3 > 6 | 79035 | 17 > 6 | 37792 | 11 > 6 | 91823 | 11 > 6 | 30422 | 3 > 6 | 94672 | 17 > 6 | 62107 | 3 > 6 | 94673 | 2 > 6 | 62080 | 16 > 6 | 101046 | 17 > 6 | 4379 | 8 > 6 | 4379 | 8 > 6 | 4379 | 8 > 6 | 4379 | 8 > 6 | 4379 | 8 > 6 | 26503 | 12 > 6 | 61105 | 3 > 6 | 19193 | 3 > 6 | 28929 | 3 > > > Is there any simpler way to do that? > > > Regards, > Markus Paaso > > > > 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.pa...@gmail.com>: > >> Hi, >> >> I am having some problems reading the LDA output. >> >> >> Please see this row of madlib.lda_train output: >> >> docid | 6 >> wordcount | 33 >> words | {7386,42021,7705,105334,18083, >> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823, >> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929} >> counts | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1} >> topic_count | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0} >> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2 >> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2} >> >> >> It's hard to find which word ids are topic ids assigned to given when >> *words* array have different length than *topic_assignment* array. >> It would be nice if *words* array was same length than *topic_assignment* >> array >> >> 1. What kind of SQL query would give a result with wordid - topicid pairs? >> I tried to match them by hand but failed for wordid: 28934. I wonder if a >> repeating wordid can have different topic assignments in a same document? >> >> wordid | topicid >> ---------------- >> 7386 | 2 >> 42021 | 16 >> 7705 | 11 >> 105334 | 15 >> 18083 | 2 >> 89364 | 2 >> 31073 | 2 >> 28934 | 2 OR 15 ? >> 56286 | 15 >> 61921 | 2 >> 59142 | 16 >> 33364 | 2 >> 79035 | 16 >> 37792 | 10 >> 91823 | 10 >> 30422 | 2 >> 94672 | 16 >> 62107 | 2 >> 94673 | 1 >> 62080 | 15 >> 101046 | 16 >> 4379 | 7 >> 26503 | 11 >> 61105 | 2 >> 19193 | 2 >> 28929 | 2 >> >> >> 2. Why is the *topic_assignment* using zero based indexing while other >> results use one base indexing? >> >> >> >> Regards, >> Markus Paaso >> > > > > -- > Markus Paaso > Tel: +358504067849 <+358%2050%204067849> >