Markus, Thanks for the extra explanation.
I agree that having topic_assignment as a dense vector but words as a sparse vector is not good practice, I am not sure why it was done that way in the first place. They should be consistent. The reason sparse vectors are used I believe is to keep the LDA model state as small as possible, since it is possible to hit the 1GB PostgreSQL field limit in certain real world use cases. So I think what you are doing seems reasonable. I don’t have a simpler way to do it at this point, given the current way LDA is built. I have updated this JIRA https://issues.apache.org/jira/browse/MADLIB-1160 proposing: 1) use consistent indexes (start at 1) 2) use consistent array types (sparse) which hopefully will help some. Frank On Fri, Sep 22, 2017 at 12:18 AM, Markus Paaso <markus.pa...@gmail.com> wrote: > Hi Frank, > > Thanks for the response. > The solution you suggest doesn't fully answer to my question. > > I don't want to join input table with top-k wordids from each topic as > this is not what lda_predict is doing. Am I right? > I want to explain the results of madlib.lda_predict exactly. > I want to explain why lda_predict gives some topic distribution for some > set of wordids. > > I understand your approach but unfortunately it doesn't fulfill my needs. > I have found the solution for my needs but and was just asking if you know > better way to handle the results of lda_predict (or data in > *output_data_table *of lda_train). > > So my question was more about how to construct wordid array from 'counts' > and 'words' columns of madlib.lda_predict results or e.g. 'my_outdata' in > your example. > Now I am doing it with (counts::text || ':' || words::text)::madlib.svec:: > float[] > > The thing I don't understand in *output_data_table* format is that why > *topic_assignment* is a dense vector but *words* is a sparse vector. > I would suggest to change *words* format into dense vector to remove the > need of playing with madlib.svec object to explain the lda_predict results > exactly. > And then the *counts* column is not needed anymore in *output_data_table*. > > > Regards, > Markus Paaso > > > > > > > 2017-09-21 22:25 GMT+03:00 Frank McQuillan <fmcquil...@pivotal.io>: > >> Markus, >> >> Sorry for the delay in response. Below is another approach that involves >> a JOIN between term frequency and an LDA output table. It is a bit simpler >> at least for me. >> >> Regarding the indexing, you are correct that it is inconsistent. It >> seems term frequency module uses 0 based indexing (e.g., docid) whereas LDA >> uses 1 based indexing (e.g., topicid). I created a JIRA to address this >> https://issues.apache.org/jira/browse/MADLIB-1160 >> probably the best way is to standardize on 1 based indexing. >> >> ---- >> >> 1) Create data set >> >> DROP TABLE IF EXISTS documents; >> CREATE TABLE documents(docid INT4, contents TEXT); >> INSERT INTO documents VALUES >> (0, 'Statistical topic models are a class of Bayesian latent variable >> models, originally developed for analyzing the semantic content of large >> document corpora.'), >> (1, 'By the late 1960s, the balance between pitching and hitting had >> swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American >> League batting title with an average of just .301, the lowest in history.'), >> (2, 'Machine learning is closely related to and often overlaps with >> computational statistics; a discipline that also specializes in >> prediction-making. It has strong ties to mathematical optimization, which >> deliver methods, theory and application domains to the field.'), >> (3, 'California''s diverse geography ranges from the Sierra Nevada in the >> east to the Pacific Coast in the west, from the Redwood–Douglas fir forests >> of the northwest, to the Mojave Desert areas in the southeast. The center >> of the state is dominated by the Central Valley, a major agricultural area. >> '); >> >> >> 2) Convert docs to a list of words >> >> ALTER TABLE documents ADD COLUMN words TEXT[]; >> UPDATE documents SET words = regexp_split_to_array(lower(contents), >> E'[\\s+\\.\\,]'); >> >> >> 3) Term frequency >> >> DROP TABLE IF EXISTS my_training, my_training_vocabulary; >> SELECT madlib.term_frequency('documents', 'docid', 'words', >> 'my_training', TRUE); >> SELECT * FROM my_training order by docid limit 20; >> docid | wordid | count >> -------+--------+------- >> 0 | 57 | 1 >> 0 | 86 | 1 >> 0 | 4 | 1 >> 0 | 55 | 1 >> 0 | 69 | 2 >> 0 | 81 | 1 >> 0 | 30 | 1 >> 0 | 33 | 1 >> 0 | 36 | 1 >> 0 | 43 | 1 >> 0 | 25 | 1 >> 0 | 65 | 2 >> 0 | 72 | 1 >> 0 | 9 | 1 >> 0 | 0 | 2 >> 0 | 29 | 1 >> 0 | 18 | 1 >> 0 | 12 | 1 >> 0 | 96 | 1 >> 0 | 91 | 1 >> (20 rows) >> >> >> 4) Run LDA >> >> DROP TABLE IF EXISTS my_model, my_outdata; >> SELECT madlib.lda_train( 'my_training', -- Data table >> 'my_model', -- Model table >> 'my_outdata', -- Output table >> 104, -- Vocab size >> 5, -- Num topics >> 10, -- Num iterations >> 5, -- Alpha (Dirichlet param) >> 0.01 -- Beta (Dirichlet param) >> ); >> >> >> 5) Get topic description by top-k words >> >> DROP TABLE IF EXISTS my_topic_desc; >> SELECT madlib.lda_get_topic_desc( 'my_model', >> 'my_training_vocabulary', >> 'my_topic_desc', >> 15); >> >> 6) Join topic description with term frequency >> SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING >> (wordid) WHERE docid=3 ORDER BY docid, wordid; >> >> docid | wordid | topicid >> -------+--------+--------- >> 3 | 0 | 3 >> 3 | 4 | 2 >> 3 | 5 | 3 >> 3 | 13 | 2 >> 3 | 14 | 5 >> 3 | 20 | 5 >> 3 | 21 | 2 >> 3 | 23 | 2 >> 3 | 24 | 3 >> 3 | 27 | 2 >> 3 | 32 | 3 >> 3 | 35 | 3 >> 3 | 38 | 5 >> 3 | 39 | 5 >> etc >> >> Frank >> >> >> >> On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <markus.pa...@gmail.com> >> wrote: >> >>> Hi Frank, >>> >>> I want to explain the LDA results for a single document (in this case >>> for docid = 6) by binding topicid into each wordid in the document. >>> The SQL query below gives exactly what I want but I am not sure if that >>> is the most effective way to build docid-wordid-topicid triples. >>> >>> SELECT docid, unnest((counts::text || ':' || >>> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment) >>> + 1 AS topicid FROM lda_output WHERE docid = 6; >>> >>> I have trained LDA with 'lda_output' as the output_data_table argument >>> in madlib.lda_train. >>> >>> >>> Regards, Markus >>> >>> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fmcquil...@pivotal.io>: >>> >>>> Markus, >>>> >>>> Please see example 4 in the user docs >>>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples >>>> which provides helper functions for learning more about the learned >>>> model. >>>> >>>> -- The topic description by top-k words >>>> DROP TABLE IF EXISTS my_topic_desc; >>>> SELECT madlib.lda_get_topic_desc( 'my_model', >>>> 'my_training_vocabulary', >>>> 'my_topic_desc', >>>> 15); >>>> select * from my_topic_desc order by topicid, prob DESC; >>>> >>>> produces: >>>> >>>> topicid | wordid | prob | word >>>> ---------+--------+--------------------+------------------- >>>> 1 | 69 | 0.181900726392252 | of >>>> 1 | 52 | 0.0608353510895884 | is >>>> 1 | 65 | 0.0608353510895884 | models >>>> 1 | 30 | 0.0305690072639225 | corpora >>>> 1 | 1 | 0.0305690072639225 | 1960s >>>> 1 | 57 | 0.0305690072639225 | latent >>>> >>>> Please let us know if this is of use, or you are looking for something >>>> else? >>>> >>>> Frank >>>> >>>> >>>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.pa...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I found a working but quite awkward way to form docid-wordid-topicid >>>>> pairing with a single SQL query: >>>>> >>>>> SELECT docid, unnest((counts::text || ':' || >>>>> words::text)::madlib.svec::float[]) AS wordid, >>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6; >>>>> >>>>> Output: >>>>> >>>>> docid | wordid | topicid >>>>> -------+--------+--------- >>>>> 6 | 7386 | 3 >>>>> 6 | 42021 | 17 >>>>> 6 | 42021 | 17 >>>>> 6 | 7705 | 12 >>>>> 6 | 105334 | 16 >>>>> 6 | 18083 | 3 >>>>> 6 | 89364 | 3 >>>>> 6 | 31073 | 3 >>>>> 6 | 28934 | 3 >>>>> 6 | 28934 | 16 >>>>> 6 | 56286 | 16 >>>>> 6 | 61921 | 3 >>>>> 6 | 61921 | 3 >>>>> 6 | 59142 | 17 >>>>> 6 | 33364 | 3 >>>>> 6 | 79035 | 17 >>>>> 6 | 37792 | 11 >>>>> 6 | 91823 | 11 >>>>> 6 | 30422 | 3 >>>>> 6 | 94672 | 17 >>>>> 6 | 62107 | 3 >>>>> 6 | 94673 | 2 >>>>> 6 | 62080 | 16 >>>>> 6 | 101046 | 17 >>>>> 6 | 4379 | 8 >>>>> 6 | 4379 | 8 >>>>> 6 | 4379 | 8 >>>>> 6 | 4379 | 8 >>>>> 6 | 4379 | 8 >>>>> 6 | 26503 | 12 >>>>> 6 | 61105 | 3 >>>>> 6 | 19193 | 3 >>>>> 6 | 28929 | 3 >>>>> >>>>> >>>>> Is there any simpler way to do that? >>>>> >>>>> >>>>> Regards, >>>>> Markus Paaso >>>>> >>>>> >>>>> >>>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.pa...@gmail.com>: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am having some problems reading the LDA output. >>>>>> >>>>>> >>>>>> Please see this row of madlib.lda_train output: >>>>>> >>>>>> docid | 6 >>>>>> wordcount | 33 >>>>>> words | {7386,42021,7705,105334,18083, >>>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823, >>>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929} >>>>>> counts | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1 >>>>>> ,1,1,1,1,1,1,5,1,1,1,1} >>>>>> topic_count | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0} >>>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2 >>>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2} >>>>>> >>>>>> >>>>>> It's hard to find which word ids are topic ids assigned to given when >>>>>> *words* array have different length than *topic_assignment* array. >>>>>> It would be nice if *words* array was same length than >>>>>> *topic_assignment* array >>>>>> >>>>>> 1. What kind of SQL query would give a result with wordid - topicid >>>>>> pairs? >>>>>> I tried to match them by hand but failed for wordid: 28934. I wonder >>>>>> if a repeating wordid can have different topic assignments in a same >>>>>> document? >>>>>> >>>>>> wordid | topicid >>>>>> ---------------- >>>>>> 7386 | 2 >>>>>> 42021 | 16 >>>>>> 7705 | 11 >>>>>> 105334 | 15 >>>>>> 18083 | 2 >>>>>> 89364 | 2 >>>>>> 31073 | 2 >>>>>> 28934 | 2 OR 15 ? >>>>>> 56286 | 15 >>>>>> 61921 | 2 >>>>>> 59142 | 16 >>>>>> 33364 | 2 >>>>>> 79035 | 16 >>>>>> 37792 | 10 >>>>>> 91823 | 10 >>>>>> 30422 | 2 >>>>>> 94672 | 16 >>>>>> 62107 | 2 >>>>>> 94673 | 1 >>>>>> 62080 | 15 >>>>>> 101046 | 16 >>>>>> 4379 | 7 >>>>>> 26503 | 11 >>>>>> 61105 | 2 >>>>>> 19193 | 2 >>>>>> 28929 | 2 >>>>>> >>>>>> >>>>>> 2. Why is the *topic_assignment* using zero based indexing while >>>>>> other results use one base indexing? >>>>>> >>>>>> >>>>>> >>>>>> Regards, >>>>>> Markus Paaso >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Markus Paaso >>>>> Tel: +358504067849 <+358%2050%204067849> >>>>> >>>> >>>> >>> >>> >>> -- >>> Markus Paaso >>> Tel: +358504067849 <+358%2050%204067849> >>> >> >> > > > -- > Markus Paaso > Tel: +358504067849 <+358%2050%204067849> >