Thanks Frank! I will keep track on this issue.
Regards, Markus 2017-09-22 23:32 GMT+03:00 Frank McQuillan <fmcquil...@pivotal.io>: > Markus, > > Thanks for the extra explanation. > > I agree that having topic_assignment as a dense vector but words as a > sparse vector is not good practice, I am not sure why it was done that way > in the first place. They should be consistent. > > The reason sparse vectors are used I believe is to keep the LDA model > state as small as possible, since it is possible to hit the 1GB PostgreSQL > field limit in certain real world use cases. > > So I think what you are doing seems reasonable. I don’t have a simpler > way to do it at this point, given the current way LDA is built. > > I have updated this JIRA > https://issues.apache.org/jira/browse/MADLIB-1160 > proposing: > > 1) use consistent indexes (start at 1) > 2) use consistent array types (sparse) > > which hopefully will help some. > > Frank > > > On Fri, Sep 22, 2017 at 12:18 AM, Markus Paaso <markus.pa...@gmail.com> > wrote: > >> Hi Frank, >> >> Thanks for the response. >> The solution you suggest doesn't fully answer to my question. >> >> I don't want to join input table with top-k wordids from each topic as >> this is not what lda_predict is doing. Am I right? >> I want to explain the results of madlib.lda_predict exactly. >> I want to explain why lda_predict gives some topic distribution for some >> set of wordids. >> >> I understand your approach but unfortunately it doesn't fulfill my needs. >> I have found the solution for my needs but and was just asking if you >> know better way to handle the results of lda_predict (or data in >> *output_data_table *of lda_train). >> >> So my question was more about how to construct wordid array from 'counts' >> and 'words' columns of madlib.lda_predict results or e.g. 'my_outdata' in >> your example. >> Now I am doing it with (counts::text || ':' || >> words::text)::madlib.svec::float[] >> >> The thing I don't understand in *output_data_table* format is that why >> *topic_assignment* is a dense vector but *words* is a sparse vector. >> I would suggest to change *words* format into dense vector to remove the >> need of playing with madlib.svec object to explain the lda_predict results >> exactly. >> And then the *counts* column is not needed anymore in *output_data_table* >> . >> >> >> Regards, >> Markus Paaso >> >> >> >> >> >> >> 2017-09-21 22:25 GMT+03:00 Frank McQuillan <fmcquil...@pivotal.io>: >> >>> Markus, >>> >>> Sorry for the delay in response. Below is another approach that >>> involves a JOIN between term frequency and an LDA output table. It is a >>> bit simpler at least for me. >>> >>> Regarding the indexing, you are correct that it is inconsistent. It >>> seems term frequency module uses 0 based indexing (e.g., docid) whereas LDA >>> uses 1 based indexing (e.g., topicid). I created a JIRA to address this >>> https://issues.apache.org/jira/browse/MADLIB-1160 >>> probably the best way is to standardize on 1 based indexing. >>> >>> ---- >>> >>> 1) Create data set >>> >>> DROP TABLE IF EXISTS documents; >>> CREATE TABLE documents(docid INT4, contents TEXT); >>> INSERT INTO documents VALUES >>> (0, 'Statistical topic models are a class of Bayesian latent variable >>> models, originally developed for analyzing the semantic content of large >>> document corpora.'), >>> (1, 'By the late 1960s, the balance between pitching and hitting had >>> swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American >>> League batting title with an average of just .301, the lowest in history.'), >>> (2, 'Machine learning is closely related to and often overlaps with >>> computational statistics; a discipline that also specializes in >>> prediction-making. It has strong ties to mathematical optimization, which >>> deliver methods, theory and application domains to the field.'), >>> (3, 'California''s diverse geography ranges from the Sierra Nevada in >>> the east to the Pacific Coast in the west, from the Redwood–Douglas fir >>> forests of the northwest, to the Mojave Desert areas in the southeast. The >>> center of the state is dominated by the Central Valley, a major >>> agricultural area. '); >>> >>> >>> 2) Convert docs to a list of words >>> >>> ALTER TABLE documents ADD COLUMN words TEXT[]; >>> UPDATE documents SET words = regexp_split_to_array(lower(contents), >>> E'[\\s+\\.\\,]'); >>> >>> >>> 3) Term frequency >>> >>> DROP TABLE IF EXISTS my_training, my_training_vocabulary; >>> SELECT madlib.term_frequency('documents', 'docid', 'words', >>> 'my_training', TRUE); >>> SELECT * FROM my_training order by docid limit 20; >>> docid | wordid | count >>> -------+--------+------- >>> 0 | 57 | 1 >>> 0 | 86 | 1 >>> 0 | 4 | 1 >>> 0 | 55 | 1 >>> 0 | 69 | 2 >>> 0 | 81 | 1 >>> 0 | 30 | 1 >>> 0 | 33 | 1 >>> 0 | 36 | 1 >>> 0 | 43 | 1 >>> 0 | 25 | 1 >>> 0 | 65 | 2 >>> 0 | 72 | 1 >>> 0 | 9 | 1 >>> 0 | 0 | 2 >>> 0 | 29 | 1 >>> 0 | 18 | 1 >>> 0 | 12 | 1 >>> 0 | 96 | 1 >>> 0 | 91 | 1 >>> (20 rows) >>> >>> >>> 4) Run LDA >>> >>> DROP TABLE IF EXISTS my_model, my_outdata; >>> SELECT madlib.lda_train( 'my_training', -- Data table >>> 'my_model', -- Model table >>> 'my_outdata', -- Output table >>> 104, -- Vocab size >>> 5, -- Num topics >>> 10, -- Num iterations >>> 5, -- Alpha (Dirichlet param) >>> 0.01 -- Beta (Dirichlet param) >>> ); >>> >>> >>> 5) Get topic description by top-k words >>> >>> DROP TABLE IF EXISTS my_topic_desc; >>> SELECT madlib.lda_get_topic_desc( 'my_model', >>> 'my_training_vocabulary', >>> 'my_topic_desc', >>> 15); >>> >>> 6) Join topic description with term frequency >>> SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING >>> (wordid) WHERE docid=3 ORDER BY docid, wordid; >>> >>> docid | wordid | topicid >>> -------+--------+--------- >>> 3 | 0 | 3 >>> 3 | 4 | 2 >>> 3 | 5 | 3 >>> 3 | 13 | 2 >>> 3 | 14 | 5 >>> 3 | 20 | 5 >>> 3 | 21 | 2 >>> 3 | 23 | 2 >>> 3 | 24 | 3 >>> 3 | 27 | 2 >>> 3 | 32 | 3 >>> 3 | 35 | 3 >>> 3 | 38 | 5 >>> 3 | 39 | 5 >>> etc >>> >>> Frank >>> >>> >>> >>> On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <markus.pa...@gmail.com> >>> wrote: >>> >>>> Hi Frank, >>>> >>>> I want to explain the LDA results for a single document (in this case >>>> for docid = 6) by binding topicid into each wordid in the document. >>>> The SQL query below gives exactly what I want but I am not sure if that >>>> is the most effective way to build docid-wordid-topicid triples. >>>> >>>> SELECT docid, unnest((counts::text || ':' || >>>> words::text)::madlib.svec::float[]) AS wordid, >>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6; >>>> >>>> I have trained LDA with 'lda_output' as the output_data_table argument >>>> in madlib.lda_train. >>>> >>>> >>>> Regards, Markus >>>> >>>> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <fmcquil...@pivotal.io>: >>>> >>>>> Markus, >>>>> >>>>> Please see example 4 in the user docs >>>>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples >>>>> which provides helper functions for learning more about the learned >>>>> model. >>>>> >>>>> -- The topic description by top-k words >>>>> DROP TABLE IF EXISTS my_topic_desc; >>>>> SELECT madlib.lda_get_topic_desc( 'my_model', >>>>> 'my_training_vocabulary', >>>>> 'my_topic_desc', >>>>> 15); >>>>> select * from my_topic_desc order by topicid, prob DESC; >>>>> >>>>> produces: >>>>> >>>>> topicid | wordid | prob | word >>>>> ---------+--------+--------------------+------------------- >>>>> 1 | 69 | 0.181900726392252 | of >>>>> 1 | 52 | 0.0608353510895884 | is >>>>> 1 | 65 | 0.0608353510895884 | models >>>>> 1 | 30 | 0.0305690072639225 | corpora >>>>> 1 | 1 | 0.0305690072639225 | 1960s >>>>> 1 | 57 | 0.0305690072639225 | latent >>>>> >>>>> Please let us know if this is of use, or you are looking for something >>>>> else? >>>>> >>>>> Frank >>>>> >>>>> >>>>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <markus.pa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I found a working but quite awkward way to form docid-wordid-topicid >>>>>> pairing with a single SQL query: >>>>>> >>>>>> SELECT docid, unnest((counts::text || ':' || >>>>>> words::text)::madlib.svec::float[]) AS wordid, >>>>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6; >>>>>> >>>>>> Output: >>>>>> >>>>>> docid | wordid | topicid >>>>>> -------+--------+--------- >>>>>> 6 | 7386 | 3 >>>>>> 6 | 42021 | 17 >>>>>> 6 | 42021 | 17 >>>>>> 6 | 7705 | 12 >>>>>> 6 | 105334 | 16 >>>>>> 6 | 18083 | 3 >>>>>> 6 | 89364 | 3 >>>>>> 6 | 31073 | 3 >>>>>> 6 | 28934 | 3 >>>>>> 6 | 28934 | 16 >>>>>> 6 | 56286 | 16 >>>>>> 6 | 61921 | 3 >>>>>> 6 | 61921 | 3 >>>>>> 6 | 59142 | 17 >>>>>> 6 | 33364 | 3 >>>>>> 6 | 79035 | 17 >>>>>> 6 | 37792 | 11 >>>>>> 6 | 91823 | 11 >>>>>> 6 | 30422 | 3 >>>>>> 6 | 94672 | 17 >>>>>> 6 | 62107 | 3 >>>>>> 6 | 94673 | 2 >>>>>> 6 | 62080 | 16 >>>>>> 6 | 101046 | 17 >>>>>> 6 | 4379 | 8 >>>>>> 6 | 4379 | 8 >>>>>> 6 | 4379 | 8 >>>>>> 6 | 4379 | 8 >>>>>> 6 | 4379 | 8 >>>>>> 6 | 26503 | 12 >>>>>> 6 | 61105 | 3 >>>>>> 6 | 19193 | 3 >>>>>> 6 | 28929 | 3 >>>>>> >>>>>> >>>>>> Is there any simpler way to do that? >>>>>> >>>>>> >>>>>> Regards, >>>>>> Markus Paaso >>>>>> >>>>>> >>>>>> >>>>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <markus.pa...@gmail.com>: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am having some problems reading the LDA output. >>>>>>> >>>>>>> >>>>>>> Please see this row of madlib.lda_train output: >>>>>>> >>>>>>> docid | 6 >>>>>>> wordcount | 33 >>>>>>> words | {7386,42021,7705,105334,18083, >>>>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823, >>>>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929} >>>>>>> counts | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1 >>>>>>> ,1,1,1,1,1,1,5,1,1,1,1} >>>>>>> topic_count | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0} >>>>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2 >>>>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2} >>>>>>> >>>>>>> >>>>>>> It's hard to find which word ids are topic ids assigned to given when >>>>>>> *words* array have different length than *topic_assignment* array. >>>>>>> It would be nice if *words* array was same length than >>>>>>> *topic_assignment* array >>>>>>> >>>>>>> 1. What kind of SQL query would give a result with wordid - topicid >>>>>>> pairs? >>>>>>> I tried to match them by hand but failed for wordid: 28934. I wonder >>>>>>> if a repeating wordid can have different topic assignments in a same >>>>>>> document? >>>>>>> >>>>>>> wordid | topicid >>>>>>> ---------------- >>>>>>> 7386 | 2 >>>>>>> 42021 | 16 >>>>>>> 7705 | 11 >>>>>>> 105334 | 15 >>>>>>> 18083 | 2 >>>>>>> 89364 | 2 >>>>>>> 31073 | 2 >>>>>>> 28934 | 2 OR 15 ? >>>>>>> 56286 | 15 >>>>>>> 61921 | 2 >>>>>>> 59142 | 16 >>>>>>> 33364 | 2 >>>>>>> 79035 | 16 >>>>>>> 37792 | 10 >>>>>>> 91823 | 10 >>>>>>> 30422 | 2 >>>>>>> 94672 | 16 >>>>>>> 62107 | 2 >>>>>>> 94673 | 1 >>>>>>> 62080 | 15 >>>>>>> 101046 | 16 >>>>>>> 4379 | 7 >>>>>>> 26503 | 11 >>>>>>> 61105 | 2 >>>>>>> 19193 | 2 >>>>>>> 28929 | 2 >>>>>>> >>>>>>> >>>>>>> 2. Why is the *topic_assignment* using zero based indexing while >>>>>>> other results use one base indexing? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Markus Paaso >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Markus Paaso >>>>>> Tel: +358504067849 <+358%2050%204067849> >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Markus Paaso >>>> Tel: +358504067849 <+358%2050%204067849> >>>> >>> >>> >> >> >> -- >> Markus Paaso >> Tel: +358504067849 <+358%2050%204067849> >> >