[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/232


---


[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167717958
  
--- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in ---
@@ -74,175 +81,231 @@ tasks related to text.
 Flag to indicate if a vocabulary is to be created. If TRUE, an 
additional
 output table is created containing the vocabulary of all words, with 
an id
 assigned to each word. The table is called 
output_table_vocabulary
-(suffix added to the output_table name) and contains the
+(i.e., suffix added to the output_table name) and contains the
 following columns:
-- \c wordid: An id assignment for each word
-- \c word: The word/term
+- \c wordid: An id for each word.
--- End diff --

We can mention it is in alphabetic ordering.


---


[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167715245
  
--- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in ---
@@ -74,175 +81,231 @@ tasks related to text.
 Flag to indicate if a vocabulary is to be created. If TRUE, an 
additional
 output table is created containing the vocabulary of all words, with 
an id
 assigned to each word. The table is called 
output_table_vocabulary
-(suffix added to the output_table name) and contains the
+(i.e., suffix added to the output_table name) and contains the
 following columns:
-- \c wordid: An id assignment for each word
-- \c word: The word/term
+- \c wordid: An id for each word.
+- \c word: The word/term corresponding to the id.
 
 
 
 @anchor examples
 @par Examples
 
--# Prepare datasets with some example documents
+-# First we create a document table with one document per row:
 
 DROP TABLE IF EXISTS documents;
-CREATE TABLE documents(docid INTEGER, doc_contents TEXT);
+CREATE TABLE documents(docid INT4, contents TEXT);
 INSERT INTO documents VALUES
-(1, 'I like to eat broccoli and banana. I ate a banana and spinach 
smoothie for breakfast.'),
-(2, 'Chinchillas and kittens are cute.'),
-(3, 'My sister adopted two kittens yesterday'),
-(4, 'Look at this cute hamster munching on a piece of broccoli');
+(0, 'I like to eat broccoli and bananas. I ate a banana and spinach 
smoothie for breakfast.'),
+(1, 'Chinchillas and kittens are cute.'),
+(2, 'My sister adopted two kittens yesterday.'),
+(3, 'Look at this cute hamster munching on a piece of broccoli.');
 
-
--# Add a new column containing the words (lower-cased) in a text array
+You can apply stemming, stop word removal and tokenization at this point 
+in order to prepare the documents for text processing. 
+Depending upon your database version, various tools are 
+available. Databases based on more recent versions of 
+PostgreSQL may do something like:
+
+SELECT tsvector_to_array(to_tsvector('english',contents)) from documents;
+
+
+tsvector_to_array 
++--
+ {ate,banana,breakfast,broccoli,eat,like,smoothi,spinach}
+ {chinchilla,cute,kitten}
+ {adopt,kitten,sister,two,yesterday}
+ {broccoli,cute,hamster,look,munch,piec}
+(4 rows)
+
+In this example, we assume a database based on an older 
+version of PostgreSQL and just perform basic punctuation 
+removal and tokenization. The array of words is added as 
+a new column to the documents table:
 
 ALTER TABLE documents ADD COLUMN words TEXT[];
-UPDATE documents SET words = regexp_split_to_array(lower(doc_contents), 
E'[s+.]');
+UPDATE documents SET words = 
+regexp_split_to_array(lower(
+regexp_replace(contents, E'[,.;\\']','', 'g')
+), E'[s+]');
+\\x on   
+SELECT * FROM documents ORDER BY docid;
+
+
+-[ RECORD 1 
]
+docid| 0
+contents | I like to eat broccoli and bananas. I ate a banana and spinach 
smoothie for breakfast.
+words| 
{i,like,to,eat,broccoli,and,bananas,i,ate,a,banana,and,spinach,smoothie,for,breakfast}
+-[ RECORD 2 
]
+docid| 1
+contents | Chinchillas and kittens are cute.
+words| {chinchillas,and,kittens,are,cute}
+-[ RECORD 3 
]
+docid| 2
+contents | My sister adopted two kittens yesterday.
+words| {my,sister,adopted,two,kittens,yesterday}
+-[ RECORD 4 
]
+docid| 3
+contents | Look at this cute hamster munching on a piece of broccoli.
+words| {look,at,this,cute,hamster,munching,on,a,piece,of,broccoli}
 
 
--# Compute the frequency of each word in each document
+-# Compute the frequency of each word in each document:
 
-DROP TABLE IF EXISTS documents_tf;
-SELECT madlib.term_frequency('documents', 'docid', 'words', 
'documents_tf');
-SELECT * FROM documents_tf order by docid;
+DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
+SELECT madlib.term_frequency('documents',-- input table
+ 'docid',-- document id
--- End diff --

document id column


---


[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167708065
  
--- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
@@ -182,324 +105,789 @@ lda_train( data_table,
 \b Arguments
 
 data_table
-TEXT. The name of the table storing the training dataset. Each row 
is
+TEXT. Name of the table storing the training dataset. Each row is
 in the form docid, wordid, count where \c docid, \c 
wordid, and \c count
-are non-negative integers.
-
+are non-negative integers.  
 The \c docid column refers to the document ID, the \c wordid column is 
the
 word ID (the index of a word in the vocabulary), and \c count is the
-number of occurrences of the word in the document.
-
-Please note that column names for \c docid, \c wordid, and \c count 
are currently fixed, so you must use these
-exact names in the data_table.
+number of occurrences of the word in the document. Please note:
+
+- \c wordid must be 
+contiguous integers going from from 0 to \c voc_size  \c 1.
+- column names for \c docid, \c wordid, and \c count are currently 
fixed, 
+so you must use these exact names in the data_table.  
+
+The function Term 
Frequency
+can be used to generate vocabulary in the required format from raw 
documents.
+
 
 model_table
-TEXT. The name of the table storing the learned models. This table 
has one row and the following columns.
+TEXT. This is an output table generated by LDA which contains the 
learned model. 
+It has one row with the following columns:
 
 
 voc_size
-INTEGER. Size of the vocabulary. Note that the \c 
wordid should be continous integers starting from 0 to \c voc_size  \c 
1.  A data validation routine is called to validate the dataset.
+INTEGER. Size of the vocabulary. As mentioned above 
for the input 
+table, \c wordid consists of contiguous integers going 
+from 0 to \c voc_size  \c 1.  
+
 
 
 topic_num
 INTEGER. Number of topics.
 
 
 alpha
-DOUBLE PRECISION. Dirichlet parameter for the per-doc 
topic multinomial (e.g. 50/topic_num).
+DOUBLE PRECISION. Dirichlet prior for the per-document 
+topic multinomial.
 
 
 beta
-DOUBLE PRECISION. Dirichlet parameter for the 
per-topic word multinomial (e.g. 0.01).
+DOUBLE PRECISION. Dirichlet prior for the per-topic 
+word multinomial.
 
 
 model
-BIGINT[].
+BIGINT[]. The encoded model description (not human 
readable).
 
 
 
 output_data_table
-TEXT. The name of the table to store the output data. It has the 
following columns:
+TEXT. The name of the table generated by LDA that stores 
+the output data. It has the following columns:
 
 
 docid
-INTEGER.
+INTEGER. Document id from input 'data_table'.
 
 
 wordcount
-INTEGER.
+INTEGER. Count of number of words in the document, 
+including repeats. For example, if a word appears 3 times 
+in the document, it is counted 3 times.
 
 
 words
-INTEGER[].
+INTEGER[]. Array of \c wordid in the document, not
+including repeats.  For example, if a word appears 3 times 
+in the document, it appears only once in the \c words 
array.
 
 
 counts
-INTEGER[].
+INTEGER[]. Frequency of occurance of a word in the 
document,
+indexed the same as the \c words array above.  For 
example, if the
+2nd element of the \c counts array is 4, it means that the 
word
+in the 2nd element of the \c words array occurs 4 times in 
the
+document.
 
 
 topic_count
-INTEGER[].
+INTEGER[]. Array of the count of words in the document
+that correspond to each topic.
--- End diff --

maybe mention array index corresponds to 0 

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167708360
  
--- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
@@ -182,324 +105,789 @@ lda_train( data_table,
 \b Arguments
 
 data_table
-TEXT. The name of the table storing the training dataset. Each row 
is
+TEXT. Name of the table storing the training dataset. Each row is
 in the form docid, wordid, count where \c docid, \c 
wordid, and \c count
-are non-negative integers.
-
+are non-negative integers.  
 The \c docid column refers to the document ID, the \c wordid column is 
the
 word ID (the index of a word in the vocabulary), and \c count is the
-number of occurrences of the word in the document.
-
-Please note that column names for \c docid, \c wordid, and \c count 
are currently fixed, so you must use these
-exact names in the data_table.
+number of occurrences of the word in the document. Please note:
+
+- \c wordid must be 
+contiguous integers going from from 0 to \c voc_size  \c 1.
+- column names for \c docid, \c wordid, and \c count are currently 
fixed, 
+so you must use these exact names in the data_table.  
+
+The function Term 
Frequency
+can be used to generate vocabulary in the required format from raw 
documents.
+
 
 model_table
-TEXT. The name of the table storing the learned models. This table 
has one row and the following columns.
+TEXT. This is an output table generated by LDA which contains the 
learned model. 
+It has one row with the following columns:
 
 
 voc_size
-INTEGER. Size of the vocabulary. Note that the \c 
wordid should be continous integers starting from 0 to \c voc_size  \c 
1.  A data validation routine is called to validate the dataset.
+INTEGER. Size of the vocabulary. As mentioned above 
for the input 
+table, \c wordid consists of contiguous integers going 
+from 0 to \c voc_size  \c 1.  
+
 
 
 topic_num
 INTEGER. Number of topics.
 
 
 alpha
-DOUBLE PRECISION. Dirichlet parameter for the per-doc 
topic multinomial (e.g. 50/topic_num).
+DOUBLE PRECISION. Dirichlet prior for the per-document 
+topic multinomial.
 
 
 beta
-DOUBLE PRECISION. Dirichlet parameter for the 
per-topic word multinomial (e.g. 0.01).
+DOUBLE PRECISION. Dirichlet prior for the per-topic 
+word multinomial.
 
 
 model
-BIGINT[].
+BIGINT[]. The encoded model description (not human 
readable).
 
 
 
 output_data_table
-TEXT. The name of the table to store the output data. It has the 
following columns:
+TEXT. The name of the table generated by LDA that stores 
+the output data. It has the following columns:
 
 
 docid
-INTEGER.
+INTEGER. Document id from input 'data_table'.
 
 
 wordcount
-INTEGER.
+INTEGER. Count of number of words in the document, 
+including repeats. For example, if a word appears 3 times 
+in the document, it is counted 3 times.
 
 
 words
-INTEGER[].
+INTEGER[]. Array of \c wordid in the document, not
+including repeats.  For example, if a word appears 3 times 
+in the document, it appears only once in the \c words 
array.
 
 
 counts
-INTEGER[].
+INTEGER[]. Frequency of occurance of a word in the 
document,
+indexed the same as the \c words array above.  For 
example, if the
+2nd element of the \c counts array is 4, it means that the 
word
+in the 2nd element of the \c words array occurs 4 times in 
the
+document.
 
 
 topic_count
-INTEGER[].
+INTEGER[]. Array of the count of words in the document
+that correspond to each topic.
 
 
 

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167709835
  
--- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
@@ -182,324 +105,789 @@ lda_train( data_table,
 \b Arguments
 
 data_table
-TEXT. The name of the table storing the training dataset. Each row 
is
+TEXT. Name of the table storing the training dataset. Each row is
 in the form docid, wordid, count where \c docid, \c 
wordid, and \c count
-are non-negative integers.
-
+are non-negative integers.  
 The \c docid column refers to the document ID, the \c wordid column is 
the
 word ID (the index of a word in the vocabulary), and \c count is the
-number of occurrences of the word in the document.
-
-Please note that column names for \c docid, \c wordid, and \c count 
are currently fixed, so you must use these
-exact names in the data_table.
+number of occurrences of the word in the document. Please note:
+
+- \c wordid must be 
+contiguous integers going from from 0 to \c voc_size  \c 1.
+- column names for \c docid, \c wordid, and \c count are currently 
fixed, 
+so you must use these exact names in the data_table.  
+
+The function Term 
Frequency
+can be used to generate vocabulary in the required format from raw 
documents.
+
 
 model_table
-TEXT. The name of the table storing the learned models. This table 
has one row and the following columns.
+TEXT. This is an output table generated by LDA which contains the 
learned model. 
+It has one row with the following columns:
 
 
 voc_size
-INTEGER. Size of the vocabulary. Note that the \c 
wordid should be continous integers starting from 0 to \c voc_size  \c 
1.  A data validation routine is called to validate the dataset.
+INTEGER. Size of the vocabulary. As mentioned above 
for the input 
+table, \c wordid consists of contiguous integers going 
+from 0 to \c voc_size  \c 1.  
+
 
 
 topic_num
 INTEGER. Number of topics.
 
 
 alpha
-DOUBLE PRECISION. Dirichlet parameter for the per-doc 
topic multinomial (e.g. 50/topic_num).
+DOUBLE PRECISION. Dirichlet prior for the per-document 
+topic multinomial.
 
 
 beta
-DOUBLE PRECISION. Dirichlet parameter for the 
per-topic word multinomial (e.g. 0.01).
+DOUBLE PRECISION. Dirichlet prior for the per-topic 
+word multinomial.
 
 
 model
-BIGINT[].
+BIGINT[]. The encoded model description (not human 
readable).
 
 
 
 output_data_table
-TEXT. The name of the table to store the output data. It has the 
following columns:
+TEXT. The name of the table generated by LDA that stores 
+the output data. It has the following columns:
 
 
 docid
-INTEGER.
+INTEGER. Document id from input 'data_table'.
 
 
 wordcount
-INTEGER.
+INTEGER. Count of number of words in the document, 
+including repeats. For example, if a word appears 3 times 
+in the document, it is counted 3 times.
 
 
 words
-INTEGER[].
+INTEGER[]. Array of \c wordid in the document, not
+including repeats.  For example, if a word appears 3 times 
+in the document, it appears only once in the \c words 
array.
 
 
 counts
-INTEGER[].
+INTEGER[]. Frequency of occurance of a word in the 
document,
+indexed the same as the \c words array above.  For 
example, if the
+2nd element of the \c counts array is 4, it means that the 
word
+in the 2nd element of the \c words array occurs 4 times in 
the
+document.
 
 
 topic_count
-INTEGER[].
+INTEGER[]. Array of the count of words in the document
+that correspond to each topic.