vectorized output from a sequence file

Periya.Data Wed, 21 Dec 2011 10:35:22 -0800

Hi,
   I am trying to take the contents of a PDF and vectorize it. I am
following the sequence of executions as listed below. Though the
sequence-file is large, the vector file is relatively small (163 bytes). Is
this expected? How do we know if the resulting vector file is good or not?
The reason being, I am unable to run a subsequent Kmeans clustering using
the vector file. I am trying to trace the error..


- Ubuntu 11.10
- Mahout 0.5-cdh3u2
- Hadoop -0.20.2-cdh3u2
- using pseudo-distributed mode and I have my intermediate outputs to HDFS.

=================================================================
pd@PeriyaData:~$ hadoop fs -ls /input/preprocessed/
Found 1 items
-rw-r--r--   1 pd supergroup     455886 2011-12-20 17:25
/input/preprocessed/full_report.txt
pd@PeriyaData:~$
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized
Found 7 items
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:27
/input/vectorized/df-count
-rw-r--r--   1 pd supergroup     226062 2011-12-20 17:26
/input/vectorized/dictionary.file-0
-rw-r--r--   1 pd supergroup     187713 2011-12-20 17:27
/input/vectorized/frequency.file-0
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:26
/input/vectorized/tf-vectors
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:27
/input/vectorized/tfidf-vectors
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:25
/input/vectorized/tokenized-documents
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:26
/input/vectorized/wordcount
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/tfidf-vectors
Found 3 items
-rw-r--r--   1 pd supergroup          0 2011-12-20 17:27
/input/vectorized/tfidf-vectors/_SUCCESS
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:27
/input/vectorized/tfidf-vectors/_logs
-rw-r--r--   1 pd supergroup        163 2011-12-20 17:27
/input/vectorized/tfidf-vectors/part-r-00000
pd@PeriyaData:~$
pd@PeriyaData:~$ hadoop fs -cat /input/vectorized/tfidf-vectors/_SUCCESS
pd@PeriyaData:~$

=================================================================
#!/bin/bash

java -jar $TIKA_HOME/tika-app/target/tika-app-1.0.jar --text
~/bigdata/examples/input/raw/full_report.pdf >
~/bigdata/examples/input/preprocessed/MGI_big_data_full_report.txt

wait

hadoop dfs -put ~/bigdata/examples/input/preprocessed/full_report.txt
/input/preprocessed/

wait

$MAHOUT_HOME/bin/mahout seqdirectory --input
/input/preprocessed/ \
                        --output            /input/seqFiles/ \
                        --charset           utf-8

wait

$MAHOUT_HOME/bin/mahout seq2sparse   --input             /input/seqFiles
\
                        --output            /input/vectorized     \
                        --maxNGramSize      2                        \
                        --namedVector                                \
                        --minDF             4                        \
                        --maxDFPercent      75                       \
                        --weight            TFIDF                    \
                        --norm              2
=============================================================================


Thanks for your suggestions.

PD

vectorized output from a sequence file

Reply via email to