Re: vectorized output from a sequence file

Periya.Data Wed, 21 Dec 2011 12:57:04 -0800

Hi Ted,


Before Tokenization: (raw text file  got from tika output of a PDF file)
==================================================
pd@PeriyaData:~$ hadoop fs -ls /input/preprocessed
Found 1 items
-rw-r--r--   1 pd supergroup     455886 2011-12-20 17:25
/input/preprocessed/MGI_big_data_full_report.txt
pd@PeriyaData:~$
pd@PeriyaData:~$ hadoop fs -cat
/input/preprocessed/MGI_big_data_full_report.txt | more

McKinsey Global Institute

Big data: The next frontier
for innovation, competition,
and productivity

June 2011

The McKinsey Global Institute

The McKinsey Global Institute (MGI), established in 1990, is McKinsey &
Company’s business and economics research arm.

MGI’s mission is to help leaders in the commercial, public, and social
sectors
develop a deeper understanding of the evolution of the global economy and to
provide a fact base that contributes to decision making on critical
management
and policy issues.
[....]


After Tokenization:
===============
pd@PeriyaData:~$
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/tokenized-documents
Found 3 items
-rw-r--r--   1 pd supergroup          0 2011-12-21 12:38
/input/vectorized/tokenized-documents/_SUCCESS
drwxr-xr-x   - pd supergroup          0 2011-12-21 12:38
/input/vectorized/tokenized-documents/_logs
-rw-r--r--   1 pd supergroup     366025 2011-12-21 12:38
/input/vectorized/tokenized-documents/part-m-00000

pd@PeriyaData:~$ hadoop fs -cat
/input/vectorized/tokenized-documents/part-m-00000 | more
SEQorg.apache.hadoop.io.Text$org.apache.mahout.common.StringTupleanexfrontier
innovation
          competition^Lproductivityjune201mckinseyglobal
institutmckinseyglobal    institutemgi

established199mckinsey

company’business    -economicresearcharmmgi’smissionhelpleaders
understandingblievolutionglobaleconomyprovidefactbase

contributedecisionmakincritical
managementpolicyissuesmgresearccombinestwo
                                              disciplines    economics
management
economistsoftenhavelimitedaccess
practicaproblemsfacingseniomanagerswhileseniomanagersoftenlacktime
incentivelookbeyondowindustryla
rgerissuesglobaleconomy
macroeconomictrendserm
affectinbusinesstrategypolicymakingnearlytwodecadesmgihautilizedmicromacrapproacresearccoveringmorethan20
countri
es3industrysectorsmgi’scurrenresearchagendafocusesthreebroadareas^Lpr

===================================================


pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/
Found 7 items
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:27
/input/vectorized/df-count
-rw-r--r--   1 pd supergroup     226062 2011-12-20 17:26
/input/vectorized/dictionary.file-0
-rw-r--r--   1 pd supergroup     187713 2011-12-20 17:27
/input/vectorized/frequency.file-0
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:26
/input/vectorized/tf-vectors
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:27
/input/vectorized/tfidf-vectors
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:25
/input/vectorized/tokenized-documents
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:26
/input/vectorized/wordcount


pd@PeriyaData:~$ hadoop fs -ls /input/
Found 4 items
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:22 /input/novels
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:25 /input/preprocessed
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:25 /input/seqFiles
drwxr-xr-x   - pd supergroup          0 2011-12-20 17:27 /input/vectorized


pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/tf-vectors
Found 3 items
-rw-r--r--   1 pd supergroup          0 2011-12-21 12:39
/input/vectorized/tf-vectors/_SUCCESS
drwxr-xr-x   - pd supergroup          0 2011-12-21 12:39
/input/vectorized/tf-vectors/_logs
-rw-r--r--   1 pd supergroup      92926 2011-12-21 12:39
/input/vectorized/tf-vectors/part-r-00000

pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/wordcount
Found 2 items
drwxr-xr-x   - pd supergroup          0 2011-12-21 12:38
/input/vectorized/wordcount/ngrams
drwxr-xr-x   - pd supergroup          0 2011-12-21 12:38
/input/vectorized/wordcount/subgrams

pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/wordcount/ngrams
Found 3 items
-rw-r--r--   1 pd supergroup          0 2011-12-21 12:38
/input/vectorized/wordcount/ngrams/_SUCCESS
drwxr-xr-x   - pd supergroup          0 2011-12-21 12:38
/input/vectorized/wordcount/ngrams/_logs
-rw-r--r--   1 pd supergroup     263581 2011-12-21 12:38
/input/vectorized/wordcount/ngrams/part-r-00000

I noticed that the others like tf-vectors look reasonably large. Only the
tfidf-vectors are small.

Again, I am using Mahout -0.5, Hadoop 0.20.2-cdh3u2. All this is for
tracking why my kmeans clustering is not working and giving the
indexoutofboudsexception...and it looks like this tfidf-vector generation
maybe the culprit..

Thanks,
/PD


On Wed, Dec 21, 2011 at 12:21 PM, Ted Dunning <[email protected]> wrote:

> It should be smaller, but this seems a bit too small.
>
> Can you dump the intermediate data containing the untokenized words?  What
> about the words after tokenization?
>
> On Wed, Dec 21, 2011 at 11:45 AM, Isabel Drost <[email protected]> wrote:
>
> > On 21.12.2011 Periya.Data wrote:
> > > Though the sequence-file is large, the vector file is relatively small
> > (163
> > > bytes). Is this expected?
> >
> > Upon vectorizing your text is split in tokens, for each token the vector
> > will
> > contain it's id and an entry indicating how often that token was found in
> > your
> > document (weighted by the overall number of documents in your input that
> > contain
> > that token). So yes, the resulting vector should be smaller than the
> > original
> > document.
> >
> > Vectors are stored as Sequence Files - you should be able to just print
> > their
> > content and see whether that makes sense.
> >
> > Isabel
> >
> >
>

Re: vectorized output from a sequence file

Reply via email to