Hi Ted,
Before Tokenization: (raw text file got from tika output of a PDF file)
==================================================
pd@PeriyaData:~$ hadoop fs -ls /input/preprocessed
Found 1 items
-rw-r--r-- 1 pd supergroup 455886 2011-12-20 17:25
/input/preprocessed/MGI_big_data_full_report.txt
pd@PeriyaData:~$
pd@PeriyaData:~$ hadoop fs -cat
/input/preprocessed/MGI_big_data_full_report.txt | more
McKinsey Global Institute
Big data: The next frontier
for innovation, competition,
and productivity
June 2011
The McKinsey Global Institute
The McKinsey Global Institute (MGI), established in 1990, is McKinsey &
Company’s business and economics research arm.
MGI’s mission is to help leaders in the commercial, public, and social
sectors
develop a deeper understanding of the evolution of the global economy and to
provide a fact base that contributes to decision making on critical
management
and policy issues.
[....]
After Tokenization:
===============
pd@PeriyaData:~$
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/tokenized-documents
Found 3 items
-rw-r--r-- 1 pd supergroup 0 2011-12-21 12:38
/input/vectorized/tokenized-documents/_SUCCESS
drwxr-xr-x - pd supergroup 0 2011-12-21 12:38
/input/vectorized/tokenized-documents/_logs
-rw-r--r-- 1 pd supergroup 366025 2011-12-21 12:38
/input/vectorized/tokenized-documents/part-m-00000
pd@PeriyaData:~$ hadoop fs -cat
/input/vectorized/tokenized-documents/part-m-00000 | more
SEQorg.apache.hadoop.io.Text$org.apache.mahout.common.StringTupleanexfrontier
innovation
competition^Lproductivityjune201mckinseyglobal
institutmckinseyglobal institutemgi
established199mckinsey
company’business -economicresearcharmmgi’smissionhelpleaders
understandingblievolutionglobaleconomyprovidefactbase
contributedecisionmakincritical
managementpolicyissuesmgresearccombinestwo
disciplines economics
management
economistsoftenhavelimitedaccess
practicaproblemsfacingseniomanagerswhileseniomanagersoftenlacktime
incentivelookbeyondowindustryla
rgerissuesglobaleconomy
macroeconomictrendserm
affectinbusinesstrategypolicymakingnearlytwodecadesmgihautilizedmicromacrapproacresearccoveringmorethan20
countri
es3industrysectorsmgi’scurrenresearchagendafocusesthreebroadareas^Lpr
===================================================
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/
Found 7 items
drwxr-xr-x - pd supergroup 0 2011-12-20 17:27
/input/vectorized/df-count
-rw-r--r-- 1 pd supergroup 226062 2011-12-20 17:26
/input/vectorized/dictionary.file-0
-rw-r--r-- 1 pd supergroup 187713 2011-12-20 17:27
/input/vectorized/frequency.file-0
drwxr-xr-x - pd supergroup 0 2011-12-20 17:26
/input/vectorized/tf-vectors
drwxr-xr-x - pd supergroup 0 2011-12-20 17:27
/input/vectorized/tfidf-vectors
drwxr-xr-x - pd supergroup 0 2011-12-20 17:25
/input/vectorized/tokenized-documents
drwxr-xr-x - pd supergroup 0 2011-12-20 17:26
/input/vectorized/wordcount
pd@PeriyaData:~$ hadoop fs -ls /input/
Found 4 items
drwxr-xr-x - pd supergroup 0 2011-12-20 17:22 /input/novels
drwxr-xr-x - pd supergroup 0 2011-12-20 17:25 /input/preprocessed
drwxr-xr-x - pd supergroup 0 2011-12-20 17:25 /input/seqFiles
drwxr-xr-x - pd supergroup 0 2011-12-20 17:27 /input/vectorized
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/tf-vectors
Found 3 items
-rw-r--r-- 1 pd supergroup 0 2011-12-21 12:39
/input/vectorized/tf-vectors/_SUCCESS
drwxr-xr-x - pd supergroup 0 2011-12-21 12:39
/input/vectorized/tf-vectors/_logs
-rw-r--r-- 1 pd supergroup 92926 2011-12-21 12:39
/input/vectorized/tf-vectors/part-r-00000
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/wordcount
Found 2 items
drwxr-xr-x - pd supergroup 0 2011-12-21 12:38
/input/vectorized/wordcount/ngrams
drwxr-xr-x - pd supergroup 0 2011-12-21 12:38
/input/vectorized/wordcount/subgrams
pd@PeriyaData:~$ hadoop fs -ls /input/vectorized/wordcount/ngrams
Found 3 items
-rw-r--r-- 1 pd supergroup 0 2011-12-21 12:38
/input/vectorized/wordcount/ngrams/_SUCCESS
drwxr-xr-x - pd supergroup 0 2011-12-21 12:38
/input/vectorized/wordcount/ngrams/_logs
-rw-r--r-- 1 pd supergroup 263581 2011-12-21 12:38
/input/vectorized/wordcount/ngrams/part-r-00000
I noticed that the others like tf-vectors look reasonably large. Only the
tfidf-vectors are small.
Again, I am using Mahout -0.5, Hadoop 0.20.2-cdh3u2. All this is for
tracking why my kmeans clustering is not working and giving the
indexoutofboudsexception...and it looks like this tfidf-vector generation
maybe the culprit..
Thanks,
/PD
On Wed, Dec 21, 2011 at 12:21 PM, Ted Dunning <[email protected]> wrote:
> It should be smaller, but this seems a bit too small.
>
> Can you dump the intermediate data containing the untokenized words? What
> about the words after tokenization?
>
> On Wed, Dec 21, 2011 at 11:45 AM, Isabel Drost <[email protected]> wrote:
>
> > On 21.12.2011 Periya.Data wrote:
> > > Though the sequence-file is large, the vector file is relatively small
> > (163
> > > bytes). Is this expected?
> >
> > Upon vectorizing your text is split in tokens, for each token the vector
> > will
> > contain it's id and an entry indicating how often that token was found in
> > your
> > document (weighted by the overall number of documents in your input that
> > contain
> > that token). So yes, the resulting vector should be smaller than the
> > original
> > document.
> >
> > Vectors are stored as Sequence Files - you should be able to just print
> > their
> > content and see whether that makes sense.
> >
> > Isabel
> >
> >
>