Re: Data Vectorization

Suneel Marthi Mon, 16 Dec 2013 13:14:00 -0800

Sorry I have not been following this thread but the output of seqdirectory 
seems suspect.  Are you working off of Mahout 0.8 or trunk?
Could you try running seqdirectory again with '-xm sequential' option and 
repeat the subsequent steps?



Given that ur tf-vectors, tokenized documents etc are all zero bytes, I suspect 
your seqdirectory has failed.




On Monday, December 16, 2013 3:58 PM, Sameer Tilak <[email protected]> wrote:
 
It does not see to work :(.
Here is who I use the generated sequence (described in my last email) file for 
clustering. 
./mahout seqdirectory -i /scratch/VectorizedInput -o 
/scratch/VectorizedOutputSeqdir -c UTF-8 -chunk 513/12/12 11:33:53 INFO 
driver.MahoutDriver: Program took 24433 ms (Minutes: 0.40721666666666667)
-bash-4.1$ hadoop dfs -ls   /scratch/VectorizedOutputSeqdirWarning: 
$HADOOP_HOME is deprecated.Found 3 items-rw-r--r--   1 userid supergroup        
  0 2013-12-12 11:33 /scratch/VectorizedOutputSeqdir/_SUCCESSdrwxr-xr-x   - 
userid supergroup          0 2013-12-12 11:33 
/scratch/VectorizedOutputSeqdir/_logs-rw-r--r--   1 userid supergroup      
61940 2013-12-12 11:33 /scratch/VectorizedOutputSeqdir/part-m-00000

./mahout seq2sparse -i  /scratch/VectorizedOutputSeqdir -o  
/scratch/VectorizedOutputSparceProgram took 361827 ms (Minutes: 6.03045)
-bash-4.1$ hadoop dfs -ls  /scratch/VectorizedOutputSparceWarning: $HADOOP_HOME 
is deprecated.
Found 7 itemsdrwxr-xr-x   - userid supergroup          0 2013-12-12 11:38 
/scratch/VectorizedOutputSparce/df-count-rw-r--r--   1 userid supergroup      
19692 2013-12-12 11:36 
/scratch/VectorizedOutputSparce/dictionary.file-0-rw-r--r--   1 userid 
supergroup      22893 2013-12-12 11:38 
/scratch/VectorizedOutputSparce/frequency.file-0drwxr-xr-x   - userid 
supergroup          0 2013-12-12 11:39 
/scratch/VectorizedOutputSparce/tf-vectorsdrwxr-xr-x   - userid supergroup      
    0 2013-12-12 11:41 /scratch/VectorizedOutputSparce/tfidf-vectorsdrwxr-xr-x  
 - userid supergroup          0 2013-12-12 11:35 
/scratch/VectorizedOutputSparce/tokenized-documentsdrwxr-xr-x   - userid 
supergroup          0 2013-12-12 11:36 /scratch/VectorizedOutputSparce/wordcount
./mahout kmeans -i  /scratch/VectorizedOutputSparce/tfidf-vectors/ -c 
/scratch/clusters -o /scratch/kmeans -x 10 -k 20 -owProgram took 43559 ms 
(Minutes: 0.7259833333333333)
-bash-4.1$ hadoop dfs -ls /scratch/kmeansWarning: $HADOOP_HOME is 
deprecated.Found 2 itemsdrwxr-xr-x   - userid supergroup          0 2013-12-12 
11:57 /scratch/kmeans/clusters-0drwxr-xr-x   - userid supergroup          0 
2013-12-12 11:58 /scratch/kmeans/clusters-1-final

./mahout clusterdump -i /scratch/kmeans/clusters-*-final -d  
/scratch/VectorizedOutputSparce/dictionary.file-0 -dt sequencefile -b 100 -n 
2013/12/12 12:28:43 INFO clustering.ClusterDumper: Wrote 7 clusters13/12/12 
12:28:43 INFO driver.MahoutDriver: Program took 1066 ms (Minutes: 
0.017766666666666667)

Running on hadoop, using /users/p529444/software/hadoop-1.0.3/bin/hadoop and 
HADOOP_CONF_DIR=/apps/hadoop/hadoop-confMAHOUT-JOB: 
/apps/mahout/trunk/examples/target/mahout-examples-0.9-SNAPSHOT-job.jarWarning: 
$HADOOP_HOME is deprecated.
13/12/16 12:52:49 INFO common.AbstractJob: Command line arguments: 
{--dictionary=[/scratch/VectorizedOutputSparce/dictionary.file-0], 
--dictionaryType=[sequencefile], 
--distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure],
 --endPhase=[2147483647], --input=[/scratch/kmeans/clusters-*-final], 
--numWords=[20], --outputFormat=[TEXT], --startPhase=[0], --substring=[100], 
--tempDir=[temp]}:VL-0{n=2 c=[0:13.950, 1:4.932, 110:3.186, 1107280:3.186, 
1118:3.186, 1177:3.186, 120:2.612, 12065996    Top Terms:        records        
                         =>  23.841001510620117        output                   
               =>  23.411399841308594        org                                
     =>  19.115327835083008        bytes                                   =>  
19.115327835083008        apache                       
           =>   17.00798797607422        map                                    
 =>  15.120783805847168        counter                                 =>  
14.424721717834473        counters                                =>  
14.247724533081055        hadoop                                  =>  
14.189364433288574        0                                       =>  
13.949626922607422        input                                   =>  
13.516578674316406        task_type                               =>  
12.743552207946777        taskid                                  =>  
12.743552207946777        file_bytes_written                      =>  
12.743552207946777        filesystemcounters                      => 
 12.743552207946777        memory                                  =>  
12.743552207946777        task                                    =>  
12.338891983032227        _0_b                                    =>  
11.036240577697754        _1_au                                   =>  
11.036240577697754        _2_vectorizedinput                      =>  
11.036240577697754:VL-1{n=2 c=[0:6.975, 0.0.0.0:6.758, 0.05:3.186, 
0.11.1:6.758, 0.9:7.124, 1:6.430, 1.0:3.186, 1.0.3:1    Top Terms:        name  
                                  =>    59.7300910949707        property        
                        =>   59.68759536743164        value                     
              =>   59.68759536743164        jar                                 
    => 
 17.303800582885742        users                                   =>  
17.156532287597656        software                                =>   
17.00798797607422        1.0.3                                   =>  
16.858135223388672        lib                                     =>  
16.625680923461914        libexec                                 =>   
16.08794593811035        hadoop                                  =>   
15.67484188079834        tmp                                     =>  
15.607600212097168        userid                                 =>  
14.545638084411621        true                                    =>  
12.338891983032227        temp                                    =>  
12.131500244140625   
     578898841                               =>  11.920500755310059        
false                                   =>   10.56639575958252        
institution                        =>  10.323457717895508        usr            
                         =>    9.81956672668457        openjdk                  
               =>   9.557663917541504        1.7.0                              
     =>   9.288379669189453


> Date: Mon, 16 Dec 2013 12:13:52 -0800
> Subject: Re: Data Vectorization
> From: [email protected]
> To: [email protected]
> 
> Looks reasonable.  Does it work?
> 
> 
> On Mon, Dec 16, 2013 at 12:09 PM, Sameer Tilak <[email protected]> wrote:
> 
> > Hi All,
> > I have some questions regarding vectorization.
> >
> > Here is my Pig script snippet.
> >
> > AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into
> > '/scratch/AU';
> > AU has the following format:
> > (userid, (item_view_history))
> > (27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1))
> > I will have at least few hundred thousand numbers in the
> >  (item_view_history), for readability I am just showing 5 here.
> > I am not sure about how to get this data written to a format that Mahout's
> > clustering algorithms will be able to parse. I have the following steps,
> > but not sure if my understanding is correct. Any help with this will be
> > great!
> >
> > VectorizedInput = FOREACH AU GENERATE FLATTEN($0);
> >
> > /*I am assuming the filed userid will be used as a key and will be written
> > using $INT_CONVERTER', and the tuple will be written using
> > $VECTOR_CONVERTER'. Is this correct?
> > STORE VectorizedInput into '/scratch/VectorizedInput' using
> > $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');
> >

Re: Data Vectorization

Reply via email to