Hi,

First of all, I use mahout 0.7

Input for seqdirectory, is not matrix. It's directory containing your text files to build sequence from.
I've got directory called infiles, which contain files:
01.txt  02.txt  03.txt  04.txt  05.txt  06.txt  07.txt

These are my demo files, should contain text (only letters in my case), just to try mahout, ie:
01.txt:
x a a a b
02.txt:
a a b b
03.txt:
c c d d x x x x x x x x x x x 0 0 0
04.txt:
c c c c
05.txt:
a b c d
06.txt:
b b b b
07.txt:
a x a a a

after running:
mahout seqdirectory -c UTF-8 -i infiles/ -o seqfiles

mahout creates seqfiles folder, containing chunk-0 file. This one is binary file, that contains all those file merged.

then I run:
mahout seq2sparse -i seqfiles/ -o vectors -ow

which produces folder 'vectors' with following structure:

/vectors$ ls -R
.:
df-count dictionary.file-0 frequency.file-0 tfidf-vectors tf-vectors tokenized-documents wordcount

./df-count:
part-r-00000  _SUCCESS

./tfidf-vectors:
part-r-00000  _SUCCESS

./tf-vectors:
part-r-00000  _SUCCESS

./tokenized-documents:
part-m-00000  _SUCCESS

./wordcount:
part-r-00000  _SUCCESS


all created files are binary sequence files.

/vectors/tfidf-vectors is the file with tf-idf matrix, that I should run svd on.

mahout svd -i vectors/tfidf-vectors/ -o svd-values --numRows 8 --numCols 5 -r 3

this returns eigenvectors from lanczos, but we should use ssvd, as "[email protected]" pointed out, I haven't yet got time to try it, but https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf describe it very well, and at evening I'll get some time, try it and if you're interested I'll share my results.

All I've done is also covered in:
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html




W dniu 23.11.2012 00:01, Chui-Hui Chiu pisze:
Hello, all,

I have a small matrix stored in a local file, small_matrix,  in the ASCII
format as follow.

1 0 0 0 2
0 0 3 0 0
0 0 0 0 0
0 4 0 0 0

I run the following command to convert it to Sequence Files.

mahout seqdirectory -i <some local dir>/small_matrix -o small_matrix_seq -c
ASCII -chunk 5

I see there's a new directory, "small_matrix_seq" in my root directory in
the HDFS.  A single file "chunk-0" is inside.  Then, I launch the Mahout
SVD with the following command line.

mahout-distribution-0.6/bin/mahout svd -i <my root dir in
HDFS>/small_matrix_seq/chunk-0 -o <my root dir in HDFS>/SVDOutput -nr 4 -nc
5 -r 4

This is the output.

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.203.0
No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.203.0/conf
MAHOUT-JOB:
/home/lanet/Downloads/mahout-distribution-0.6/mahout-examples-0.6-job.jar
12/11/22 16:38:36 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647, --inMemory=false,
--input=/user/lanet/small_matrix_seq/chunk-0, --maxError=0.05,
--minEigenvalue=0.0, --numCols=5, --numRows=4,
--output=/user/lanet/SVDOutput, --rank=4, --startPhase=0, --tempDir=temp}
12/11/22 16:38:37 INFO lanczos.LanczosSolver: Finding 4 singular vectors of
matrix with 4 rows, via Lanczos
12/11/22 16:38:37 INFO mapred.FileInputFormat: Total input paths to process
: 1
12/11/22 16:38:38 INFO mapred.JobClient: Running job: job_201211121510_0027
12/11/22 16:38:39 INFO mapred.JobClient:  map 0% reduce 0%
12/11/22 16:38:57 INFO mapred.JobClient:  map 100% reduce 0%
12/11/22 16:39:08 INFO mapred.JobClient:  map 100% reduce 100%
12/11/22 16:39:13 INFO mapred.JobClient: Job complete: job_201211121510_0027
12/11/22 16:39:13 INFO mapred.JobClient: Counters: 26
12/11/22 16:39:13 INFO mapred.JobClient:   Job Counters
12/11/22 16:39:13 INFO mapred.JobClient:     Launched reduce tasks=1
12/11/22 16:39:13 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=15207
12/11/22 16:39:13 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/11/22 16:39:13 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/11/22 16:39:13 INFO mapred.JobClient:     Rack-local map tasks=1
12/11/22 16:39:13 INFO mapred.JobClient:     Launched map tasks=1
12/11/22 16:39:13 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=11080
12/11/22 16:39:13 INFO mapred.JobClient:   File Input Format Counters
12/11/22 16:39:13 INFO mapred.JobClient:     Bytes Read=78
12/11/22 16:39:13 INFO mapred.JobClient:   File Output Format Counters
12/11/22 16:39:13 INFO mapred.JobClient:     Bytes Written=98
12/11/22 16:39:13 INFO mapred.JobClient:   FileSystemCounters
12/11/22 16:39:13 INFO mapred.JobClient:     FILE_BYTES_READ=6
12/11/22 16:39:13 INFO mapred.JobClient:     HDFS_BYTES_READ=334
12/11/22 16:39:13 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45393
12/11/22 16:39:13 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=98
12/11/22 16:39:13 INFO mapred.JobClient:   Map-Reduce Framework
12/11/22 16:39:13 INFO mapred.JobClient:     Map output materialized bytes=6
12/11/22 16:39:13 INFO mapred.JobClient:     Map input records=0
12/11/22 16:39:13 INFO mapred.JobClient:     Reduce shuffle bytes=0
12/11/22 16:39:13 INFO mapred.JobClient:     Spilled Records=0
12/11/22 16:39:13 INFO mapred.JobClient:     Map output bytes=0
12/11/22 16:39:13 INFO mapred.JobClient:     Map input bytes=0
12/11/22 16:39:13 INFO mapred.JobClient:     Combine input records=0
12/11/22 16:39:13 INFO mapred.JobClient:     SPLIT_RAW_BYTES=108
12/11/22 16:39:13 INFO mapred.JobClient:     Reduce input records=0
12/11/22 16:39:13 INFO mapred.JobClient:     Reduce input groups=0
12/11/22 16:39:13 INFO mapred.JobClient:     Combine output records=0
12/11/22 16:39:13 INFO mapred.JobClient:     Reduce output records=0
12/11/22 16:39:13 INFO mapred.JobClient:     Map output records=0
Exception in thread "main" java.util.NoSuchElementException
at
com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
at
org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
at
org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
at
org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
at
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200)
at
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:123)
at
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


What does the exception imply?


Thanks,
Chiu

On Thu, Nov 22, 2012 at 8:13 AM, Chui-Hui Chiu <[email protected]>wrote:

Hello, all,

I read the introduction page of the SVD in Mahout.  The SVD application
requires that the input matrix in the Sequence File Format.  Now, I have a
matrix with real number elements in the ASCII format.  Rows are separated
by the new line characters and columns are separated by the space
character.  The matrix looks like

1.1 1.2 1.3
2.1 2.2 2.3
3.1 3.2 3.3

How is the matrix stored in Sequence Files?  How do I convert the matrix
into the appropriate format for the SVD application?


Thanks,
Chiu


Reply via email to