Re: .txt to vector

Lance Norskog Thu, 19 Jul 2012 18:16:53 -0700

There are two books out for Mahout and text processing. "Mahout in
Action" covers all of the apps in Mahout. "Taming Text" gives a good
detailed explanation of the text processing programs in Mahout, and
otherwise covers other text processing problems.


Mahout in Action is very good, and can help you use most of the Mahout features.

http://www.manning.com/owen
http://www.manning.com/ingersoll

On Thu, Jul 19, 2012 at 8:08 AM, Videnova, Svetlana
<[email protected]> wrote:
> Hi again,
> Just finished.
> That's what I done:
>
> Mahout .txt to seqfile
> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
> Converting directory of documents to SequenceFile format
> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout seqdirectory  
> --input /usr/local/apache-mahout-d6d6ee8/toto  --output 
> /usr/local/apache-mahout-d6d6ee8/examples/output/
> -This first step will create chunk-0 file in the output path that you gave
> Creating Vectors from SequenceFile
> ./bin/mahout seq2sparse --input ./examples/output/chunk-0 --output 
> ./toto/output/ -maxNGramSize
> *Don't forget to put  ./toto/output full right
> -this second step will take the chunk-0 created by the first step and will 
> create output dir where you specified in the --output option
>
> Creating vector with kmeans
> ./bin/mahout kmeans -i ./toto/output/tfidf-vectors/ -c 
> ./toto/centroides_kmeans/ -cl  -o ./toto/cluster_kmeans/ -k 20 -ow -x 10
>
> Transform vectors to human redable (does not work yet)
>  csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout clusterdump 
> -i ./toto/cluster_kmeans/clusters-1-final/ -o ./toto/clusters-dump/ -of TEXT 
> -d ./toto/output/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate 
> -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir 
> ./toto/cluster_kmeans/clusteredPoints/
> *-s got changed to -i for mahout 0.7
> * works : ./bin/mahout clusterdump -i ./toto/cluster_kmeans/clusters-1-final/ 
> -o ./toto/clusters-dump/  --pointsDir ./toto/cluster_kmeans/clusteredPoints/
>
>
>
>
>
> Can somebody please explain me belows files? What exactly they contect how to 
> use them ect...
> dictionary.file-0 ; tfidf-vectors   ;  tokenized-documents; df-count  ;       
>   frequency.file-0 ; tf-vectors     ;      wordcount
>
>
> What is the chunk-0 file exactly?
>
>
>  What represent clusters-dump at the end created by using the command 
> clusterdump?
>
>
> Thank you all!
>
>
> -----Message d'origine-----
> De : Videnova, Svetlana [mailto:[email protected]]
> Envoyé : jeudi 19 juillet 2012 15:07
> À : [email protected]
> Objet : RE: .txt to vector
>
> The problem was that i gave as input file to seq2sparse the directory and no 
> -chunk directly.
> Also I didn't got write rights for "group" and "others" to my output file.
>
> After running the command -> ./bin/mahout seq2sparse --input 
> ./examples/output/chunk-0 --output ./toto/output/ --maxNGramSize 3 I have got 
> -> 12/07/19 13:57:10 INFO driver.MahoutDriver: Program took 57093 ms 
> (Minutes: 0.95155)
>
>
> So I went to my output and there is ->
> root@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/toto/output# ls
> df-count           frequency.file-0  tf-vectors           wordcount
> dictionary.file-0  tfidf-vectors     tokenized-documents
>
> How should the vectors files looking like?
> And can somebody please explain me what represents each directory of the 
> output above?
>
>
>
> Thank you
>
> -----Message d'origine-----
> De : Videnova, Svetlana [mailto:[email protected]]
> Envoyé : jeudi 19 juillet 2012 14:26
> À : [email protected]
> Objet : RE: .txt to vector
>
> Yes that i was saying.
>
> But I have no idea where in the code mahout calls/creates the data that I 
> don't have.
> And the clusters that I have (especially clusters-8) are old and not generate 
> by seqdirectory either by seq2sparse...
> Should I make other manipulations before seqdirectory or seq2sparse step?
>
>
> Thank you
>
>
> -----Message d'origine-----
> De : Alexander Aristov [mailto:[email protected]] But Envoyé : 
> jeudi 19 juillet 2012 12:05 À : [email protected] Objet : Re: .txt to 
> vector
>
> you've got another problem now
>
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data does 
> not exist.
>
> Best Regards
> Alexander Aristov
>
>
> On 19 July 2012 12:30, Videnova, Svetlana <[email protected]>wrote:
>
>> Hi Lance,
>>
>> Thank you for your fast answer.
>> I was changing my :
>> CLASSPATH=/opt/lucene-3.6.0/lucene-core-3.6.0.jar:/opt/lucene-3.6.0/lucene-core-3.6.0-javadoc.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0-javadoc.jar:.
>>
>> And put 3.6.0 in the pom.xml
>>
>>
>> But:
>>
>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>> seq2sparse --input ./examples/output/ --output ./toto/output/ hadoop
>> binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
>> locally
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-exam
>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency/
>> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency/
>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>> Maximum n-gram size is: 1
>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>> Minimum LLR value: 1.0
>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>> Number of reduce tasks: 1
>> 12/07/19 09:03:56 INFO input.FileInputFormat: Total input paths to
>> process
>> : 15
>> 12/07/19 09:03:56 INFO mapred.JobClient: Cleaning up the staging area
>> file:/tmp/hadoop-csi/mapred/staging/csi-379951768/.staging/job_local_0
>> 001 Exception in thread "main" java.io.FileNotFoundException: File
>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data
>> does not exist.
>>         at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
>>         at
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>>         at
>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>         at
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>         at
>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919)
>>         at
>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936)
>>         at
>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854)
>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807)
>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495)
>>         at
>> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:93)
>>         at
>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:255)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>         at
>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>         at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>         at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>
>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/examples/output/clus
>> ters-8$
>> ls
>> _logs  part-r-00000  _policy  _SUCCESS
>>
>>  There is no
>> /usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data here!
>>
>>
>> Thank you
>>
>> -----Message d'origine-----
>> De : Lance Norskog [mailto:[email protected]] Envoyé : jeudi 19
>> juillet 2012 09:33 À : [email protected] Objet : Re: .txt to
>> vector
>>
>> Yes, the Mahout analyzer would have to be updated for Lucene 4.0. I
>> suggest using an earlier one. Mahout uses with Lucene in a very simple
>> way, and it is OK to use any earlier Lucene from 3.1 to 3.6.
>>
>> On Wed, Jul 18, 2012 at 11:50 PM, Videnova, Svetlana <
>> [email protected]> wrote:
>> > Hi Sean,
>> >
>> > In fact i was using lucene version 3.6.0 (saw that in the pom.xml)
>> > But in my classpath I was using lucene version 4.0.0
>> >
>> > I change pom.xml to 4.0.0 => <lucene.version>4.0.0</lucene.version>
>> >
>> > But still the same error:
>> > ###
>> > Exception in thread "main" java.lang.VerifyError: class
>> > org.apache.mahout.vectorizer.DefaultAnalyzer overrides final method
>> > tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/a
>> > na
>> > lysis/TokenStream;
>> > ###
>> >
>> > Should I change something else? Or may be lucene 4.0 is too recent
>> > for
>> mahout!?
>> >
>> >
>> >
>> > Thank you
>> >
>> > -----Message d'origine-----
>> > De : Sean Owen [mailto:[email protected]] Envoyé : mercredi 18
>> > juillet
>> > 2012 22:52 À : [email protected] Objet : Re: .txt to vector
>> >
>> > This means you're using it with an incompatible version of Lucene. I
>> think we're on 3.1. Check the version that Mahout depends upon and use
>> at least that version or later.
>> >
>> > On Wed, Jul 18, 2012 at 6:04 PM, Videnova, Svetlana <
>> [email protected]> wrote:
>> >
>> >> I'm working with mahout. I'm trying to do web service in java by
>> >> myself who will take the output of solr and give this file to mahout.
>> >> For the moment I successfully do the recommendation part.
>> >> Now I'm trying to clusterise. For this I have to vectorise the
>> >> output of solr.
>> >> Do you have any idea how to do it please? I was following
>> >> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>> >> BUT : doesn't work very well (at all...).
>> >>
>> >> I'm trying to find how to transform .txt to vector for mahout in
>> >> order to clusterise and categorise my information. Is it possible?
>> >> I saw that I have to use seqdirectory And seq2sparse.
>> >>
>> >> Seqdirectory create a file (with some numbers and everything...)
>> >> this step is ok But then when I have to use seq2sparse that gives
>> >> me this
>> >> error:
>> >>
>> >> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>> >> seq2sparse --input ./examples/output/ --output ./toto/output/
>> >> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>> >> running locally
>> >> SLF4J: Class path contains multiple SLF4J bindings.
>> >> SLF4J: Found binding in
>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-e
>> >> xa m
>> >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> >> SLF4J: Found binding in
>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen
>> >> cy / slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> >> SLF4J: Found binding in
>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen
>> >> cy /
>> >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> >> explanation.
>> >> SLF4J: Actual binding is of type
>> >> [org.slf4j.impl.Log4jLoggerFactory]
>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>> >> Maximum n-gram size is: 1
>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>> >> Minimum LLR value: 1.0
>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>> >> Number of reduce tasks: 1 Exception in thread "main"
>> >> java.lang.VerifyError: class
>> >> org.apache.mahout.vectorizer.DefaultAnalyzer overrides final method
>> >>
>> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/ana
>> lysis/TokenStream;
>> >>                 at java.lang.ClassLoader.defineClass1(Native Method)
>> >>                 at
>> >> java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>> >>                 at
>> java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>> >>                 at
>> >> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>> >>                 at
>> >> java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>> >>                 at
>> >> java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>> >>                 at
>> java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>> >>                 at
>> >> java.security.AccessController.doPrivileged(Native
>> >> Method)
>> >>                 at
>> >> java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>> >>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>> >>                 at
>> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>> >>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>> >>                 at
>> >>
>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Sparse
>> VectorsFromSequenceFiles.java:199)
>> >>                 at
>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >>                 at
>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> >>                 at
>> >>
>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(Spars
>> eVectorsFromSequenceFiles.java:55)
>> >>                 at
>> >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> >> Method)
>> >>                 at
>> >>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
>> ava:39)
>> >>                 at
>> >>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
>> orImpl.java:25)
>> >>                 at java.lang.reflect.Method.invoke(Method.java:597)
>> >>                 at
>> >>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Program
>> Driver.java:68)
>> >>                 at
>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> >>                 at
>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>> >>
>> >> im using only lucene 4.0!
>> >>
>> >>
>> CLASSPATH=/opt/lucene-4.0.0-ALPHA/demo/lucene-demo-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/core/lucene-core-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/analysis/common/lucene-analyzers-common-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/queryparser/lucene-queryparser-4.0.0-ALPHA.jar:.
>> >>
>> >> Please where im wrong?
>> >>
>> >>
>> >> Thank you all
>> >> Regards
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Think green - keep it on the screen.
>> >>
>> >> This e-mail and any attachment is for authorised use by the
>> >> intended
>> >> recipient(s) only. It may contain proprietary material,
>> >> confidential information and/or be subject to legal privilege. It
>> >> should not be copied, disclosed to, retained or used by, any other
>> >> party. If you are not an intended recipient then please promptly
>> >> delete this e-mail and any attachment and all copies and inform the 
>> >> sender. Thank you.
>> >>
>> >>
>> >
>> > Think green - keep it on the screen.
>> >
>> > This e-mail and any attachment is for authorised use by the intended
>> recipient(s) only. It may contain proprietary material, confidential
>> information and/or be subject to legal privilege. It should not be
>> copied, disclosed to, retained or used by, any other party. If you are
>> not an intended recipient then please promptly delete this e-mail and
>> any attachment and all copies and inform the sender. Thank you.
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>
>>
>> Think green - keep it on the screen.
>>
>> This e-mail and any attachment is for authorised use by the intended
>> recipient(s) only. It may contain proprietary material, confidential
>> information and/or be subject to legal privilege. It should not be
>> copied, disclosed to, retained or used by, any other party. If you are
>> not an intended recipient then please promptly delete this e-mail and
>> any attachment and all copies and inform the sender. Thank you.
>>
>>
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
>
>
>
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
>
>
>
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
>
>



-- 
Lance Norskog
[email protected]

Re: .txt to vector

Reply via email to