Yes i saw the help thats why I was trying with something between 0 and 1 but I have all the time the same error which is :
12/07/25 08:05:41 ERROR lucene.LuceneIterator: There are too many documents that do not have a term vector for bananas Exception in thread "main" java.lang.IllegalStateException: There are too many documents that do not have a term vector for bananas -----Message d'origine----- De : Lance Norskog [mailto:[email protected]] Envoyé : mercredi 25 juillet 2012 08:59 À : [email protected] Objet : Re: .txt to vector You're making progress! Run "bin/mahout lucene.vector" and look at the help message: --maxPercentErrorDocs (-err) maxPercentErrorDocs The max percentage of docs that can have a null term vector. These are noise document and can occur if the analyzer used strips out all terms in the target field. This percentage is expressed as a value between 0 and 1. The default is 0. You want .3, not 30 ! On Tue, Jul 24, 2012 at 1:27 AM, Videnova, Svetlana <[email protected]> wrote: > I find this : > http://comments.gmane.org/gmane.comp.apache.mahout.devel/16422 > > When I run this : apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector > --dir ./toto/index_bananas/ -o > ./toto/lucene_vector_test/tom_indexes_output --maxPercentErrorDocs 30 > --field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2 > > I have this error : > 12/07/24 09:25:22 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes > where applicable > 12/07/24 09:25:22 INFO compress.CodecPool: Got brand-new compressor > Exception in thread "main" java.lang.IllegalArgumentException > > -----Message d'origine----- > De : Videnova, Svetlana [mailto:[email protected]] > Envoyé : mardi 24 juillet 2012 09:16 > À : [email protected] > Objet : RE: .txt to vector > > Hi Lance, > > My dir contains now : _0.tvf and the others. > > With the command: > apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir ./toto/index_bananas/ > -o ./toto/lucene_vector_test/tom_indexes_output --field bananas -t > ./toto/lucene_vector_test/dictionnary/ -n 2 the output is: > ... > 12/07/24 08:13:01 ERROR lucene.LuceneIterator: There are too many documents > that do not have a term vector for bananas Exception in thread "main" > java.lang.IllegalStateException: There are too many documents that do not > have a term vector for bananas ... > > > Still can't understand the error ... > > Thank you > > > -----Message d'origine----- > De : Lance Norskog [mailto:[email protected]] Envoyé : mardi 24 > juillet 2012 04:28 À : [email protected] Objet : Re: .txt to > vector > > You have to add termvectors to the field type you want to use. Then, you have > to reindex all of the data. You will now have another file in the index with > the suffix .tvf. This has the data which the Mahout lucene job looks for. > > On Mon, Jul 23, 2012 at 8:03 AM, Videnova, Svetlana > <[email protected]> wrote: >> Hello again, >> >> I have got my indexed files from solr in windows and copy them into a >> directory in ubuntu. >> They are like this : >> ### >> index_test$ ls >> _4d.fdt _4d.frq _4d.tis _4e.fdx _4e.frq _4e.prx _4e.tis >> segments.gen >> _4d.fdx _4d.prx _4e.fdt _4e.fnm _4e.nrm _4e.tii segments_55 ### >> >> _4d.tis looks like: >> ### >> ]0 - PA – savoir où se trouve un panier workflow, statut >> ### >> >> >> Then i'm using mahout like that: >> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir >> ./toto/index_test/ -o ./toto/lucene_vector_test/tom_indexes_output --field >> PA -t ./toto/lucene_vector_test/dictionnary/ -n 2 The output is: >> >> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >> running locally >> SLF4J: Class path contains multiple SLF4J bindings. >> SLF4J: Found binding in >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-exa >> m ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency >> / slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency >> / slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >> explanation. >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] >> 12/07/23 15:50:09 INFO lucene.Driver: Output File: >> ./toto/lucene_vector_test/tom_indexes_output >> 12/07/23 15:50:10 WARN util.NativeCodeLoader: Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 12/07/23 15:50:10 INFO compress.CodecPool: Got brand-new compressor >> 12/07/23 15:50:10 ERROR lucene.LuceneIterator: There are too many >> documents that do not have a term vector for PA Exception in thread "main" >> java.lang.IllegalStateException: There are too many documents that do not >> have a term vector for PA >> at >> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:118) >> at >> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:41) >> at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >> at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >> at >> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:44) >> at >> org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:109) >> at >> org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >> >> >> >> I'm looking for field = "PA" which is using in a lot of files so I don’t >> understand why the exception tell me "too many documents that do not have a >> term vector for PA". >> >> Somebody can explain me how I have to use the command lucene.vector because >> apparently I'm missing something... >> >> Thank you all! >> >> >> -----Message d'origine----- >> De : Videnova, Svetlana [mailto:[email protected]] >> Envoyé : lundi 23 juillet 2012 10:18 >> À : [email protected] >> Objet : RE: .txt to vector >> >> I'm using mahout on ubuntu and solr on windows i guess with a web service I >> can get the indexed files from solr and then thanks to java program In the >> web service call mahout library to classify/clusterize and categorize my >> database. >> For the moment im just training with a directory on ubuntu (my dir contains >> : .xml,.txt,.csv), because I don’t know where can I get the indexed files >> from solr on ubuntu...?! >> Also I'm using the last version calls : apache-mahout-d6d6ee8 >> >> When I'm using lucene.vector like : $ ./bin/mahout lucene.vector -d >> ./toto/lucene_vector_test/ -o ./toto/lucene_vector_test/ -t ./toto/ >> -f content -n 2 Exception in thread "main" >> org.apache.lucene.index.IndexNotFoundException: no segments* file >> found in >> org.apache.lucene.store.NIOFSDirectory@/usr/local/apache-mahout-d6d6e >> e >> 8/toto/lucene_vector_test >> lockFactory=org.apache.lucene.store.NativeFSLockFactory@157aa53: >> files: [] >> >> >> Thank you >> >> >> >> -----Message d'origine----- >> De : Lance Norskog [mailto:[email protected]] Envoyé : samedi 21 >> juillet 2012 05:55 À : [email protected] Objet : Re: .txt to >> vector >> >> Solr creates Lucene index files. You can query it for content in several >> formats. You will have to fetch the data with a program. >> >> bin/mahout lucene.vector >> creates vector sequencefiles from a lucene index. I have not tried >> this. You have to configure Solr to create termvectors for the field >> you want. This is in the field type declaration, see the Introduction >> in: >> http://wiki.apache.org/solr/TermVectorComponent >> >> I don't know if lucene.vector is in the Mahout 0.5 release. >> >> For cluster outputs, the current cluster dumper supports 'graphml' >> format. Giraph is an interactive graph browsers. You can look at small >> cluster jobs. >> >> On Thu, Jul 19, 2012 at 11:34 PM, Videnova, Svetlana >> <[email protected]> wrote: >>> Hi, >>> I already have mahout in action, but nothing working with mahout last >>> version.. >>> I will see again.. >>> For "taming text" does it treat .xml, json files too, cause my goal is to >>> take the output of solr (which is .xml, json or php)? >>> >>> >>> >>> Regards >>> >>> >>> >>> -----Message d'origine----- >>> De : Lance Norskog [mailto:[email protected]] Envoyé : vendredi 20 >>> juillet 2012 03:16 À : [email protected] Objet : Re: .txt to >>> vector >>> >>> There are two books out for Mahout and text processing. "Mahout in Action" >>> covers all of the apps in Mahout. "Taming Text" gives a good detailed >>> explanation of the text processing programs in Mahout, and otherwise covers >>> other text processing problems. >>> >>> Mahout in Action is very good, and can help you use most of the Mahout >>> features. >>> >>> http://www.manning.com/owen >>> http://www.manning.com/ingersoll >>> >>> On Thu, Jul 19, 2012 at 8:08 AM, Videnova, Svetlana >>> <[email protected]> wrote: >>>> Hi again, >>>> Just finished. >>>> That's what I done: >>>> >>>> Mahout .txt to seqfile >>>> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html >>>> Converting directory of documents to SequenceFile format >>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>> seqdirectory --input /usr/local/apache-mahout-d6d6ee8/toto >>>> --output /usr/local/apache-mahout-d6d6ee8/examples/output/ >>>> -This first step will create chunk-0 file in the output path that >>>> you gave Creating Vectors from SequenceFile ./bin/mahout seq2sparse >>>> --input ./examples/output/chunk-0 --output ./toto/output/ >>>> -maxNGramSize *Don't forget to put ./toto/output full right -this >>>> second step will take the chunk-0 created by the first step and >>>> will create output dir where you specified in the --output option >>>> >>>> Creating vector with kmeans >>>> ./bin/mahout kmeans -i ./toto/output/tfidf-vectors/ -c >>>> ./toto/centroides_kmeans/ -cl -o ./toto/cluster_kmeans/ -k 20 -ow >>>> -x >>>> 10 >>>> >>>> Transform vectors to human redable (does not work yet) >>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>> clusterdump -i ./toto/cluster_kmeans/clusters-1-final/ -o >>>> ./toto/clusters-dump/ -of TEXT -d ./toto/output/dictionary.file-0 >>>> -dt sequencefile -b 100 -n 20 --evaluate -dm >>>> org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir >>>> ./toto/cluster_kmeans/clusteredPoints/ >>>> *-s got changed to -i for mahout 0.7 >>>> * works : ./bin/mahout clusterdump -i >>>> ./toto/cluster_kmeans/clusters-1-final/ -o ./toto/clusters-dump/ >>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/ >>>> >>>> >>>> >>>> >>>> >>>> Can somebody please explain me belows files? What exactly they contect how >>>> to use them ect... >>>> dictionary.file-0 ; tfidf-vectors ; tokenized-documents; df-count ; >>>> frequency.file-0 ; tf-vectors ; wordcount >>>> >>>> >>>> What is the chunk-0 file exactly? >>>> >>>> >>>> What represent clusters-dump at the end created by using the command >>>> clusterdump? >>>> >>>> >>>> Thank you all! >>>> >>>> >>>> -----Message d'origine----- >>>> De : Videnova, Svetlana [mailto:[email protected]] >>>> Envoyé : jeudi 19 juillet 2012 15:07 À : [email protected] >>>> Objet : RE: .txt to vector >>>> >>>> The problem was that i gave as input file to seq2sparse the directory and >>>> no -chunk directly. >>>> Also I didn't got write rights for "group" and "others" to my output file. >>>> >>>> After running the command -> ./bin/mahout seq2sparse --input >>>> ./examples/output/chunk-0 --output ./toto/output/ --maxNGramSize 3 >>>> I have got -> 12/07/19 13:57:10 INFO driver.MahoutDriver: Program >>>> took >>>> 57093 ms (Minutes: 0.95155) >>>> >>>> >>>> So I went to my output and there is -> >>>> root@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/toto/output# ls >>>> df-count frequency.file-0 tf-vectors wordcount >>>> dictionary.file-0 tfidf-vectors tokenized-documents >>>> >>>> How should the vectors files looking like? >>>> And can somebody please explain me what represents each directory of the >>>> output above? >>>> >>>> >>>> >>>> Thank you >>>> >>>> -----Message d'origine----- >>>> De : Videnova, Svetlana [mailto:[email protected]] >>>> Envoyé : jeudi 19 juillet 2012 14:26 À : [email protected] >>>> Objet : RE: .txt to vector >>>> >>>> Yes that i was saying. >>>> >>>> But I have no idea where in the code mahout calls/creates the data that I >>>> don't have. >>>> And the clusters that I have (especially clusters-8) are old and not >>>> generate by seqdirectory either by seq2sparse... >>>> Should I make other manipulations before seqdirectory or seq2sparse step? >>>> >>>> >>>> Thank you >>>> >>>> >>>> -----Message d'origine----- >>>> De : Alexander Aristov [mailto:[email protected]] But >>>> Envoyé >>>> : jeudi 19 juillet 2012 12:05 À : [email protected] Objet : Re: >>>> .txt to vector >>>> >>>> you've got another problem now >>>> >>>> Exception in thread "main" java.io.FileNotFoundException: File >>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data does >>>> not exist. >>>> >>>> Best Regards >>>> Alexander Aristov >>>> >>>> >>>> On 19 July 2012 12:30, Videnova, Svetlana >>>> <[email protected]>wrote: >>>> >>>>> Hi Lance, >>>>> >>>>> Thank you for your fast answer. >>>>> I was changing my : >>>>> CLASSPATH=/opt/lucene-3.6.0/lucene-core-3.6.0.jar:/opt/lucene-3.6.0/lucene-core-3.6.0-javadoc.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0-javadoc.jar:. >>>>> >>>>> And put 3.6.0 in the pom.xml >>>>> >>>>> >>>>> But: >>>>> >>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>> seq2sparse --input ./examples/output/ --output ./toto/output/ >>>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>>>> running locally >>>>> SLF4J: Class path contains multiple SLF4J bindings. >>>>> SLF4J: Found binding in >>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout- >>>>> e >>>>> x >>>>> a m >>>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class >>>>> ] >>>>> SLF4J: Found binding in >>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depende >>>>> n c y / >>>>> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>> SLF4J: Found binding in >>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depende >>>>> n >>>>> c >>>>> y / >>>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for >>>>> an explanation. >>>>> SLF4J: Actual binding is of type >>>>> [org.slf4j.impl.Log4jLoggerFactory] >>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> Maximum n-gram size is: 1 >>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> Minimum LLR value: 1.0 >>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> Number of reduce tasks: 1 >>>>> 12/07/19 09:03:56 INFO input.FileInputFormat: Total input paths to >>>>> process >>>>> : 15 >>>>> 12/07/19 09:03:56 INFO mapred.JobClient: Cleaning up the staging >>>>> area >>>>> file:/tmp/hadoop-csi/mapred/staging/csi-379951768/.staging/job_loc >>>>> a >>>>> l >>>>> _ >>>>> 0 >>>>> 001 Exception in thread "main" java.io.FileNotFoundException: File >>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/d >>>>> a >>>>> t >>>>> a >>>>> does not exist. >>>>> at >>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371) >>>>> at >>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) >>>>> at >>>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63) >>>>> at >>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) >>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854) >>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807) >>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>> at javax.security.auth.Subject.doAs(Subject.java:396) >>>>> at >>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807) >>>>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) >>>>> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495) >>>>> at >>>>> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:93) >>>>> at >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:255) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>> at >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>> at >>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>> at >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at >>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>>> at >>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>> at >>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >>>>> >>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/examples/output/ >>>>> c >>>>> l >>>>> u >>>>> s >>>>> ters-8$ >>>>> ls >>>>> _logs part-r-00000 _policy _SUCCESS >>>>> >>>>> There is no >>>>> /usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data here! >>>>> >>>>> >>>>> Thank you >>>>> >>>>> -----Message d'origine----- >>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : jeudi 19 >>>>> juillet 2012 09:33 À : [email protected] Objet : Re: .txt to >>>>> vector >>>>> >>>>> Yes, the Mahout analyzer would have to be updated for Lucene 4.0. >>>>> I suggest using an earlier one. Mahout uses with Lucene in a very >>>>> simple way, and it is OK to use any earlier Lucene from 3.1 to 3.6. >>>>> >>>>> On Wed, Jul 18, 2012 at 11:50 PM, Videnova, Svetlana < >>>>> [email protected]> wrote: >>>>> > Hi Sean, >>>>> > >>>>> > In fact i was using lucene version 3.6.0 (saw that in the >>>>> > pom.xml) But in my classpath I was using lucene version 4.0.0 >>>>> > >>>>> > I change pom.xml to 4.0.0 => >>>>> > <lucene.version>4.0.0</lucene.version> >>>>> > >>>>> > But still the same error: >>>>> > ### >>>>> > Exception in thread "main" java.lang.VerifyError: class >>>>> > org.apache.mahout.vectorizer.DefaultAnalyzer overrides final >>>>> > method >>>>> > tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/luce >>>>> > n >>>>> > e >>>>> > / >>>>> > a >>>>> > na >>>>> > lysis/TokenStream; >>>>> > ### >>>>> > >>>>> > Should I change something else? Or may be lucene 4.0 is too >>>>> > recent for >>>>> mahout!? >>>>> > >>>>> > >>>>> > >>>>> > Thank you >>>>> > >>>>> > -----Message d'origine----- >>>>> > De : Sean Owen [mailto:[email protected]] Envoyé : mercredi 18 >>>>> > juillet >>>>> > 2012 22:52 À : [email protected] Objet : Re: .txt to vector >>>>> > >>>>> > This means you're using it with an incompatible version of Lucene. >>>>> > I >>>>> think we're on 3.1. Check the version that Mahout depends upon and >>>>> use at least that version or later. >>>>> > >>>>> > On Wed, Jul 18, 2012 at 6:04 PM, Videnova, Svetlana < >>>>> [email protected]> wrote: >>>>> > >>>>> >> I'm working with mahout. I'm trying to do web service in java >>>>> >> by myself who will take the output of solr and give this file to >>>>> >> mahout. >>>>> >> For the moment I successfully do the recommendation part. >>>>> >> Now I'm trying to clusterise. For this I have to vectorise the >>>>> >> output of solr. >>>>> >> Do you have any idea how to do it please? I was following >>>>> >> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html >>>>> >> BUT : doesn't work very well (at all...). >>>>> >> >>>>> >> I'm trying to find how to transform .txt to vector for mahout >>>>> >> in order to clusterise and categorise my information. Is it possible? >>>>> >> I saw that I have to use seqdirectory And seq2sparse. >>>>> >> >>>>> >> Seqdirectory create a file (with some numbers and >>>>> >> everything...) this step is ok But then when I have to use >>>>> >> seq2sparse that gives me this >>>>> >> error: >>>>> >> >>>>> >> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>> >> seq2sparse --input ./examples/output/ --output ./toto/output/ >>>>> >> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>>>> >> running locally >>>>> >> SLF4J: Class path contains multiple SLF4J bindings. >>>>> >> SLF4J: Found binding in >>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/maho >>>>> >> u >>>>> >> t >>>>> >> - >>>>> >> e >>>>> >> xa m >>>>> >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.cl >>>>> >> a >>>>> >> s >>>>> >> s >>>>> >> ] >>>>> >> SLF4J: Found binding in >>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depe >>>>> >> n >>>>> >> d >>>>> >> e >>>>> >> n cy / >>>>> >> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>> >> SLF4J: Found binding in >>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depe >>>>> >> n >>>>> >> d >>>>> >> e >>>>> >> n >>>>> >> cy / >>>>> >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.clas >>>>> >> s >>>>> >> ] >>>>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings >>>>> >> for an explanation. >>>>> >> SLF4J: Actual binding is of type >>>>> >> [org.slf4j.impl.Log4jLoggerFactory] >>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> >> Maximum n-gram size is: 1 >>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> >> Minimum LLR value: 1.0 >>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>> >> Number of reduce tasks: 1 Exception in thread "main" >>>>> >> java.lang.VerifyError: class >>>>> >> org.apache.mahout.vectorizer.DefaultAnalyzer overrides final >>>>> >> method >>>>> >> >>>>> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene >>>>> / >>>>> a >>>>> n >>>>> a >>>>> lysis/TokenStream; >>>>> >> at java.lang.ClassLoader.defineClass1(Native Method) >>>>> >> at >>>>> >> java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) >>>>> >> at >>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:615) >>>>> >> at >>>>> >> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) >>>>> >> at >>>>> >> java.net.URLClassLoader.defineClass(URLClassLoader.java:283) >>>>> >> at >>>>> >> java.net.URLClassLoader.access$000(URLClassLoader.java:58) >>>>> >> at >>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:197) >>>>> >> at >>>>> >> java.security.AccessController.doPrivileged(Native >>>>> >> Method) >>>>> >> at >>>>> >> java.net.URLClassLoader.findClass(URLClassLoader.java:190) >>>>> >> at >>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:306) >>>>> >> at >>>>> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) >>>>> >> at >>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:247) >>>>> >> at >>>>> >> >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Sp >>>>> a >>>>> r >>>>> s >>>>> e >>>>> VectorsFromSequenceFiles.java:199) >>>>> >> at >>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> >> at >>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>> >> at >>>>> >> >>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(S >>>>> p >>>>> a >>>>> r >>>>> s >>>>> eVectorsFromSequenceFiles.java:55) >>>>> >> at >>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>> >> Method) >>>>> >> at >>>>> >> >>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. >>>>> j >>>>> ava:39) >>>>> >> at >>>>> >> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAc >>>>> c >>>>> e >>>>> s >>>>> s >>>>> orImpl.java:25) >>>>> >> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> >> at >>>>> >> >>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Pro >>>>> g >>>>> r >>>>> a >>>>> m >>>>> Driver.java:68) >>>>> >> at >>>>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>> >> at >>>>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:19 >>>>> >> 5 >>>>> >> ) >>>>> >> >>>>> >> im using only lucene 4.0! >>>>> >> >>>>> >> >>>>> CLASSPATH=/opt/lucene-4.0.0-ALPHA/demo/lucene-demo-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/core/lucene-core-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/analysis/common/lucene-analyzers-common-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/queryparser/lucene-queryparser-4.0.0-ALPHA.jar:. >>>>> >> >>>>> >> Please where im wrong? >>>>> >> >>>>> >> >>>>> >> Thank you all >>>>> >> Regards >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> Think green - keep it on the screen. >>>>> >> >>>>> >> This e-mail and any attachment is for authorised use by the >>>>> >> intended >>>>> >> recipient(s) only. It may contain proprietary material, >>>>> >> confidential information and/or be subject to legal privilege. >>>>> >> It should not be copied, disclosed to, retained or used by, any >>>>> >> other party. If you are not an intended recipient then please >>>>> >> promptly delete this e-mail and any attachment and all copies and >>>>> >> inform the sender. Thank you. >>>>> >> >>>>> >> >>>>> > >>>>> > Think green - keep it on the screen. >>>>> > >>>>> > This e-mail and any attachment is for authorised use by the >>>>> > intended >>>>> recipient(s) only. It may contain proprietary material, >>>>> confidential information and/or be subject to legal privilege. It >>>>> should not be copied, disclosed to, retained or used by, any other >>>>> party. If you are not an intended recipient then please promptly >>>>> delete this e-mail and any attachment and all copies and inform the >>>>> sender. Thank you. >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> Lance Norskog >>>>> [email protected] >>>>> >>>>> >>>>> Think green - keep it on the screen. >>>>> >>>>> This e-mail and any attachment is for authorised use by the >>>>> intended >>>>> recipient(s) only. It may contain proprietary material, >>>>> confidential information and/or be subject to legal privilege. It >>>>> should not be copied, disclosed to, retained or used by, any other >>>>> party. If you are not an intended recipient then please promptly >>>>> delete this e-mail and any attachment and all copies and inform the >>>>> sender. Thank you. >>>>> >>>>> >>>> >>>> Think green - keep it on the screen. >>>> >>>> This e-mail and any attachment is for authorised use by the intended >>>> recipient(s) only. It may contain proprietary material, confidential >>>> information and/or be subject to legal privilege. It should not be copied, >>>> disclosed to, retained or used by, any other party. If you are not an >>>> intended recipient then please promptly delete this e-mail and any >>>> attachment and all copies and inform the sender. Thank you. >>>> >>>> >>>> >>>> >>>> Think green - keep it on the screen. >>>> >>>> This e-mail and any attachment is for authorised use by the intended >>>> recipient(s) only. It may contain proprietary material, confidential >>>> information and/or be subject to legal privilege. It should not be copied, >>>> disclosed to, retained or used by, any other party. If you are not an >>>> intended recipient then please promptly delete this e-mail and any >>>> attachment and all copies and inform the sender. Thank you. >>>> >>>> >>>> >>>> >>>> Think green - keep it on the screen. >>>> >>>> This e-mail and any attachment is for authorised use by the intended >>>> recipient(s) only. It may contain proprietary material, confidential >>>> information and/or be subject to legal privilege. It should not be copied, >>>> disclosed to, retained or used by, any other party. If you are not an >>>> intended recipient then please promptly delete this e-mail and any >>>> attachment and all copies and inform the sender. Thank you. >>>> >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> [email protected] >>> >>> >>> Think green - keep it on the screen. >>> >>> This e-mail and any attachment is for authorised use by the intended >>> recipient(s) only. It may contain proprietary material, confidential >>> information and/or be subject to legal privilege. It should not be copied, >>> disclosed to, retained or used by, any other party. If you are not an >>> intended recipient then please promptly delete this e-mail and any >>> attachment and all copies and inform the sender. Thank you. >>> >> >> >> >> -- >> Lance Norskog >> [email protected] >> >> >> Think green - keep it on the screen. >> >> This e-mail and any attachment is for authorised use by the intended >> recipient(s) only. It may contain proprietary material, confidential >> information and/or be subject to legal privilege. It should not be copied, >> disclosed to, retained or used by, any other party. If you are not an >> intended recipient then please promptly delete this e-mail and any >> attachment and all copies and inform the sender. Thank you. >> >> >> Think green - keep it on the screen. >> >> This e-mail and any attachment is for authorised use by the intended >> recipient(s) only. It may contain proprietary material, confidential >> information and/or be subject to legal privilege. It should not be copied, >> disclosed to, retained or used by, any other party. If you are not an >> intended recipient then please promptly delete this e-mail and any >> attachment and all copies and inform the sender. Thank you. >> > > > > -- > Lance Norskog > [email protected] > > > Think green - keep it on the screen. > > This e-mail and any attachment is for authorised use by the intended > recipient(s) only. It may contain proprietary material, confidential > information and/or be subject to legal privilege. It should not be copied, > disclosed to, retained or used by, any other party. If you are not an > intended recipient then please promptly delete this e-mail and any attachment > and all copies and inform the sender. Thank you. > > > Think green - keep it on the screen. > > This e-mail and any attachment is for authorised use by the intended > recipient(s) only. It may contain proprietary material, confidential > information and/or be subject to legal privilege. It should not be copied, > disclosed to, retained or used by, any other party. If you are not an > intended recipient then please promptly delete this e-mail and any attachment > and all copies and inform the sender. Thank you. > -- Lance Norskog [email protected] Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
