My problem was solved. An index contains a sequence of documents. A document is a sequence of fields. A field is a named sequence of terms. A term is a string.
I was searching for a term while the option was --field ... grrrrr Now my output is looking like : 9 #term doc freq idx artichokes 14 0 cheese 17 1 deron 14 2 french 14 3 fries 14 4 hamburger 14 5 nicole 17 6 salad 17 7 steak 14 8 -----Message d'origine----- De : Videnova, Svetlana [mailto:[email protected]] Envoyé : mercredi 25 juillet 2012 11:05 À : [email protected] Objet : RE: .txt to vector OK thank you. All good. 31docs, 3fields: content (term count 16), filename (termcount 17), indexDate (termcount 1) There are "bananas" in at least 3files ... Can't understand why 12/07/25 10:03:02 ERROR lucene.LuceneIterator: There are too many documents that do not have a term vector for bananas Exception in thread "main" java.lang.IllegalStateException: There are too many documents that do not have a term vector for bananas -----Message d'origine----- De : Lance Norskog [mailto:[email protected]] Envoyé : mercredi 25 juillet 2012 09:57 À : [email protected] Objet : Re: .txt to vector It is a jar file, so just java -jar luke.....jar But, there's a problem. Luke releases are keyed to different Lucene releases. You need the right Luke download for your version of Lucene. http://code.google.com/p/luke/downloads/list On Wed, Jul 25, 2012 at 12:52 AM, Videnova, Svetlana <[email protected]> wrote: > Sorry but what the command line for running luke? > > -----Message d'origine----- > De : Lance Norskog [mailto:[email protected]] Envoyé : mercredi 25 > juillet 2012 09:24 À : [email protected] Objet : Re: .txt to > vector > > The Luke program lets you examine a Lucene index. Try that and check for your > term vectors. > http://code.google.com/p/luke/ > > It uses Swing, so you need the index on your local PC. > > On Wed, Jul 25, 2012 at 12:15 AM, Videnova, Svetlana > <[email protected]> wrote: >> Yes i saw the help thats why I was trying with something between 0 and 1 but >> I have all the time the same error which is : >> >> 12/07/25 08:05:41 ERROR lucene.LuceneIterator: There are too many >> documents that do not have a term vector for bananas Exception in >> thread "main" java.lang.IllegalStateException: There are too many >> documents that do not have a term vector for bananas >> >> >> >> -----Message d'origine----- >> De : Lance Norskog [mailto:[email protected]] Envoyé : mercredi 25 >> juillet 2012 08:59 À : [email protected] Objet : Re: .txt to >> vector >> >> You're making progress! Run "bin/mahout lucene.vector" and look at the help >> message: >> --maxPercentErrorDocs (-err) maxPercentErrorDocs The max percentage of >> docs that can have a >> null >> term vector. These are >> noise document and can >> occur if the analyzer >> used strips out all >> terms >> in the target field. >> This >> percentage is expressed >> as a value between 0 >> and >> 1. The default is 0. >> >> You want .3, not 30 ! >> >> On Tue, Jul 24, 2012 at 1:27 AM, Videnova, Svetlana >> <[email protected]> wrote: >>> I find this : >>> http://comments.gmane.org/gmane.comp.apache.mahout.devel/16422 >>> >>> When I run this : apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector >>> --dir ./toto/index_bananas/ -o >>> ./toto/lucene_vector_test/tom_indexes_output --maxPercentErrorDocs >>> 30 --field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2 >>> >>> I have this error : >>> 12/07/24 09:25:22 WARN util.NativeCodeLoader: Unable to load >>> native-hadoop library for your platform... using builtin-java >>> classes where applicable >>> 12/07/24 09:25:22 INFO compress.CodecPool: Got brand-new compressor >>> Exception in thread "main" java.lang.IllegalArgumentException >>> >>> -----Message d'origine----- >>> De : Videnova, Svetlana [mailto:[email protected]] >>> Envoyé : mardi 24 juillet 2012 09:16 À : [email protected] >>> Objet : RE: .txt to vector >>> >>> Hi Lance, >>> >>> My dir contains now : _0.tvf and the others. >>> >>> With the command: >>> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir >>> ./toto/index_bananas/ -o ./toto/lucene_vector_test/tom_indexes_output >>> --field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2 the output >>> is: >>> ... >>> 12/07/24 08:13:01 ERROR lucene.LuceneIterator: There are too many documents >>> that do not have a term vector for bananas Exception in thread "main" >>> java.lang.IllegalStateException: There are too many documents that do not >>> have a term vector for bananas ... >>> >>> >>> Still can't understand the error ... >>> >>> Thank you >>> >>> >>> -----Message d'origine----- >>> De : Lance Norskog [mailto:[email protected]] Envoyé : mardi 24 >>> juillet 2012 04:28 À : [email protected] Objet : Re: .txt to >>> vector >>> >>> You have to add termvectors to the field type you want to use. Then, you >>> have to reindex all of the data. You will now have another file in the >>> index with the suffix .tvf. This has the data which the Mahout lucene job >>> looks for. >>> >>> On Mon, Jul 23, 2012 at 8:03 AM, Videnova, Svetlana >>> <[email protected]> wrote: >>>> Hello again, >>>> >>>> I have got my indexed files from solr in windows and copy them into a >>>> directory in ubuntu. >>>> They are like this : >>>> ### >>>> index_test$ ls >>>> _4d.fdt _4d.frq _4d.tis _4e.fdx _4e.frq _4e.prx _4e.tis >>>> segments.gen >>>> _4d.fdx _4d.prx _4e.fdt _4e.fnm _4e.nrm _4e.tii segments_55 >>>> ### >>>> >>>> _4d.tis looks like: >>>> ### >>>> ]0 - PA – savoir où se trouve un panier workflow, >>>> statut >>>> ### >>>> >>>> >>>> Then i'm using mahout like that: >>>> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir >>>> ./toto/index_test/ -o ./toto/lucene_vector_test/tom_indexes_output --field >>>> PA -t ./toto/lucene_vector_test/dictionnary/ -n 2 The output is: >>>> >>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>>> running locally >>>> SLF4J: Class path contains multiple SLF4J bindings. >>>> SLF4J: Found binding in >>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-e >>>> x >>>> a m >>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>> SLF4J: Found binding in >>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen >>>> c y / slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>> SLF4J: Found binding in >>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen >>>> c >>>> y / >>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >>>> explanation. >>>> SLF4J: Actual binding is of type >>>> [org.slf4j.impl.Log4jLoggerFactory] >>>> 12/07/23 15:50:09 INFO lucene.Driver: Output File: >>>> ./toto/lucene_vector_test/tom_indexes_output >>>> 12/07/23 15:50:10 WARN util.NativeCodeLoader: Unable to load >>>> native-hadoop library for your platform... using builtin-java >>>> classes where applicable >>>> 12/07/23 15:50:10 INFO compress.CodecPool: Got brand-new compressor >>>> 12/07/23 15:50:10 ERROR lucene.LuceneIterator: There are too many >>>> documents that do not have a term vector for PA Exception in thread "main" >>>> java.lang.IllegalStateException: There are too many documents that do not >>>> have a term vector for PA >>>> at >>>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:118) >>>> at >>>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:41) >>>> at >>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >>>> at >>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >>>> at >>>> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:44) >>>> at >>>> org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:109) >>>> at >>>> org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at >>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>> at >>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>> at >>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >>>> >>>> >>>> >>>> I'm looking for field = "PA" which is using in a lot of files so I don’t >>>> understand why the exception tell me "too many documents that do not have >>>> a term vector for PA". >>>> >>>> Somebody can explain me how I have to use the command lucene.vector >>>> because apparently I'm missing something... >>>> >>>> Thank you all! >>>> >>>> >>>> -----Message d'origine----- >>>> De : Videnova, Svetlana [mailto:[email protected]] >>>> Envoyé : lundi 23 juillet 2012 10:18 À : [email protected] >>>> Objet : RE: .txt to vector >>>> >>>> I'm using mahout on ubuntu and solr on windows i guess with a web service >>>> I can get the indexed files from solr and then thanks to java program In >>>> the web service call mahout library to classify/clusterize and categorize >>>> my database. >>>> For the moment im just training with a directory on ubuntu (my dir >>>> contains : .xml,.txt,.csv), because I don’t know where can I get the >>>> indexed files from solr on ubuntu...?! >>>> Also I'm using the last version calls : apache-mahout-d6d6ee8 >>>> >>>> When I'm using lucene.vector like : $ ./bin/mahout lucene.vector -d >>>> ./toto/lucene_vector_test/ -o ./toto/lucene_vector_test/ -t ./toto/ >>>> -f content -n 2 Exception in thread "main" >>>> org.apache.lucene.index.IndexNotFoundException: no segments* file >>>> found in >>>> org.apache.lucene.store.NIOFSDirectory@/usr/local/apache-mahout-d6d >>>> 6 >>>> e >>>> e >>>> 8/toto/lucene_vector_test >>>> lockFactory=org.apache.lucene.store.NativeFSLockFactory@157aa53: >>>> files: [] >>>> >>>> >>>> Thank you >>>> >>>> >>>> >>>> -----Message d'origine----- >>>> De : Lance Norskog [mailto:[email protected]] Envoyé : samedi 21 >>>> juillet 2012 05:55 À : [email protected] Objet : Re: .txt to >>>> vector >>>> >>>> Solr creates Lucene index files. You can query it for content in several >>>> formats. You will have to fetch the data with a program. >>>> >>>> bin/mahout lucene.vector >>>> creates vector sequencefiles from a lucene index. I have not tried >>>> this. You have to configure Solr to create termvectors for the >>>> field you want. This is in the field type declaration, see the >>>> Introduction >>>> in: >>>> http://wiki.apache.org/solr/TermVectorComponent >>>> >>>> I don't know if lucene.vector is in the Mahout 0.5 release. >>>> >>>> For cluster outputs, the current cluster dumper supports 'graphml' >>>> format. Giraph is an interactive graph browsers. You can look at small >>>> cluster jobs. >>>> >>>> On Thu, Jul 19, 2012 at 11:34 PM, Videnova, Svetlana >>>> <[email protected]> wrote: >>>>> Hi, >>>>> I already have mahout in action, but nothing working with mahout last >>>>> version.. >>>>> I will see again.. >>>>> For "taming text" does it treat .xml, json files too, cause my goal is to >>>>> take the output of solr (which is .xml, json or php)? >>>>> >>>>> >>>>> >>>>> Regards >>>>> >>>>> >>>>> >>>>> -----Message d'origine----- >>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : vendredi 20 >>>>> juillet 2012 03:16 À : [email protected] Objet : Re: .txt to >>>>> vector >>>>> >>>>> There are two books out for Mahout and text processing. "Mahout in >>>>> Action" covers all of the apps in Mahout. "Taming Text" gives a good >>>>> detailed explanation of the text processing programs in Mahout, and >>>>> otherwise covers other text processing problems. >>>>> >>>>> Mahout in Action is very good, and can help you use most of the Mahout >>>>> features. >>>>> >>>>> http://www.manning.com/owen >>>>> http://www.manning.com/ingersoll >>>>> >>>>> On Thu, Jul 19, 2012 at 8:08 AM, Videnova, Svetlana >>>>> <[email protected]> wrote: >>>>>> Hi again, >>>>>> Just finished. >>>>>> That's what I done: >>>>>> >>>>>> Mahout .txt to seqfile >>>>>> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html >>>>>> Converting directory of documents to SequenceFile format >>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>>> seqdirectory --input /usr/local/apache-mahout-d6d6ee8/toto >>>>>> --output /usr/local/apache-mahout-d6d6ee8/examples/output/ >>>>>> -This first step will create chunk-0 file in the output path that >>>>>> you gave Creating Vectors from SequenceFile ./bin/mahout >>>>>> seq2sparse --input ./examples/output/chunk-0 --output >>>>>> ./toto/output/ -maxNGramSize *Don't forget to put ./toto/output >>>>>> full right -this second step will take the chunk-0 created by the >>>>>> first step and will create output dir where you specified in the >>>>>> --output option >>>>>> >>>>>> Creating vector with kmeans >>>>>> ./bin/mahout kmeans -i ./toto/output/tfidf-vectors/ -c >>>>>> ./toto/centroides_kmeans/ -cl -o ./toto/cluster_kmeans/ -k 20 >>>>>> -ow -x >>>>>> 10 >>>>>> >>>>>> Transform vectors to human redable (does not work yet) >>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>>> clusterdump -i ./toto/cluster_kmeans/clusters-1-final/ -o >>>>>> ./toto/clusters-dump/ -of TEXT -d ./toto/output/dictionary.file-0 >>>>>> -dt sequencefile -b 100 -n 20 --evaluate -dm >>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure >>>>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/ >>>>>> *-s got changed to -i for mahout 0.7 >>>>>> * works : ./bin/mahout clusterdump -i >>>>>> ./toto/cluster_kmeans/clusters-1-final/ -o ./toto/clusters-dump/ >>>>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Can somebody please explain me belows files? What exactly they contect >>>>>> how to use them ect... >>>>>> dictionary.file-0 ; tfidf-vectors ; tokenized-documents; df-count ; >>>>>> frequency.file-0 ; tf-vectors ; wordcount >>>>>> >>>>>> >>>>>> What is the chunk-0 file exactly? >>>>>> >>>>>> >>>>>> What represent clusters-dump at the end created by using the command >>>>>> clusterdump? >>>>>> >>>>>> >>>>>> Thank you all! >>>>>> >>>>>> >>>>>> -----Message d'origine----- >>>>>> De : Videnova, Svetlana [mailto:[email protected]] >>>>>> Envoyé : jeudi 19 juillet 2012 15:07 À : [email protected] >>>>>> Objet : RE: .txt to vector >>>>>> >>>>>> The problem was that i gave as input file to seq2sparse the directory >>>>>> and no -chunk directly. >>>>>> Also I didn't got write rights for "group" and "others" to my output >>>>>> file. >>>>>> >>>>>> After running the command -> ./bin/mahout seq2sparse --input >>>>>> ./examples/output/chunk-0 --output ./toto/output/ --maxNGramSize >>>>>> 3 I have got -> 12/07/19 13:57:10 INFO driver.MahoutDriver: >>>>>> Program took >>>>>> 57093 ms (Minutes: 0.95155) >>>>>> >>>>>> >>>>>> So I went to my output and there is -> >>>>>> root@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/toto/output# ls >>>>>> df-count frequency.file-0 tf-vectors wordcount >>>>>> dictionary.file-0 tfidf-vectors tokenized-documents >>>>>> >>>>>> How should the vectors files looking like? >>>>>> And can somebody please explain me what represents each directory of the >>>>>> output above? >>>>>> >>>>>> >>>>>> >>>>>> Thank you >>>>>> >>>>>> -----Message d'origine----- >>>>>> De : Videnova, Svetlana [mailto:[email protected]] >>>>>> Envoyé : jeudi 19 juillet 2012 14:26 À : [email protected] >>>>>> Objet : RE: .txt to vector >>>>>> >>>>>> Yes that i was saying. >>>>>> >>>>>> But I have no idea where in the code mahout calls/creates the data that >>>>>> I don't have. >>>>>> And the clusters that I have (especially clusters-8) are old and not >>>>>> generate by seqdirectory either by seq2sparse... >>>>>> Should I make other manipulations before seqdirectory or seq2sparse step? >>>>>> >>>>>> >>>>>> Thank you >>>>>> >>>>>> >>>>>> -----Message d'origine----- >>>>>> De : Alexander Aristov [mailto:[email protected]] But >>>>>> Envoyé >>>>>> : jeudi 19 juillet 2012 12:05 À : [email protected] Objet : Re: >>>>>> .txt to vector >>>>>> >>>>>> you've got another problem now >>>>>> >>>>>> Exception in thread "main" java.io.FileNotFoundException: File >>>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data >>>>>> does not exist. >>>>>> >>>>>> Best Regards >>>>>> Alexander Aristov >>>>>> >>>>>> >>>>>> On 19 July 2012 12:30, Videnova, Svetlana >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> Hi Lance, >>>>>>> >>>>>>> Thank you for your fast answer. >>>>>>> I was changing my : >>>>>>> CLASSPATH=/opt/lucene-3.6.0/lucene-core-3.6.0.jar:/opt/lucene-3.6.0/lucene-core-3.6.0-javadoc.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0-javadoc.jar:. >>>>>>> >>>>>>> And put 3.6.0 in the pom.xml >>>>>>> >>>>>>> >>>>>>> But: >>>>>>> >>>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>>>> seq2sparse --input ./examples/output/ --output ./toto/output/ >>>>>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>>>>>> running locally >>>>>>> SLF4J: Class path contains multiple SLF4J bindings. >>>>>>> SLF4J: Found binding in >>>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahou >>>>>>> t >>>>>>> - >>>>>>> e >>>>>>> x >>>>>>> a m >>>>>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.cla >>>>>>> s >>>>>>> s >>>>>>> ] >>>>>>> SLF4J: Found binding in >>>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen >>>>>>> d >>>>>>> e >>>>>>> n c y / >>>>>>> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>>>> SLF4J: Found binding in >>>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen >>>>>>> d >>>>>>> e >>>>>>> n >>>>>>> c >>>>>>> y / >>>>>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class >>>>>>> ] >>>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for >>>>>>> an explanation. >>>>>>> SLF4J: Actual binding is of type >>>>>>> [org.slf4j.impl.Log4jLoggerFactory] >>>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>>> Maximum n-gram size is: 1 >>>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>>> Minimum LLR value: 1.0 >>>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>>> Number of reduce tasks: 1 >>>>>>> 12/07/19 09:03:56 INFO input.FileInputFormat: Total input paths >>>>>>> to process >>>>>>> : 15 >>>>>>> 12/07/19 09:03:56 INFO mapred.JobClient: Cleaning up the staging >>>>>>> area >>>>>>> file:/tmp/hadoop-csi/mapred/staging/csi-379951768/.staging/job_l >>>>>>> o >>>>>>> c >>>>>>> a >>>>>>> l >>>>>>> _ >>>>>>> 0 >>>>>>> 001 Exception in thread "main" java.io.FileNotFoundException: >>>>>>> File >>>>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8 >>>>>>> / >>>>>>> d >>>>>>> a >>>>>>> t >>>>>>> a >>>>>>> does not exist. >>>>>>> at >>>>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371) >>>>>>> at >>>>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) >>>>>>> at >>>>>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63) >>>>>>> at >>>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) >>>>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854) >>>>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807) >>>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>>> at javax.security.auth.Subject.doAs(Subject.java:396) >>>>>>> at >>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807) >>>>>>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) >>>>>>> at >>>>>>> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495) >>>>>>> at >>>>>>> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:93) >>>>>>> at >>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:255) >>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>>>> at >>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55) >>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>> at >>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>>>> at >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> at >>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>>>>> at >>>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>>>> at >>>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195 >>>>>>> ) >>>>>>> >>>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/examples/outpu >>>>>>> t >>>>>>> / >>>>>>> c >>>>>>> l >>>>>>> u >>>>>>> s >>>>>>> ters-8$ >>>>>>> ls >>>>>>> _logs part-r-00000 _policy _SUCCESS >>>>>>> >>>>>>> There is no >>>>>>> /usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data here! >>>>>>> >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> -----Message d'origine----- >>>>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : jeudi 19 >>>>>>> juillet 2012 09:33 À : [email protected] Objet : Re: .txt >>>>>>> to vector >>>>>>> >>>>>>> Yes, the Mahout analyzer would have to be updated for Lucene 4.0. >>>>>>> I suggest using an earlier one. Mahout uses with Lucene in a >>>>>>> very simple way, and it is OK to use any earlier Lucene from 3.1 to 3.6. >>>>>>> >>>>>>> On Wed, Jul 18, 2012 at 11:50 PM, Videnova, Svetlana < >>>>>>> [email protected]> wrote: >>>>>>> > Hi Sean, >>>>>>> > >>>>>>> > In fact i was using lucene version 3.6.0 (saw that in the >>>>>>> > pom.xml) But in my classpath I was using lucene version 4.0.0 >>>>>>> > >>>>>>> > I change pom.xml to 4.0.0 => >>>>>>> > <lucene.version>4.0.0</lucene.version> >>>>>>> > >>>>>>> > But still the same error: >>>>>>> > ### >>>>>>> > Exception in thread "main" java.lang.VerifyError: class >>>>>>> > org.apache.mahout.vectorizer.DefaultAnalyzer overrides final >>>>>>> > method >>>>>>> > tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lu >>>>>>> > c >>>>>>> > e >>>>>>> > n >>>>>>> > e >>>>>>> > / >>>>>>> > a >>>>>>> > na >>>>>>> > lysis/TokenStream; >>>>>>> > ### >>>>>>> > >>>>>>> > Should I change something else? Or may be lucene 4.0 is too >>>>>>> > recent for >>>>>>> mahout!? >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Thank you >>>>>>> > >>>>>>> > -----Message d'origine----- >>>>>>> > De : Sean Owen [mailto:[email protected]] Envoyé : mercredi 18 >>>>>>> > juillet >>>>>>> > 2012 22:52 À : [email protected] Objet : Re: .txt to >>>>>>> > vector >>>>>>> > >>>>>>> > This means you're using it with an incompatible version of Lucene. >>>>>>> > I >>>>>>> think we're on 3.1. Check the version that Mahout depends upon >>>>>>> and use at least that version or later. >>>>>>> > >>>>>>> > On Wed, Jul 18, 2012 at 6:04 PM, Videnova, Svetlana < >>>>>>> [email protected]> wrote: >>>>>>> > >>>>>>> >> I'm working with mahout. I'm trying to do web service in java >>>>>>> >> by myself who will take the output of solr and give this file to >>>>>>> >> mahout. >>>>>>> >> For the moment I successfully do the recommendation part. >>>>>>> >> Now I'm trying to clusterise. For this I have to vectorise >>>>>>> >> the output of solr. >>>>>>> >> Do you have any idea how to do it please? I was following >>>>>>> >> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.ht >>>>>>> >> m l BUT : doesn't work very well (at all...). >>>>>>> >> >>>>>>> >> I'm trying to find how to transform .txt to vector for mahout >>>>>>> >> in order to clusterise and categorise my information. Is it possible? >>>>>>> >> I saw that I have to use seqdirectory And seq2sparse. >>>>>>> >> >>>>>>> >> Seqdirectory create a file (with some numbers and >>>>>>> >> everything...) this step is ok But then when I have to use >>>>>>> >> seq2sparse that gives me this >>>>>>> >> error: >>>>>>> >> >>>>>>> >> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ >>>>>>> >> ./bin/mahout seq2sparse --input ./examples/output/ --output >>>>>>> >> ./toto/output/ hadoop binary is not in >>>>>>> >> PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>>>>>> >> running locally >>>>>>> >> SLF4J: Class path contains multiple SLF4J bindings. >>>>>>> >> SLF4J: Found binding in >>>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/ma >>>>>>> >> h >>>>>>> >> o >>>>>>> >> u >>>>>>> >> t >>>>>>> >> - >>>>>>> >> e >>>>>>> >> xa m >>>>>>> >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder. >>>>>>> >> c >>>>>>> >> l >>>>>>> >> a >>>>>>> >> s >>>>>>> >> s >>>>>>> >> ] >>>>>>> >> SLF4J: Found binding in >>>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/de >>>>>>> >> p >>>>>>> >> e >>>>>>> >> n >>>>>>> >> d >>>>>>> >> e >>>>>>> >> n cy / >>>>>>> >> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>>>> >> SLF4J: Found binding in >>>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/de >>>>>>> >> p >>>>>>> >> e >>>>>>> >> n >>>>>>> >> d >>>>>>> >> e >>>>>>> >> n >>>>>>> >> cy / >>>>>>> >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.cl >>>>>>> >> a >>>>>>> >> s >>>>>>> >> s >>>>>>> >> ] >>>>>>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings >>>>>>> >> for an explanation. >>>>>>> >> SLF4J: Actual binding is of type >>>>>>> >> [org.slf4j.impl.Log4jLoggerFactory] >>>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>>> >> Maximum n-gram size is: 1 >>>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>>> >> Minimum LLR value: 1.0 >>>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>>> >> Number of reduce tasks: 1 Exception in thread "main" >>>>>>> >> java.lang.VerifyError: class >>>>>>> >> org.apache.mahout.vectorizer.DefaultAnalyzer overrides final >>>>>>> >> method >>>>>>> >> >>>>>>> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/luce >>>>>>> n >>>>>>> e >>>>>>> / >>>>>>> a >>>>>>> n >>>>>>> a >>>>>>> lysis/TokenStream; >>>>>>> >> at java.lang.ClassLoader.defineClass1(Native Method) >>>>>>> >> at >>>>>>> >> java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) >>>>>>> >> at >>>>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:615) >>>>>>> >> at >>>>>>> >> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) >>>>>>> >> at >>>>>>> >> java.net.URLClassLoader.defineClass(URLClassLoader.java:283) >>>>>>> >> at >>>>>>> >> java.net.URLClassLoader.access$000(URLClassLoader.java:58) >>>>>>> >> at >>>>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:197) >>>>>>> >> at >>>>>>> >> java.security.AccessController.doPrivileged(Native >>>>>>> >> Method) >>>>>>> >> at >>>>>>> >> java.net.URLClassLoader.findClass(URLClassLoader.java:190) >>>>>>> >> at >>>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:306) >>>>>>> >> at >>>>>>> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) >>>>>>> >> at >>>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:247) >>>>>>> >> at >>>>>>> >> >>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run( >>>>>>> S >>>>>>> p >>>>>>> a >>>>>>> r >>>>>>> s >>>>>>> e >>>>>>> VectorsFromSequenceFiles.java:199) >>>>>>> >> at >>>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>>> >> at >>>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>>>> >> at >>>>>>> >> >>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main >>>>>>> ( >>>>>>> S >>>>>>> p >>>>>>> a >>>>>>> r >>>>>>> s >>>>>>> eVectorsFromSequenceFiles.java:55) >>>>>>> >> at >>>>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>>> >> Method) >>>>>>> >> at >>>>>>> >> >>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. >>>>>>> j >>>>>>> ava:39) >>>>>>> >> at >>>>>>> >> >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod >>>>>>> A >>>>>>> c >>>>>>> c >>>>>>> e >>>>>>> s >>>>>>> s >>>>>>> orImpl.java:25) >>>>>>> >> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> >> at >>>>>>> >> >>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(P >>>>>>> r >>>>>>> o >>>>>>> g >>>>>>> r >>>>>>> a >>>>>>> m >>>>>>> Driver.java:68) >>>>>>> >> at >>>>>>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>>>> >> at >>>>>>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java: >>>>>>> >> 1 >>>>>>> >> 9 >>>>>>> >> 5 >>>>>>> >> ) >>>>>>> >> >>>>>>> >> im using only lucene 4.0! >>>>>>> >> >>>>>>> >> >>>>>>> CLASSPATH=/opt/lucene-4.0.0-ALPHA/demo/lucene-demo-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/core/lucene-core-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/analysis/common/lucene-analyzers-common-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/queryparser/lucene-queryparser-4.0.0-ALPHA.jar:. >>>>>>> >> >>>>>>> >> Please where im wrong? >>>>>>> >> >>>>>>> >> >>>>>>> >> Thank you all >>>>>>> >> Regards >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> Think green - keep it on the screen. >>>>>>> >> >>>>>>> >> This e-mail and any attachment is for authorised use by the >>>>>>> >> intended >>>>>>> >> recipient(s) only. It may contain proprietary material, >>>>>>> >> confidential information and/or be subject to legal privilege. >>>>>>> >> It should not be copied, disclosed to, retained or used by, >>>>>>> >> any other party. If you are not an intended recipient then >>>>>>> >> please promptly delete this e-mail and any attachment and all copies >>>>>>> >> and inform the sender. Thank you. >>>>>>> >> >>>>>>> >> >>>>>>> > >>>>>>> > Think green - keep it on the screen. >>>>>>> > >>>>>>> > This e-mail and any attachment is for authorised use by the >>>>>>> > intended >>>>>>> recipient(s) only. It may contain proprietary material, >>>>>>> confidential information and/or be subject to legal privilege. >>>>>>> It should not be copied, disclosed to, retained or used by, any >>>>>>> other party. If you are not an intended recipient then please >>>>>>> promptly delete this e-mail and any attachment and all copies and >>>>>>> inform the sender. Thank you. >>>>>>> > >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Lance Norskog >>>>>>> [email protected] >>>>>>> >>>>>>> >>>>>>> Think green - keep it on the screen. >>>>>>> >>>>>>> This e-mail and any attachment is for authorised use by the >>>>>>> intended >>>>>>> recipient(s) only. It may contain proprietary material, >>>>>>> confidential information and/or be subject to legal privilege. >>>>>>> It should not be copied, disclosed to, retained or used by, any >>>>>>> other party. If you are not an intended recipient then please >>>>>>> promptly delete this e-mail and any attachment and all copies and >>>>>>> inform the sender. Thank you. >>>>>>> >>>>>>> >>>>>> >>>>>> Think green - keep it on the screen. >>>>>> >>>>>> This e-mail and any attachment is for authorised use by the intended >>>>>> recipient(s) only. It may contain proprietary material, confidential >>>>>> information and/or be subject to legal privilege. It should not be >>>>>> copied, disclosed to, retained or used by, any other party. If you are >>>>>> not an intended recipient then please promptly delete this e-mail and >>>>>> any attachment and all copies and inform the sender. Thank you. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Think green - keep it on the screen. >>>>>> >>>>>> This e-mail and any attachment is for authorised use by the intended >>>>>> recipient(s) only. It may contain proprietary material, confidential >>>>>> information and/or be subject to legal privilege. It should not be >>>>>> copied, disclosed to, retained or used by, any other party. If you are >>>>>> not an intended recipient then please promptly delete this e-mail and >>>>>> any attachment and all copies and inform the sender. Thank you. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Think green - keep it on the screen. >>>>>> >>>>>> This e-mail and any attachment is for authorised use by the intended >>>>>> recipient(s) only. It may contain proprietary material, confidential >>>>>> information and/or be subject to legal privilege. It should not be >>>>>> copied, disclosed to, retained or used by, any other party. If you are >>>>>> not an intended recipient then please promptly delete this e-mail and >>>>>> any attachment and all copies and inform the sender. Thank you. >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Lance Norskog >>>>> [email protected] >>>>> >>>>> >>>>> Think green - keep it on the screen. >>>>> >>>>> This e-mail and any attachment is for authorised use by the intended >>>>> recipient(s) only. It may contain proprietary material, confidential >>>>> information and/or be subject to legal privilege. It should not be >>>>> copied, disclosed to, retained or used by, any other party. If you are >>>>> not an intended recipient then please promptly delete this e-mail and any >>>>> attachment and all copies and inform the sender. Thank you. >>>>> >>>> >>>> >>>> >>>> -- >>>> Lance Norskog >>>> [email protected] >>>> >>>> >>>> Think green - keep it on the screen. >>>> >>>> This e-mail and any attachment is for authorised use by the intended >>>> recipient(s) only. It may contain proprietary material, confidential >>>> information and/or be subject to legal privilege. It should not be copied, >>>> disclosed to, retained or used by, any other party. If you are not an >>>> intended recipient then please promptly delete this e-mail and any >>>> attachment and all copies and inform the sender. Thank you. >>>> >>>> >>>> Think green - keep it on the screen. >>>> >>>> This e-mail and any attachment is for authorised use by the intended >>>> recipient(s) only. It may contain proprietary material, confidential >>>> information and/or be subject to legal privilege. It should not be copied, >>>> disclosed to, retained or used by, any other party. If you are not an >>>> intended recipient then please promptly delete this e-mail and any >>>> attachment and all copies and inform the sender. Thank you. >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> [email protected] >>> >>> >>> Think green - keep it on the screen. >>> >>> This e-mail and any attachment is for authorised use by the intended >>> recipient(s) only. It may contain proprietary material, confidential >>> information and/or be subject to legal privilege. It should not be copied, >>> disclosed to, retained or used by, any other party. If you are not an >>> intended recipient then please promptly delete this e-mail and any >>> attachment and all copies and inform the sender. Thank you. >>> >>> >>> Think green - keep it on the screen. >>> >>> This e-mail and any attachment is for authorised use by the intended >>> recipient(s) only. It may contain proprietary material, confidential >>> information and/or be subject to legal privilege. It should not be copied, >>> disclosed to, retained or used by, any other party. If you are not an >>> intended recipient then please promptly delete this e-mail and any >>> attachment and all copies and inform the sender. Thank you. >>> >> >> >> >> -- >> Lance Norskog >> [email protected] >> >> >> Think green - keep it on the screen. >> >> This e-mail and any attachment is for authorised use by the intended >> recipient(s) only. It may contain proprietary material, confidential >> information and/or be subject to legal privilege. It should not be copied, >> disclosed to, retained or used by, any other party. If you are not an >> intended recipient then please promptly delete this e-mail and any >> attachment and all copies and inform the sender. Thank you. >> > > > > -- > Lance Norskog > [email protected] > > > Think green - keep it on the screen. > > This e-mail and any attachment is for authorised use by the intended > recipient(s) only. It may contain proprietary material, confidential > information and/or be subject to legal privilege. It should not be copied, > disclosed to, retained or used by, any other party. If you are not an > intended recipient then please promptly delete this e-mail and any attachment > and all copies and inform the sender. Thank you. > -- Lance Norskog [email protected] Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you. Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
