The Luke program lets you examine a Lucene index. Try that and check for your term vectors. http://code.google.com/p/luke/
It uses Swing, so you need the index on your local PC. On Wed, Jul 25, 2012 at 12:15 AM, Videnova, Svetlana <[email protected]> wrote: > Yes i saw the help thats why I was trying with something between 0 and 1 but > I have all the time the same error which is : > > 12/07/25 08:05:41 ERROR lucene.LuceneIterator: There are too many documents > that do not have a term vector for bananas > Exception in thread "main" java.lang.IllegalStateException: There are too > many documents that do not have a term vector for bananas > > > > -----Message d'origine----- > De : Lance Norskog [mailto:[email protected]] > Envoyé : mercredi 25 juillet 2012 08:59 > À : [email protected] > Objet : Re: .txt to vector > > You're making progress! Run "bin/mahout lucene.vector" and look at the help > message: > --maxPercentErrorDocs (-err) maxPercentErrorDocs The max percentage of > docs that can have a > null > term vector. These are > noise document and can > occur if the analyzer > used strips out all > terms > in the target field. > This > percentage is expressed > as a value between 0 and > 1. The default is 0. > > You want .3, not 30 ! > > On Tue, Jul 24, 2012 at 1:27 AM, Videnova, Svetlana > <[email protected]> wrote: >> I find this : >> http://comments.gmane.org/gmane.comp.apache.mahout.devel/16422 >> >> When I run this : apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector >> --dir ./toto/index_bananas/ -o >> ./toto/lucene_vector_test/tom_indexes_output --maxPercentErrorDocs 30 >> --field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2 >> >> I have this error : >> 12/07/24 09:25:22 WARN util.NativeCodeLoader: Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 12/07/24 09:25:22 INFO compress.CodecPool: Got brand-new compressor >> Exception in thread "main" java.lang.IllegalArgumentException >> >> -----Message d'origine----- >> De : Videnova, Svetlana [mailto:[email protected]] >> Envoyé : mardi 24 juillet 2012 09:16 >> À : [email protected] >> Objet : RE: .txt to vector >> >> Hi Lance, >> >> My dir contains now : _0.tvf and the others. >> >> With the command: >> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir >> ./toto/index_bananas/ -o ./toto/lucene_vector_test/tom_indexes_output >> --field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2 the output is: >> ... >> 12/07/24 08:13:01 ERROR lucene.LuceneIterator: There are too many documents >> that do not have a term vector for bananas Exception in thread "main" >> java.lang.IllegalStateException: There are too many documents that do not >> have a term vector for bananas ... >> >> >> Still can't understand the error ... >> >> Thank you >> >> >> -----Message d'origine----- >> De : Lance Norskog [mailto:[email protected]] Envoyé : mardi 24 >> juillet 2012 04:28 À : [email protected] Objet : Re: .txt to >> vector >> >> You have to add termvectors to the field type you want to use. Then, you >> have to reindex all of the data. You will now have another file in the index >> with the suffix .tvf. This has the data which the Mahout lucene job looks >> for. >> >> On Mon, Jul 23, 2012 at 8:03 AM, Videnova, Svetlana >> <[email protected]> wrote: >>> Hello again, >>> >>> I have got my indexed files from solr in windows and copy them into a >>> directory in ubuntu. >>> They are like this : >>> ### >>> index_test$ ls >>> _4d.fdt _4d.frq _4d.tis _4e.fdx _4e.frq _4e.prx _4e.tis >>> segments.gen >>> _4d.fdx _4d.prx _4e.fdt _4e.fnm _4e.nrm _4e.tii segments_55 ### >>> >>> _4d.tis looks like: >>> ### >>> ]0 - PA – savoir où se trouve un panier workflow, statut >>> ### >>> >>> >>> Then i'm using mahout like that: >>> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir >>> ./toto/index_test/ -o ./toto/lucene_vector_test/tom_indexes_output --field >>> PA -t ./toto/lucene_vector_test/dictionnary/ -n 2 The output is: >>> >>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>> running locally >>> SLF4J: Class path contains multiple SLF4J bindings. >>> SLF4J: Found binding in >>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-exa >>> m ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: Found binding in >>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency >>> / slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: Found binding in >>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency >>> / slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >>> explanation. >>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] >>> 12/07/23 15:50:09 INFO lucene.Driver: Output File: >>> ./toto/lucene_vector_test/tom_indexes_output >>> 12/07/23 15:50:10 WARN util.NativeCodeLoader: Unable to load >>> native-hadoop library for your platform... using builtin-java classes >>> where applicable >>> 12/07/23 15:50:10 INFO compress.CodecPool: Got brand-new compressor >>> 12/07/23 15:50:10 ERROR lucene.LuceneIterator: There are too many >>> documents that do not have a term vector for PA Exception in thread "main" >>> java.lang.IllegalStateException: There are too many documents that do not >>> have a term vector for PA >>> at >>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:118) >>> at >>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:41) >>> at >>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >>> at >>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >>> at >>> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:44) >>> at >>> org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:109) >>> at >>> org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at >>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>> at >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>> at >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >>> >>> >>> >>> I'm looking for field = "PA" which is using in a lot of files so I don’t >>> understand why the exception tell me "too many documents that do not have a >>> term vector for PA". >>> >>> Somebody can explain me how I have to use the command lucene.vector because >>> apparently I'm missing something... >>> >>> Thank you all! >>> >>> >>> -----Message d'origine----- >>> De : Videnova, Svetlana [mailto:[email protected]] >>> Envoyé : lundi 23 juillet 2012 10:18 >>> À : [email protected] >>> Objet : RE: .txt to vector >>> >>> I'm using mahout on ubuntu and solr on windows i guess with a web service I >>> can get the indexed files from solr and then thanks to java program In the >>> web service call mahout library to classify/clusterize and categorize my >>> database. >>> For the moment im just training with a directory on ubuntu (my dir contains >>> : .xml,.txt,.csv), because I don’t know where can I get the indexed files >>> from solr on ubuntu...?! >>> Also I'm using the last version calls : apache-mahout-d6d6ee8 >>> >>> When I'm using lucene.vector like : $ ./bin/mahout lucene.vector -d >>> ./toto/lucene_vector_test/ -o ./toto/lucene_vector_test/ -t ./toto/ >>> -f content -n 2 Exception in thread "main" >>> org.apache.lucene.index.IndexNotFoundException: no segments* file >>> found in >>> org.apache.lucene.store.NIOFSDirectory@/usr/local/apache-mahout-d6d6e >>> e >>> 8/toto/lucene_vector_test >>> lockFactory=org.apache.lucene.store.NativeFSLockFactory@157aa53: >>> files: [] >>> >>> >>> Thank you >>> >>> >>> >>> -----Message d'origine----- >>> De : Lance Norskog [mailto:[email protected]] Envoyé : samedi 21 >>> juillet 2012 05:55 À : [email protected] Objet : Re: .txt to >>> vector >>> >>> Solr creates Lucene index files. You can query it for content in several >>> formats. You will have to fetch the data with a program. >>> >>> bin/mahout lucene.vector >>> creates vector sequencefiles from a lucene index. I have not tried >>> this. You have to configure Solr to create termvectors for the field >>> you want. This is in the field type declaration, see the Introduction >>> in: >>> http://wiki.apache.org/solr/TermVectorComponent >>> >>> I don't know if lucene.vector is in the Mahout 0.5 release. >>> >>> For cluster outputs, the current cluster dumper supports 'graphml' >>> format. Giraph is an interactive graph browsers. You can look at small >>> cluster jobs. >>> >>> On Thu, Jul 19, 2012 at 11:34 PM, Videnova, Svetlana >>> <[email protected]> wrote: >>>> Hi, >>>> I already have mahout in action, but nothing working with mahout last >>>> version.. >>>> I will see again.. >>>> For "taming text" does it treat .xml, json files too, cause my goal is to >>>> take the output of solr (which is .xml, json or php)? >>>> >>>> >>>> >>>> Regards >>>> >>>> >>>> >>>> -----Message d'origine----- >>>> De : Lance Norskog [mailto:[email protected]] Envoyé : vendredi 20 >>>> juillet 2012 03:16 À : [email protected] Objet : Re: .txt to >>>> vector >>>> >>>> There are two books out for Mahout and text processing. "Mahout in Action" >>>> covers all of the apps in Mahout. "Taming Text" gives a good detailed >>>> explanation of the text processing programs in Mahout, and otherwise >>>> covers other text processing problems. >>>> >>>> Mahout in Action is very good, and can help you use most of the Mahout >>>> features. >>>> >>>> http://www.manning.com/owen >>>> http://www.manning.com/ingersoll >>>> >>>> On Thu, Jul 19, 2012 at 8:08 AM, Videnova, Svetlana >>>> <[email protected]> wrote: >>>>> Hi again, >>>>> Just finished. >>>>> That's what I done: >>>>> >>>>> Mahout .txt to seqfile >>>>> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html >>>>> Converting directory of documents to SequenceFile format >>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>> seqdirectory --input /usr/local/apache-mahout-d6d6ee8/toto >>>>> --output /usr/local/apache-mahout-d6d6ee8/examples/output/ >>>>> -This first step will create chunk-0 file in the output path that >>>>> you gave Creating Vectors from SequenceFile ./bin/mahout seq2sparse >>>>> --input ./examples/output/chunk-0 --output ./toto/output/ >>>>> -maxNGramSize *Don't forget to put ./toto/output full right -this >>>>> second step will take the chunk-0 created by the first step and >>>>> will create output dir where you specified in the --output option >>>>> >>>>> Creating vector with kmeans >>>>> ./bin/mahout kmeans -i ./toto/output/tfidf-vectors/ -c >>>>> ./toto/centroides_kmeans/ -cl -o ./toto/cluster_kmeans/ -k 20 -ow >>>>> -x >>>>> 10 >>>>> >>>>> Transform vectors to human redable (does not work yet) >>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>> clusterdump -i ./toto/cluster_kmeans/clusters-1-final/ -o >>>>> ./toto/clusters-dump/ -of TEXT -d ./toto/output/dictionary.file-0 >>>>> -dt sequencefile -b 100 -n 20 --evaluate -dm >>>>> org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir >>>>> ./toto/cluster_kmeans/clusteredPoints/ >>>>> *-s got changed to -i for mahout 0.7 >>>>> * works : ./bin/mahout clusterdump -i >>>>> ./toto/cluster_kmeans/clusters-1-final/ -o ./toto/clusters-dump/ >>>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Can somebody please explain me belows files? What exactly they contect >>>>> how to use them ect... >>>>> dictionary.file-0 ; tfidf-vectors ; tokenized-documents; df-count ; >>>>> frequency.file-0 ; tf-vectors ; wordcount >>>>> >>>>> >>>>> What is the chunk-0 file exactly? >>>>> >>>>> >>>>> What represent clusters-dump at the end created by using the command >>>>> clusterdump? >>>>> >>>>> >>>>> Thank you all! >>>>> >>>>> >>>>> -----Message d'origine----- >>>>> De : Videnova, Svetlana [mailto:[email protected]] >>>>> Envoyé : jeudi 19 juillet 2012 15:07 À : [email protected] >>>>> Objet : RE: .txt to vector >>>>> >>>>> The problem was that i gave as input file to seq2sparse the directory and >>>>> no -chunk directly. >>>>> Also I didn't got write rights for "group" and "others" to my output file. >>>>> >>>>> After running the command -> ./bin/mahout seq2sparse --input >>>>> ./examples/output/chunk-0 --output ./toto/output/ --maxNGramSize 3 >>>>> I have got -> 12/07/19 13:57:10 INFO driver.MahoutDriver: Program >>>>> took >>>>> 57093 ms (Minutes: 0.95155) >>>>> >>>>> >>>>> So I went to my output and there is -> >>>>> root@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/toto/output# ls >>>>> df-count frequency.file-0 tf-vectors wordcount >>>>> dictionary.file-0 tfidf-vectors tokenized-documents >>>>> >>>>> How should the vectors files looking like? >>>>> And can somebody please explain me what represents each directory of the >>>>> output above? >>>>> >>>>> >>>>> >>>>> Thank you >>>>> >>>>> -----Message d'origine----- >>>>> De : Videnova, Svetlana [mailto:[email protected]] >>>>> Envoyé : jeudi 19 juillet 2012 14:26 À : [email protected] >>>>> Objet : RE: .txt to vector >>>>> >>>>> Yes that i was saying. >>>>> >>>>> But I have no idea where in the code mahout calls/creates the data that I >>>>> don't have. >>>>> And the clusters that I have (especially clusters-8) are old and not >>>>> generate by seqdirectory either by seq2sparse... >>>>> Should I make other manipulations before seqdirectory or seq2sparse step? >>>>> >>>>> >>>>> Thank you >>>>> >>>>> >>>>> -----Message d'origine----- >>>>> De : Alexander Aristov [mailto:[email protected]] But >>>>> Envoyé >>>>> : jeudi 19 juillet 2012 12:05 À : [email protected] Objet : Re: >>>>> .txt to vector >>>>> >>>>> you've got another problem now >>>>> >>>>> Exception in thread "main" java.io.FileNotFoundException: File >>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data >>>>> does not exist. >>>>> >>>>> Best Regards >>>>> Alexander Aristov >>>>> >>>>> >>>>> On 19 July 2012 12:30, Videnova, Svetlana >>>>> <[email protected]>wrote: >>>>> >>>>>> Hi Lance, >>>>>> >>>>>> Thank you for your fast answer. >>>>>> I was changing my : >>>>>> CLASSPATH=/opt/lucene-3.6.0/lucene-core-3.6.0.jar:/opt/lucene-3.6.0/lucene-core-3.6.0-javadoc.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0-javadoc.jar:. >>>>>> >>>>>> And put 3.6.0 in the pom.xml >>>>>> >>>>>> >>>>>> But: >>>>>> >>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>>> seq2sparse --input ./examples/output/ --output ./toto/output/ >>>>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>>>>> running locally >>>>>> SLF4J: Class path contains multiple SLF4J bindings. >>>>>> SLF4J: Found binding in >>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout- >>>>>> e >>>>>> x >>>>>> a m >>>>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class >>>>>> ] >>>>>> SLF4J: Found binding in >>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depende >>>>>> n c y / >>>>>> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>>> SLF4J: Found binding in >>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depende >>>>>> n >>>>>> c >>>>>> y / >>>>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for >>>>>> an explanation. >>>>>> SLF4J: Actual binding is of type >>>>>> [org.slf4j.impl.Log4jLoggerFactory] >>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>> Maximum n-gram size is: 1 >>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>> Minimum LLR value: 1.0 >>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>> Number of reduce tasks: 1 >>>>>> 12/07/19 09:03:56 INFO input.FileInputFormat: Total input paths to >>>>>> process >>>>>> : 15 >>>>>> 12/07/19 09:03:56 INFO mapred.JobClient: Cleaning up the staging >>>>>> area >>>>>> file:/tmp/hadoop-csi/mapred/staging/csi-379951768/.staging/job_loc >>>>>> a >>>>>> l >>>>>> _ >>>>>> 0 >>>>>> 001 Exception in thread "main" java.io.FileNotFoundException: File >>>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/d >>>>>> a >>>>>> t >>>>>> a >>>>>> does not exist. >>>>>> at >>>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371) >>>>>> at >>>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) >>>>>> at >>>>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63) >>>>>> at >>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252) >>>>>> at >>>>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919) >>>>>> at >>>>>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936) >>>>>> at >>>>>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) >>>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854) >>>>>> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807) >>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>> at javax.security.auth.Subject.doAs(Subject.java:396) >>>>>> at >>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >>>>>> at >>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807) >>>>>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) >>>>>> at >>>>>> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495) >>>>>> at >>>>>> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:93) >>>>>> at >>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:255) >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>>> at >>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>>> at >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>> at >>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>>>> at >>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>>> at >>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >>>>>> >>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/examples/output/ >>>>>> c >>>>>> l >>>>>> u >>>>>> s >>>>>> ters-8$ >>>>>> ls >>>>>> _logs part-r-00000 _policy _SUCCESS >>>>>> >>>>>> There is no >>>>>> /usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data here! >>>>>> >>>>>> >>>>>> Thank you >>>>>> >>>>>> -----Message d'origine----- >>>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : jeudi 19 >>>>>> juillet 2012 09:33 À : [email protected] Objet : Re: .txt to >>>>>> vector >>>>>> >>>>>> Yes, the Mahout analyzer would have to be updated for Lucene 4.0. >>>>>> I suggest using an earlier one. Mahout uses with Lucene in a very >>>>>> simple way, and it is OK to use any earlier Lucene from 3.1 to 3.6. >>>>>> >>>>>> On Wed, Jul 18, 2012 at 11:50 PM, Videnova, Svetlana < >>>>>> [email protected]> wrote: >>>>>> > Hi Sean, >>>>>> > >>>>>> > In fact i was using lucene version 3.6.0 (saw that in the >>>>>> > pom.xml) But in my classpath I was using lucene version 4.0.0 >>>>>> > >>>>>> > I change pom.xml to 4.0.0 => >>>>>> > <lucene.version>4.0.0</lucene.version> >>>>>> > >>>>>> > But still the same error: >>>>>> > ### >>>>>> > Exception in thread "main" java.lang.VerifyError: class >>>>>> > org.apache.mahout.vectorizer.DefaultAnalyzer overrides final >>>>>> > method >>>>>> > tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/luce >>>>>> > n >>>>>> > e >>>>>> > / >>>>>> > a >>>>>> > na >>>>>> > lysis/TokenStream; >>>>>> > ### >>>>>> > >>>>>> > Should I change something else? Or may be lucene 4.0 is too >>>>>> > recent for >>>>>> mahout!? >>>>>> > >>>>>> > >>>>>> > >>>>>> > Thank you >>>>>> > >>>>>> > -----Message d'origine----- >>>>>> > De : Sean Owen [mailto:[email protected]] Envoyé : mercredi 18 >>>>>> > juillet >>>>>> > 2012 22:52 À : [email protected] Objet : Re: .txt to vector >>>>>> > >>>>>> > This means you're using it with an incompatible version of Lucene. >>>>>> > I >>>>>> think we're on 3.1. Check the version that Mahout depends upon and >>>>>> use at least that version or later. >>>>>> > >>>>>> > On Wed, Jul 18, 2012 at 6:04 PM, Videnova, Svetlana < >>>>>> [email protected]> wrote: >>>>>> > >>>>>> >> I'm working with mahout. I'm trying to do web service in java >>>>>> >> by myself who will take the output of solr and give this file to >>>>>> >> mahout. >>>>>> >> For the moment I successfully do the recommendation part. >>>>>> >> Now I'm trying to clusterise. For this I have to vectorise the >>>>>> >> output of solr. >>>>>> >> Do you have any idea how to do it please? I was following >>>>>> >> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html >>>>>> >> BUT : doesn't work very well (at all...). >>>>>> >> >>>>>> >> I'm trying to find how to transform .txt to vector for mahout >>>>>> >> in order to clusterise and categorise my information. Is it possible? >>>>>> >> I saw that I have to use seqdirectory And seq2sparse. >>>>>> >> >>>>>> >> Seqdirectory create a file (with some numbers and >>>>>> >> everything...) this step is ok But then when I have to use >>>>>> >> seq2sparse that gives me this >>>>>> >> error: >>>>>> >> >>>>>> >> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout >>>>>> >> seq2sparse --input ./examples/output/ --output ./toto/output/ >>>>>> >> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, >>>>>> >> running locally >>>>>> >> SLF4J: Class path contains multiple SLF4J bindings. >>>>>> >> SLF4J: Found binding in >>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/maho >>>>>> >> u >>>>>> >> t >>>>>> >> - >>>>>> >> e >>>>>> >> xa m >>>>>> >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.cl >>>>>> >> a >>>>>> >> s >>>>>> >> s >>>>>> >> ] >>>>>> >> SLF4J: Found binding in >>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depe >>>>>> >> n >>>>>> >> d >>>>>> >> e >>>>>> >> n cy / >>>>>> >> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>>> >> SLF4J: Found binding in >>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depe >>>>>> >> n >>>>>> >> d >>>>>> >> e >>>>>> >> n >>>>>> >> cy / >>>>>> >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.clas >>>>>> >> s >>>>>> >> ] >>>>>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings >>>>>> >> for an explanation. >>>>>> >> SLF4J: Actual binding is of type >>>>>> >> [org.slf4j.impl.Log4jLoggerFactory] >>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>> >> Maximum n-gram size is: 1 >>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>> >> Minimum LLR value: 1.0 >>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles: >>>>>> >> Number of reduce tasks: 1 Exception in thread "main" >>>>>> >> java.lang.VerifyError: class >>>>>> >> org.apache.mahout.vectorizer.DefaultAnalyzer overrides final >>>>>> >> method >>>>>> >> >>>>>> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene >>>>>> / >>>>>> a >>>>>> n >>>>>> a >>>>>> lysis/TokenStream; >>>>>> >> at java.lang.ClassLoader.defineClass1(Native Method) >>>>>> >> at >>>>>> >> java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) >>>>>> >> at >>>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:615) >>>>>> >> at >>>>>> >> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) >>>>>> >> at >>>>>> >> java.net.URLClassLoader.defineClass(URLClassLoader.java:283) >>>>>> >> at >>>>>> >> java.net.URLClassLoader.access$000(URLClassLoader.java:58) >>>>>> >> at >>>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:197) >>>>>> >> at >>>>>> >> java.security.AccessController.doPrivileged(Native >>>>>> >> Method) >>>>>> >> at >>>>>> >> java.net.URLClassLoader.findClass(URLClassLoader.java:190) >>>>>> >> at >>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:306) >>>>>> >> at >>>>>> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) >>>>>> >> at >>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:247) >>>>>> >> at >>>>>> >> >>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Sp >>>>>> a >>>>>> r >>>>>> s >>>>>> e >>>>>> VectorsFromSequenceFiles.java:199) >>>>>> >> at >>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>> >> at >>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>>> >> at >>>>>> >> >>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(S >>>>>> p >>>>>> a >>>>>> r >>>>>> s >>>>>> eVectorsFromSequenceFiles.java:55) >>>>>> >> at >>>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>> >> Method) >>>>>> >> at >>>>>> >> >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. >>>>>> j >>>>>> ava:39) >>>>>> >> at >>>>>> >> >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAc >>>>>> c >>>>>> e >>>>>> s >>>>>> s >>>>>> orImpl.java:25) >>>>>> >> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>> >> at >>>>>> >> >>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Pro >>>>>> g >>>>>> r >>>>>> a >>>>>> m >>>>>> Driver.java:68) >>>>>> >> at >>>>>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>>> >> at >>>>>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:19 >>>>>> >> 5 >>>>>> >> ) >>>>>> >> >>>>>> >> im using only lucene 4.0! >>>>>> >> >>>>>> >> >>>>>> CLASSPATH=/opt/lucene-4.0.0-ALPHA/demo/lucene-demo-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/core/lucene-core-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/analysis/common/lucene-analyzers-common-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/queryparser/lucene-queryparser-4.0.0-ALPHA.jar:. >>>>>> >> >>>>>> >> Please where im wrong? >>>>>> >> >>>>>> >> >>>>>> >> Thank you all >>>>>> >> Regards >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> Think green - keep it on the screen. >>>>>> >> >>>>>> >> This e-mail and any attachment is for authorised use by the >>>>>> >> intended >>>>>> >> recipient(s) only. It may contain proprietary material, >>>>>> >> confidential information and/or be subject to legal privilege. >>>>>> >> It should not be copied, disclosed to, retained or used by, any >>>>>> >> other party. If you are not an intended recipient then please >>>>>> >> promptly delete this e-mail and any attachment and all copies and >>>>>> >> inform the sender. Thank you. >>>>>> >> >>>>>> >> >>>>>> > >>>>>> > Think green - keep it on the screen. >>>>>> > >>>>>> > This e-mail and any attachment is for authorised use by the >>>>>> > intended >>>>>> recipient(s) only. It may contain proprietary material, >>>>>> confidential information and/or be subject to legal privilege. It >>>>>> should not be copied, disclosed to, retained or used by, any other >>>>>> party. If you are not an intended recipient then please promptly >>>>>> delete this e-mail and any attachment and all copies and inform the >>>>>> sender. Thank you. >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Lance Norskog >>>>>> [email protected] >>>>>> >>>>>> >>>>>> Think green - keep it on the screen. >>>>>> >>>>>> This e-mail and any attachment is for authorised use by the >>>>>> intended >>>>>> recipient(s) only. It may contain proprietary material, >>>>>> confidential information and/or be subject to legal privilege. It >>>>>> should not be copied, disclosed to, retained or used by, any other >>>>>> party. If you are not an intended recipient then please promptly >>>>>> delete this e-mail and any attachment and all copies and inform the >>>>>> sender. Thank you. >>>>>> >>>>>> >>>>> >>>>> Think green - keep it on the screen. >>>>> >>>>> This e-mail and any attachment is for authorised use by the intended >>>>> recipient(s) only. It may contain proprietary material, confidential >>>>> information and/or be subject to legal privilege. It should not be >>>>> copied, disclosed to, retained or used by, any other party. If you are >>>>> not an intended recipient then please promptly delete this e-mail and any >>>>> attachment and all copies and inform the sender. Thank you. >>>>> >>>>> >>>>> >>>>> >>>>> Think green - keep it on the screen. >>>>> >>>>> This e-mail and any attachment is for authorised use by the intended >>>>> recipient(s) only. It may contain proprietary material, confidential >>>>> information and/or be subject to legal privilege. It should not be >>>>> copied, disclosed to, retained or used by, any other party. If you are >>>>> not an intended recipient then please promptly delete this e-mail and any >>>>> attachment and all copies and inform the sender. Thank you. >>>>> >>>>> >>>>> >>>>> >>>>> Think green - keep it on the screen. >>>>> >>>>> This e-mail and any attachment is for authorised use by the intended >>>>> recipient(s) only. It may contain proprietary material, confidential >>>>> information and/or be subject to legal privilege. It should not be >>>>> copied, disclosed to, retained or used by, any other party. If you are >>>>> not an intended recipient then please promptly delete this e-mail and any >>>>> attachment and all copies and inform the sender. Thank you. >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Lance Norskog >>>> [email protected] >>>> >>>> >>>> Think green - keep it on the screen. >>>> >>>> This e-mail and any attachment is for authorised use by the intended >>>> recipient(s) only. It may contain proprietary material, confidential >>>> information and/or be subject to legal privilege. It should not be copied, >>>> disclosed to, retained or used by, any other party. If you are not an >>>> intended recipient then please promptly delete this e-mail and any >>>> attachment and all copies and inform the sender. Thank you. >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> [email protected] >>> >>> >>> Think green - keep it on the screen. >>> >>> This e-mail and any attachment is for authorised use by the intended >>> recipient(s) only. It may contain proprietary material, confidential >>> information and/or be subject to legal privilege. It should not be copied, >>> disclosed to, retained or used by, any other party. If you are not an >>> intended recipient then please promptly delete this e-mail and any >>> attachment and all copies and inform the sender. Thank you. >>> >>> >>> Think green - keep it on the screen. >>> >>> This e-mail and any attachment is for authorised use by the intended >>> recipient(s) only. It may contain proprietary material, confidential >>> information and/or be subject to legal privilege. It should not be copied, >>> disclosed to, retained or used by, any other party. If you are not an >>> intended recipient then please promptly delete this e-mail and any >>> attachment and all copies and inform the sender. Thank you. >>> >> >> >> >> -- >> Lance Norskog >> [email protected] >> >> >> Think green - keep it on the screen. >> >> This e-mail and any attachment is for authorised use by the intended >> recipient(s) only. It may contain proprietary material, confidential >> information and/or be subject to legal privilege. It should not be copied, >> disclosed to, retained or used by, any other party. If you are not an >> intended recipient then please promptly delete this e-mail and any >> attachment and all copies and inform the sender. Thank you. >> >> >> Think green - keep it on the screen. >> >> This e-mail and any attachment is for authorised use by the intended >> recipient(s) only. It may contain proprietary material, confidential >> information and/or be subject to legal privilege. It should not be copied, >> disclosed to, retained or used by, any other party. If you are not an >> intended recipient then please promptly delete this e-mail and any >> attachment and all copies and inform the sender. Thank you. >> > > > > -- > Lance Norskog > [email protected] > > > Think green - keep it on the screen. > > This e-mail and any attachment is for authorised use by the intended > recipient(s) only. It may contain proprietary material, confidential > information and/or be subject to legal privilege. It should not be copied, > disclosed to, retained or used by, any other party. If you are not an > intended recipient then please promptly delete this e-mail and any attachment > and all copies and inform the sender. Thank you. > -- Lance Norskog [email protected]
