Re: .txt to vector

Lance Norskog Wed, 25 Jul 2012 00:00:11 -0700

You're making progress! Run "bin/mahout lucene.vector" and look at the
help message:
  --maxPercentErrorDocs (-err) maxPercentErrorDocs    The max percentage of
                                                      docs that can have a null
                                                      term vector. These are
                                                      noise document and can
                                                      occur if the analyzer
                                                      used strips out all terms
                                                      in the target field. This
                                                      percentage is expressed
                                                      as a value between 0 and
                                                      1. The default is 0.


You want .3, not 30 !

On Tue, Jul 24, 2012 at 1:27 AM, Videnova, Svetlana
<[email protected]> wrote:
> I find this : http://comments.gmane.org/gmane.comp.apache.mahout.devel/16422
>
> When I run this : apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir 
> ./toto/index_bananas/ -o ./toto/lucene_vector_test/tom_indexes_output 
> --maxPercentErrorDocs 30 --field bananas -t 
> ./toto/lucene_vector_test/dictionnary/ -n 2
>
> I have this error :
> 12/07/24 09:25:22 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 12/07/24 09:25:22 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IllegalArgumentException
>
> -----Message d'origine-----
> De : Videnova, Svetlana [mailto:[email protected]]
> Envoyé : mardi 24 juillet 2012 09:16
> À : [email protected]
> Objet : RE: .txt to vector
>
> Hi Lance,
>
> My dir contains now : _0.tvf and the others.
>
> With the command:
> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir ./toto/index_bananas/ 
> -o ./toto/lucene_vector_test/tom_indexes_output --field bananas -t 
> ./toto/lucene_vector_test/dictionnary/ -n 2 the output is:
> ...
> 12/07/24 08:13:01 ERROR lucene.LuceneIterator: There are too many documents 
> that do not have a term vector for bananas Exception in thread "main" 
> java.lang.IllegalStateException: There are too many documents that do not 
> have a term vector for bananas ...
>
>
> Still can't understand the error ...
>
> Thank you
>
>
> -----Message d'origine-----
> De : Lance Norskog [mailto:[email protected]] Envoyé : mardi 24 juillet 2012 
> 04:28 À : [email protected] Objet : Re: .txt to vector
>
> You have to add termvectors to the field type you want to use. Then, you have 
> to reindex all of the data. You will now have another file in the index with 
> the suffix .tvf. This has the data which the Mahout lucene job looks for.
>
> On Mon, Jul 23, 2012 at 8:03 AM, Videnova, Svetlana 
> <[email protected]> wrote:
>> Hello again,
>>
>> I have got my indexed files from solr in windows and copy them into a 
>> directory in ubuntu.
>> They are like this :
>> ###
>> index_test$ ls
>> _4d.fdt  _4d.frq  _4d.tis  _4e.fdx  _4e.frq  _4e.prx  _4e.tis      
>> segments.gen
>> _4d.fdx  _4d.prx  _4e.fdt  _4e.fnm  _4e.nrm  _4e.tii  segments_55 ###
>>
>> _4d.tis looks like:
>> ###
>>              ]0 - PA – savoir où se trouve un panier        workflow, statut
>> ###
>>
>>
>> Then i'm using mahout like that:
>> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir
>> ./toto/index_test/ -o ./toto/lucene_vector_test/tom_indexes_output --field 
>> PA -t ./toto/lucene_vector_test/dictionnary/ -n 2 The output is:
>>
>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>> running locally
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-exam
>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency/
>> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency/
>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 12/07/23 15:50:09 INFO lucene.Driver: Output File:
>> ./toto/lucene_vector_test/tom_indexes_output
>> 12/07/23 15:50:10 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 12/07/23 15:50:10 INFO compress.CodecPool: Got brand-new compressor
>> 12/07/23 15:50:10 ERROR lucene.LuceneIterator: There are too many
>> documents that do not have a term vector for PA Exception in thread "main" 
>> java.lang.IllegalStateException: There are too many documents that do not 
>> have a term vector for PA
>>         at 
>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:118)
>>         at 
>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:41)
>>         at 
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>         at 
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>         at 
>> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:44)
>>         at 
>> org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:109)
>>         at 
>> org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>         at 
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>         at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>
>>
>>
>> I'm looking for field = "PA" which is using in a lot of files so I don’t 
>> understand why the exception tell me "too many documents that do not have a 
>> term vector for PA".
>>
>> Somebody can explain me how I have to use the command lucene.vector because 
>> apparently I'm missing something...
>>
>> Thank you all!
>>
>>
>> -----Message d'origine-----
>> De : Videnova, Svetlana [mailto:[email protected]]
>> Envoyé : lundi 23 juillet 2012 10:18
>> À : [email protected]
>> Objet : RE: .txt to vector
>>
>> I'm using mahout on ubuntu and solr on windows i guess with a web service I 
>> can get the indexed files from solr and then thanks to java program In the 
>> web service call mahout library to classify/clusterize and categorize my 
>> database.
>> For the moment im just training with a directory on ubuntu (my dir contains 
>> : .xml,.txt,.csv), because I don’t know where can I get the indexed files 
>> from solr on ubuntu...?!
>> Also I'm using the last version calls : apache-mahout-d6d6ee8
>>
>> When I'm using lucene.vector like : $ ./bin/mahout lucene.vector -d
>> ./toto/lucene_vector_test/ -o ./toto/lucene_vector_test/ -t ./toto/ -f
>> content -n 2 Exception in thread "main"
>> org.apache.lucene.index.IndexNotFoundException: no segments* file
>> found in
>> org.apache.lucene.store.NIOFSDirectory@/usr/local/apache-mahout-d6d6ee
>> 8/toto/lucene_vector_test
>> lockFactory=org.apache.lucene.store.NativeFSLockFactory@157aa53:
>> files: []
>>
>>
>> Thank you
>>
>>
>>
>> -----Message d'origine-----
>> De : Lance Norskog [mailto:[email protected]] Envoyé : samedi 21
>> juillet 2012 05:55 À : [email protected] Objet : Re: .txt to
>> vector
>>
>> Solr creates Lucene index files. You can query it for content in several 
>> formats. You will have to fetch the data with a program.
>>
>> bin/mahout lucene.vector
>> creates vector sequencefiles from a lucene index. I have not tried
>> this. You have to configure Solr to create termvectors for the field
>> you want. This is in the field type declaration, see the Introduction
>> in:
>> http://wiki.apache.org/solr/TermVectorComponent
>>
>> I don't know if lucene.vector is in the Mahout 0.5 release.
>>
>> For cluster outputs, the current cluster dumper supports 'graphml'
>> format. Giraph is an interactive graph browsers. You can look at small 
>> cluster jobs.
>>
>> On Thu, Jul 19, 2012 at 11:34 PM, Videnova, Svetlana 
>> <[email protected]> wrote:
>>> Hi,
>>> I already have mahout in action, but nothing working with mahout last 
>>> version..
>>> I will see again..
>>> For "taming text" does it treat .xml, json files too, cause my goal is to 
>>> take the output of solr (which is .xml, json or php)?
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>>
>>> -----Message d'origine-----
>>> De : Lance Norskog [mailto:[email protected]] Envoyé : vendredi 20
>>> juillet 2012 03:16 À : [email protected] Objet : Re: .txt to
>>> vector
>>>
>>> There are two books out for Mahout and text processing. "Mahout in Action" 
>>> covers all of the apps in Mahout. "Taming Text" gives a good detailed 
>>> explanation of the text processing programs in Mahout, and otherwise covers 
>>> other text processing problems.
>>>
>>> Mahout in Action is very good, and can help you use most of the Mahout 
>>> features.
>>>
>>> http://www.manning.com/owen
>>> http://www.manning.com/ingersoll
>>>
>>> On Thu, Jul 19, 2012 at 8:08 AM, Videnova, Svetlana 
>>> <[email protected]> wrote:
>>>> Hi again,
>>>> Just finished.
>>>> That's what I done:
>>>>
>>>> Mahout .txt to seqfile
>>>> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>>>> Converting directory of documents to SequenceFile format
>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>> seqdirectory  --input /usr/local/apache-mahout-d6d6ee8/toto
>>>> --output /usr/local/apache-mahout-d6d6ee8/examples/output/
>>>> -This first step will create chunk-0 file in the output path that
>>>> you gave Creating Vectors from SequenceFile ./bin/mahout seq2sparse
>>>> --input ./examples/output/chunk-0 --output ./toto/output/
>>>> -maxNGramSize *Don't forget to put  ./toto/output full right -this
>>>> second step will take the chunk-0 created by the first step and will
>>>> create output dir where you specified in the --output option
>>>>
>>>> Creating vector with kmeans
>>>> ./bin/mahout kmeans -i ./toto/output/tfidf-vectors/ -c
>>>> ./toto/centroides_kmeans/ -cl  -o ./toto/cluster_kmeans/ -k 20 -ow
>>>> -x
>>>> 10
>>>>
>>>> Transform vectors to human redable (does not work yet)
>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>> clusterdump -i ./toto/cluster_kmeans/clusters-1-final/ -o
>>>> ./toto/clusters-dump/ -of TEXT -d ./toto/output/dictionary.file-0
>>>> -dt sequencefile -b 100 -n 20 --evaluate -dm
>>>> org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir
>>>> ./toto/cluster_kmeans/clusteredPoints/
>>>> *-s got changed to -i for mahout 0.7
>>>> * works : ./bin/mahout clusterdump -i
>>>> ./toto/cluster_kmeans/clusters-1-final/ -o ./toto/clusters-dump/
>>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Can somebody please explain me belows files? What exactly they contect how 
>>>> to use them ect...
>>>> dictionary.file-0 ; tfidf-vectors   ;  tokenized-documents; df-count  ;    
>>>>      frequency.file-0 ; tf-vectors     ;      wordcount
>>>>
>>>>
>>>> What is the chunk-0 file exactly?
>>>>
>>>>
>>>>  What represent clusters-dump at the end created by using the command 
>>>> clusterdump?
>>>>
>>>>
>>>> Thank you all!
>>>>
>>>>
>>>> -----Message d'origine-----
>>>> De : Videnova, Svetlana [mailto:[email protected]]
>>>> Envoyé : jeudi 19 juillet 2012 15:07 À : [email protected]
>>>> Objet : RE: .txt to vector
>>>>
>>>> The problem was that i gave as input file to seq2sparse the directory and 
>>>> no -chunk directly.
>>>> Also I didn't got write rights for "group" and "others" to my output file.
>>>>
>>>> After running the command -> ./bin/mahout seq2sparse --input
>>>> ./examples/output/chunk-0 --output ./toto/output/ --maxNGramSize 3 I
>>>> have got -> 12/07/19 13:57:10 INFO driver.MahoutDriver: Program took
>>>> 57093 ms (Minutes: 0.95155)
>>>>
>>>>
>>>> So I went to my output and there is ->
>>>> root@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/toto/output# ls
>>>> df-count           frequency.file-0  tf-vectors           wordcount
>>>> dictionary.file-0  tfidf-vectors     tokenized-documents
>>>>
>>>> How should the vectors files looking like?
>>>> And can somebody please explain me what represents each directory of the 
>>>> output above?
>>>>
>>>>
>>>>
>>>> Thank you
>>>>
>>>> -----Message d'origine-----
>>>> De : Videnova, Svetlana [mailto:[email protected]]
>>>> Envoyé : jeudi 19 juillet 2012 14:26 À : [email protected]
>>>> Objet : RE: .txt to vector
>>>>
>>>> Yes that i was saying.
>>>>
>>>> But I have no idea where in the code mahout calls/creates the data that I 
>>>> don't have.
>>>> And the clusters that I have (especially clusters-8) are old and not 
>>>> generate by seqdirectory either by seq2sparse...
>>>> Should I make other manipulations before seqdirectory or seq2sparse step?
>>>>
>>>>
>>>> Thank you
>>>>
>>>>
>>>> -----Message d'origine-----
>>>> De : Alexander Aristov [mailto:[email protected]] But
>>>> Envoyé
>>>> : jeudi 19 juillet 2012 12:05 À : [email protected] Objet : Re:
>>>> .txt to vector
>>>>
>>>> you've got another problem now
>>>>
>>>> Exception in thread "main" java.io.FileNotFoundException: File 
>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data does 
>>>> not exist.
>>>>
>>>> Best Regards
>>>> Alexander Aristov
>>>>
>>>>
>>>> On 19 July 2012 12:30, Videnova, Svetlana 
>>>> <[email protected]>wrote:
>>>>
>>>>> Hi Lance,
>>>>>
>>>>> Thank you for your fast answer.
>>>>> I was changing my :
>>>>> CLASSPATH=/opt/lucene-3.6.0/lucene-core-3.6.0.jar:/opt/lucene-3.6.0/lucene-core-3.6.0-javadoc.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0-javadoc.jar:.
>>>>>
>>>>> And put 3.6.0 in the pom.xml
>>>>>
>>>>>
>>>>> But:
>>>>>
>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>>> seq2sparse --input ./examples/output/ --output ./toto/output/
>>>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>>>> running locally
>>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>>> SLF4J: Found binding in
>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-e
>>>>> x
>>>>> a m
>>>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: Found binding in
>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen
>>>>> c y / slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: Found binding in
>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen
>>>>> c
>>>>> y /
>>>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>>> explanation.
>>>>> SLF4J: Actual binding is of type
>>>>> [org.slf4j.impl.Log4jLoggerFactory]
>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> Maximum n-gram size is: 1
>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> Minimum LLR value: 1.0
>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> Number of reduce tasks: 1
>>>>> 12/07/19 09:03:56 INFO input.FileInputFormat: Total input paths to
>>>>> process
>>>>> : 15
>>>>> 12/07/19 09:03:56 INFO mapred.JobClient: Cleaning up the staging
>>>>> area
>>>>> file:/tmp/hadoop-csi/mapred/staging/csi-379951768/.staging/job_loca
>>>>> l
>>>>> _
>>>>> 0
>>>>> 001 Exception in thread "main" java.io.FileNotFoundException: File
>>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/da
>>>>> t
>>>>> a
>>>>> does not exist.
>>>>>         at
>>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
>>>>>         at
>>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>>>>>         at
>>>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>>>>         at
>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>>>>         at
>>>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919)
>>>>>         at
>>>>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936)
>>>>>         at
>>>>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>>>>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854)
>>>>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807)
>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>>         at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>>         at
>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807)
>>>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>>>>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495)
>>>>>         at
>>>>> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:93)
>>>>>         at
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:255)
>>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>>         at
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55)
>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>         at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>         at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>         at
>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>         at
>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>         at
>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>>
>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/examples/output/c
>>>>> l
>>>>> u
>>>>> s
>>>>> ters-8$
>>>>> ls
>>>>> _logs  part-r-00000  _policy  _SUCCESS
>>>>>
>>>>>  There is no
>>>>> /usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data here!
>>>>>
>>>>>
>>>>> Thank you
>>>>>
>>>>> -----Message d'origine-----
>>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : jeudi 19
>>>>> juillet 2012 09:33 À : [email protected] Objet : Re: .txt to
>>>>> vector
>>>>>
>>>>> Yes, the Mahout analyzer would have to be updated for Lucene 4.0. I
>>>>> suggest using an earlier one. Mahout uses with Lucene in a very
>>>>> simple way, and it is OK to use any earlier Lucene from 3.1 to 3.6.
>>>>>
>>>>> On Wed, Jul 18, 2012 at 11:50 PM, Videnova, Svetlana <
>>>>> [email protected]> wrote:
>>>>> > Hi Sean,
>>>>> >
>>>>> > In fact i was using lucene version 3.6.0 (saw that in the
>>>>> > pom.xml) But in my classpath I was using lucene version 4.0.0
>>>>> >
>>>>> > I change pom.xml to 4.0.0 =>
>>>>> > <lucene.version>4.0.0</lucene.version>
>>>>> >
>>>>> > But still the same error:
>>>>> > ###
>>>>> > Exception in thread "main" java.lang.VerifyError: class
>>>>> > org.apache.mahout.vectorizer.DefaultAnalyzer overrides final
>>>>> > method
>>>>> > tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucen
>>>>> > e
>>>>> > /
>>>>> > a
>>>>> > na
>>>>> > lysis/TokenStream;
>>>>> > ###
>>>>> >
>>>>> > Should I change something else? Or may be lucene 4.0 is too
>>>>> > recent for
>>>>> mahout!?
>>>>> >
>>>>> >
>>>>> >
>>>>> > Thank you
>>>>> >
>>>>> > -----Message d'origine-----
>>>>> > De : Sean Owen [mailto:[email protected]] Envoyé : mercredi 18
>>>>> > juillet
>>>>> > 2012 22:52 À : [email protected] Objet : Re: .txt to vector
>>>>> >
>>>>> > This means you're using it with an incompatible version of Lucene.
>>>>> > I
>>>>> think we're on 3.1. Check the version that Mahout depends upon and
>>>>> use at least that version or later.
>>>>> >
>>>>> > On Wed, Jul 18, 2012 at 6:04 PM, Videnova, Svetlana <
>>>>> [email protected]> wrote:
>>>>> >
>>>>> >> I'm working with mahout. I'm trying to do web service in java by
>>>>> >> myself who will take the output of solr and give this file to mahout.
>>>>> >> For the moment I successfully do the recommendation part.
>>>>> >> Now I'm trying to clusterise. For this I have to vectorise the
>>>>> >> output of solr.
>>>>> >> Do you have any idea how to do it please? I was following
>>>>> >> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>>>>> >> BUT : doesn't work very well (at all...).
>>>>> >>
>>>>> >> I'm trying to find how to transform .txt to vector for mahout in
>>>>> >> order to clusterise and categorise my information. Is it possible?
>>>>> >> I saw that I have to use seqdirectory And seq2sparse.
>>>>> >>
>>>>> >> Seqdirectory create a file (with some numbers and everything...)
>>>>> >> this step is ok But then when I have to use seq2sparse that
>>>>> >> gives me this
>>>>> >> error:
>>>>> >>
>>>>> >> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>>> >> seq2sparse --input ./examples/output/ --output ./toto/output/
>>>>> >> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>>>> >> running locally
>>>>> >> SLF4J: Class path contains multiple SLF4J bindings.
>>>>> >> SLF4J: Found binding in
>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahou
>>>>> >> t
>>>>> >> -
>>>>> >> e
>>>>> >> xa m
>>>>> >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.cla
>>>>> >> s
>>>>> >> s
>>>>> >> ]
>>>>> >> SLF4J: Found binding in
>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen
>>>>> >> d
>>>>> >> e
>>>>> >> n cy /
>>>>> >> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> >> SLF4J: Found binding in
>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen
>>>>> >> d
>>>>> >> e
>>>>> >> n
>>>>> >> cy /
>>>>> >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class
>>>>> >> ]
>>>>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for
>>>>> >> an explanation.
>>>>> >> SLF4J: Actual binding is of type
>>>>> >> [org.slf4j.impl.Log4jLoggerFactory]
>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> >> Maximum n-gram size is: 1
>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> >> Minimum LLR value: 1.0
>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>> >> Number of reduce tasks: 1 Exception in thread "main"
>>>>> >> java.lang.VerifyError: class
>>>>> >> org.apache.mahout.vectorizer.DefaultAnalyzer overrides final
>>>>> >> method
>>>>> >>
>>>>> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/
>>>>> a
>>>>> n
>>>>> a
>>>>> lysis/TokenStream;
>>>>> >>                 at java.lang.ClassLoader.defineClass1(Native Method)
>>>>> >>                 at
>>>>> >> java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>>>>> >>                 at
>>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>>>>> >>                 at
>>>>> >> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>>>>> >>                 at
>>>>> >> java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>>>>> >>                 at
>>>>> >> java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>>>>> >>                 at
>>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>>>>> >>                 at
>>>>> >> java.security.AccessController.doPrivileged(Native
>>>>> >> Method)
>>>>> >>                 at
>>>>> >> java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>>>> >>                 at 
>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>>>> >>                 at
>>>>> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>>>> >>                 at 
>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>>>> >>                 at
>>>>> >>
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Spa
>>>>> r
>>>>> s
>>>>> e
>>>>> VectorsFromSequenceFiles.java:199)
>>>>> >>                 at
>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> >>                 at
>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>> >>                 at
>>>>> >>
>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(Sp
>>>>> a
>>>>> r
>>>>> s
>>>>> eVectorsFromSequenceFiles.java:55)
>>>>> >>                 at
>>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>> >> Method)
>>>>> >>                 at
>>>>> >>
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
>>>>> j
>>>>> ava:39)
>>>>> >>                 at
>>>>> >>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
>>>>> e
>>>>> s
>>>>> s
>>>>> orImpl.java:25)
>>>>> >>                 at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> >>                 at
>>>>> >>
>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Prog
>>>>> r
>>>>> a
>>>>> m
>>>>> Driver.java:68)
>>>>> >>                 at
>>>>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>> >>                 at
>>>>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195
>>>>> >> )
>>>>> >>
>>>>> >> im using only lucene 4.0!
>>>>> >>
>>>>> >>
>>>>> CLASSPATH=/opt/lucene-4.0.0-ALPHA/demo/lucene-demo-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/core/lucene-core-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/analysis/common/lucene-analyzers-common-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/queryparser/lucene-queryparser-4.0.0-ALPHA.jar:.
>>>>> >>
>>>>> >> Please where im wrong?
>>>>> >>
>>>>> >>
>>>>> >> Thank you all
>>>>> >> Regards
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Think green - keep it on the screen.
>>>>> >>
>>>>> >> This e-mail and any attachment is for authorised use by the
>>>>> >> intended
>>>>> >> recipient(s) only. It may contain proprietary material,
>>>>> >> confidential information and/or be subject to legal privilege.
>>>>> >> It should not be copied, disclosed to, retained or used by, any
>>>>> >> other party. If you are not an intended recipient then please
>>>>> >> promptly delete this e-mail and any attachment and all copies and 
>>>>> >> inform the sender. Thank you.
>>>>> >>
>>>>> >>
>>>>> >
>>>>> > Think green - keep it on the screen.
>>>>> >
>>>>> > This e-mail and any attachment is for authorised use by the
>>>>> > intended
>>>>> recipient(s) only. It may contain proprietary material,
>>>>> confidential information and/or be subject to legal privilege. It
>>>>> should not be copied, disclosed to, retained or used by, any other
>>>>> party. If you are not an intended recipient then please promptly
>>>>> delete this e-mail and any attachment and all copies and inform the 
>>>>> sender. Thank you.
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> [email protected]
>>>>>
>>>>>
>>>>> Think green - keep it on the screen.
>>>>>
>>>>> This e-mail and any attachment is for authorised use by the
>>>>> intended
>>>>> recipient(s) only. It may contain proprietary material,
>>>>> confidential information and/or be subject to legal privilege. It
>>>>> should not be copied, disclosed to, retained or used by, any other
>>>>> party. If you are not an intended recipient then please promptly
>>>>> delete this e-mail and any attachment and all copies and inform the 
>>>>> sender. Thank you.
>>>>>
>>>>>
>>>>
>>>> Think green - keep it on the screen.
>>>>
>>>> This e-mail and any attachment is for authorised use by the intended 
>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>> information and/or be subject to legal privilege. It should not be copied, 
>>>> disclosed to, retained or used by, any other party. If you are not an 
>>>> intended recipient then please promptly delete this e-mail and any 
>>>> attachment and all copies and inform the sender. Thank you.
>>>>
>>>>
>>>>
>>>>
>>>> Think green - keep it on the screen.
>>>>
>>>> This e-mail and any attachment is for authorised use by the intended 
>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>> information and/or be subject to legal privilege. It should not be copied, 
>>>> disclosed to, retained or used by, any other party. If you are not an 
>>>> intended recipient then please promptly delete this e-mail and any 
>>>> attachment and all copies and inform the sender. Thank you.
>>>>
>>>>
>>>>
>>>>
>>>> Think green - keep it on the screen.
>>>>
>>>> This e-mail and any attachment is for authorised use by the intended 
>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>> information and/or be subject to legal privilege. It should not be copied, 
>>>> disclosed to, retained or used by, any other party. If you are not an 
>>>> intended recipient then please promptly delete this e-mail and any 
>>>> attachment and all copies and inform the sender. Thank you.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> [email protected]
>>>
>>>
>>> Think green - keep it on the screen.
>>>
>>> This e-mail and any attachment is for authorised use by the intended 
>>> recipient(s) only. It may contain proprietary material, confidential 
>>> information and/or be subject to legal privilege. It should not be copied, 
>>> disclosed to, retained or used by, any other party. If you are not an 
>>> intended recipient then please promptly delete this e-mail and any 
>>> attachment and all copies and inform the sender. Thank you.
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>
>>
>> Think green - keep it on the screen.
>>
>> This e-mail and any attachment is for authorised use by the intended 
>> recipient(s) only. It may contain proprietary material, confidential 
>> information and/or be subject to legal privilege. It should not be copied, 
>> disclosed to, retained or used by, any other party. If you are not an 
>> intended recipient then please promptly delete this e-mail and any 
>> attachment and all copies and inform the sender. Thank you.
>>
>>
>> Think green - keep it on the screen.
>>
>> This e-mail and any attachment is for authorised use by the intended 
>> recipient(s) only. It may contain proprietary material, confidential 
>> information and/or be subject to legal privilege. It should not be copied, 
>> disclosed to, retained or used by, any other party. If you are not an 
>> intended recipient then please promptly delete this e-mail and any 
>> attachment and all copies and inform the sender. Thank you.
>>
>
>
>
> --
> Lance Norskog
> [email protected]
>
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
>
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
>



-- 
Lance Norskog
[email protected]

Re: .txt to vector

Reply via email to