Re: .txt to vector

Lance Norskog Wed, 25 Jul 2012 00:24:10 -0700

The Luke program lets you examine a Lucene index. Try that and check
for your term vectors.
http://code.google.com/p/luke/


It uses Swing, so you need the index on your local PC.

On Wed, Jul 25, 2012 at 12:15 AM, Videnova, Svetlana
<[email protected]> wrote:
> Yes i saw the help thats why I was trying with something between 0 and 1 but 
> I have all the time the same error which is :
>
> 12/07/25 08:05:41 ERROR lucene.LuceneIterator: There are too many documents 
> that do not have a term vector for bananas
> Exception in thread "main" java.lang.IllegalStateException: There are too 
> many documents that do not have a term vector for bananas
>
>
>
> -----Message d'origine-----
> De : Lance Norskog [mailto:[email protected]]
> Envoyé : mercredi 25 juillet 2012 08:59
> À : [email protected]
> Objet : Re: .txt to vector
>
> You're making progress! Run "bin/mahout lucene.vector" and look at the help 
> message:
>   --maxPercentErrorDocs (-err) maxPercentErrorDocs    The max percentage of
>                                                       docs that can have a 
> null
>                                                       term vector. These are
>                                                       noise document and can
>                                                       occur if the analyzer
>                                                       used strips out all 
> terms
>                                                       in the target field. 
> This
>                                                       percentage is expressed
>                                                       as a value between 0 and
>                                                       1. The default is 0.
>
> You want .3, not 30 !
>
> On Tue, Jul 24, 2012 at 1:27 AM, Videnova, Svetlana 
> <[email protected]> wrote:
>> I find this :
>> http://comments.gmane.org/gmane.comp.apache.mahout.devel/16422
>>
>> When I run this : apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector
>> --dir ./toto/index_bananas/ -o
>> ./toto/lucene_vector_test/tom_indexes_output --maxPercentErrorDocs 30
>> --field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2
>>
>> I have this error :
>> 12/07/24 09:25:22 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 12/07/24 09:25:22 INFO compress.CodecPool: Got brand-new compressor
>> Exception in thread "main" java.lang.IllegalArgumentException
>>
>> -----Message d'origine-----
>> De : Videnova, Svetlana [mailto:[email protected]]
>> Envoyé : mardi 24 juillet 2012 09:16
>> À : [email protected]
>> Objet : RE: .txt to vector
>>
>> Hi Lance,
>>
>> My dir contains now : _0.tvf and the others.
>>
>> With the command:
>> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir 
>> ./toto/index_bananas/ -o ./toto/lucene_vector_test/tom_indexes_output 
>> --field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2 the output is:
>> ...
>> 12/07/24 08:13:01 ERROR lucene.LuceneIterator: There are too many documents 
>> that do not have a term vector for bananas Exception in thread "main" 
>> java.lang.IllegalStateException: There are too many documents that do not 
>> have a term vector for bananas ...
>>
>>
>> Still can't understand the error ...
>>
>> Thank you
>>
>>
>> -----Message d'origine-----
>> De : Lance Norskog [mailto:[email protected]] Envoyé : mardi 24
>> juillet 2012 04:28 À : [email protected] Objet : Re: .txt to
>> vector
>>
>> You have to add termvectors to the field type you want to use. Then, you 
>> have to reindex all of the data. You will now have another file in the index 
>> with the suffix .tvf. This has the data which the Mahout lucene job looks 
>> for.
>>
>> On Mon, Jul 23, 2012 at 8:03 AM, Videnova, Svetlana 
>> <[email protected]> wrote:
>>> Hello again,
>>>
>>> I have got my indexed files from solr in windows and copy them into a 
>>> directory in ubuntu.
>>> They are like this :
>>> ###
>>> index_test$ ls
>>> _4d.fdt  _4d.frq  _4d.tis  _4e.fdx  _4e.frq  _4e.prx  _4e.tis      
>>> segments.gen
>>> _4d.fdx  _4d.prx  _4e.fdt  _4e.fnm  _4e.nrm  _4e.tii  segments_55 ###
>>>
>>> _4d.tis looks like:
>>> ###
>>>              ]0 - PA – savoir où se trouve un panier        workflow, statut
>>> ###
>>>
>>>
>>> Then i'm using mahout like that:
>>> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir
>>> ./toto/index_test/ -o ./toto/lucene_vector_test/tom_indexes_output --field 
>>> PA -t ./toto/lucene_vector_test/dictionnary/ -n 2 The output is:
>>>
>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>> running locally
>>> SLF4J: Class path contains multiple SLF4J bindings.
>>> SLF4J: Found binding in
>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-exa
>>> m ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in
>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency
>>> / slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in
>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependency
>>> / slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
>>> explanation.
>>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>>> 12/07/23 15:50:09 INFO lucene.Driver: Output File:
>>> ./toto/lucene_vector_test/tom_indexes_output
>>> 12/07/23 15:50:10 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>>> where applicable
>>> 12/07/23 15:50:10 INFO compress.CodecPool: Got brand-new compressor
>>> 12/07/23 15:50:10 ERROR lucene.LuceneIterator: There are too many
>>> documents that do not have a term vector for PA Exception in thread "main" 
>>> java.lang.IllegalStateException: There are too many documents that do not 
>>> have a term vector for PA
>>>         at 
>>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:118)
>>>         at 
>>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:41)
>>>         at 
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>>         at 
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>>         at 
>>> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:44)
>>>         at 
>>> org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:109)
>>>         at 
>>> org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at 
>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>         at 
>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>         at
>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>
>>>
>>>
>>> I'm looking for field = "PA" which is using in a lot of files so I don’t 
>>> understand why the exception tell me "too many documents that do not have a 
>>> term vector for PA".
>>>
>>> Somebody can explain me how I have to use the command lucene.vector because 
>>> apparently I'm missing something...
>>>
>>> Thank you all!
>>>
>>>
>>> -----Message d'origine-----
>>> De : Videnova, Svetlana [mailto:[email protected]]
>>> Envoyé : lundi 23 juillet 2012 10:18
>>> À : [email protected]
>>> Objet : RE: .txt to vector
>>>
>>> I'm using mahout on ubuntu and solr on windows i guess with a web service I 
>>> can get the indexed files from solr and then thanks to java program In the 
>>> web service call mahout library to classify/clusterize and categorize my 
>>> database.
>>> For the moment im just training with a directory on ubuntu (my dir contains 
>>> : .xml,.txt,.csv), because I don’t know where can I get the indexed files 
>>> from solr on ubuntu...?!
>>> Also I'm using the last version calls : apache-mahout-d6d6ee8
>>>
>>> When I'm using lucene.vector like : $ ./bin/mahout lucene.vector -d
>>> ./toto/lucene_vector_test/ -o ./toto/lucene_vector_test/ -t ./toto/
>>> -f content -n 2 Exception in thread "main"
>>> org.apache.lucene.index.IndexNotFoundException: no segments* file
>>> found in
>>> org.apache.lucene.store.NIOFSDirectory@/usr/local/apache-mahout-d6d6e
>>> e
>>> 8/toto/lucene_vector_test
>>> lockFactory=org.apache.lucene.store.NativeFSLockFactory@157aa53:
>>> files: []
>>>
>>>
>>> Thank you
>>>
>>>
>>>
>>> -----Message d'origine-----
>>> De : Lance Norskog [mailto:[email protected]] Envoyé : samedi 21
>>> juillet 2012 05:55 À : [email protected] Objet : Re: .txt to
>>> vector
>>>
>>> Solr creates Lucene index files. You can query it for content in several 
>>> formats. You will have to fetch the data with a program.
>>>
>>> bin/mahout lucene.vector
>>> creates vector sequencefiles from a lucene index. I have not tried
>>> this. You have to configure Solr to create termvectors for the field
>>> you want. This is in the field type declaration, see the Introduction
>>> in:
>>> http://wiki.apache.org/solr/TermVectorComponent
>>>
>>> I don't know if lucene.vector is in the Mahout 0.5 release.
>>>
>>> For cluster outputs, the current cluster dumper supports 'graphml'
>>> format. Giraph is an interactive graph browsers. You can look at small 
>>> cluster jobs.
>>>
>>> On Thu, Jul 19, 2012 at 11:34 PM, Videnova, Svetlana 
>>> <[email protected]> wrote:
>>>> Hi,
>>>> I already have mahout in action, but nothing working with mahout last 
>>>> version..
>>>> I will see again..
>>>> For "taming text" does it treat .xml, json files too, cause my goal is to 
>>>> take the output of solr (which is .xml, json or php)?
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>>
>>>>
>>>> -----Message d'origine-----
>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : vendredi 20
>>>> juillet 2012 03:16 À : [email protected] Objet : Re: .txt to
>>>> vector
>>>>
>>>> There are two books out for Mahout and text processing. "Mahout in Action" 
>>>> covers all of the apps in Mahout. "Taming Text" gives a good detailed 
>>>> explanation of the text processing programs in Mahout, and otherwise 
>>>> covers other text processing problems.
>>>>
>>>> Mahout in Action is very good, and can help you use most of the Mahout 
>>>> features.
>>>>
>>>> http://www.manning.com/owen
>>>> http://www.manning.com/ingersoll
>>>>
>>>> On Thu, Jul 19, 2012 at 8:08 AM, Videnova, Svetlana 
>>>> <[email protected]> wrote:
>>>>> Hi again,
>>>>> Just finished.
>>>>> That's what I done:
>>>>>
>>>>> Mahout .txt to seqfile
>>>>> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>>>>> Converting directory of documents to SequenceFile format
>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>>> seqdirectory  --input /usr/local/apache-mahout-d6d6ee8/toto
>>>>> --output /usr/local/apache-mahout-d6d6ee8/examples/output/
>>>>> -This first step will create chunk-0 file in the output path that
>>>>> you gave Creating Vectors from SequenceFile ./bin/mahout seq2sparse
>>>>> --input ./examples/output/chunk-0 --output ./toto/output/
>>>>> -maxNGramSize *Don't forget to put  ./toto/output full right -this
>>>>> second step will take the chunk-0 created by the first step and
>>>>> will create output dir where you specified in the --output option
>>>>>
>>>>> Creating vector with kmeans
>>>>> ./bin/mahout kmeans -i ./toto/output/tfidf-vectors/ -c
>>>>> ./toto/centroides_kmeans/ -cl  -o ./toto/cluster_kmeans/ -k 20 -ow
>>>>> -x
>>>>> 10
>>>>>
>>>>> Transform vectors to human redable (does not work yet)
>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>>> clusterdump -i ./toto/cluster_kmeans/clusters-1-final/ -o
>>>>> ./toto/clusters-dump/ -of TEXT -d ./toto/output/dictionary.file-0
>>>>> -dt sequencefile -b 100 -n 20 --evaluate -dm
>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir
>>>>> ./toto/cluster_kmeans/clusteredPoints/
>>>>> *-s got changed to -i for mahout 0.7
>>>>> * works : ./bin/mahout clusterdump -i
>>>>> ./toto/cluster_kmeans/clusters-1-final/ -o ./toto/clusters-dump/
>>>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Can somebody please explain me belows files? What exactly they contect 
>>>>> how to use them ect...
>>>>> dictionary.file-0 ; tfidf-vectors   ;  tokenized-documents; df-count  ;   
>>>>>       frequency.file-0 ; tf-vectors     ;      wordcount
>>>>>
>>>>>
>>>>> What is the chunk-0 file exactly?
>>>>>
>>>>>
>>>>>  What represent clusters-dump at the end created by using the command 
>>>>> clusterdump?
>>>>>
>>>>>
>>>>> Thank you all!
>>>>>
>>>>>
>>>>> -----Message d'origine-----
>>>>> De : Videnova, Svetlana [mailto:[email protected]]
>>>>> Envoyé : jeudi 19 juillet 2012 15:07 À : [email protected]
>>>>> Objet : RE: .txt to vector
>>>>>
>>>>> The problem was that i gave as input file to seq2sparse the directory and 
>>>>> no -chunk directly.
>>>>> Also I didn't got write rights for "group" and "others" to my output file.
>>>>>
>>>>> After running the command -> ./bin/mahout seq2sparse --input
>>>>> ./examples/output/chunk-0 --output ./toto/output/ --maxNGramSize 3
>>>>> I have got -> 12/07/19 13:57:10 INFO driver.MahoutDriver: Program
>>>>> took
>>>>> 57093 ms (Minutes: 0.95155)
>>>>>
>>>>>
>>>>> So I went to my output and there is ->
>>>>> root@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/toto/output# ls
>>>>> df-count           frequency.file-0  tf-vectors           wordcount
>>>>> dictionary.file-0  tfidf-vectors     tokenized-documents
>>>>>
>>>>> How should the vectors files looking like?
>>>>> And can somebody please explain me what represents each directory of the 
>>>>> output above?
>>>>>
>>>>>
>>>>>
>>>>> Thank you
>>>>>
>>>>> -----Message d'origine-----
>>>>> De : Videnova, Svetlana [mailto:[email protected]]
>>>>> Envoyé : jeudi 19 juillet 2012 14:26 À : [email protected]
>>>>> Objet : RE: .txt to vector
>>>>>
>>>>> Yes that i was saying.
>>>>>
>>>>> But I have no idea where in the code mahout calls/creates the data that I 
>>>>> don't have.
>>>>> And the clusters that I have (especially clusters-8) are old and not 
>>>>> generate by seqdirectory either by seq2sparse...
>>>>> Should I make other manipulations before seqdirectory or seq2sparse step?
>>>>>
>>>>>
>>>>> Thank you
>>>>>
>>>>>
>>>>> -----Message d'origine-----
>>>>> De : Alexander Aristov [mailto:[email protected]] But
>>>>> Envoyé
>>>>> : jeudi 19 juillet 2012 12:05 À : [email protected] Objet : Re:
>>>>> .txt to vector
>>>>>
>>>>> you've got another problem now
>>>>>
>>>>> Exception in thread "main" java.io.FileNotFoundException: File 
>>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data 
>>>>> does not exist.
>>>>>
>>>>> Best Regards
>>>>> Alexander Aristov
>>>>>
>>>>>
>>>>> On 19 July 2012 12:30, Videnova, Svetlana 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Hi Lance,
>>>>>>
>>>>>> Thank you for your fast answer.
>>>>>> I was changing my :
>>>>>> CLASSPATH=/opt/lucene-3.6.0/lucene-core-3.6.0.jar:/opt/lucene-3.6.0/lucene-core-3.6.0-javadoc.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0-javadoc.jar:.
>>>>>>
>>>>>> And put 3.6.0 in the pom.xml
>>>>>>
>>>>>>
>>>>>> But:
>>>>>>
>>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>>>> seq2sparse --input ./examples/output/ --output ./toto/output/
>>>>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>>>>> running locally
>>>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>>>> SLF4J: Found binding in
>>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-
>>>>>> e
>>>>>> x
>>>>>> a m
>>>>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class
>>>>>> ]
>>>>>> SLF4J: Found binding in
>>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depende
>>>>>> n c y /
>>>>>> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>>> SLF4J: Found binding in
>>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depende
>>>>>> n
>>>>>> c
>>>>>> y /
>>>>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for
>>>>>> an explanation.
>>>>>> SLF4J: Actual binding is of type
>>>>>> [org.slf4j.impl.Log4jLoggerFactory]
>>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>> Maximum n-gram size is: 1
>>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>> Minimum LLR value: 1.0
>>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>> Number of reduce tasks: 1
>>>>>> 12/07/19 09:03:56 INFO input.FileInputFormat: Total input paths to
>>>>>> process
>>>>>> : 15
>>>>>> 12/07/19 09:03:56 INFO mapred.JobClient: Cleaning up the staging
>>>>>> area
>>>>>> file:/tmp/hadoop-csi/mapred/staging/csi-379951768/.staging/job_loc
>>>>>> a
>>>>>> l
>>>>>> _
>>>>>> 0
>>>>>> 001 Exception in thread "main" java.io.FileNotFoundException: File
>>>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/d
>>>>>> a
>>>>>> t
>>>>>> a
>>>>>> does not exist.
>>>>>>         at
>>>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
>>>>>>         at
>>>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>>>>>>         at
>>>>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>>>>>         at
>>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>>>>>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854)
>>>>>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807)
>>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>>>         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807)
>>>>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>>>>>         at 
>>>>>> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495)
>>>>>>         at
>>>>>> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:93)
>>>>>>         at
>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:255)
>>>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>>>         at
>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55)
>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>         at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>         at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>         at
>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>>         at
>>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>         at
>>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>>>
>>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/examples/output/
>>>>>> c
>>>>>> l
>>>>>> u
>>>>>> s
>>>>>> ters-8$
>>>>>> ls
>>>>>> _logs  part-r-00000  _policy  _SUCCESS
>>>>>>
>>>>>>  There is no
>>>>>> /usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data here!
>>>>>>
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> -----Message d'origine-----
>>>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : jeudi 19
>>>>>> juillet 2012 09:33 À : [email protected] Objet : Re: .txt to
>>>>>> vector
>>>>>>
>>>>>> Yes, the Mahout analyzer would have to be updated for Lucene 4.0.
>>>>>> I suggest using an earlier one. Mahout uses with Lucene in a very
>>>>>> simple way, and it is OK to use any earlier Lucene from 3.1 to 3.6.
>>>>>>
>>>>>> On Wed, Jul 18, 2012 at 11:50 PM, Videnova, Svetlana <
>>>>>> [email protected]> wrote:
>>>>>> > Hi Sean,
>>>>>> >
>>>>>> > In fact i was using lucene version 3.6.0 (saw that in the
>>>>>> > pom.xml) But in my classpath I was using lucene version 4.0.0
>>>>>> >
>>>>>> > I change pom.xml to 4.0.0 =>
>>>>>> > <lucene.version>4.0.0</lucene.version>
>>>>>> >
>>>>>> > But still the same error:
>>>>>> > ###
>>>>>> > Exception in thread "main" java.lang.VerifyError: class
>>>>>> > org.apache.mahout.vectorizer.DefaultAnalyzer overrides final
>>>>>> > method
>>>>>> > tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/luce
>>>>>> > n
>>>>>> > e
>>>>>> > /
>>>>>> > a
>>>>>> > na
>>>>>> > lysis/TokenStream;
>>>>>> > ###
>>>>>> >
>>>>>> > Should I change something else? Or may be lucene 4.0 is too
>>>>>> > recent for
>>>>>> mahout!?
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Thank you
>>>>>> >
>>>>>> > -----Message d'origine-----
>>>>>> > De : Sean Owen [mailto:[email protected]] Envoyé : mercredi 18
>>>>>> > juillet
>>>>>> > 2012 22:52 À : [email protected] Objet : Re: .txt to vector
>>>>>> >
>>>>>> > This means you're using it with an incompatible version of Lucene.
>>>>>> > I
>>>>>> think we're on 3.1. Check the version that Mahout depends upon and
>>>>>> use at least that version or later.
>>>>>> >
>>>>>> > On Wed, Jul 18, 2012 at 6:04 PM, Videnova, Svetlana <
>>>>>> [email protected]> wrote:
>>>>>> >
>>>>>> >> I'm working with mahout. I'm trying to do web service in java
>>>>>> >> by myself who will take the output of solr and give this file to 
>>>>>> >> mahout.
>>>>>> >> For the moment I successfully do the recommendation part.
>>>>>> >> Now I'm trying to clusterise. For this I have to vectorise the
>>>>>> >> output of solr.
>>>>>> >> Do you have any idea how to do it please? I was following
>>>>>> >> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>>>>>> >> BUT : doesn't work very well (at all...).
>>>>>> >>
>>>>>> >> I'm trying to find how to transform .txt to vector for mahout
>>>>>> >> in order to clusterise and categorise my information. Is it possible?
>>>>>> >> I saw that I have to use seqdirectory And seq2sparse.
>>>>>> >>
>>>>>> >> Seqdirectory create a file (with some numbers and
>>>>>> >> everything...) this step is ok But then when I have to use
>>>>>> >> seq2sparse that gives me this
>>>>>> >> error:
>>>>>> >>
>>>>>> >> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout
>>>>>> >> seq2sparse --input ./examples/output/ --output ./toto/output/
>>>>>> >> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>>>>> >> running locally
>>>>>> >> SLF4J: Class path contains multiple SLF4J bindings.
>>>>>> >> SLF4J: Found binding in
>>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/maho
>>>>>> >> u
>>>>>> >> t
>>>>>> >> -
>>>>>> >> e
>>>>>> >> xa m
>>>>>> >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.cl
>>>>>> >> a
>>>>>> >> s
>>>>>> >> s
>>>>>> >> ]
>>>>>> >> SLF4J: Found binding in
>>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depe
>>>>>> >> n
>>>>>> >> d
>>>>>> >> e
>>>>>> >> n cy /
>>>>>> >> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>>> >> SLF4J: Found binding in
>>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depe
>>>>>> >> n
>>>>>> >> d
>>>>>> >> e
>>>>>> >> n
>>>>>> >> cy /
>>>>>> >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.clas
>>>>>> >> s
>>>>>> >> ]
>>>>>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings
>>>>>> >> for an explanation.
>>>>>> >> SLF4J: Actual binding is of type
>>>>>> >> [org.slf4j.impl.Log4jLoggerFactory]
>>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>> >> Maximum n-gram size is: 1
>>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>> >> Minimum LLR value: 1.0
>>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>> >> Number of reduce tasks: 1 Exception in thread "main"
>>>>>> >> java.lang.VerifyError: class
>>>>>> >> org.apache.mahout.vectorizer.DefaultAnalyzer overrides final
>>>>>> >> method
>>>>>> >>
>>>>>> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene
>>>>>> /
>>>>>> a
>>>>>> n
>>>>>> a
>>>>>> lysis/TokenStream;
>>>>>> >>                 at java.lang.ClassLoader.defineClass1(Native Method)
>>>>>> >>                 at
>>>>>> >> java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>>>>>> >>                 at
>>>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>>>>>> >>                 at
>>>>>> >> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>>>>>> >>                 at
>>>>>> >> java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>>>>>> >>                 at
>>>>>> >> java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>>>>>> >>                 at
>>>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>>>>>> >>                 at
>>>>>> >> java.security.AccessController.doPrivileged(Native
>>>>>> >> Method)
>>>>>> >>                 at
>>>>>> >> java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>>>>> >>                 at 
>>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>>>>> >>                 at
>>>>>> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>>>>> >>                 at 
>>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>>>>> >>                 at
>>>>>> >>
>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(Sp
>>>>>> a
>>>>>> r
>>>>>> s
>>>>>> e
>>>>>> VectorsFromSequenceFiles.java:199)
>>>>>> >>                 at
>>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>> >>                 at
>>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>>> >>                 at
>>>>>> >>
>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(S
>>>>>> p
>>>>>> a
>>>>>> r
>>>>>> s
>>>>>> eVectorsFromSequenceFiles.java:55)
>>>>>> >>                 at
>>>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>> >> Method)
>>>>>> >>                 at
>>>>>> >>
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
>>>>>> j
>>>>>> ava:39)
>>>>>> >>                 at
>>>>>> >>
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAc
>>>>>> c
>>>>>> e
>>>>>> s
>>>>>> s
>>>>>> orImpl.java:25)
>>>>>> >>                 at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>> >>                 at
>>>>>> >>
>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Pro
>>>>>> g
>>>>>> r
>>>>>> a
>>>>>> m
>>>>>> Driver.java:68)
>>>>>> >>                 at
>>>>>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>> >>                 at
>>>>>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:19
>>>>>> >> 5
>>>>>> >> )
>>>>>> >>
>>>>>> >> im using only lucene 4.0!
>>>>>> >>
>>>>>> >>
>>>>>> CLASSPATH=/opt/lucene-4.0.0-ALPHA/demo/lucene-demo-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/core/lucene-core-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/analysis/common/lucene-analyzers-common-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/queryparser/lucene-queryparser-4.0.0-ALPHA.jar:.
>>>>>> >>
>>>>>> >> Please where im wrong?
>>>>>> >>
>>>>>> >>
>>>>>> >> Thank you all
>>>>>> >> Regards
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Think green - keep it on the screen.
>>>>>> >>
>>>>>> >> This e-mail and any attachment is for authorised use by the
>>>>>> >> intended
>>>>>> >> recipient(s) only. It may contain proprietary material,
>>>>>> >> confidential information and/or be subject to legal privilege.
>>>>>> >> It should not be copied, disclosed to, retained or used by, any
>>>>>> >> other party. If you are not an intended recipient then please
>>>>>> >> promptly delete this e-mail and any attachment and all copies and 
>>>>>> >> inform the sender. Thank you.
>>>>>> >>
>>>>>> >>
>>>>>> >
>>>>>> > Think green - keep it on the screen.
>>>>>> >
>>>>>> > This e-mail and any attachment is for authorised use by the
>>>>>> > intended
>>>>>> recipient(s) only. It may contain proprietary material,
>>>>>> confidential information and/or be subject to legal privilege. It
>>>>>> should not be copied, disclosed to, retained or used by, any other
>>>>>> party. If you are not an intended recipient then please promptly
>>>>>> delete this e-mail and any attachment and all copies and inform the 
>>>>>> sender. Thank you.
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>> Think green - keep it on the screen.
>>>>>>
>>>>>> This e-mail and any attachment is for authorised use by the
>>>>>> intended
>>>>>> recipient(s) only. It may contain proprietary material,
>>>>>> confidential information and/or be subject to legal privilege. It
>>>>>> should not be copied, disclosed to, retained or used by, any other
>>>>>> party. If you are not an intended recipient then please promptly
>>>>>> delete this e-mail and any attachment and all copies and inform the 
>>>>>> sender. Thank you.
>>>>>>
>>>>>>
>>>>>
>>>>> Think green - keep it on the screen.
>>>>>
>>>>> This e-mail and any attachment is for authorised use by the intended 
>>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>>> information and/or be subject to legal privilege. It should not be 
>>>>> copied, disclosed to, retained or used by, any other party. If you are 
>>>>> not an intended recipient then please promptly delete this e-mail and any 
>>>>> attachment and all copies and inform the sender. Thank you.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Think green - keep it on the screen.
>>>>>
>>>>> This e-mail and any attachment is for authorised use by the intended 
>>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>>> information and/or be subject to legal privilege. It should not be 
>>>>> copied, disclosed to, retained or used by, any other party. If you are 
>>>>> not an intended recipient then please promptly delete this e-mail and any 
>>>>> attachment and all copies and inform the sender. Thank you.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Think green - keep it on the screen.
>>>>>
>>>>> This e-mail and any attachment is for authorised use by the intended 
>>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>>> information and/or be subject to legal privilege. It should not be 
>>>>> copied, disclosed to, retained or used by, any other party. If you are 
>>>>> not an intended recipient then please promptly delete this e-mail and any 
>>>>> attachment and all copies and inform the sender. Thank you.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> [email protected]
>>>>
>>>>
>>>> Think green - keep it on the screen.
>>>>
>>>> This e-mail and any attachment is for authorised use by the intended 
>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>> information and/or be subject to legal privilege. It should not be copied, 
>>>> disclosed to, retained or used by, any other party. If you are not an 
>>>> intended recipient then please promptly delete this e-mail and any 
>>>> attachment and all copies and inform the sender. Thank you.
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> [email protected]
>>>
>>>
>>> Think green - keep it on the screen.
>>>
>>> This e-mail and any attachment is for authorised use by the intended 
>>> recipient(s) only. It may contain proprietary material, confidential 
>>> information and/or be subject to legal privilege. It should not be copied, 
>>> disclosed to, retained or used by, any other party. If you are not an 
>>> intended recipient then please promptly delete this e-mail and any 
>>> attachment and all copies and inform the sender. Thank you.
>>>
>>>
>>> Think green - keep it on the screen.
>>>
>>> This e-mail and any attachment is for authorised use by the intended 
>>> recipient(s) only. It may contain proprietary material, confidential 
>>> information and/or be subject to legal privilege. It should not be copied, 
>>> disclosed to, retained or used by, any other party. If you are not an 
>>> intended recipient then please promptly delete this e-mail and any 
>>> attachment and all copies and inform the sender. Thank you.
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>
>>
>> Think green - keep it on the screen.
>>
>> This e-mail and any attachment is for authorised use by the intended 
>> recipient(s) only. It may contain proprietary material, confidential 
>> information and/or be subject to legal privilege. It should not be copied, 
>> disclosed to, retained or used by, any other party. If you are not an 
>> intended recipient then please promptly delete this e-mail and any 
>> attachment and all copies and inform the sender. Thank you.
>>
>>
>> Think green - keep it on the screen.
>>
>> This e-mail and any attachment is for authorised use by the intended 
>> recipient(s) only. It may contain proprietary material, confidential 
>> information and/or be subject to legal privilege. It should not be copied, 
>> disclosed to, retained or used by, any other party. If you are not an 
>> intended recipient then please promptly delete this e-mail and any 
>> attachment and all copies and inform the sender. Thank you.
>>
>
>
>
> --
> Lance Norskog
> [email protected]
>
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
>



-- 
Lance Norskog
[email protected]

Re: .txt to vector

Reply via email to