RE: .txt to vector

Videnova, Svetlana Thu, 26 Jul 2012 06:26:13 -0700

My problem was solved. 
An index contains a sequence of documents.

A document is a sequence of fields.
A field is a named sequence of terms.
A term is a string.


I was searching for a term while the option was --field ... grrrrr

Now my output is looking like : 

9
#term   doc freq        idx
artichokes      14      0
cheese  17      1
deron   14      2
french  14      3
fries   14      4
hamburger       14      5
nicole  17      6
salad   17      7
steak   14      8





-----Message d'origine-----
De : Videnova, Svetlana [mailto:[email protected]] 
Envoyé : mercredi 25 juillet 2012 11:05
À : [email protected]
Objet : RE: .txt to vector

OK thank you. All good.

31docs,
3fields: content (term count 16), filename (termcount 17), indexDate (termcount 
1) 

There are "bananas" in at least 3files  ...
Can't understand why
12/07/25 10:03:02 ERROR lucene.LuceneIterator: There are too many documents 
that do not have a term vector for bananas Exception in thread "main" 
java.lang.IllegalStateException: There are too many documents that do not have 
a term vector for bananas


-----Message d'origine-----
De : Lance Norskog [mailto:[email protected]] 
Envoyé : mercredi 25 juillet 2012 09:57
À : [email protected]
Objet : Re: .txt to vector

It is a jar file, so just java -jar luke.....jar

But, there's a problem. Luke releases are keyed to different Lucene releases. 
You need the right Luke download for your version of Lucene.

http://code.google.com/p/luke/downloads/list

On Wed, Jul 25, 2012 at 12:52 AM, Videnova, Svetlana 
<[email protected]> wrote:
> Sorry but what the command line for running luke?
>
> -----Message d'origine-----
> De : Lance Norskog [mailto:[email protected]] Envoyé : mercredi 25 
> juillet 2012 09:24 À : [email protected] Objet : Re: .txt to 
> vector
>
> The Luke program lets you examine a Lucene index. Try that and check for your 
> term vectors.
> http://code.google.com/p/luke/
>
> It uses Swing, so you need the index on your local PC.
>
> On Wed, Jul 25, 2012 at 12:15 AM, Videnova, Svetlana 
> <[email protected]> wrote:
>> Yes i saw the help thats why I was trying with something between 0 and 1 but 
>> I have all the time the same error which is :
>>
>> 12/07/25 08:05:41 ERROR lucene.LuceneIterator: There are too many 
>> documents that do not have a term vector for bananas Exception in 
>> thread "main" java.lang.IllegalStateException: There are too many 
>> documents that do not have a term vector for bananas
>>
>>
>>
>> -----Message d'origine-----
>> De : Lance Norskog [mailto:[email protected]] Envoyé : mercredi 25 
>> juillet 2012 08:59 À : [email protected] Objet : Re: .txt to 
>> vector
>>
>> You're making progress! Run "bin/mahout lucene.vector" and look at the help 
>> message:
>>   --maxPercentErrorDocs (-err) maxPercentErrorDocs    The max percentage of
>>                                                       docs that can have a 
>> null
>>                                                       term vector. These are
>>                                                       noise document and can
>>                                                       occur if the analyzer
>>                                                       used strips out all 
>> terms
>>                                                       in the target field. 
>> This
>>                                                       percentage is expressed
>>                                                       as a value between 0 
>> and
>>                                                       1. The default is 0.
>>
>> You want .3, not 30 !
>>
>> On Tue, Jul 24, 2012 at 1:27 AM, Videnova, Svetlana 
>> <[email protected]> wrote:
>>> I find this :
>>> http://comments.gmane.org/gmane.comp.apache.mahout.devel/16422
>>>
>>> When I run this : apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector 
>>> --dir ./toto/index_bananas/ -o 
>>> ./toto/lucene_vector_test/tom_indexes_output --maxPercentErrorDocs 
>>> 30 --field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2
>>>
>>> I have this error :
>>> 12/07/24 09:25:22 WARN util.NativeCodeLoader: Unable to load 
>>> native-hadoop library for your platform... using builtin-java 
>>> classes where applicable
>>> 12/07/24 09:25:22 INFO compress.CodecPool: Got brand-new compressor 
>>> Exception in thread "main" java.lang.IllegalArgumentException
>>>
>>> -----Message d'origine-----
>>> De : Videnova, Svetlana [mailto:[email protected]]
>>> Envoyé : mardi 24 juillet 2012 09:16 À : [email protected] 
>>> Objet : RE: .txt to vector
>>>
>>> Hi Lance,
>>>
>>> My dir contains now : _0.tvf and the others.
>>>
>>> With the command:
>>> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir 
>>> ./toto/index_bananas/ -o ./toto/lucene_vector_test/tom_indexes_output 
>>> --field bananas -t ./toto/lucene_vector_test/dictionnary/ -n 2 the output 
>>> is:
>>> ...
>>> 12/07/24 08:13:01 ERROR lucene.LuceneIterator: There are too many documents 
>>> that do not have a term vector for bananas Exception in thread "main" 
>>> java.lang.IllegalStateException: There are too many documents that do not 
>>> have a term vector for bananas ...
>>>
>>>
>>> Still can't understand the error ...
>>>
>>> Thank you
>>>
>>>
>>> -----Message d'origine-----
>>> De : Lance Norskog [mailto:[email protected]] Envoyé : mardi 24 
>>> juillet 2012 04:28 À : [email protected] Objet : Re: .txt to 
>>> vector
>>>
>>> You have to add termvectors to the field type you want to use. Then, you 
>>> have to reindex all of the data. You will now have another file in the 
>>> index with the suffix .tvf. This has the data which the Mahout lucene job 
>>> looks for.
>>>
>>> On Mon, Jul 23, 2012 at 8:03 AM, Videnova, Svetlana 
>>> <[email protected]> wrote:
>>>> Hello again,
>>>>
>>>> I have got my indexed files from solr in windows and copy them into a 
>>>> directory in ubuntu.
>>>> They are like this :
>>>> ###
>>>> index_test$ ls
>>>> _4d.fdt  _4d.frq  _4d.tis  _4e.fdx  _4e.frq  _4e.prx  _4e.tis      
>>>> segments.gen
>>>> _4d.fdx  _4d.prx  _4e.fdt  _4e.fnm  _4e.nrm  _4e.tii  segments_55 
>>>> ###
>>>>
>>>> _4d.tis looks like:
>>>> ###
>>>>              ]0 - PA – savoir où se trouve un panier        workflow, 
>>>> statut
>>>> ###
>>>>
>>>>
>>>> Then i'm using mahout like that:
>>>> apache-mahout-d6d6ee8$ ./bin/mahout lucene.vector --dir 
>>>> ./toto/index_test/ -o ./toto/lucene_vector_test/tom_indexes_output --field 
>>>> PA -t ./toto/lucene_vector_test/dictionnary/ -n 2 The output is:
>>>>
>>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>>> running locally
>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>> SLF4J: Found binding in
>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahout-e
>>>> x
>>>> a m
>>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>> SLF4J: Found binding in
>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen
>>>> c y / slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>> SLF4J: Found binding in
>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/dependen
>>>> c
>>>> y /
>>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
>>>> explanation.
>>>> SLF4J: Actual binding is of type 
>>>> [org.slf4j.impl.Log4jLoggerFactory]
>>>> 12/07/23 15:50:09 INFO lucene.Driver: Output File:
>>>> ./toto/lucene_vector_test/tom_indexes_output
>>>> 12/07/23 15:50:10 WARN util.NativeCodeLoader: Unable to load 
>>>> native-hadoop library for your platform... using builtin-java 
>>>> classes where applicable
>>>> 12/07/23 15:50:10 INFO compress.CodecPool: Got brand-new compressor
>>>> 12/07/23 15:50:10 ERROR lucene.LuceneIterator: There are too many 
>>>> documents that do not have a term vector for PA Exception in thread "main" 
>>>> java.lang.IllegalStateException: There are too many documents that do not 
>>>> have a term vector for PA
>>>>         at 
>>>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:118)
>>>>         at 
>>>> org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:41)
>>>>         at 
>>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>>>         at 
>>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>>>         at 
>>>> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:44)
>>>>         at 
>>>> org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:109)
>>>>         at 
>>>> org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:250)
>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>         at 
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>         at 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>>         at 
>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>         at 
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>         at
>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>
>>>>
>>>>
>>>> I'm looking for field = "PA" which is using in a lot of files so I don’t 
>>>> understand why the exception tell me "too many documents that do not have 
>>>> a term vector for PA".
>>>>
>>>> Somebody can explain me how I have to use the command lucene.vector 
>>>> because apparently I'm missing something...
>>>>
>>>> Thank you all!
>>>>
>>>>
>>>> -----Message d'origine-----
>>>> De : Videnova, Svetlana [mailto:[email protected]]
>>>> Envoyé : lundi 23 juillet 2012 10:18 À : [email protected] 
>>>> Objet : RE: .txt to vector
>>>>
>>>> I'm using mahout on ubuntu and solr on windows i guess with a web service 
>>>> I can get the indexed files from solr and then thanks to java program In 
>>>> the web service call mahout library to classify/clusterize and categorize 
>>>> my database.
>>>> For the moment im just training with a directory on ubuntu (my dir 
>>>> contains : .xml,.txt,.csv), because I don’t know where can I get the 
>>>> indexed files from solr on ubuntu...?!
>>>> Also I'm using the last version calls : apache-mahout-d6d6ee8
>>>>
>>>> When I'm using lucene.vector like : $ ./bin/mahout lucene.vector -d 
>>>> ./toto/lucene_vector_test/ -o ./toto/lucene_vector_test/ -t ./toto/ 
>>>> -f content -n 2 Exception in thread "main"
>>>> org.apache.lucene.index.IndexNotFoundException: no segments* file 
>>>> found in
>>>> org.apache.lucene.store.NIOFSDirectory@/usr/local/apache-mahout-d6d
>>>> 6
>>>> e
>>>> e
>>>> 8/toto/lucene_vector_test
>>>> lockFactory=org.apache.lucene.store.NativeFSLockFactory@157aa53:
>>>> files: []
>>>>
>>>>
>>>> Thank you
>>>>
>>>>
>>>>
>>>> -----Message d'origine-----
>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : samedi 21 
>>>> juillet 2012 05:55 À : [email protected] Objet : Re: .txt to 
>>>> vector
>>>>
>>>> Solr creates Lucene index files. You can query it for content in several 
>>>> formats. You will have to fetch the data with a program.
>>>>
>>>> bin/mahout lucene.vector
>>>> creates vector sequencefiles from a lucene index. I have not tried 
>>>> this. You have to configure Solr to create termvectors for the 
>>>> field you want. This is in the field type declaration, see the 
>>>> Introduction
>>>> in:
>>>> http://wiki.apache.org/solr/TermVectorComponent
>>>>
>>>> I don't know if lucene.vector is in the Mahout 0.5 release.
>>>>
>>>> For cluster outputs, the current cluster dumper supports 'graphml'
>>>> format. Giraph is an interactive graph browsers. You can look at small 
>>>> cluster jobs.
>>>>
>>>> On Thu, Jul 19, 2012 at 11:34 PM, Videnova, Svetlana 
>>>> <[email protected]> wrote:
>>>>> Hi,
>>>>> I already have mahout in action, but nothing working with mahout last 
>>>>> version..
>>>>> I will see again..
>>>>> For "taming text" does it treat .xml, json files too, cause my goal is to 
>>>>> take the output of solr (which is .xml, json or php)?
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>>
>>>>>
>>>>> -----Message d'origine-----
>>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : vendredi 20 
>>>>> juillet 2012 03:16 À : [email protected] Objet : Re: .txt to 
>>>>> vector
>>>>>
>>>>> There are two books out for Mahout and text processing. "Mahout in 
>>>>> Action" covers all of the apps in Mahout. "Taming Text" gives a good 
>>>>> detailed explanation of the text processing programs in Mahout, and 
>>>>> otherwise covers other text processing problems.
>>>>>
>>>>> Mahout in Action is very good, and can help you use most of the Mahout 
>>>>> features.
>>>>>
>>>>> http://www.manning.com/owen
>>>>> http://www.manning.com/ingersoll
>>>>>
>>>>> On Thu, Jul 19, 2012 at 8:08 AM, Videnova, Svetlana 
>>>>> <[email protected]> wrote:
>>>>>> Hi again,
>>>>>> Just finished.
>>>>>> That's what I done:
>>>>>>
>>>>>> Mahout .txt to seqfile
>>>>>> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>>>>>> Converting directory of documents to SequenceFile format 
>>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout 
>>>>>> seqdirectory  --input /usr/local/apache-mahout-d6d6ee8/toto
>>>>>> --output /usr/local/apache-mahout-d6d6ee8/examples/output/
>>>>>> -This first step will create chunk-0 file in the output path that 
>>>>>> you gave Creating Vectors from SequenceFile ./bin/mahout 
>>>>>> seq2sparse --input ./examples/output/chunk-0 --output 
>>>>>> ./toto/output/ -maxNGramSize *Don't forget to put  ./toto/output 
>>>>>> full right -this second step will take the chunk-0 created by the 
>>>>>> first step and will create output dir where you specified in the 
>>>>>> --output option
>>>>>>
>>>>>> Creating vector with kmeans
>>>>>> ./bin/mahout kmeans -i ./toto/output/tfidf-vectors/ -c 
>>>>>> ./toto/centroides_kmeans/ -cl  -o ./toto/cluster_kmeans/ -k 20 
>>>>>> -ow -x
>>>>>> 10
>>>>>>
>>>>>> Transform vectors to human redable (does not work yet) 
>>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout 
>>>>>> clusterdump -i ./toto/cluster_kmeans/clusters-1-final/ -o 
>>>>>> ./toto/clusters-dump/ -of TEXT -d ./toto/output/dictionary.file-0 
>>>>>> -dt sequencefile -b 100 -n 20 --evaluate -dm 
>>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/
>>>>>> *-s got changed to -i for mahout 0.7
>>>>>> * works : ./bin/mahout clusterdump -i 
>>>>>> ./toto/cluster_kmeans/clusters-1-final/ -o ./toto/clusters-dump/ 
>>>>>> --pointsDir ./toto/cluster_kmeans/clusteredPoints/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Can somebody please explain me belows files? What exactly they contect 
>>>>>> how to use them ect...
>>>>>> dictionary.file-0 ; tfidf-vectors   ;  tokenized-documents; df-count  ;  
>>>>>>        frequency.file-0 ; tf-vectors     ;      wordcount
>>>>>>
>>>>>>
>>>>>> What is the chunk-0 file exactly?
>>>>>>
>>>>>>
>>>>>>  What represent clusters-dump at the end created by using the command 
>>>>>> clusterdump?
>>>>>>
>>>>>>
>>>>>> Thank you all!
>>>>>>
>>>>>>
>>>>>> -----Message d'origine-----
>>>>>> De : Videnova, Svetlana [mailto:[email protected]]
>>>>>> Envoyé : jeudi 19 juillet 2012 15:07 À : [email protected] 
>>>>>> Objet : RE: .txt to vector
>>>>>>
>>>>>> The problem was that i gave as input file to seq2sparse the directory 
>>>>>> and no -chunk directly.
>>>>>> Also I didn't got write rights for "group" and "others" to my output 
>>>>>> file.
>>>>>>
>>>>>> After running the command -> ./bin/mahout seq2sparse --input
>>>>>> ./examples/output/chunk-0 --output ./toto/output/ --maxNGramSize 
>>>>>> 3 I have got -> 12/07/19 13:57:10 INFO driver.MahoutDriver: 
>>>>>> Program took
>>>>>> 57093 ms (Minutes: 0.95155)
>>>>>>
>>>>>>
>>>>>> So I went to my output and there is -> 
>>>>>> root@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/toto/output# ls
>>>>>> df-count           frequency.file-0  tf-vectors           wordcount
>>>>>> dictionary.file-0  tfidf-vectors     tokenized-documents
>>>>>>
>>>>>> How should the vectors files looking like?
>>>>>> And can somebody please explain me what represents each directory of the 
>>>>>> output above?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> -----Message d'origine-----
>>>>>> De : Videnova, Svetlana [mailto:[email protected]]
>>>>>> Envoyé : jeudi 19 juillet 2012 14:26 À : [email protected] 
>>>>>> Objet : RE: .txt to vector
>>>>>>
>>>>>> Yes that i was saying.
>>>>>>
>>>>>> But I have no idea where in the code mahout calls/creates the data that 
>>>>>> I don't have.
>>>>>> And the clusters that I have (especially clusters-8) are old and not 
>>>>>> generate by seqdirectory either by seq2sparse...
>>>>>> Should I make other manipulations before seqdirectory or seq2sparse step?
>>>>>>
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>>
>>>>>> -----Message d'origine-----
>>>>>> De : Alexander Aristov [mailto:[email protected]] But 
>>>>>> Envoyé
>>>>>> : jeudi 19 juillet 2012 12:05 À : [email protected] Objet : Re:
>>>>>> .txt to vector
>>>>>>
>>>>>> you've got another problem now
>>>>>>
>>>>>> Exception in thread "main" java.io.FileNotFoundException: File 
>>>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data 
>>>>>> does not exist.
>>>>>>
>>>>>> Best Regards
>>>>>> Alexander Aristov
>>>>>>
>>>>>>
>>>>>> On 19 July 2012 12:30, Videnova, Svetlana 
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>> Hi Lance,
>>>>>>>
>>>>>>> Thank you for your fast answer.
>>>>>>> I was changing my :
>>>>>>> CLASSPATH=/opt/lucene-3.6.0/lucene-core-3.6.0.jar:/opt/lucene-3.6.0/lucene-core-3.6.0-javadoc.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0.jar:/opt/lucene-3.6.0/lucene-test-framework-3.6.0-javadoc.jar:.
>>>>>>>
>>>>>>> And put 3.6.0 in the pom.xml
>>>>>>>
>>>>>>>
>>>>>>> But:
>>>>>>>
>>>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$ ./bin/mahout 
>>>>>>> seq2sparse --input ./examples/output/ --output ./toto/output/ 
>>>>>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>>>>>> running locally
>>>>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>>>>> SLF4J: Found binding in
>>>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/mahou
>>>>>>> t
>>>>>>> -
>>>>>>> e
>>>>>>> x
>>>>>>> a m
>>>>>>> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.cla
>>>>>>> s
>>>>>>> s
>>>>>>> ]
>>>>>>> SLF4J: Found binding in
>>>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen
>>>>>>> d
>>>>>>> e
>>>>>>> n c y /
>>>>>>> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>>>> SLF4J: Found binding in
>>>>>>> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/depen
>>>>>>> d
>>>>>>> e
>>>>>>> n
>>>>>>> c
>>>>>>> y /
>>>>>>> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class
>>>>>>> ]
>>>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for 
>>>>>>> an explanation.
>>>>>>> SLF4J: Actual binding is of type 
>>>>>>> [org.slf4j.impl.Log4jLoggerFactory]
>>>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>>> Maximum n-gram size is: 1
>>>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>>> Minimum LLR value: 1.0
>>>>>>> 12/07/19 09:03:55 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>>> Number of reduce tasks: 1
>>>>>>> 12/07/19 09:03:56 INFO input.FileInputFormat: Total input paths 
>>>>>>> to process
>>>>>>> : 15
>>>>>>> 12/07/19 09:03:56 INFO mapred.JobClient: Cleaning up the staging 
>>>>>>> area 
>>>>>>> file:/tmp/hadoop-csi/mapred/staging/csi-379951768/.staging/job_l
>>>>>>> o
>>>>>>> c
>>>>>>> a
>>>>>>> l
>>>>>>> _
>>>>>>> 0
>>>>>>> 001 Exception in thread "main" java.io.FileNotFoundException:
>>>>>>> File
>>>>>>> file:/usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8
>>>>>>> /
>>>>>>> d
>>>>>>> a
>>>>>>> t
>>>>>>> a
>>>>>>> does not exist.
>>>>>>>         at
>>>>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
>>>>>>>         at
>>>>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>>>>>>>         at
>>>>>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>>>>>>         at
>>>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>>>>>>         at
>>>>>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:919)
>>>>>>>         at
>>>>>>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:936)
>>>>>>>         at
>>>>>>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>>>>>>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:854)
>>>>>>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:807)
>>>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>>>>         at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>>>>         at
>>>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:807)
>>>>>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>>>>>>         at 
>>>>>>> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495)
>>>>>>>         at
>>>>>>> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:93)
>>>>>>>         at
>>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:255)
>>>>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>>>>         at
>>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55)
>>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>         at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>>         at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>         at
>>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>>>         at
>>>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>>         at
>>>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195
>>>>>>> )
>>>>>>>
>>>>>>> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8/examples/outpu
>>>>>>> t
>>>>>>> /
>>>>>>> c
>>>>>>> l
>>>>>>> u
>>>>>>> s
>>>>>>> ters-8$
>>>>>>> ls
>>>>>>> _logs  part-r-00000  _policy  _SUCCESS
>>>>>>>
>>>>>>>  There is no
>>>>>>> /usr/local/apache-mahout-d6d6ee8/examples/output/clusters-8/data here!
>>>>>>>
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> -----Message d'origine-----
>>>>>>> De : Lance Norskog [mailto:[email protected]] Envoyé : jeudi 19 
>>>>>>> juillet 2012 09:33 À : [email protected] Objet : Re: .txt 
>>>>>>> to vector
>>>>>>>
>>>>>>> Yes, the Mahout analyzer would have to be updated for Lucene 4.0.
>>>>>>> I suggest using an earlier one. Mahout uses with Lucene in a 
>>>>>>> very simple way, and it is OK to use any earlier Lucene from 3.1 to 3.6.
>>>>>>>
>>>>>>> On Wed, Jul 18, 2012 at 11:50 PM, Videnova, Svetlana < 
>>>>>>> [email protected]> wrote:
>>>>>>> > Hi Sean,
>>>>>>> >
>>>>>>> > In fact i was using lucene version 3.6.0 (saw that in the
>>>>>>> > pom.xml) But in my classpath I was using lucene version 4.0.0
>>>>>>> >
>>>>>>> > I change pom.xml to 4.0.0 =>
>>>>>>> > <lucene.version>4.0.0</lucene.version>
>>>>>>> >
>>>>>>> > But still the same error:
>>>>>>> > ###
>>>>>>> > Exception in thread "main" java.lang.VerifyError: class 
>>>>>>> > org.apache.mahout.vectorizer.DefaultAnalyzer overrides final 
>>>>>>> > method 
>>>>>>> > tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lu
>>>>>>> > c
>>>>>>> > e
>>>>>>> > n
>>>>>>> > e
>>>>>>> > /
>>>>>>> > a
>>>>>>> > na
>>>>>>> > lysis/TokenStream;
>>>>>>> > ###
>>>>>>> >
>>>>>>> > Should I change something else? Or may be lucene 4.0 is too 
>>>>>>> > recent for
>>>>>>> mahout!?
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > Thank you
>>>>>>> >
>>>>>>> > -----Message d'origine-----
>>>>>>> > De : Sean Owen [mailto:[email protected]] Envoyé : mercredi 18 
>>>>>>> > juillet
>>>>>>> > 2012 22:52 À : [email protected] Objet : Re: .txt to 
>>>>>>> > vector
>>>>>>> >
>>>>>>> > This means you're using it with an incompatible version of Lucene.
>>>>>>> > I
>>>>>>> think we're on 3.1. Check the version that Mahout depends upon 
>>>>>>> and use at least that version or later.
>>>>>>> >
>>>>>>> > On Wed, Jul 18, 2012 at 6:04 PM, Videnova, Svetlana <
>>>>>>> [email protected]> wrote:
>>>>>>> >
>>>>>>> >> I'm working with mahout. I'm trying to do web service in java 
>>>>>>> >> by myself who will take the output of solr and give this file to 
>>>>>>> >> mahout.
>>>>>>> >> For the moment I successfully do the recommendation part.
>>>>>>> >> Now I'm trying to clusterise. For this I have to vectorise 
>>>>>>> >> the output of solr.
>>>>>>> >> Do you have any idea how to do it please? I was following 
>>>>>>> >> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.ht
>>>>>>> >> m l BUT : doesn't work very well (at all...).
>>>>>>> >>
>>>>>>> >> I'm trying to find how to transform .txt to vector for mahout 
>>>>>>> >> in order to clusterise and categorise my information. Is it possible?
>>>>>>> >> I saw that I have to use seqdirectory And seq2sparse.
>>>>>>> >>
>>>>>>> >> Seqdirectory create a file (with some numbers and
>>>>>>> >> everything...) this step is ok But then when I have to use 
>>>>>>> >> seq2sparse that gives me this
>>>>>>> >> error:
>>>>>>> >>
>>>>>>> >> csi@csi-SCENIC-W:/usr/local/apache-mahout-d6d6ee8$
>>>>>>> >> ./bin/mahout seq2sparse --input ./examples/output/ --output 
>>>>>>> >> ./toto/output/ hadoop binary is not in 
>>>>>>> >> PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin,
>>>>>>> >> running locally
>>>>>>> >> SLF4J: Class path contains multiple SLF4J bindings.
>>>>>>> >> SLF4J: Found binding in
>>>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/ma
>>>>>>> >> h
>>>>>>> >> o
>>>>>>> >> u
>>>>>>> >> t
>>>>>>> >> -
>>>>>>> >> e
>>>>>>> >> xa m
>>>>>>> >> ples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.
>>>>>>> >> c
>>>>>>> >> l
>>>>>>> >> a
>>>>>>> >> s
>>>>>>> >> s
>>>>>>> >> ]
>>>>>>> >> SLF4J: Found binding in
>>>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/de
>>>>>>> >> p
>>>>>>> >> e
>>>>>>> >> n
>>>>>>> >> d
>>>>>>> >> e
>>>>>>> >> n cy /
>>>>>>> >> slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>>>> >> SLF4J: Found binding in
>>>>>>> >> [jar:file:/usr/local/apache-mahout-d6d6ee8/examples/target/de
>>>>>>> >> p
>>>>>>> >> e
>>>>>>> >> n
>>>>>>> >> d
>>>>>>> >> e
>>>>>>> >> n
>>>>>>> >> cy /
>>>>>>> >> slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.cl
>>>>>>> >> a
>>>>>>> >> s
>>>>>>> >> s
>>>>>>> >> ]
>>>>>>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings
>>>>>>> >> for an explanation.
>>>>>>> >> SLF4J: Actual binding is of type 
>>>>>>> >> [org.slf4j.impl.Log4jLoggerFactory]
>>>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>>> >> Maximum n-gram size is: 1
>>>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>>> >> Minimum LLR value: 1.0
>>>>>>> >> 12/07/18 15:53:33 INFO vectorizer.SparseVectorsFromSequenceFiles:
>>>>>>> >> Number of reduce tasks: 1 Exception in thread "main"
>>>>>>> >> java.lang.VerifyError: class
>>>>>>> >> org.apache.mahout.vectorizer.DefaultAnalyzer overrides final 
>>>>>>> >> method
>>>>>>> >>
>>>>>>> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/luce
>>>>>>> n
>>>>>>> e
>>>>>>> /
>>>>>>> a
>>>>>>> n
>>>>>>> a
>>>>>>> lysis/TokenStream;
>>>>>>> >>                 at java.lang.ClassLoader.defineClass1(Native Method)
>>>>>>> >>                 at
>>>>>>> >> java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>>>>>>> >>                 at
>>>>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>>>>>>> >>                 at
>>>>>>> >> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>>>>>>> >>                 at
>>>>>>> >> java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>>>>>>> >>                 at
>>>>>>> >> java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>>>>>>> >>                 at
>>>>>>> java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>>>>>>> >>                 at
>>>>>>> >> java.security.AccessController.doPrivileged(Native
>>>>>>> >> Method)
>>>>>>> >>                 at
>>>>>>> >> java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>>>>>> >>                 at 
>>>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>>>>>> >>                 at
>>>>>>> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>>>>>> >>                 at 
>>>>>>> >> java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>>>>>> >>                 at
>>>>>>> >>
>>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(
>>>>>>> S
>>>>>>> p
>>>>>>> a
>>>>>>> r
>>>>>>> s
>>>>>>> e
>>>>>>> VectorsFromSequenceFiles.java:199)
>>>>>>> >>                 at
>>>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>> >>                 at
>>>>>>> >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>>>> >>                 at
>>>>>>> >>
>>>>>>> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main
>>>>>>> (
>>>>>>> S
>>>>>>> p
>>>>>>> a
>>>>>>> r
>>>>>>> s
>>>>>>> eVectorsFromSequenceFiles.java:55)
>>>>>>> >>                 at
>>>>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>> >> Method)
>>>>>>> >>                 at
>>>>>>> >>
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
>>>>>>> j
>>>>>>> ava:39)
>>>>>>> >>                 at
>>>>>>> >>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
>>>>>>> A
>>>>>>> c
>>>>>>> c
>>>>>>> e
>>>>>>> s
>>>>>>> s
>>>>>>> orImpl.java:25)
>>>>>>> >>                 at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>> >>                 at
>>>>>>> >>
>>>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(P
>>>>>>> r
>>>>>>> o
>>>>>>> g
>>>>>>> r
>>>>>>> a
>>>>>>> m
>>>>>>> Driver.java:68)
>>>>>>> >>                 at
>>>>>>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>> >>                 at
>>>>>>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:
>>>>>>> >> 1
>>>>>>> >> 9
>>>>>>> >> 5
>>>>>>> >> )
>>>>>>> >>
>>>>>>> >> im using only lucene 4.0!
>>>>>>> >>
>>>>>>> >>
>>>>>>> CLASSPATH=/opt/lucene-4.0.0-ALPHA/demo/lucene-demo-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/core/lucene-core-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/analysis/common/lucene-analyzers-common-4.0.0-ALPHA.jar:/opt/lucene-4.0.0-ALPHA/queryparser/lucene-queryparser-4.0.0-ALPHA.jar:.
>>>>>>> >>
>>>>>>> >> Please where im wrong?
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> Thank you all
>>>>>>> >> Regards
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> Think green - keep it on the screen.
>>>>>>> >>
>>>>>>> >> This e-mail and any attachment is for authorised use by the 
>>>>>>> >> intended
>>>>>>> >> recipient(s) only. It may contain proprietary material, 
>>>>>>> >> confidential information and/or be subject to legal privilege.
>>>>>>> >> It should not be copied, disclosed to, retained or used by, 
>>>>>>> >> any other party. If you are not an intended recipient then 
>>>>>>> >> please promptly delete this e-mail and any attachment and all copies 
>>>>>>> >> and inform the sender. Thank you.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >
>>>>>>> > Think green - keep it on the screen.
>>>>>>> >
>>>>>>> > This e-mail and any attachment is for authorised use by the 
>>>>>>> > intended
>>>>>>> recipient(s) only. It may contain proprietary material, 
>>>>>>> confidential information and/or be subject to legal privilege. 
>>>>>>> It should not be copied, disclosed to, retained or used by, any 
>>>>>>> other party. If you are not an intended recipient then please 
>>>>>>> promptly delete this e-mail and any attachment and all copies and 
>>>>>>> inform the sender. Thank you.
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Lance Norskog
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>> Think green - keep it on the screen.
>>>>>>>
>>>>>>> This e-mail and any attachment is for authorised use by the 
>>>>>>> intended
>>>>>>> recipient(s) only. It may contain proprietary material, 
>>>>>>> confidential information and/or be subject to legal privilege. 
>>>>>>> It should not be copied, disclosed to, retained or used by, any 
>>>>>>> other party. If you are not an intended recipient then please 
>>>>>>> promptly delete this e-mail and any attachment and all copies and 
>>>>>>> inform the sender. Thank you.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Think green - keep it on the screen.
>>>>>>
>>>>>> This e-mail and any attachment is for authorised use by the intended 
>>>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>>>> information and/or be subject to legal privilege. It should not be 
>>>>>> copied, disclosed to, retained or used by, any other party. If you are 
>>>>>> not an intended recipient then please promptly delete this e-mail and 
>>>>>> any attachment and all copies and inform the sender. Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Think green - keep it on the screen.
>>>>>>
>>>>>> This e-mail and any attachment is for authorised use by the intended 
>>>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>>>> information and/or be subject to legal privilege. It should not be 
>>>>>> copied, disclosed to, retained or used by, any other party. If you are 
>>>>>> not an intended recipient then please promptly delete this e-mail and 
>>>>>> any attachment and all copies and inform the sender. Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Think green - keep it on the screen.
>>>>>>
>>>>>> This e-mail and any attachment is for authorised use by the intended 
>>>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>>>> information and/or be subject to legal privilege. It should not be 
>>>>>> copied, disclosed to, retained or used by, any other party. If you are 
>>>>>> not an intended recipient then please promptly delete this e-mail and 
>>>>>> any attachment and all copies and inform the sender. Thank you.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> [email protected]
>>>>>
>>>>>
>>>>> Think green - keep it on the screen.
>>>>>
>>>>> This e-mail and any attachment is for authorised use by the intended 
>>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>>> information and/or be subject to legal privilege. It should not be 
>>>>> copied, disclosed to, retained or used by, any other party. If you are 
>>>>> not an intended recipient then please promptly delete this e-mail and any 
>>>>> attachment and all copies and inform the sender. Thank you.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> [email protected]
>>>>
>>>>
>>>> Think green - keep it on the screen.
>>>>
>>>> This e-mail and any attachment is for authorised use by the intended 
>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>> information and/or be subject to legal privilege. It should not be copied, 
>>>> disclosed to, retained or used by, any other party. If you are not an 
>>>> intended recipient then please promptly delete this e-mail and any 
>>>> attachment and all copies and inform the sender. Thank you.
>>>>
>>>>
>>>> Think green - keep it on the screen.
>>>>
>>>> This e-mail and any attachment is for authorised use by the intended 
>>>> recipient(s) only. It may contain proprietary material, confidential 
>>>> information and/or be subject to legal privilege. It should not be copied, 
>>>> disclosed to, retained or used by, any other party. If you are not an 
>>>> intended recipient then please promptly delete this e-mail and any 
>>>> attachment and all copies and inform the sender. Thank you.
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> [email protected]
>>>
>>>
>>> Think green - keep it on the screen.
>>>
>>> This e-mail and any attachment is for authorised use by the intended 
>>> recipient(s) only. It may contain proprietary material, confidential 
>>> information and/or be subject to legal privilege. It should not be copied, 
>>> disclosed to, retained or used by, any other party. If you are not an 
>>> intended recipient then please promptly delete this e-mail and any 
>>> attachment and all copies and inform the sender. Thank you.
>>>
>>>
>>> Think green - keep it on the screen.
>>>
>>> This e-mail and any attachment is for authorised use by the intended 
>>> recipient(s) only. It may contain proprietary material, confidential 
>>> information and/or be subject to legal privilege. It should not be copied, 
>>> disclosed to, retained or used by, any other party. If you are not an 
>>> intended recipient then please promptly delete this e-mail and any 
>>> attachment and all copies and inform the sender. Thank you.
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>
>>
>> Think green - keep it on the screen.
>>
>> This e-mail and any attachment is for authorised use by the intended 
>> recipient(s) only. It may contain proprietary material, confidential 
>> information and/or be subject to legal privilege. It should not be copied, 
>> disclosed to, retained or used by, any other party. If you are not an 
>> intended recipient then please promptly delete this e-mail and any 
>> attachment and all copies and inform the sender. Thank you.
>>
>
>
>
> --
> Lance Norskog
> [email protected]
>
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
>



--
Lance Norskog
[email protected]


Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.


Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.

RE: .txt to vector

Reply via email to