Re: Mahout lucene UTFDataFormatException: encoded string too long:
This sounds pretty fishy. What this is saying is that you have a document in your index whose name is longer than 65,535 characters. That doesn't sound very plausible. Don't you have a more appropriate ID column? The problem starts where you say --idField text. Pick a better field. On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore nishant.rathor...@gmail.com wrote: Hi, I am trying to import vector from lucene using the command, ./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut ../output/luceneDictionary -err 0.10 But i am getting following error Exception in thread main java.io.UTFDataFormatException: encoded string too long: 94944 bytes at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) at org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:188) at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:84) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1190) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039) at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:49) at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:111) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:252) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) I understand that Since its a UTF format and it can not be greater than 64KB. But confused how to deal with this. I change the mahout to read and write using Byte rather than UTF. But later while doing clustering, I get the error of byte mismatch. So I reverted the changes. What can i do to circumvent the UTF limitation issue? I wonder this seems to be too obvious issue to get solve inside mahout only. Thanks, Nishant
Re: Mahout lucene UTFDataFormatException: encoded string too long:
Hi Ted, That was a stupid mistake. Thanks a lot for quick reply and pointing out the issue. I have change the idfield to link of the document. *./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField link -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10* and ran the fkmeans clustering using command: *bin/mahout fkmeans -i ../output/fetise/luceneVector -c ../output/fetise/fetise-fkmeans-centroids -o ../output/fetise/fetise-fkmeans-clusters -cd 1.0 -k 40 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* But when running cluster dumper *./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o ../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/ -d ../output/fetise/luceneDictionary -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* got the following error Exception in thread main java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:306) at org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:252) at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:155) at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100) *./bin/mahout seqdumper -i ../output/fetise/fetise-fkmeans-centroids/part-randomSeed | more* Input Path: ../output/fetise/fetise-fkmeans-centroids/part-randomSeed Key class: *class org.apache.hadoop.io.Text* Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable Key: 662: Value: org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf Key: 1014: Value: org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf Why am i getting the key in centroids as Text? Thanks, Nishant On Thu, Apr 25, 2013 at 12:20 PM, Ted Dunning ted.dunn...@gmail.com wrote: This sounds pretty fishy. What this is saying is that you have a document in your index whose name is longer than 65,535 characters. That doesn't sound very plausible. Don't you have a more appropriate ID column? The problem starts where you say --idField text. Pick a better field. On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore nishant.rathor...@gmail.com wrote: Hi, I am trying to import vector from lucene using the command, ./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut ../output/luceneDictionary -err 0.10 But i am getting following error Exception in thread main java.io.UTFDataFormatException: encoded string too long: 94944 bytes at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) at org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:188) at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:84) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1190) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039) at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:49) at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:111) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:252) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) I understand that Since its a UTF format and it can not be greater than 64KB. But confused how to deal with this. I change the mahout to read and write using Byte rather than UTF. But later while doing clustering, I get the error of byte mismatch. So I reverted the changes. What can i do to circumvent the UTF limitation issue? I wonder this seems to be
Re: Mahout lucene UTFDataFormatException: encoded string too long:
Hi, Afer running the commane, * * *./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o ../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/ -d ../output/fetise/luceneDictionary -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* * * My directory strucuture is like outputFolder drwxrwxr-x 2 pacman pacman4096 Apr 25 20:09 centroids -rw-rw-r-- 1 pacman pacman 0 Apr 26 08:51 clusterdump drwxrwxr-x 4 pacman pacman4096 Apr 25 20:09 clusters -rw-rw-r-- 1 pacman pacman 173057 Apr 25 20:09 luceneDictionary -rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector pacman@pacman:~/DownloadedCodes/mahout/output/fetise$ ls -lR .: total 3148 drwxrwxr-x 2 pacman pacman4096 Apr 25 20:09 centroids -rw-rw-r-- 1 pacman pacman 0 Apr 26 08:51 clusterdump drwxrwxr-x 4 pacman pacman4096 Apr 25 20:09 clusters -rw-rw-r-- 1 pacman pacman 173057 Apr 25 20:09 luceneDictionary -rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector ./centroids: total 188 -rwxrwxrwx 1 pacman pacman 191155 Apr 25 20:09 part-randomSeed ./clusters: total 8 drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-0 drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-1-final ./clusters/clusters-0: total 324 -rwxrwxrwx 1 pacman pacman 4888 Apr 25 20:09 part-0 ... -rwxrwxrwx 1 pacman pacman 4888 Apr 25 20:09 part-00039 -rwxrwxrwx 1 pacman pacman 207 Apr 25 20:09 _policy ./clusters/clusters-1-final: total 7212 -rwxrwxrwx 1 pacman pacman 7377533 Apr 25 20:09 part-r-0 -rwxrwxrwx 1 pacman pacman 207 Apr 25 20:09 _policy -rwxrwxrwx 1 pacman pacman 0 Apr 25 20:09 _SUCCESS *So I am confused while running clusterdump what is cluster points and cluster directory??* Thanks, Nishant On Thu, Apr 25, 2013 at 1:37 PM, nishant rathore nishant.rathor...@gmail.com wrote: Hi Ted, That was a stupid mistake. Thanks a lot for quick reply and pointing out the issue. I have change the idfield to link of the document. *./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField link -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10* and ran the fkmeans clustering using command: *bin/mahout fkmeans -i ../output/fetise/luceneVector -c ../output/fetise/fetise-fkmeans-centroids -o ../output/fetise/fetise-fkmeans-clusters -cd 1.0 -k 40 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* But when running cluster dumper *./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o ../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/ -d ../output/fetise/luceneDictionary -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* got the following error Exception in thread main java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:306) at org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:252) at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:155) at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100) *./bin/mahout seqdumper -i ../output/fetise/fetise-fkmeans-centroids/part-randomSeed | more* Input Path: ../output/fetise/fetise-fkmeans-centroids/part-randomSeed Key class: *class org.apache.hadoop.io.Text* Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable Key: 662: Value: org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf Key: 1014: Value: org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf Why am i getting the key in centroids as Text? Thanks, Nishant On Thu, Apr 25, 2013 at 12:20 PM, Ted Dunning ted.dunn...@gmail.comwrote: This sounds pretty fishy. What this is saying is that you have a document in your index whose name is longer than 65,535 characters. That doesn't sound very plausible. Don't you have a more appropriate ID column? The problem starts where you say --idField text. Pick a better field. On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore nishant.rathor...@gmail.com wrote: Hi, I am trying to import vector from lucene using the command, ./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut ../output/luceneDictionary -err 0.10 But i am getting following error Exception in thread main java.io.UTFDataFormatException: encoded string