Re: Mahout lucene UTFDataFormatException: encoded string too long:

2013-04-25 Thread Ted Dunning
This sounds pretty fishy.

What this is saying is that you have a document in your index whose name is
longer than 65,535 characters.

That doesn't sound very plausible.  Don't you have a more appropriate ID
column?

The problem starts where you say --idField text.  Pick a better field.



On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore 
nishant.rathor...@gmail.com wrote:

 Hi,

 I am trying to import vector from lucene using the command,

 ./bin/mahout lucene.vector -d

 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
 --idField text -o ../output/fetise/luceneVector --field text -w TFIDF
 --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout
 lucene.vector -d

 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
 --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut
 ../output/luceneDictionary -err 0.10

 But i am getting following error
 Exception in thread main java.io.UTFDataFormatException: encoded string
 too long: 94944 bytes
 at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
 at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
 at
 org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:188)
 at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:84)
 at

 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
 at

 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
 at

 org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1190)
 at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039)
 at

 org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:49)
 at
 org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:111)
 at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:252)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at

 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)

 I understand that Since its a UTF format and it can not be greater than
 64KB.  But confused how to deal with this. I change the mahout to read and
 write using Byte rather than UTF. But later while doing clustering, I get
 the error of byte mismatch.

 So I reverted the changes. What can i do to circumvent the UTF limitation
 issue? I wonder this seems to be too obvious issue to get solve inside
 mahout only.


 Thanks,
 Nishant



Re: Mahout lucene UTFDataFormatException: encoded string too long:

2013-04-25 Thread nishant rathore
Hi Ted,

That was a stupid mistake. Thanks a lot for quick reply and pointing out
the issue.

I have change the idfield to link of the document.
*./bin/mahout lucene.vector -d
/home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
--idField link  -o ../output/fetise/luceneVector --field text -w TFIDF
--dictOut ../output/fetise/luceneDictionary -err 0.10*

and ran the fkmeans clustering using command:
*bin/mahout fkmeans -i ../output/fetise/luceneVector -c
../output/fetise/fetise-fkmeans-centroids -o
../output/fetise/fetise-fkmeans-clusters -cd 1.0 -k 40 -m 2 -ow -x 10 -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure*

But when running cluster dumper
*./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o
../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/
-d ../output/fetise/luceneDictionary  -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure*

got the following error
Exception in thread main java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:306)
at
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:252)
at
org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:155)
at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100)


 *./bin/mahout seqdumper -i
../output/fetise/fetise-fkmeans-centroids/part-randomSeed | more*
Input Path: ../output/fetise/fetise-fkmeans-centroids/part-randomSeed
Key class: *class org.apache.hadoop.io.Text* Value Class: class
org.apache.mahout.clustering.iterator.ClusterWritable
Key: 662: Value:
org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf
Key: 1014: Value:
org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf

Why am i getting the key in centroids as Text?


Thanks,
Nishant




On Thu, Apr 25, 2013 at 12:20 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 This sounds pretty fishy.

 What this is saying is that you have a document in your index whose name is
 longer than 65,535 characters.

 That doesn't sound very plausible.  Don't you have a more appropriate ID
 column?

 The problem starts where you say --idField text.  Pick a better field.



 On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore 
 nishant.rathor...@gmail.com wrote:

  Hi,
 
  I am trying to import vector from lucene using the command,
 
  ./bin/mahout lucene.vector -d
 
 
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
  --idField text -o ../output/fetise/luceneVector --field text -w TFIDF
  --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout
  lucene.vector -d
 
 
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
  --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut
  ../output/luceneDictionary -err 0.10
 
  But i am getting following error
  Exception in thread main java.io.UTFDataFormatException: encoded string
  too long: 94944 bytes
  at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
  at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
  at
 
 org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:188)
  at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:84)
  at
 
 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
  at
 
 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
  at
 
 
 org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1190)
  at
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039)
  at
 
 
 org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:49)
  at
 
 org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:111)
  at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:252)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:601)
  at
 
 
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 
  I understand that Since its a UTF format and it can not be greater than
  64KB.  But confused how to deal with this. I change the mahout to read
 and
  write using Byte rather than UTF. But later while doing clustering, I get
  the error of byte mismatch.
 
  So I reverted the changes. What can i do to circumvent the UTF limitation
  issue? I wonder this seems to be 

Re: Mahout lucene UTFDataFormatException: encoded string too long:

2013-04-25 Thread nishant rathore
Hi,

 Afer running the commane,
*
*
*./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o
../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/
-d ../output/fetise/luceneDictionary  -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure*
*
*
My directory strucuture is like

outputFolder

drwxrwxr-x 2 pacman pacman4096 Apr 25 20:09 centroids
-rw-rw-r-- 1 pacman pacman   0 Apr 26 08:51 clusterdump
drwxrwxr-x 4 pacman pacman4096 Apr 25 20:09 clusters
-rw-rw-r-- 1 pacman pacman  173057 Apr 25 20:09 luceneDictionary
-rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector
pacman@pacman:~/DownloadedCodes/mahout/output/fetise$ ls -lR
.:
total 3148
drwxrwxr-x 2 pacman pacman4096 Apr 25 20:09 centroids
-rw-rw-r-- 1 pacman pacman   0 Apr 26 08:51 clusterdump
drwxrwxr-x 4 pacman pacman4096 Apr 25 20:09 clusters
-rw-rw-r-- 1 pacman pacman  173057 Apr 25 20:09 luceneDictionary
-rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector

./centroids:
total 188
-rwxrwxrwx 1 pacman pacman 191155 Apr 25 20:09 part-randomSeed

./clusters:
total 8
drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-0
drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-1-final

./clusters/clusters-0:
total 324
-rwxrwxrwx 1 pacman pacman  4888 Apr 25 20:09 part-0
...
-rwxrwxrwx 1 pacman pacman  4888 Apr 25 20:09 part-00039
-rwxrwxrwx 1 pacman pacman   207 Apr 25 20:09 _policy

./clusters/clusters-1-final:
total 7212
-rwxrwxrwx 1 pacman pacman 7377533 Apr 25 20:09 part-r-0
-rwxrwxrwx 1 pacman pacman 207 Apr 25 20:09 _policy
-rwxrwxrwx 1 pacman pacman   0 Apr 25 20:09 _SUCCESS


*So I am confused while running clusterdump what  is cluster points and
cluster directory??*


Thanks,
Nishant



On Thu, Apr 25, 2013 at 1:37 PM, nishant rathore 
nishant.rathor...@gmail.com wrote:

 Hi Ted,

 That was a stupid mistake. Thanks a lot for quick reply and pointing out
 the issue.

 I have change the idfield to link of the document.
 *./bin/mahout lucene.vector -d
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
 --idField link  -o ../output/fetise/luceneVector --field text -w TFIDF
 --dictOut ../output/fetise/luceneDictionary -err 0.10*

 and ran the fkmeans clustering using command:
 *bin/mahout fkmeans -i ../output/fetise/luceneVector -c
 ../output/fetise/fetise-fkmeans-centroids -o
 ../output/fetise/fetise-fkmeans-clusters -cd 1.0 -k 40 -m 2 -ow -x 10 -dm
 org.apache.mahout.common.distance.TanimotoDistanceMeasure*

 But when running cluster dumper
 *./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o
 ../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/
 -d ../output/fetise/luceneDictionary  -dm
 org.apache.mahout.common.distance.TanimotoDistanceMeasure*

 got the following error
 Exception in thread main java.lang.ClassCastException:
 org.apache.hadoop.io.Text cannot be cast to
 org.apache.hadoop.io.IntWritable
  at
 org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:306)
 at
 org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:252)
  at
 org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:155)
 at
 org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100)


  *./bin/mahout seqdumper -i
 ../output/fetise/fetise-fkmeans-centroids/part-randomSeed | more*
 Input Path: ../output/fetise/fetise-fkmeans-centroids/part-randomSeed
 Key class: *class org.apache.hadoop.io.Text* Value Class: class
 org.apache.mahout.clustering.iterator.ClusterWritable
 Key: 662: Value:
 org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf
 Key: 1014: Value:
 org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf

 Why am i getting the key in centroids as Text?


 Thanks,
 Nishant




 On Thu, Apr 25, 2013 at 12:20 PM, Ted Dunning ted.dunn...@gmail.comwrote:

 This sounds pretty fishy.

 What this is saying is that you have a document in your index whose name
 is
 longer than 65,535 characters.

 That doesn't sound very plausible.  Don't you have a more appropriate ID
 column?

 The problem starts where you say --idField text.  Pick a better field.



 On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore 
 nishant.rathor...@gmail.com wrote:

  Hi,
 
  I am trying to import vector from lucene using the command,
 
  ./bin/mahout lucene.vector -d
 
 
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
  --idField text -o ../output/fetise/luceneVector --field text -w TFIDF
  --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout
  lucene.vector -d
 
 
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
  --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut
  ../output/luceneDictionary -err 0.10
 
  But i am getting following error
  Exception in thread main java.io.UTFDataFormatException: encoded
 string