Re: Mahout lucene UTFDataFormatException: encoded string too long:
This sounds pretty fishy. What this is saying is that you have a document in your index whose name is longer than 65,535 characters. That doesn't sound very plausible. Don't you have a more appropriate ID column? The problem starts where you say --idField text. Pick a better field. On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore nishant.rathor...@gmail.com wrote: Hi, I am trying to import vector from lucene using the command, ./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut ../output/luceneDictionary -err 0.10 But i am getting following error Exception in thread main java.io.UTFDataFormatException: encoded string too long: 94944 bytes at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) at org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:188) at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:84) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1190) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039) at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:49) at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:111) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:252) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) I understand that Since its a UTF format and it can not be greater than 64KB. But confused how to deal with this. I change the mahout to read and write using Byte rather than UTF. But later while doing clustering, I get the error of byte mismatch. So I reverted the changes. What can i do to circumvent the UTF limitation issue? I wonder this seems to be too obvious issue to get solve inside mahout only. Thanks, Nishant
mahout 0.7 NaiveBayes usage
I'm using mahout 0.7 NaiveBayes Algorithm. I want to use my own data with the Algorithm. Anyone can give some example code with the NaiveBayes Algorithm. I had browse the mahout 0.7 NaiveBayes source code . The NaiveBayes training code is in class org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob I am confused with the input format , how can I use my own data with this mahout NaiveBayes Algorithm. Thanks a lot
Re: Mahout lucene UTFDataFormatException: encoded string too long:
Hi Ted, That was a stupid mistake. Thanks a lot for quick reply and pointing out the issue. I have change the idfield to link of the document. *./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField link -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10* and ran the fkmeans clustering using command: *bin/mahout fkmeans -i ../output/fetise/luceneVector -c ../output/fetise/fetise-fkmeans-centroids -o ../output/fetise/fetise-fkmeans-clusters -cd 1.0 -k 40 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* But when running cluster dumper *./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o ../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/ -d ../output/fetise/luceneDictionary -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* got the following error Exception in thread main java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:306) at org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:252) at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:155) at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100) *./bin/mahout seqdumper -i ../output/fetise/fetise-fkmeans-centroids/part-randomSeed | more* Input Path: ../output/fetise/fetise-fkmeans-centroids/part-randomSeed Key class: *class org.apache.hadoop.io.Text* Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable Key: 662: Value: org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf Key: 1014: Value: org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf Why am i getting the key in centroids as Text? Thanks, Nishant On Thu, Apr 25, 2013 at 12:20 PM, Ted Dunning ted.dunn...@gmail.com wrote: This sounds pretty fishy. What this is saying is that you have a document in your index whose name is longer than 65,535 characters. That doesn't sound very plausible. Don't you have a more appropriate ID column? The problem starts where you say --idField text. Pick a better field. On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore nishant.rathor...@gmail.com wrote: Hi, I am trying to import vector from lucene using the command, ./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut ../output/luceneDictionary -err 0.10 But i am getting following error Exception in thread main java.io.UTFDataFormatException: encoded string too long: 94944 bytes at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) at org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:188) at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:84) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1190) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039) at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:49) at org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:111) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:252) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) I understand that Since its a UTF format and it can not be greater than 64KB. But confused how to deal with this. I change the mahout to read and write using Byte rather than UTF. But later while doing clustering, I get the error of byte mismatch. So I reverted the changes. What can i do to circumvent the UTF limitation issue? I wonder this seems to be
kmeans local vs mapreduce difference
Hi, I'm running a kmeans clusterization on a small sequence (around 50 KB) file on a 2 node cluster. The block size for this file is 20 KB, so it uses 3 mappers I am using CDH4.2.0 with yarn and Mahout 0.7 If the job runs local on only one node the used CPU is around 20% and the job finishes in 30 seconds with 11 iterations. If the job runs on both nodes, the used CPU is around 80, 90% on one node at a time and the job finishes in 15 minutes with 30 iterations (maximum set by parameter). the command is: mahout kmeans \ -i $Path/${SEQUENCE} \ -c $Path/${CENTROIDS_IN} \ -o $Path/${CLUSTERS_OUT} \ -x 30 \ -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure \ -ow \ -cd 0.5 last iteration on local: Cluster Iterator running iteration 11 over priorPath: **/clusters-out/clusters-10 13/04/24 16:42:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 13/04/24 16:42:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/04/24 16:42:32 INFO input.FileInputFormat: Total input paths to process : 1 13/04/24 16:42:32 INFO mapred.JobClient: Running job: job_local103137904_0011 13/04/24 16:42:32 INFO mapred.LocalJobRunner: OutputCommitter set in config null 13/04/24 16:42:32 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Waiting for map tasks 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: attempt_local103137904_0011_m_00_0 13/04/24 16:42:32 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 13/04/24 16:42:32 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@603b88a8 13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:0+20480 13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100 13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720 13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output 13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0 13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_00_0 is done. And is in the process of commiting 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.Task: Task 'attempt_local103137904_0011_m_00_0' done. 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: attempt_local103137904_0011_m_00_0 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: attempt_local103137904_0011_m_01_0 13/04/24 16:42:32 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 13/04/24 16:42:32 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3183a45e 13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:20480+20480 13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100 13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720 13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output 13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0 13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_01_0 is done. And is in the process of commiting 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO mapred.Task: Task 'attempt_local103137904_0011_m_01_0' done. 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: attempt_local103137904_0011_m_01_0 13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: attempt_local103137904_0011_m_02_0 13/04/24 16:42:32 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 13/04/24 16:42:32 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3273d766 13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:40960+10127 13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100 13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720 13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680 13/04/24 16:42:32 INFO mapred.LocalJobRunner: 13/04/24 16:42:32 INFO
Random Forrest implementation in mahout
I just run the RF examples, non-distributed version: BreimanExample with glass data, 10 iterations with 100 trees, here is the unexpected output: 13/04/25 15:38:40 INFO df.BreimanExample: 13/04/25 15:38:40 INFO df.BreimanExample: Random Input Test Error : 1.0 13/04/25 15:38:40 INFO df.BreimanExample: Single Input Test Error : 1.0 13/04/25 15:38:40 INFO df.BreimanExample: Mean Random Input Time : 0h 0m 1s 79 13/04/25 15:38:40 INFO df.BreimanExample: Mean Single Input Time : 0h 0m 0s 845 13/04/25 15:38:40 INFO df.BreimanExample: Mean Random Input Num Nodes : 6719 13/04/25 15:38:40 INFO df.BreimanExample: Mean Single Input Num Nodes : 11260 is this a mahout bug? or just an unverified implementation? or as the name indicating, 'random forest' just give random predictions? anyone knows any verified random forest java code available online? thanks,
CfP 2013 Workshop on Middleware for HPC and Big Data Systems (MHPC'13)
we apologize if you receive multiple copies of this message === CALL FOR PAPERS 2013 Workshop on Middleware for HPC and Big Data Systems MHPC '13 as part of Euro-Par 2013, Aachen, Germany === Date: August 27, 2012 Workshop URL: http://m-hpc.org Springer LNCS SUBMISSION DEADLINE: May 31, 2013 - LNCS Full paper submission (rolling abstract submission) June 28, 2013 - Lightning Talk abstracts SCOPE Extremely large, diverse, and complex data sets are generated from scientific applications, the Internet, social media and other applications. Data may be physically distributed and shared by an ever larger community. Collecting, aggregating, storing and analyzing large data volumes presents major challenges. Processing such amounts of data efficiently has been an issue to scientific discovery and technological advancement. In addition, making the data accessible, understandable and interoperable includes unsolved problems. Novel middleware architectures, algorithms, and application development frameworks are required. In this workshop we are particularly interested in original work at the intersection of HPC and Big Data with regard to middleware handling and optimizations. Scope is existing and proposed middleware for HPC and big data, including analytics libraries and frameworks. The goal of this workshop is to bring together software architects, middleware and framework developers, data-intensive application developers as well as users from the scientific and engineering community to exchange their experience in processing large datasets and to report their scientific achievement and innovative ideas. The workshop also offers a dedicated forum for these researchers to access the state of the art, to discuss problems and requirements, to identify gaps in current and planned designs, and to collaborate in strategies for scalable data-intensive computing. The workshop will be one day in length, composed of 20 min paper presentations, each followed by 10 min discussion sections. Presentations may be accompanied by interactive demonstrations. TOPICS Topics of interest include, but are not limited to: - Middleware including: Hadoop, Apache Drill, YARN, Spark/Shark, Hive, Pig, Sqoop, HBase, HDFS, S4, CIEL, Oozie, Impala, Storm and Hyrack - Data intensive middleware architecture - Libraries/Frameworks including: Apache Mahout, Giraph, UIMA and GraphLab - NG Databases including Apache Cassandra, MongoDB and CouchDB/Couchbase - Schedulers including Cascading - Middleware for optimized data locality/in-place data processing - Data handling middleware for deployment in virtualized HPC environments - Parallelization and distributed processing architectures at the middleware level - Integration with cloud middleware and application servers - Runtime environments and system level support for data-intensive computing - Skeletons and patterns - Checkpointing - Programming models and languages - Big Data ETL - Stream processing middleware - In-memory databases for HPC - Scalability and interoperability - Large-scale data storage and distributed file systems - Content-centric addressing and networking - Execution engines, languages and environments including CIEL/Skywriting - Performance analysis, evaluation of data-intensive middleware - In-depth analysis and performance optimizations in existing data-handling middleware, focusing on indexing/fast storing or retrieval between compute and storage nodes - Highly scalable middleware optimized for minimum communication - Use cases and experience for popular Big Data middleware - Middleware security, privacy and trust architectures DATES Papers: Rolling abstract submission May 31, 2013 - Full paper submission July 8, 2013 - Acceptance notification October 3, 2013 - Camera-ready version due Lightning Talks: June 28, 2013 - Deadline for lightning talk abstracts July 15, 2013 - Lightning talk notification August 27, 2013 - Workshop Date TPC CHAIR Michael Alexander (chair), TU Wien, Austria Anastassios Nanos (co-chair), NTUA, Greece Jie Tao (co-chair), Karlsruhe Institut of Technology, Germany Lizhe Wang (co-chair), Chinese Academy of Sciences, China Gianluigi Zanetti (co-chair), CRS4, Italy PROGRAM COMMITTEE Amitanand Aiyer, Facebook, USA Costas Bekas, IBM, Switzerland Jakob Blomer, CERN, Switzerland William Gardner, University of Guelph, Canada José Gracia, HPC Center of the University of Stuttgart, Germany Zhenghua Guom, Indiana University, USA Marcus Hardt, Karlsruhe Institute of Technology, Germany Sverre Jarp, CERN, Switzerland Christopher Jung, Karlsruhe Institute of Technology, Germany Andreas Knüpfer - Technische Universität Dresden, Germany Nectarios Koziris, National Technical University of Athens, Greece Yan Ma, Chinese Academy of Sciences, China Martin Schulz - Lawrence Livermore National Laboratory Viral Shah,
Re: Mahout lucene UTFDataFormatException: encoded string too long:
Hi, Afer running the commane, * * *./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o ../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/ -d ../output/fetise/luceneDictionary -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* * * My directory strucuture is like outputFolder drwxrwxr-x 2 pacman pacman4096 Apr 25 20:09 centroids -rw-rw-r-- 1 pacman pacman 0 Apr 26 08:51 clusterdump drwxrwxr-x 4 pacman pacman4096 Apr 25 20:09 clusters -rw-rw-r-- 1 pacman pacman 173057 Apr 25 20:09 luceneDictionary -rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector pacman@pacman:~/DownloadedCodes/mahout/output/fetise$ ls -lR .: total 3148 drwxrwxr-x 2 pacman pacman4096 Apr 25 20:09 centroids -rw-rw-r-- 1 pacman pacman 0 Apr 26 08:51 clusterdump drwxrwxr-x 4 pacman pacman4096 Apr 25 20:09 clusters -rw-rw-r-- 1 pacman pacman 173057 Apr 25 20:09 luceneDictionary -rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector ./centroids: total 188 -rwxrwxrwx 1 pacman pacman 191155 Apr 25 20:09 part-randomSeed ./clusters: total 8 drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-0 drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-1-final ./clusters/clusters-0: total 324 -rwxrwxrwx 1 pacman pacman 4888 Apr 25 20:09 part-0 ... -rwxrwxrwx 1 pacman pacman 4888 Apr 25 20:09 part-00039 -rwxrwxrwx 1 pacman pacman 207 Apr 25 20:09 _policy ./clusters/clusters-1-final: total 7212 -rwxrwxrwx 1 pacman pacman 7377533 Apr 25 20:09 part-r-0 -rwxrwxrwx 1 pacman pacman 207 Apr 25 20:09 _policy -rwxrwxrwx 1 pacman pacman 0 Apr 25 20:09 _SUCCESS *So I am confused while running clusterdump what is cluster points and cluster directory??* Thanks, Nishant On Thu, Apr 25, 2013 at 1:37 PM, nishant rathore nishant.rathor...@gmail.com wrote: Hi Ted, That was a stupid mistake. Thanks a lot for quick reply and pointing out the issue. I have change the idfield to link of the document. *./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField link -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10* and ran the fkmeans clustering using command: *bin/mahout fkmeans -i ../output/fetise/luceneVector -c ../output/fetise/fetise-fkmeans-centroids -o ../output/fetise/fetise-fkmeans-clusters -cd 1.0 -k 40 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* But when running cluster dumper *./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o ../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/ -d ../output/fetise/luceneDictionary -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure* got the following error Exception in thread main java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:306) at org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:252) at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:155) at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100) *./bin/mahout seqdumper -i ../output/fetise/fetise-fkmeans-centroids/part-randomSeed | more* Input Path: ../output/fetise/fetise-fkmeans-centroids/part-randomSeed Key class: *class org.apache.hadoop.io.Text* Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable Key: 662: Value: org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf Key: 1014: Value: org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf Why am i getting the key in centroids as Text? Thanks, Nishant On Thu, Apr 25, 2013 at 12:20 PM, Ted Dunning ted.dunn...@gmail.comwrote: This sounds pretty fishy. What this is saying is that you have a document in your index whose name is longer than 65,535 characters. That doesn't sound very plausible. Don't you have a more appropriate ID column? The problem starts where you say --idField text. Pick a better field. On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore nishant.rathor...@gmail.com wrote: Hi, I am trying to import vector from lucene using the command, ./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/fetise/luceneVector --field text -w TFIDF --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout lucene.vector -d /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut ../output/luceneDictionary -err 0.10 But i am getting following error Exception in thread main java.io.UTFDataFormatException: encoded string