date:20130425

Re: Mahout lucene UTFDataFormatException: encoded string too long:

2013-04-25 Thread Ted Dunning

This sounds pretty fishy.

What this is saying is that you have a document in your index whose name is
longer than 65,535 characters.

That doesn't sound very plausible.  Don't you have a more appropriate ID
column?

The problem starts where you say --idField text.  Pick a better field.



On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore 
nishant.rathor...@gmail.com wrote:

 Hi,

 I am trying to import vector from lucene using the command,

 ./bin/mahout lucene.vector -d

 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
 --idField text -o ../output/fetise/luceneVector --field text -w TFIDF
 --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout
 lucene.vector -d

 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
 --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut
 ../output/luceneDictionary -err 0.10

 But i am getting following error
 Exception in thread main java.io.UTFDataFormatException: encoded string
 too long: 94944 bytes
 at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
 at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
 at
 org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:188)
 at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:84)
 at

 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
 at

 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
 at

 org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1190)
 at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039)
 at

 org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:49)
 at
 org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:111)
 at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:252)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at

 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)

 I understand that Since its a UTF format and it can not be greater than
 64KB.  But confused how to deal with this. I change the mahout to read and
 write using Byte rather than UTF. But later while doing clustering, I get
 the error of byte mismatch.

 So I reverted the changes. What can i do to circumvent the UTF limitation
 issue? I wonder this seems to be too obvious issue to get solve inside
 mahout only.


 Thanks,
 Nishant

mahout 0.7 NaiveBayes usage

2013-04-25 Thread 蒋玉柱

I'm using mahout 0.7 NaiveBayes  Algorithm.
I want to use my own data with the Algorithm.
Anyone can give some example code with the NaiveBayes  Algorithm.

I had browse the mahout 0.7 NaiveBayes source code .
The NaiveBayes  training  code is in class
org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob
I am confused with the input format , how can I use my own data with this
mahout NaiveBayes  Algorithm.

Thanks a lot

Re: Mahout lucene UTFDataFormatException: encoded string too long:

2013-04-25 Thread nishant rathore

Hi Ted,

That was a stupid mistake. Thanks a lot for quick reply and pointing out
the issue.

I have change the idfield to link of the document.
*./bin/mahout lucene.vector -d
/home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
--idField link  -o ../output/fetise/luceneVector --field text -w TFIDF
--dictOut ../output/fetise/luceneDictionary -err 0.10*

and ran the fkmeans clustering using command:
*bin/mahout fkmeans -i ../output/fetise/luceneVector -c
../output/fetise/fetise-fkmeans-centroids -o
../output/fetise/fetise-fkmeans-clusters -cd 1.0 -k 40 -m 2 -ow -x 10 -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure*

But when running cluster dumper
*./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o
../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/
-d ../output/fetise/luceneDictionary  -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure*

got the following error
Exception in thread main java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:306)
at
org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:252)
at
org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:155)
at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100)


 *./bin/mahout seqdumper -i
../output/fetise/fetise-fkmeans-centroids/part-randomSeed | more*
Input Path: ../output/fetise/fetise-fkmeans-centroids/part-randomSeed
Key class: *class org.apache.hadoop.io.Text* Value Class: class
org.apache.mahout.clustering.iterator.ClusterWritable
Key: 662: Value:
org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf
Key: 1014: Value:
org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf

Why am i getting the key in centroids as Text?


Thanks,
Nishant




On Thu, Apr 25, 2013 at 12:20 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 This sounds pretty fishy.

 What this is saying is that you have a document in your index whose name is
 longer than 65,535 characters.

 That doesn't sound very plausible.  Don't you have a more appropriate ID
 column?

 The problem starts where you say --idField text.  Pick a better field.



 On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore 
 nishant.rathor...@gmail.com wrote:

  Hi,
 
  I am trying to import vector from lucene using the command,
 
  ./bin/mahout lucene.vector -d
 
 
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
  --idField text -o ../output/fetise/luceneVector --field text -w TFIDF
  --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout
  lucene.vector -d
 
 
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
  --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut
  ../output/luceneDictionary -err 0.10
 
  But i am getting following error
  Exception in thread main java.io.UTFDataFormatException: encoded string
  too long: 94944 bytes
  at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
  at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
  at
 
 org.apache.mahout.math.VectorWritable.writeVector(VectorWritable.java:188)
  at org.apache.mahout.math.VectorWritable.write(VectorWritable.java:84)
  at
 
 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
  at
 
 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
  at
 
 
 org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1190)
  at
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1039)
  at
 
 
 org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:49)
  at
 
 org.apache.mahout.utils.vectors.lucene.Driver.dumpVectors(Driver.java:111)
  at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:252)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:601)
  at
 
 
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 
  I understand that Since its a UTF format and it can not be greater than
  64KB.  But confused how to deal with this. I change the mahout to read
 and
  write using Byte rather than UTF. But later while doing clustering, I get
  the error of byte mismatch.
 
  So I reverted the changes. What can i do to circumvent the UTF limitation
  issue? I wonder this seems to be

kmeans local vs mapreduce difference

2013-04-25 Thread Mihai Josan

Hi,

I'm running a kmeans clusterization on a small sequence (around 50 KB) file on 
a 2 node cluster. The block size for this file is 20 KB, so it uses 3 mappers
I am using CDH4.2.0 with yarn and Mahout 0.7
If the job runs local on only one node the used CPU is around 20% and the job 
finishes in 30 seconds with 11 iterations.
If the job runs on both nodes, the used CPU is around 80, 90% on one node at a 
time and the job finishes in 15 minutes with 30 iterations (maximum set by 
parameter).

the command is:
mahout kmeans \
-i $Path/${SEQUENCE} \
-c $Path/${CENTROIDS_IN} \
-o $Path/${CLUSTERS_OUT} \
-x 30 \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure \
-ow \
-cd 0.5

last iteration on local:
Cluster Iterator running iteration 11 over priorPath: 
**/clusters-out/clusters-10
13/04/24 16:42:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with 
processName=JobTracker, sessionId= - already initialized
13/04/24 16:42:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
13/04/24 16:42:32 INFO input.FileInputFormat: Total input paths to process : 1
13/04/24 16:42:32 INFO mapred.JobClient: Running job: job_local103137904_0011
13/04/24 16:42:32 INFO mapred.LocalJobRunner: OutputCommitter set in config null
13/04/24 16:42:32 INFO mapred.LocalJobRunner: OutputCommitter is 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Waiting for map tasks
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: 
attempt_local103137904_0011_m_00_0
13/04/24 16:42:32 WARN mapreduce.Counters: Group 
org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
org.apache.hadoop.mapreduce.TaskCounter instead
13/04/24 16:42:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@603b88a8
13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:0+20480
13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100
13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720
13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output
13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0
13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_00_0 
is done. And is in the process of commiting
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.Task: Task 
'attempt_local103137904_0011_m_00_0' done.
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local103137904_0011_m_00_0
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: 
attempt_local103137904_0011_m_01_0
13/04/24 16:42:32 WARN mapreduce.Counters: Group 
org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
org.apache.hadoop.mapreduce.TaskCounter instead
13/04/24 16:42:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3183a45e
13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:20480+20480
13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100
13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720
13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.MapTask: Starting flush of map output
13/04/24 16:42:32 INFO mapred.MapTask: Finished spill 0
13/04/24 16:42:32 INFO mapred.Task: Task:attempt_local103137904_0011_m_01_0 
is done. And is in the process of commiting
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO mapred.Task: Task 
'attempt_local103137904_0011_m_01_0' done.
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local103137904_0011_m_01_0
13/04/24 16:42:32 INFO mapred.LocalJobRunner: Starting task: 
attempt_local103137904_0011_m_02_0
13/04/24 16:42:32 WARN mapreduce.Counters: Group 
org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
org.apache.hadoop.mapreduce.TaskCounter instead
13/04/24 16:42:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3273d766
13/04/24 16:42:32 INFO mapred.MapTask: Processing split: **/points-0:40960+10127
13/04/24 16:42:32 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/04/24 16:42:32 INFO mapred.MapTask: io.sort.mb = 100
13/04/24 16:42:32 INFO mapred.MapTask: data buffer = 79691776/99614720
13/04/24 16:42:32 INFO mapred.MapTask: record buffer = 262144/327680
13/04/24 16:42:32 INFO mapred.LocalJobRunner:
13/04/24 16:42:32 INFO

Random Forrest implementation in mahout

2013-04-25 Thread qiaoresearcher

I just run the RF examples, non-distributed version: BreimanExample

with glass data, 10 iterations with 100 trees, here is the unexpected
output:

13/04/25 15:38:40 INFO df.BreimanExample:

13/04/25 15:38:40 INFO df.BreimanExample: Random Input Test Error : 1.0
13/04/25 15:38:40 INFO df.BreimanExample: Single Input Test Error : 1.0
13/04/25 15:38:40 INFO df.BreimanExample: Mean Random Input Time : 0h 0m 1s
79
13/04/25 15:38:40 INFO df.BreimanExample: Mean Single Input Time : 0h 0m 0s
845
13/04/25 15:38:40 INFO df.BreimanExample: Mean Random Input Num Nodes : 6719
13/04/25 15:38:40 INFO df.BreimanExample: Mean Single Input Num Nodes :
11260

is this a mahout bug? or just an unverified implementation?
or as the name indicating, 'random forest' just give random predictions?

anyone knows any verified random forest java code available online?

thanks,

CfP 2013 Workshop on Middleware for HPC and Big Data Systems (MHPC'13)

2013-04-25 Thread MHPC 2013

we apologize if you receive multiple copies of this message
===

CALL FOR PAPERS

2013 Workshop on

Middleware for HPC and Big Data Systems

MHPC '13

as part of Euro-Par 2013, Aachen, Germany

===

Date: August 27, 2012

Workshop URL: http://m-hpc.org

Springer LNCS

SUBMISSION DEADLINE:

May 31, 2013 - LNCS Full paper submission (rolling abstract submission)
June 28, 2013 - Lightning Talk abstracts


SCOPE

Extremely large, diverse, and complex data sets are generated from
scientific applications, the Internet, social media and other applications.
Data may be physically distributed and shared by an ever larger community.
Collecting, aggregating, storing and analyzing large data volumes
presents major challenges. Processing such amounts of data efficiently
has been an issue to scientific discovery and technological
advancement. In addition, making the data accessible, understandable and
interoperable includes unsolved problems. Novel middleware architectures,
algorithms, and application development frameworks are required.

In this workshop we are particularly interested in original work at the
intersection of HPC and Big Data with regard to middleware handling
and optimizations. Scope is existing and proposed middleware for HPC
and big data, including analytics libraries and frameworks.

The goal of this workshop is to bring together software architects,
middleware and framework developers, data-intensive application developers
as well as users from the scientific and engineering community to exchange
their experience in processing large datasets and to report their scientific
achievement and innovative ideas. The workshop also offers a dedicated forum
for these researchers to access the state of the art, to discuss problems
and requirements, to identify gaps in current and planned designs, and to
collaborate in strategies for scalable data-intensive computing.

The workshop will be one day in length, composed of 20 min paper
presentations, each followed by 10 min discussion sections.
Presentations may be accompanied by interactive demonstrations.


TOPICS

Topics of interest include, but are not limited to:

- Middleware including: Hadoop, Apache Drill, YARN, Spark/Shark, Hive,
Pig, Sqoop,
HBase, HDFS, S4, CIEL, Oozie, Impala, Storm and Hyrack
- Data intensive middleware architecture
 - Libraries/Frameworks including: Apache Mahout, Giraph, UIMA and GraphLab
- NG Databases including Apache Cassandra, MongoDB and CouchDB/Couchbase
- Schedulers including Cascading
- Middleware for optimized data locality/in-place data processing
- Data handling middleware for deployment in virtualized HPC environments
- Parallelization and distributed processing architectures at the
middleware level
- Integration with cloud middleware and application servers
- Runtime environments and system level support for data-intensive computing
- Skeletons and patterns
- Checkpointing
- Programming models and languages
- Big Data ETL
- Stream processing middleware
- In-memory databases for HPC
- Scalability and interoperability
- Large-scale data storage and distributed file systems
- Content-centric addressing and networking
- Execution engines, languages and environments including CIEL/Skywriting
- Performance analysis, evaluation of data-intensive middleware
- In-depth analysis and performance optimizations in existing data-handling
middleware, focusing on indexing/fast storing or retrieval between compute
and storage nodes
- Highly scalable middleware optimized for minimum communication
- Use cases and experience for popular Big Data middleware
- Middleware security, privacy and trust architectures

DATES

Papers:
Rolling abstract submission
May 31, 2013 - Full paper submission
July 8, 2013 - Acceptance notification
October 3, 2013 - Camera-ready version due

Lightning Talks:
June 28, 2013 - Deadline for lightning talk abstracts
July 15, 2013 - Lightning talk notification

August 27, 2013 - Workshop Date


TPC

CHAIR

Michael Alexander (chair), TU Wien, Austria
Anastassios Nanos (co-chair), NTUA, Greece
Jie Tao (co-chair), Karlsruhe Institut of Technology, Germany
Lizhe Wang (co-chair), Chinese Academy of Sciences, China
Gianluigi Zanetti (co-chair), CRS4, Italy

PROGRAM COMMITTEE

Amitanand Aiyer, Facebook, USA
Costas Bekas, IBM, Switzerland
Jakob Blomer, CERN, Switzerland
William Gardner, University of Guelph, Canada
José Gracia, HPC Center of the University of Stuttgart, Germany
Zhenghua Guom,  Indiana University, USA
Marcus Hardt,  Karlsruhe Institute of Technology, Germany
Sverre Jarp, CERN, Switzerland
Christopher Jung,  Karlsruhe Institute of Technology, Germany
Andreas Knüpfer - Technische Universität Dresden, Germany
Nectarios Koziris, National Technical University of Athens, Greece
Yan Ma, Chinese Academy of Sciences, China
Martin Schulz - Lawrence Livermore National Laboratory
Viral Shah,

Re: Mahout lucene UTFDataFormatException: encoded string too long:

2013-04-25 Thread nishant rathore

Hi,

 Afer running the commane,
*
*
*./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o
../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/
-d ../output/fetise/luceneDictionary  -dm
org.apache.mahout.common.distance.TanimotoDistanceMeasure*
*
*
My directory strucuture is like

outputFolder

drwxrwxr-x 2 pacman pacman4096 Apr 25 20:09 centroids
-rw-rw-r-- 1 pacman pacman   0 Apr 26 08:51 clusterdump
drwxrwxr-x 4 pacman pacman4096 Apr 25 20:09 clusters
-rw-rw-r-- 1 pacman pacman  173057 Apr 25 20:09 luceneDictionary
-rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector
pacman@pacman:~/DownloadedCodes/mahout/output/fetise$ ls -lR
.:
total 3148
drwxrwxr-x 2 pacman pacman4096 Apr 25 20:09 centroids
-rw-rw-r-- 1 pacman pacman   0 Apr 26 08:51 clusterdump
drwxrwxr-x 4 pacman pacman4096 Apr 25 20:09 clusters
-rw-rw-r-- 1 pacman pacman  173057 Apr 25 20:09 luceneDictionary
-rwxrwxrwx 1 pacman pacman 3038677 Apr 25 20:09 luceneVector

./centroids:
total 188
-rwxrwxrwx 1 pacman pacman 191155 Apr 25 20:09 part-randomSeed

./clusters:
total 8
drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-0
drwxrwxr-x 2 pacman pacman 4096 Apr 25 20:09 clusters-1-final

./clusters/clusters-0:
total 324
-rwxrwxrwx 1 pacman pacman  4888 Apr 25 20:09 part-0
...
-rwxrwxrwx 1 pacman pacman  4888 Apr 25 20:09 part-00039
-rwxrwxrwx 1 pacman pacman   207 Apr 25 20:09 _policy

./clusters/clusters-1-final:
total 7212
-rwxrwxrwx 1 pacman pacman 7377533 Apr 25 20:09 part-r-0
-rwxrwxrwx 1 pacman pacman 207 Apr 25 20:09 _policy
-rwxrwxrwx 1 pacman pacman   0 Apr 25 20:09 _SUCCESS


*So I am confused while running clusterdump what  is cluster points and
cluster directory??*


Thanks,
Nishant



On Thu, Apr 25, 2013 at 1:37 PM, nishant rathore 
nishant.rathor...@gmail.com wrote:

 Hi Ted,

 That was a stupid mistake. Thanks a lot for quick reply and pointing out
 the issue.

 I have change the idfield to link of the document.
 *./bin/mahout lucene.vector -d
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
 --idField link  -o ../output/fetise/luceneVector --field text -w TFIDF
 --dictOut ../output/fetise/luceneDictionary -err 0.10*

 and ran the fkmeans clustering using command:
 *bin/mahout fkmeans -i ../output/fetise/luceneVector -c
 ../output/fetise/fetise-fkmeans-centroids -o
 ../output/fetise/fetise-fkmeans-clusters -cd 1.0 -k 40 -m 2 -ow -x 10 -dm
 org.apache.mahout.common.distance.TanimotoDistanceMeasure*

 But when running cluster dumper
 *./bin/mahout clusterdump -i ../output/fetise/fetise-fkmeans-clusters/ -o
 ../output/fetise/clusterdump -p ../output/fetise/fetise-fkmeans-centroids/
 -d ../output/fetise/luceneDictionary  -dm
 org.apache.mahout.common.distance.TanimotoDistanceMeasure*

 got the following error
 Exception in thread main java.lang.ClassCastException:
 org.apache.hadoop.io.Text cannot be cast to
 org.apache.hadoop.io.IntWritable
  at
 org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:306)
 at
 org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:252)
  at
 org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:155)
 at
 org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:100)


  *./bin/mahout seqdumper -i
 ../output/fetise/fetise-fkmeans-centroids/part-randomSeed | more*
 Input Path: ../output/fetise/fetise-fkmeans-centroids/part-randomSeed
 Key class: *class org.apache.hadoop.io.Text* Value Class: class
 org.apache.mahout.clustering.iterator.ClusterWritable
 Key: 662: Value:
 org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf
 Key: 1014: Value:
 org.apache.mahout.clustering.iterator.ClusterWritable@17cc6cf

 Why am i getting the key in centroids as Text?


 Thanks,
 Nishant




 On Thu, Apr 25, 2013 at 12:20 PM, Ted Dunning ted.dunn...@gmail.comwrote:

 This sounds pretty fishy.

 What this is saying is that you have a document in your index whose name
 is
 longer than 65,535 characters.

 That doesn't sound very plausible.  Don't you have a more appropriate ID
 column?

 The problem starts where you say --idField text.  Pick a better field.



 On Wed, Apr 24, 2013 at 10:34 PM, nishant rathore 
 nishant.rathor...@gmail.com wrote:

  Hi,
 
  I am trying to import vector from lucene using the command,
 
  ./bin/mahout lucene.vector -d
 
 
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
  --idField text -o ../output/fetise/luceneVector --field text -w TFIDF
  --dictOut ../output/fetise/luceneDictionary -err 0.10./bin/mahout
  lucene.vector -d
 
 
 /home/pacman/DownloadedCodes/solr-4.2.0/example/example-DIH/solr/plaintext/data/index
  --idField text -o ../output/luceneVector --field text -w TFIDF --dictOut
  ../output/luceneDictionary -err 0.10
 
  But i am getting following error
  Exception in thread main java.io.UTFDataFormatException: encoded
 string

Re: Mahout lucene UTFDataFormatException: encoded string too long:

mahout 0.7 NaiveBayes usage

Re: Mahout lucene UTFDataFormatException: encoded string too long:

kmeans local vs mapreduce difference

Random Forrest implementation in mahout

CfP 2013 Workshop on Middleware for HPC and Big Data Systems (MHPC'13)

Re: Mahout lucene UTFDataFormatException: encoded string too long:

7 matches

Site Navigation

Mail list logo

Footer information