RE: Mahout parallel K-Means - algorithms analysis

2014-03-18 Thread hiroshi leon
Thank you Wei and Suneel, 

By the way, does somebody know if the Parallel K-means of Mahout is using 
Cannopy clustering at the beginning to generate the initial K in the K-Means 
driver class?

Best regards,

Hiroshi

 Date: Mon, 17 Mar 2014 13:05:01 -0700
 Subject: Re: Mahout parallel K-Means - algorithms analysis
 From: weish...@gmail.com
 To: user@mahout.apache.org
 CC: ted.dunn...@gmail.com
 
 You could take a look
 at org.apache.mahout.clustering.classify/ClusterClassificationMapper
 
 Enjoy,
 Wei Shung
 
 
 On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:
 
  The clustering code is cimapper and cireducer.  Following the clustering,
  there is cluster classification which is mapper only.
 
  Not sure about the reference paper, this stuffs been around for long but
  the documentation for kmeans on mahout.apache.org should explain the
  approach.
 
  Sent from my iPhone
 
   On Mar 15, 2014, at 5:36 PM, hiroshi leon hiroshi_8...@hotmail.com
  wrote:
  
   Hello Ted,
  
   Thank you so much for your reply, the program that I was checking is the
  KMeansDriver class with the run function,
   the buildCluster function in the same class and following the
  ClusterIterator class with
   the iterateMR function.
  
   I would like to know how where can I check the code that is implemented
  for the mapper and the
   reducer? is it in the CIMappper.class and CIReducer.class?
  
   Is there a research paper or pseudo-code in which Mahout parallel
  K-means was based on?
  
   Thank you so much and have a nice day.
  
   Best regards
  
  
   From: ted.dunn...@gmail.com
   Date: Sat, 15 Mar 2014 13:56:56 -0700
   Subject: Re: Mahout parallel K-Means - algorithms analysis
   To: user@mahout.apache.org
  
   We would love to help.
  
   Can you say which program and which classes you are looking at?
  
  
   On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon 
  hiroshi_8...@hotmail.comwrote:
  
   To whom it may correspond,
  
   Hello, I have been checking the algorithm of Mahout 0.9 version k-means
   using MapReduce and I would like to know where can I check the code of
   what is happening inside the map function and in the reducer?
  
  
   I was debugging using NetBeans and I was not able to find what is
  exactly
   implemented in the Map and Reduce functions...
  
  
  
   The reason what I am doing this is because I would like to know what
   is exactly implemented in the version of Mahout 0.9 in order to see
   which parts where optimized on the K-Means mapReduce algorithm.
  
  
  
   Do you know  which research paper the Mahout K-means was based on or
  where
   can I read the pseudo code?
  
  
  
   Thank you so much!
  
  
  
   Best regards!
  
   Hiroshi
  
 
  

Command line vector to sequence file

2014-03-18 Thread Margusja

Hi

I am looking a simple way in a command line how to convert vector to 
sequence file.

in example I have data.txt file contains vectors.
1,1
2,1
1,2
2,2
3,3
8,8
8,9
9,8
9,9

So is there command line possibility to convert that into sequence file?

I tried mahout seqdirectory but after it  hdfs dfs -text 
output2/part-m-0 gives me something like:

/data.txt1,1
2,1
1,2
2,2
3,3
8,8
8,9
9,8
9,9

and that is not sequence file format as I understand.

I know there are java API but I am looking command line.


--
Best regards, Margus (Margusja) Roo
+372 51 48 780
http://margus.roo.ee
http://ee.linkedin.com/in/margusroo
skype: margusja
ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)
-BEGIN PUBLIC KEY-
MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE
5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl
RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa
BjM8j36yJvoBVsfOHQIDAQAB
-END PUBLIC KEY-



Re: Command line vector to sequence file

2014-03-18 Thread Margusja

Thank you, I am going to try it.

Best regards, Margus (Margusja) Roo
+372 51 48 780
http://margus.roo.ee
http://ee.linkedin.com/in/margusroo
skype: margusja
ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)
-BEGIN PUBLIC KEY-
MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE
5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl
RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa
BjM8j36yJvoBVsfOHQIDAQAB
-END PUBLIC KEY-

On 18/03/14 10:58, Kevin Moulart wrote:

Hi,

I did the same search a few weeks back and found that there is nothing in
the current API to do that from command line.

However I did write a java program that transforms a csv into a
SequenceFile which can be used to train a naive bayes (amongst other
things).

Here are the sources :
https://gist.github.com/kmoulart/9616125

You'll find all you need to make a jar with dependecies running and with a
proper command line (using JCommander).
Both the sequential version and the MapReduce one are in the given files.

If you're lazy, I'll put the whole maven project on my github later today.

Hope it helps you

Kévin Moulart


2014-03-18 9:41 GMT+01:00 Margusja mar...@roo.ee:


Hi

I am looking a simple way in a command line how to convert vector to
sequence file.
in example I have data.txt file contains vectors.
1,1
2,1
1,2
2,2
3,3
8,8
8,9
9,8
9,9

So is there command line possibility to convert that into sequence file?

I tried mahout seqdirectory but after it  hdfs dfs -text
output2/part-m-0 gives me something like:
/data.txt1,1
2,1
1,2
2,2
3,3
8,8
8,9
9,8
9,9

and that is not sequence file format as I understand.

I know there are java API but I am looking command line.


--
Best regards, Margus (Margusja) Roo
+372 51 48 780
http://margus.roo.ee
http://ee.linkedin.com/in/margusroo
skype: margusja
ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)
-BEGIN PUBLIC KEY-
MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE
5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl
RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa
BjM8j36yJvoBVsfOHQIDAQAB
-END PUBLIC KEY-






Re: Command line vector to sequence file

2014-03-18 Thread Kevin Moulart
You're welcome !

Here's the repository if need be :
https://github.com/kmoulart/hadoop_mahout_utils



Kévin Moulart


2014-03-18 10:00 GMT+01:00 Margusja mar...@roo.ee:

 Thank you, I am going to try it.


 Best regards, Margus (Margusja) Roo
 +372 51 48 780
 http://margus.roo.ee
 http://ee.linkedin.com/in/margusroo
 skype: margusja
 ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)
 -BEGIN PUBLIC KEY-
 MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE
 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl
 RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa
 BjM8j36yJvoBVsfOHQIDAQAB
 -END PUBLIC KEY-

 On 18/03/14 10:58, Kevin Moulart wrote:

 Hi,

 I did the same search a few weeks back and found that there is nothing in
 the current API to do that from command line.

 However I did write a java program that transforms a csv into a
 SequenceFile which can be used to train a naive bayes (amongst other
 things).

 Here are the sources :
 https://gist.github.com/kmoulart/9616125

 You'll find all you need to make a jar with dependecies running and with a
 proper command line (using JCommander).
 Both the sequential version and the MapReduce one are in the given files.

 If you're lazy, I'll put the whole maven project on my github later today.

 Hope it helps you

 Kévin Moulart


 2014-03-18 9:41 GMT+01:00 Margusja mar...@roo.ee:

  Hi

 I am looking a simple way in a command line how to convert vector to
 sequence file.
 in example I have data.txt file contains vectors.
 1,1
 2,1
 1,2
 2,2
 3,3
 8,8
 8,9
 9,8
 9,9

 So is there command line possibility to convert that into sequence file?

 I tried mahout seqdirectory but after it  hdfs dfs -text
 output2/part-m-0 gives me something like:
 /data.txt1,1
 2,1
 1,2
 2,2
 3,3
 8,8
 8,9
 9,8
 9,9

 and that is not sequence file format as I understand.

 I know there are java API but I am looking command line.


 --
 Best regards, Margus (Margusja) Roo
 +372 51 48 780
 http://margus.roo.ee
 http://ee.linkedin.com/in/margusroo
 skype: margusja
 ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)
 -BEGIN PUBLIC KEY-
 MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE
 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl
 RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa
 BjM8j36yJvoBVsfOHQIDAQAB
 -END PUBLIC KEY-






Re: Mahout parallel K-Means - algorithms analysis

2014-03-18 Thread Suneel Marthi
Canopy and KMeans run independently and do not call eachother. 

For KMEans, the K value has to be specified when invoking KMeans.

Typically u run Canopy first and then invoke KMeans with the appropriate 
K-value as inferred from Canopy.







On Tuesday, March 18, 2014 4:33 AM, hiroshi leon hiroshi_8...@hotmail.com 
wrote:
 
Thank you Wei and Suneel, 

By the way, does somebody know if the Parallel K-means of Mahout is using 
Cannopy clustering at the beginning to generate the initial K in the K-Means 
driver class?

Best regards,

Hiroshi

 Date: Mon, 17 Mar 2014 13:05:01 -0700
 Subject: Re: Mahout parallel K-Means - algorithms analysis
 From: weish...@gmail.com
 To: user@mahout.apache.org
 CC: ted.dunn...@gmail.com
 
 You could take a look
 at org.apache.mahout.clustering.classify/ClusterClassificationMapper
 
 Enjoy,
 Wei Shung
 
 
 On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:
 
  The clustering code is cimapper and cireducer.  Following the clustering,
  there is cluster classification which is mapper only.
 
  Not sure about the reference paper, this stuffs been around for long but
  the documentation for kmeans on mahout.apache.org should explain the
  approach.
 
  Sent from my iPhone
 
   On Mar 15, 2014, at 5:36 PM, hiroshi leon hiroshi_8...@hotmail.com
  wrote:
  
   Hello Ted,
  
   Thank you so much for your reply, the program that I was checking is the
  KMeansDriver class with the run function,
   the buildCluster function in the same class and following the
  ClusterIterator class with
   the iterateMR function.
  
   I would like to know how where can I check the code that is implemented
  for the mapper and the
   reducer? is it in the CIMappper.class and CIReducer.class?
  
   Is there a research paper or pseudo-code in which Mahout parallel
  K-means was based on?
  
   Thank you so much and have a nice day.
  
   Best regards
  
  
   From: ted.dunn...@gmail.com
   Date: Sat, 15 Mar 2014 13:56:56 -0700
   Subject: Re: Mahout parallel K-Means - algorithms analysis
   To: user@mahout.apache.org
  
   We would love to help.
  
   Can you say which program and which classes you are looking at?
  
  
   On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon 
  hiroshi_8...@hotmail.comwrote:
  
   To whom it may correspond,
  
   Hello, I have been checking the algorithm of Mahout 0.9 version k-means
   using MapReduce and I would like to know where can I check the code of
   what is happening inside the map function and in the reducer?
  
  
   I was debugging using NetBeans and I was not able to find what is
  exactly
   implemented in the Map and Reduce functions...
  
  
  
   The reason what I am doing this is because I would like to know what
   is exactly implemented in the version of Mahout 0.9 in order to see
   which parts where optimized on the K-Means mapReduce algorithm.
  
  
  
   Do you know  which research paper the Mahout K-means was based on or
  where
   can I read the pseudo code?
  
  
  
   Thank you so much!
  
  
  
   Best regards!
  
   Hiroshi
  
 

RE: Mahout parallel K-Means - algorithms analysis

2014-03-18 Thread hiroshi leon
Thanks Suneel,

Can someone please explain me a litlte bit about the ClusteringPolicy and the 
clusterClassifier?
and what are the benefits when using it with parallel K-Means?

Thank you so much,

Best regards.

 Date: Tue, 18 Mar 2014 04:35:14 -0700
 From: suneel_mar...@yahoo.com
 Subject: Re: Mahout parallel K-Means - algorithms analysis
 To: user@mahout.apache.org
 
 Canopy and KMeans run independently and do not call eachother. 
 
 For KMEans, the K value has to be specified when invoking KMeans.
 
 Typically u run Canopy first and then invoke KMeans with the appropriate 
 K-value as inferred from Canopy.
 
 
 
 
 
 
 
 On Tuesday, March 18, 2014 4:33 AM, hiroshi leon hiroshi_8...@hotmail.com 
 wrote:
  
 Thank you Wei and Suneel, 
 
 By the way, does somebody know if the Parallel K-means of Mahout is using 
 Cannopy clustering at the beginning to generate the initial K in the K-Means 
 driver class?
 
 Best regards,
 
 Hiroshi
 
  Date: Mon, 17 Mar 2014 13:05:01 -0700
  Subject: Re: Mahout parallel K-Means - algorithms analysis
  From: weish...@gmail.com
  To: user@mahout.apache.org
  CC: ted.dunn...@gmail.com
  
  You could take a look
  at org.apache.mahout.clustering.classify/ClusterClassificationMapper
  
  Enjoy,
  Wei Shung
  
  
  On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi 
  suneel_mar...@yahoo.comwrote:
  
   The clustering code is cimapper and cireducer.  Following the clustering,
   there is cluster classification which is mapper only.
  
   Not sure about the reference paper, this stuffs been around for long but
   the documentation for kmeans on mahout.apache.org should explain the
   approach.
  
   Sent from my iPhone
  
On Mar 15, 2014, at 5:36 PM, hiroshi leon hiroshi_8...@hotmail.com
   wrote:
   
Hello Ted,
   
Thank you so much for your reply, the program that I was checking is the
   KMeansDriver class with the run function,
the buildCluster function in the same class and following the
   ClusterIterator class with
the iterateMR function.
   
I would like to know how where can I check the code that is implemented
   for the mapper and the
reducer? is it in the CIMappper.class and CIReducer.class?
   
Is there a research paper or pseudo-code in which Mahout parallel
   K-means was based on?
   
Thank you so much and have a nice day.
   
Best regards
   
   
From: ted.dunn...@gmail.com
Date: Sat, 15 Mar 2014 13:56:56 -0700
Subject: Re: Mahout parallel K-Means - algorithms analysis
To: user@mahout.apache.org
   
We would love to help.
   
Can you say which program and which classes you are looking at?
   
   
On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon 
   hiroshi_8...@hotmail.comwrote:
   
To whom it may correspond,
   
Hello, I have been checking the algorithm of Mahout 0.9 version 
k-means
using MapReduce and I would like to know where can I check the code of
what is happening inside the map function and in the reducer?
   
   
I was debugging using NetBeans and I was not able to find what is
   exactly
implemented in the Map and Reduce functions...
   
   
   
The reason what I am doing this is because I would like to know what
is exactly implemented in the version of Mahout 0.9 in order to see
which parts where optimized on the K-Means mapReduce algorithm.
   
   
   
Do you know  which research paper the Mahout K-means was based on or
   where
can I read the pseudo code?
   
   
   
Thank you so much!
   
   
   
Best regards!
   
Hiroshi
   
  
  

Re: Naive Bayes classification

2014-03-18 Thread Frank Scholten
Hi Tharindu,

If I understand correctly seqdirectory creates labels based on the file
name but this is not what you want. What do you want the labels to be?

Cheers,

Frank


On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
tharindurus...@gmail.comwrote:

 Hi everyone,
 I'm developing an application where I need to train a Naive Bayes
 classification model and use this model to classify new entities(In this
 case text files based on their content)

 I observed that seqdirectory command always adds the file/directory name as
 the key field for each document which will be used as the label in
 classification jobs.
 This makes sense when I need to train a model and create the labelindex
 since I have organized my training data according to their labels in
 separate directories.

 Now I'm trying to use this model and infer the best label for an unknown
 document.
 My requirement is to ask Mahout to read my new file and output the
 predicted category by looking at the labelindex and the tfidf vector of the
 new content.
 I tried creating vectors from the new content (seqdirectory and
 seq2sparse), and then using this vector to run testnb command. But
 unfortunately seqdirectory commands adds file names as labels which does
 not make sense in classification.

 The following error message will further demonstrate this behavior.
 imput0.txt is the file name of my new document.

 [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
 classifying documents
 java.lang.IllegalArgumentException: Label not found: input0.txt
 at
 com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
 at

 org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
 at

 org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
 at

 org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
 at

 org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
 at

 org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
 at

 org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
 at

 org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at

 org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)


 So how can I achieve what I'm trying to do here?

 Thanks,


 --
 M.P. Tharindu Rusira Kumara

 Department of Computer Science and Engineering,
 University of Moratuwa,
 Sri Lanka.
 +94757033733
 www.tharindu-rusira.blogspot.com



Introducing PredictionIO: A developer-friendly Mahout stack for production

2014-03-18 Thread Simon Chan
Hi,

After a year of work, I would like to present PredictionIO project (
https://github.com/PredictionIO) to this community.

When a few of us were doing PhD study, Mahout was the de facto Java package
that we used in many research work. This is a very powerful algorithm
library, yet we see that something needs to be done to make it more
accessible to developers in production environment.

Therefore, we started the idea of PredictionIO, which adds a
developer-friendly REST API, a web admin UI and an integrated
infrastructure on top of Mahout. The project is still at its early stage.
CF algorithm libraries of Mahout is supported currently.

*REST API and SDK* in Python, Ruby, Java, PHP, Node.js etc
Through the API layer, which supports both sync and asycn call, users can:

- Record data
  A sample SDK call:
* cli.identify(John)*
* cli.record_action_on_item(view, Mahout Page 1)*

- Query recommendation in real-time
  A sample GEO-based recommendation query:
* r = cli.get_itemrec_topn(myEngine, 5, {pio_latlng:[37.9, 91.2]})*


*Web Admin UI*
Through the UI, users can:
- conduct algorithm evaluation with metrics such as MAP@k
- deploy / switch algorithm on production
- adjust recommendation preferences, such as Freshness, Serendipity,
Unseen-only filter etc


*Integrated Infrastructure*
PredictionIO helps users link Mahout, Hadoop, data store and job scheduler
etc together. The whole stack can be installed and configured in minutes.
It takes care of a lot of production issues, such as model re-training with
new data and prediction result indexing.


We are working hard to make it extremely easy for developers to build
Machine Learning into web and apps. Hopefully, PredictionIO can get Mahout
into the hands of a wider audience.

Love to hear your feedback. If you are interested in the project, just
remember that contributors are always welcome!


Regards,
Simon


Text clustering with hashing vector encoders

2014-03-18 Thread Frank Scholten
Hi all,

Would it be possible to use hashing vector encoders for text clustering
just like when classifying?

Currently we vectorize using a dictionary where we map each token to a
fixed position in the dictionary. After the clustering we use have to
retrieve the dictionary to determine the cluster labels.
This is quite a complex process where multiple outputs are read and written
in the entire clustering process.

I think it would be great if both algorithms could use the same encoding
process but I don't know if this is possible.

The problem is that we lose the mapping between token and position when
hashing. We need this mapping to determine cluster labels.

However, maybe we could make it so hashed encoders can be used and that
determining top labels is left to the user. This might be a possibility
because I noticed a problem with the current cluster labeling code. This is
what happens: first vectors are vectorized with TF-IDF and clustered. Then
the labels are ranked, but again according to TF-IDF, instead of TF. So it
is possible that a token becomes the top ranked label, even though it is
rare within the cluster. The document with that token is in the cluster
because of other tokens. If the labels are determined based on a TF score
within the cluster I think you would have better labels. But this requires
a post-processing step on your original data and doing a TF count.

Thoughts?

Cheers,

Frank


Re: Text clustering with hashing vector encoders

2014-03-18 Thread Ted Dunning
Yes.  Hashing vector encoders will preserve distances when used with
multiple probes.

Interpretation becomes somewhat difficult, but there is code available to
reverse engineer labels on hashed vectors.

IDF weighting is slightly tricky, but quite doable if you keep a dictionary
of, say, the most common 50-200 thousand words and assume all others have
constant and equal frequency.



On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten fr...@frankscholten.nlwrote:

 Hi all,

 Would it be possible to use hashing vector encoders for text clustering
 just like when classifying?

 Currently we vectorize using a dictionary where we map each token to a
 fixed position in the dictionary. After the clustering we use have to
 retrieve the dictionary to determine the cluster labels.
 This is quite a complex process where multiple outputs are read and written
 in the entire clustering process.

 I think it would be great if both algorithms could use the same encoding
 process but I don't know if this is possible.

 The problem is that we lose the mapping between token and position when
 hashing. We need this mapping to determine cluster labels.

 However, maybe we could make it so hashed encoders can be used and that
 determining top labels is left to the user. This might be a possibility
 because I noticed a problem with the current cluster labeling code. This is
 what happens: first vectors are vectorized with TF-IDF and clustered. Then
 the labels are ranked, but again according to TF-IDF, instead of TF. So it
 is possible that a token becomes the top ranked label, even though it is
 rare within the cluster. The document with that token is in the cluster
 because of other tokens. If the labels are determined based on a TF score
 within the cluster I think you would have better labels. But this requires
 a post-processing step on your original data and doing a TF count.

 Thoughts?

 Cheers,

 Frank



Re: reduce is too slow in StreamingKmeans

2014-03-18 Thread Suneel Marthi
When dealing with Streaming KMeans, it would be helpful for troubleshooting 
purposes if u could provide the values for k (# of clusters), km ( = k log n) 
and n (# of datapoints).

Try setting -Xmx to a higher heap size and run the sequential version again.

I had seen OOM errors happen during the Reduce phase while running the MR 
version, my reduce heap size was set to 2GB and I was trying to cluster about 
2M datapoints each of cardinality 100 (that's after running thru SSVD-PCA).  

Speaking from my experience, its either been that the Reducer fails with OOM 
errors or is stuck forever at 76% (and raise alarms with the Operations stuck 
because its not making any progress).


How big is ur dataset and how long did it take for the map phase to complete? 



On Tuesday, March 18, 2014 12:54 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp 
wrote:
 
As mahout streamingkmeans has no problems in sequential mode, 
I would like to try sequential mode.
However, java.lang.OutofMemoryError occurs.

I wonder where to set JVM heap size for sequential mode?
Is it the same with mapreduce mode?




-Original Message-
From: fx MA XIAOJUN [mailto:xiaojun...@fujixerox.co.jp] 
Sent: Tuesday, March 18, 2014 10:50 AM
To: Suneel Marthi; user@mahout.apache.org
Subject: RE:
 reduce is too slow in StreamingKmeans

Thank you for your extremely quick reply.

 What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
 mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8-0.9


The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 
2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is 
compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 
distribution. 
It turned out that Mahout kmeans runs very well on mapreduce.
However, Mahout
 streamingkmeans runs properly in sequential mode, but fails in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t 
think mahout kmeans can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?





-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans





On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp 
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 
76% for ever.

 This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

 What do u mean by
 this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans 
here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread main java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at
 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 

Re: Naive Bayes classification

2014-03-18 Thread Suneel Marthi
Tharindu,

If I understand what u r trying to do:-

a) You have a trained Bayes model.
b) You would like to classify new documents using this trained model.
c) You were trying to use TestNaiveBayesDriver to classify the documents in (b).

Option 1:
---

You could write a custom MapReduce job that creates sequence files from the 
documents (without the label key). You could feed these sequencefiles to 
seq2sparse to generate ur vectors - call TestNAiveBayes with this input. Let 
me know if u need code for the earlier part.


Option 2:
---
Work with your existing tf-idf vectors generated from seqdirectory - 
seq2sparse.  But instead of invoking Mahout TestNaiveBayes, create a custom 
MapReduce job (or a plain java program if that's fine with u) that basically 
does the following:

a) Instantiate a classifier with trained model: (Pseudo code below)

 NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new 
Path(outputDir.getAbsolutePath()), conf);

 AbstractVectorClassifier classifier = new 
StandardNaiveBayesClassifier(naiveBayesModel);

// Parse through the input tf-idf vectors Text, VectorWritable and feed them 
to the classifier

for (PairText,VectorWritable vector : new 
SequenceFileDirIterableText,VectorWritable(getInputPath(), PathType.LIST,     
PathFilters.logsCRCFilter(), null, true, conf)) {
    // invoke the classifier on the incoming vector
 Vector result = classifier.classifyFull(vector.getSecond().get());
 context.write(record.getFirst(), new VectorWritable(result));
}

You can have the above code as part of a mapper in an MR job.









On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart kevinmoul...@gmail.com 
wrote:
 
To use Naive Bayes you need a Sequence File Text, VectorWritable with the
key formatted like this label/label for some reason I checked with the
sources to be sure and it parses it looking for a '/'.

When y used seqdirectory, it told Naive Bayes to classify the content of
each file (ex : file1.txt) with the label corresponding to its name (here,
file1.txt). So when you tried testing with input0.txt it failed because
input0.txt was not a valid label.

I designed a MapReduce java job that transforms a csv with numeric values
into a proper SequenceFile, if you want you can take the source and update
if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils

Good luck.

Kévin Moulart



2014-03-18 20:13 GMT+01:00 Frank Scholten fr...@frankscholten.nl:

 Hi Tharindu,

 If I understand correctly seqdirectory creates labels based on the file
 name but this is not what you want. What do you want the labels to be?

 Cheers,

 Frank


 On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
 tharindurus...@gmail.comwrote:

  Hi everyone,
  I'm developing an application where I need to train a Naive Bayes
  classification model and use this model to classify new entities(In this
  case text files based on their content)
 
  I observed that seqdirectory command always adds the file/directory name
 as
  the key field for each document which will be used as the label in
  classification jobs.
  This makes sense when I need to train a model and create the labelindex
  since I have organized my training data according to their labels in
  separate
 directories.
 
  Now I'm trying to use this model and infer the best label for an unknown
  document.
  My requirement is to ask Mahout to read my new file and output the
  predicted category by looking at the labelindex and the tfidf vector of
 the
  new content.
  I tried creating vectors from the new content (seqdirectory and
  seq2sparse), and then using this vector to run testnb command. But
  unfortunately seqdirectory commands adds file names as labels which does
  not make sense in classification.
 
  The following error message will further demonstrate this behavior.
  imput0.txt is the file name of my new document.
 
  [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
  classifying documents
  java.lang.IllegalArgumentException: Label not found: input0.txt
      at
 
 com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
      at
 
 
 org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
      at
 
 
 org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
      at
 
 

 
org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
      at
 
 
 org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
      at
 
 
 org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
      at
 
 
 org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
      at
 
 
 org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
     
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at
 
 
 

clusterdump samplePoints parameter

2014-03-18 Thread Terry Blankers

Hi all,

Can someone please answer a quick question about the --samplePoints 
parameter in the clusterdump utility? I understand it specifies the 
number of points returned per cluster. But are the points per cluster 
ordered or ranked in any way before this truncation occurs?


Thanks,

Terry


Re: clusterdump samplePoints parameter

2014-03-18 Thread Suneel Marthi
Its the max. no. of points to include from each cluster in the clusterdump. If 
not specified all points would be included.





On Tuesday, March 18, 2014 11:25 PM, Terry Blankers te...@amritanet.com wrote:
 
Hi all,

Can someone please answer a quick question about the --samplePoints 
parameter in the clusterdump utility? I understand it specifies the 
number of points returned per cluster. But are the points per cluster 
ordered or ranked in any way before this truncation occurs?

Thanks,

Terry

Re: Text clustering with hashing vector encoders

2014-03-18 Thread Suneel Marthi
+1 to this. We could then use Hamming Distance to compute the distances between 
Hashed Vectors.

We have  the code for HashedVector.java based on Moses Charikar's SimHash paper.







On Tuesday, March 18, 2014 7:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
Yes.  Hashing vector encoders will preserve distances when used with
multiple probes.

Interpretation becomes somewhat difficult, but there is code available to
reverse engineer labels on hashed vectors.

IDF weighting is slightly tricky, but quite doable if you keep a dictionary
of, say, the most common 50-200 thousand words and assume all others have
constant and equal frequency.




On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten fr...@frankscholten.nlwrote:

 Hi all,

 Would it be possible to use hashing vector encoders for text clustering
 just like when classifying?

 Currently we vectorize using a dictionary where we map each token to a
 fixed position in the dictionary. After the clustering we use have to
 retrieve the dictionary to determine the cluster labels.
 This is quite a complex process where multiple outputs are read and written
 in the entire clustering process.

 I think it would be great if both algorithms could use the same encoding
 process but I don't know if this is possible.

 The problem is that we lose the mapping between token and position when
 hashing. We need this mapping to determine cluster labels.

 However, maybe we could make it so hashed encoders can be used and that
 determining top labels is left to the user. This might be a possibility
 because I noticed a problem with the current cluster labeling code. This is
 what happens: first vectors are vectorized with TF-IDF and clustered. Then
 the labels are ranked, but again according to TF-IDF, instead of TF. So it
 is possible that a token becomes the top ranked label, even though it is
 rare within the cluster. The document with that token is in the cluster
 because of other tokens. If the labels are determined based on a TF score
 within the cluster I think you would have better labels. But this requires
 a post-processing step on your original data and doing a TF count.

 Thoughts?

 Cheers,

 Frank


Re: Text clustering with hashing vector encoders

2014-03-18 Thread Andrew Musselman
How does with multiple probes affect distance preservation, and how would
idf weighting get tricky just by hashing strings?

Would we be computing distance between hashed strings, or distance between
vectors based on counts of hashed strings?


On Tue, Mar 18, 2014 at 8:50 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 +1 to this. We could then use Hamming Distance to compute the distances
 between Hashed Vectors.

 We have  the code for HashedVector.java based on Moses Charikar's SimHash
 paper.







 On Tuesday, March 18, 2014 7:14 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 Yes.  Hashing vector encoders will preserve distances when used with
 multiple probes.

 Interpretation becomes somewhat difficult, but there is code available to
 reverse engineer labels on hashed vectors.

 IDF weighting is slightly tricky, but quite doable if you keep a dictionary
 of, say, the most common 50-200 thousand words and assume all others have
 constant and equal frequency.




 On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten fr...@frankscholten.nl
 wrote:

  Hi all,
 
  Would it be possible to use hashing vector encoders for text clustering
  just like when classifying?
 
  Currently we vectorize using a dictionary where we map each token to a
  fixed position in the dictionary. After the clustering we use have to
  retrieve the dictionary to determine the cluster labels.
  This is quite a complex process where multiple outputs are read and
 written
  in the entire clustering process.
 
  I think it would be great if both algorithms could use the same encoding
  process but I don't know if this is possible.
 
  The problem is that we lose the mapping between token and position when
  hashing. We need this mapping to determine cluster labels.
 
  However, maybe we could make it so hashed encoders can be used and that
  determining top labels is left to the user. This might be a possibility
  because I noticed a problem with the current cluster labeling code. This
 is
  what happens: first vectors are vectorized with TF-IDF and clustered.
 Then
  the labels are ranked, but again according to TF-IDF, instead of TF. So
 it
  is possible that a token becomes the top ranked label, even though it is
  rare within the cluster. The document with that token is in the cluster
  because of other tokens. If the labels are determined based on a TF score
  within the cluster I think you would have better labels. But this
 requires
  a post-processing step on your original data and doing a TF count.
 
  Thoughts?
 
  Cheers,
 
  Frank
 



Re: Naive Bayes classification

2014-03-18 Thread Tharindu Rusira
Hi, first of all I'm sorry that my previous mail was vague and poorly
formulated.
Yes, Suneel got exactly what I was asking.Both  options will address my
requirement.
Thanks a lot.
-Tharindu
On Mar 19, 2014 8:51 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Tharindu,

 If I understand what u r trying to do:-

 a) You have a trained Bayes model.
 b) You would like to classify new documents using this trained model.
 c) You were trying to use TestNaiveBayesDriver to classify the documents
 in (b).

 Option 1:
 ---

 You could write a custom MapReduce job that creates sequence files from
 the documents (without the label key). You could feed these sequencefiles
 to seq2sparse to generate ur vectors - call TestNAiveBayes with this
 input. Let me know if u need code for the earlier part.


 Option 2:
 ---
 Work with your existing tf-idf vectors generated from seqdirectory -
 seq2sparse.  But instead of invoking Mahout TestNaiveBayes, create a custom
 MapReduce job (or a plain java program if that's fine with u) that
 basically does the following:

 a) Instantiate a classifier with trained model: (Pseudo code below)

  NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new
 Path(outputDir.getAbsolutePath()), conf);

  AbstractVectorClassifier classifier = new
 StandardNaiveBayesClassifier(naiveBayesModel);

 // Parse through the input tf-idf vectors Text, VectorWritable and feed
 them to the classifier

 for (PairText,VectorWritable vector : new
 SequenceFileDirIterableText,VectorWritable(getInputPath(), PathType.LIST,
 PathFilters.logsCRCFilter(), null, true, conf)) {
 // invoke the classifier on the incoming vector
  Vector result = classifier.classifyFull(vector.getSecond().get());
  context.write(record.getFirst(), new VectorWritable(result));
 }

 You can have the above code as part of a mapper in an MR job.









 On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart kevinmoul...@gmail.com
 wrote:

 To use Naive Bayes you need a Sequence File Text, VectorWritable with the
 key formatted like this label/label for some reason I checked with the
 sources to be sure and it parses it looking for a '/'.

 When y used seqdirectory, it told Naive Bayes to classify the content of
 each file (ex : file1.txt) with the label corresponding to its name (here,
 file1.txt). So when you tried testing with input0.txt it failed because
 input0.txt was not a valid label.

 I designed a MapReduce java job that transforms a csv with numeric values
 into a proper SequenceFile, if you want you can take the source and update
 if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils

 Good luck.

 Kévin Moulart



 2014-03-18 20:13 GMT+01:00 Frank Scholten fr...@frankscholten.nl:

  Hi Tharindu,
 
  If I understand correctly seqdirectory creates labels based on the file
  name but this is not what you want. What do you want the labels to be?
 
  Cheers,
 
  Frank
 
 
  On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
  tharindurus...@gmail.comwrote:
 
   Hi everyone,
   I'm developing an application where I need to train a Naive Bayes
   classification model and use this model to classify new entities(In
 this
   case text files based on their content)
  
   I observed that seqdirectory command always adds the file/directory
 name
  as
   the key field for each document which will be used as the label in
   classification jobs.
   This makes sense when I need to train a model and create the labelindex
   since I have organized my training data according to their labels in
   separate
  directories.
  
   Now I'm trying to use this model and infer the best label for an
 unknown
   document.
   My requirement is to ask Mahout to read my new file and output the
   predicted category by looking at the labelindex and the tfidf vector of
  the
   new content.
   I tried creating vectors from the new content (seqdirectory and
   seq2sparse), and then using this vector to run testnb command. But
   unfortunately seqdirectory commands adds file names as labels which
 does
   not make sense in classification.
  
   The following error message will further demonstrate this behavior.
   imput0.txt is the file name of my new document.
  
   [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
   classifying documents
   java.lang.IllegalArgumentException: Label not found: input0.txt
   at
  
 
 com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
   at
  
  
 
 org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
   at
  
  
 
 org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
   at
  
  
 

  
 org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
   at
  
  
 
 org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
   at
  
  
 
 

Multiple errors and messages

2014-03-18 Thread Mahmood Naderan
Hello
When  run the following command on Mahout-0.9  and Hadoop-1.2.1, I get multiple 
errors and I can not figure out what is the problem? Sorry for the long post.



[hadoop@solaris ~]$ mahout wikipediaDataSetCreator -i wikipedia/chunks -o 
wikipediainput -c ~/categories.txt 
Running on hadoop, using /export/home/hadoop/hadoop-1.2.1/bin/hadoop and 
HADOOP_CONF_DIR=
MAHOUT-JOB: 
/export/home/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar
14/03/18 20:28:28 WARN driver.MahoutDriver: No wikipediaDataSetCreator.props 
found on classpath, will use command-line arguments only
14/03/18 20:28:29 INFO wikipedia.WikipediaDatasetCreatorDriver: Input: 
wikipedia/chunks Out: wikipediainput Categories: 
/export/home/hadoop/categories.txt
14/03/18 20:28:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
14/03/18 20:28:32 INFO input.FileInputFormat: Total input paths to process : 699
14/03/18 20:28:32 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
14/03/18 20:28:32 WARN snappy.LoadSnappy: Snappy native library not loaded
14/03/18 20:28:37 INFO mapred.JobClient: Running job: job_201403181916_0001
14/03/18 20:28:38 INFO mapred.JobClient:  map 0% reduce 0%
14/03/18 20:41:44 INFO mapred.JobClient:  map 1% reduce 0%
14/03/18 20:52:57 INFO mapred.JobClient:  map 2% reduce 0%
14/03/18 21:04:02 INFO mapred.JobClient:  map 3% reduce 0%
14/03/18 21:15:13 INFO mapred.JobClient:  map 4% reduce 0%
14/03/18 21:26:30 INFO mapred.JobClient:  map 5% reduce 0%
14/03/18 21:29:07 INFO mapred.JobClient:  map 5% reduce 1%
14/03/18 21:34:45 INFO mapred.JobClient: Task Id : 
attempt_201403181916_0001_m_40_0, Status : FAILED
14/03/18 21:34:46 WARN mapred.JobClient: Error reading task 
outputhttp://solaris:50060/tasklog?plaintext=trueattemptid=attempt_201403181916_0001_m_40_0filter=stdout
14/03/18 21:34:46 WARN mapred.JobClient: Error reading task 
outputhttp://solaris:50060/tasklog?plaintext=trueattemptid=attempt_201403181916_0001_m_40_0filter=stderr
14/03/18 21:38:29 INFO mapred.JobClient:  map 6% reduce 1%
14/03/18 21:41:48 INFO mapred.JobClient:  map 6% reduce 2%
14/03/18 21:50:05 INFO mapred.JobClient:  map 7% reduce 2%
14/03/18 22:00:59 INFO mapred.JobClient:  map 8% reduce 2%
14/03/18 22:12:38 INFO mapred.JobClient:  map 9% reduce 2%
14/03/18 22:14:53 INFO mapred.JobClient:  map 9% reduce 3%
14/03/18 22:24:30 INFO mapred.JobClient:  map 10% reduce 3%
14/03/18 22:35:49 INFO mapred.JobClient:  map 11% reduce 3%
14/03/18 22:47:41 INFO mapred.JobClient:  map 12% reduce 3%
14/03/18 22:48:18 INFO mapred.JobClient:  map 12% reduce 4%
14/03/18 22:59:26 INFO mapred.JobClient:  map 13% reduce 4%
14/03/18 23:10:39 INFO mapred.JobClient:  map 14% reduce 4%
14/03/18 23:21:32 INFO mapred.JobClient:  map 15% reduce 4%
14/03/18 23:24:54 INFO mapred.JobClient:  map 15% reduce 5%
14/03/18 23:32:48 INFO mapred.JobClient:  map 16% reduce 5%
14/03/18 23:43:53 INFO mapred.JobClient:  map 17% reduce 5%
14/03/18 23:54:57 INFO mapred.JobClient:  map 18% reduce 5%
14/03/18 23:58:59 INFO mapred.JobClient:  map 18% reduce 6%
14/03/19 00:05:59 INFO mapred.JobClient:  map 19% reduce 6%
14/03/19 00:16:43 INFO mapred.JobClient:  map 20% reduce 6%
14/03/19 00:17:30 INFO mapred.JobClient: Task Id : 
attempt_201403181916_0001_m_000137_0, Status : FAILED
Map output lost, rescheduling: 
getMapOutput(attempt_201403181916_0001_m_000137_0,0) failed :
java.io.IOException: Error Reading IndexFile
    at 
org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:113)
    at 
org.apache.hadoop.mapred.IndexCache.getIndexInformation(IndexCache.java:66)
    at 
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:4070)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
    at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:914)
    at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
    at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
    at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at