RE: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Stuti Awasthi



Thanks Mathieu,
So either I must have shared filesystem OR Hadoop as filesystem in order to write data from Standalone mode cluster setup environment. Thanks for your input.


Regards
Stuti Awasthi



From: Mathieu Longtin [math...@closetwork.org]
Sent: Tuesday, May 24, 2016 7:34 PM
To: Stuti Awasthi; Jacek Laskowski
Cc: user
Subject: Re: Not able to write output to local filsystem from Standalone mode.




In standalone mode, executor assume they have access to a shared file system. The driver creates the directory and the executor write files, so the executors end up not writing anything since there is no local directory.


On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi <stutiawas...@hcl.com> wrote:



hi Jacek,


Parent directory already present, its my home directory. Im using Linux (Redhat) machine 64 bit.
Also I noticed that "test1" folder is created in my master with subdirectory as "_temporary" which is empty. but on slaves, no such directory is created under /home/stuti.


Thanks
Stuti 


From: Jacek Laskowski [ja...@japila.pl]
Sent: Tuesday, May 24, 2016 5:27 PM
To: Stuti Awasthi
Cc: user
Subject: Re: Not able to write output to local filsystem from Standalone mode.












Hi, 
What happens when you create the parent directory /home/stuti? I think the failure is due to missing parent directories. What's the OS?

Jacek
On 24 May 2016 11:27 a.m., "Stuti Awasthi" <stutiawas...@hcl.com> wrote:



Hi All,
I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read the input file from local filesystem and perform transformation successfully. When
 I try to write my output in local filesystem path then I receive below error .
 
I tried to search on web and found similar Jira : 
https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows resolved for Spark 1.3+ but already people have posted the same issue still persists in latest versions.
 
ERROR
scala> data.saveAsTextFile("/home/stuti/test1")
16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, server1): java.io.IOException: The temporary job-output directory file:/home/stuti/test1/_temporary doesn't exist!
    at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
    at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
    at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
 
What is the best way to resolve this issue if suppose I don’t want to have Hadoop installed OR is it mandatory to have Hadoop to write the output from Standalone cluster mode.
 
Please suggest.
 
Thanks 
Stuti Awasthi
 




::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,

lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents

(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.

Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the

views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,

distribution and / or publication of this message without the prior written consent of authorized representative of

HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.

Before opening any email and/or attachments, please check them for viruses and other defects.
















---

RE: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Stuti Awasthi



hi Jacek,


Parent directory already present, its my home directory. Im using Linux (Redhat) machine 64 bit.
Also I noticed that "test1" folder is created in my master with subdirectory as "_temporary" which is empty. but on slaves, no such directory is created under /home/stuti.


Thanks
Stuti 


From: Jacek Laskowski [ja...@japila.pl]
Sent: Tuesday, May 24, 2016 5:27 PM
To: Stuti Awasthi
Cc: user
Subject: Re: Not able to write output to local filsystem from Standalone mode.




Hi, 
What happens when you create the parent directory /home/stuti? I think the failure is due to missing parent directories. What's the OS?

Jacek
On 24 May 2016 11:27 a.m., "Stuti Awasthi" <stutiawas...@hcl.com> wrote:



Hi All,
I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read the input file from local filesystem and perform transformation successfully. When
 I try to write my output in local filesystem path then I receive below error .
 
I tried to search on web and found similar Jira : 
https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows resolved for Spark 1.3+ but already people have posted the same issue still persists in latest versions.
 
ERROR
scala> data.saveAsTextFile("/home/stuti/test1")
16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, server1): java.io.IOException: The temporary job-output directory file:/home/stuti/test1/_temporary doesn't exist!
    at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
    at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
    at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
 
What is the best way to resolve this issue if suppose I don’t want to have Hadoop installed OR is it mandatory to have Hadoop to write the output from Standalone cluster mode.
 
Please suggest.
 
Thanks 
Stuti Awasthi
 




::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,

lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents

(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.

Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the

views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,

distribution and / or publication of this message without the prior written consent of authorized representative of

HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.

Before opening any email and/or attachments, please check them for viruses and other defects.











-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Stuti Awasthi
Hi All,
I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. 
Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read 
the input file from local filesystem and perform transformation successfully. 
When I try to write my output in local filesystem path then I receive below 
error .

I tried to search on web and found similar Jira : 
https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows 
resolved for Spark 1.3+ but already people have posted the same issue still 
persists in latest versions.

ERROR
scala> data.saveAsTextFile("/home/stuti/test1")
16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
server1): java.io.IOException: The temporary job-output directory 
file:/home/stuti/test1/_temporary doesn't exist!
at 
org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
at 
org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

What is the best way to resolve this issue if suppose I don't want to have 
Hadoop installed OR is it mandatory to have Hadoop to write the output from 
Standalone cluster mode.

Please suggest.

Thanks 
Stuti Awasthi



::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Joins in Spark

2016-03-19 Thread Stuti Awasthi
Hi All,

I have to join 2 files both not very big say few MBs only but the result can be 
huge say generating 500GBs to TBs of data.  Now I have tried using spark Join() 
function but Im noticing that join is executing on only 1 or 2 nodes at the 
max. Since I have a cluster size of 5 nodes , I tried to pass 
"join(otherDataset, [numTasks])" as numTasks=10 but again what I noticed that 
all the 9 tasks are finished instantly and only 1 executor is processing all 
the data.

I searched on internet and got that we can use Broadcast variable to send data 
from 1 file to all nodes and then use map function to do the join. In this way 
I should be able to run multiple task on different executors.
Now my question is , since Spark is providing the Join functionality, I have 
assumed that it will handle the data parallelism automatically. Now is Spark 
provide some functionality which I can directly use for join rather than 
implementing Mapside join using Broadcast on my own or any other better way is 
also welcome.

I assume that this might be very common problem for all and looking out for 
suggestions.

Thanks 
Stuti Awasthi



::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




RE: Launch Spark shell using differnt python version

2016-03-15 Thread Stuti Awasthi
Thanks Prabhu,
I tried starting in local mode but still picking Python 2.6 only. I have 
exported “DEFAULT_PYTHON” in my session variable and also included in PATH.

Export:
export DEFAULT_PYTHON="/home/stuti/Python/bin/python2.7"
export PATH="/home/stuti/Python/bin/python2.7:$PATH

$ pyspark --master local
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
….

Thanks 
Stuti Awasthi

From: Prabhu Joseph [mailto:prabhujose.ga...@gmail.com]
Sent: Tuesday, March 15, 2016 2:22 PM
To: Stuti Awasthi
Cc: user@spark.apache.org
Subject: Re: Launch Spark shell using differnt python version

Hi Stuti,
 You can try local mode but not spark master or yarn mode if python-2.7 is 
not installed on all Spark Worker / NodeManager machines. To run with master 
mode

   1. Check whether user is able to access python2.7
   2. Check if you have installed python-2.7 in all NodeManager machines / 
Spark Worker machines and restarted

  Executor running inside Spark Worker is able to get the full path of 
python2.7. But inside NodeManager, executor does not find the python2.7 even 
though the script is in PATH. To make NodeManager find the path, set the full 
path of python-2.7 like below in pyspark script.

   DEFAULT_PYTHON="/ANACONDA/anaconda2/bin/python2.7"
Thanks,
Prabhu Joseph


On Tue, Mar 15, 2016 at 11:52 AM, Stuti Awasthi 
<stutiawas...@hcl.com<mailto:stutiawas...@hcl.com>> wrote:
Hi All,

I have a Centos cluster (without any sudo permissions) which has by default 
Python 2.6. Now I have installed Python2.7 for my user account and did the 
changes in bashrc so that Python2.7 is picked up by default. Then I have set 
the following properties in bashrc inorder to launch spark shell using Python 
2.7 but its not working.

Bashrc details :
alias python='/home/stuti/Python/bin/python2.7'
alias python2='/home/stuti/Python/bin/python2.7'
export PYSPARK_PYTHON=/home/stuti/Python/bin/python2.7
export LD_LIBRARY_PATH=/home/stuti/Python/lib:$LD_LIBRARY_PATH
export PATH=$HOME/bin:$PATH

Also it is to be noted that Spark cluster is configured with different user 
account and I have not installed python2.7 on all the nodes in the cluster as I 
don’t have permission access.
So is there any way that I can launch my spark shell using Python2.7.
Please suggest

Thanks 
Stuti Awasthi



::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Launch Spark shell using differnt python version

2016-03-15 Thread Stuti Awasthi
Hi All,

I have a Centos cluster (without any sudo permissions) which has by default 
Python 2.6. Now I have installed Python2.7 for my user account and did the 
changes in bashrc so that Python2.7 is picked up by default. Then I have set 
the following properties in bashrc inorder to launch spark shell using Python 
2.7 but its not working.

Bashrc details :
alias python='/home/stuti/Python/bin/python2.7'
alias python2='/home/stuti/Python/bin/python2.7'
export PYSPARK_PYTHON=/home/stuti/Python/bin/python2.7
export LD_LIBRARY_PATH=/home/stuti/Python/lib:$LD_LIBRARY_PATH
export PATH=$HOME/bin:$PATH

Also it is to be noted that Spark cluster is configured with different user 
account and I have not installed python2.7 on all the nodes in the cluster as I 
don't have permission access.
So is there any way that I can launch my spark shell using Python2.7.
Please suggest

Thanks 
Stuti Awasthi



::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Survival Curves using AFT implementation in Spark

2016-02-25 Thread Stuti Awasthi
Hi All,
I wanted to apply Survival Analysis using Spark AFT algorithm implementation. 
Now I perform the same in R using coxph model and passing the model in 
Survfit() function to generate survival curves
Then I can visualize the survival curve on validation data to understand how 
good my model fits.



R: Code

fit <- coxph(Surv(futime, fustat) ~ age, data = ovarian)

plot(survfit(fit,newdata=data.frame(age=60)))


I wanted to achieve something similar with Spark. Hence I created the AFT model 
using Spark and passed my Test dataframe for prediction. The result of 
prediction is single prediction value for single input data which is as 
expected. But now how can I use this model to generate the Survival curves for 
visualization.

Eg: Spark Code model.transform(test_final).show()

standardized_features|   prediction|
+-+-+
| [0.0,0.0,0.743853...|48.33071792204102|
+-+-+

Can any suggest how to use the developed model for plotting Survival Curves for 
"test_final" data which is a dataframe feature[vector].

Thanks
Stuti Awasthi



::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




RE: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-16 Thread Stuti Awasthi
Thanks a lot Yanbo, this will really help. Since I was unaware of this, I was 
speculating if my vectors were not getting generated correctly.  Thanks !!

Thanks 
Stuti Awasthi

From: Yanbo Liang [mailto:yblia...@gmail.com]
Sent: Wednesday, February 17, 2016 11:51 AM
To: Stuti Awasthi
Cc: user@spark.apache.org
Subject: Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum 
is infinity. Error for unknown reason.

Hi Stuti,

The features should be standardized before training the model. Currently 
AFTSurvivalRegression does not support standardization. Here is the work around 
for this issue, and I will send a PR to fix this issue soon.

val ovarian = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true") // Use first line of all files as header
  .option("inferSchema", "true") // Automatically infer data types
  .load("..")
  .toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps")

val assembler = new VectorAssembler()
  .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
  .setOutputCol("features")

val ovarian2 = assembler.transform(ovarian)
  .select(col("censor").cast(DoubleType), col("label").cast(DoubleType), 
col("features"))

val standardScaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("standardized_features")
val ssModel = standardScaler.fit(ovarian2)
val ovarian3 = ssModel.transform(ovarian2)

val aft = new 
AFTSurvivalRegression().setFeaturesCol("standardized_features")

val model = aft.fit(ovarian3)

val newCoefficients = 
model.coefficients.toArray.zip(ssModel.std.toArray).map { x =>
  x._1 / x._2
}
println(newCoefficients.toSeq.mkString(","))
println(model.intercept)
println(model.scale)

Yanbo

2016-02-15 16:07 GMT+08:00 Yanbo Liang 
<yblia...@gmail.com<mailto:yblia...@gmail.com>>:
Hi Stuti,

This is a bug of AFTSurvivalRegression, we did not handle "lossSum == infinity" 
properly.
I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this 
issue and will send a PR.
Thanks for reporting this issue.

Yanbo

2016-02-12 15:03 GMT+08:00 Stuti Awasthi 
<stutiawas...@hcl.com<mailto:stutiawas...@hcl.com>>:
Hi All,
Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to run 
the AFT example provided. Now I tried to train the model with Ovarian data 
which is standard data comes with Survival library in R.
Default Column Name :  Futime,fustat,age,resid_ds,rx,ecog_ps

Here are the steps I have done :

• Loaded the data from csv to dataframe labeled as
val ovarian_data = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true") // Use first line of all files as header
  .option("inferSchema", "true") // Automatically infer data types
  .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx", 
"ecog_ps")

• Utilize the VectorAssembler() to create features from "age", 
"resid_ds", "rx", "ecog_ps" like
val assembler = new VectorAssembler()
.setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
.setOutputCol("features")


• Then I create a new dataframe with only 3 colums as :
val training = finalDf.select("label", "censor", "features")



• Finally Im passing it to AFT
val model = aft.fit(training)

Im getting the error as :
java.lang.AssertionError: assertion failed: AFTAggregator loss sum is infinity. 
Error for unknown reason.
   at scala.Predef$.assert(Predef.scala:179)
   at 
org.apache.spark.ml.regression.AFTAggregator.add(AFTSurvivalRegression.scala:480)
   at 
org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:522)
   at 
org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:521)
   at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
   at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)

I have tried to print the schema :
()root
|-- label: double (nullable = true)
|-- censor: double (nullable = true)
|-- features: vector (nullable = true)

Sample data training looks like
[59.0,1.0,[72.3315,2.0,1.0,1.0]]
[115.0,1.0,[74.4932,2.0,1.0,1.0]]
[156.0,1.0,[66.4658,2.0,1.0,2.0]]
[421.0,0.0,[53.3644,2.0,2.0,1.0]]
[431.0,1.0,[50.3397,2.0,1.0,1.0]]

Im not able to understa

mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-11 Thread Stuti Awasthi
Hi All,
Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to run 
the AFT example provided. Now I tried to train the model with Ovarian data 
which is standard data comes with Survival library in R.
Default Column Name :  Futime,fustat,age,resid_ds,rx,ecog_ps

Here are the steps I have done :

* Loaded the data from csv to dataframe labeled as
val ovarian_data = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true") // Use first line of all files as header
  .option("inferSchema", "true") // Automatically infer data types
  .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx", 
"ecog_ps")

* Utilize the VectorAssembler() to create features from "age", 
"resid_ds", "rx", "ecog_ps" like
val assembler = new VectorAssembler()
.setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
.setOutputCol("features")


* Then I create a new dataframe with only 3 colums as :
val training = finalDf.select("label", "censor", "features")



* Finally Im passing it to AFT
val model = aft.fit(training)

Im getting the error as :
java.lang.AssertionError: assertion failed: AFTAggregator loss sum is infinity. 
Error for unknown reason.
   at scala.Predef$.assert(Predef.scala:179)
   at 
org.apache.spark.ml.regression.AFTAggregator.add(AFTSurvivalRegression.scala:480)
   at 
org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:522)
   at 
org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:521)
   at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
   at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)

I have tried to print the schema :
()root
|-- label: double (nullable = true)
|-- censor: double (nullable = true)
|-- features: vector (nullable = true)

Sample data training looks like
[59.0,1.0,[72.3315,2.0,1.0,1.0]]
[115.0,1.0,[74.4932,2.0,1.0,1.0]]
[156.0,1.0,[66.4658,2.0,1.0,2.0]]
[421.0,0.0,[53.3644,2.0,2.0,1.0]]
[431.0,1.0,[50.3397,2.0,1.0,1.0]]

Im not able to understand about the error, as if I use same data and create the 
denseVector as given in Sample example of AFT, then code works completely fine. 
But I would like to read the data from CSV file and then proceed.

Please suggest

Thanks 
Stuti Awasthi



::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Issue in executing Spark Application from Eclipse

2014-12-03 Thread Stuti Awasthi
/12/04 11:05:46 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 2.4 KB, free 133.6 MB)
14/12/04 11:05:46 INFO MemoryStore: ensureFreeSpace(1541) called with 
curMem=37486, maxMem=140142182
14/12/04 11:05:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 1541.0 B, free 133.6 MB)
14/12/04 11:05:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
HOSTNAME_DESKTOP:62311 (size: 1541.0 B, free: 133.6 MB)
14/12/04 11:05:46 INFO BlockManagerMaster: Updated info of block 
broadcast_1_piece0
14/12/04 11:05:46 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
(D:/Workspace/Spark/Test/README MappedRDD[1] at textFile at Test.scala:14)
14/12/04 11:05:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/12/04 11:06:01 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient memory
14/12/04 11:06:03 INFO AppClient$ClientActor: Connecting to master 
spark://10.112.67.80:7077...
14/12/04 11:06:16 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient memory
14/12/04 11:06:23 INFO AppClient$ClientActor: Connecting to master 
spark://10.112.67.80:7077...
14/12/04 11:06:31 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient memory

Thanks
Stuti Awasthi



::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




RE: Cannot run program Rscript using SparkR

2014-08-19 Thread Stuti Awasthi
Thanks Shivaram,

This was the issue. Now I have installed Rscript on all the nodes in Spark 
cluster and  it works now bith from script as well as R prompt.
Thanks

Stuti Awasthi

From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
Sent: Tuesday, August 19, 2014 1:17 PM
To: Stuti Awasthi
Cc: user@spark.apache.org
Subject: Re: Cannot run program Rscript using SparkR

Hi Stuti

Could you check if Rscript is installed on all of the worker machines in the 
Spark cluster ? You can ssh into the machines and check if Rscript can be found 
in $PATH.

Thanks
Shivaram

On Mon, Aug 18, 2014 at 10:05 PM, Stuti Awasthi 
stutiawas...@hcl.commailto:stutiawas...@hcl.com wrote:
Hi All,

I am using R 3.1 and Spark 0.9 and installed SparkR successfully. Now when I 
execute the “pi.R” example using spark master as local, then script executes 
fine.
But when I try to execute same example using master as spark cluster master, 
then in throws Rcript error.

Error :
java.io.IOException: Cannot run program Rscript: java.io.IOException: 
error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:113)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

I have checked, Rscript is present in my system and I also exported it in 
CLASSPATH and PATH variables. Script is given the permission 777 as there are 
multiple users of the clusters.
$ which Rscript
/usr/local/bin/Rscript

$ type -a Rscript
Rscript is /usr/local/bin/Rscript

$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/var/local/maven3/bin/:/var/local/ant/bin/:/usr/lib/jvm/java-6-openjdk:/usr/local/bin/Rscript

$ echo $CLASSPATH
:/usr/local/bin/Rscript

Also I am getting same error if I open R prompt and then execute the commands 
one after another OR if I execute the script.

Please suggest

Thanks
Stuti Awasthi




::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




RSpark installation on Windows

2014-07-01 Thread Stuti Awasthi
Hi All

Can we install RSpark on windows setup of R and use it to access the remote 
Spark cluster ?

Thanks
Stuti Awasthi


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




SparkR Installation

2014-06-18 Thread Stuti Awasthi
Hi All,

I wanted to try SparkR. Do we need preinstalled R on all the nodes of the 
cluster before installing SparkR package ? Please guide me how to proceed with 
this. As of now, I work with R only on single node.
Please suggest

Thanks
Stuti Awasthi


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Convert text into tfidf vectors for Classification

2014-06-13 Thread Stuti Awasthi
Hi all,

I wanted to perform Text Classification using Spark1.0 Naïve Bayes. I was 
looking for the way to convert text into sparse vector with TFIDF weighting 
scheme.
I found that MLI library supports that but it is compatible with Spark 0.8.

What are all the options available to achieve text vectorization. Is there any 
pre-built api's which can be used or other way in which we can achieve this
Please suggest

Thanks
Stuti Awasthi


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Inter and Inra Cluster Density in KMeans

2014-05-28 Thread Stuti Awasthi
Hi,

I wanted to calculate the InterClusterDensity and IntraClusterDensity from the 
clusters generated from KMeans.
How can I achieve that? Is there any already present code/api to use for this 
purpose.

Thanks
Stuti Awasthi


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




RE: How to use Mahout VectorWritable in Spark.

2014-05-14 Thread Stuti Awasthi
The issue of console:12: error: not found: type Text is resolved by import 
statement.. But still facing issue with imports of VectorWritable.
Mahout math jar is added to classpath as I can check on WebUI as well on shell

scala System.getenv
res1: java.util.Map[String,String] = {TERM=xterm, 
JAVA_HOME=/usr/lib/jvm/java-6-openjdk, SHLVL=2, 
SHELL_JARS=/home/hduser/installations/work-space/mahout-math-0.7.jar, 
SPARK_MASTER_WEBUI_PORT=5050, LESSCLOSE=/usr/bin/lesspipe %s %s, 
SSH_CLIENT=10.112.67.149 55123 22, 
SPARK_HOME=/home/hduser/installations/spark-0.9.0, MAIL=/var/mail/hduser, 
SPARK_WORKER_DIR=/tmp/spark-hduser-worklogs/work, 
XDG_SESSION_COOKIE=fbd2e4304c8c75dd606c36100186-1400039480.256868-916349946,
 https_proxy=https://DS-1078D2486320:3128/, NICKNAME=vm01, JAVA_OPTS=  
-Djava.library.path= -Xms512m -Xmx512m, 
PWD=/home/hduser/installations/work-space/KMeansClustering_1, 
SSH_TTY=/dev/pts/0, SPARK_MASTER_PORT=7077, LOGNAME=hduser, 
MASTER=spark://VM-52540048731A:7077, SPARK_WORKER_MEMORY=2g, 
HADOOP_HOME=/usr/lib/hadoop, SS...

Still not able to import  Mahout Classes.. Any ideas ??

Thanks
Stuti Awasthi

-Original Message-
From: Stuti Awasthi 
Sent: Wednesday, May 14, 2014 1:13 PM
To: user@spark.apache.org
Subject: RE: How to use Mahout VectorWritable in Spark.

Hi Xiangrui,
Thanks for the response .. I tried few ways to include mahout-math jar while 
launching Spark shell.. but no success.. Can you please point what I am doing 
wrong

1. mahout-math.jar exported in CLASSPATH, and PATH 2. Tried Launching Spark 
Shell by :  MASTER=spark://HOSTNAME:PORT 
ADD_JARS=~/installations/work-space/mahout-math-0.7.jar 
park-0.9.0/bin/spark-shell

 After launching, I checked the environment details on WebUi: It looks like 
mahout-math jar is included.
spark.jars  /home/hduser/installations/work-space/mahout-math-0.7.jar

Then I try :
scala import org.apache.mahout.math.VectorWritable
console:10: error: object mahout is not a member of package org.apache
   import org.apache.mahout.math.VectorWritable

scala val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWritable])  
console:12: error: not found: type Text
   val data = 
sc.sequenceFile(/stuti/ML/Clustering/KMeans/HAR/KMeans_dataset_seq/part-r-0,
 classOf[Text], classOf[VectorWritable])

 ^ Im using Spark 0.9 and Hadoop 1.0.4 and Mahout 
0.7

Thanks
Stuti 



-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com]
Sent: Wednesday, May 14, 2014 11:56 AM
To: user@spark.apache.org
Subject: Re: How to use Mahout VectorWritable in Spark.

You need

 val raw = sc.sequenceFile(path, classOf[Text],
 classOf[VectorWriteable])

to load the data. After that, you can do

 val data = raw.values.map(_.get)

To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` when you 
launch spark-shell to include mahout-math.

Best,
Xiangrui

On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi stutiawas...@hcl.com wrote:
 Hi All,

 I am very new to Spark and trying to play around with Mllib hence 
 apologies for the basic question.



 I am trying to run KMeans algorithm using Mahout and Spark MLlib to 
 see the performance. Now initial datasize was 10 GB. Mahout converts 
 the data in Sequence File Text,VectorWritable which is used for KMeans 
 Clustering.
 The Sequence File crated was ~ 6GB in size.



 Now I wanted if I can use the Mahout Sequence file to be executed in 
 Spark MLlib for KMeans . I have read that SparkContext.sequenceFile 
 may be used here. Hence I tried to read my sequencefile as below but getting 
 the error :



 Command on Spark Shell :

 scala val data = sc.sequenceFile[String,VectorWritable](/
 KMeans_dataset_seq/part-r-0,String,VectorWritable)

 console:12: error: not found: type VectorWritable

val data = sc.sequenceFile[String,VectorWritable](
 /KMeans_dataset_seq/part-r-0,String,VectorWritable)



 Here I have 2 ques:

 1.  Mahout has “Text” as Key but Spark is printing “not found: type:Text”
 hence I changed it to String.. Is this correct ???

 2. How will VectorWritable be found in Spark. Do I need to include 
 Mahout jar in Classpath or any other option ??



 Please Suggest



 Regards

 Stuti Awasthi



 ::DISCLAIMER::
 --
 --
 

 The contents of this e-mail and any attachment(s) are confidential and 
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as 
 information could be intercepted, corrupted, lost, destroyed, arrive 
 late or incomplete, or may contain viruses in transmission. The e mail 
 and its contents (with or without referred errors) shall therefore not 
 attach any liability on the originator or HCL or its affiliates.
 Views or opinions, if any

RE: How to use Mahout VectorWritable in Spark.

2014-05-14 Thread Stuti Awasthi
Hi Xiangrui,
Thanks for the response .. I tried few ways to include mahout-math jar while 
launching Spark shell.. but no success.. Can you please point what I am doing 
wrong

1. mahout-math.jar exported in CLASSPATH, and PATH
2. Tried Launching Spark Shell by :  MASTER=spark://HOSTNAME:PORT 
ADD_JARS=~/installations/work-space/mahout-math-0.7.jar 
park-0.9.0/bin/spark-shell

 After launching, I checked the environment details on WebUi: It looks like 
mahout-math jar is included.
spark.jars  /home/hduser/installations/work-space/mahout-math-0.7.jar

Then I try :
scala import org.apache.mahout.math.VectorWritable
console:10: error: object mahout is not a member of package org.apache
   import org.apache.mahout.math.VectorWritable

scala val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWritable])  
console:12: error: not found: type Text
   val data = 
sc.sequenceFile(/stuti/ML/Clustering/KMeans/HAR/KMeans_dataset_seq/part-r-0,
 classOf[Text], classOf[VectorWritable])

 ^
Im using Spark 0.9 and Hadoop 1.0.4 and Mahout 0.7

Thanks
Stuti 



-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Wednesday, May 14, 2014 11:56 AM
To: user@spark.apache.org
Subject: Re: How to use Mahout VectorWritable in Spark.

You need

 val raw = sc.sequenceFile(path, classOf[Text], 
 classOf[VectorWriteable])

to load the data. After that, you can do

 val data = raw.values.map(_.get)

To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` when you 
launch spark-shell to include mahout-math.

Best,
Xiangrui

On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi stutiawas...@hcl.com wrote:
 Hi All,

 I am very new to Spark and trying to play around with Mllib hence 
 apologies for the basic question.



 I am trying to run KMeans algorithm using Mahout and Spark MLlib to 
 see the performance. Now initial datasize was 10 GB. Mahout converts 
 the data in Sequence File Text,VectorWritable which is used for KMeans 
 Clustering.
 The Sequence File crated was ~ 6GB in size.



 Now I wanted if I can use the Mahout Sequence file to be executed in 
 Spark MLlib for KMeans . I have read that SparkContext.sequenceFile 
 may be used here. Hence I tried to read my sequencefile as below but getting 
 the error :



 Command on Spark Shell :

 scala val data = sc.sequenceFile[String,VectorWritable](/
 KMeans_dataset_seq/part-r-0,String,VectorWritable)

 console:12: error: not found: type VectorWritable

val data = sc.sequenceFile[String,VectorWritable](
 /KMeans_dataset_seq/part-r-0,String,VectorWritable)



 Here I have 2 ques:

 1.  Mahout has “Text” as Key but Spark is printing “not found: type:Text”
 hence I changed it to String.. Is this correct ???

 2. How will VectorWritable be found in Spark. Do I need to include 
 Mahout jar in Classpath or any other option ??



 Please Suggest



 Regards

 Stuti Awasthi



 ::DISCLAIMER::
 --
 --
 

 The contents of this e-mail and any attachment(s) are confidential and 
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as 
 information could be intercepted, corrupted, lost, destroyed, arrive 
 late or incomplete, or may contain viruses in transmission. The e mail 
 and its contents (with or without referred errors) shall therefore not 
 attach any liability on the originator or HCL or its affiliates.
 Views or opinions, if any, presented in this email are solely those of 
 the author and may not necessarily reflect the views or opinions of 
 HCL or its affiliates. Any form of reproduction, dissemination, 
 copying, disclosure, modification, distribution and / or publication 
 of this message without the prior written consent of authorized 
 representative of HCL is strictly prohibited. If you have received 
 this email in error please delete it and notify the sender 
 immediately.
 Before opening any email and/or attachments, please check them for 
 viruses and other defects.

 --
 --