RE: Not able to write output to local filsystem from Standalone mode.
Thanks Mathieu, So either I must have shared filesystem OR Hadoop as filesystem in order to write data from Standalone mode cluster setup environment. Thanks for your input. Regards Stuti Awasthi From: Mathieu Longtin [math...@closetwork.org] Sent: Tuesday, May 24, 2016 7:34 PM To: Stuti Awasthi; Jacek Laskowski Cc: user Subject: Re: Not able to write output to local filsystem from Standalone mode. In standalone mode, executor assume they have access to a shared file system. The driver creates the directory and the executor write files, so the executors end up not writing anything since there is no local directory. On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi <stutiawas...@hcl.com> wrote: hi Jacek, Parent directory already present, its my home directory. Im using Linux (Redhat) machine 64 bit. Also I noticed that "test1" folder is created in my master with subdirectory as "_temporary" which is empty. but on slaves, no such directory is created under /home/stuti. Thanks Stuti From: Jacek Laskowski [ja...@japila.pl] Sent: Tuesday, May 24, 2016 5:27 PM To: Stuti Awasthi Cc: user Subject: Re: Not able to write output to local filsystem from Standalone mode. Hi, What happens when you create the parent directory /home/stuti? I think the failure is due to missing parent directories. What's the OS? Jacek On 24 May 2016 11:27 a.m., "Stuti Awasthi" <stutiawas...@hcl.com> wrote: Hi All, I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read the input file from local filesystem and perform transformation successfully. When I try to write my output in local filesystem path then I receive below error . I tried to search on web and found similar Jira : https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows resolved for Spark 1.3+ but already people have posted the same issue still persists in latest versions. ERROR scala> data.saveAsTextFile("/home/stuti/test1") 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, server1): java.io.IOException: The temporary job-output directory file:/home/stuti/test1/_temporary doesn't exist! at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250) at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) What is the best way to resolve this issue if suppose I don’t want to have Hadoop installed OR is it mandatory to have Hadoop to write the output from Standalone cluster mode. Please suggest. Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. ---
RE: Not able to write output to local filsystem from Standalone mode.
hi Jacek, Parent directory already present, its my home directory. Im using Linux (Redhat) machine 64 bit. Also I noticed that "test1" folder is created in my master with subdirectory as "_temporary" which is empty. but on slaves, no such directory is created under /home/stuti. Thanks Stuti From: Jacek Laskowski [ja...@japila.pl] Sent: Tuesday, May 24, 2016 5:27 PM To: Stuti Awasthi Cc: user Subject: Re: Not able to write output to local filsystem from Standalone mode. Hi, What happens when you create the parent directory /home/stuti? I think the failure is due to missing parent directories. What's the OS? Jacek On 24 May 2016 11:27 a.m., "Stuti Awasthi" <stutiawas...@hcl.com> wrote: Hi All, I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read the input file from local filesystem and perform transformation successfully. When I try to write my output in local filesystem path then I receive below error . I tried to search on web and found similar Jira : https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows resolved for Spark 1.3+ but already people have posted the same issue still persists in latest versions. ERROR scala> data.saveAsTextFile("/home/stuti/test1") 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, server1): java.io.IOException: The temporary job-output directory file:/home/stuti/test1/_temporary doesn't exist! at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250) at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) What is the best way to resolve this issue if suppose I don’t want to have Hadoop installed OR is it mandatory to have Hadoop to write the output from Standalone cluster mode. Please suggest. Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Not able to write output to local filsystem from Standalone mode.
Hi All, I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read the input file from local filesystem and perform transformation successfully. When I try to write my output in local filesystem path then I receive below error . I tried to search on web and found similar Jira : https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows resolved for Spark 1.3+ but already people have posted the same issue still persists in latest versions. ERROR scala> data.saveAsTextFile("/home/stuti/test1") 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, server1): java.io.IOException: The temporary job-output directory file:/home/stuti/test1/_temporary doesn't exist! at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250) at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) What is the best way to resolve this issue if suppose I don't want to have Hadoop installed OR is it mandatory to have Hadoop to write the output from Standalone cluster mode. Please suggest. Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Joins in Spark
Hi All, I have to join 2 files both not very big say few MBs only but the result can be huge say generating 500GBs to TBs of data. Now I have tried using spark Join() function but Im noticing that join is executing on only 1 or 2 nodes at the max. Since I have a cluster size of 5 nodes , I tried to pass "join(otherDataset, [numTasks])" as numTasks=10 but again what I noticed that all the 9 tasks are finished instantly and only 1 executor is processing all the data. I searched on internet and got that we can use Broadcast variable to send data from 1 file to all nodes and then use map function to do the join. In this way I should be able to run multiple task on different executors. Now my question is , since Spark is providing the Join functionality, I have assumed that it will handle the data parallelism automatically. Now is Spark provide some functionality which I can directly use for join rather than implementing Mapside join using Broadcast on my own or any other better way is also welcome. I assume that this might be very common problem for all and looking out for suggestions. Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
RE: Launch Spark shell using differnt python version
Thanks Prabhu, I tried starting in local mode but still picking Python 2.6 only. I have exported “DEFAULT_PYTHON” in my session variable and also included in PATH. Export: export DEFAULT_PYTHON="/home/stuti/Python/bin/python2.7" export PATH="/home/stuti/Python/bin/python2.7:$PATH $ pyspark --master local Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56) [GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2 …. Thanks Stuti Awasthi From: Prabhu Joseph [mailto:prabhujose.ga...@gmail.com] Sent: Tuesday, March 15, 2016 2:22 PM To: Stuti Awasthi Cc: user@spark.apache.org Subject: Re: Launch Spark shell using differnt python version Hi Stuti, You can try local mode but not spark master or yarn mode if python-2.7 is not installed on all Spark Worker / NodeManager machines. To run with master mode 1. Check whether user is able to access python2.7 2. Check if you have installed python-2.7 in all NodeManager machines / Spark Worker machines and restarted Executor running inside Spark Worker is able to get the full path of python2.7. But inside NodeManager, executor does not find the python2.7 even though the script is in PATH. To make NodeManager find the path, set the full path of python-2.7 like below in pyspark script. DEFAULT_PYTHON="/ANACONDA/anaconda2/bin/python2.7" Thanks, Prabhu Joseph On Tue, Mar 15, 2016 at 11:52 AM, Stuti Awasthi <stutiawas...@hcl.com<mailto:stutiawas...@hcl.com>> wrote: Hi All, I have a Centos cluster (without any sudo permissions) which has by default Python 2.6. Now I have installed Python2.7 for my user account and did the changes in bashrc so that Python2.7 is picked up by default. Then I have set the following properties in bashrc inorder to launch spark shell using Python 2.7 but its not working. Bashrc details : alias python='/home/stuti/Python/bin/python2.7' alias python2='/home/stuti/Python/bin/python2.7' export PYSPARK_PYTHON=/home/stuti/Python/bin/python2.7 export LD_LIBRARY_PATH=/home/stuti/Python/lib:$LD_LIBRARY_PATH export PATH=$HOME/bin:$PATH Also it is to be noted that Spark cluster is configured with different user account and I have not installed python2.7 on all the nodes in the cluster as I don’t have permission access. So is there any way that I can launch my spark shell using Python2.7. Please suggest Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Launch Spark shell using differnt python version
Hi All, I have a Centos cluster (without any sudo permissions) which has by default Python 2.6. Now I have installed Python2.7 for my user account and did the changes in bashrc so that Python2.7 is picked up by default. Then I have set the following properties in bashrc inorder to launch spark shell using Python 2.7 but its not working. Bashrc details : alias python='/home/stuti/Python/bin/python2.7' alias python2='/home/stuti/Python/bin/python2.7' export PYSPARK_PYTHON=/home/stuti/Python/bin/python2.7 export LD_LIBRARY_PATH=/home/stuti/Python/lib:$LD_LIBRARY_PATH export PATH=$HOME/bin:$PATH Also it is to be noted that Spark cluster is configured with different user account and I have not installed python2.7 on all the nodes in the cluster as I don't have permission access. So is there any way that I can launch my spark shell using Python2.7. Please suggest Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Survival Curves using AFT implementation in Spark
Hi All, I wanted to apply Survival Analysis using Spark AFT algorithm implementation. Now I perform the same in R using coxph model and passing the model in Survfit() function to generate survival curves Then I can visualize the survival curve on validation data to understand how good my model fits. R: Code fit <- coxph(Surv(futime, fustat) ~ age, data = ovarian) plot(survfit(fit,newdata=data.frame(age=60))) I wanted to achieve something similar with Spark. Hence I created the AFT model using Spark and passed my Test dataframe for prediction. The result of prediction is single prediction value for single input data which is as expected. But now how can I use this model to generate the Survival curves for visualization. Eg: Spark Code model.transform(test_final).show() standardized_features| prediction| +-+-+ | [0.0,0.0,0.743853...|48.33071792204102| +-+-+ Can any suggest how to use the developed model for plotting Survival Curves for "test_final" data which is a dataframe feature[vector]. Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
RE: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.
Thanks a lot Yanbo, this will really help. Since I was unaware of this, I was speculating if my vectors were not getting generated correctly. Thanks !! Thanks Stuti Awasthi From: Yanbo Liang [mailto:yblia...@gmail.com] Sent: Wednesday, February 17, 2016 11:51 AM To: Stuti Awasthi Cc: user@spark.apache.org Subject: Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason. Hi Stuti, The features should be standardized before training the model. Currently AFTSurvivalRegression does not support standardization. Here is the work around for this issue, and I will send a PR to fix this issue soon. val ovarian = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("..") .toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps") val assembler = new VectorAssembler() .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps")) .setOutputCol("features") val ovarian2 = assembler.transform(ovarian) .select(col("censor").cast(DoubleType), col("label").cast(DoubleType), col("features")) val standardScaler = new StandardScaler() .setInputCol("features") .setOutputCol("standardized_features") val ssModel = standardScaler.fit(ovarian2) val ovarian3 = ssModel.transform(ovarian2) val aft = new AFTSurvivalRegression().setFeaturesCol("standardized_features") val model = aft.fit(ovarian3) val newCoefficients = model.coefficients.toArray.zip(ssModel.std.toArray).map { x => x._1 / x._2 } println(newCoefficients.toSeq.mkString(",")) println(model.intercept) println(model.scale) Yanbo 2016-02-15 16:07 GMT+08:00 Yanbo Liang <yblia...@gmail.com<mailto:yblia...@gmail.com>>: Hi Stuti, This is a bug of AFTSurvivalRegression, we did not handle "lossSum == infinity" properly. I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this issue and will send a PR. Thanks for reporting this issue. Yanbo 2016-02-12 15:03 GMT+08:00 Stuti Awasthi <stutiawas...@hcl.com<mailto:stutiawas...@hcl.com>>: Hi All, Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to run the AFT example provided. Now I tried to train the model with Ovarian data which is standard data comes with Survival library in R. Default Column Name : Futime,fustat,age,resid_ds,rx,ecog_ps Here are the steps I have done : • Loaded the data from csv to dataframe labeled as val ovarian_data = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps") • Utilize the VectorAssembler() to create features from "age", "resid_ds", "rx", "ecog_ps" like val assembler = new VectorAssembler() .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps")) .setOutputCol("features") • Then I create a new dataframe with only 3 colums as : val training = finalDf.select("label", "censor", "features") • Finally Im passing it to AFT val model = aft.fit(training) Im getting the error as : java.lang.AssertionError: assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.ml.regression.AFTAggregator.add(AFTSurvivalRegression.scala:480) at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:522) at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:521) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.Iterator$class.foreach(Iterator.scala:727) I have tried to print the schema : ()root |-- label: double (nullable = true) |-- censor: double (nullable = true) |-- features: vector (nullable = true) Sample data training looks like [59.0,1.0,[72.3315,2.0,1.0,1.0]] [115.0,1.0,[74.4932,2.0,1.0,1.0]] [156.0,1.0,[66.4658,2.0,1.0,2.0]] [421.0,0.0,[53.3644,2.0,2.0,1.0]] [431.0,1.0,[50.3397,2.0,1.0,1.0]] Im not able to understa
mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.
Hi All, Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to run the AFT example provided. Now I tried to train the model with Ovarian data which is standard data comes with Survival library in R. Default Column Name : Futime,fustat,age,resid_ds,rx,ecog_ps Here are the steps I have done : * Loaded the data from csv to dataframe labeled as val ovarian_data = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps") * Utilize the VectorAssembler() to create features from "age", "resid_ds", "rx", "ecog_ps" like val assembler = new VectorAssembler() .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps")) .setOutputCol("features") * Then I create a new dataframe with only 3 colums as : val training = finalDf.select("label", "censor", "features") * Finally Im passing it to AFT val model = aft.fit(training) Im getting the error as : java.lang.AssertionError: assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.ml.regression.AFTAggregator.add(AFTSurvivalRegression.scala:480) at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:522) at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:521) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.Iterator$class.foreach(Iterator.scala:727) I have tried to print the schema : ()root |-- label: double (nullable = true) |-- censor: double (nullable = true) |-- features: vector (nullable = true) Sample data training looks like [59.0,1.0,[72.3315,2.0,1.0,1.0]] [115.0,1.0,[74.4932,2.0,1.0,1.0]] [156.0,1.0,[66.4658,2.0,1.0,2.0]] [421.0,0.0,[53.3644,2.0,2.0,1.0]] [431.0,1.0,[50.3397,2.0,1.0,1.0]] Im not able to understand about the error, as if I use same data and create the denseVector as given in Sample example of AFT, then code works completely fine. But I would like to read the data from CSV file and then proceed. Please suggest Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Issue in executing Spark Application from Eclipse
/12/04 11:05:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.4 KB, free 133.6 MB) 14/12/04 11:05:46 INFO MemoryStore: ensureFreeSpace(1541) called with curMem=37486, maxMem=140142182 14/12/04 11:05:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1541.0 B, free 133.6 MB) 14/12/04 11:05:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on HOSTNAME_DESKTOP:62311 (size: 1541.0 B, free: 133.6 MB) 14/12/04 11:05:46 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 14/12/04 11:05:46 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (D:/Workspace/Spark/Test/README MappedRDD[1] at textFile at Test.scala:14) 14/12/04 11:05:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/12/04 11:06:01 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/12/04 11:06:03 INFO AppClient$ClientActor: Connecting to master spark://10.112.67.80:7077... 14/12/04 11:06:16 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/12/04 11:06:23 INFO AppClient$ClientActor: Connecting to master spark://10.112.67.80:7077... 14/12/04 11:06:31 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
RE: Cannot run program Rscript using SparkR
Thanks Shivaram, This was the issue. Now I have installed Rscript on all the nodes in Spark cluster and it works now bith from script as well as R prompt. Thanks Stuti Awasthi From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Tuesday, August 19, 2014 1:17 PM To: Stuti Awasthi Cc: user@spark.apache.org Subject: Re: Cannot run program Rscript using SparkR Hi Stuti Could you check if Rscript is installed on all of the worker machines in the Spark cluster ? You can ssh into the machines and check if Rscript can be found in $PATH. Thanks Shivaram On Mon, Aug 18, 2014 at 10:05 PM, Stuti Awasthi stutiawas...@hcl.commailto:stutiawas...@hcl.com wrote: Hi All, I am using R 3.1 and Spark 0.9 and installed SparkR successfully. Now when I execute the “pi.R” example using spark master as local, then script executes fine. But when I try to execute same example using master as spark cluster master, then in throws Rcript error. Error : java.io.IOException: Cannot run program Rscript: java.io.IOException: error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:475) at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:113) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) I have checked, Rscript is present in my system and I also exported it in CLASSPATH and PATH variables. Script is given the permission 777 as there are multiple users of the clusters. $ which Rscript /usr/local/bin/Rscript $ type -a Rscript Rscript is /usr/local/bin/Rscript $ echo $PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/var/local/maven3/bin/:/var/local/ant/bin/:/usr/lib/jvm/java-6-openjdk:/usr/local/bin/Rscript $ echo $CLASSPATH :/usr/local/bin/Rscript Also I am getting same error if I open R prompt and then execute the commands one after another OR if I execute the script. Please suggest Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
RSpark installation on Windows
Hi All Can we install RSpark on windows setup of R and use it to access the remote Spark cluster ? Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
SparkR Installation
Hi All, I wanted to try SparkR. Do we need preinstalled R on all the nodes of the cluster before installing SparkR package ? Please guide me how to proceed with this. As of now, I work with R only on single node. Please suggest Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Convert text into tfidf vectors for Classification
Hi all, I wanted to perform Text Classification using Spark1.0 Naïve Bayes. I was looking for the way to convert text into sparse vector with TFIDF weighting scheme. I found that MLI library supports that but it is compatible with Spark 0.8. What are all the options available to achieve text vectorization. Is there any pre-built api's which can be used or other way in which we can achieve this Please suggest Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Inter and Inra Cluster Density in KMeans
Hi, I wanted to calculate the InterClusterDensity and IntraClusterDensity from the clusters generated from KMeans. How can I achieve that? Is there any already present code/api to use for this purpose. Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
RE: How to use Mahout VectorWritable in Spark.
The issue of console:12: error: not found: type Text is resolved by import statement.. But still facing issue with imports of VectorWritable. Mahout math jar is added to classpath as I can check on WebUI as well on shell scala System.getenv res1: java.util.Map[String,String] = {TERM=xterm, JAVA_HOME=/usr/lib/jvm/java-6-openjdk, SHLVL=2, SHELL_JARS=/home/hduser/installations/work-space/mahout-math-0.7.jar, SPARK_MASTER_WEBUI_PORT=5050, LESSCLOSE=/usr/bin/lesspipe %s %s, SSH_CLIENT=10.112.67.149 55123 22, SPARK_HOME=/home/hduser/installations/spark-0.9.0, MAIL=/var/mail/hduser, SPARK_WORKER_DIR=/tmp/spark-hduser-worklogs/work, XDG_SESSION_COOKIE=fbd2e4304c8c75dd606c36100186-1400039480.256868-916349946, https_proxy=https://DS-1078D2486320:3128/, NICKNAME=vm01, JAVA_OPTS= -Djava.library.path= -Xms512m -Xmx512m, PWD=/home/hduser/installations/work-space/KMeansClustering_1, SSH_TTY=/dev/pts/0, SPARK_MASTER_PORT=7077, LOGNAME=hduser, MASTER=spark://VM-52540048731A:7077, SPARK_WORKER_MEMORY=2g, HADOOP_HOME=/usr/lib/hadoop, SS... Still not able to import Mahout Classes.. Any ideas ?? Thanks Stuti Awasthi -Original Message- From: Stuti Awasthi Sent: Wednesday, May 14, 2014 1:13 PM To: user@spark.apache.org Subject: RE: How to use Mahout VectorWritable in Spark. Hi Xiangrui, Thanks for the response .. I tried few ways to include mahout-math jar while launching Spark shell.. but no success.. Can you please point what I am doing wrong 1. mahout-math.jar exported in CLASSPATH, and PATH 2. Tried Launching Spark Shell by : MASTER=spark://HOSTNAME:PORT ADD_JARS=~/installations/work-space/mahout-math-0.7.jar park-0.9.0/bin/spark-shell After launching, I checked the environment details on WebUi: It looks like mahout-math jar is included. spark.jars /home/hduser/installations/work-space/mahout-math-0.7.jar Then I try : scala import org.apache.mahout.math.VectorWritable console:10: error: object mahout is not a member of package org.apache import org.apache.mahout.math.VectorWritable scala val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWritable]) console:12: error: not found: type Text val data = sc.sequenceFile(/stuti/ML/Clustering/KMeans/HAR/KMeans_dataset_seq/part-r-0, classOf[Text], classOf[VectorWritable]) ^ Im using Spark 0.9 and Hadoop 1.0.4 and Mahout 0.7 Thanks Stuti -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Wednesday, May 14, 2014 11:56 AM To: user@spark.apache.org Subject: Re: How to use Mahout VectorWritable in Spark. You need val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWriteable]) to load the data. After that, you can do val data = raw.values.map(_.get) To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` when you launch spark-shell to include mahout-math. Best, Xiangrui On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi All, I am very new to Spark and trying to play around with Mllib hence apologies for the basic question. I am trying to run KMeans algorithm using Mahout and Spark MLlib to see the performance. Now initial datasize was 10 GB. Mahout converts the data in Sequence File Text,VectorWritable which is used for KMeans Clustering. The Sequence File crated was ~ 6GB in size. Now I wanted if I can use the Mahout Sequence file to be executed in Spark MLlib for KMeans . I have read that SparkContext.sequenceFile may be used here. Hence I tried to read my sequencefile as below but getting the error : Command on Spark Shell : scala val data = sc.sequenceFile[String,VectorWritable](/ KMeans_dataset_seq/part-r-0,String,VectorWritable) console:12: error: not found: type VectorWritable val data = sc.sequenceFile[String,VectorWritable]( /KMeans_dataset_seq/part-r-0,String,VectorWritable) Here I have 2 ques: 1. Mahout has “Text” as Key but Spark is printing “not found: type:Text” hence I changed it to String.. Is this correct ??? 2. How will VectorWritable be found in Spark. Do I need to include Mahout jar in Classpath or any other option ?? Please Suggest Regards Stuti Awasthi ::DISCLAIMER:: -- -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any
RE: How to use Mahout VectorWritable in Spark.
Hi Xiangrui, Thanks for the response .. I tried few ways to include mahout-math jar while launching Spark shell.. but no success.. Can you please point what I am doing wrong 1. mahout-math.jar exported in CLASSPATH, and PATH 2. Tried Launching Spark Shell by : MASTER=spark://HOSTNAME:PORT ADD_JARS=~/installations/work-space/mahout-math-0.7.jar park-0.9.0/bin/spark-shell After launching, I checked the environment details on WebUi: It looks like mahout-math jar is included. spark.jars /home/hduser/installations/work-space/mahout-math-0.7.jar Then I try : scala import org.apache.mahout.math.VectorWritable console:10: error: object mahout is not a member of package org.apache import org.apache.mahout.math.VectorWritable scala val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWritable]) console:12: error: not found: type Text val data = sc.sequenceFile(/stuti/ML/Clustering/KMeans/HAR/KMeans_dataset_seq/part-r-0, classOf[Text], classOf[VectorWritable]) ^ Im using Spark 0.9 and Hadoop 1.0.4 and Mahout 0.7 Thanks Stuti -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Wednesday, May 14, 2014 11:56 AM To: user@spark.apache.org Subject: Re: How to use Mahout VectorWritable in Spark. You need val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWriteable]) to load the data. After that, you can do val data = raw.values.map(_.get) To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` when you launch spark-shell to include mahout-math. Best, Xiangrui On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi All, I am very new to Spark and trying to play around with Mllib hence apologies for the basic question. I am trying to run KMeans algorithm using Mahout and Spark MLlib to see the performance. Now initial datasize was 10 GB. Mahout converts the data in Sequence File Text,VectorWritable which is used for KMeans Clustering. The Sequence File crated was ~ 6GB in size. Now I wanted if I can use the Mahout Sequence file to be executed in Spark MLlib for KMeans . I have read that SparkContext.sequenceFile may be used here. Hence I tried to read my sequencefile as below but getting the error : Command on Spark Shell : scala val data = sc.sequenceFile[String,VectorWritable](/ KMeans_dataset_seq/part-r-0,String,VectorWritable) console:12: error: not found: type VectorWritable val data = sc.sequenceFile[String,VectorWritable]( /KMeans_dataset_seq/part-r-0,String,VectorWritable) Here I have 2 ques: 1. Mahout has “Text” as Key but Spark is printing “not found: type:Text” hence I changed it to String.. Is this correct ??? 2. How will VectorWritable be found in Spark. Do I need to include Mahout jar in Classpath or any other option ?? Please Suggest Regards Stuti Awasthi ::DISCLAIMER:: -- -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. -- --