RE: Control default partition when load a RDD from HDFS

2014-12-16 Thread Sun, Rui
Hi, Shuai, How did you turn off the file split in Hadoop? I guess you might have implemented a customized FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat, you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

RE: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-16 Thread Sun, Rui
Gautham, How many number of gz files do you have? Maybe the reason is that gz file is compressed that can't be splitted for processing by Mapreduce. A single gz file can only be processed by a single Mapper so that the CPU treads can't be fully utilized. -Original Message- From:

weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sun, Rui
Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch

RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sun, Rui
? -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, December 17, 2014 8:39 PM To: Sun, Rui Cc: user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary You should use

RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sun, Rui
...@eecs.berkeley.edu] Sent: Thursday, December 18, 2014 2:20 AM To: Sean Owen Cc: Sun, Rui; user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary Just to clarify, are you running the application using spark-submit

RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sun, Rui
in such case? -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, December 18, 2014 5:23 PM To: Sun, Rui Cc: shiva...@eecs.berkeley.edu; user@spark.apache.org Subject: Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark

RE: Error in creating spark RDD

2015-04-23 Thread Sun, Rui
Hi, SparkContext.newAPIHadoopRDD() is for working with new Hadoop mapreduce API. So, you should import import org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat; Instead of import org.apache.accumulo.core.client.mapred.AccumuloInputFormat; -Original Message- From: madhvi

RE: SparkR csv without headers

2015-08-20 Thread Sun, Rui
Hi, You can create a DataFrame using load.df() with a specified schema. Something like: schema - structType(structField(“a”, “string”), structField(“b”, integer), …) read.df ( …, schema = schema) From: Franc Carter [mailto:franc.car...@rozettatech.com] Sent: Wednesday, August 19, 2015 1:48 PM

RE: SparkR

2015-07-27 Thread Sun, Rui
Simply no. Currently SparkR is the R API of Spark DataFrame, no existing algorithms can benefit from it unless they are re-written to be based on the API. There is on-going development on supporting MLlib and ML Pipelines in SparkR: https://issues.apache.org/jira/browse/SPARK-6805 From: Mohit

RE: SparkR Supported Types - Please add bigint

2015-07-23 Thread Sun, Rui
Exie, Reported your issue: https://issues.apache.org/jira/browse/SPARK-9302 SparkR has support for long(bigint) type in serde. This issue is related to support complex Scala types in serde. -Original Message- From: Exie [mailto:tfind...@prodevelop.com.au] Sent: Friday, July 24, 2015

RE: SparkR Supported Types - Please add bigint

2015-07-24 Thread Sun, Rui
printSchema calls StructField. buildFormattedString() to output schema information. buildFormattedString() use DataType.typeName as string representation of the data type. LongType. typeName = long LongType.simpleString = bigint I am not sure about the difference of these two type name

RE: unserialize error in sparkR

2015-07-27 Thread Sun, Rui
Hi, Do you mean you are running the script with https://github.com/amplab-extras/SparkR-pkg and spark 1.2? I am afraid that currently there is no development effort and support on the SparkR-pkg since it has been integrated into Spark since Spark 1.4. Unfortunately, the RDD API and RDD-like

RE: Including additional scala libraries in sparkR

2015-07-13 Thread Sun, Rui
Hi, Michal, SparkR comes with a JVM backend that supports Java object instantiation, calling Java instance and static methods from R side. As defined in https://github.com/apache/spark/blob/master/R/pkg/R/backend.R, newJObject() is to create an instance of a Java class; callJMethod() is to call

RE: Share RDD from SparkR and another application

2015-07-14 Thread Sun, Rui
Hi, hari, I don't think job-server can work with SparkR (also pySpark). It seems it would be technically possible but needs support from job-server and SparkR(also pySpark), which doesn't exist yet. But there may be some in-direct ways of sharing RDDs between SparkR and an application. For

RE: Including additional scala libraries in sparkR

2015-07-14 Thread Sun, Rui
Could you give more details about the mis-behavior of --jars for SparkR? maybe it's a bug. From: Michal Haris [michal.ha...@visualdna.com] Sent: Tuesday, July 14, 2015 5:31 PM To: Sun, Rui Cc: Michal Haris; user@spark.apache.org Subject: Re: Including additional

RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
is not so complete. You may use scala documentation as reference, and try if some method is supported in SparkR. From: jianshu Weng [jian...@gmail.com] Sent: Wednesday, July 15, 2015 9:37 PM To: Sun, Rui Cc: user@spark.apache.org Subject: Re: [SparkR] creating dataframe

RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
suppose df - jsonFile(sqlContext, json file) You can extract hashtags.text as a Column object using the following command: t - getField(df$hashtags, text) and then you can perform operations on the column. You can extract hashtags.text as a DataFrame using the following command: t -

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-07-13 Thread Sun, Rui
Hi, Kachau, If you are using SparkR with RStudio, have you followed the guidelines in the section Using SparkR from RStudio in https://github.com/apache/spark/tree/master/R ? From: kachau [umesh.ka...@gmail.com] Sent: Saturday, July 11, 2015 12:30 AM

RE: Spark_1.5.1_on_HortonWorks

2015-10-22 Thread Sun, Rui
Frans, SparkR runs with R 3.1+. If possible, latest verison of R is recommended. From: Saisai Shao [mailto:sai.sai.s...@gmail.com] Sent: Thursday, October 22, 2015 11:17 AM To: Frans Thamura Cc: Ajay Chander; Doug Balog; user spark mailing list Subject: Re: Spark_1.5.1_on_HortonWorks SparkR is

RE: How to set memory for SparkR with master="local[*]"

2015-10-25 Thread Sun, Rui
As documented in http://spark.apache.org/docs/latest/configuration.html#available-properties, Note for “spark.driver.memory”: Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead,

RE: Connecting SparkR through Yarn

2015-11-13 Thread Sun, Rui
To: Sun, Rui; user@spark.apache.org Subject: Re: Connecting SparkR through Yarn Hi Sun, Thank you for reply. I did the same, but now I am getting another issue. i.e: Not able to connect to ResourceManager after submitting the job the Error message showing something like this Connecting

RE: Connecting SparkR through Yarn

2015-11-10 Thread Sun, Rui
Amit, You can simply set “MASTER” as “yarn-client” before calling sparkR.init(). Sys.setenv("MASTER"="yarn-client") I assume that you have set “YARN_CONF_DIR” env variable required for running Spark on YARN. If you want to set more YARN specific configurations, you can for example Sys.setenv

RE: sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found

2015-11-01 Thread Sun, Rui
Tom, Have you set the “MASTER” evn variable on your machine? What is the value if set? From: Tom Stewart [mailto:stewartthom...@yahoo.com.INVALID] Sent: Friday, October 30, 2015 10:11 PM To: user@spark.apache.org Subject: sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found I have

RE: SparkR job with >200 tasks hangs when calling from web server

2015-11-01 Thread Sun, Rui
I guess that this is not related to SparkR, but something wrong in the Spark Core. Could you try your application logic within spark-shell (you have to use Scala DataFrame API) instead of SparkR shell and to see if this issue still happens? -Original Message- From: rporcio

RE: How to set memory for SparkR with master="local[*]"

2015-11-01 Thread Sun, Rui
, spark.driver.extraJavaOptions, spark.driver.extraLibraryPath) in the sparkEnvir parameter for sparkR.init() to take effect. Would you like to give it a try? Note the change is on the master branch, you have to build Spark from source before using it. From: Sun, Rui [mailto:rui@intel.com] Sent: Monday

RE: [sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-07 Thread Sun, Rui
k.driver.memory”, (also other similar options, like: spark.driver.extraClassPath, spark.driver.extraJavaOptions, spark.driver.extraLibraryPath) in the sparkEnvir parameter for sparkR.init() to take effect. Would you like to give it a try? Note the change is on the master branch, you have to build Spark from source befo

RE: [Spark R]could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

2015-11-06 Thread Sun, Rui
Hi,Todd, "--driver-memory" options specifies the maximum heap memory size of the JVM backend for SparkR. The error you faced is memory allocation error of your R process. They are different. I guess that 2G memory bound for a string is limitation of the R interpreter? That's the reason why we

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-07 Thread Sun, Rui
Not sure "/C/DevTools/spark-1.5.1/bin/spark-submit.cmd" is a valid? From: Hossein [mailto:fal...@gmail.com] Sent: Wednesday, October 7, 2015 12:46 AM To: Khandeshi, Ami Cc: Sun, Rui; akhandeshi; user@spark.apache.org Subject: Re: SparkR Error in sparkR.init(master=“local”) in RStudio

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-08 Thread Sun, Rui
Can you extract the spark-submit command from the console output, and run it on the Shell, and see if there is any error message? From: Khandeshi, Ami [mailto:ami.khande...@fmr.com] Sent: Wednesday, October 7, 2015 9:57 PM To: Sun, Rui; Hossein Cc: akhandeshi; user@spark.apache.org Subject: RE

RE: How can I read file from HDFS i sparkR from RStudio

2015-10-08 Thread Sun, Rui
Amit, sqlContext <- sparkRSQL.init(sc) peopleDF <- read.df(sqlContext, "hdfs://master:9000/sears/example.csv") have you restarted the R session in RStudio between the two lines? From: Amit Behera [mailto:amit.bd...@gmail.com] Sent: Thursday, October 8, 2015 5:59 PM To: user@spark.apache.org

RE: [SparkR] Float type coercion with hiveContext

2015-07-08 Thread Sun, Rui
Hi, Evgeny, I reported a JIRA issue for your problem: https://issues.apache.org/jira/browse/SPARK-8897. You can track it to see how it will be solved. Ray -Original Message- From: Evgeny Sinelnikov [mailto:esinelni...@griddynamics.com] Sent: Monday, July 6, 2015 7:27 PM To:

RE: SparkR dataFrame read.df fails to read from aws s3

2015-07-09 Thread Sun, Rui
Hi, Ben 1) I guess this may be a JDK version mismatch. Could you check the JDK version? 2) I believe this is a bug in SparkR. I will fire a JIRA issue for it. From: Ben Spark [mailto:ben_spar...@yahoo.com.au] Sent: Thursday, July 9, 2015 12:14 PM To: user Subject: SparkR dataFrame

RE: Support of other languages?

2015-09-09 Thread Sun, Rui
Hi, Rahul, To support a new language other than Java/Scala in spark, it is different between RDD API and DataFrame API. For RDD API: RDD is a distributed collection of the language-specific data types whose representation is unknown to JVM. Also transformation functions for RDD are written

RE: reading files on HDFS /s3 in sparkR -failing

2015-09-10 Thread Sun, Rui
Hi, Roni, For parquetFile(), it is just a warning, you can get the DataFrame successfully, right? It is a bug has been fixed in the latest repo: https://issues.apache.org/jira/browse/SPARK-8952 For S3, it is not related to SparkR. I guess it is related to

RE: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-17 Thread Sun, Rui
The existing algorithms operating on R data.frame can't simply operate on SparkR DataFrame. They have to be re-implemented to be based on SparkR DataFrame API. -Original Message- From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] Sent: Thursday, September 17, 2015 3:30 AM To:

RE: textFile() and includePackage() not found

2015-09-27 Thread Sun, Rui
Eugene, SparkR RDD API is private for now (https://issues.apache.org/jira/browse/SPARK-7230) You can use SparkR::: prefix to access those private functions. -Original Message- From: Eugene Cao [mailto:eugene...@163.com] Sent: Monday, September 28, 2015 8:02 AM To:

RE: Support of other languages?

2015-09-22 Thread Sun, Rui
Palamuttam [mailto:rahulpala...@gmail.com] Sent: Thursday, September 17, 2015 3:09 PM To: Sun, Rui Cc: user@spark.apache.org Subject: Re: Support of other languages? Hi, Thank you for both responses. Sun you pointed out the exact issue I was referring to, which is copying,serializing, deserializing

RE: SparkR for accumulo

2015-09-23 Thread Sun, Rui
transformations on it. -Original Message- From: madhvi.gupta [mailto:madhvi.gu...@orkash.com] Sent: Wednesday, September 23, 2015 11:42 AM To: Sun, Rui; user Subject: Re: SparkR for accumulo Hi Rui, Cant we use the accumulo data RDD created from JAVA in spark, in sparkR? Thanks and Regards Madhvi

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread Sun, Rui
What you have done is supposed to work. Need more debugging information to find the cause. Could you add the following lines before calling sparkR.init()? Sys.setenv(SPARKR_SUBMIT_ARGS="--verbose sparkr-shell") Sys.setenv(SPARK_PRINT_LAUNCH_COMMAND=1) Then to see if you can find any hint in

RE: SparkR read.df failed to read file from local directory

2015-12-08 Thread Sun, Rui
Hi, Boyu, Does the local file “/home/myuser/test_data/sparkR/flights.csv” really exist? I just tried, and had no problem creating a DataFrame from a local CSV file. From: Boyu Zhang [mailto:boyuzhan...@gmail.com] Sent: Wednesday, December 9, 2015 1:49 AM To: Felix Cheung Cc:

RE: Do existing R packages work with SparkR data frames

2015-12-22 Thread Sun, Rui
Hi, Lan, Generally, it is hard to use existing R packages working with R data frames to work with SparkR data frames transparently. Typically the algorithms have to be re-written to use SparkR DataFrame API. Collect is for collecting the data from a SparkR DataFrame into a local data.frame.

RE: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-28 Thread Sun, Rui
Spark does not support computing cov matrix now. But there is a PR for it. Maybe you can try it: https://issues.apache.org/jira/browse/SPARK-11057 From: zhangjp [mailto:592426...@qq.com] Sent: Tuesday, December 29, 2015 3:21 PM To: Felix Cheung; Andy Davidson; Yanbo Liang Cc: user Subject: 回复:

RE: SparkR DataFrame , Out of memory exception for very small file.

2015-11-22 Thread Sun, Rui
Vipul, Not sure if I understand your question. DataFrame is immutable. You can't update a DataFrame. Could you paste some log info for the OOM error? -Original Message- From: vipulrai [mailto:vipulrai8...@gmail.com] Sent: Friday, November 20, 2015 12:11 PM To: user@spark.apache.org

Re: SparkR interaction with R libraries (currently 1.5.2)

2016-06-07 Thread Sun Rui
Hi, Ian, You should not use the Spark DataFrame a_df in your closure. For an R function for lapplyPartition, the parameter is a list of lists, representing the rows in the corresponding partition. In Spark 2.0, SparkR provides a new public API called dapply, which can apply an R function to each

Re: Slow collecting of large Spark Data Frames into R

2016-06-11 Thread Sun Rui
Hi, Jonathan, Thanks for reporting. This is a known issue that the community would like to address later. Please refer to https://issues.apache.org/jira/browse/SPARK-14037. It would be better that you can profile your use case using the method discussed in the JIRA issue and paste the

Re: SparkR : glm model

2016-06-11 Thread Sun Rui
You were looking at some old code. poisson family is supported in latest master branch. You can try spark 2.0 preview release from http://spark.apache.org/news/spark-2.0.0-preview.html > On Jun 10, 2016, at 12:14, april_ZMQ

Re: Can we use existing R model in Spark

2016-05-30 Thread Sun Rui
Unfortunately no. Spark does not support loading external modes (for examples, PMML) for now. Maybe you can try using the existing random forest model in Spark. > On May 30, 2016, at 18:21, Neha Mehta wrote: > > Hi, > > I have an existing random forest model created

Re: Can we use existing R model in Spark

2016-05-30 Thread Sun Rui
er@spark.apache.org>> > > > Try to invoke a R script from Spark using rdd pipe method , get the work done > & and receive the model back in RDD. > > > for ex :- > . rdd.pipe("") > > > On Mon, May 30, 2016 at 3:57 PM, Sun Rui <sunr

Re: --driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-02 Thread Sun Rui
yes, I think you can fire a JIRA issue for this. But why removing the default value. Seems the default core is 1 according to https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala#L110 On Jun 2, 2016, at 05:18, Jacek Laskowski

Re: get and append file name in record being reading

2016-06-02 Thread Sun Rui
You can use RDD.wholeTextFiles(). For example, suppose all your files are under /tmp/ABC_input/, val rdd = sc.wholeTextFiles("file:///tmp/ABC_input”) val rdd1 = rdd.flatMap { case (path, content) => val fileName = new java.io.File(path).getName content.split("\n").map { line =>

Re: Windows Rstudio to Linux spakR

2016-06-01 Thread Sun Rui
Selvam, First, deploy the Spark distribution on your Windows machine, which is of the same version of Spark in your Linux cluster Second, follow the instructions at https://github.com/apache/spark/tree/master/R#using-sparkr-from-rstudio. Specify the Spark master URL for your Linux Spark

Re: Unable to execute sparkr jobs through Chronos

2016-06-16 Thread Sun Rui
It seems that spark master URL is not correct. What is it? > On Jun 16, 2016, at 18:57, Rodrick Brown wrote: > > Master must start with yarn, spark, mesos, or local

Re: Unable to execute sparkr jobs through Chronos

2016-06-16 Thread Sun Rui
I saw in the job definition an Env Var: SPARKR_MASTER. What is that for? I don’t think SparkR uses it. > On Jun 17, 2016, at 10:08, Sun Rui <sunrise_...@163.com> wrote: > > It seems that spark master URL is not correct. What is it? >> On Jun 16, 2016, at 18:57, Rodrick B

Re: sparkR.init() can not load sparkPackages.

2016-06-19 Thread Sun Rui
Hi, Joseph, This is a known issue but not a bug. This issue does not occur when you use interactive SparkR session, while it does occur when you execute an R file. The reason behind this is that in case you execute an R file, the R backend launches before the R interpreter, so there is no

Re: Adding h5 files in a zip to use with PySpark

2016-06-15 Thread Sun Rui
have you tried --files ? > On Jun 15, 2016, at 18:50, ar7 wrote: > > I am using PySpark 1.6.1 for my spark application. I have additional modules > which I am loading using the argument --py-files. I also have a h5 file > which I need to access from one of the modules for

RE: building spark 1.6 throws error Rscript: command not found

2016-01-19 Thread Sun, Rui
Hi, Mich, Building Spark with SparkR profile enabled requires installation of R on your building machine. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, January 19, 2016 5:27 AM To: Mich Talebzadeh Cc: user @spark Subject: Re: building spark 1.6 throws error Rscript: command not found

RE: different behavior while using createDataFrame and read.df in SparkR

2016-02-08 Thread Sun, Rui
le has to be re-assigned to reference a column in the new DataFrame. From: Devesh Raj Singh [mailto:raj.deves...@gmail.com] Sent: Saturday, February 6, 2016 8:31 PM To: Sun, Rui <rui@intel.com> Cc: user@spark.apache.org Subject: Re: different behavior while using createDataFrame and read.d

RE: different behavior while using createDataFrame and read.df in SparkR

2016-02-05 Thread Sun, Rui
: Devesh Raj Singh [mailto:raj.deves...@gmail.com] Sent: Friday, February 5, 2016 2:44 PM To: user@spark.apache.org Cc: Sun, Rui Subject: different behavior while using createDataFrame and read.df in SparkR Hi, I am using Spark 1.5.1 When I do this df <- createDataFrame(sqlContext, i

RE: sparkR not able to create /append new columns

2016-02-03 Thread Sun, Rui
Devesh, Note that DataFrame is immutable. withColumn returns a new DataFrame instead of adding a column in-pace to the DataFrame being operated. So, you can modify the for loop like: for (j in 1:lev) { dummy.df.new<-withColumn(df, paste0(colnames(cat.column),j),

RE: can we do column bind of 2 dataframes in spark R? similar to cbind in R?

2016-02-02 Thread Sun, Rui
Devesh, The cbind-like operation is not supported by Scala DataFrame API, so it is also not supported in SparkR. You may try to workaround this by trying the approach in http://stackoverflow.com/questions/32882529/how-to-zip-twoor-more-dataframe-in-spark You could also submit a JIRA

RE: Apache Arrow + Spark examples?

2016-02-24 Thread Sun, Rui
Spark has not supported Arrow yet. There is a JIRA https://issues.apache.org/jira/browse/SPARK-13391 requesting working on it. From: Robert Towne [mailto:robert.to...@webtrends.com] Sent: Wednesday, February 24, 2016 5:21 AM To: user@spark.apache.org Subject: Apache Arrow + Spark examples? I

RE: Running synchronized JRI code

2016-02-14 Thread Sun, Rui
Yes, JRI loads an R dynamic library into the executor JVM, which faces thread-safe issue when there are multiple task threads within the executor. If you are running Spark on Standalone mode, it is possible to run multiple workers per node, and at the same time, limit the cores per worker to be

RE: Running synchronized JRI code

2016-02-14 Thread Sun, Rui
For YARN mode, you can set --executor-cores 1 -Original Message- From: Sun, Rui [mailto:rui@intel.com] Sent: Monday, February 15, 2016 11:35 AM To: Simon Hafner <reactorm...@gmail.com>; user <user@spark.apache.org> Subject: RE: Running synchronized JRI code Yes, JR

RE: Running synchronized JRI code

2016-02-15 Thread Sun, Rui
On computation, RRDD launches one R process for each partition, so there won't be thread-safe issue Could you give more details on your new environment? -Original Message- From: Simon Hafner [mailto:reactorm...@gmail.com] Sent: Monday, February 15, 2016 7:31 PM To: Sun, Rui <

RE: sparkR issues ?

2016-03-15 Thread Sun, Rui
I have submitted https://issues.apache.org/jira/browse/SPARK-13905 and a PR for it. From: Alex Kozlov [mailto:ale...@gmail.com] Sent: Wednesday, March 16, 2016 12:52 AM To: roni <roni.epi...@gmail.com> Cc: Sun, Rui <rui@intel.com>; user@spark.apache.org Subject: Re: sparkR issue

RE: sparkR issues ?

2016-03-18 Thread Sun, Rui
Sorry. I am wrong. The issue is not related to as.data.frame(). It seems to be related to DataFrame naming conflict between s4vectors and SparkR. Refer to https://issues.apache.org/jira/browse/SPARK-12148 From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, March 16, 2016 9:33 AM To: Alex

RE: sparkR issues ?

2016-03-15 Thread Sun, Rui
It seems as.data.frame() defined in SparkR convers the versions in R base package. We can try to see if we can change the implementation of as.data.frame() in SparkR to avoid such covering. From: Alex Kozlov [mailto:ale...@gmail.com] Sent: Tuesday, March 15, 2016 2:59 PM To: roni

RE: lint-r checks failing

2016-03-10 Thread Sun, Rui
This is probably because the installed lintr package get updated. After update, lintr can detect errors that are skipped before I will submit a PR for this issue -Original Message- From: Gayathri Murali [mailto:gayathri.m.sof...@gmail.com] Sent: Friday, March 11, 2016 12:48 PM To:

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Sun, Rui
c/main/scala/org/apache/spark/deploy/PythonRunner.scala#L47 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RRunner.scala#L65 From: Andrei [mailto:faithlessfri...@gmail.com] Sent: Wednesday, April 13, 2016 4:32 AM To: Sun, Rui <rui@intel.com&

RE: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Sun, Rui
Which py file is your main file (primary py file)? Zip the other two py files. Leave the main py file alone. Don't copy them to S3 because it seems that only local primary and additional py files are supported. ./bin/spark-submit --master spark://... --py-files -Original Message-

RE: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Sun, Rui
val ALLOW_MULTIPLE_CONTEXTS = booleanConf("spark.sql.allowMultipleContexts", defaultValue = Some(true), doc = "When set to true, creating multiple SQLContexts/HiveContexts is allowed." + "When set to false, only one SQLContext/HiveContext is allowed to be created " +

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Sun, Rui
There is much deployment preparation work handling different deployment modes for pyspark and SparkR in SparkSubmit. It is difficult to summarize it briefly, you had better refer to the source code. Supporting running Julia scripts in SparkSubmit is more than implementing a ‘JuliaRunner’. One

RE: How to process one partition at a time?

2016-04-06 Thread Sun, Rui
Maybe you can try SparkContext.submitJob: def submitJob[T, U, R](rdd: RDD[T], processPartition: (Iterator[T]) ⇒ U, partitions: Seq[Int], resultHandler: (Int, U) ⇒ Unit, resultFunc: ⇒ R):

RE: Error in "java.io.IOException: No input paths specified in job"

2016-03-19 Thread Sun, Rui
It complains about the file path "./examples/src/main/resources/people.json" You can try to use absolute path instead of relative path, and make sure the absolute path is correct. If that still does not work, you can prefix the path with "file://" in case the default file schema for Hadoop is

RE: What's the benifit of RDD checkpoint against RDD save

2016-03-24 Thread Sun, Rui
As Mark said, checkpoint() can be called before calling any action on the RDD. The choice between checkpoint and saveXXX depends. If you just want to cut the long RDD lineage, and the data won’t be re-used later, then use checkpoint, because it is simple and the checkpoint data will be cleaned

RE: Run External R script from Spark

2016-03-21 Thread Sun, Rui
It’s a possible approach. It actually leverages Spark’s parallel execution. PipeRDD’s launching of external processes is just like that in pySpark and SparkR for RDD API. The concern is pipeRDD relies on text based serialization/deserialization. Whether the performance is acceptable actually

RE: SparkR Count vs Take performance

2016-03-02 Thread Sun, Rui
This is nothing to do with object serialization/deserialization. It is expected behavior that take(1) most likely runs slower than count() on an empty RDD. This is all about the algorithm with which take() is implemented. Take() 1. Reads one partition to get the elements 2. If the fetched

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-13 Thread Sun, Rui
...@gmail.com] Sent: Thursday, April 14, 2016 5:45 AM To: Sun, Rui <rui@intel.com> Cc: user <user@spark.apache.org> Subject: Re: How does spark-submit handle Python scripts (and how to repeat it)? Julia can pick the env var, and set the system properties or directly fill the co

Re: Splitting RDD by partition

2016-05-20 Thread Sun Rui
I think the latter approach is better, which can avoid un-necessary computations by filtering out un-needed partitions. It is better to cache the previous RDD so that it won’t be computed twice > On May 20, 2016, at 16:59, shlomi wrote: > > Another approach I found: > >

Re: SparkR query

2016-05-17 Thread Sun Rui
nd workers looking for Windows path, > Which must be being passed through by the driver I guess. I checked the > spark-env.sh on each node and the appropriate SPARK_HOME is set > correctly…. > > > From: Sun Rui [mailto:sunrise_...@163.com] > Sent: 17 May 2016 11:32 >

Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
from python? > > On 19 May 2016 16:57, "Sun Rui" <sunrise_...@163.com > <mailto:sunrise_...@163.com>> wrote: > 1. create a temp dir on HDFS, say “/tmp” > 2. write a script to create in the temp dir one file for each tar file. Each > file has only one

Re: dataframe stat corr for multiple columns

2016-05-19 Thread Sun Rui
There is an existing JIRA issue for it: https://issues.apache.org/jira/browse/SPARK-11057 Also there is an PR. Maybe we should help to review and merge it with a higher priority. > On May 20, 2016, at 00:09, Xiangrui Meng

Re: Does spark support Apache Arrow

2016-05-19 Thread Sun Rui
1. I don’t think so 2. Arrow is for in-memory columnar execution. While cache is for in-memory columnar storage > On May 20, 2016, at 10:16, Todd wrote: > > From the official site http://arrow.apache.org/, Apache Arrow is used for > Columnar In-Memory storage. I have two quick

Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
1. create a temp dir on HDFS, say “/tmp” 2. write a script to create in the temp dir one file for each tar file. Each file has only one line: 3. Write a spark application. It is like: val rdd = sc.textFile () rdd.map { line => construct an untar command using the path information in

Re: Spark 1.6.0: substring on df.select

2016-05-12 Thread Sun Rui
Alternatively, you may try the built-in function: regexp_extract > On May 12, 2016, at 20:27, Ewan Leith wrote: > > You could use a UDF pretty easily, something like this should work, the > lastElement function could be changed to do pretty much any string >

Re: SparkR query

2016-05-17 Thread Sun Rui
Lewis, 1. Could you check the values of “SPARK_HOME” environment on all of your worker nodes? 2. How did you start your SparkR shell? > On May 17, 2016, at 18:07, Mike Lewis wrote: > > Hi, > > I have a SparkR driver process that connects to a master running on

Re: Spark 2.0 on YARN - Dynamic Resource Allocation Behavior change?

2016-07-28 Thread Sun Rui
Yes, this is a change in Spark 2.0. you can take a look at https://issues.apache.org/jira/browse/SPARK-13723 In the latest Spark On Yarn documentation for Spark 2.0, there is

Re: SPARK Exception thrown in awaitResult

2016-07-28 Thread Sun Rui
Are you using Mesos? if not , https://issues.apache.org/jira/browse/SPARK-16522 is not relevant You may describe more information about your Spark environment, and the full stack trace. > On Jul 28, 2016, at 17:44, Carlo.Allocca

Re: Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Sun Rui
If you want to keep using RDD API, then you still need to create SparkContext first. If you want to use just Dataset/DataFrame/SQL API, then you can directly create a SparkSession. Generally the SparkContext is hidden although it is internally created and held within the SparkSession. Anytime

Re: Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Sun Rui
If you want to keep using RDD API, then you still need to create SparkContext first. If you want to use just Dataset/DataFrame/SQL API, then you can directly create a SparkSession. Generally the SparkContext is hidden although it is internally created and held within the SparkSession. Anytime

Re: Application not showing in Spark History

2016-08-02 Thread Sun Rui
bin/spark-submit will set some env variable, like SPARK_HOME, that Spark later will use to locate the spark-defaults.conf from which default settings for Spark will be loaded. I would guess that some configuration option like spark.eventLog.enabled in the spark-defaults.conf is skipped by

Re: [2.0.0] mapPartitions on DataFrame unable to find encoder

2016-08-02 Thread Sun Rui
import org.apache.spark.sql.catalyst.encoders.RowEncoder implicit val encoder = RowEncoder(df.schema) df.mapPartitions(_.take(1)) > On Aug 3, 2016, at 04:55, Dragisa Krsmanovic wrote: > > I am trying to use mapPartitions on DataFrame. > > Example: > > import

Re: How to partition a SparkDataFrame using all distinct column values in sparkR

2016-08-03 Thread Sun Rui
SparkDataFrame.repartition() uses hash partitioning, it can guarantee that all rows of the same column value go to the same partition, but it does not guarantee that each partition contain only single column value. Fortunately, Spark 2.0 comes with gapply() in SparkR. You can apply an R

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-08-03 Thread Sun Rui
--num-executors does not work for Standalone mode. Try --total-executor-cores > On Jul 26, 2016, at 00:17, Mich Talebzadeh wrote: > > Hi, > > > I am doing some tests > > I have started Spark in Standalone mode. > > For simplicity I am using one node only with 8

Re: how to run local[k] threads on a single core

2016-08-04 Thread Sun Rui
I don’t think it possible as Spark does not support thread to CPU affinity. > On Aug 4, 2016, at 14:27, sujeet jog wrote: > > Is there a way we can run multiple tasks concurrently on a single core in > local mode. > > for ex :- i have 5 partition ~ 5 tasks, and only a

Re: Issue in spark job. Remote rpc client dissociated

2016-07-14 Thread Sun Rui
Where is argsList defined? is Launcher.main() thread-safe? Note that if multiple folders are processed in a node, multiple threads may concurrently run in the executor, each processing a folder. > On Jul 14, 2016, at 12:28, Balachandar R.A. wrote: > > Hello Ted, >

Re: How to convert from DataFrame to Dataset[Row]?

2016-07-16 Thread Sun Rui
For Spark 1.6.x, a DataFrame can't be directly converted to a Dataset[Row], but can done indirectly as follows: import org.apache.spark.sql.catalyst.encoders.RowEncoder // assume df is a DataFrame implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) val ds = df.as[Row] However,

Re: Saving data frames on Spark Master/Driver

2016-07-14 Thread Sun Rui
You can simply save the join result distributedly, for example, as a HDFS file, and then copy the HDFS file to a local file. There is an alternative memory-efficient way to collect distributed data back to driver other than collect(), that is toLocalIterator. The iterator will consume as much

Re: Enforcing shuffle hash join

2016-07-04 Thread Sun Rui
You can try set “spark.sql.join.preferSortMergeJoin” cons option to false. For detailed join strategies, take a look at the source code of SparkStrategies.scala: /** * Select the proper physical plan for join based on joining keys and size of logical plan. * * At first, uses the

Re: SparkR error when repartition is called

2016-08-09 Thread Sun Rui
I can’t reproduce your issue with len=1 in local mode. Could you give more environment information? > On Aug 9, 2016, at 11:35, Shane Lee wrote: > > Hi All, > > I am trying out SparkR 2.0 and have run into an issue with repartition. > > Here is the R code

  1   2   >