Re: Spark Streaming to Kafka

2015-05-19 Thread Saisai Shao
I think here is the PR https://github.com/apache/spark/pull/2994 you could refer to. 2015-05-20 13:41 GMT+08:00 twinkle sachdeva : > Hi, > > As Spark streaming is being nicely integrated with consuming messages from > Kafka, so I thought of asking the forum, that is there any implementation > ava

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-26 Thread Saisai Shao
It depends on how you use Spark, if you use Spark with Yarn and enable dynamic allocation, the number of executor is not fixed, will change dynamically according to the load. Thanks Jerry 2015-05-27 14:44 GMT+08:00 canan chen : > It seems the executor number is fixed for the standalone mode, not

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
ean does it related > with parallelism of my RDD and how does driver know how many executor it > needs ? > > On Wed, May 27, 2015 at 2:49 PM, Saisai Shao > wrote: > >> It depends on how you use Spark, if you use Spark with Yarn and enable >> dynamic allocation, the numb

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
ode but can not figure out a way to do >> it in Yarn. >> >> On Wednesday, May 27, 2015, Saisai Shao wrote: >> >>> The drive has a heuristic mechanism to decide the number of executors in >>> the run-time according the pending tasks. You could enable with >

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Saisai Shao
I think you could check the yarn nodemanager log or other Spark executor logs to see the details. What you listed above of the exception stacks are just the phenomenon, not the cause. Normally there will be some situations which will lead to executor lost: 1. Killed by yarn cause of memory exceed,

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
Hi, What is your meaning of getting the offsets from the RDD, from my understanding, the offsetRange is a parameter you offered to KafkaRDD, why do you still want to get the one previous you set into? Thanks Jerry 2015-06-12 12:36 GMT+08:00 Amit Ramesh : > > Congratulations on the release of 1.

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
processing guarantees across code deployments. > Spark does not support any state persistence across deployments so this is > something we need to handle on our own. > > Hope that helps. Let me know if not. > > Thanks! > Amit > > > On Thu, Jun 11, 2015 at 10:02 P

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
amesh : > > Thanks, Jerry. That's what I suspected based on the code I looked at. Any > pointers on what is needed to build in this support would be great. This is > critical to the project we are currently working on. > > Thanks! > > > On Thu, Jun 11, 2015 at 10:54

Re: Reading Really Big File Stream from HDFS

2015-06-12 Thread Saisai Shao
Using sc.textFile will also read the file from HDFS one by one line through iterator, don't need to fit all into memory, even you have small size of memory, it still can be worked. 2015-06-12 13:19 GMT+08:00 SLiZn Liu : > Hmm, you have a good point. So should I load the file by `sc.textFile()` >

Re: How to use Window Operations with kafka Direct-API?

2015-06-12 Thread Saisai Shao
I think you could not use offsetRange in such way, when you transform a DirectKafkaInputDStream into WindowedDStream, internally the KafkaRDD is changed into normal RDD, but offsetRange is a specific attribute for KafkaRDD, so when you convert a normal RDD to HasOffsetRanges, you will meet such exc

Re: Spark Configuration of spark.worker.cleanup.appDataTtl

2015-06-16 Thread Saisai Shao
I think you have to using "604800" instead of "7 * 24 * 3600", obviously SparkConf will not do multiplication for you.. The exception is quite obvious: "Caused by: java.lang.NumberFormatException: For input string: "3 * 24 * 3600"" 2015-06-16 14:52 GMT+08:00 : > Hi guys: > >I added a pa

Re: Spark streaming: StorageLevel.MEMORY_AND_DISK_SER setting for KafkaUtils.createDirectStream

2016-03-02 Thread Saisai Shao
You don't have to specify the storage level for direct Kafka API, since it doesn't require to store the input data ahead of time. Only receiver-based approach could specify the storage level. Thanks Saisai On Wed, Mar 2, 2016 at 7:08 PM, Vinti Maheshwari wrote: > Hi All, > > I wanted to set *St

Re: Spark executor killed without apparent reason

2016-03-03 Thread Saisai Shao
If it is due to heartbeat problem and driver explicitly killed the executors, there should be some driver logs mentioned about it. So you could check the driver log about it. Also container (executor) logs are useful, if this container is killed, then there'll be some signal related logs, like (SIG

Re: How to compile Spark with private build of Hadoop

2016-03-08 Thread Saisai Shao
I think the first step is to publish your in-house built Hadoop related jars to your local maven or ivy repo, and then change the Spark building profiles like -Phadoop-2.x (you could use 2.7 or you have to change the pom file if you met jar conflicts) -Dhadoop.version=3.0.0-SNAPSHOT to build agains

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Saisai Shao
Would you please send out the configurations of dynamic allocation so we could know better. On Wed, Mar 9, 2016 at 4:29 PM, Jy Chen wrote: > Hello everyone: > > I'm trying the dynamic allocation in Spark on YARN. I have followed > configuration steps and started the shuffle service. > > Now it c

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Saisai Shao
quot;--conf" was lost > when I copied it to mail. > > -- Forwarded message -- > From: Jy Chen > Date: 2016-03-10 10:09 GMT+08:00 > Subject: Re: Dynamic allocation doesn't work on YARN > To: Saisai Shao , user@spark.apache.org > > > Hi, > My

Re: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Saisai Shao
Currently configuration is a part of checkpoint data, and when recovering from failure, Spark Streaming will fetch the configuration from checkpoint data, so even if you change the configuration file, recovered Spark Streaming application will not use it. So from my understanding currently there's

Re: Job failed while submitting python to yarn programatically

2016-03-15 Thread Saisai Shao
You cannot directly invoke Spark application by using yarn#client like what you mentioned, it is deprecated and not supported. you have to use spark-submit to submit a Spark application to yarn. Also here the specific problem is that you're invoking yarn#client to run spark app as yarn-client mode

Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Saisai Shao
If you want to avoid existing job failure while restarting NM, you could enable work preserving for NM, in this case, the restart of NM will not affect the running containers (containers can still run). That could alleviate NM restart problem. Thanks Saisai On Wed, Mar 16, 2016 at 6:30 PM, Alex D

Re: Issues facing while Running Spark Streaming Job in YARN cluster mode

2016-03-22 Thread Saisai Shao
I guess in local mode you're using local FS instead of HDFS, here the exception mainly threw from HDFS when running on Yarn, I think it would be better to check the status and configurations of HDFS to see if it normal or not. Thanks Saisai On Tue, Mar 22, 2016 at 5:46 PM, Soni spark wrote: > H

Re: Is there a way to submit spark job to your by YARN REST API?

2016-03-22 Thread Saisai Shao
I'm afraid currently it is not supported by Spark to submit application through Yarn REST API. However Yarn AMRMClient is functionally equal to REST API, not sure which specific features are you referring? Thanks Saisai On Tue, Mar 22, 2016 at 5:27 PM, tony@tendcloud.com < tony@tendcloud.

Re: Re: Is there a way to submit spark job to your by YARN REST API?

2016-03-22 Thread Saisai Shao
> 微信: zhitao_yan > QQ : 4707059 > 地址:北京市东城区东直门外大街39号院2号楼航空服务大厦602室 > 邮编:100027 > > -------- > TalkingData.com <http://talkingdata.com/> - 让数据说话 > > > *From:* Saisai Shao > *Date:* 2016-03-22 18:03 > *To:* tony@tendcloud.com > *CC:* user > *Subject:* Re: Is there a way to s

Re: Spark Metrics : Why is the Sink class declared private[spark] ?

2016-04-01 Thread Saisai Shao
There's a JIRA (https://issues.apache.org/jira/browse/SPARK-14151) about it, please take a look. Thanks Saisai On Sat, Apr 2, 2016 at 6:48 AM, Walid Lezzar wrote: > Hi, > > I looked into the spark code at how spark report metrics using the > MetricsSystem class. I've seen that the spark Metrics

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao
Hi Michael, shuffle data (mapper output) have to be materialized into disk finally, no matter how large memory you have, it is the design purpose of Spark. In you scenario, since you have a big memory, shuffle spill should not happen frequently, most of the disk IO you see might be final shuffle fi

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao
Fri, Apr 1, 2016, 7:25 PM Saisai Shao wrote: > >> Hi Michael, shuffle data (mapper output) have to be materialized into >> disk finally, no matter how large memory you have, it is the design purpose >> of Spark. In you scenario, since you have a big memory, shuffle spill >&g

Re: --packages configuration equivalent item name?

2016-04-04 Thread Saisai Shao
spark.jars.ivy, spark.jars.packages, spark.jars.excludes is the configurations you can use. Thanks Saisai On Sun, Apr 3, 2016 at 1:59 AM, Russell Jurney wrote: > Thanks, Andy! > > On Mon, Mar 28, 2016 at 8:44 AM, Andy Davidson < > a...@santacruzintegration.com> wrote: > >> Hi Russell >> >> I us

Re: Detecting application restart when running in supervised cluster mode

2016-04-05 Thread Saisai Shao
Hi Deepak, I don't think supervise can be worked with yarn, it is a standalone and Mesos specific feature. Thanks Saisai On Tue, Apr 5, 2016 at 3:23 PM, Deepak Sharma wrote: > Hi Rafael > If you are using yarn as the engine , you can always use RM UI to see the > application progress. > > Than

Re: kafka direct streaming python API fromOffsets

2016-05-03 Thread Saisai Shao
I guess the problem is that py4j automatically translate the python int into java int or long according to the value of the data. If this value is small it will translate to java int, otherwise it will translate into java long. But in java code, the parameter must be long type, so that's the excep

Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-05 Thread Saisai Shao
Writing RDD based application using pyspark will bring in additional overheads, Spark is running on the JVM whereas your python code is running on python runtime, so data should be communicated between JVM world and python world, this requires additional serialization-deserialization, IPC. Also oth

Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016

Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
ile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 9 May 2016 at 08:14, Saisai Shao wrote: > >> For window related operators, Spark Streaming will cache the data into >> memory within this window,

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
; > > > > At 2016-05-09 15:14:47, "Saisai Shao" wrote: > > For window related operators, Spark Streaming will cache the data into > memory within this window, in your case your window size is up to 24 hours, > which means data has to be in Executor's memory f

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
06 PM, Ashok Kumar wrote: > hi, > > so if i have 10gb of streaming data coming in does it require 10gb of > memory in each node? > > also in that case why do we need using > > dstream.cache() > > thanks > > > On Monday, 9 May 2016, 9:58, Saisai Shao wrote

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
y, it will distributed evenly across the executors, also this is target for tuning. Normally it depends on several conditions like receiver distribution, partition distribution. > > The issue raises if the amount of streaming data does not fit into these 4 > caches? Will the job crash? > &

Re: spark uploading resource error

2016-05-10 Thread Saisai Shao
What is the version of Spark are you using? From my understanding, there's no code in yarn#client will upload "__hadoop_conf__" into distributed cache. On Tue, May 10, 2016 at 3:51 PM, 朱旻 wrote: > hi all: > I found a problem using spark . > WHEN I use spark-submit to launch a task. it works >

Re: Re: spark uploading resource error

2016-05-10 Thread Saisai Shao
, May 10, 2016 at 4:17 PM, 朱旻 wrote: > > > it was a product sold by huawei . name is FusionInsight. it says spark was > 1.3 with hadoop 2.7.1 > > where can i find the code or config file which define the files to be > uploaded? > > > At 2016-05-10 16:06:05, "S

Re: How to use Kafka as data source for Structured Streaming

2016-05-17 Thread Saisai Shao
It is not supported now, currently only filestream is supported. Thanks Jerry On Wed, May 18, 2016 at 10:14 AM, Todd wrote: > Hi, > I am wondering whether structured streaming supports Kafka as data source. > I brief the source code(meanly related with DataSourceRegister trait), and > didn't fi

Re: Re: How to change output mode to Update

2016-05-17 Thread Saisai Shao
> .mode(SaveMode.Overwrite) >From my understanding mode is not supported in continuous query. def mode(saveMode: SaveMode): DataFrameWriter = { // mode() is used for non-continuous queries // outputMode() is used for continuous queries assertNotStreaming("mode() can only be called on non-co

Re: duplicate jar problem in yarn-cluster mode

2016-05-17 Thread Saisai Shao
I think it is already fixed if your problem is exactly the same as what mentioned in this JIRA (https://issues.apache.org/jira/browse/SPARK-14423). Thanks Jerry On Wed, May 18, 2016 at 2:46 AM, satish saley wrote: > Hello, > I am executing a simple code with yarn-cluster > > --master > yarn-clu

Re: File not found exception while reading from folder using textFileStream

2016-05-18 Thread Saisai Shao
>From my understanding, we should copy the file into another folder and move to source folder after copy is finished, otherwise we will read the half-copied data or meet the issue as you mentioned above. On Wed, May 18, 2016 at 8:32 PM, Ted Yu wrote: > The following should handle the situation y

Re: Map tuple to case class in Dataset

2016-05-31 Thread Saisai Shao
It works fine in my local test, I'm using latest master, maybe this bug is already fixed. On Wed, Jun 1, 2016 at 7:29 AM, Michael Armbrust wrote: > Version of Spark? What is the exception? > > On Tue, May 31, 2016 at 4:17 PM, Tim Gautier > wrote: > >> How should I go about mapping from say a Da

YARN Application Timeline service with Spark 2.0.0 issue

2016-06-17 Thread Saisai Shao
Hi Community, In Spark 2.0.0 we upgrade to use jersey2 ( https://issues.apache.org/jira/browse/SPARK-12154) instead of jersey 1.9, while for the whole Hadoop we still stick on the old version. This will bring in some issues when yarn timeline service is enabled ( https://issues.apache.org/jira/bro

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-21 Thread Saisai Shao
spark.yarn.jar (none) The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn'

Re: problem running spark with yarn-client not using spark-submit

2016-06-26 Thread Saisai Shao
It means several jars are missing in the yarn container environment, if you want to submit your application through some other ways besides spark-submit, you have to take care all the environment things yourself. Since we don't know your implementation of java web service, so it is hard to provide

Re: deploy-mode flag in spark-sql cli

2016-06-29 Thread Saisai Shao
I think you cannot use sql client in the cluster mode, also for spark-shell/pyspark which has a repl, all these application can only be started with client deploy mode. On Thu, Jun 30, 2016 at 12:46 PM, Mich Talebzadeh wrote: > Hi, > > When you use spark-shell or for that matter spark-sql, you a

Re: spark local dir to HDFS ?

2016-07-05 Thread Saisai Shao
It is not worked to configure local dirs to HDFS. Local dirs are mainly used for data spill and shuffle data persistence, it is not suitable to use hdfs. If you met capacity problem, you could configure multiple dirs located in different mounted disks. On Wed, Jul 6, 2016 at 9:05 AM, Sri wrote:

Re: It seemed JavaDStream.print() did not work when launching via yarn on a single node

2016-07-06 Thread Saisai Shao
DStream.print() will collect some of the data to driver and display, please see the implementation of DStream.print() RDD.take() will collect some of the data to driver. Normally the behavior should be consistent between cluster and local mode, please find out the root cause of this problem, like

Re: scala.MatchError on stand-alone cluster mode

2016-07-15 Thread Saisai Shao
The error stack is throwing from your code: Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class [Ljava.lang.String;) at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29) at com.jd.deeplog.LogAggregator.main(LogAggregator.scala) I think you should debug the

Re: How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Saisai Shao
I think both 6066 and 7077 can be worked. 6066 is using the REST way to submit application, while 7077 is the legacy way. From user's aspect, it should be transparent and no need to worry about the difference. - *URL:* spark://hw12100.local:7077 - *REST URL:* spark://hw12100.local:6066 (clu

Re: yarn.exceptions.ApplicationAttemptNotFoundException when trying to shut down spark applicaiton via yarn applicaiton --kill

2016-07-25 Thread Saisai Shao
Several useful information can be found here ( https://issues.apache.org/jira/browse/YARN-1842), though personally I haven't met this problem before. Thanks Saisai On Tue, Jul 26, 2016 at 2:21 PM, Yu Wei wrote: > Hi guys, > > > When I tried to shut down spark application via "yarn application -

Re: Getting error, when I do df.show()

2016-08-01 Thread Saisai Shao
> > java.lang.NoClassDefFoundError: spray/json/JsonReader > > at > com.memsql.spark.pushdown.MemSQLPhysicalRDD$.fromAbstractQueryTree(MemSQLPhysicalRDD.scala:95) > > at > com.memsql.spark.pushdown.MemSQLPushdownStrategy.apply(MemSQLPushdownStrategy.scala:49) > Looks

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Saisai Shao
Use dominant resource calculator instead of default resource calculator will get the expected vcores as you wanted. Basically by default yarn does not honor cpu cores as resource, so you will always see vcore is 1 no matter what number of cores you set in spark. On Wed, Aug 3, 2016 at 12:11 PM, sa

Re: spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread Saisai Shao
I guess you're mentioning about spark assembly uber jar. In Spark 2.0, there's no uber jar, instead there's a jars folder which contains all jars required in the run-time. For the end user it is transparent, the way to submit spark application is still the same. On Wed, Aug 3, 2016 at 4:51 PM, Mic

Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Saisai Shao
1. Standalone mode doesn't support accessing kerberized Hadoop, simply because it lacks the mechanism to distribute delegation tokens via cluster manager. 2. For the HBase token fetching failure, I think you have to do kinit to generate tgt before start spark application ( http://hbase.apache.org/0

Re: Apache Spark toDebugString producing different output for python and scala repl

2016-08-15 Thread Saisai Shao
The implementation inside the Python API and Scala API for RDD is slightly different, so the difference of RDD lineage you printed is expected. On Tue, Aug 16, 2016 at 10:58 AM, DEEPAK SHARMA wrote: > Hi All, > > > Below is the small piece of code in scala and python REPL in Apache > Spark.Howev

Re: Can't run spark on yarn

2015-12-17 Thread Saisai Shao
Please check the Yarn AM log to see why AM is failed to start. That's the reason why using `sc` will get such complaint. On Fri, Dec 18, 2015 at 4:25 AM, Eran Witkon wrote: > Hi, > I am trying to install spark 1.5.2 on Apache hadoop 2.6 and Hive and yarn > > spark-env.sh > export HADOOP_CONF_DIR

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Saisai Shao
Normally there will be one RDD in each batch. You could refer to the implementation of DStream#getOrCompute. On Mon, Dec 21, 2015 at 11:04 AM, Arun Patel wrote: > It may be simple question...But, I am struggling to understand this > > DStream is a sequence of RDDs created in a batch window

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Saisai Shao
Yes, basically from the currently implementation it should be. On Mon, Dec 21, 2015 at 6:39 PM, Arun Patel wrote: > So, Does that mean only one RDD is created by all receivers? > > > > On Sun, Dec 20, 2015 at 10:23 PM, Saisai Shao > wrote: > >> Normally there wi

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Saisai Shao
Hi Siva, How did you know that --executor-cores is ignored and where did you see that only 1 Vcore is allocated? Thanks Saisai On Tue, Dec 22, 2015 at 9:08 AM, Siva wrote: > Hi Everyone, > > Observing a strange problem while submitting spark streaming job in > yarn-cluster mode through spark-s

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Saisai Shao
; Thanks, > Sivakumar Bhavanari. > > On Mon, Dec 21, 2015 at 5:21 PM, Saisai Shao > wrote: > >> Hi Siva, >> >> How did you know that --executor-cores is ignored and where did you see >> that only 1 Vcore is allocated? >> >> Thanks >> Saisai &

Re: Job Error:Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@130.1.10.108:23600/)

2015-12-25 Thread Saisai Shao
I think SparkContext is thread-safe, you could concurrently submit jobs from different threads, the problem you hit might not relate to this. Can you reproduce this issue each time when you concurrently submit jobs, or is it happened occasionally? BTW, I guess you're using the old version of Spark

Re: Job Error:Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@130.1.10.108:23600/)

2015-12-25 Thread Saisai Shao
ource might be one potential cause, you'd better increase the vm resource to try again, just to verify your assumption. On Fri, Dec 25, 2015 at 4:28 PM, donhoff_h <165612...@qq.com> wrote: > Hi, Saisai Shao > > Many thanks for your reply. I used spark v1.3. Unfortunately I can not

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Saisai Shao
Replace all the shuffle jars and restart the NodeManager is enough, no need to restart NN. On Mon, Dec 28, 2015 at 2:05 PM, Jeff Zhang wrote: > See > http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation > > > > On Mon, Dec 28, 2015 at 2:00 PM, 顾亮亮 wrote: > >> Hi a

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Saisai Shao
luster? > > > > *From:* Saisai Shao [mailto:sai.sai.s...@gmail.com] > *Sent:* Monday, December 28, 2015 2:29 PM > *To:* Jeff Zhang > *Cc:* 顾亮亮; user@spark.apache.org; 刘骋昺 > *Subject:* Re: Opening Dynamic Scaling Executors on Yarn > > > > Replace all the shuffle jars

Re: Problem About Worker System.out

2015-12-28 Thread Saisai Shao
Stdout will not be sent back to driver, no matter you use Scala or Java. You must do something wrongly that makes you think it is an expected behavior. On Mon, Dec 28, 2015 at 5:33 PM, David John wrote: > I have used Spark *1.4* for 6 months. Thanks all the members of this > community for yo

Re: OOM on yarn-cluster mode

2016-01-19 Thread Saisai Shao
You could try increase the driver memory by "--driver-memory", looks like the OOM is came from driver side, so the simple solution is to increase the memory of driver. On Tue, Jan 19, 2016 at 1:15 PM, Julio Antonio Soto wrote: > Hi, > > I'm having trouble when uploadig spark jobs in yarn-cluster

Re: streaming textFileStream problem - got only ONE line

2016-01-26 Thread Saisai Shao
Any possibility that this file is still written by other application, so what Spark Streaming processed is an incomplete file. On Tue, Jan 26, 2016 at 5:30 AM, Shixiong(Ryan) Zhu wrote: > Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or > write into it directly? `textFile

Re: How data locality is honored when spark is running on yarn

2016-01-27 Thread Saisai Shao
Hi Todd, There're two levels of locality based scheduling when you run Spark on Yarn if dynamic allocation enabled: 1. Container allocation is based on the locality ratio of pending tasks, this is Yarn specific and only works with dynamic allocation enabled. 2. Task scheduling is locality awared,

Re: help with enabling spark dynamic allocation

2016-01-27 Thread Saisai Shao
You should also check the available YARN resources, overall the number of containers can be allocated is restricted by Yarn resources. I guess here your Yarn cluster resources can only allocate 3 containers, even if you set the initial number to 10, still it cannot be satisfied. On Wed, Jan 27, 2

Re: Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Saisai Shao
I think I met this problem before, this problem might be due to some race conditions in exit period. The way you mentioned is still valid, this problem only occurs when stopping the application. Thanks Saisai On Fri, Jan 29, 2016 at 10:22 AM, Nirav Patel wrote: > Hi, we were using spark 1.3.1 a

Re: Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Saisai Shao
g sparkcontext manually in your application > still works then I'll investigate more on my side. It just before I dig > more I wanted to know if it was still supported. > > Nir > > On Thu, Jan 28, 2016 at 7:47 PM, Saisai Shao > wrote: > >> I think I met th

Re: IllegalStateException : When use --executor-cores option in YARN

2016-02-14 Thread Saisai Shao
Hi Divya, Would you please provide full stack of exception? From my understanding --executor-cores should be worked, we could know better if you provide the full stack trace. The performance relies on many different aspects, I'd recommend you to check the spark web UI to know the application runt

Re: Yarn client mode: Setting environment variables

2016-02-17 Thread Saisai Shao
IIUC for example you want to set environment FOO=bar in executor side, you could use "spark.executor.Env.FOO=bar" in conf file, AM will pick this configuration and set as environment variable through container launching. Just list all the envs you want to set in executor side like spark.executor.xx

Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Saisai Shao
You could set this configuration "auto.offset.reset" through parameter "kafkaParams" which is provided in some other overloaded APIs of createStream. By default Kafka will pick data from latest offset unless you explicitly set it, this is the behavior Kafka, not Spark. Thanks Saisai On Mon, Feb

Re: Submitting with --deploy-mode cluster: uploading the jar

2015-09-30 Thread Saisai Shao
As I remembered you don't need to upload application jar manually, Spark will do it for you when you use Spark submit. Would you mind posting out your command of Spark submit? On Wed, Sep 30, 2015 at 3:13 PM, Christophe Schmitz wrote: > Hi there, > > I am trying to use the "--deploy-mode cluste

Re: Submitting with --deploy-mode cluster: uploading the jar

2015-09-30 Thread Saisai Shao
rovide the full path of > the file. > > On the other hand with --deploy-mode client I don't need to do that, but > then I need to accept incoming connection in my client to serve the jar > (which is not possible behind a firewall I don't control) > > Thanks, > >

Re: an problem about zippartition

2015-10-13 Thread Saisai Shao
You have to call the checkpoint regularly on rdd0 to cut the dependency chain, otherwise you will meet such problem as you mentioned, even stack overflow finally. This is a classic problem for high iterative job, you could google it for the fix solution. On Tue, Oct 13, 2015 at 7:09 PM, 张仪yf1 wro

Re: localhost webui port

2015-10-13 Thread Saisai Shao
By configuring "spark.ui.port" to the port you could bind. On Tue, Oct 13, 2015 at 8:47 PM, Langston, Jim wrote: > Hi all, > > Is there anyway to change the default port 4040 for the localhost webUI, > unfortunately, that port is blocked and I have no control of that. I have > not found any conf

Re: an problem about zippartition

2015-10-13 Thread Saisai Shao
maybe you could try "localCheckpoint" insteadly. 2015年10月14日星期三,张仪yf1 写道: > Thank you for your reply. It helped a lot. But when the data became > bigger, the action cost more, is there any optimizer > > > > *发件人:* Saisai Shao [mailto:sai.sai.s...@gmail.com > ] >

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Saisai Shao
You could check the code of KafkaRDD, the locality (host) is got from Kafka's partition and set in KafkaRDD, this will a hint for Spark to schedule task on the preferred location. override def getPreferredLocations(thePart: Partition): Seq[String] = { val part = thePart.asInstanceOf[KafkaRDDPart

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Saisai Shao
This preferred locality is a hint to spark to schedule Kafka tasks on the preferred nodes, if Kafka and Spark are two separate cluster, obviously this locality hint takes no effect, and spark will schedule tasks following node-local -> rack-local -> any pattern, like any other spark tasks. On Wed,

Re: Spark_1.5.1_on_HortonWorks

2015-10-20 Thread Saisai Shao
Hi Frans, You could download Spark 1.5.1-hadoop 2.6 pre-built tarball and copy into HDP 2.3 sandbox or master node. Then copy all the conf files from /usr/hdp/current/spark-client/ to your /conf, or you could refer to this tech preview ( http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-te

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Saisai Shao
ually go and copy > spark-1.5.1 tarbal to all the nodes or is there any alternative so that I > can get it upgraded through Ambari UI ? If possible can anyone point me to > a documentation online? Thank you. > > Regards, > Ajay > > > On Wednesday, October 21, 2015, Saisai S

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Saisai Shao
ryServer.scala) > > > I went to the lib folder and noticed that > "spark-assembly-1.5.1-hadoop2.6.0.jar" is missing that class. I was able to > get the spark history server started with 1.3.1 but not 1.5.1. Any inputs > on this? > > Really appreciate your help. Thank

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Saisai Shao
ian.org > > "We grow because we share the same belief." > > > On Thu, Oct 22, 2015 at 8:56 AM, Saisai Shao > wrote: > > How you start history server, do you still use the history server of > 1.3.1, > > or you started the history server in 1.5.1? > > &g

Re: How to restart a failed Spark Streaming Application automatically in client mode on YARN

2015-10-22 Thread Saisai Shao
Looks like currently there's no way for Spark Streaming to restart automatically in yarn-client mode, because in yarn-client mode, AM and driver are two processes, Yarn only control the restart of AM, not driver, so it is not supported in yarn-client mode. You can write some scripts to monitor you

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
What Spark version are you using, also a small code snippet of how you use Spark Streaming would be greatly helpful. On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V wrote: > I can able to read and print few lines. Afterthat i'm getting this > exception. Any idea for this ? > > *Thanks*, >

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
Void call(JavaPairRDD tuple) { > JavaRDDrdd = tuple.values(); > rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output"); > return null; > } >}); > > > *Thanks*, > <https://in.linkedin.com/in/ram

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
ystem unstable. On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V wrote: > No, i dont have any special settings. if i keep only reading line in my > code, it's throwing NPE. > > *Thanks*, > <https://in.linkedin.com/in/ramkumarcs31> > > > On Fri, Oct 30, 2015 at 2:14 PM, Sa

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
com/in/ramkumarcs31> > > > On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao > wrote: > >> I just did a local test with your code, seems everything is fine, the >> only difference is that I use the master branch, but I don't think it >> changes a lot in this part.

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
mingContext.start(JavaStreamingContext.scala:622) > > > *Thanks*, > <https://in.linkedin.com/in/ramkumarcs31> > > > On Fri, Oct 30, 2015 at 3:25 PM, Saisai Shao > wrote: > >> From the code, I think this field "rememberDuration" shouldn't be null, >>

Re: Does the Standalone cluster and Applications need to be same Spark version?

2015-11-03 Thread Saisai Shao
I think it can be worked unless you use some new APIs that only exists in 1.5.1 release (mostly this will not happened). You'd better take a try to see if it can be run or not. On Tue, Nov 3, 2015 at 10:11 AM, pnpritchard < nicholas.pritch...@falkonry.com> wrote: > The title gives the gist of it:

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread Saisai Shao
Hi Swetha, Would you mind elaborating your usage scenario of DStream unpersisting? >From my understanding: 1. Spark Streaming will automatically unpersist outdated data (you already mentioned about the configurations). 2. If streaming job is started, I think you may lose the control of the job,

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Saisai Shao
>From my understanding, it depends on whether you enabled CGroup isolation or not in Yarn. By default it is not, which means you could allocate one core but bump a lot of thread in your task to occupy the CPU resource, this is just a logic limitation. For Yarn CPU isolation you may refer to this po

Re: Python Kafka support?

2015-11-10 Thread Saisai Shao
Hi Darren, Functionality like messageHandler is missing in python API, still not included in version 1.5.1. Thanks Jerry On Wed, Nov 11, 2015 at 7:37 AM, Darren Govoni wrote: > Hi, > I read on this page > http://spark.apache.org/docs/latest/streaming-kafka-integration.html > about python supp

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Saisai Shao
I think for receiver-less Streaming connectors like direct Kafka input stream or hdfs connector, dynamic allocation could be worked compared to other receiver-based streaming connectors, since for receiver-less connectors, the behavior of streaming app is more like a normal Spark app, so dynamic al

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Saisai Shao
y to use all the > executors all the time, and no executor will remain idle for long. That is > why the heuristic doesnt work that well. > > > On Wed, Nov 11, 2015 at 6:32 PM, Saisai Shao > wrote: > >> I think for receiver-less Streaming connectors like direct Kafka inp

Re: How to enable MetricsServlet sink in Spark 1.5.0?

2015-11-16 Thread Saisai Shao
it should worked. I tested in my local environment with "curl http://localhost:4040/metrics/json/";, there's metrics dumped. For cluster metrics, you have to change your base url to point to cluster manager. Thanks Jerry On Mon, Nov 16, 2015 at 5:42 PM, ihavethepotential < ihavethepotent...@gmail

Re: YARN Labels

2015-11-16 Thread Saisai Shao
Node label for AM is not yet supported for Spark now, currently only executor is supported. On Tue, Nov 17, 2015 at 7:57 AM, Ted Yu wrote: > Wangda, YARN committer, told me that support for selecting which nodes the > application master is running on is integrated to the upcoming hadoop 2.8.0 >

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread Saisai Shao
Hi Tingwen, Would you minding sharing your changes in ExecutorAllocationManager#addExecutors(). >From my understanding and test, dynamic allocation can be worked when you set the min to max number of executors to the same number. Please check your Spark and Yarn log to make sure the executors ar

  1   2   3   >