Optimal Amount of Tasks Per size of data in memory

2016-07-20 Thread Brandon White
What is the best heuristic for setting the number of partitions/task on an RDD based on the size of the RDD in memory? The Spark docs say that the number of partitions/tasks should be 2-3x the number of CPU cores but this does not make sense for all data sizes. Sometimes, this number is way to

R: ML PipelineModel to be scored locally

2016-07-20 Thread Simone
Thanks for your reply. I cannot rely on jpmml due licensing stuff. I can evaluate writing my own prediction code, but I am looking for a more general purpose approach. Any other thoughts? Best Simone - Messaggio originale - Da: "Peyman Mohajerian" Inviato:

Ratings in mllib.recommendation

2016-07-20 Thread glen

calculate time difference between consecutive rows

2016-07-20 Thread Divya Gehlot
I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its giving me *null *in the result set. Would really appreciate the help.

Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Sachin Mittal
Hi, Thanks for the links, is there any english translation for the same? Sachin On Thu, Jul 21, 2016 at 8:34 AM, Taotao.Li wrote: > Hi, Sachin, here are two posts about the basic concepts about spark: > > >- spark-questions-concepts >

Re: the spark job is so slow - almost frozen

2016-07-20 Thread Zhiliang Zhu
Thanks a lot for your kind help.  On Wednesday, July 20, 2016 11:35 AM, Andrew Ehrlich wrote: Try: - filtering down the data as soon as possible in the job, dropping columns you don’t need.- processing fewer partitions of the hive tables at a time- caching

Re: run spark apps in linux crontab

2016-07-20 Thread Mich Talebzadeh
you should source the environment file before or in the file. for example this one is ksh type 0,5,10,15,20,25,30,35,40,45,50,55 * * * * (/home/hduser/dba/bin/send_messages_to_Kafka.ksh > /var/tmp/send_messages_to_Kafka.err 2>&1) in that shell it sources the environment file # # Main Section #

Re: write and call UDF in spark dataframe

2016-07-20 Thread Mich Talebzadeh
something similar def ChangeToDate (word : String) : Date = { //return TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(word,"dd/MM/"),"-MM-dd")) val d1 = Date.valueOf(ReverseDate(word)) return d1 } sqlContext.udf.register("ChangeToDate", ChangeToDate(_:String)) Dr Mich Talebzadeh LinkedIn

getting null when calculating time diff with unix_timestamp + spark 1.6

2016-07-20 Thread Divya Gehlot
Hi, val lags=sqlContext.sql("select *,(unix_timestamp(time1,'$timeFmt') - lag(unix_timestamp(time2,'$timeFmt'))) as time_diff from df_table"); Instead of time difference in seconds I am gettng null . Would reay appreciate the help. Thanks, Divya

Re: run spark apps in linux crontab

2016-07-20 Thread Chanh Le
you should you use command.sh | tee file.log > On Jul 21, 2016, at 10:36 AM, > wrote: > > > thank you focus, and all. > this problem solved by adding a line ". /etc/profile" in my shell. > > > > > ThanksBest

回复:Re:run spark apps in linux crontab

2016-07-20 Thread luohui20001
thank you focus, and all.this problem solved by adding a line ". /etc/profile" in my shell. ThanksBest regards! San.Luo - 原始邮件 - 发件人:"focus" 收件人:"luohui20001" , "user@spark.apache.org"

Re: XLConnect in SparkR

2016-07-20 Thread Felix Cheung
>From looking at be CLConnect package, its loadWorkbook() function only >supports reading from local file path, so you might need a way to call HDFS >command to get the file from HDFS first. SparkR currently does not support this - you could read it in as a text file (I don't think .xlsx is a

Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Taotao.Li
Hi, Sachin, here are two posts about the basic concepts about spark: - spark-questions-concepts - deep-into-spark-exection-model And, I fully recommend

Re: write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
Hi , To be very specific I am looking for UDFs syntax for example which takes String as parameter and returns integer .. how do we define the return type . Thanks, Divya On 21 July 2016 at 00:24, Andy Davidson wrote: > Hi Divya > > In general you will get

Re: Role-based S3 access outside of EMR

2016-07-20 Thread Everett Anderson
Thanks, Andy. I am indeed often doing something similar, now -- copying data locally rather than dealing with the S3 impl selection and AWS credentials issues. It'd be nice if it worked a little easier out of the box, though! On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson <

Re: Subquery in having-clause (Spark 1.1.0)

2016-07-20 Thread rickn
Seeing the same results on the current 1.62 release ... just wanted to confirm. Are there any work arounds? Do I need to wait for 2.0 for support ? https://issues.apache.org/jira/browse/SPARK-12543 Thank you -- View this message in context:

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-07-20 Thread Chang Lim
It's an issue with the preview build. Switched to RC5 and all is working. Thanks to Michael Armbrust. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-and-spark-sql-hive-thriftServer-singleSession-setting-tp27340p27379.html Sent from the

Re: MultiThreading in Spark 1.6.0

2016-07-20 Thread Maciej Bryński
RK Aduri, Another idea is to union all results and then run collect. The question is how big collected data is. 2016-07-20 20:32 GMT+02:00 RK Aduri : > Spark version: 1.6.0 > So, here is the background: > > I have a data frame (Large_Row_DataFrame) which I have

Re: PySpark 2.0 Structured Streaming Question

2016-07-20 Thread Tathagata Das
foreachWriter is not currently available in the python. we dont have a clear plan yet on when foreachWriter will be available in Python. On Wed, Jul 20, 2016 at 1:22 PM, A.W. Covert III wrote: > Hi All, > > I've been digging into spark 2.0, I have some streaming jobs running

Re: Little idea needed

2016-07-20 Thread Aakash Basu
Thanks for the detailed description buddy. But this will actually be done through NiFi (End to End) so we need to add the delta logic inside NiFi to automate the whole process. That's why, need a good (best) solution to solve this problem. Since, this is a classic issue which we can face any

Re: Little idea needed

2016-07-20 Thread Aakash Basu
Your second point: That's going to be a bottleneck for all the programs which will fetch the data from that folder and again add extra filters into the DF. I want to finish that off, there itself. And that merge logic is weak when one table is huge and the other is very small (which is the case

PySpark 2.0 Structured Streaming Question

2016-07-20 Thread A.W. Covert III
Hi All, I've been digging into spark 2.0, I have some streaming jobs running well on YARN, and I'm working on some Spark Structured Streaming jobs now. I have a couple of jobs I'd like to move to Structured Streaming with the `foreachWriter` but it's not available in PySpark yet. Is it just

SparkWebUI and Master URL on EC2

2016-07-20 Thread KhajaAsmath Mohammed
Hi, I got an access to spark cluser and have intstatiated spark-shell on aws using command $spark-shell. Spark shell is started successfully but I am looking to access WebUI and Master URL. does anyone know how to access that in AWS. I tried http://IPMaster:4040 and http://IpMaster:8080 but it

Re: ML PipelineModel to be scored locally

2016-07-20 Thread Peyman Mohajerian
One option is to save the model in parquet or json format and then build your own prediction code. Some also use: https://github.com/jpmml/jpmml-sparkml It depends on the model, e.g. ml v mllib and other factors whether this works on or not. Couple of weeks ago there was a long discussion on

Using multiple data sources in one stream

2016-07-20 Thread Joe Panciera
Hi, I have a rather complicated situation thats raised an issue regarding consuming multiple data sources for processing. Unlike the use cases I've found, I have 3 sources of different formats. There's one 'main' stream A that does the processing, and 2 sources B and C that provide elements

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-07-20 Thread Chang Lim
Would appreciate if someone: 1. Can confirm if this is an issue or 2. Share on how to get HiveThriftServer2.startWithContext working with shared temp table. I am using Beeline as the JDBC client to access the temp tables of the running Spark app. -- View this message in context:

MultiThreading in Spark 1.6.0

2016-07-20 Thread RK Aduri
Spark version: 1.6.0 So, here is the background: I have a data frame (Large_Row_DataFrame) which I have created from an array of row objects and also have another array of unique ids (U_ID) which I’m going to use to look up into the Large_Row_DataFrame (which is cached) to do a

Re: Saving a pyspark.ml.feature.PCA model

2016-07-20 Thread Ajinkya Kale
Just found Google dataproc has a preview of spark 2.0. Tried it and save/load works! Thanks Shuai. Followup question - is there a way to export the pyspark.ml models to PMML ? If not, what is the best way to integrate the model for inference in a production service ? On Tue, Jul 19, 2016 at 8:22

Re: Building standalone spark application via sbt

2016-07-20 Thread Sachin Mittal
I got the error during run time. It was for mongo-spark-connector class files. My build.sbt is like this name := "Test Advice Project" version := "1.0" scalaVersion := "2.10.6" libraryDependencies ++= Seq( "org.mongodb.spark" %% "mongo-spark-connector" % "1.0.0", "org.apache.spark" %%

Re: Building standalone spark application via sbt

2016-07-20 Thread Marco Mistroni
that will work but ideally you should not include any of the spark-releated jars as they are provided to you by the spark environment whenever you launch your app via spark-submit (this will prevent unexpected errors e.g. when you kick off your app using a different version of spark where some of

Attribute name "sum(proceeds)" contains invalid character(s) among " ,;{}()\n\t="

2016-07-20 Thread Chanh Le
Hi everybody, I got a error about the name of the columns is not following the rule. Please tell me the way to fix it. Here is my code metricFields Here is a Seq of metrics: spent, proceed, click, impression sqlContext .sql(s"select * from hourly where time between '$dateStr-00' and

Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Jean Georges Perrin
Hey, I love when questions are numbered, it's easier :) 1) Yes (but I am not an expert) 2) You don't control... One of my process is going to 8k tasks, so... 3) Yes, if you have HT, it double. My servers have 12 cores, but HT, so it makes 24. 4) From my understanding: Slave is the logical

Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-07-20 Thread Ian O'Connell
Ravi did your issue ever get solved for this? I think i've been hitting the same thing, it looks like the spark.sql.autoBroadcastJoinThreshold stuff isn't kicking in as expected, if I set that to -1 then the computation proceeds successfully. On Tue, Jun 14, 2016 at 12:28 AM, Ravi Aggarwal

Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Sachin Mittal
Hi, I was able to build and run my spark application via spark submit. I have understood some of the concepts by going through the resources at https://spark.apache.org but few doubts still remain. I have few specific questions and would be glad if someone could share some light on it. So I

Re: RandomForestClassifier

2016-07-20 Thread Marco Mistroni
Hi afaik yes (other pls override ). Generally, in RandomForest and DecisionTree you have a column which you are trying to 'predict' (the label) and a set of features that are used to predict the outcome. i would assume that if you specify thelabel column and the 'features' columns, everything

Re: Building standalone spark application via sbt

2016-07-20 Thread Sachin Mittal
NoClassDefFound error was for spark classes like say SparkConext. When running a standalone spark application I was not passing external jars using --jars option. However I have fixed this by making a fat jar using sbt assembly plugin. Now all the dependencies are included in that jar and I use

RandomForestClassifier

2016-07-20 Thread pseudo oduesp
hi , we have parmaters named labelCol="labe" ,featuresCol="features", when i precise the value here (label and features) if train my model on data frame with other columns tha algorithme choos only label columns and features columns ? thanks

Re: Spark driver getting out of memory

2016-07-20 Thread RK Aduri
Cache defaults to MEMORY_ONLY. Can you try with different storage levels ,i.e., MEMORY_ONLY_SER or even DISK_ONLY. you may want to use persist( ) instead of cache. Or there is an experimental storage level OFF_HEAP which might also help. On Tue, Jul 19, 2016 at 11:08 PM, Saurav Sinha

Storm HDFS bolt equivalent in Spark Streaming.

2016-07-20 Thread Rajesh_Kalluri
Dell - Internal Use - Confidential While writing to Kafka from Storm, the hdfs bolt provides a nice way to batch the messages , rotate files, file name convention etc as shown below. Do you know of something similar in Spark Streaming or do we have to roll our own? If anyone attempted this can

How to connect HBase and Spark using Python?

2016-07-20 Thread Def_Os
I'd like to know whether there's any way to query HBase with Spark SQL via the PySpark interface. See my question on SO: http://stackoverflow.com/questions/38470114/how-to-connect-hbase-and-spark-using-python The new HBase-Spark module in HBase, which introduces the HBaseContext/JavaHBaseContext,

Re: write and call UDF in spark dataframe

2016-07-20 Thread Andy Davidson
Hi Divya In general you will get better performance if you can minimize your use of UDFs. Spark 2.0/ tungsten does a lot of code generation. It will have to treat your UDF as a block box. Andy From: Rishabh Bhardwaj Date: Wednesday, July 20, 2016 at 4:22 AM To: Rabin

Re: Building standalone spark application via sbt

2016-07-20 Thread Marco Mistroni
Hello Sachin pls paste the NoClassDefFound Exception so we can see what's failing, aslo please advise how are you running your Spark App For an extremely simple case, let's assume you have your MyFirstSparkApp packaged in your myFirstSparkApp.jar Then all you need to do would be to kick off

difference between two consecutive rows of same column + spark + dataframe

2016-07-20 Thread Divya Gehlot
Hi, I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its not giving me *null *in the result set. Would really appreciate

Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-20 Thread Igor Berman
in addition check what ip the master is binding to(with nestat) On 20 July 2016 at 06:12, Andrew Ehrlich wrote: > Troubleshooting steps: > > $ telnet localhost 7077 (on master, to confirm port is open) > $ telnet 7077 (on slave, to confirm port is blocked) > > If the port

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Yu Wei
This is startup project. We don't know how much data will be written everyday. Definitely, there is not too much data at the beginning. But data will increase later. And we want to use spark streaming to receive data via MQTT Util. We're now evaluate which components could be used for storing

Re: Building standalone spark application via sbt

2016-07-20 Thread Mich Talebzadeh
you need an uber jar file. Have you actually followed the dependencies and project sub-directory build? check this. http://stackoverflow.com/questions/28459333/how-to-build-an-uber-jar-fat-jar-using-sbt-within-intellij-idea under three answers the top one. I started reading the official SBT

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Ted Yu
You can decide which component(s) to use for storing your data. If you haven't used hbase before, it may be better to store data on hdfs and query through Hive or SparkSQL. Maintaining hbase is not trivial task, especially when the cluster size is large. How much data are you expecting to be

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Yu Wei
I'm beginner to big data. I don't have too much knowledge about hbase/hive. What's the difference between hbase and hive/hdfs for storing data for analytics? Thanks, Jared From: ayan guha Sent: Wednesday, July 20, 2016 9:34:24 PM To:

Re: Latest 200 messages per topic

2016-07-20 Thread Cody Koeninger
If they're files in a file system, and you don't actually need multiple kinds of consumers, have you considered streamingContext.fileStream instead of kafka? On Wed, Jul 20, 2016 at 5:40 AM, Rabin Banerjee wrote: > Hi Cody, > > Thanks for your reply . > >Let

ML PipelineModel to be scored locally

2016-07-20 Thread Simone Miraglia
Hi all, I am working on the following use case involving ML Pipelines. 1. I created a Pipeline composed from a set of stages 2. I called "fit" method on my training set 3. I validated my model by calling "transform" on my test set 4. I stored my fitted Pipeline to a shared folder Then I have a

Snappy initialization issue, spark assembly jar missing snappy classes?

2016-07-20 Thread Eugene Morozov
Greetings! We're reading input files with newApiHadoopFile that is configured with multiline split. Everything's fine, besides https://issues.apache.org/jira/browse/MAPREDUCE-6549. It looks like the issue is fixed, but within hadoop 2.7.2. Which means we have to download spark without hadoop and

Re: Spark Job trigger in production

2016-07-20 Thread Sathish Kumaran Vairavelu
If you are using Mesos, then u can use Chronos or Marathon On Wed, Jul 20, 2016 at 6:08 AM Rabin Banerjee wrote: > ++ crontab :) > > On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich > wrote: > >> Another option is Oozie with the spark action: >>

Spark 1.6.2 Spark-SQL RACK_LOCAL

2016-07-20 Thread chandana
Hive - 1.2.1 AWS EMR 4.7.2 I have external tables with partitions from s3. I had some good performance with Spark 1.6.1 with NODE_LOCAL data 7x compared to RACK_LOCAL data. With Spark 1.6.2 and AWS EMR 4.7.2, my node locality is 0! Rack locality 100%. I am using the default settings and didn't

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread ayan guha
Just as a rain check, saving data to hbase for analytics may not be the best choice. Any specific reason for not using hdfs or hive? On 20 Jul 2016 20:57, "Rabin Banerjee" wrote: > Hi Wei , > > You can do something like this , > > foreachPartition( (part) => {

lift coefficien

2016-07-20 Thread pseudo oduesp
Hi , how we can claculate lift coeff from pyspark result of prediction ? thanks ?

Re: Little idea needed

2016-07-20 Thread Mich Talebzadeh
In reality a true real time analytics will require interrogating the transaction (redo) log of the RDBMS database to see for changes. An RDBMS will only keep on current record (the most recent) so if record is deleted since last import into HDFS that record will not exist. If the record has been

Re: write and call UDF in spark dataframe

2016-07-20 Thread Mich Talebzadeh
yep something in line of val df = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') as time ") Note that this does not require a column from an already existing table. HTH Dr Mich Talebzadeh LinkedIn *

Re: write and call UDF in spark dataframe

2016-07-20 Thread Rishabh Bhardwaj
Hi Divya, There is already "from_unixtime" exists in org.apache.spark.sql.frunctions, Rabin has used that in the sql query,if you want to use it in dataframe DSL you can try like this, val new_df = df.select(from_unixtime($"time").as("newtime")) Thanks, Rishabh. On Wed, Jul 20, 2016 at 4:21

Re: Spark Job trigger in production

2016-07-20 Thread Rabin Banerjee
++ crontab :) On Wed, Jul 20, 2016 at 9:07 AM, Andrew Ehrlich wrote: > Another option is Oozie with the spark action: > https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html > > On Jul 18, 2016, at 12:15 AM, Jagat Singh wrote: > > You can

Re: Storm HDFS bolt equivalent in Spark Streaming.

2016-07-20 Thread Rabin Banerjee
++Deepak, There is also a option to use saveAsHadoopFile & saveAsNewAPIHadoopFile, In which you can customize(filename and many things ...) the way you want to save it. :) Happy Sparking Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 10:01 AM, Deepak Sharma wrote:

Re: Running multiple Spark Jobs on Yarn( Client mode)

2016-07-20 Thread Rabin Banerjee
Hi Vaibhav, Please check your yarn configuration and make sure you have available resources .Please try creating multiple queues ,And submit job on queues. --queue thequeue Regards, Rabin Banerjee On Wed, Jul 20, 2016 at 12:05 PM, vaibhavrtk wrote: > I have a silly

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Rabin Banerjee
Hi Wei , You can do something like this , foreachPartition( (part) => {val conn = ConnectionFactory.createConnection(HBaseConfiguration.create()); val table = conn.getTable(TableName.valueOf(tablename)); //part.foreach((inp)=>{println(inp);table.put(inp)}) //This is line by line put

Re: write and call UDF in spark dataframe

2016-07-20 Thread Rabin Banerjee
Hi Divya , Try, val df = sqlContext.sql("select from_unixtime(ts,'-MM-dd') as `ts` from mr") Regards, Rabin On Wed, Jul 20, 2016 at 12:44 PM, Divya Gehlot wrote: > Hi, > Could somebody share example of writing and calling udf which converts > unix tme stamp to

Re: XLConnect in SparkR

2016-07-20 Thread Rabin Banerjee
Hi Yogesh , I have never tried reading XLS files using Spark . But I think you can use sc.wholeTextFiles to read the complete xls at once , as xls files are xml internally, you need to read them all to parse . Then I think you can use apache poi to read them . Also, you can copy you XLS data

Re: run spark apps in linux crontab

2016-07-20 Thread Rabin Banerjee
HI , Please check your deploy mode and master , For example if you want to deploy in yarn cluster you should use --master yarn-cluster , if you want to do it on yarn client mode you should use --master yarn-client . Please note for your case deploying yarn-cluster will be better as cluster

Re: Latest 200 messages per topic

2016-07-20 Thread Rabin Banerjee
Hi Cody, Thanks for your reply . Let Me elaborate a bit,We have a Directory where small xml(90 KB) files are continuously coming(pushed from other node).File has ID & Timestamp in name and also inside record. Data coming in the directory has to be pushed to Kafka to finally get into

Re:run spark apps in linux crontab

2016-07-20 Thread focus
Hi, I just meet this problem, too! The reason is crontab runtime doesn't have the variables you defined, such as $SPARK_HOME. I defined the $SPARK_HOME and other variables in /etc/profile like this: export $MYSCRIPTS=/opt/myscripts export $SPARK_HOME=/opt/spark then, in my crontab job script

RE: run spark apps in linux crontab

2016-07-20 Thread Joaquin Alzola
Remember that the you need to souce your .bashrc For your PATH to be set up. From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: 20 July 2016 11:01 To: user Subject: run spark apps in linux crontab hi guys: I add a spark-submit job into my Linux crontab

run spark apps in linux crontab

2016-07-20 Thread luohui20001
hi guys: I add a spark-submit job into my Linux crontab list by the means below ,however none of them works. If I change it to a normal shell script, it is ok. I don't quite understand why. I checked the 8080 web ui of my spark cluster, no job submitted, and there is not messages in

Re: Building standalone spark application via sbt

2016-07-20 Thread Sachin Mittal
Hi, I am following the example under https://spark.apache.org/docs/latest/quick-start.html For standalone scala application. I added all my dependencies via build.sbt (one dependency is under lib folder). When I run sbt package I see the jar created under target/scala-2.10/ So compile seems to

XLConnect in SparkR

2016-07-20 Thread Yogesh Vyas
Hi, I am trying to load and read excel sheets from HDFS in sparkR using XLConnect package. Can anyone help me in finding out how to read xls files from HDFS in sparkR ? Regards, Yogesh - To unsubscribe e-mail:

How spark decides whether to do BroadcastHashJoin or SortMergeJoin

2016-07-20 Thread raaggarw
Hi, How spark decides/optimizes internally as to when it needs to a BroadcastHashJoin vs SortMergeJoin? Is there anyway we can guide from outside or through options which Join to use? Because in my case when i am trying to do a join, spark makes that join as BroadCastHashJoin internally and when

write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
Hi, Could somebody share example of writing and calling udf which converts unix tme stamp to date tiime . Thanks, Divya

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Yu Wei
I need to write all data received from MQTT data into hbase for further processing. They're not final result. I also need to read the data from hbase for analysis. Is it good choice to use DAO in such situation? Thx, Jared From: Deepak Sharma

Running multiple Spark Jobs on Yarn( Client mode)

2016-07-20 Thread vaibhavrtk
I have a silly question: Do multiple spark jobs running on yarn have any impact on each other? e.g. If the traffic on one streaming job increases too much does it have any effect on second job? Will it slow it down or any other consequences? I have enough resources(memory,cores) for both jobs in

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Yu Wei
Hi Ted, I also noticed HBASE-13992. I never used stuff similar as DAO. As a general rule, which is better choice when working with spark, hbase? hbase-spark module, DAO or hbase client api? I'm beginner to big data. Any guidance is very helpful for me. Thanks, Jared

Re: Spark driver getting out of memory

2016-07-20 Thread Saurav Sinha
Hi, I have set driver memory 10 GB and job ran with intermediate failure which is recovered back by spark. But I still what to know if no of parts increases git driver ram need to be increased and what is ration of no of parts/RAM. @RK : I am using cache on RDD. Is this reason of high RAM

Running multiple Spark Jobs on Yarn( Client mode)

2016-07-20 Thread Vaibhav Nagpal
I have a silly question: Do multiple spark jobs running on yarn have any impact on each other? e.g. If the traffic on one streaming job increases too much does it have any effect on second job? Will it slow it down or any other consequences? Thanks Vaibhav