Debug spark core and streaming programs in scala

2016-05-15 Thread Deepak Sharma
Hi I have scala program consisting of spark core and spark streaming APIs Is there any open source tool that i can use to debug the program for performance reasons? My primary interest is to find the block of codes that would be exeuted on driver and what would go to the executors. Is there JMX

Re: How to use the spark submit script / capability

2016-05-15 Thread John Trengrove
Assuming you are refering to running SparkSubmit.main programatically otherwise read this [1]. I can't find any scaladocs for org.apache.spark.deploy.* but Oozie's [2] example of using SparkSubmit is pretty comprehensive. [1] http://spark.apache.org/docs/latest/submitting-applications.html [2]

?????? spark udf can not change a json string to a map

2016-05-15 Thread ??????
this is my usecase: Another system upload csv files to my system. In csv files, there are complicated data types such as map. In order to express complicated data types and ordinary string having special characters?? we put urlencoded string in csv files. So we use urlencoded json string to

Re: Executors and Cores

2016-05-15 Thread Mail.com
Hi Mich, We have HDP 2.3.2 where spark will run on 21 nodes each having 250 gb memory. Jobs run in yarn-client and yarn-cluster mode. We have other teams using the same cluster to build their applications. Regards, Pradeep > On May 15, 2016, at 1:37 PM, Mich Talebzadeh

Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
Well the task itself is completed (it indeed gives a result) but the tasks in Mesos says killed and it gives an error as Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Kind regards, Richard Op maandag 16 mei 2016 heeft Jacek Laskowski

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Yuval Itzchakov
Hi Ofir, Thanks for the elaborated answer. I have read both documents, where they do a light touch on infinite Dataframes/Datasets. However, they do not go in depth as regards to how existing transformations on DStreams, for example, will be transformed into the Dataset APIs. I've been browsing

Kafka stream message sampling

2016-05-15 Thread Samuel Zhou
Hi, I was trying to use filter to sampling a Kafka direct stream, and the filter function just take 1 messages from 10 by using hashcode % 10 == 0, but the number of events of input for each batch didn't shrink to 10% of original traffic. So I want to ask if there are any way to shrink the batch

Re: Executors and Cores

2016-05-15 Thread Jacek Laskowski
On Sun, May 15, 2016 at 8:19 AM, Mail.com wrote: > In all that I have seen, it seems each job has to be given the max resources > allowed in the cluster. Hi, I'm fairly sure it was because FIFO scheduling mode was used. You could change it to FAIR and make some

Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Jacek Laskowski
On Sun, May 15, 2016 at 5:50 PM, Richard Siebeling wrote: > I'm getting the following errors running SparkPi on a clean just compiled > and checked Mesos 0.29.0 installation with Spark 1.6.1 > > 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor >

Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Jacek Laskowski
Hi Richard, I don't know the answer, but just saw the way you've executed the examples and thought I'd share a slightly easier (?) way using run-example as follows: ./bin/run-example --verbose --master yarn --deploy-mode cluster SparkPi 1000 (I use yarn so change that and possible deploy-mode).

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
Ofir, Thanks for the clarification. I was confused for the moment. The links will be very helpful. > On May 15, 2016, at 2:32 PM, Ofir Manor wrote: > > Ben, > I'm just a Spark user - but at least in March Spark Summit, that was the main > term used. > Taking a step

Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
B.t.w. this is on a single node cluster Op zondag 15 mei 2016 heeft Richard Siebeling het volgende geschreven: > Hi, > > I'm getting the following errors running SparkPi on a clean just compiled > and checked Mesos 0.29.0 installation with Spark 1.6.1 > > 16/05/15 23:05:52

Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
Hi, I'm getting the following errors running SparkPi on a clean just compiled and checked Mesos 0.29.0 installation with Spark 1.6.1 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx Remote RPC client disassociated. Likely due to containers

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Ofir Manor
Ben, I'm just a Spark user - but at least in March Spark Summit, that was the main term used. Taking a step back from the details, maybe this new post from Reynold is a better intro to Spark 2.0 highlights

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
Hi Ofir, I just recently saw the webinar with Reynold Xin. He mentioned the Spark Session unification efforts, but I don’t remember the DataSet for Structured Streaming aka Continuous Applications as he put it. He did mention streaming or unlimited DataFrames for Structured Streaming so one

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Ofir Manor
Hi Yuval, let me share my understanding based on similar questions I had. First, Spark 2.x aims to replace a whole bunch of its APIs with just two main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset (merging of Dataset and Dataframe - which is why it inherits all the SparkSQL

Re: JDBC SQL Server RDD

2016-05-15 Thread Mich Talebzadeh
Hi, Which version of Spark are you using? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 15 May

Re: How to use the spark submit script / capability

2016-05-15 Thread Marcelo Vanzin
As I mentioned, the "user document" is the Spark API documentation. On Sun, May 15, 2016 at 12:20 PM, Stephen Boesch wrote: > Hi Marcelo, here is the JIRA > https://issues.apache.org/jira/browse/SPARK-4924 > > Jeff Zhang >

Re: pyspark.zip and py4j-0.9-src.zip

2016-05-15 Thread Ted Yu
For py4j, adjust version according to your need: net.sf.py4j py4j 0.10.1 FYI On Sun, May 15, 2016 at 11:55 AM, satish saley wrote: > Hi, > Is there any way to pull in pyspark.zip and py4j-0.9-src.zip in maven > project? > > >

Re: How to use the spark submit script / capability

2016-05-15 Thread Stephen Boesch
Hi Marcelo, here is the JIRA https://issues.apache.org/jira/browse/SPARK-4924 Jeff Zhang added a comment - 26/Nov/15 08:15 Marcelo Vanzin Is there any user

JDBC SQL Server RDD

2016-05-15 Thread KhajaAsmath Mohammed
Hi , I am trying to test sql server connection with JDBC RDD but unable to connect. val myRDD = new JdbcRDD( sparkContext, () => DriverManager.getConnection(sqlServerConnectionString) , "select CTRY_NA,CTRY_SHRT_NA from dbo.CTRY limit ?, ?", 0, 5, 1, r => r.getString("CTRY_NA") + ",

Re: How to use the spark submit script / capability

2016-05-15 Thread Marcelo Vanzin
I don't understand your question. The PR you mention is not about spark-submit. If you want help with spark-submit, check the Spark docs or "spark-submit -h". If you want help with the library added in the PR, check Spark's API documentation. On Sun, May 15, 2016 at 9:33 AM, Stephen Boesch

pyspark.zip and py4j-0.9-src.zip

2016-05-15 Thread satish saley
Hi, Is there any way to pull in pyspark.zip and py4j-0.9-src.zip in maven project?

Re: Executors and Cores

2016-05-15 Thread Mich Talebzadeh
Hi Pradeep, In your case what type of cluster we are taking about? A standalone cluster? HTh Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: spark udf can not change a json string to a map

2016-05-15 Thread Ted Yu
Can you let us know more about your use case ? I wonder if you can structure your udf by not returning Map. Cheers On Sun, May 15, 2016 at 9:18 AM, 喜之郎 <251922...@qq.com> wrote: > Hi, all. I want to implement a udf which is used to change a json string > to a map. > But some

Re: orgin of error

2016-05-15 Thread Ted Yu
Adding back user@spark >From namenode audit log, you should be able to find out who deleted part-r-00163-e94fa2c5-aa0d-4a08-b4c3-9fe7087ca493.gz.parquet and when. There might be other errors in the executor log which would give you more clue. On Sun, May 15, 2016 at 9:08 AM, pseudo oduesp

How to use the spark submit script / capability

2016-05-15 Thread Stephen Boesch
There is a committed PR from Marcelo Vanzin addressing that capability: https://github.com/apache/spark/pull/3916/files Is there any documentation on how to use this? The PR itself has two comments asking for the docs that were not answered.

spark udf can not change a json string to a map

2016-05-15 Thread ??????
Hi, all. I want to implement a udf which is used to change a json string to a map. But some problem occurs. My spark version:1.5.1. my udf code: public Map evaluate(final String s) { if (s == null)

Re: orgin of error

2016-05-15 Thread Ted Yu
bq. ExecutorLostFailure (executor 4 lost) Can you check executor log for more clue ? Which Spark release are you using ? Cheers On Sun, May 15, 2016 at 8:47 AM, pseudo oduesp wrote: > someone can help me about this issues > > > > py4j.protocol.Py4JJavaError: An error

orgin of error

2016-05-15 Thread pseudo oduesp
someone can help me about this issues py4j.protocol.Py4JJavaError: An error occurred while calling o126.parquet. : org.apache.spark.SparkException: Job aborted. at

Re: Executors and Cores

2016-05-15 Thread Ted Yu
For the last question, have you looked at: https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation FYI On Sun, May 15, 2016 at 5:19 AM, Mail.com wrote: > Hi , > > I have seen multiple videos on spark tuning which shows how to determine # > cores,

Executors and Cores

2016-05-15 Thread Mail.com
Hi , I have seen multiple videos on spark tuning which shows how to determine # cores, #executors and memory size of the job. In all that I have seen, it seems each job has to be given the max resources allowed in the cluster. How do we factor in input size as well? I am processing a 1gb

Re: "collecting" DStream data

2016-05-15 Thread Daniel Haviv
I mistyped, the code is foreachRDD(r=> arr++=r.collect) And it does work for ArrayBuffer but not for HashMap On Sun, May 15, 2016 at 3:04 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Hi Daniel, > > Given your example, “arr” is defined on the driver, but the “foreachRDD” >

Re: "collecting" DStream data

2016-05-15 Thread Silvio Fiorito
Hi Daniel, Given your example, “arr” is defined on the driver, but the “foreachRDD” function is run on the executors. If you want to collect the results of the RDD/DStream down to the driver you need to call RDD.collect. You have to be careful though that you have enough memory on the driver

Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Yuval.Itzchakov
I've been reading/watching videos about the upcoming Spark 2.0 release which brings us Structured Streaming. One thing I've yet to understand is how this relates to the current state of working with Streaming in Spark with the DStream abstraction. All examples I can find, in the Spark

"collecting" DStream data

2016-05-15 Thread Daniel Haviv
Hi, I have a DStream I'd like to collect and broadcast it's values. To do so I've created a mutable HashMap which i'm filling with foreachRDD but when I'm checking it, it remains empty. If I use ArrayBuffer it works as expected. This is my code: val arr =

Re: spark sql write orc table on viewFS throws exception

2016-05-15 Thread Mich Talebzadeh
I am not sure this is going to resolve INSERT OVEERWRITE into ORC table issue. Can you go to hive and do show create table custom.rank_less_orc_none and send the output. Is that table defined as transactional? Other alternative is to use Spark to insert into a normal text table and do insert