Debug spark core and streaming programs in scala

2016-05-15 Thread Deepak Sharma
Hi
I have scala program consisting of spark core and spark streaming APIs
Is there any open source tool that i can use to debug the program for
performance reasons?
My primary interest is to find the block of codes that would be exeuted on
driver and what would go to the executors.
Is there JMX extension of Spark?

-- 
Thanks
Deepak


Re: How to use the spark submit script / capability

2016-05-15 Thread John Trengrove
Assuming you are refering to running SparkSubmit.main programatically
otherwise read this [1].

I can't find any scaladocs for org.apache.spark.deploy.* but Oozie's [2]
example of using SparkSubmit is pretty comprehensive.

[1] http://spark.apache.org/docs/latest/submitting-applications.html
[2]
https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java

John

2016-05-16 2:33 GMT+10:00 Stephen Boesch :

>
> There is a committed PR from Marcelo Vanzin addressing that capability:
>
> https://github.com/apache/spark/pull/3916/files
>
> Is there any documentation on how to use this?  The PR itself has two
> comments asking for the docs that were not answered.
>


?????? spark udf can not change a json string to a map

2016-05-15 Thread ??????
this is my usecase:
   Another system upload csv files to my system. In csv files, there are 
complicated data types such as map. In order to express complicated data types 
and ordinary string having special characters?? we put urlencoded string in csv 
files.  So we use urlencoded json string to express map,string and array.


second stage:
  load csv files to spark text table. 
###
CREATE TABLE `a_text`(
  parameters  string
)
load data inpath 'XXX' into table a_text;
#
Third stage:
 insert into spark parquet table select from text table. In order to use 
advantage of complicated data types, we use udf to transform a json string to 
map , and put map to table.


CREATE TABLE `a_parquet`(
  parameters   map
)



insert into a_parquet select UDF(parameters ) from a_text;


So do you have any suggestions?












--  --
??: "Ted Yu";;
: 2016??5??16??(??) 0:44
??: "??"<251922...@qq.com>; 
: "user"; 
: Re: spark udf can not change a json string to a map



Can you let us know more about your use case ?

I wonder if you can structure your udf by not returning Map.


Cheers


On Sun, May 15, 2016 at 9:18 AM, ?? <251922...@qq.com> wrote:
Hi, all. I want to implement a udf which is used to change a json string to a 
map.
But some problem occurs. My spark version:1.5.1.




my udf code:

public Map evaluate(final String s) {
if (s == null)
return null;
return getString(s);
}


@SuppressWarnings("unchecked")
public static Map getString(String s) {
try {
String str =  URLDecoder.decode(s, "UTF-8");
ObjectMapper mapper = new ObjectMapper();
Map  map = mapper.readValue(str, 
Map.class);

return map;
} catch (Exception e) {
return new HashMap();
}
}

#
exception infos:


16/05/14 21:05:22 ERROR CliDriver: org.apache.spark.sql.AnalysisException: Map 
type in java is unsupported because JVM type erasure makes spark fail to catch 
key and value types in Map<>; line 1 pos 352
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:230)
at 
org.apache.spark.sql.hive.HiveSimpleUDF.javaClassToDataType(hiveUDFs.scala:107)
at org.apache.spark.sql.hive.HiveSimpleUDF.(hiveUDFs.scala:136)






I have saw that there is a testsuite in spark says spark did not support this 
kind of udf.
But is there a method to implement this udf?

Re: Executors and Cores

2016-05-15 Thread Mail.com
Hi Mich,

We have HDP 2.3.2 where spark will run on 21 nodes each having 250 gb memory.  
Jobs run in yarn-client and yarn-cluster mode.

We have other teams using the same cluster to build their applications.

Regards,
Pradeep


> On May 15, 2016, at 1:37 PM, Mich Talebzadeh  
> wrote:
> 
> Hi Pradeep,
> 
> In your case what type of cluster we are taking about? A standalone cluster?
> 
> HTh
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 15 May 2016 at 13:19, Mail.com  wrote:
>> Hi ,
>> 
>> I have seen multiple videos on spark tuning which shows how to determine # 
>> cores, #executors and memory size of the job.
>> 
>> In all that I have seen, it seems each job has to be given the max resources 
>> allowed in the cluster.
>> 
>> How do we factor in input size as well? I am processing a 1gb compressed 
>> file then I can live with say 10 executors and not 21 etc..
>> 
>> Also do we consider other jobs in the cluster that could be running? I will 
>> use only 20 GB out of available 300 gb etc..
>> 
>> Thanks,
>> Pradeep
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
Well the task itself is completed (it indeed gives a result) but the tasks
in Mesos says killed and it gives an error as Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues.

Kind regards,
Richard

Op maandag 16 mei 2016 heeft Jacek Laskowski > het volgende geschreven:

> On Sun, May 15, 2016 at 5:50 PM, Richard Siebeling 
> wrote:
>
> > I'm getting the following errors running SparkPi on a clean just compiled
> > and checked Mesos 0.29.0 installation with Spark 1.6.1
> >
> > 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
> > e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx Remote RPC client
> > disassociated. Likely due to containers exceeding thresholds, or network
> > issues. Check driver logs for WARN messages.
>
> Looking at it again and I don't see an issue here? Why do you think it
> doesn't work for you? After Pi is calculated, the executors were taken
> down since the driver finished calculation (and closed SparkContext).
>
> Jacek
>


Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Yuval Itzchakov
Hi Ofir,
Thanks for the elaborated answer. I have read both documents, where they do
a light touch on infinite Dataframes/Datasets. However, they do not go in
depth as regards to how existing transformations on DStreams, for example,
will be transformed into the Dataset APIs. I've been browsing the 2.0
branch and have yet been able to understand how they correlate.

Also, placing SparkSession in the sql package seems like a peculiar choice,
since this is going to be the global abstraction over
SparkContext/StreamingContext from now on.

On Sun, May 15, 2016, 23:42 Ofir Manor  wrote:

> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
> (merging of Dataset and Dataframe - which is why it inherits all the
> SparkSQL goodness), while RDD seems as a low-level API only for special
> cases. The new Dataset should also support both batch and streaming -
> replacing (eventually) DStream as well. See the design docs in SPARK-13485
> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
> However, as you noted, not all will be fully delivered in 2.0. For
> example, it seems that streaming from / to Kafka using StructuredStreaming
> didn't make it (so far?) to 2.0 (which is a showstopper for me).
> Anyway, as far as I understand, you should be able to apply stateful
> operators (non-RDD) on Datasets (for example, the new event-time window
> processing SPARK-8360). The gap I see is mostly limited streaming sources /
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
> examples will align with the current offering...
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov 
> wrote:
>
>> I've been reading/watching videos about the upcoming Spark 2.0 release
>> which
>> brings us Structured Streaming. One thing I've yet to understand is how
>> this
>> relates to the current state of working with Streaming in Spark with the
>> DStream abstraction.
>>
>> All examples I can find, in the Spark repository/different videos is
>> someone
>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>> browsing
>> the source, SparkSession seems to be defined inside org.apache.spark.sql,
>> so
>> this gives me a hunch that this is somehow all related to SQL and the
>> likes,
>> and not really to DStreams.
>>
>> What I'm failing to understand is: Will this feature impact how we do
>> Streaming today? Will I be able to consume a Kafka source in a streaming
>> fashion (like we do today when we open a stream using KafkaUtils)? Will we
>> be able to do state-full operations on a Dataset[T] like we do today using
>> MapWithStateRDD? Or will there be a subset of operations that the catalyst
>> optimizer can understand such as aggregate and such?
>>
>> I'd be happy anyone could shed some light on this.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Kafka stream message sampling

2016-05-15 Thread Samuel Zhou
Hi,

I was trying to use filter to sampling a Kafka direct stream, and the
filter function just take 1 messages from 10 by using hashcode % 10 == 0,
but the number of events of input for each batch didn't shrink to 10% of
original traffic. So I want to ask if there are any way to shrink the batch
size by a sampling function to save the traffic from Kafka?

Thanks!
Samuel


Re: Executors and Cores

2016-05-15 Thread Jacek Laskowski
On Sun, May 15, 2016 at 8:19 AM, Mail.com  wrote:

> In all that I have seen, it seems each job has to be given the max resources 
> allowed in the cluster.

Hi,

I'm fairly sure it was because FIFO scheduling mode was used. You
could change it to FAIR and make some adjustments.

https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

It may also a little bit depend on your resource manager (aka cluster
manager) but just a little bit since after resources are assigned and
executors spawned, the resources are handled by Spark itself.

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Jacek Laskowski
On Sun, May 15, 2016 at 5:50 PM, Richard Siebeling  wrote:

> I'm getting the following errors running SparkPi on a clean just compiled
> and checked Mesos 0.29.0 installation with Spark 1.6.1
>
> 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx Remote RPC client
> disassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.

Looking at it again and I don't see an issue here? Why do you think it
doesn't work for you? After Pi is calculated, the executors were taken
down since the driver finished calculation (and closed SparkContext).

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Jacek Laskowski
Hi Richard,

I don't know the answer, but just saw the way you've executed the
examples and thought I'd share a slightly easier (?) way using
run-example as follows:

./bin/run-example --verbose --master yarn --deploy-mode cluster SparkPi 1000

(I use yarn so change that and possible deploy-mode).

Also, deploy-mode client is the default deploy mode so you may safely remove it.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, May 15, 2016 at 5:50 PM, Richard Siebeling  wrote:
> Hi,
>
> I'm getting the following errors running SparkPi on a clean just compiled
> and checked Mesos 0.29.0 installation with Spark 1.6.1
>
> 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx Remote RPC client
> disassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.
>
>
> The Mesos examples are running fine, only the SparkPi example isn't...
> I'm not sure what to do, I thought it had to do with the installation so I
> installed and compiled everything again, but without any good results.
>
> Please help,
> thanks in advance,
> Richard
>
>
> The complete logs are
>
> sudo ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> mesos://192.168.33.10:5050 --deploy-mode client ./lib/spark-examples* 10
>
> 16/05/15 23:05:36 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> I0515 23:05:38.393546 10915 sched.cpp:224] Version: 0.29.0
>
> I0515 23:05:38.402220 10909 sched.cpp:328] New master detected at
> master@192.168.33.10:5050
>
> I0515 23:05:38.403033 10909 sched.cpp:338] No credentials provided.
> Attempting to register without authentication
>
> I0515 23:05:38.431784 10909 sched.cpp:710] Framework registered with
> e23f2d53-22c5-40f0-918d-0d73805fdfec-0006
>
> Pi is roughly 3.145964
>
> 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx: Remote RPC client
> disassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.
>
> 16/05/15 23:05:52 ERROR LiveListenerBus: SparkListenerBus has already
> stopped! Dropping event
> SparkListenerExecutorRemoved(1463346352364,e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0,Remote
> RPC client disassociated. Likely due to containers exceeding thresholds, or
> network issues. Check driver logs for WARN messages.)
>
> I0515 23:05:52.380164 10810 sched.cpp:1921] Asked to stop the driver
>
> I0515 23:05:52.382272 10910 sched.cpp:1150] Stopping framework
> 'e23f2d53-22c5-40f0-918d-0d73805fdfec-0006'
>
>
> The Mesos sandbox gives the following messages in STDERR
>
> 16/05/15 23:05:52 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
> 1029 bytes result sent to driver
>
> 16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 8
>
> 16/05/15 23:05:52 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
>
> 16/05/15 23:05:52 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8).
> 1029 bytes result sent to driver
>
> 16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 9
>
> 16/05/15 23:05:52 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
>
> 16/05/15 23:05:52 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9).
> 1029 bytes result sent to driver
>
> 16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Driver commanded a
> shutdown
>
> 16/05/15 23:05:52 INFO MemoryStore: MemoryStore cleared
>
> 16/05/15 23:05:52 INFO BlockManager: BlockManager stopped
>
> 16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
>
> 16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Remote
> daemon shut down; proceeding with flushing remote transports.
>
> 16/05/15 23:05:52 WARN CoarseGrainedExecutorBackend: An unknown
> (anabrix:45663) driver disconnected.
>
> 16/05/15 23:05:52 ERROR CoarseGrainedExecutorBackend: Driver
> 192.168.33.10:45663 disassociated! Shutting down.
>
> I0515 23:05:52.388991 11120 exec.cpp:399] Executor asked to shutdown
>
> 16/05/15 23:05:52 INFO ShutdownHookManager: Shutdown hook called
>
> 16/05/15 23:05:52 INFO ShutdownHookManager: Deleting directory
> /tmp/mesos/slaves/e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/frameworks/e23f2d53-22c5-40f0-918d-0d73805fdfec-0006/executors/0/runs/b9df4275-a597-4b8e-9a7b-45e7fb79bd93/spark-a99d0380-2d0d-4bbd-a593-49ad885e5430
>
>
> And the following messages in STDOUT
>
> Registered executor on xxx
>
> Starting task 0
>
> sh -c 'cd spark-1*;  ./bin/spark-class
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> spark://CoarseGrainedScheduler@192.168.33.10:45663 --executor-id
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 --hostname xxx --cores 1 --app-id
> 

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
Ofir,

Thanks for the clarification. I was confused for the moment. The links will be 
very helpful.


> On May 15, 2016, at 2:32 PM, Ofir Manor  wrote:
> 
> Ben,
> I'm just a Spark user - but at least in March Spark Summit, that was the main 
> term used.
> Taking a step back from the details, maybe this new post from Reynold is a 
> better intro to Spark 2.0 highlights 
> https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
>  
> 
> 
> If you want to drill down, go to SPARK-8360 "Structured Streaming (aka 
> Streaming DataFrames)". The design doc (written by Reynold in March) is very 
> readable:
>  https://issues.apache.org/jira/browse/SPARK-8360 
> 
> 
> Regarding directly querying (SQL) the state managed by a streaming process - 
> I don't know if that will land in 2.0 or only later.
> 
> Hope that helps,
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286  | Email: 
> ofir.ma...@equalum.io 
> On Sun, May 15, 2016 at 11:58 PM, Benjamin Kim  > wrote:
> Hi Ofir,
> 
> I just recently saw the webinar with Reynold Xin. He mentioned the Spark 
> Session unification efforts, but I don’t remember the DataSet for Structured 
> Streaming aka Continuous Applications as he put it. He did mention streaming 
> or unlimited DataFrames for Structured Streaming so one can directly query 
> the data from it. Has something changed since then?
> 
> Thanks,
> Ben
> 
> 
>> On May 15, 2016, at 1:42 PM, Ofir Manor > > wrote:
>> 
>> Hi Yuval,
>> let me share my understanding based on similar questions I had.
>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two 
>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset 
>> (merging of Dataset and Dataframe - which is why it inherits all the 
>> SparkSQL goodness), while RDD seems as a low-level API only for special 
>> cases. The new Dataset should also support both batch and streaming - 
>> replacing (eventually) DStream as well. See the design docs in SPARK-13485 
>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro. 
>> However, as you noted, not all will be fully delivered in 2.0. For example, 
>> it seems that streaming from / to Kafka using StructuredStreaming didn't 
>> make it (so far?) to 2.0 (which is a showstopper for me). 
>> Anyway, as far as I understand, you should be able to apply stateful 
>> operators (non-RDD) on Datasets (for example, the new event-time window 
>> processing SPARK-8360). The gap I see is mostly limited streaming sources / 
>> sinks migrated to the new (richer) API and semantics.
>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples 
>> will align with the current offering...
>> 
>> 
>> Ofir Manor
>> 
>> Co-Founder & CTO | Equalum
>> 
>> 
>> Mobile: +972-54-7801286  | Email: 
>> ofir.ma...@equalum.io 
>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov > > wrote:
>> I've been reading/watching videos about the upcoming Spark 2.0 release which
>> brings us Structured Streaming. One thing I've yet to understand is how this
>> relates to the current state of working with Streaming in Spark with the
>> DStream abstraction.
>> 
>> All examples I can find, in the Spark repository/different videos is someone
>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
>> the source, SparkSession seems to be defined inside org.apache.spark.sql, so
>> this gives me a hunch that this is somehow all related to SQL and the likes,
>> and not really to DStreams.
>> 
>> What I'm failing to understand is: Will this feature impact how we do
>> Streaming today? Will I be able to consume a Kafka source in a streaming
>> fashion (like we do today when we open a stream using KafkaUtils)? Will we
>> be able to do state-full operations on a Dataset[T] like we do today using
>> MapWithStateRDD? Or will there be a subset of operations that the catalyst
>> optimizer can understand such as aggregate and such?
>> 
>> I'd be happy anyone could shed some light on this.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>>  
>> 
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> .
>> 
>> -
>> To unsubscribe, e-mail: 

Re: Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
B.t.w. this is on a single node cluster

Op zondag 15 mei 2016 heeft Richard Siebeling  het
volgende geschreven:

> Hi,
>
> I'm getting the following errors running SparkPi on a clean just compiled
> and checked Mesos 0.29.0 installation with Spark 1.6.1
>
> 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx Remote RPC client
> disassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.
>
> The Mesos examples are running fine, only the SparkPi example isn't...
> I'm not sure what to do, I thought it had to do with the installation so I
> installed and compiled everything again, but without any good results.
>
> Please help,
> thanks in advance,
> Richard
>
>
> The complete logs are
>
> sudo ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> mesos://192.168.33.10:5050 --deploy-mode client ./lib/spark-examples* 10
>
> 16/05/15 23:05:36 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> I0515 23:05:38.393546 10915 sched.cpp:224] Version: 0.29.0
>
> I0515 23:05:38.402220 10909 sched.cpp:328] New master detected at
> master@192.168.33.10:5050
>
> I0515 23:05:38.403033 10909 sched.cpp:338] No credentials provided.
> Attempting to register without authentication
>
> I0515 23:05:38.431784 10909 sched.cpp:710] Framework registered with
> e23f2d53-22c5-40f0-918d-0d73805fdfec-0006
>
> Pi is roughly 3.145964
>
>
> 16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx: Remote RPC client
> disassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.
>
> 16/05/15 23:05:52 ERROR LiveListenerBus: SparkListenerBus has already
> stopped! Dropping event
> SparkListenerExecutorRemoved(1463346352364,e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0,Remote
> RPC client disassociated. Likely due to containers exceeding thresholds, or
> network issues. Check driver logs for WARN messages.)
>
> I0515 23:05:52.380164 10810 sched.cpp:1921] Asked to stop the driver
>
> I0515 23:05:52.382272 10910 sched.cpp:1150] Stopping framework
> 'e23f2d53-22c5-40f0-918d-0d73805fdfec-0006'
>
> The Mesos sandbox gives the following messages in STDERR
>
> 16/05/15 23:05:52 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
> 1029 bytes result sent to driver
>
> 16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 8
>
> 16/05/15 23:05:52 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
>
> 16/05/15 23:05:52 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8).
> 1029 bytes result sent to driver
>
> 16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 9
>
> 16/05/15 23:05:52 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
>
> 16/05/15 23:05:52 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9).
> 1029 bytes result sent to driver
>
> 16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Driver commanded a
> shutdown
>
> 16/05/15 23:05:52 INFO MemoryStore: MemoryStore cleared
>
> 16/05/15 23:05:52 INFO BlockManager: BlockManager stopped
>
> 16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
>
> 16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Remote
> daemon shut down; proceeding with flushing remote transports.
>
> 16/05/15 23:05:52 WARN CoarseGrainedExecutorBackend: An unknown
> (anabrix:45663) driver disconnected.
>
> 16/05/15 23:05:52 ERROR CoarseGrainedExecutorBackend: Driver
> 192.168.33.10:45663 disassociated! Shutting down.
>
> I0515 23:05:52.388991 11120 exec.cpp:399] Executor asked to shutdown
>
> 16/05/15 23:05:52 INFO ShutdownHookManager: Shutdown hook called
>
> 16/05/15 23:05:52 INFO ShutdownHookManager: Deleting directory
> /tmp/mesos/slaves/e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/frameworks/e23f2d53-22c5-40f0-918d-0d73805fdfec-0006/executors/0/runs/b9df4275-a597-4b8e-9a7b-45e7fb79bd93/spark-a99d0380-2d0d-4bbd-a593-49ad885e5430
>
> And the following messages in STDOUT
>
> Registered executor on xxx
>
> Starting task 0
>
> sh -c 'cd spark-1*;  ./bin/spark-class
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
> CoarseGrainedScheduler@192.168.33.10:45663 --executor-id
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 --hostname xxx --cores 1 --app-id
> e23f2d53-22c5-40f0-918d-0d73805fdfec-0006'
>
> Forked command at 11124
>
> Shutting down
>
> Sending SIGTERM to process tree at pid 11124
>
> Sent SIGTERM to the following process trees:
>
> [
>
> -+- 11124 sh -c cd spark-1*;  ./bin/spark-class
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
> CoarseGrainedScheduler@192.168.33.10:45663 --executor-id
> e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 --hostname xxx --cores 1 --app-id
> e23f2d53-22c5-40f0-918d-0d73805fdfec-0006
>
>  \--- 11125
> 

Errors when running SparkPi on a clean Spark 1.6.1 on Mesos

2016-05-15 Thread Richard Siebeling
Hi,

I'm getting the following errors running SparkPi on a clean just compiled
and checked Mesos 0.29.0 installation with Spark 1.6.1

16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues. Check driver logs for WARN messages.

The Mesos examples are running fine, only the SparkPi example isn't...
I'm not sure what to do, I thought it had to do with the installation so I
installed and compiled everything again, but without any good results.

Please help,
thanks in advance,
Richard


The complete logs are

sudo ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
mesos://192.168.33.10:5050 --deploy-mode client ./lib/spark-examples* 10

16/05/15 23:05:36 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

I0515 23:05:38.393546 10915 sched.cpp:224] Version: 0.29.0

I0515 23:05:38.402220 10909 sched.cpp:328] New master detected at
master@192.168.33.10:5050

I0515 23:05:38.403033 10909 sched.cpp:338] No credentials provided.
Attempting to register without authentication

I0515 23:05:38.431784 10909 sched.cpp:710] Framework registered with
e23f2d53-22c5-40f0-918d-0d73805fdfec-0006

Pi is roughly 3.145964


16/05/15 23:05:52 ERROR TaskSchedulerImpl: Lost executor
e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 on xxx: Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues. Check driver logs for WARN messages.

16/05/15 23:05:52 ERROR LiveListenerBus: SparkListenerBus has already
stopped! Dropping event
SparkListenerExecutorRemoved(1463346352364,e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0,Remote
RPC client disassociated. Likely due to containers exceeding thresholds, or
network issues. Check driver logs for WARN messages.)

I0515 23:05:52.380164 10810 sched.cpp:1921] Asked to stop the driver

I0515 23:05:52.382272 10910 sched.cpp:1150] Stopping framework
'e23f2d53-22c5-40f0-918d-0d73805fdfec-0006'

The Mesos sandbox gives the following messages in STDERR

16/05/15 23:05:52 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
1029 bytes result sent to driver

16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 8

16/05/15 23:05:52 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)

16/05/15 23:05:52 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8).
1029 bytes result sent to driver

16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Got assigned task 9

16/05/15 23:05:52 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)

16/05/15 23:05:52 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9).
1029 bytes result sent to driver

16/05/15 23:05:52 INFO CoarseGrainedExecutorBackend: Driver commanded a
shutdown

16/05/15 23:05:52 INFO MemoryStore: MemoryStore cleared

16/05/15 23:05:52 INFO BlockManager: BlockManager stopped

16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.

16/05/15 23:05:52 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.

16/05/15 23:05:52 WARN CoarseGrainedExecutorBackend: An unknown
(anabrix:45663) driver disconnected.

16/05/15 23:05:52 ERROR CoarseGrainedExecutorBackend: Driver
192.168.33.10:45663 disassociated! Shutting down.

I0515 23:05:52.388991 11120 exec.cpp:399] Executor asked to shutdown

16/05/15 23:05:52 INFO ShutdownHookManager: Shutdown hook called

16/05/15 23:05:52 INFO ShutdownHookManager: Deleting directory
/tmp/mesos/slaves/e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/frameworks/e23f2d53-22c5-40f0-918d-0d73805fdfec-0006/executors/0/runs/b9df4275-a597-4b8e-9a7b-45e7fb79bd93/spark-a99d0380-2d0d-4bbd-a593-49ad885e5430

And the following messages in STDOUT

Registered executor on xxx

Starting task 0

sh -c 'cd spark-1*;  ./bin/spark-class
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
CoarseGrainedScheduler@192.168.33.10:45663 --executor-id
e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 --hostname xxx --cores 1 --app-id
e23f2d53-22c5-40f0-918d-0d73805fdfec-0006'

Forked command at 11124

Shutting down

Sending SIGTERM to process tree at pid 11124

Sent SIGTERM to the following process trees:

[

-+- 11124 sh -c cd spark-1*;  ./bin/spark-class
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
CoarseGrainedScheduler@192.168.33.10:45663 --executor-id
e23f2d53-22c5-40f0-918d-0d73805fdfec-S0/0 --hostname xxx --cores 1 --app-id
e23f2d53-22c5-40f0-918d-0d73805fdfec-0006

 \--- 11125
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.91-0.b14.el7_2.x86_64/jre/bin/java
-cp

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Ofir Manor
Ben,
I'm just a Spark user - but at least in March Spark Summit, that was the
main term used.
Taking a step back from the details, maybe this new post from Reynold is a
better intro to Spark 2.0 highlights
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html

If you want to drill down, go to SPARK-8360 "Structured Streaming (aka
Streaming DataFrames)". The design doc (written by Reynold in March) is
very readable:
 https://issues.apache.org/jira/browse/SPARK-8360

Regarding directly querying (SQL) the state managed by a streaming process
- I don't know if that will land in 2.0 or only later.

Hope that helps,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Sun, May 15, 2016 at 11:58 PM, Benjamin Kim  wrote:

> Hi Ofir,
>
> I just recently saw the webinar with Reynold Xin. He mentioned the Spark
> Session unification efforts, but I don’t remember the DataSet for
> Structured Streaming aka Continuous Applications as he put it. He did
> mention streaming or unlimited DataFrames for Structured Streaming so one
> can directly query the data from it. Has something changed since then?
>
> Thanks,
> Ben
>
>
> On May 15, 2016, at 1:42 PM, Ofir Manor  wrote:
>
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two
> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
> (merging of Dataset and Dataframe - which is why it inherits all the
> SparkSQL goodness), while RDD seems as a low-level API only for special
> cases. The new Dataset should also support both batch and streaming -
> replacing (eventually) DStream as well. See the design docs in SPARK-13485
> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
> However, as you noted, not all will be fully delivered in 2.0. For
> example, it seems that streaming from / to Kafka using StructuredStreaming
> didn't make it (so far?) to 2.0 (which is a showstopper for me).
> Anyway, as far as I understand, you should be able to apply stateful
> operators (non-RDD) on Datasets (for example, the new event-time window
> processing SPARK-8360). The gap I see is mostly limited streaming sources /
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
> examples will align with the current offering...
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov 
> wrote:
>
>> I've been reading/watching videos about the upcoming Spark 2.0 release
>> which
>> brings us Structured Streaming. One thing I've yet to understand is how
>> this
>> relates to the current state of working with Streaming in Spark with the
>> DStream abstraction.
>>
>> All examples I can find, in the Spark repository/different videos is
>> someone
>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>> browsing
>> the source, SparkSession seems to be defined inside org.apache.spark.sql,
>> so
>> this gives me a hunch that this is somehow all related to SQL and the
>> likes,
>> and not really to DStreams.
>>
>> What I'm failing to understand is: Will this feature impact how we do
>> Streaming today? Will I be able to consume a Kafka source in a streaming
>> fashion (like we do today when we open a stream using KafkaUtils)? Will we
>> be able to do state-full operations on a Dataset[T] like we do today using
>> MapWithStateRDD? Or will there be a subset of operations that the catalyst
>> optimizer can understand such as aggregate and such?
>>
>> I'd be happy anyone could shed some light on this.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
Hi Ofir,

I just recently saw the webinar with Reynold Xin. He mentioned the Spark 
Session unification efforts, but I don’t remember the DataSet for Structured 
Streaming aka Continuous Applications as he put it. He did mention streaming or 
unlimited DataFrames for Structured Streaming so one can directly query the 
data from it. Has something changed since then?

Thanks,
Ben


> On May 15, 2016, at 1:42 PM, Ofir Manor  wrote:
> 
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two main 
> ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset (merging 
> of Dataset and Dataframe - which is why it inherits all the SparkSQL 
> goodness), while RDD seems as a low-level API only for special cases. The new 
> Dataset should also support both batch and streaming - replacing (eventually) 
> DStream as well. See the design docs in SPARK-13485 (unified API) and 
> SPARK-8360 (StructuredStreaming) for a good intro. 
> However, as you noted, not all will be fully delivered in 2.0. For example, 
> it seems that streaming from / to Kafka using StructuredStreaming didn't make 
> it (so far?) to 2.0 (which is a showstopper for me). 
> Anyway, as far as I understand, you should be able to apply stateful 
> operators (non-RDD) on Datasets (for example, the new event-time window 
> processing SPARK-8360). The gap I see is mostly limited streaming sources / 
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples 
> will align with the current offering...
> 
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286  | Email: 
> ofir.ma...@equalum.io 
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov  > wrote:
> I've been reading/watching videos about the upcoming Spark 2.0 release which
> brings us Structured Streaming. One thing I've yet to understand is how this
> relates to the current state of working with Streaming in Spark with the
> DStream abstraction.
> 
> All examples I can find, in the Spark repository/different videos is someone
> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
> the source, SparkSession seems to be defined inside org.apache.spark.sql, so
> this gives me a hunch that this is somehow all related to SQL and the likes,
> and not really to DStreams.
> 
> What I'm failing to understand is: Will this feature impact how we do
> Streaming today? Will I be able to consume a Kafka source in a streaming
> fashion (like we do today when we open a stream using KafkaUtils)? Will we
> be able to do state-full operations on a Dataset[T] like we do today using
> MapWithStateRDD? Or will there be a subset of operations that the catalyst
> optimizer can understand such as aggregate and such?
> 
> I'd be happy anyone could shed some light on this.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Ofir Manor
Hi Yuval,
let me share my understanding based on similar questions I had.
First, Spark 2.x aims to replace a whole bunch of its APIs with just two
main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
(merging of Dataset and Dataframe - which is why it inherits all the
SparkSQL goodness), while RDD seems as a low-level API only for special
cases. The new Dataset should also support both batch and streaming -
replacing (eventually) DStream as well. See the design docs in SPARK-13485
(unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
However, as you noted, not all will be fully delivered in 2.0. For example,
it seems that streaming from / to Kafka using StructuredStreaming didn't
make it (so far?) to 2.0 (which is a showstopper for me).
Anyway, as far as I understand, you should be able to apply stateful
operators (non-RDD) on Datasets (for example, the new event-time window
processing SPARK-8360). The gap I see is mostly limited streaming sources /
sinks migrated to the new (richer) API and semantics.
Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples
will align with the current offering...


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov  wrote:

> I've been reading/watching videos about the upcoming Spark 2.0 release
> which
> brings us Structured Streaming. One thing I've yet to understand is how
> this
> relates to the current state of working with Streaming in Spark with the
> DStream abstraction.
>
> All examples I can find, in the Spark repository/different videos is
> someone
> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
> the source, SparkSession seems to be defined inside org.apache.spark.sql,
> so
> this gives me a hunch that this is somehow all related to SQL and the
> likes,
> and not really to DStreams.
>
> What I'm failing to understand is: Will this feature impact how we do
> Streaming today? Will I be able to consume a Kafka source in a streaming
> fashion (like we do today when we open a stream using KafkaUtils)? Will we
> be able to do state-full operations on a Dataset[T] like we do today using
> MapWithStateRDD? Or will there be a subset of operations that the catalyst
> optimizer can understand such as aggregate and such?
>
> I'd be happy anyone could shed some light on this.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: JDBC SQL Server RDD

2016-05-15 Thread Mich Talebzadeh
Hi,

Which version of Spark are you using?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 May 2016 at 20:05, KhajaAsmath Mohammed 
wrote:

> Hi ,
>
> I am trying to test sql server connection with JDBC RDD but unable to
> connect.
>
> val myRDD = new JdbcRDD( sparkContext, () =>
> DriverManager.getConnection(sqlServerConnectionString) ,
>   "select CTRY_NA,CTRY_SHRT_NA from dbo.CTRY limit ?, ?",
>   0, 5, 1, r => r.getString("CTRY_NA") + ", " +
> r.getString("CTRY_SHRT_NA"))
>
>
> sqlServerConnectionString here is jdbc:sqlserver://
> usaecducc1ew1.ccgaco45mak.us-east-1.rds.amazonaws.com
> ;database=ProdAWS;user=sa;password=?s3iY2mv6.H
>
>
> can you please let me know what I am doing worng. I tried solutions from
> all forums but didnt find any luck
>
> Thanks,
> Asmath.
>


Re: How to use the spark submit script / capability

2016-05-15 Thread Marcelo Vanzin
As I mentioned, the "user document" is the Spark API documentation.

On Sun, May 15, 2016 at 12:20 PM, Stephen Boesch  wrote:

> Hi Marcelo,  here is the JIRA
> https://issues.apache.org/jira/browse/SPARK-4924
>
> Jeff Zhang
>  added
> a comment - 26/Nov/15 08:15
>
> Marcelo Vanzin
>  Is
> there any user document about it ? I didn't find it on the spark official
> site. If this is not production ready, I think adding documentation to let
> users know would be a good start.
> 
>
> 
> Jiahongchao
> 
>  added
> a comment - 28/Dec/15 03:51
>
> Where is the official document?
>
>
>
>
>
> 2016-05-15 12:04 GMT-07:00 Marcelo Vanzin :
>
>> I don't understand your question. The PR you mention is not about
>> spark-submit.
>>
>> If you want help with spark-submit, check the Spark docs or "spark-submit
>> -h".
>>
>> If you want help with the library added in the PR, check Spark's API
>> documentation.
>>
>>
>> On Sun, May 15, 2016 at 9:33 AM, Stephen Boesch 
>> wrote:
>> >
>> > There is a committed PR from Marcelo Vanzin addressing that capability:
>> >
>> > https://github.com/apache/spark/pull/3916/files
>> >
>> > Is there any documentation on how to use this?  The PR itself has two
>> > comments asking for the docs that were not answered.
>>
>>
>>
>> --
>> Marcelo
>>
>
>


-- 
Marcelo


Re: pyspark.zip and py4j-0.9-src.zip

2016-05-15 Thread Ted Yu
For py4j, adjust version according to your need:


  net.sf.py4j
  py4j
  0.10.1


FYI

On Sun, May 15, 2016 at 11:55 AM, satish saley 
wrote:

> Hi,
> Is there any way to pull in pyspark.zip and py4j-0.9-src.zip in maven
> project?
>
>
>


Re: How to use the spark submit script / capability

2016-05-15 Thread Stephen Boesch
Hi Marcelo,  here is the JIRA
https://issues.apache.org/jira/browse/SPARK-4924

Jeff Zhang
 added
a comment - 26/Nov/15 08:15

Marcelo Vanzin
 Is
there any user document about it ? I didn't find it on the spark official
site. If this is not production ready, I think adding documentation to let
users know would be a good start.


Jiahongchao

added
a comment - 28/Dec/15 03:51

Where is the official document?





2016-05-15 12:04 GMT-07:00 Marcelo Vanzin :

> I don't understand your question. The PR you mention is not about
> spark-submit.
>
> If you want help with spark-submit, check the Spark docs or "spark-submit
> -h".
>
> If you want help with the library added in the PR, check Spark's API
> documentation.
>
>
> On Sun, May 15, 2016 at 9:33 AM, Stephen Boesch  wrote:
> >
> > There is a committed PR from Marcelo Vanzin addressing that capability:
> >
> > https://github.com/apache/spark/pull/3916/files
> >
> > Is there any documentation on how to use this?  The PR itself has two
> > comments asking for the docs that were not answered.
>
>
>
> --
> Marcelo
>


JDBC SQL Server RDD

2016-05-15 Thread KhajaAsmath Mohammed
Hi ,

I am trying to test sql server connection with JDBC RDD but unable to
connect.

val myRDD = new JdbcRDD( sparkContext, () =>
DriverManager.getConnection(sqlServerConnectionString) ,
  "select CTRY_NA,CTRY_SHRT_NA from dbo.CTRY limit ?, ?",
  0, 5, 1, r => r.getString("CTRY_NA") + ", " +
r.getString("CTRY_SHRT_NA"))


sqlServerConnectionString here is jdbc:sqlserver://
usaecducc1ew1.ccgaco45mak.us-east-1.rds.amazonaws.com
;database=ProdAWS;user=sa;password=?s3iY2mv6.H


can you please let me know what I am doing worng. I tried solutions from
all forums but didnt find any luck

Thanks,
Asmath.


Re: How to use the spark submit script / capability

2016-05-15 Thread Marcelo Vanzin
I don't understand your question. The PR you mention is not about spark-submit.

If you want help with spark-submit, check the Spark docs or "spark-submit -h".

If you want help with the library added in the PR, check Spark's API
documentation.


On Sun, May 15, 2016 at 9:33 AM, Stephen Boesch  wrote:
>
> There is a committed PR from Marcelo Vanzin addressing that capability:
>
> https://github.com/apache/spark/pull/3916/files
>
> Is there any documentation on how to use this?  The PR itself has two
> comments asking for the docs that were not answered.



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark.zip and py4j-0.9-src.zip

2016-05-15 Thread satish saley
Hi,
Is there any way to pull in pyspark.zip and py4j-0.9-src.zip in maven
project?


Re: Executors and Cores

2016-05-15 Thread Mich Talebzadeh
Hi Pradeep,

In your case what type of cluster we are taking about? A standalone cluster?

HTh

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 May 2016 at 13:19, Mail.com  wrote:

> Hi ,
>
> I have seen multiple videos on spark tuning which shows how to determine #
> cores, #executors and memory size of the job.
>
> In all that I have seen, it seems each job has to be given the max
> resources allowed in the cluster.
>
> How do we factor in input size as well? I am processing a 1gb compressed
> file then I can live with say 10 executors and not 21 etc..
>
> Also do we consider other jobs in the cluster that could be running? I
> will use only 20 GB out of available 300 gb etc..
>
> Thanks,
> Pradeep
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark udf can not change a json string to a map

2016-05-15 Thread Ted Yu
Can you let us know more about your use case ?

I wonder if you can structure your udf by not returning Map.

Cheers

On Sun, May 15, 2016 at 9:18 AM, 喜之郎 <251922...@qq.com> wrote:

> Hi, all. I want to implement a udf which is used to change a json string
> to a map.
> But some problem occurs. My spark version:1.5.1.
>
>
> my udf code:
> 
> public Map evaluate(final String s) {
> if (s == null)
> return null;
> return getString(s);
> }
>
> @SuppressWarnings("unchecked")
> public static Map getString(String s) {
> try {
> String str =  URLDecoder.decode(s, "UTF-8");
> ObjectMapper mapper = new ObjectMapper();
> Map  map = mapper.readValue(str, Map.class);
> return map;
> } catch (Exception e) {
> return new HashMap();
> }
> }
> #
> exception infos:
>
> 16/05/14 21:05:22 ERROR CliDriver: org.apache.spark.sql.AnalysisException:
> Map type in java is unsupported because JVM type erasure makes spark fail
> to catch key and value types in Map<>; line 1 pos 352
> at
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:230)
> at
> org.apache.spark.sql.hive.HiveSimpleUDF.javaClassToDataType(hiveUDFs.scala:107)
> at org.apache.spark.sql.hive.HiveSimpleUDF.(hiveUDFs.scala:136)
> 
>
>
> I have saw that there is a testsuite in spark says spark did not support
> this kind of udf.
> But is there a method to implement this udf?
>
>
>


Re: orgin of error

2016-05-15 Thread Ted Yu
Adding back user@spark

>From namenode audit log, you should be able to find out who deleted
part-r-00163-e94fa2c5-aa0d-4a08-b4c3-9fe7087ca493.gz.parquet and when.

There might be other errors in the executor log which would give you more
clue.

On Sun, May 15, 2016 at 9:08 AM, pseudo oduesp 
wrote:

> ERROR hdfs.DFSClient: Failed to close inode 49551738
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /user/data2015/df_join2015/_temporary/0/_temporary/attempt_201605151649_0019_m_000163_0/part-r-00163-e94fa2c5-aa0d-4a08-b4c3-9fe7087ca493.gz.parquet
> (inode 49551738): File does not exist. Holder
> DFSClient_NONMAPREDUCE_-24268488_69 does not have any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3602)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:738)
> at
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:243)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
> at java.security.AccessController.doPrivileged(Native Method)
>at
> org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:985)
> at
> org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2740)
> at
> org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2757)
> at
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> 16/05/15 16:51:39 ERROR hdfs.DFSClient: Failed to close inode 49551465
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /user/data2015/df_join2015/_temporary/0/_temporary/attempt_201605151647_0019_m_30_0/part-r-00030-e94fa2c5-aa0d-4a08-b4c3-9fe7087ca493.gz.parquet
> (inode 49551465): File does not exist. Holder
> DFSClient_NONMAPREDUCE_-24268488_69 does not have any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3602)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660)
>
>
>
>at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3602)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3690)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3660)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:738)
> at
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:243)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
> at 

How to use the spark submit script / capability

2016-05-15 Thread Stephen Boesch
There is a committed PR from Marcelo Vanzin addressing that capability:

https://github.com/apache/spark/pull/3916/files

Is there any documentation on how to use this?  The PR itself has two
comments asking for the docs that were not answered.


spark udf can not change a json string to a map

2016-05-15 Thread ??????
Hi, all. I want to implement a udf which is used to change a json string to a 
map.
But some problem occurs. My spark version:1.5.1.




my udf code:

public Map evaluate(final String s) {
if (s == null)
return null;
return getString(s);
}


@SuppressWarnings("unchecked")
public static Map getString(String s) {
try {
String str =  URLDecoder.decode(s, "UTF-8");
ObjectMapper mapper = new ObjectMapper();
Map  map = mapper.readValue(str, 
Map.class);

return map;
} catch (Exception e) {
return new HashMap();
}
}

#
exception infos:


16/05/14 21:05:22 ERROR CliDriver: org.apache.spark.sql.AnalysisException: Map 
type in java is unsupported because JVM type erasure makes spark fail to catch 
key and value types in Map<>; line 1 pos 352
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:230)
at 
org.apache.spark.sql.hive.HiveSimpleUDF.javaClassToDataType(hiveUDFs.scala:107)
at org.apache.spark.sql.hive.HiveSimpleUDF.(hiveUDFs.scala:136)






I have saw that there is a testsuite in spark says spark did not support this 
kind of udf.
But is there a method to implement this udf?

Re: orgin of error

2016-05-15 Thread Ted Yu
bq. ExecutorLostFailure (executor 4 lost)

Can you check executor log for more clue ?

Which Spark release are you using ?

Cheers

On Sun, May 15, 2016 at 8:47 AM, pseudo oduesp 
wrote:

> someone can help me about this issues
>
>
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o126.parquet.
> : org.apache.spark.SparkException: Job aborted.
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 69 in stage 19.0 failed 4 times, most recent failure: Lost
> task 69.3 in stage 19.0 (TID 3788, prssnbd1s003.bigplay.bigdata.intraxa):
> ExecutorLostFailure (executor 4 lost)
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
> at 

orgin of error

2016-05-15 Thread pseudo oduesp
someone can help me about this issues



py4j.protocol.Py4JJavaError: An error occurred while calling o126.parquet.
: org.apache.spark.SparkException: Job aborted.
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 69 in stage 19.0 failed 4 times, most recent failure: Lost
task 69.3 in stage 19.0 (TID 3788, prssnbd1s003.bigplay.bigdata.intraxa):
ExecutorLostFailure (executor 4 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)
... 28 more


Re: Executors and Cores

2016-05-15 Thread Ted Yu
For the last question, have you looked at:

https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation

FYI

On Sun, May 15, 2016 at 5:19 AM, Mail.com  wrote:

> Hi ,
>
> I have seen multiple videos on spark tuning which shows how to determine #
> cores, #executors and memory size of the job.
>
> In all that I have seen, it seems each job has to be given the max
> resources allowed in the cluster.
>
> How do we factor in input size as well? I am processing a 1gb compressed
> file then I can live with say 10 executors and not 21 etc..
>
> Also do we consider other jobs in the cluster that could be running? I
> will use only 20 GB out of available 300 gb etc..
>
> Thanks,
> Pradeep
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Executors and Cores

2016-05-15 Thread Mail.com
Hi ,

I have seen multiple videos on spark tuning which shows how to determine # 
cores, #executors and memory size of the job.

In all that I have seen, it seems each job has to be given the max resources 
allowed in the cluster.

How do we factor in input size as well? I am processing a 1gb compressed file 
then I can live with say 10 executors and not 21 etc..

Also do we consider other jobs in the cluster that could be running? I will use 
only 20 GB out of available 300 gb etc..

Thanks,
Pradeep
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: "collecting" DStream data

2016-05-15 Thread Daniel Haviv
I mistyped, the code is
foreachRDD(r=> arr++=r.collect)

And it does work for ArrayBuffer but not for HashMap

On Sun, May 15, 2016 at 3:04 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Hi Daniel,
>
> Given your example, “arr” is defined on the driver, but the “foreachRDD”
> function is run on the executors. If you want to collect the results of the
> RDD/DStream down to the driver you need to call RDD.collect. You have to be
> careful though that you have enough memory on the driver JVM to hold the
> results, otherwise you’ll have an OOM exception. Also, you can’t update the
> value of a broadcast variable, since it’s immutable.
>
> Thanks,
> Silvio
>
> From: Daniel Haviv 
> Date: Sunday, May 15, 2016 at 6:23 AM
> To: user 
> Subject: "collecting" DStream data
>
> Hi,
> I have a DStream I'd like to collect and broadcast it's values.
> To do so I've created a mutable HashMap which i'm filling with foreachRDD
> but when I'm checking it, it remains empty. If I use ArrayBuffer it works
> as expected.
>
> This is my code:
>
> val arr = scala.collection.mutable.HashMap.empty[String,String]
> MappedVersionsToBrodcast.foreachRDD(r=> { r.foreach(r=> { arr+=r})   } )
>
>
> What am I missing here?
>
> Thank you,
> Daniel
>
>


Re: "collecting" DStream data

2016-05-15 Thread Silvio Fiorito
Hi Daniel,

Given your example, “arr” is defined on the driver, but the “foreachRDD” 
function is run on the executors. If you want to collect the results of the 
RDD/DStream down to the driver you need to call RDD.collect. You have to be 
careful though that you have enough memory on the driver JVM to hold the 
results, otherwise you’ll have an OOM exception. Also, you can’t update the 
value of a broadcast variable, since it’s immutable.

Thanks,
Silvio

From: Daniel Haviv 
>
Date: Sunday, May 15, 2016 at 6:23 AM
To: user >
Subject: "collecting" DStream data

Hi,
I have a DStream I'd like to collect and broadcast it's values.
To do so I've created a mutable HashMap which i'm filling with foreachRDD but 
when I'm checking it, it remains empty. If I use ArrayBuffer it works as 
expected.

This is my code:

val arr = scala.collection.mutable.HashMap.empty[String,String]
MappedVersionsToBrodcast.foreachRDD(r=> { r.foreach(r=> { arr+=r})   } )


What am I missing here?

Thank you,
Daniel



Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Yuval.Itzchakov
I've been reading/watching videos about the upcoming Spark 2.0 release which
brings us Structured Streaming. One thing I've yet to understand is how this
relates to the current state of working with Streaming in Spark with the
DStream abstraction.

All examples I can find, in the Spark repository/different videos is someone
streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
the source, SparkSession seems to be defined inside org.apache.spark.sql, so
this gives me a hunch that this is somehow all related to SQL and the likes,
and not really to DStreams.

What I'm failing to understand is: Will this feature impact how we do
Streaming today? Will I be able to consume a Kafka source in a streaming
fashion (like we do today when we open a stream using KafkaUtils)? Will we
be able to do state-full operations on a Dataset[T] like we do today using
MapWithStateRDD? Or will there be a subset of operations that the catalyst
optimizer can understand such as aggregate and such?

I'd be happy anyone could shed some light on this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



"collecting" DStream data

2016-05-15 Thread Daniel Haviv
Hi,
I have a DStream I'd like to collect and broadcast it's values.
To do so I've created a mutable HashMap which i'm filling with foreachRDD
but when I'm checking it, it remains empty. If I use ArrayBuffer it works
as expected.

This is my code:

val arr = scala.collection.mutable.HashMap.empty[String,String]
MappedVersionsToBrodcast.foreachRDD(r=> { r.foreach(r=> { arr+=r})   } )


What am I missing here?

Thank you,
Daniel


Re: spark sql write orc table on viewFS throws exception

2016-05-15 Thread Mich Talebzadeh
I am not sure this is going to resolve INSERT OVEERWRITE into ORC table
issue. Can you go to hive and do

show create table custom.rank_less_orc_none

and send the output.

Is that table defined as transactional?

Other alternative is to use Spark to insert into a normal text table and do
insert from the text table into ORC using HiveContext or doing it purely in
Hive


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 May 2016 at 04:01, linxi zeng  wrote:

> hi, all:
> Recently, we have encountered a problem while using spark sql to write orc
> table, which is related to
> https://issues.apache.org/jira/browse/HIVE-10790.
> In order to fix this problem we decided to patched the PR to the hive
> branch which spark1.5 rely on.
> We pull the hive branch(
> https://github.com/pwendell/hive/tree/release-1.2.1-spark) and compile it
> with cmd: mvn clean package -Phadoop-2,dist -DskipTests, and then upload to
> the nexus without any problem.
> 
> But when we compile spark with hive (group: org.spark-project.hive,
> version: 1.2.1.spark) using cmd: ./make-distribution.sh --tgz -Phive
> -Phive-thriftserver -Psparkr -Pyarn -Dhadoop.version=2.4.1
> -Dprotobuf.version=2.5.0 -DskipTests
> we get this error msg:
>
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
> spark-hive_2.10 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] Copying 1 resource
> [INFO] Copying 3 resources
> [INFO][INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
> spark-hive_2.10 ---
> [INFO] Using zinc server for incremental compilation
> [info] Compiling 27 Scala sources and 1 Java source to
> /home/sankuai/zenglinxi/spark/sql/hive/target/scala-2.10/classes...
> [warn] Class org.apache.hadoop.hive.shims.HadoopShims not found -
> continuing with a stub.
> [error]
> /home/sankuai/zenglinxi/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala:35:
> object shims is not a member of package org.apache.hadoop.hive
> [error] import org.apache.hadoop.hive.shims.{HadoopShims, ShimLoader}
> [error]   ^
> [error]
> /home/sankuai/zenglinxi/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala:114:
> not found: value ShimLoader
> [error] val loadedShimsClassName =
> ShimLoader.getHadoopShims.getClass.getCanonicalName
> [error]^
> [error]
> /home/sankuai/zenglinxi/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala:123:
> not found: type ShimLoader
> [error]   val shimsField =
> classOf[ShimLoader].getDeclaredField("hadoopShims")
> [error]^
> [error]
> /home/sankuai/zenglinxi/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala:127:
> not found: type HadoopShims
> [error]   val shims =
> classOf[HadoopShims].cast(shimsClass.newInstance())
> [error]   ^
> [warn] Class org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge not
> found - continuing with a stub.
> [warn] Class org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge not
> found - continuing with a stub.
> [warn] Class org.apache.hadoop.hive.shims.HadoopShims not found -
> continuing with a stub.
> [warn] four warnings found
> [error] four errors found
> [error] Compile failed at 2016-5-13 16:34:44 [4.348s]
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [
>  3.105 s]
> [INFO] Spark Project Launcher . SUCCESS [
>  8.360 s]
> [INFO] Spark Project Networking ... SUCCESS [
>  8.491 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  5.110 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
>  6.854 s]
> [INFO] Spark Project Core . SUCCESS [02:33
> min]
> [INFO] Spark Project Bagel  SUCCESS [
>  5.183 s]
> [INFO] Spark Project GraphX ... SUCCESS [
> 15.744 s]
> [INFO] Spark Project Streaming  SUCCESS [
> 39.070 s]
> [INFO] Spark Project Catalyst . SUCCESS [
> 57.416 s]
> [INFO] Spark Project SQL .. SUCCESS [01:11
> min]
> [INFO] Spark Project ML Library ... SUCCESS [01:28
> min]
> [INFO] Spark Project Tools  SUCCESS [
>  2.539 s]
> [INFO] Spark Project Hive . FAILURE [
> 13.273 s]
> [INFO] Spark Project REPL