Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Mich Talebzadeh
Good stuff Khalid.

I have created a section in Apache Spark Community Stack called spark
foundation.  spark-foundation - Apache Spark Community - Slack


I invite you to add your weblink to that section.

HTH
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 1 Apr 2023 at 13:12, Khalid Mammadov 
wrote:

> Hey AN-TRUONG
>
> I have got some articles about this subject that should help.
> E.g.
> https://khalidmammadov.github.io/spark/spark_internals_rdd.html
>
> Also check other Spark Internals on web.
>
> Regards
> Khalid
>
> On Fri, 31 Mar 2023, 16:29 AN-TRUONG Tran Phan, 
> wrote:
>
>> Thank you for your information,
>>
>> I have tracked the spark history server on port 18080 and the spark UI on
>> port 4040. I see the result of these two tools as similar right?
>>
>> I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in
>> the images does, is it possible?
>> https://i.stack.imgur.com/Azva4.png
>>
>> Best regards,
>>
>> An - Truong
>>
>>
>> On Fri, Mar 31, 2023 at 9:38 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Are you familiar with spark GUI default on port 4040?
>>>
>>> have a look.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 31 Mar 2023 at 15:15, AN-TRUONG Tran Phan <
>>> tr.phan.tru...@gmail.com> wrote:
>>>
 Hi,

 I am learning about Apache Spark and want to know the meaning of each
 Task created on the Jobs recorded on Spark history.

 For example, the application I write creates 17 jobs, in which job 0
 runs for 10 minutes, there are 2384 small tasks and I want to learn about
 the meaning of these 2384, is it possible?

 I found a picture of DAG in the Jobs and want to know the relationship
 between DAG and Task, is it possible (Specifically from the attached file
 DAG and 2384 tasks below)?

 Thank you very much, have a nice day everyone.

 Best regards,

 An-Trường.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Trân Trọng,
>>
>> An Trường.
>>
>


Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Khalid Mammadov
Hey AN-TRUONG

I have got some articles about this subject that should help.
E.g.
https://khalidmammadov.github.io/spark/spark_internals_rdd.html

Also check other Spark Internals on web.

Regards
Khalid

On Fri, 31 Mar 2023, 16:29 AN-TRUONG Tran Phan, 
wrote:

> Thank you for your information,
>
> I have tracked the spark history server on port 18080 and the spark UI on
> port 4040. I see the result of these two tools as similar right?
>
> I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in
> the images does, is it possible?
> https://i.stack.imgur.com/Azva4.png
>
> Best regards,
>
> An - Truong
>
>
> On Fri, Mar 31, 2023 at 9:38 PM Mich Talebzadeh 
> wrote:
>
>> Are you familiar with spark GUI default on port 4040?
>>
>> have a look.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 31 Mar 2023 at 15:15, AN-TRUONG Tran Phan <
>> tr.phan.tru...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am learning about Apache Spark and want to know the meaning of each
>>> Task created on the Jobs recorded on Spark history.
>>>
>>> For example, the application I write creates 17 jobs, in which job 0
>>> runs for 10 minutes, there are 2384 small tasks and I want to learn about
>>> the meaning of these 2384, is it possible?
>>>
>>> I found a picture of DAG in the Jobs and want to know the relationship
>>> between DAG and Task, is it possible (Specifically from the attached file
>>> DAG and 2384 tasks below)?
>>>
>>> Thank you very much, have a nice day everyone.
>>>
>>> Best regards,
>>>
>>> An-Trường.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
> --
> Trân Trọng,
>
> An Trường.
>


Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
yes history refers to completed jobs. 4040 is the running jobs

you should have screen shots for executors and stages as well.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 31 Mar 2023 at 16:17, AN-TRUONG Tran Phan 
wrote:

> Thank you for your information,
>
> I have tracked the spark history server on port 18080 and the spark UI on
> port 4040. I see the result of these two tools as similar right?
>
> I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in
> the images does, is it possible?
> https://i.stack.imgur.com/Azva4.png
>
> Best regards,
>
> An - Truong
>
>
> On Fri, Mar 31, 2023 at 9:38 PM Mich Talebzadeh 
> wrote:
>
>> Are you familiar with spark GUI default on port 4040?
>>
>> have a look.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 31 Mar 2023 at 15:15, AN-TRUONG Tran Phan <
>> tr.phan.tru...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am learning about Apache Spark and want to know the meaning of each
>>> Task created on the Jobs recorded on Spark history.
>>>
>>> For example, the application I write creates 17 jobs, in which job 0
>>> runs for 10 minutes, there are 2384 small tasks and I want to learn about
>>> the meaning of these 2384, is it possible?
>>>
>>> I found a picture of DAG in the Jobs and want to know the relationship
>>> between DAG and Task, is it possible (Specifically from the attached file
>>> DAG and 2384 tasks below)?
>>>
>>> Thank you very much, have a nice day everyone.
>>>
>>> Best regards,
>>>
>>> An-Trường.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
> --
> Trân Trọng,
>
> An Trường.
>


Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread AN-TRUONG Tran Phan
Thank you for your information,

I have tracked the spark history server on port 18080 and the spark UI on
port 4040. I see the result of these two tools as similar right?

I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in
the images does, is it possible?
https://i.stack.imgur.com/Azva4.png

Best regards,

An - Truong


On Fri, Mar 31, 2023 at 9:38 PM Mich Talebzadeh 
wrote:

> Are you familiar with spark GUI default on port 4040?
>
> have a look.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 31 Mar 2023 at 15:15, AN-TRUONG Tran Phan <
> tr.phan.tru...@gmail.com> wrote:
>
>> Hi,
>>
>> I am learning about Apache Spark and want to know the meaning of each
>> Task created on the Jobs recorded on Spark history.
>>
>> For example, the application I write creates 17 jobs, in which job 0 runs
>> for 10 minutes, there are 2384 small tasks and I want to learn about the
>> meaning of these 2384, is it possible?
>>
>> I found a picture of DAG in the Jobs and want to know the relationship
>> between DAG and Task, is it possible (Specifically from the attached file
>> DAG and 2384 tasks below)?
>>
>> Thank you very much, have a nice day everyone.
>>
>> Best regards,
>>
>> An-Trường.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Trân Trọng,

An Trường.


Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
Are you familiar with spark GUI default on port 4040?

have a look.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 31 Mar 2023 at 15:15, AN-TRUONG Tran Phan 
wrote:

> Hi,
>
> I am learning about Apache Spark and want to know the meaning of each Task
> created on the Jobs recorded on Spark history.
>
> For example, the application I write creates 17 jobs, in which job 0 runs
> for 10 minutes, there are 2384 small tasks and I want to learn about the
> meaning of these 2384, is it possible?
>
> I found a picture of DAG in the Jobs and want to know the relationship
> between DAG and Task, is it possible (Specifically from the attached file
> DAG and 2384 tasks below)?
>
> Thank you very much, have a nice day everyone.
>
> Best regards,
>
> An-Trường.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Unusual bug,please help me,i can do nothing!!!

2022-03-30 Thread spark User
Hello, I am a spark user. I use the "spark-shell.cmd" startup command in 
windows cmd, the first startup is normal, when I use the "ctrl+c" command to 
force the end of the spark window, it can't start normally again. .The error 
message is as follows "Failed to initialize Spark 
session.org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@x.168.137.41:49963".
When I try to add "x.168.137.41" in 'etc/hosts' it works fine, then use 
"ctrl+c" again.
The result is that it cannot start normally. Please help me

error bug,please help me!!!

2022-03-20 Thread spark User
Hello, I am a spark user. I use the "spark-shell.cmd" startup command in 
windows cmd, the first startup is normal, when I use the "ctrl+c" command to 
force the end of the spark window, it can't start normally again. .The error 
message is as follows "Failed to initialize Spark 
session.org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@x.168.137.41:49963".
When I try to add "x.168.137.41" in 'etc/hosts' it works fine, then use 
"ctrl+c" again.
The result is that it cannot start normally. Please help me

please help me: when I write code to connect kafka with spark using python and I run code on jupyer there is error display

2018-09-16 Thread hager
I write code to connect kafka with spark using python and I run code on
jupyer
my code
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars
/home/hadoop/Desktop/spark-program/kafka/spark-streaming-kafka-0-8-assembly_2.10-2.0.0-preview.jar
pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages
org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell"

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages
org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 pyspark-shell"

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

#sc = SparkContext()
ssc = StreamingContext(sc,1)

broker = "iotmsgs"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
{"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

error display
Spark Streaming's Kafka libraries not found in class path. Try one of the
following.

  1. Include the Kafka library and its dependencies with in the
 spark-submit command as

 $ bin/spark-submit --packages
org.apache.spark:spark-streaming-kafka-0-8:2.3.0 ...

  2. Download the JAR of the artifact from Maven Central
http://search.maven.org/,
 Group Id = org.apache.spark, Artifact Id =
spark-streaming-kafka-0-8-assembly, Version = 2.3.0.
 Then, include the jar in the spark-submit command as

 $ bin/spark-submit --jars  ... 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark streaming giving me a bunch of WARNINGS, please help me understand them

2017-07-09 Thread shyla deshpande
WARN  Use an existing SparkContext, some configuration may not take effect.
 I wanted to restart the spark streaming app, so stopped the
running and issued a new spark submit. Why and how it will use a existing
 SparkContext?

WARN  Spark is not running in local mode, therefore the checkpoint
directory must not be on the local filesystem. Directory
'file:/efs/checkpoint' appears to be on the local filesystem.

WARN  overriding enable.auto.commit to false for executor
WARN  overriding auto.offset.reset to none for executor
WARN  overriding executor group.id to spark-executor-mygroupid
WARN  overriding receive.buffer.bytes to 65536 see KAFKA-3135
WARN  overriding enable.auto.commit to false for executor
WARN  overriding auto.offset.reset to none for executor


the function of countByValueAndWindow and foreachRDD in DStream, would you like help me understand it please?

2017-06-27 Thread ??????????
HI all,


I have code like below:
Logger.getLogger("org.apache.spark").setLevel( Level.ERROR)
//Logger.getLogger("org.apache.spark.streaming.dstream").setLevel( 
Level.DEBUG)
val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
//val sc = SparkContext.getOrCreate( conf)
val ssc = new StreamingContext(conf, Seconds(1))

ssc.checkpoint( "E:\\spark\\tmp\\cp")
val lines = ssc.socketTextStream("127.0.0.1", )
lines.foreachRDD( r=>{
  println("RDD" + r.id + "begin" + "   " + new SimpleDateFormat("-mm-dd 
 HH:MM:SS").format( new Date()))
  r.foreach( ele => println(":::" + ele))
  println("RDD" + r.id + "end")
})
lines.countByValueAndWindow( Seconds(4), Seconds(1)).foreachRDD( s => { 
 // here is key code 
  println( "countByValueAndWindow RDD ID IS : " + s.id + "begin")
  println("time is " + new SimpleDateFormat("-mm-dd  HH:MM:SS").format( 
new Date()))
  s.foreach( e => println("data is " + e._1 + " :" + e._2))
  println("countByValueAndWindow RDD ID IS : " + s.id + "end")
})

ssc.start() // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate
I run the code and use "nc" send the message manually. The speed I input 
message is about one letter per seconds.I know the time in log does not equal 
the window duration, but I think they are very near.the output and my comment 
is :---RDD1begin   
2017-41-27  22:06:16 RDD1end countByValueAndWindow RDD ID IS : 7  begin time is 
2017-41-27  22:06:16 countByValueAndWindow RDD ID IS : 7  end RDD8begin   
2017-41-27  22:06:17 RDD8end countByValueAndWindow RDD ID IS : 13  begin time 
is 2017-41-27  22:06:17 countByValueAndWindow RDD ID IS : 13  end RDD14begin   
2017-41-27  22:06:18 :::1 RDD14end countByValueAndWindow RDD ID IS : 19  begin  
 time is 2017-41-27  22:06:18  <== data from 22:06:15 -- 22:06:18 is in RDD 14. 
data is 1 :1 countByValueAndWindow RDD ID IS : 19  end RDD20begin   2017-41-27  
22:06:19 :::2 RDD20end countByValueAndWindow RDD ID IS : 25  begin  time is 
2017-41-27  22:06:19  <== data from 22:06:16 -- 22:06:19 is in RDD 14 ,20. data 
is 1 :1 data is 2 :1 countByValueAndWindow RDD ID IS : 25  end RDD26begin   
2017-41-27  22:06:20 :::3 RDD26end countByValueAndWindow RDD ID IS : 31  begin 
time is 2017-41-27  22:06:20 <== data from 22:06:17 -- 22:06:20 is in RDD 14 , 
20, 26 data is 2 :1 data is 1 :1 data is 3 :1 countByValueAndWindow RDD ID IS : 
31  end RDD32begin   2017-41-27  22:06:21 :::4 RDD32end countByValueAndWindow 
RDD ID IS : 37  begin time is 2017-41-27  22:06:21 <== data from 22:06:18 -- 
22:06:21 is in RDD 14 , 20,  26, 32 data is 2 :1 data is 1 :1 data is 4 :1 data 
is 3 :1 countByValueAndWindow RDD ID IS : 37  end RDD38begin   2017-41-27  
22:06:22:::5:::6 RDD38end countByValueAndWindow RDD ID IS : 43  begin time is 
2017-41-27  22:06:22<== data from 22:06:19 -- 22:06:22 is in RDD  20,  26, 
32,38. Here 14 is out of window. data is 4 :1 data is 5 :1 data is 6 :1 data is 
2 :1 data is 3 :1 countByValueAndWindow RDD ID IS : 43  end RDD44begin   
2017-41-27  22:06:23 :::7 RDD44end countByValueAndWindow RDD ID IS : 49  begin 
time is 2017-41-27  22:06:23  <== data from 22:06:29 -- 22:06:23 is in RDD
26, 32,38, 44. Here 20is out of window. data is 5 :1 data is 4 :1 data is 6 :1 
data is 7 :1 data is 3 :1 countByValueAndWindow RDD ID IS : 49  
end---I think the 
foreachRDD function outputs the last RDD calculated by countByValueAndWindow, 
and the above log validate my idea.Now, I change the red code 
tolines.countByValueAndWindow( Seconds(4), Seconds(6)).foreachRDD( s => {  
// here is key code the slide duration is 6 seconds. The log and my comment is 
below:---DD1begin   
2017-59-27  10:59:12 RDD1end RDD2begin   2017-59-27  10:59:13 :::1 :::2 RDD2end 
RDD3begin   2017-59-27  10:59:14 :::3 RDD3end RDD4begin   2017-59-27  10:59:15 
:::4 RDD4end RDD5begin   2017-59-27  10:59:16 :::5 RDD5end RDD6begin   
2017-59-27  10:59:17 RDD6end countByValueAndWindow RDD ID IS : 22  begin time 
is 2017-59-27  10:59:17 <== I think here is OK, event RDD2 is calculated. data 
is 4 :1 data is 5 :1 data is 1 :1 data is 2 :1 data is 3 :1 
countByValueAndWindow RDD ID IS : 22  end RDD23begin   2017-59-27  10:59:18 
:::6 RDD23end RDD24begin   2017-59-27  10:59:19 :::8 :::7 RDD24end RDD25begin   
2017-59-27  10:59:20 :::9 RDD25end RDD26begin   2017-59-27  10:59:21 :::0 
RDD26end RDD27begin   2017-59-27  10:59:22 :::- RDD27end RDD28begin   
2017-59-27  10:59:23 :::p RDD28end countByValueAndWindow RDD ID IS : 43  begin 
time is 2017-59-27  10:59:23 <==the data between 10:59:20 --10:59:23 should be 
RDD 25, 26, 27, 28. but the data is wrong.  data is 6 :1 data is 2 :1 data is 9 
:1 data is - :1 data is 1 :1 data is 8 :1 data is p :1 dat

Re: the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread Ted Yu
build-classpath
> (generate-test-classpath) @ spark-sketch_2.11 ---
> [INFO]
> [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @
> spark-sketch_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- maven-surefire-plugin:2.19.1:test (test) @ spark-sketch_2.11
> ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (test) @ spark-sketch_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @
> spark-sketch_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\sketch\target\spark-sketch_2.11-2.1.2-SNAPSHOT-tests.jar
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:jar (default-jar) @ spark-sketch_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\sketch\target\spark-sketch_2.11-2.1.2-SNAPSHOT.jar
> [INFO]
> [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @
> spark-sketch_2.11 ---
> [INFO]
> [INFO] --- maven-shade-plugin:2.4.3:shade (default) @ spark-sketch_2.11 ---
> [INFO] Excluding org.apache.spark:spark-tags_2.11:jar:2.1.2-SNAPSHOT from
> the shaded jar.
> [INFO] Excluding org.scala-lang:scala-library:jar:2.11.8 from the shaded
> jar.
> [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
> jar.
> [INFO] Replacing original artifact with shaded artifact.
> [INFO] Replacing E:\spark\fromweb\spark-branch-2.1\common\sketch\target\
> spark-sketch_2.11-2.1.2-SNAPSHOT.jar with E:\spark\fromweb\spark-branch-
> 2.1\common\sketch\target\spark-sketch_2.11-2.1.2-SNAPSHOT-shaded.jar
> [INFO] Dependency-reduced POM written at: E:\spark\fromweb\spark-branch-
> 2.1\common\sketch\dependency-reduced-pom.xml
> [INFO] Dependency-reduced POM written at: E:\spark\fromweb\spark-branch-
> 2.1\common\sketch\dependency-reduced-pom.xml
> [INFO]
> [INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @
> spark-sketch_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\sketch\target\spark-sketch_2.11-2.1.2-SNAPSHOT-sources.jar
> [INFO]
> [INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @
> spark-sketch_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\sketch\target\spark-sketch_2.11-2.1.2-SNAPSHOT-test-sources.jar
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project Networking 2.1.2-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @
> spark-network-common_2.11 ---
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
> spark-network-common_2.11 ---
> [INFO] Add Source directory: E:\spark\fromweb\spark-branch-
> 2.1\common\network-common\src\main\scala
> [INFO] Add Test Source directory: E:\spark\fromweb\spark-branch-
> 2.1\common\network-common\src\test\scala
> [INFO]
> [INFO] --- maven-dependency-plugin:2.10:build-classpath (default-cli) @
> spark-network-common_2.11 ---
> [INFO] Dependencies classpath:
> C:\Users\shaof\.m2\repository\io\netty\netty-all\4.0.42.
> Final\netty-all-4.0.42.Final.jar;C:\Users\shaof\.m2\
> repository\com\fasterxml\jackson\core\jackson-core\2.6.
> 5\jackson-core-2.6.5.jar;C:\Users\shaof\.m2\repository\
> org\spark-project\spark\unused\1.0.0\unused-1.0.0.jar;
> C:\Users\shaof\.m2\repository\com\fasterxml\jackson\core\
> jackson-annotations\2.6.5\jackson-annotations-2.6.5.jar;
> C:\Users\shaof\.m2\repository\com\google\code\findbugs\
> jsr305\1.3.9\jsr305-1.3.9.jar;E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT.jar;
> C:\Users\shaof\.m2\repository\org\scala-lang\scala-library\
> 2.11.8\scala-library-2.11.8.jar;C:\Users\shaof\.m2\
> repository\org\fusesource\leveldbjni\leveldbjni-all\1.8\
> leveldbjni-all-1.8.jar;C:\Users\shaof\.m2\repository\
> org\apache\commons\commons-lang3\3.5\commons-lang3-3.5.
> jar;C:\Users\shaof\.m2\repository\com\fasterxml\
> jackson\core\jackson-databind\2.6.5\jackson-databind-2.6.5.
> jar;C:\Users\shaof\.m2\repository\com\google\guava\
> guava\14.0.1\guava-14.0.1.jar
> [INFO]
> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
> spark-network-common_2.11 ---
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
> spark-network-common_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory E:\spark\fromweb\spark-branch-
> 2.1\common\network-common\src\main\resources
> [INFO] Copying 3 resources
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
> spark-network-common_2.11 ---  <stop here for more than 30
> minutes.
>
> 
>
> I stopped it and retried again. It stopped at the same point.
>
> Would you like help me please?Or tell me which log file I can check to
> find the reason.
>
> My OS is WIN10, SPARK code is 2.1.
> Maven is 3.5.0.
> jdk is 1.8.
> Compile command is "mvn  -Phadoop-2.7 -Dhadoop.version=2.7.0 -DskipTests
> clean package"
>
>
> Thanks
> Fei  Shao
>


the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread ??????????
ect.spark:unused:jar:1.0.0 in the shaded jar.
[INFO] Replacing original artifact with shaded artifact.
[INFO] Replacing 
E:\spark\fromweb\spark-branch-2.1\common\sketch\target\spark-sketch_2.11-2.1.2-SNAPSHOT.jar
 with 
E:\spark\fromweb\spark-branch-2.1\common\sketch\target\spark-sketch_2.11-2.1.2-SNAPSHOT-shaded.jar
[INFO] Dependency-reduced POM written at: 
E:\spark\fromweb\spark-branch-2.1\common\sketch\dependency-reduced-pom.xml
[INFO] Dependency-reduced POM written at: 
E:\spark\fromweb\spark-branch-2.1\common\sketch\dependency-reduced-pom.xml
[INFO]
[INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @ 
spark-sketch_2.11 ---
[INFO] Building jar: 
E:\spark\fromweb\spark-branch-2.1\common\sketch\target\spark-sketch_2.11-2.1.2-SNAPSHOT-sources.jar
[INFO]
[INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @ 
spark-sketch_2.11 ---
[INFO] Building jar: 
E:\spark\fromweb\spark-branch-2.1\common\sketch\target\spark-sketch_2.11-2.1.2-SNAPSHOT-test-sources.jar
[INFO]
[INFO] 
[INFO] Building Spark Project Networking 2.1.2-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @ 
spark-network-common_2.11 ---
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @ 
spark-network-common_2.11 ---
[INFO] Add Source directory: 
E:\spark\fromweb\spark-branch-2.1\common\network-common\src\main\scala
[INFO] Add Test Source directory: 
E:\spark\fromweb\spark-branch-2.1\common\network-common\src\test\scala
[INFO]
[INFO] --- maven-dependency-plugin:2.10:build-classpath (default-cli) @ 
spark-network-common_2.11 ---
[INFO] Dependencies classpath:
C:\Users\shaof\.m2\repository\io\netty\netty-all\4.0.42.Final\netty-all-4.0.42.Final.jar;C:\Users\shaof\.m2\repository\com\fasterxml\jackson\core\jackson-core\2.6.5\jackson-core-2.6.5.jar;C:\Users\shaof\.m2\repository\org\spark-project\spark\unused\1.0.0\unused-1.0.0.jar;C:\Users\shaof\.m2\repository\com\fasterxml\jackson\core\jackson-annotations\2.6.5\jackson-annotations-2.6.5.jar;C:\Users\shaof\.m2\repository\com\google\code\findbugs\jsr305\1.3.9\jsr305-1.3.9.jar;E:\spark\fromweb\spark-branch-2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT.jar;C:\Users\shaof\.m2\repository\org\scala-lang\scala-library\2.11.8\scala-library-2.11.8.jar;C:\Users\shaof\.m2\repository\org\fusesource\leveldbjni\leveldbjni-all\1.8\leveldbjni-all-1.8.jar;C:\Users\shaof\.m2\repository\org\apache\commons\commons-lang3\3.5\commons-lang3-3.5.jar;C:\Users\shaof\.m2\repository\com\fasterxml\jackson\core\jackson-databind\2.6.5\jackson-databind-2.6.5.jar;C:\Users\shaof\.m2\repository\com\google\guava\guava\14.0.1\guava-14.0.1.jar
[INFO]
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
spark-network-common_2.11 ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
spark-network-common_2.11 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
E:\spark\fromweb\spark-branch-2.1\common\network-common\src\main\resources
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
spark-network-common_2.11 ---  <stop here for more than 30 
minutes.






I stopped it and retried again. It stopped at the same point.


Would you like help me please?Or tell me which log file I can check to find the 
reason.


My OS is WIN10, SPARK code is 2.1.
Maven is 3.5.0.
jdk is 1.8.
Compile command is "mvn  -Phadoop-2.7 -Dhadoop.version=2.7.0 -DskipTests clean 
package"




Thanks
Fei  Shao

Please help me out !!!!Getting error while trying to hive java generic udf in spark

2017-01-17 Thread Sirisha Cheruvu
Hi Everyone..


getting below error while running hive java udf from sql context..


org.apache.spark.sql.AnalysisException: No handler for Hive udf class
com.nexr.platform.hive.udf.GenericUDFNVL2 because:
com.nexr.platform.hive.udf.GenericUDFNVL2.; line 1 pos 26
at
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:105)
at
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:64)
at scala.util.Try.getOrElse(Try.scala:77)





My script is this


import org.apache.spark.sql.hive.HiveContext
val hc = new org.apache.spark.sql.hive.HiveContext(sc) ;
hc.sql("add jar /home/cloudera/Downloads/genudnvl2.jar");
hc.sql("create temporary function nexr_nvl2 as
'com.nexr.platform.hive.udf.GenericUDFNVL2'");
hc.sql("select nexr_nvl2(name,let,ret) from testtab5").show;
System.exit(0);




and attached is the java hive udf which i am trying to run on spark

Please hep

Regards,
SIrisha


GenericUDFNVL2.java
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Help me! Spark WebUI is corrupted!

2015-12-31 Thread Aniket Bhatnagar
Are you running on YARN or standalone?

On Thu, Dec 31, 2015, 3:35 PM LinChen  wrote:

> *Screenshot1(Normal WebUI)*
>
>
>
> *Screenshot2(Corrupted WebUI)*
>
>
>
> As screenshot2 shows, the format of my Spark WebUI looks strange and I
> cannot click the description of active jobs. It seems there is something
> missing in my opearing system. I googled it but find nothing. Could anybody
> help me?
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org


Help me! Spark WebUI is corrupted!

2015-12-31 Thread LinChen
Screenshot1(Normal WebUI)


Screenshot2(Corrupted WebUI)


As screenshot2 shows, the format of my Spark WebUI looks strange and I cannot 
click the description of active jobs. It seems there is something missing in my 
opearing system. I googled it but find nothing. Could anybody help me?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

anyone who can help me out with thi error please

2015-12-04 Thread Mich Talebzadeh
 

Hi,

 

 

I am trying to make Hive work with Spark.

 

I have been told that I need to use Spark 1.3 and build it from source code
WITHOUT HIVE libraries.

 

I have built it as follows:

 

./make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

 

 

Now the issue I have that I cannot start master node.

 

When I try

 

hduser@rhes564::/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin

> ./start-master.sh

starting org.apache.spark.deploy.master.Master, logging to
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.
apache.spark.deploy.master.Master-1-rhes564.out

failed to launch org.apache.spark.deploy.master.Master:

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

full log in
/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.
apache.spark.deploy.master.Master-1-rhes564.out

 

I get

 

Spark Command: /usr/java/latest/bin/java -cp
:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../conf:/usr/lib/spark-1
.3.0-bin-hadoop2-without-hive/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home
/hduser/hadoop-2.6.0/etc/hadoop -XX:MaxPermSize=128m
-Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.master.Master --ip 50.140.197.217 --port 7077
--webui-port 8080



 

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

at java.lang.Class.getDeclaredMethods0(Native Method)

at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)

at java.lang.Class.getMethod0(Class.java:2764)

at java.lang.Class.getMethod(Class.java:1653)

at
sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)

at
sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 6 more

 

Any advice will be appreciated.

 

Thanks,

 

Mich

 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 



Re: Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hey, I work it out myself :)

The "Vector" is actually a "SparesVector", so when it is written into a
string, the format is

(size, [coordinate], [value...])


Simple!


On Sat, Mar 14, 2015 at 6:05 PM Xi Shen  wrote:

> Hi,
>
> I read this document,
> http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and
> tried to build a TF-IDF model of my documents.
>
> I have a list of documents, each word is represented as a Int, and each
> document is listed in one line.
>
> doc_name, int1, int2...
> doc_name, int3, int4...
>
> This is how I load my documents:
> val documents: RDD[Seq[Int]] = sc.objectFile[(String,
> Seq[Int])](s"$sparkStore/documents") map (_._2) cache()
>
> Then I did:
>
> val hashingTF = new HashingTF()
> val tf: RDD[Vector] = hashingTF.transform(documents)
> val idf = new IDF().fit(tf)
> val tfidf = idf.transform(tf)
>
> I write the tfidf model to a text file and try to understand the structure.
> FileUtils.writeLines(new File("tfidf.out"),
> tfidf.collect().toList.asJavaCollection)
>
> What I is something like:
>
> (1048576,[0,4,7,8,10,13,17,21],[...some float numbers...])
> ...
>
> I think it s a tuple with 3 element.
>
>- I have no idea what the 1st element is...
>- I think the 2nd element is a list of the word
>- I think the 3rd element is a list of tf-idf value of the words in
>the previous list
>
> Please help me understand this structure.
>
>
> Thanks,
> David
>
>
>
>


Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hi,

I read this document,
http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried
to build a TF-IDF model of my documents.

I have a list of documents, each word is represented as a Int, and each
document is listed in one line.

doc_name, int1, int2...
doc_name, int3, int4...

This is how I load my documents:
val documents: RDD[Seq[Int]] = sc.objectFile[(String,
Seq[Int])](s"$sparkStore/documents") map (_._2) cache()

Then I did:

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)

I write the tfidf model to a text file and try to understand the structure.
FileUtils.writeLines(new File("tfidf.out"),
tfidf.collect().toList.asJavaCollection)

What I is something like:

(1048576,[0,4,7,8,10,13,17,21],[...some float numbers...])
...

I think it s a tuple with 3 element.

   - I have no idea what the 1st element is...
   - I think the 2nd element is a list of the word
   - I think the 3rd element is a list of tf-idf value of the words in the
   previous list

Please help me understand this structure.


Thanks,
David


Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the
shuffle setting: spark.sql.shuffle.partitions

On Thu, Feb 26, 2015 at 5:51 PM, java8964  wrote:

> Imran, thanks for your explaining about the parallelism. That is very
> helpful.
>
> In my test case, I am only use one box cluster, with one executor. So if I
> put 10 cores, then 10 concurrent task will be run within this one executor,
> which will handle more data than 4 core case, then leaded to OOM.
>
> I haven't setup Spark on our production cluster yet, but assume that we
> have 100 nodes cluster, if I guess right, set up to 1000 cores mean that on
>  average, each box's executor will run 10 threads to process data. So
> lowering core will reduce the speed of spark, but can help to avoid the
> OOM, as less data to be processed in the memory.
>
> My another guess is that each partition will be processed by one core
> eventually. So make bigger partition count can decrease partition size,
> which should help the memory footprint. In my case, I guess that Spark SQL
> in fact doesn't use the "spark.default.parallelism" setting, or at least in
> my query, it is not used. So no matter what I changed, it doesn't matter.
> The reason I said that is that there is always 200 tasks in stage 2 and 3
> of my query job, no matter what I set the "spark.default.parallelism".
>
> I think lowering the core is to exchange lower memory usage vs speed. Hope
> my understanding is correct.
>
> Thanks
>
> Yong
>
> --
> Date: Thu, 26 Feb 2015 17:03:20 -0500
> Subject: Re: Help me understand the partition, parallelism in Spark
> From: yana.kadiy...@gmail.com
> To: iras...@cloudera.com
> CC: java8...@hotmail.com; user@spark.apache.org
>
>
> Imran, I have also observed the phenomenon of reducing the cores helping
> with OOM. I wanted to ask this (hopefully without straying off topic): we
> can specify the number of cores and the executor memory. But we don't get
> to specify _how_ the cores are spread among executors.
>
> Is it possible that with 24G memory and 4 cores we get a spread of 1 core
> per executor thus ending up with 24G for the task, but with 24G memory and
> 10 cores some executor ends up with 3 cores on the same machine and thus we
> have only 8G per task?
>
> On Thu, Feb 26, 2015 at 4:42 PM, Imran Rashid 
> wrote:
>
> Hi Yong,
>
> mostly correct except for:
>
>
>- Since we are doing reduceByKey, shuffling will happen. Data will be
>shuffled into 1000 partitions, as we have 1000 unique keys.
>
> no, you will not get 1000 partitions.  Spark has to decide how many
> partitions to use before it even knows how many unique keys there are.  If
> you have 200 as the default parallelism (or you just explicitly make it the
> second parameter to reduceByKey()), then you will get 200 partitions.  The
> 1000 unique keys will be distributed across the 200 partitions.  ideally
> they will be distributed pretty equally, but how they get distributed
> depends on the partitioner (by default you will have a HashPartitioner, so
> it depends on the hash of your keys).
>
> Note that this is more or less the same as in Hadoop MapReduce.
>
> the amount of parallelism matters b/c there are various places in spark
> where there is some overhead proportional to the size of a partition.  So
> in your example, if you have 1000 unique keys in 200 partitions, you expect
> about 5 unique keys per partitions -- if instead you had 10 partitions,
> you'd expect 100 unique keys per partitions, and thus more data and you'd
> be more likely to hit an OOM.  But there are many other possible sources of
> OOM, so this is definitely not the *only* solution.
>
> Sorry I can't comment in particular about Spark SQL -- hopefully somebody
> more knowledgeable can comment on that.
>
>
>
> On Wed, Feb 25, 2015 at 8:58 PM, java8964  wrote:
>
> Hi, Sparkers:
>
> I come from the Hadoop MapReducer world, and try to understand some
> internal information of spark. From the web and this list, I keep seeing
> people talking about increase the parallelism if you get the OOM error. I
> tried to read document as much as possible to understand the RDD partition,
> and parallelism usage in the spark.
>
> I understand that for RDD from HDFS, by default, one partition will be one
> HDFS block, pretty straightforward. I saw that lots of RDD operations
> support 2nd parameter of parallelism. This is the part confuse me. From my
> understand, the parallelism is totally controlled by how many cores you
> give to your job. Adjust that parameter, or "spark.default.parallelism"
> shouldn't have any impact.
>
>

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
Imran, thanks for your explaining about the parallelism. That is very helpful.
In my test case, I am only use one box cluster, with one executor. So if I put 
10 cores, then 10 concurrent task will be run within this one executor, which 
will handle more data than 4 core case, then leaded to OOM.
I haven't setup Spark on our production cluster yet, but assume that we have 
100 nodes cluster, if I guess right, set up to 1000 cores mean that on  
average, each box's executor will run 10 threads to process data. So lowering 
core will reduce the speed of spark, but can help to avoid the OOM, as less 
data to be processed in the memory.
My another guess is that each partition will be processed by one core 
eventually. So make bigger partition count can decrease partition size, which 
should help the memory footprint. In my case, I guess that Spark SQL in fact 
doesn't use the "spark.default.parallelism" setting, or at least in my query, 
it is not used. So no matter what I changed, it doesn't matter. The reason I 
said that is that there is always 200 tasks in stage 2 and 3 of my query job, 
no matter what I set the "spark.default.parallelism".
I think lowering the core is to exchange lower memory usage vs speed. Hope my 
understanding is correct.
Thanks
Yong
Date: Thu, 26 Feb 2015 17:03:20 -0500
Subject: Re: Help me understand the partition, parallelism in Spark
From: yana.kadiy...@gmail.com
To: iras...@cloudera.com
CC: java8...@hotmail.com; user@spark.apache.org

Imran, I have also observed the phenomenon of reducing the cores helping with 
OOM. I wanted to ask this (hopefully without straying off topic): we can 
specify the number of cores and the executor memory. But we don't get to 
specify _how_ the cores are spread among executors.
Is it possible that with 24G memory and 4 cores we get a spread of 1 core per 
executor thus ending up with 24G for the task, but with 24G memory and 10 cores 
some executor ends up with 3 cores on the same machine and thus we have only 8G 
per task?
On Thu, Feb 26, 2015 at 4:42 PM, Imran Rashid  wrote:
Hi Yong,
mostly correct except for:Since we are doing reduceByKey, shuffling will 
happen. Data will be shuffled into 1000 partitions, as we have 1000 unique 
keys.no, you will not get 1000 partitions.  Spark has to decide how many 
partitions to use before it even knows how many unique keys there are.  If you 
have 200 as the default parallelism (or you just explicitly make it the second 
parameter to reduceByKey()), then you will get 200 partitions.  The 1000 unique 
keys will be distributed across the 200 partitions.  ideally they will be 
distributed pretty equally, but how they get distributed depends on the 
partitioner (by default you will have a HashPartitioner, so it depends on the 
hash of your keys).
Note that this is more or less the same as in Hadoop MapReduce.
the amount of parallelism matters b/c there are various places in spark where 
there is some overhead proportional to the size of a partition.  So in your 
example, if you have 1000 unique keys in 200 partitions, you expect about 5 
unique keys per partitions -- if instead you had 10 partitions, you'd expect 
100 unique keys per partitions, and thus more data and you'd be more likely to 
hit an OOM.  But there are many other possible sources of OOM, so this is 
definitely not the *only* solution.
Sorry I can't comment in particular about Spark SQL -- hopefully somebody more 
knowledgeable can comment on that.


On Wed, Feb 25, 2015 at 8:58 PM, java8964  wrote:



Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand some internal 
information of spark. From the web and this list, I keep seeing people talking 
about increase the parallelism if you get the OOM error. I tried to read 
document as much as possible to understand the RDD partition, and parallelism 
usage in the spark.
I understand that for RDD from HDFS, by default, one partition will be one HDFS 
block, pretty straightforward. I saw that lots of RDD operations support 2nd 
parameter of parallelism. This is the part confuse me. From my understand, the 
parallelism is totally controlled by how many cores you give to your job. 
Adjust that parameter, or "spark.default.parallelism" shouldn't have any impact.
For example, if I have a 10G data in HDFS, and assume the block size is 128M, 
so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to a Pair 
RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey action, using 
200 as the default parallelism. Here is what I assume:
We have 100 partitions, as the data comes from 100 blocks. Most likely the 
spark will generate 100 tasks to read and shuffle them?The 1000 unique keys 
mean the 1000 reducer group, like in MRIf I set the max core to be 50, so there 
will be up to 50 tasks can be run concurrently. The rest tasks just have to 
wait for the core, if there are 50 tasks are r

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Zhan Zhang
Here is my understanding.

When running on top of yarn, the cores means the number of tasks can run in one 
executor. But all these cores are located in the same JVM.

Parallelism typically control the balance of tasks. For example, if you have 
200 cores, but only 50 partitions. There will be 150 cores sitting idle.

OOM: increase the memory size, and JVM memory overhead may help here.

Thanks.

Zhan Zhang

On Feb 26, 2015, at 2:03 PM, Yana Kadiyska 
mailto:yana.kadiy...@gmail.com>> wrote:

Imran, I have also observed the phenomenon of reducing the cores helping with 
OOM. I wanted to ask this (hopefully without straying off topic): we can 
specify the number of cores and the executor memory. But we don't get to 
specify _how_ the cores are spread among executors.

Is it possible that with 24G memory and 4 cores we get a spread of 1 core per 
executor thus ending up with 24G for the task, but with 24G memory and 10 cores 
some executor ends up with 3 cores on the same machine and thus we have only 8G 
per task?

On Thu, Feb 26, 2015 at 4:42 PM, Imran Rashid 
mailto:iras...@cloudera.com>> wrote:
Hi Yong,

mostly correct except for:

  *   Since we are doing reduceByKey, shuffling will happen. Data will be 
shuffled into 1000 partitions, as we have 1000 unique keys.

no, you will not get 1000 partitions.  Spark has to decide how many partitions 
to use before it even knows how many unique keys there are.  If you have 200 as 
the default parallelism (or you just explicitly make it the second parameter to 
reduceByKey()), then you will get 200 partitions.  The 1000 unique keys will be 
distributed across the 200 partitions.  ideally they will be distributed pretty 
equally, but how they get distributed depends on the partitioner (by default 
you will have a HashPartitioner, so it depends on the hash of your keys).

Note that this is more or less the same as in Hadoop MapReduce.

the amount of parallelism matters b/c there are various places in spark where 
there is some overhead proportional to the size of a partition.  So in your 
example, if you have 1000 unique keys in 200 partitions, you expect about 5 
unique keys per partitions -- if instead you had 10 partitions, you'd expect 
100 unique keys per partitions, and thus more data and you'd be more likely to 
hit an OOM.  But there are many other possible sources of OOM, so this is 
definitely not the *only* solution.

Sorry I can't comment in particular about Spark SQL -- hopefully somebody more 
knowledgeable can comment on that.



On Wed, Feb 25, 2015 at 8:58 PM, java8964 
mailto:java8...@hotmail.com>> wrote:
Hi, Sparkers:

I come from the Hadoop MapReducer world, and try to understand some internal 
information of spark. From the web and this list, I keep seeing people talking 
about increase the parallelism if you get the OOM error. I tried to read 
document as much as possible to understand the RDD partition, and parallelism 
usage in the spark.

I understand that for RDD from HDFS, by default, one partition will be one HDFS 
block, pretty straightforward. I saw that lots of RDD operations support 2nd 
parameter of parallelism. This is the part confuse me. From my understand, the 
parallelism is totally controlled by how many cores you give to your job. 
Adjust that parameter, or "spark.default.parallelism" shouldn't have any impact.

For example, if I have a 10G data in HDFS, and assume the block size is 128M, 
so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to a Pair 
RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey action, using 
200 as the default parallelism. Here is what I assume:


  *   We have 100 partitions, as the data comes from 100 blocks. Most likely 
the spark will generate 100 tasks to read and shuffle them?
  *   The 1000 unique keys mean the 1000 reducer group, like in MR
  *   If I set the max core to be 50, so there will be up to 50 tasks can be 
run concurrently. The rest tasks just have to wait for the core, if there are 
50 tasks are running.
  *   Since we are doing reduceByKey, shuffling will happen. Data will be 
shuffled into 1000 partitions, as we have 1000 unique keys.
  *   I don't know these 1000 partitions will be processed by how many tasks, 
maybe this is the parallelism parameter comes in?
  *   No matter what parallelism this will be, there are ONLY 50 task can be 
run concurrently. So if we set more cores, more partitions' data will be 
processed in the executor (which runs more thread in this case), so more memory 
needs. I don't see how increasing parallelism could help the OOM in this case.
  *   In my test case of Spark SQL, I gave 24G as the executor heap, my join 
between 2 big datasets keeps getting OOM. I keep increasing the 
"spark.default.parallelism", from 200 to 400, to 2000, even to 4000, no help. 
What really makes the query finish finally without OOM is after I change the 
"--total-executor-cores" from 10 to 4.

So my questions are:
1) What is the parallelism rea

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Imran, I have also observed the phenomenon of reducing the cores helping
with OOM. I wanted to ask this (hopefully without straying off topic): we
can specify the number of cores and the executor memory. But we don't get
to specify _how_ the cores are spread among executors.

Is it possible that with 24G memory and 4 cores we get a spread of 1 core
per executor thus ending up with 24G for the task, but with 24G memory and
10 cores some executor ends up with 3 cores on the same machine and thus we
have only 8G per task?

On Thu, Feb 26, 2015 at 4:42 PM, Imran Rashid  wrote:

> Hi Yong,
>
> mostly correct except for:
>
>>
>>- Since we are doing reduceByKey, shuffling will happen. Data will be
>>shuffled into 1000 partitions, as we have 1000 unique keys.
>>
>> no, you will not get 1000 partitions.  Spark has to decide how many
> partitions to use before it even knows how many unique keys there are.  If
> you have 200 as the default parallelism (or you just explicitly make it the
> second parameter to reduceByKey()), then you will get 200 partitions.  The
> 1000 unique keys will be distributed across the 200 partitions.  ideally
> they will be distributed pretty equally, but how they get distributed
> depends on the partitioner (by default you will have a HashPartitioner, so
> it depends on the hash of your keys).
>
> Note that this is more or less the same as in Hadoop MapReduce.
>
> the amount of parallelism matters b/c there are various places in spark
> where there is some overhead proportional to the size of a partition.  So
> in your example, if you have 1000 unique keys in 200 partitions, you expect
> about 5 unique keys per partitions -- if instead you had 10 partitions,
> you'd expect 100 unique keys per partitions, and thus more data and you'd
> be more likely to hit an OOM.  But there are many other possible sources of
> OOM, so this is definitely not the *only* solution.
>
> Sorry I can't comment in particular about Spark SQL -- hopefully somebody
> more knowledgeable can comment on that.
>
>
>
> On Wed, Feb 25, 2015 at 8:58 PM, java8964  wrote:
>
>> Hi, Sparkers:
>>
>> I come from the Hadoop MapReducer world, and try to understand some
>> internal information of spark. From the web and this list, I keep seeing
>> people talking about increase the parallelism if you get the OOM error. I
>> tried to read document as much as possible to understand the RDD partition,
>> and parallelism usage in the spark.
>>
>> I understand that for RDD from HDFS, by default, one partition will be
>> one HDFS block, pretty straightforward. I saw that lots of RDD operations
>> support 2nd parameter of parallelism. This is the part confuse me. From my
>> understand, the parallelism is totally controlled by how many cores you
>> give to your job. Adjust that parameter, or "spark.default.parallelism"
>> shouldn't have any impact.
>>
>> For example, if I have a 10G data in HDFS, and assume the block size is
>> 128M, so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to
>> a Pair RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey
>> action, using 200 as the default parallelism. Here is what I assume:
>>
>>
>>- We have 100 partitions, as the data comes from 100 blocks. Most
>>likely the spark will generate 100 tasks to read and shuffle them?
>>- The 1000 unique keys mean the 1000 reducer group, like in MR
>>- If I set the max core to be 50, so there will be up to 50 tasks can
>>be run concurrently. The rest tasks just have to wait for the core, if
>>there are 50 tasks are running.
>>- Since we are doing reduceByKey, shuffling will happen. Data will be
>>shuffled into 1000 partitions, as we have 1000 unique keys.
>>- I don't know these 1000 partitions will be processed by how many
>>tasks, maybe this is the parallelism parameter comes in?
>>- No matter what parallelism this will be, there are ONLY 50 task can
>>be run concurrently. So if we set more cores, more partitions' data will 
>> be
>>processed in the executor (which runs more thread in this case), so more
>>memory needs. I don't see how increasing parallelism could help the OOM in
>>this case.
>>- In my test case of Spark SQL, I gave 24G as the executor heap, my
>>join between 2 big datasets keeps getting OOM. I keep increasing the
>>"spark.default.parallelism", from 200 to 400, to 2000, even to 4000, no
>>help. What really makes the query finish finally without OOM is after I
>>change the "--total-executor-cores" from 10 to 4.
>>
>>
>> So my questions are:
>> 1) What is the parallelism really mean in the Spark? In the simple
>> example above, for reduceByKey, what difference it is between parallelism
>> change from 10 to 20?
>> 2) When we talk about partition in the spark, for the data coming from
>> HDFS, I can understand the partition clearly. For the intermediate data,
>> the partition will be same as key, right? For group, reducing, join action,

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Imran Rashid
Hi Yong,

mostly correct except for:

>
>- Since we are doing reduceByKey, shuffling will happen. Data will be
>shuffled into 1000 partitions, as we have 1000 unique keys.
>
> no, you will not get 1000 partitions.  Spark has to decide how many
partitions to use before it even knows how many unique keys there are.  If
you have 200 as the default parallelism (or you just explicitly make it the
second parameter to reduceByKey()), then you will get 200 partitions.  The
1000 unique keys will be distributed across the 200 partitions.  ideally
they will be distributed pretty equally, but how they get distributed
depends on the partitioner (by default you will have a HashPartitioner, so
it depends on the hash of your keys).

Note that this is more or less the same as in Hadoop MapReduce.

the amount of parallelism matters b/c there are various places in spark
where there is some overhead proportional to the size of a partition.  So
in your example, if you have 1000 unique keys in 200 partitions, you expect
about 5 unique keys per partitions -- if instead you had 10 partitions,
you'd expect 100 unique keys per partitions, and thus more data and you'd
be more likely to hit an OOM.  But there are many other possible sources of
OOM, so this is definitely not the *only* solution.

Sorry I can't comment in particular about Spark SQL -- hopefully somebody
more knowledgeable can comment on that.



On Wed, Feb 25, 2015 at 8:58 PM, java8964  wrote:

> Hi, Sparkers:
>
> I come from the Hadoop MapReducer world, and try to understand some
> internal information of spark. From the web and this list, I keep seeing
> people talking about increase the parallelism if you get the OOM error. I
> tried to read document as much as possible to understand the RDD partition,
> and parallelism usage in the spark.
>
> I understand that for RDD from HDFS, by default, one partition will be one
> HDFS block, pretty straightforward. I saw that lots of RDD operations
> support 2nd parameter of parallelism. This is the part confuse me. From my
> understand, the parallelism is totally controlled by how many cores you
> give to your job. Adjust that parameter, or "spark.default.parallelism"
> shouldn't have any impact.
>
> For example, if I have a 10G data in HDFS, and assume the block size is
> 128M, so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to
> a Pair RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey
> action, using 200 as the default parallelism. Here is what I assume:
>
>
>- We have 100 partitions, as the data comes from 100 blocks. Most
>likely the spark will generate 100 tasks to read and shuffle them?
>- The 1000 unique keys mean the 1000 reducer group, like in MR
>- If I set the max core to be 50, so there will be up to 50 tasks can
>be run concurrently. The rest tasks just have to wait for the core, if
>there are 50 tasks are running.
>- Since we are doing reduceByKey, shuffling will happen. Data will be
>shuffled into 1000 partitions, as we have 1000 unique keys.
>- I don't know these 1000 partitions will be processed by how many
>tasks, maybe this is the parallelism parameter comes in?
>- No matter what parallelism this will be, there are ONLY 50 task can
>be run concurrently. So if we set more cores, more partitions' data will be
>processed in the executor (which runs more thread in this case), so more
>memory needs. I don't see how increasing parallelism could help the OOM in
>this case.
>- In my test case of Spark SQL, I gave 24G as the executor heap, my
>join between 2 big datasets keeps getting OOM. I keep increasing the
>"spark.default.parallelism", from 200 to 400, to 2000, even to 4000, no
>help. What really makes the query finish finally without OOM is after I
>change the "--total-executor-cores" from 10 to 4.
>
>
> So my questions are:
> 1) What is the parallelism really mean in the Spark? In the simple example
> above, for reduceByKey, what difference it is between parallelism change
> from 10 to 20?
> 2) When we talk about partition in the spark, for the data coming from
> HDFS, I can understand the partition clearly. For the intermediate data,
> the partition will be same as key, right? For group, reducing, join action,
> uniqueness of the keys will be partition. Is that correct?
> 3) Why increasing parallelism could help OOM? I don't get this part. From
> my limited experience, adjusting the core count really matters for memory.
>
> Thanks
>
> Yong
>


RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
Anyone can share any thoughts related to my questions?
Thanks

From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Help me understand the partition, parallelism in Spark
Date: Wed, 25 Feb 2015 21:58:55 -0500




Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand some internal 
information of spark. From the web and this list, I keep seeing people talking 
about increase the parallelism if you get the OOM error. I tried to read 
document as much as possible to understand the RDD partition, and parallelism 
usage in the spark.
I understand that for RDD from HDFS, by default, one partition will be one HDFS 
block, pretty straightforward. I saw that lots of RDD operations support 2nd 
parameter of parallelism. This is the part confuse me. From my understand, the 
parallelism is totally controlled by how many cores you give to your job. 
Adjust that parameter, or "spark.default.parallelism" shouldn't have any impact.
For example, if I have a 10G data in HDFS, and assume the block size is 128M, 
so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to a Pair 
RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey action, using 
200 as the default parallelism. Here is what I assume:
We have 100 partitions, as the data comes from 100 blocks. Most likely the 
spark will generate 100 tasks to read and shuffle them?The 1000 unique keys 
mean the 1000 reducer group, like in MRIf I set the max core to be 50, so there 
will be up to 50 tasks can be run concurrently. The rest tasks just have to 
wait for the core, if there are 50 tasks are running.Since we are doing 
reduceByKey, shuffling will happen. Data will be shuffled into 1000 partitions, 
as we have 1000 unique keys.I don't know these 1000 partitions will be 
processed by how many tasks, maybe this is the parallelism parameter comes 
in?No matter what parallelism this will be, there are ONLY 50 task can be run 
concurrently. So if we set more cores, more partitions' data will be processed 
in the executor (which runs more thread in this case), so more memory needs. I 
don't see how increasing parallelism could help the OOM in this case.In my test 
case of Spark SQL, I gave 24G as the executor heap, my join between 2 big 
datasets keeps getting OOM. I keep increasing the "spark.default.parallelism", 
from 200 to 400, to 2000, even to 4000, no help. What really makes the query 
finish finally without OOM is after I change the "--total-executor-cores" from 
10 to 4.
So my questions are:1) What is the parallelism really mean in the Spark? In the 
simple example above, for reduceByKey, what difference it is between 
parallelism change from 10 to 20?2) When we talk about partition in the spark, 
for the data coming from HDFS, I can understand the partition clearly. For the 
intermediate data, the partition will be same as key, right? For group, 
reducing, join action, uniqueness of the keys will be partition. Is that 
correct?3) Why increasing parallelism could help OOM? I don't get this part. 
From my limited experience, adjusting the core count really matters for memory.
Thanks
Yong
  

Help me understand the partition, parallelism in Spark

2015-02-25 Thread java8964
Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand some internal 
information of spark. From the web and this list, I keep seeing people talking 
about increase the parallelism if you get the OOM error. I tried to read 
document as much as possible to understand the RDD partition, and parallelism 
usage in the spark.
I understand that for RDD from HDFS, by default, one partition will be one HDFS 
block, pretty straightforward. I saw that lots of RDD operations support 2nd 
parameter of parallelism. This is the part confuse me. From my understand, the 
parallelism is totally controlled by how many cores you give to your job. 
Adjust that parameter, or "spark.default.parallelism" shouldn't have any impact.
For example, if I have a 10G data in HDFS, and assume the block size is 128M, 
so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to a Pair 
RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey action, using 
200 as the default parallelism. Here is what I assume:
We have 100 partitions, as the data comes from 100 blocks. Most likely the 
spark will generate 100 tasks to read and shuffle them?The 1000 unique keys 
mean the 1000 reducer group, like in MRIf I set the max core to be 50, so there 
will be up to 50 tasks can be run concurrently. The rest tasks just have to 
wait for the core, if there are 50 tasks are running.Since we are doing 
reduceByKey, shuffling will happen. Data will be shuffled into 1000 partitions, 
as we have 1000 unique keys.I don't know these 1000 partitions will be 
processed by how many tasks, maybe this is the parallelism parameter comes 
in?No matter what parallelism this will be, there are ONLY 50 task can be run 
concurrently. So if we set more cores, more partitions' data will be processed 
in the executor (which runs more thread in this case), so more memory needs. I 
don't see how increasing parallelism could help the OOM in this case.In my test 
case of Spark SQL, I gave 24G as the executor heap, my join between 2 big 
datasets keeps getting OOM. I keep increasing the "spark.default.parallelism", 
from 200 to 400, to 2000, even to 4000, no help. What really makes the query 
finish finally without OOM is after I change the "--total-executor-cores" from 
10 to 4.
So my questions are:1) What is the parallelism really mean in the Spark? In the 
simple example above, for reduceByKey, what difference it is between 
parallelism change from 10 to 20?2) When we talk about partition in the spark, 
for the data coming from HDFS, I can understand the partition clearly. For the 
intermediate data, the partition will be same as key, right? For group, 
reducing, join action, uniqueness of the keys will be partition. Is that 
correct?3) Why increasing parallelism could help OOM? I don't get this part. 
From my limited experience, adjusting the core count really matters for memory.
Thanks
Yong  

Re: Please help me get started on Apache Spark

2014-11-20 Thread Guibert. J Tchinde
For Spark,
You can start with a new book like :
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch01.html
I think the paper book is out now,

You can also have a look on tutorials documentation guide available on :
https://spark.apache.org/docs/1.1.0/mllib-guide.html

There is a lot of good tutorials (google), but I think the best manner
remain to get a case study

Cheers

2014-11-20 15:04 GMT+01:00 Saurabh Agrawal :

>
>
> Friends,
>
>
>
> I am pretty new to Spark as much as to Scala, MLib and the entire Hadoop
> stack!! It would be so much help if I could be pointed to some good books
> on Spark and MLib?
>
>
>
> Further, does MLib support any algorithms for B2B cross sell/ upsell or
> customer retention (out of the box preferably) that I could run on my Sales
> force data? I am currently using Collaborative filtering but that’s
> essentially B2C.
>
>
>
> Thanks in advance!!
>
>
>
> Regards,
>
> Saurabh Agrawal
>
> --
> This e-mail, including accompanying communications and attachments, is
> strictly confidential and only for the intended recipient. Any retention,
> use or disclosure not expressly authorised by Markit is prohibited. This
> email is subject to all waivers and other terms at the following link:
> http://www.markit.com/en/about/legal/email-disclaimer.page
>
> Please visit http://www.markit.com/en/about/contact/contact-us.page? for
> contact information on our offices worldwide.
>
> MarkitSERV Limited has its registered office located at Level 4, Ropemaker
> Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and
> regulated by the Financial Conduct Authority with registration number 207294
>


Re: Please help me get started on Apache Spark

2014-11-20 Thread Darin McBeath
Take a look at the O'Reilly Learning Spark (Early Release) book.  I've found 
this very useful.
Darin.
  From: Saurabh Agrawal 
 To: "user@spark.apache.org"  
 Sent: Thursday, November 20, 2014 9:04 AM
 Subject: Please help me get started on Apache Spark
   
    Friends,     I 
am pretty new to Spark as much as to Scala, MLib and the entire Hadoop stack!! 
It would be so much help if I could be pointed to some good books on Spark and 
MLib?    Further, does MLib support any algorithms for B2B cross sell/ upsell 
or customer retention (out of the box preferably) that I could run on my Sales 
force data? I am currently using Collaborative filtering but that’s essentially 
B2C.    Thanks in advance!!    Regards, Saurabh Agrawal 
This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by Markit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for 
contact information on our offices worldwide.

MarkitSERV Limited has its registered office located at Level 4, Ropemaker 
Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated by 
the Financial Conduct Authority with registration number 207294


  

Please help me get started on Apache Spark

2014-11-20 Thread Saurabh Agrawal

Friends,

I am pretty new to Spark as much as to Scala, MLib and the entire Hadoop 
stack!! It would be so much help if I could be pointed to some good books on 
Spark and MLib?

Further, does MLib support any algorithms for B2B cross sell/ upsell or 
customer retention (out of the box preferably) that I could run on my Sales 
force data? I am currently using Collaborative filtering but that's essentially 
B2C.

Thanks in advance!!

Regards,
Saurabh Agrawal


This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by Markit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for 
contact information on our offices worldwide.

MarkitSERV Limited has its registered office located at Level 4, Ropemaker 
Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated by 
the Financial Conduct Authority with registration number 207294


Re: Can anyone help me set memory for standalone cluster?

2014-06-01 Thread Aaron Davidson
In addition to setting the Standalone memory, you'll also need to tell your
SparkContext to claim the extra resources. Set "spark.executor.memory" to
1600m as well. This should be a system property set in SPARK_JAVA_OPTS in
conf/spark-env.sh (in 0.9.1, which you appear to be using) -- e.g.,
export SPARK_JAVA_OPTS="-Dspark.executor.memory=1600mb"


On Sun, Jun 1, 2014 at 7:36 PM, Yunmeng Ban  wrote:

> Hi,
>
> I'm running the example of JavaKafkaWordCount in a standalone cluster. I
> want to set 1600MB memory for each slave node. I wrote in the
> spark/conf/spark-env.sh
>
> SPARK_WORKER_MEMORY=1600m
>
> But the logs on slave nodes looks this:
> Spark Executor Command: "/usr/java/latest/bin/java" "-cp"
> ":/~path/spark/conf:/~path/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar"
> "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>
> The memory seems to be the default number, not 1600M.
> I don't how to make SPARK_WORKER_MEMORY work.
> Can anyone help me?
> Many thanks in advance.
>
> Yunmeng
>


Can anyone help me set memory for standalone cluster?

2014-06-01 Thread Yunmeng Ban
Hi,

I'm running the example of JavaKafkaWordCount in a standalone cluster. I
want to set 1600MB memory for each slave node. I wrote in the
spark/conf/spark-env.sh

SPARK_WORKER_MEMORY=1600m

But the logs on slave nodes looks this:
Spark Executor Command: "/usr/java/latest/bin/java" "-cp"
":/~path/spark/conf:/~path/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar"
"-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"

The memory seems to be the default number, not 1600M.
I don't how to make SPARK_WORKER_MEMORY work.
Can anyone help me?
Many thanks in advance.

Yunmeng


help me: Out of memory when spark streaming

2014-05-16 Thread Francis . Hu
hi, All

 

I encountered OOM when streaming.

I send data to spark streaming through Zeromq at a speed of 600 records per
second, but the spark streaming only handle 10 records per 5 seconds( set it
in streaming program)

my two workers have 4 cores CPU and 1G RAM.

These workers always occur Out Of Memory after moments.

I tried to adjust JVM GC arguments to speed up GC process.  Actually, it
made a little bit change of performance, but workers finally occur OOM.

 

Is there any way to resolve it?

 

it would be appreciated if anyone can help me to get it fixed !

 

 

Thanks,

Francis.Hu



Re: help me

2014-05-03 Thread Chris Fregly
as Mayur indicated, it's odd that you are seeing better performance from a
less-local configuration.  however, the non-deterministic behavior that you
describe is likely caused by GC pauses in your JVM process.

take note of the *spark.locality.wait* configuration parameter described
here: http://spark.apache.org/docs/latest/configuration.html

this is the amount of time the Spark execution engine waits before
launching a new task on a less-data-local node (ie. process -> node ->
rack).  by default, this is 3 seconds.

if there is excessive GC occurring on the original process-local JVM, it is
possible that another node-local JVM process could actually load the data
from HDFS (on the same node) and complete the processing before the
original process's GC finishes.

you could bump up the *spark.locality.wait* default (not recommended) or
increase your number of nodes/partitions to increase parallelism and reduce
hotspots.

also, keep an eye on your GC characteristics.  perhaps you need to increase
your Eden size to reduce the amount of movement through the GC generations
and reduce major compactions.  (the usual GC tuning fun.)

curious if others have experienced this behavior, as well?

-chris


On Fri, May 2, 2014 at 6:07 AM, Mayur Rustagi wrote:

> Spark would be much faster on process_local instead of node_local.
> Node_local references data from local harddisk, process_local references
> data from in-memory thread.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Tue, Apr 22, 2014 at 4:45 PM, Joe L  wrote:
>
>> I got the following performance is it normal in spark to be like this.
>> some
>> times spark switchs into node_local mode from process_local and it becomes
>> 10x faster. I am very confused.
>>
>> scala> val a = sc.textFile("/user/exobrain/batselem/LUBM1000")
>> scala> f.count()
>>
>> Long = 137805557
>> took 130.809661618 s
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/help-me-tp4598.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: help me

2014-05-02 Thread Mayur Rustagi
Spark would be much faster on process_local instead of node_local.
Node_local references data from local harddisk, process_local references
data from in-memory thread.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Tue, Apr 22, 2014 at 4:45 PM, Joe L  wrote:

> I got the following performance is it normal in spark to be like this. some
> times spark switchs into node_local mode from process_local and it becomes
> 10x faster. I am very confused.
>
> scala> val a = sc.textFile("/user/exobrain/batselem/LUBM1000")
> scala> f.count()
>
> Long = 137805557
> took 130.809661618 s
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/help-me-tp4598.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


help me

2014-04-22 Thread Joe L
I got the following performance is it normal in spark to be like this. some
times spark switchs into node_local mode from process_local and it becomes
10x faster. I am very confused.

scala> val a = sc.textFile("/user/exobrain/batselem/LUBM1000")
scala> f.count()

Long = 137805557
took 130.809661618 s




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/help-me-tp4598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.