Re: Add jar files on classpath when submitting tasks to Spark

2016-11-01 Thread Vinod Mangipudi
unsubscribe

On Tue, Nov 1, 2016 at 8:56 AM, Jan Botorek  wrote:

> Thank you for the reply.
>
> I am aware of the parameters used when submitting the tasks (--jars is
> working for us).
>
>
>
> But, isn’t there any way how to specify a location (directory) for jars
> „in global“ - in the spark-defaults.conf??
>
>
>
>
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* Tuesday, November 1, 2016 1:49 PM
> *To:* Jan Botorek 
> *Cc:* user 
> *Subject:* Re: Add jar files on classpath when submitting tasks to Spark
>
>
>
> There are options to specify external jars in the form of --jars,
> --driver-classpath etc depending on spark version and cluster manager..
> Please see spark documents for configuration sections and/or run spark
> submit help to see available options.
>
> On 1 Nov 2016 23:13, "Jan Botorek"  wrote:
>
> Hello,
>
> I have a problem trying to add jar files to be available on classpath when
> submitting task to Spark.
>
>
>
> In my spark-defaults.conf file I have configuration:
>
> *spark.driver.extraClassPath = path/to/folder/with/jars*
>
> all jars in the folder are available in SPARK-SHELL
>
>
>
> The problem is that jars are not on the classpath for SPARK-MASTER; more
> precisely – when I submit any job that utilizes any jar from external
> folder, the* java.lang.ClassNotFoundException* is thrown.
>
> Moving all external jars into the *jars* folder solves the situation, but
> we need to keep external files separatedly.
>
>
>
> Thank you for any help
>
> Best regards,
>
> Jan
>
>


graphx - trianglecount of 2B edges

2015-11-11 Thread Vinod Mangipudi
I was attempting to use the graphx triangle count method on a 2B edge graph 
(Friendster dataset on SNAP)  . I have access to a 60 node cluster with 90GB 
memory and 30v cores per node .
I am running into memory issues


 I am using 1000 partitions using the RandomVertexCut. Here’s my submit script :

spark-submit --executor-cores 5 --num-executors 100 --executor-memory 32g 
--driver-memory 6g --conf spark.yarn.executor.memoryOverhead=8000  --conf 
"spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit”  
trianglecount_2.10-1.0.jar

There was one particular stage where it shuffled 3.7 TB

Active Stages (1)

Stage IdDescription Submitted   DurationTasks: 
Succeeded/Total  Input   Output  Shuffle ReadShuffle Write
11  (kill 
)mapPartitions
 at VertexRDDImpl.scala:218 
+details
 

 

  2015/11/12 01:33:06 7.3 min 
316/344
22.6 GB 57.0 GB 3.7 TB
In this subsequent stage it fails reading the Shuffle around the half terabyte 
mark with a java.lang.OutOfMemoryError: Java heap space


Active Stages (1)

Stage IdDescription Submitted   DurationTasks: 
Succeeded/Total  Input   Output  Shuffle ReadShuffle Write
12  (kill 
)mapPartitions
 at GraphImpl.scala:235 
+details
2015/11/12 01:41:25 5.2 min 
0/1000
26.3 GB 533.8 GB




Compared to the benchmarking (http://arxiv.org/pdf/1402.2394v1.pdf 
) cluster used on the twitter dataset 
(2.5B edges) the resources i am providing for the job seem to be reasonable. 
Can anyone point out any optimization or other tweaks i need to perform to get 
this to work ?

Thanks!
Vinod