Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-17 Thread akhandeshi
Little more progress...

I add few enviornment variables, not I get following error message:

 InvocationTargetException: Can't get Master Kerberos principal for use as
renewer -> [Help 1]






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kerberos-setup-in-Apache-spark-connecting-to-remote-HDFS-Yarn-tp27181p27189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread akhandeshi
Rest of the stacktrace.  

WARNING] 
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:294)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Can't get Kerberos realm
at
org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:227)
at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:214)
at
org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:275)
at
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:269)
at
org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:820)
at org.apache.spark.examples.SparkYarn$.launchClient(SparkYarn.scala:57)
at org.apache.spark.examples.SparkYarn$.main(SparkYarn.scala:84)
at org.apache.spark.examples.SparkYarn.main(SparkYarn.scala)
... 6 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:75)
at
org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
... 14 more
Caused by: KrbException: Cannot locate default realm
at sun.security.krb5.Config.getDefaultRealm(Config.java:1029)

I did add krb5.config to classpath as well as define KRB5_CONFIG




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kerberos-setup-in-Apache-spark-connecting-to-remote-HDFS-Yarn-tp27181p27183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread akhandeshi
I am trying to setup my IDE to a scala spark application.  I want to access
HDFS files from remote Hadoop server that has Kerberos enabled.  My
understanding is I should be able to do that from Spark.  Here is my code so
far:

val sparkConf = new SparkConf().setAppName(appName).setMaster(master);

if(jars.length>0) {
sparkConf.setJars(jars);
}

if(!properties.isEmpty) {
//val iter = properties.keys.iterator
for((k,v)<-properties)
sparkConf.set(k, v);
} else {
sparkConf
.set("spark.executor.memory", "1024m")
.set("spark.cores.max", "1")
.set("spark.default.parallelism", "4");
}

try {
if(!StringUtils.isBlank(principal) && 
!StringUtils.isBlank(keytab)) {
//UserGroupInformation.setConfiguration(config);

UserGroupInformation.loginUserFromKeytab(principal, keytab);
}
} catch  {
  case ioe:IOException =>{
println("Failed to login to Hadoop [principal = " + 
principal + ", keytab
= " + keytab + "]");
ioe.printStackTrace();}
}
 val sc = new SparkContext(sparkConf)
   val MY_FILE: String = "hdfs://remoteserver:port/file.out"
   val rDD = sc.textFile(MY_FILE,10)
   println("Lines "+rDD.count);

I have core-site.xml in my classpath.  I changed hadoop.ssl.enabled to false
as it was expecting a secret key.  The principal I am using is correct.  I
tried username/_HOST@fully.qualified.domain and
username@fully.qualified.domain with no success.  I tried running spark in
local mode and yarn client mode.   I am hoping someone has a recipe/solved
this problem.  Any pointers to help setup/debug this problem will be
helpful.

I am getting following error message:

Caused by: java.lang.IllegalArgumentException: Can't get Kerberos realm
at
org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:227)
at
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:249)
at org.apache.spark.examples.SparkYarn$.launchClient(SparkYarn.scala:55)
at org.apache.spark.examples.SparkYarn$.main(SparkYarn.scala:83)
at org.apache.spark.examples.SparkYarn.main(SparkYarn.scala)
... 6 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:75)
at
org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
... 11 more
Caused by: KrbException: Cannot locate default realm



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kerberos-setup-in-Apache-spark-connecting-to-remote-HDFS-Yarn-tp27181.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



This post has NOT been accepted by the mailing list yet.

2015-10-07 Thread akhandeshi
I seem to see this for many of my posts... does anyone have solution?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/This-post-has-NOT-been-accepted-by-the-mailing-list-yet-tp24969.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread akhandeshi
I couldn't get this working...

I have have JAVA_HOME set.
I have defined SPARK_HOME
Sys.setenv(SPARK_HOME="c:\DevTools\spark-1.5.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library("SparkR", lib.loc="c:\\DevTools\\spark-1.5.1\\lib")
library(SparkR)
sc<-sparkR.init(master="local")

I get 
Error in sparkR.init(master = "local") : 
  JVM is not ready after 10 seconds

What am I missing??






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Error-in-sparkR-init-master-local-in-RStudio-tp23768p24949.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread akhandeshi
It seems it is failing at 
 path <- tempfile(pattern = "backend_port")  I do not see backend_port
directory created...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Error-in-sparkR-init-master-local-in-RStudio-tp23768p24958.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Loading status

2015-02-02 Thread akhandeshi
I am not sure what Loading status means, followed by Running.  In the
application UI, I see:
Executor Summary

ExecutorID  Worker  Cores   Memory  State   Logs
1   worker-20150202144112-hadoop-w-1.c.fi-mdd-poc.internal-3887416  
83971
LOADING stdout stderr
0   worker-20150202144112-hadoop-w-2.c.fi-mdd-poc.internal-5868516  
83971
RUNNING stdout stderr

Looking at the executor hadoop-w-2, I see the status is Loading .  Why
different statuses, and what does that mean?

Please see below for details:

ID: worker-20150202144112-hadoop-w-2.c.fi-mdd-poc.internal-58685
Master URL: spark://hadoop-m:7077
Cores: 16 (16 Used)
Memory: 82.0 GB (82.0 GB Used)
Back to Master

Running Executors (1)

ExecutorID  Cores   State   Memory  Job Details Logs
0   16  LOADING 82.0 GB 
ID: app-20150202152154-0001
Name: Simple File Merge Application
User: hadoop
stdout stderr

Thank you, 

Ami



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-status-tp21468.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ExternalSorter - spilling in-memory map

2015-01-13 Thread akhandeshi
 I am using spark 1.2, and I see a lot of messages like:

ExternalSorter: Thread 66 spilling in-memory map of 5.0 MB to disk (13160
times so far)

I seem to have a lot of memory:

URL: spark://hadoop-m:7077
Workers: 4
Cores: 64 Total, 64 Used
Memory: 328.0 GB Total, 327.0 GB Used
_

Executors (4)
Memory: 3.5 GB Used (114.8 GB Total)

Executor ID Address RDD Blocks  Memory Used
0   hadoop-w-1  176 2.8 GB / 28.8 GB
1   hadoop-w-0  42  680.9 MB / 28.8 GB
2   hadoop-w-2  0   0.0 B / 28.8 GB
driverhadoop-w-3  0   0.0 B / 28.4 GB

Also,  I am not sure why hadoop-w-2 is in Loading state?

Thanks,

Ami





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ExternalSorter-spilling-in-memory-map-tp21125.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any ideas why a few tasks would stall

2014-12-04 Thread akhandeshi
This did not work for me.  that is, rdd.coalesce(200, forceShuffle) .  Does
anyone have ideas on how to distribute your data evenly and co-locate
partitions of interest?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Any-ideas-why-a-few-tasks-would-stall-tp20207p20387.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
hmm..
33.6gb is sum of the memory used by the two RDD that is cached.  You're
right when I put serialized RDDs in the cache, the memory foot print for
these rdds become a lot smaller.

Serialized Memory footprint shown below:
RDD NameStorage Level   Cached Partitions   Fraction Cached Size in 
Memory  Size
in Tachyon  Size on Disk
2   Memory Serialized 1x Replicated 239 100%3.1 GB  0.0 B   0.0 B
5   Memory Serialized 1x Replicated 100 100%1254.9 MB   0.0 B   
0.0 B

I don't know what is 73.7 reflective of.  I am able to verify in the
application UI, I am able to see 4.3 GB Used  out of (73.7 GB Total) by the
cahced RDD.  I am not sure how that is 73.7 is  calculated.  I have
following configuration:

conf.set(spark.storage.memoryFraction, 0.9);
conf.set(spark.shuffle.memoryFraction,0.1);

Based on my understanding, 0.9 * 95g (memory allocated to the driver) = 85.5
g should be the available memory, correct?  Out of which 1 % is taken out
for shuffle(~85.5-8.55=76.95)! which would lead to  76.95 gb usable memory. 
Is that right?  The two RDD that is cached is not using nearly as much.  

The two systematic problem that I am avoiding is MAX_INTEGER and Requested
array size exceeds VM limit  No matter how much I tweak the
parallelism/memory configuration, there seems to be little or no impact.  Is
there someone, who can help me understand the internals, so that I can get
this working.  I know this platform is great viable solution for the use
case we have in mind, if I can get it running successfully.  At this point,
the data size is not that huge compared to some white papers that are
published.  So, I am thinking it boils down to the configuration and
validating what I have with an expert.  We can take this offline, if need
be.  Please feel free to email me directly.

Thank you,

Ami




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-understanding-Not-enough-space-to-cache-rdd-tp20186p20269.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Help understanding - Not enough space to cache rdd

2014-12-03 Thread akhandeshi
I think, the memory calculation is correct, what I didn't account for is the
memory used.  I am still puzzled as how I can successfully process the RDD
in spark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-understanding-Not-enough-space-to-cache-rdd-tp20186p20273.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Help understanding - Not enough space to cache rdd

2014-12-02 Thread akhandeshi
I am running in local mode. I am using google n1-highmem-16 (16 vCPU, 104 GB
memory) machine.

I have allocated the SPARK_DRIVER_MEMORY=95g

I see Memory: 33.6 GB Used (73.7 GB Total) that the exeuctor is using.

In the log out put below, I see 33.6 gb blocks are used by 2 rdds that I
have cached.   I should still have 40.2 gb left.

However, I see  messages like:

14/12/02 18:15:04 WARN storage.MemoryStore: Not enough space to cache
rdd_15_9 in memory! (computed 8.1 GB so far)
14/12/02 18:15:04 INFO storage.MemoryStore: Memory use = 33.6 GB (blocks) +
40.1 GB (scratch space shared across 14 thread(s)) = 73.7 GB. Storage limit
= 73.7 GB.
14/12/02 18:15:04 WARN spark.CacheManager: Persisting partition rdd_15_9 to
disk instead.
.
.
.
.
further down I see:
4/12/02 18:30:08 INFO storage.BlockManagerInfo: Added rdd_15_9 on disk on
localhost:41889 (size: 6.9 GB)
4/12/02 18:30:08 INFO storage.BlockManagerMaster: Updated info of block
rdd_15_9
14/12/02 18:30:08 ERROR executor.Executor: Exception in task 9.0 in stage
2.0 (TID 348)
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

I don't understand couple of things:
1) In this case, I am joining 2 RDDs (size 16.3 G and 17.2 GB) both rdds are
create from reading from HDFS files.  The size of each .part is 24.87 MB, I
am reading this files into 250 partitions, so I shouldn't have any
individual partition over 25MB, so how could rdd_15_9 have 8.1g?

2) Even if the data is 8.1g, spark should have enough memory to write, but I
would expect Integer.MAX_VALUE  2gb limitation!   However, I don't get that
error message, and partial dataset is written to disk (6.9 gb).  I don't
understand how and why only partial dataset is written.

3)  Why do get java.lang.IllegalArgumentException: Size exceeds
Integer.MAX_VALUE after writing partial dataset.  

I would love to hear from anyone that can shed some light into this...


None





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-understanding-Not-enough-space-to-cache-rdd-tp20186.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



packaging from source gives protobuf compatibility issues.

2014-12-01 Thread akhandeshi
scala textFile.count()
java.lang.VerifyError: class
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CompleteReques
tProto overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;

I tried ./make-distribution.sh -Dhadoop.version=2.5.0 and
/usr/local/apache-maven-3.2.3/bin/mvn -Dhadoop.version=2.5.0 -DskipTests
clean package  both are giving same errors.  I am connecting to HDFS hosted
on hadoop version 2.5.0.

I will appreciate any help anyone can provide!
Thanks,

Ami



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/packaging-from-source-gives-protobuf-compatibility-issues-tp20112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-11-17 Thread akhandeshi
only option is to split you problem further by increasing parallelism My
understanding is  by increasing the number of partitions, is that right? 
That didn't seem to help because it is seem the partitions are not uniformly
sized.   My observation is when I increase the number of partitions, it
creates many empty block partitions and may larger partition is not broken
down into smaller size.  Any hints, on how I can get uniform partitions.  I
noticed many threads, but was not able to do any thing effective from Java
api.  I will appreciate any help/insight you can provide.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-Requested-array-size-exceeds-VM-limit-tp16809p19097.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSubmitDriverBootstrapper and JVM parameters

2014-11-06 Thread akhandeshi
/usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java
org.apache.spark.deploy.SparkSubmitDriverBootstrapper

When I  execute /usr/local/spark-1.1.0/bin/spark-submit local[32] for my
app, I see two processes get spun off.  One is the 
org.apache.spark.deploy.SparkSubmitDriverBootstrapper and
org.apache.spark.deploy.SparkSubmit. My understanding is first one is the
driver and the latter is the executor, can you confirm?  If that is true, my 
spark

my application defaults don't seem to be picked-up from the following
parmeters.  My SparkSubmit picks up JVM parameters from here.

spark-defaults.conf
spark.daemon.memory=45g
spark.driver.memory=45g
spark.executor.memory=45g

It is not clear to me, when spark uses spark-defaults? and when spark-env? 
Can some one help me understand.  

spark-env.sh
SPARK_DAEMON_MEMORY=30g
SPARK_EXECUTOR_MEMORY=30g
SPARK_DRIVER_MEMORY=30g

I am running into GC/OOM issues, and I am wondering whether tweaking
SparkSubmitDriverBootstrapper or  SparkSubmit JVM parameter will help.  I
did look at the configuration on Spark's site, and tried many different
approaches as suggested there.

Thanks,
Ami



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSubmitDriverBootstrapper-and-JVM-parameters-tp18290.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: CANNOT FIND ADDRESS

2014-11-03 Thread akhandeshi
no luck :(!  Still observing the same behavior!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17988.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



OOM - Requested array size exceeds VM limit

2014-11-03 Thread akhandeshi
I am running local (client).  My vm is 16 cpu/108gb ram. My configuration is
as following:

spark.executor.extraJavaOptions  -XX:+PrintGCDetails -XX:+UseCompressedOops
-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+DisableExplicitGC
-XX:MaxPermSize=1024m 

spark.daemon.memory=20g
spark.driver.memory=20g
spark.executor.memory=20g

export SPARK_DAEMON_JAVA_OPTS=-XX:+UseConcMarkSweepGC -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+UseCompressedOops -XX:+UseParallelGC
-XX:+UseParallelOldGC -XX:+DisableExplicitGC -XX:MaxPermSize=1024m

/usr/local/spark-1.1.0/bin/spark-submit --class main.java.MyAppMainProcess
--master local[32]  MyApp.jar  myapp.out

/11/03 20:45:43 INFO BlockManager: Removing block broadcast_4
14/11/03 20:45:43 INFO MemoryStore: Block broadcast_4 of size 3872 dropped
from memory (free 16669590422)
14/11/03 20:45:43 INFO ContextCleaner: Cleaned broadcast 4
14/11/03 20:46:00 WARN BlockManager: Putting block rdd_19_5 failed
14/11/03 20:46:00 ERROR Executor: Exception in task 5.0 in stage 3.0 (TID
70)
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
at
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1047)
at
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1056)
at
org.apache.spark.storage.TachyonStore.putIterator(TachyonStore.scala:60)
at
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:743)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:594)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745

It is hard to see from this output what stage it fails, but the output is
saving  textFile.  Individual record (key, value or key and value is
relatively small, but number of records in the collection is large.)  There
seems to be a bottleneck that I have run into that I can't seem to get pass. 
Any pointers in the right direction will be helpful!

Thanks,
Ami



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OOM-Requested-array-size-exceeds-VM-limit-tp17996.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: CANNOT FIND ADDRESS

2014-10-31 Thread akhandeshi
Thanks for the pointers! I did tried but didn't seem to help...

In my latest try, I am doing spark-submit local

But see the same message in  spark App ui (4040)
localhost   CANNOT FIND ADDRESS

In the logs, I see a lot of in-memory map to disk.  I don't understand why
that is the case.  There should be over 35 gb ram available for input that
is not significantly large.  It may be link to the performance issues that I
am seeing.  I have another post for seeing advice on  that.  It seems, I am
not able to tune spark sufficiently to execute my process successfully.

14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory
map of 1777 MB to disk (2 times so
 far)
14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory
map of 393 MB to disk (1 time so f
ar)
14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory
map of 392 MB to disk (2 times so 
far)
14/10/31 13:45:14 INctorsBySecId();^ZFO ExternalAppendOnlyMap: Thread 230
spilling in-memory map of 554 MB to 
disk (2 times so far)
14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 235 spilling in-memory
map of 3990 MB to disk (1 time so 
far)
14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 236 spilling in-memory
map of 2667 MB to disk (1 time so 
far)
14/10/31 13:45:17 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory
map of 825 MB to disk (3 times so 
far)
14/10/31 13:45:24 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory
map of 4618 MB to disk (2 times so
 far)
14/10/31 13:45:26 INFO ExternalAppendOnlyMap: Thread 233 spilling in-memory
map of 15869 MB to disk (1 time so
 far)
14/10/31 13:45:37 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory
map of 3026 MB to disk (3 times so
 far)
14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory
map of 401 MB to disk (3 times so 
far)
14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory
map of 392 MB to disk (4 times so 

My spark properties are:

NameValue
spark.akka.frameSize50
spark.akka.timeout  900
spark.app.name  Simple File Merge Application
spark.core.connection.ack.wait.timeout  900
spark.default.parallelism   10
spark.driver.host   spark-single.c.fi-mdd-poc.internal
spark.driver.memory 35g
spark.driver.port   40255
spark.eventLog.enabled  true
spark.fileserver.urihttp://10.240.106.135:59255
spark.jars  file:/home/ami_khandeshi_gmail_com/SparkExample-1.0.jar
spark.masterlocal[16]
spark.scheduler.modeFIFO
spark.shuffle.consolidateFiles  true
spark.storage.memoryFraction0.3
spark.tachyonStore.baseDir  /tmp
spark.tachyonStore.folderName   spark-21ad0fd2-2177-48ce-9242-8dbb33f2a1f1
spark.tachyonStore.url  tachyon://mdd:19998



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17824.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



CANNOT FIND ADDRESS

2014-10-29 Thread akhandeshi
SparkApplication UI shows that one of the executor Cannot find Addresss
Aggregated Metrics by Executor  

Executor ID Address Task Time   Total Tasks Failed Tasks
Succeeded Tasks Input
Shuffle ReadShuffle Write   Shuffle Spill (Memory)  Shuffle Spill (Disk)
0   mddworker1.c.fi-mdd-poc.internal:42197  0 ms0   0   0   
0.0 B   136.1 MB184.9 MB
146.8 GB135.4 MB
1   CANNOT FIND ADDRESS 0 ms0   0   0   0.0 B   87.4 MB 
142.0 MB61.4 GB 81.4 MB

I also see following in one of the executor logs for which the driver may
have lost communication.

14/10/29 13:18:33 WARN : Master_Client Heartbeat last execution took 90859
ms. Longer than  the FIXED_EXECUTION_INTERVAL_MS 5000
14/10/29 13:18:33 WARN : WorkerClientToWorkerHeartbeat last execution took
90859 ms. Longer than  the FIXED_EXECUTION_INTERVAL_MS 1000
14/10/29 13:18:33 WARN AkkaUtils: Error sending message in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)

I have also seen other variation of timeouts

14/10/29 06:21:05 WARN SendingConnection: Error finishing connection to
mddworker1.c.fi-mdd-poc.internal/10.240.179.241:40442
java.net.ConnectException: Connection refused
14/10/29 06:21:05 ERROR BlockManager: Failed to report broadcast_6_piece0 to
master; giving up.

or

14/10/29 07:23:40 WARN AkkaUtils: Error sending message in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at
org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:218)
at
org.apache.spark.storage.BlockManagerMaster.updateBlockInfo(BlockManagerMaster.scala:58)
at
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$tryToReportBlockStatus(BlockManager.scala:310)
at
org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:190)
at
org.apache.spark.storage.BlockManager$$anonfun$reportAllBlocks$3.apply(BlockManager.scala:188)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
org.apache.spark.util.TimeStampedHashMap.foreach(TimeStampedHashMap.scala:107)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.storage.BlockManager.reportAllBlocks(BlockManager.scala:188)
at 
org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:207)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:366)

How do I track down what is causing this problem?  Any suggestion on
solution, debugging or workaround will be helpful!







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Performance

2014-10-29 Thread akhandeshi
I am relatively new to spark processing. I am using Spark Java API to process
data.  I am having trouble processing a data set that I don't think is
significantly large.  It is joining a dataset that is around 3-4gb each
(around 12 gb data). 

The workflow is: 
x=RDD1.KeyBy(x).partitionBy(new HashPartitioner(10).cache() 
y=RDD2.KeyBy(x).partitionBy(new HashPartitioner(10).cache() 
z=RDD3.KeyBy(x).partitionBy(new HashPartitioner(10).cache() 
o=RDD4.KeyBy(y).partitionBy(new HashPartitioner(10).cache() 
out=x.join(y).join(z).keyBy(y).partitionBy(new
HashPartitioner(10).cache().join(o) 
out.saveAsObject(Out); 

The spark processor seems to be hung at out= step indefinitely.  I am
using kyro for serialization. using local with SPARK_MEM=90g.  I have 16cpu,
108g ram.  I am saving output to hadoop. 

I have also tried on a standalone cluster with 2 workers 8 cpu and 52 gb
ram.  My VMs are on google cloud.


Below is the table from the completed stages. 
Stage IdDescription Submitted   DurationTasks: 
Succeeded/Total  Input   Shuffle
ReadShuffle Write 
8   keyBy at ProcessA.java:1094+details 10/27/2014 12:402.0 min 
10-Oct  
3   filter at ProcessA.java:1079+details10/27/2014 12:402.0 min 
10-Oct  
2   keyBy at ProcessA.java:1071+details 10/27/2014 12:3939 s
11-Nov  268.4 MB
25.7 MB 
1   filter at ProcessA.java:1103+details10/27/2014 12:3916 s
9-Sep   58.8 MB
30.4 MB 
7   keyBy at ProcessA.java:1045+details 10/27/2014 12:3932 s
24/24   2.8 GB
573.8 MB 
6   keyBy at ProcessA.java:1045+details 10/27/2014 12:3940 s
11-Nov  268.4 MB
24.5 MB 
 

Somethings, I don't understand..  I see output in the logfiles where it is
indicating it is spilling in-memory map to disk, and the spilling size is
greater than the input.  I am not sure how to avoid that or reduce that... 
I also tried the cluster mode where I observed the same behavior, so I
questioned whether the processes are running in parallel or serial? 

14/10/27 14:11:33 INFO collection.ExternalAppendOnlyMap: Thread 94 spilling
in-memory map of 1000 MB to disk ( 
15 times so far) 
14/10/27 14:11:34 INFO collection.ExternalAppendOnlyMap: Thread 107 spilling
in-memory map of 2351 MB to disk 
(2 times so far) 
14/10/27 14:11:36 INFO collection.ExternalAppendOnlyMap: Thread 94 spilling
in-memory map of 1000 MB to disk ( 
16 times so far) 
14/10/27 14:11:37 INFO collection.ExternalAppendOnlyMap: Thread 91 spilling
in-memory map of 4781 MB to disk ( 
10 times so far) 
14/10/27 14:11:38 INFO collection.ExternalAppendOnlyMap: Thread 112 spilling
in-memory map of 1243 MB to disk 
(10 times so far) 
14/10/27 14:11:39 INFO collection.ExternalAppendOnlyMap: Thread 94 spilling
in-memory map of 983 MB to disk (1 
7 times so far) 
14/10/27 14:11:39 INFO collection.ExternalAppendOnlyMap: Thread 96 spilling
in-memory map of 75546 MB to disk 
(11 times so far) 
14/10/27 14:11:56 INFO collection.ExternalAppendOnlyMap: Thread 106 spilling
in-memory map of 2324 MB to disk 
(7 times so far) 
14/10/27 14:11:56 INFO collection.ExternalAppendOnlyMap: Thread 112 spilling
in-memory map of 1729 MB to disk 
(11 times so far) 
14/10/27 14:11:58 INFO collection.ExternalAppendOnlyMap: Thread 96 spilling
in-memory map of 2410 MB to disk ( 
12 times so far) 
14/10/27 14:11:58 INFO collection.ExternalAppendOnlyMap: Thread 91 spilling
in-memory map of 1211 MB to disk 


I would appreciate any pointers in  the right direction! 
___
by the way, I also see behavior described error messages like 

Not enough space to cache partition rdd_21_4 -indicating perhaps nothing is
getting cached. 

per - 
http://mail-archives.apache.org/mod_mbox/spark-issues/201409.mbox/%3cjira.12744773.141202099.148323.1412021014...@atlassian.jira%3E




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-tp17640.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: CANNOT FIND ADDRESS

2014-10-29 Thread akhandeshi
Thanks...hmm It is seems to be a timeout issue perhaps??  Not sure what
is causing it? or how to debug?

I see following error message...

4/10/29 13:26:04 ERROR ContextCleaner: Error cleaning broadcast 9
akka.pattern.AskTimeoutException: Timed out
at
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)
at
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$
$unbatchedExecute(Future.scala:694)
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
at
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)
at
akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)
at
akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)
at
akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
at java.lang.Thread.run(Thread.java:745)
14/10/29 13:26:04 WARN BlockManagerMaster: Failed to remove broadcast 9 with
removeFromMaster = true - Timed o
ut}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17646.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark-submt job Killed

2014-10-28 Thread akhandeshi
I recently starting seeing this new problem where spark-submt is terminated
by Killed message but no error message indicating what happened. I have
enable logging on in spark configuration.  has anyone seen this or know how
to troubleshoot?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submt-job-Killed-tp17560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark-submt job Killed

2014-10-28 Thread akhandeshi
I did have it as rdd.saveAsText(RDD);
and now I have it as:
Log.info(RDD
Counts+rdd.persist(StorageLevel.MEMORY_AND_DISK_SER()).count());





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submt-job-Killed-tp17560p17598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org