newAPIHadoopFile throws a JsonMappingException: Infinite recursion (StackOverflowError) error

2016-11-17 Thread David Robison
I am trying to create a new JavaPairRDD from data in an HDFS file. My code is:

sparkContext = new JavaSparkContext("yarn-client", "SumFramesPerTimeUnit", 
sparkConf);
JavaPairRDD inputRDD = 
sparkContext.newAPIHadoopFile(fileFilter, FixedLengthInputFormat.class, 
LongWritable.class, BytesWritable.class, config);

However, when I run the job I get the following error:

com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion 
(StackOverflowError) (through reference chain: 
scala.collection.convert.IterableWrapper[0]->org.apache.spark.rdd.RDDOperationScope["allScopes"]->scala.collection.convert.IterableWrapper[0]->org.apache.spark.rdd.RDDOperationScope["allScopes"]->...)
  at 
com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:680)
  at 
com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:156)
  at 
com.fasterxml.jackson.databind.ser.std.CollectionSerializer.serializeContents(CollectionSerializer.java:132)
  at 
com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents(IterableSerializerModule.scala:30)
  at 
com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents(IterableSerializerModule.scala:16)
  at 
com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:185)
  at 
com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:575)
  at 
com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:666)
  at 
com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:156)

Any thoughts as to what may be going wrong?
David

David R Robison
Senior Systems Engineer
O. +1 512 247 3700
M. +1 757 286 0022
david.robi...@psgglobal.net
www.psgglobal.net

Prometheus Security Group Global, Inc.
3019 Alvin Devane Boulevard
Building 4, Suite 450
Austin, TX 78741




RE: submitting a spark job using yarn-client and getting NoClassDefFoundError: org/apache/spark/Logging

2016-11-16 Thread David Robison
I’ve gotten a little further along. It now submits the job via Yarn, but now 
the jobs exit immediately with the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:646)
at 
org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

I’ve checked and the class does live in the spark assembly. Any thoughts as 
what might be wrong?


Best Regards,

David R Robison
Senior Systems Engineer
[cid:image004.png@01D19182.F24CA3E0]

From: David Robison [mailto:david.robi...@psgglobal.net]
Sent: Wednesday, November 16, 2016 9:04 AM
To: Rohit Verma 
Cc: user@spark.apache.org
Subject: RE: Problem submitting a spark job using yarn-client as master


This sender failed our fraud detection checks and may not be who they appear to 
be. Learn about spoofing<http://aka.ms/LearnAboutSpoofing>

Feedback<http://aka.ms/SafetyTipsFeedback>

Unfortunately, it doesn’t get that far in my code where I have a SparkContext 
from which to set the Hadoop config parameters. Here is my Java code:

SparkConf sparkConf = new SparkConf()
   .setJars(new String[] { 
"file:///opt/wildfly/mapreduce/mysparkjob-5.0.0.jar", })
   .setSparkHome("/usr/hdp/" + getHdpVersion() + "/spark")
   .set("fs.defaultFS", config.get("fs.defaultFS"))
   ;
sparkContext = new JavaSparkContext("yarn-client", "SumFramesPerTimeUnit", 
sparkConf);

The job dies in the constructor of the JavaSparkContext. I have a logging call 
right after creating the SparkContext and it is never executied.
Any idea what I’m doing wrong? David

Best Regards,

David R Robison
Senior Systems Engineer
[cid:image004.png@01D19182.F24CA3E0]

From: Rohit Verma [mailto:rohit.ve...@rokittech.com]
Sent: Tuesday, November 15, 2016 9:27 PM
To: David Robison 
mailto:david.robi...@psgglobal.net>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Problem submitting a spark job using yarn-client as master

you can set hdfs as defaults,

sparksession.sparkContext().hadoopConfiguration().set("fs.defaultFS", 
“hdfs://master_node:8020”);

Regards
Rohit

On Nov 16, 2016, at 3:15 AM, David Robison 
mailto:david.robi...@psgglobal.net>> wrote:

I am trying to submit a spark job through the yarn-client master setting. The 
job gets created and submitted to the clients but immediately errors out. Here 
is the relevant portion of the log:

15:39:37,385 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Requesting a new application from cluster with 1 NodeManagers
15:39:37,397 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Verifying our application has not requested more than the maximum memory 
capability of the cluster (4608 MB per container)
15:39:37,398 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) Will 
allocate AM container, with 896 MB memory including 384 MB overhead
15:39:37,399 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Setting up container launch context for our AM
15:39:37,403 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Setting up the launch environment for our AM container
15:39:37,427 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Preparing resources for our AM container
15:39:37,845 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Source and destination file systems are the same. Not copying 
file:/opt/wildfly/modules/org/apache/hadoop/client/main/spark-yarn_2.10-1.6.2.jar
15:39:38,050 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Source and destination file systems are the same. Not copying 
file:/tmp/spark-fa954c4

RE: Problem submitting a spark job using yarn-client as master

2016-11-16 Thread David Robison
Unfortunately, it doesn’t get that far in my code where I have a SparkContext 
from which to set the Hadoop config parameters. Here is my Java code:

SparkConf sparkConf = new SparkConf()
   .setJars(new String[] { 
"file:///opt/wildfly/mapreduce/mysparkjob-5.0.0.jar", })
   .setSparkHome("/usr/hdp/" + getHdpVersion() + "/spark")
   .set("fs.defaultFS", config.get("fs.defaultFS"))
   ;
sparkContext = new JavaSparkContext("yarn-client", "SumFramesPerTimeUnit", 
sparkConf);

The job dies in the constructor of the JavaSparkContext. I have a logging call 
right after creating the SparkContext and it is never executied.
Any idea what I’m doing wrong? David

Best Regards,

David R Robison
Senior Systems Engineer
[cid:image004.png@01D19182.F24CA3E0]

From: Rohit Verma [mailto:rohit.ve...@rokittech.com]
Sent: Tuesday, November 15, 2016 9:27 PM
To: David Robison 
Cc: user@spark.apache.org
Subject: Re: Problem submitting a spark job using yarn-client as master

you can set hdfs as defaults,

sparksession.sparkContext().hadoopConfiguration().set("fs.defaultFS", 
“hdfs://master_node:8020”);

Regards
Rohit

On Nov 16, 2016, at 3:15 AM, David Robison 
mailto:david.robi...@psgglobal.net>> wrote:

I am trying to submit a spark job through the yarn-client master setting. The 
job gets created and submitted to the clients but immediately errors out. Here 
is the relevant portion of the log:

15:39:37,385 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Requesting a new application from cluster with 1 NodeManagers
15:39:37,397 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Verifying our application has not requested more than the maximum memory 
capability of the cluster (4608 MB per container)
15:39:37,398 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) Will 
allocate AM container, with 896 MB memory including 384 MB overhead
15:39:37,399 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Setting up container launch context for our AM
15:39:37,403 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Setting up the launch environment for our AM container
15:39:37,427 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Preparing resources for our AM container
15:39:37,845 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Source and destination file systems are the same. Not copying 
file:/opt/wildfly/modules/org/apache/hadoop/client/main/spark-yarn_2.10-1.6.2.jar
15:39:38,050 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Source and destination file systems are the same. Not copying 
file:/tmp/spark-fa954c4a-a6cd-4675-8610-67ce858b4842/__spark_conf__1435451360463636119.zip
15:39:38,102 INFO  [org.apache.spark.SecurityManager] (default task-1) Changing 
view acls to: wildfly,hdfs
15:39:38,105 INFO  [org.apache.spark.SecurityManager] (default task-1) Changing 
modify acls to: wildfly,hdfs
15:39:38,105 INFO  [org.apache.spark.SecurityManager] (default task-1) 
SecurityManager: authentication disabled; ui acls disabled; users with view 
permissions: Set(wildfly, hdfs); users with modify permissions: Set(wildfly, 
hdfs)
15:39:38,138 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Submitting application 5 to ResourceManager
15:39:38,256 INFO  [org.apache.hadoop.yarn.client.api.impl.YarnClientImpl] 
(default task-1) Submitted application application_1479240217825_0005
15:39:39,269 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Application report for application_1479240217825_0005 (state: ACCEPTED)
15:39:39,279 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1)
 client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1479242378159
final status: UNDEFINED
tracking URL: 
http://vb1.localdomain:8088/proxy/application_1479240217825_0005/
user: hdfs
15:39:40,285 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Application report for application_1479240217825_0005 (state: ACCEPTED)
15:39:41,290 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Application report for application_1479240217825_0005 (state: ACCEPTED)
15:39:42,295 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Application report for application_1479240217825_0005 (state: FAILED)
15:39:42,295 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1)
 client token: N/A
diagnostics: Application application_1479240217825_0005 
failed 2 times due to AM Container for appattempt_1479240217825_0005_02 
exited with  exitCode: -1000
For more detailed output, check application tracking 
page:http://vb1.localdomain:8088/c

Problem submitting a spark job using yarn-client as master

2016-11-15 Thread David Robison
I am trying to submit a spark job through the yarn-client master setting. The 
job gets created and submitted to the clients but immediately errors out. Here 
is the relevant portion of the log:

15:39:37,385 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Requesting a new application from cluster with 1 NodeManagers
15:39:37,397 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Verifying our application has not requested more than the maximum memory 
capability of the cluster (4608 MB per container)
15:39:37,398 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) Will 
allocate AM container, with 896 MB memory including 384 MB overhead
15:39:37,399 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Setting up container launch context for our AM
15:39:37,403 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Setting up the launch environment for our AM container
15:39:37,427 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Preparing resources for our AM container
15:39:37,845 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Source and destination file systems are the same. Not copying 
file:/opt/wildfly/modules/org/apache/hadoop/client/main/spark-yarn_2.10-1.6.2.jar
15:39:38,050 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Source and destination file systems are the same. Not copying 
file:/tmp/spark-fa954c4a-a6cd-4675-8610-67ce858b4842/__spark_conf__1435451360463636119.zip
15:39:38,102 INFO  [org.apache.spark.SecurityManager] (default task-1) Changing 
view acls to: wildfly,hdfs
15:39:38,105 INFO  [org.apache.spark.SecurityManager] (default task-1) Changing 
modify acls to: wildfly,hdfs
15:39:38,105 INFO  [org.apache.spark.SecurityManager] (default task-1) 
SecurityManager: authentication disabled; ui acls disabled; users with view 
permissions: Set(wildfly, hdfs); users with modify permissions: Set(wildfly, 
hdfs)
15:39:38,138 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Submitting application 5 to ResourceManager
15:39:38,256 INFO  [org.apache.hadoop.yarn.client.api.impl.YarnClientImpl] 
(default task-1) Submitted application application_1479240217825_0005
15:39:39,269 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Application report for application_1479240217825_0005 (state: ACCEPTED)
15:39:39,279 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1)
 client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1479242378159
final status: UNDEFINED
tracking URL: 
http://vb1.localdomain:8088/proxy/application_1479240217825_0005/
user: hdfs
15:39:40,285 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Application report for application_1479240217825_0005 (state: ACCEPTED)
15:39:41,290 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Application report for application_1479240217825_0005 (state: ACCEPTED)
15:39:42,295 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1) 
Application report for application_1479240217825_0005 (state: FAILED)
15:39:42,295 INFO  [org.apache.spark.deploy.yarn.Client] (default task-1)
 client token: N/A
diagnostics: Application application_1479240217825_0005 
failed 2 times due to AM Container for appattempt_1479240217825_0005_02 
exited with  exitCode: -1000
For more detailed output, check application tracking 
page:http://vb1.localdomain:8088/cluster/app/application_1479240217825_0005Then,
 click on links to logs of each attempt.
Diagnostics: File 
file:/tmp/spark-fa954c4a-a6cd-4675-8610-67ce858b4842/__spark_conf__1435451360463636119.zip
 does not exist
java.io.FileNotFoundException: File 
file:/tmp/spark-fa954c4a-a6cd-4675-8610-67ce858b4842/__spark_conf__1435451360463636119.zip
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)


Notice that the file __spark_conf__1435451360463636119.zip is not copied 
because it exists, I believe on the hdfs. However when the client goes to fetch 
it, it is reporting that it does not exist, probably because it is trying to 
get it from "file:/tmp" not the hdfs. Any idea how I can get this to work?
Thanks, David

David R Robison
Senior Systems Engineer
O. +1 512 247 3700
M. +1 757 286 0022
david.robi...@psgglobal.net
www.psgglobal.net

creating a javaRDD using newAPIHadoopFile and FixedLengthInputFormat

2016-11-15 Thread David Robison
I am trying to create a Spark javaRDD using the newAPIHadoopFile and the 
FixedLengthInputFormat. Here is my code snippit,

Configuration config = new Configuration();
config.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, JPEG_INDEX_SIZE);
config.set("fs.hdfs.impl", DistributedFileSystem.class.getName());
String fileFilter = config.get("fs.defaultFS") + "/A/B/C/*.idx";
JavaPairRDD inputRDD = 
sparkContext.newAPIHadoopFile(fileFilter, FixedLengthInputFormat.class, 
LongWritable.class, BytesWritable.class, config);

At this point I get the following exception:

Error executing mapreduce job: 
com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion 
(StackOverflowError)

Any idea what I am doing wrong? I am new to Spark. David

David R Robison
Senior Systems Engineer
O. +1 512 247 3700
M. +1 757 286 0022
david.robi...@psgglobal.net
www.psgglobal.net

Prometheus Security Group Global, Inc.
3019 Alvin Devane Boulevard
Building 4, Suite 450
Austin, TX 78741