Hi all,

I have written a small ETL spark application which takes data from GCS and
transforms them and saves them again into some other GCS bucket.
I am trying to run this application for different ids using a spark cluster
in google's dataproc and just tweaking the default configuration to use a
FAIR scheduler with FIFO queue by configuring these settings
  in /etc/hadoop/conf/yarn-site.xml
  yarn.resourcemanager.scheduler.class =
yarn.resourcemanager.scheduler.class
  yarn.scheduler.fair.allocation.file = /etc/hadoop/conf/fair-scheduler.xml
  yarn.scheduler.fair.user-as-default-queue = false
  in /etc/hadoop/conf/fair-scheduler.xml, allocations as
  <queueMaxAppsDefault>1</queueMaxAppsDefault>


in a spark cluster for a
2 core, 4GB RAM master   - 1
4 core, 16GB RAM workers - 2
I did testing for 5 spark submissions and everything is working as
expected. All the applications are running one after the other without any
exceptions.

when I tried to run the same testing exercise for 100 submissions, some of
the submissions failed with out of memory errors. When I re-ran the OOM
submissions individually they completed without any error.

the submission's log which has out of memory
'''
20/06/05 19:44:23 INFO org.spark_project.jetty.util.log: Logging
initialized @5463ms
20/06/05 19:44:24 INFO org.spark_project.jetty.server.Server:
jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/06/05 19:44:24 INFO org.spark_project.jetty.server.Server: Started
@5599ms
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could
not bind on port 4040. Attempting port 4041.
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could
not bind on port 4041. Attempting port 4042.
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could
not bind on port 4042. Attempting port 4043.
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could
not bind on port 4043. Attempting port 4044.
20/06/05 19:44:24 WARN org.apache.spark.util.Utils: Service 'SparkUI' could
not bind on port 4044. Attempting port 4045.
20/06/05 19:44:24 INFO org.spark_project.jetty.server.AbstractConnector:
Started ServerConnector@723f98fa{HTTP/1.1,[http/1.1]}{0.0.0.0:4045}
20/06/05 19:44:24 WARN org.apache.spark.scheduler.FairSchedulableBuilder:
Fair Scheduler configuration file not found so jobs will be scheduled in
FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or
set spark.scheduler.allocation.file to a file that contains the
configuration.
20/06/05 19:44:26 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to
ResourceManager at airf-m-2c-w-4c-4-faff-m/10.160.0.156:8032
20/06/05 19:44:27 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting
to Application History server at airf-m-2c-w-4c-4-faff-m/10.160.0.156:10200
20/06/05 19:44:29 INFO
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted
application application_1591383928453_0047
20/06/05 19:46:34 WARN org.apache.spark.sql.SparkSession$Builder: Using an
existing SparkSession; some configuration may not take effect.
20/06/05 19:46:41 INFO
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl:
Repairing batch of 24 missing directories.
20/06/05 19:46:44 INFO
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl:
Successfully repaired 24/24 implicit directories.
OpenJDK 64-Bit Server VM warning: INFO:
os::commit_memory(0x0000000098200000, 46661632, 0) failed; error='Cannot
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 46661632 bytes for
committing reserved memory.
# An error report file with more information is saved as:
#
/tmp/9e22ca5b-5bf8-47b7-12ee-69cd9e37e7c8_spark_submit_20200605_82b0375c/hs_err_pid9917.log
Job output is complete
'''

ALso, when I was test running an application I never saw this log
  Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

I am very new to spark. I didnt know which configurations might help to
debug this. This log also didn't help.
I lost the hs_err file when the cluster was deleted.
What can I do to debug this?
Thanks for taking your time to read this.

Reply via email to