Hi Ram,
Below is the information.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 712 0 712 0 0 3807 0 --:--:-- --:--:-- --:--:-- 3807
{
"clusterInfo": {
"haState": "ACTIVE",
"haZooKeeperConnectionState": "CONNECTED",
"hadoopBuildVersion": "2.7.1.2.3.2.0-2950 from 5cc60e0003e33aa98205f18bc
caeaf36cb193c1c by jenkins source checksum 69a3bf8c667267c2c252a54fbbf23d",
"hadoopVersion": "2.7.1.2.3.2.0-2950",
"hadoopVersionBuiltOn": "2015-09-30T18:08Z",
"id": 1465495186350,
"resourceManagerBuildVersion": "2.7.1.2.3.2.0-2950 from 5cc60e0003e33aa9
8205f18bccaeaf36cb193c1c by jenkins source checksum 48db4b572827c2e9c2da66982d14
7626",
"resourceManagerVersion": "2.7.1.2.3.2.0-2950",
"resourceManagerVersionBuiltOn": "2015-09-30T18:20Z",
"rmStateStoreName": "org.apache.hadoop.yarn.server.resourcemanager.recov
ery.ZKRMStateStore",
"startedOn": 1465495186350,
"state": "STARTED"
}
}
Regards,
Surya Vamshi
From: Munagala Ramanath [mailto:[email protected]]
Sent: 2016, June, 16 2:57 PM
To: [email protected]
Subject: Re: Multiple directories
Can you ssh to one of the cluster nodes ? If so, can you run this command and
show the output
(where {rm} is the host:port running the resource manager, aka YARN):
curl http://{rm}/ws/v1/cluster<http://%7brm%7d/ws/v1/cluster> | python
-mjson.tool
Ram
ps. You can determine the node running YARN with:
hdfs getconf -confKey yarn.resourcemanager.webapp.address
hdfs getconf -confKey yarn.resourcemanager.webapp.https.address
On Thu, Jun 16, 2016 at 11:15 AM, Mukkamula, Suryavamshivardhan (CWM-NR)
<[email protected]<mailto:[email protected]>>
wrote:
Hi,
I am facing a weird issue and the logs are not clear to me !!
I have created apa file which works fine within my local sandbox but facing
problems when I upload on the enterprise Hadoop cluster using DT Console.
Below is the error message from yarn logs. Please help in understanding the
issue.
###################### Error Logs
########################################################
Log Type: AppMaster.stderr
Log Upload Time: Thu Jun 16 14:07:46 -0400 2016
Log Length: 1259
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/grid/06/hadoop/yarn/local/usercache/mukkamula/appcache/application_1465495186350_2224/filecache/36/slf4j-log4j12-1.7.19.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/hdp/2.3.2.0-2950/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.IllegalArgumentException: Invalid
ContainerId: container_e35_1465495186350_2224_01_000001
at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
at
com.datatorrent.stram.StreamingAppMaster.main(StreamingAppMaster.java:90)
Caused by: java.lang.NumberFormatException: For input string: "e35"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:441)
at java.lang.Long.parseLong(Long.java:483)
at
org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
... 1 more
Log Type: AppMaster.stdout
Log Upload Time: Thu Jun 16 14:07:46 -0400 2016
Log Length: 0
Log Type: dt.log
Log Upload Time: Thu Jun 16 14:07:46 -0400 2016
Log Length: 29715
Showing 4096 bytes of 29715 total. Click
here<http://guedlpdhdp001.saifg.rbc.com:19888/jobhistory/logs/guedlpdhdp012.saifg.rbc.com:45454/container_e35_1465495186350_2224_01_000001/container_e35_1465495186350_2224_01_000001/mukkamula/dt.log/?start=0>
for the full log.
56m -Xloggc:/var/log/hadoop/yarn/gc.log-201606140038 -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms4096m
-Xmx4096m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT
SHLVL=3
HADOOP_SSH_OPTS=-o ConnectTimeout=5 -o SendEnv=HADOOP_CONF_DIR
HADOOP_USER_NAME=datatorrent/[email protected]<mailto:[email protected]>
HADOOP_NAMENODE_OPTS=-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC
-XX:ErrorFile=/var/log/hadoop/yarn/hs_err_pid%p.log -XX:NewSize=200m
-XX:MaxNewSize=200m -XX:PermSize=128m -XX:MaxPermSize=256m
-Xloggc:/var/log/hadoop/yarn/gc.log-201606140038 -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms8192m
-Xmx8192m -Dhadoop.security.logger=INFO,DRFAS
-Dhdfs.audit.logger=INFO,DRFAAUDIT
-XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node"
-Dorg.mortbay.jetty.Request.maxFormContentSize=-1 -server
-XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC
-XX:ErrorFile=/var/log/hadoop/yarn/hs_err_pid%p.log -XX:NewSize=200m
-XX:MaxNewSize=200m -XX:PermSize=128m -XX:MaxPermSize=256m
-Xloggc:/var/log/hadoop/yarn/gc.log-201606140038 -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms8192m
-Xmx8192m -Dhadoop.security.logger=INFO,DRFAS
-Dhdfs.audit.logger=INFO,DRFAAUDIT
-XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node"
-Dorg.mortbay.jetty.Request.maxFormContentSize=-1
HADOOP_IDENT_STRING=yarn
HADOOP_MAPRED_LOG_DIR=/var/log/hadoop-mapreduce/yarn
NM_HOST=guedlpdhdp012.saifg.rbc.com<http://guedlpdhdp012.saifg.rbc.com>
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop/hdfs
YARN_HISTORYSERVER_HEAPSIZE=1024
JVM_PID=2638
YARN_PID_DIR=/var/run/hadoop-yarn/yarn
HADOOP_HOME_WARN_SUPPRESS=1
NM_PORT=45454
LOGNAME=mukkamula
YARN_CONF_DIR=/usr/hdp/current/hadoop-client/conf
HADOOP_YARN_USER=yarn
QTDIR=/usr/lib64/qt-3.3
_=/usr/lib/jvm/java-1.7.0/bin/java
MSM_PRODUCT=MSM
HADOOP_HOME=/usr/hdp/2.3.2.0-2950/hadoop
MALLOC_ARENA_MAX=4
HADOOP_OPTS=-Dhdp.version=2.3.2.0-2950 -Djava.net.preferIPv4Stack=true
-Dhdp.version= -Djava.net.preferIPv4Stack=true
-Dhadoop.log.dir=/var/log/hadoop/yarn -Dhadoop.log.file=hadoop.log
-Dhadoop.home.dir=/usr/hdp/2.3.2.0-2950/hadoop -Dhadoop.id.str=yarn
-Dhadoop.root.logger=INFO,console
-Djava.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
-Dhdp.version=2.3.2.0-2950 -Dhadoop.log.dir=/var/log/hadoop/yarn
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.3.2.0-2950/hadoop
-Dhadoop.id.str=yarn -Dhadoop.root.logger=INFO,console
-Djava.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
SHELL=/bin/bash
YARN_ROOT_LOGGER=INFO,EWMA,RFA
HADOOP_TOKEN_FILE_LOCATION=/grid/11/hadoop/yarn/local/usercache/mukkamula/appcache/application_1465495186350_2224/container_e35_1465495186350_2224_01_000001/container_tokens
CLASSPATH=./*:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*
HADOOP_MAPRED_PID_DIR=/var/run/hadoop-mapreduce/yarn
YARN_NODEMANAGER_HEAPSIZE=1024
QTINC=/usr/lib64/qt-3.3/include
USER=mukkamula
HADOOP_CLIENT_OPTS=-Xmx2048m -XX:MaxPermSize=512m -Xmx2048m -XX:MaxPermSize=512m
CONTAINER_ID=container_e35_1465495186350_2224_01_000001
HADOOP_SECURE_DN_PID_DIR=/var/run/hadoop/hdfs
HISTCONTROL=ignoredups
HOME=/home/
HADOOP_NAMENODE_INIT_HEAPSIZE=-Xms8192m
MSM_HOME=/usr/local/MegaRAID Storage Manager
LESSOPEN=||/usr/bin/lesspipe.sh %s
LANG=en_US.UTF-8
YARN_NICENESS=0
YARN_IDENT_STRING=yarn
HADOOP_MAPRED_HOME=/usr/hdp/2.3.2.0-2950/hadoop-mapreduce
Regards,
Surya Vamshi
From: Mukkamula, Suryavamshivardhan (CWM-NR)
Sent: 2016, June, 16 8:58 AM
To: [email protected]<mailto:[email protected]>
Subject: RE: Multiple directories
Thank you for the inputs.
Regards,
Surya Vamshi
From: Thomas Weise [mailto:[email protected]]
Sent: 2016, June, 15 5:08 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Multiple directories
On Wed, Jun 15, 2016 at 1:55 PM, Mukkamula, Suryavamshivardhan (CWM-NR)
<[email protected]<mailto:[email protected]>>
wrote:
Hi Ram/Team,
I could create an operator which reads multiple directories and parses the each
file with respect to an individual configuration file and generates output file
to different directories.
However I have some questions regarding the design.
==> We have 120 directories to scan on HDFS, if we use parallel partitioning
with operator memory around 250MB , it might be around 30GB of RAM for the
processing of this operator, are these figures going to create any problem in
production ?
You can benchmark this with a single partition. If the downstream operators can
keep up with the rate at which the file reader emits, then the memory
consumption should be minimal. Keep in mind though that the container memory is
not just heap space for the operator, but also memory the JVM requires to run
and the memory that the buffer server consumes. You see the allocated memory in
the UI if you use the DT community edition (container list in the physical
plan).
==> Should I use a scheduler for running the batch job (or) define next scan
time and make the DT job running continuously ? if I run DT job continuously I
assume memory will be continuously utilized by the DT Job it is not available
to other resources on the cluster, please clarify.
It is possible to set this up elastically also, so that when there is no input
available, the number of reader partitions are reduced and the memory given
back (Apex supports dynamic scaling).
Regards,
Surya Vamshi
From: Munagala Ramanath
[mailto:[email protected]<mailto:[email protected]>]
Sent: 2016, June, 05 10:24 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Multiple directories
Some sample code to monitor multiple directories is now available at:
https://github.com/DataTorrent/examples/tree/master/tutorials/fileIO-multiDir
It shows how to use a custom implementation of definePartitions() to create
multiple partitions of the file input operator and group them
into "slices" where each slice monitors a single directory.
Ram
On Wed, May 25, 2016 at 9:55 AM, Munagala Ramanath
<[email protected]<mailto:[email protected]>> wrote:
I'm hoping to have a sample sometime next week.
Ram
On Wed, May 25, 2016 at 9:30 AM, Mukkamula, Suryavamshivardhan (CWM-NR)
<[email protected]<mailto:[email protected]>>
wrote:
Thank you so much ram, for your advice , Option (a) would be ideal for my
requirement.
Do you have sample usage for partitioning with individual configuration set ups
different partitions?
Regards,
Surya Vamshi
From: Munagala Ramanath
[mailto:[email protected]<mailto:[email protected]>]
Sent: 2016, May, 25 12:11 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Multiple directories
You have 2 options: (a) AbstractFileInputOperator (b) FileSplitter/BlockReader
For (a), each partition (i.e. replica or the operator) can scan only a single
directory, so if you have 100
directories, you can simply start with 100 partitions; since each partition is
scanning its own directory
you don't need to worry about which files the lines came from. This approach
however needs a custom
definePartition() implementation in your subclass to assign the appropriate
directory and XML parsing
config file to each partition; it also needs adequate cluster resources to be
able to spin up the required
number of partitions.
For (b), there is some documentation in the Operators section at
http://docs.datatorrent.com/ including
sample code. There operators support scanning multiple directories out of the
box but have more
elaborate configuration options. Check this out and see if it works in your use
case.
Ram
On Wed, May 25, 2016 at 8:17 AM, Mukkamula, Suryavamshivardhan (CWM-NR)
<[email protected]<mailto:[email protected]>>
wrote:
Hello Ram/Team,
My requirement is to read input feeds from different locations on HDFS and
parse those files by reading XML configuration files (each input feed has
configuration file which defines the fields inside the input feeds).
My approach : I would like to define a mapping file which contains individual
feed identifier, feed location , configuration file location. I would like to
read this mapping file at initial load within setup() method and define my
DirectoryScan.acceptFiles. Here my challenge is when I read the files , I
should parse the lines by reading the individual configuration files. How do I
know the line is from particular file , if I know this I can read the
corresponding configuration file before parsing the line.
Please let me know how do I handle this.
Regards,
Surya Vamshi
From: Munagala Ramanath
[mailto:[email protected]<mailto:[email protected]>]
Sent: 2016, May, 24 5:49 PM
To: Mukkamula, Suryavamshivardhan (CWM-NR)
Subject: Multiple directories
One way of addressing the issue is to use some sort of external tool (like a
script) to
copy all the input files to a common directory (making sure that the file names
are
unique to prevent one file from overwriting another) before the Apex
application starts.
The Apex application then starts and processes files from this directory.
If you set the partition count of the file input operator to N, it will create
N partitions and
the files will be automatically distributed among the partitions. The
partitions will work
in parallel.
Ram
_______________________________________________________________________
This [email] may be privileged and/or confidential, and the sender does not
waive any related rights and obligations. Any distribution, use or copying of
this [email] or the information it contains by other than an intended recipient
is unauthorized. If you received this [email] in error, please advise the
sender (by return [email] or otherwise) immediately. You have consented to
receive the attached electronically at the above-noted address; please retain a
copy of this confirmation for future reference.
_______________________________________________________________________
This [email] may be privileged and/or confidential, and the sender does not
waive any related rights and obligations. Any distribution, use or copying of
this [email] or the information it contains by other than an intended recipient
is unauthorized. If you received this [email] in error, please advise the
sender (by return [email] or otherwise) immediately. You have consented to
receive the attached electronically at the above-noted address; please retain a
copy of this confirmation for future reference.
_______________________________________________________________________
This [email] may be privileged and/or confidential, and the sender does not
waive any related rights and obligations. Any distribution, use or copying of
this [email] or the information it contains by other than an intended recipient
is unauthorized. If you received this [email] in error, please advise the
sender (by return [email] or otherwise) immediately. You have consented to
receive the attached electronically at the above-noted address; please retain a
copy of this confirmation for future reference.
_______________________________________________________________________
This [email] may be privileged and/or confidential, and the sender does not
waive any related rights and obligations. Any distribution, use or copying of
this [email] or the information it contains by other than an intended recipient
is unauthorized. If you received this [email] in error, please advise the
sender (by return [email] or otherwise) immediately. You have consented to
receive the attached electronically at the above-noted address; please retain a
copy of this confirmation for future reference.
_______________________________________________________________________
This [email] may be privileged and/or confidential, and the sender does not
waive any related rights and obligations. Any distribution, use or copying of
this [email] or the information it contains by other than an intended recipient
is unauthorized. If you received this [email] in error, please advise the
sender (by return [email] or otherwise) immediately. You have consented to
receive the attached electronically at the above-noted address; please retain a
copy of this confirmation for future reference.