Re: Multiple directories

Munagala Ramanath Thu, 16 Jun 2016 13:37:43 -0700

It looks like you may be including old Hadoop jars in your apa package
since the stack trace
shows *ConverterUtils.toContainerId* calling
*ConverterUtils.toApplicationAttemptId* but recent versions
don't have that call sequence. In 2.7.1 (which is what your cluster has)
the function looks like this:
 * public static ContainerId toContainerId(String containerIdStr) {*
*    return ContainerId.fromString(containerIdStr);*
*  }*


Could you post the output of "*jar tvf {your-apa-file}*" as well as: "*mvn
dependency:tree"*

Ram

On Thu, Jun 16, 2016 at 12:38 PM, Mukkamula, Suryavamshivardhan (CWM-NR) <
suryavamshivardhan.mukkam...@rbc.com> wrote:

> Hi Ram,
>
>
>
> Below is the information.
>
>
>
>
>
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>
>                                  Dload  Upload   Total   Spent    Left
> Speed
>
> 100   712    0   712    0     0   3807      0 --:--:-- --:--:-- --:--:--
> 3807
>
> {
>
>     "clusterInfo": {
>
>         "haState": "ACTIVE",
>
>         "haZooKeeperConnectionState": "CONNECTED",
>
>         "hadoopBuildVersion": "2.7.1.2.3.2.0-2950 from
> 5cc60e0003e33aa98205f18bc
>
> caeaf36cb193c1c by jenkins source checksum 69a3bf8c667267c2c252a54fbbf23d",
>
>         "hadoopVersion": "2.7.1.2.3.2.0-2950",
>
>         "hadoopVersionBuiltOn": "2015-09-30T18:08Z",
>
>         "id": 1465495186350,
>
>         "resourceManagerBuildVersion": "2.7.1.2.3.2.0-2950 from
> 5cc60e0003e33aa9
>
> 8205f18bccaeaf36cb193c1c by jenkins source checksum
> 48db4b572827c2e9c2da66982d14
>
> 7626",
>
>         "resourceManagerVersion": "2.7.1.2.3.2.0-2950",
>
>        "resourceManagerVersionBuiltOn": "2015-09-30T18:20Z",
>
>         "rmStateStoreName":
> "org.apache.hadoop.yarn.server.resourcemanager.recov
>
> ery.ZKRMStateStore",
>
>         "startedOn": 1465495186350,
>
>         "state": "STARTED"
>
>     }
>
> }
>
>
>
> Regards,
>
> Surya Vamshi
>
>
>
> *From:* Munagala Ramanath [mailto:r...@datatorrent.com]
> *Sent:* 2016, June, 16 2:57 PM
> *To:* users@apex.apache.org
> *Subject:* Re: Multiple directories
>
>
>
> Can you ssh to one of the cluster nodes ? If so, can you run this command
> and show the output
>
> (where *{rm} *is the *host:port* running the resource manager, aka YARN):
>
>
>
> *curl http://{rm}/ws/v1/cluster <http://%7brm%7d/ws/v1/cluster> | python
> -mjson.tool*
>
>
>
> Ram
>
> ps. You can determine the node running YARN with:
>
>
>
> *hdfs getconf -confKey yarn.resourcemanager.webapp.address*
>
> *hdfs getconf -confKey yarn.resourcemanager.webapp.https.address*
>
>
>
>
>
>
>
> On Thu, Jun 16, 2016 at 11:15 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <
> suryavamshivardhan.mukkam...@rbc.com> wrote:
>
> Hi,
>
>
>
> I am facing a weird  issue and the logs are not clear to me !!
>
>
>
> I have created apa file which works fine within my local sandbox but
> facing problems when I upload on the enterprise Hadoop cluster using DT
> Console.
>
>
>
> Below is the error message from yarn logs. Please help in understanding
> the issue.
>
>
>
> ###################### Error Logs
> ########################################################
>
>
>
> Log Type: AppMaster.stderr
>
> Log Upload Time: Thu Jun 16 14:07:46 -0400 2016
>
> Log Length: 1259
>
> SLF4J: Class path contains multiple SLF4J bindings.
>
> SLF4J: Found binding in
> [jar:file:/grid/06/hadoop/yarn/local/usercache/mukkamula/appcache/application_1465495186350_2224/filecache/36/slf4j-log4j12-1.7.19.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: Found binding in
> [jar:file:/usr/hdp/2.3.2.0-2950/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
>
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
> Exception in thread "main" java.lang.IllegalArgumentException: Invalid
> ContainerId: container_e35_1465495186350_2224_01_000001
>
>         at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>
>         at
> com.datatorrent.stram.StreamingAppMaster.main(StreamingAppMaster.java:90)
>
> Caused by: java.lang.NumberFormatException: For input string: "e35"
>
>         at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>
>         at java.lang.Long.parseLong(Long.java:441)
>
>         at java.lang.Long.parseLong(Long.java:483)
>
>         at
> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>
>         at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>
>         ... 1 more
>
>
>
> Log Type: AppMaster.stdout
>
> Log Upload Time: Thu Jun 16 14:07:46 -0400 2016
>
> Log Length: 0
>
>
>
> Log Type: dt.log
>
> Log Upload Time: Thu Jun 16 14:07:46 -0400 2016
>
> Log Length: 29715
>
> Showing 4096 bytes of 29715 total. Click here
> <http://guedlpdhdp001.saifg.rbc.com:19888/jobhistory/logs/guedlpdhdp012.saifg.rbc.com:45454/container_e35_1465495186350_2224_01_000001/container_e35_1465495186350_2224_01_000001/mukkamula/dt.log/?start=0>
>  for
> the full log.
>
> 56m -Xloggc:/var/log/hadoop/yarn/gc.log-201606140038 -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms4096m
> -Xmx4096m -Dhadoop.security.logger=INFO,DRFAS
> -Dhdfs.audit.logger=INFO,DRFAAUDIT
>
> SHLVL=3
>
> HADOOP_SSH_OPTS=-o ConnectTimeout=5 -o SendEnv=HADOOP_CONF_DIR
>
> HADOOP_USER_NAME=datatorrent/gueulvahal003.saifg.rbc....@saifg.rbc.com
>
> HADOOP_NAMENODE_OPTS=-server -XX:ParallelGCThreads=8
> -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/yarn/hs_err_pid%p.log
> -XX:NewSize=200m -XX:MaxNewSize=200m -XX:PermSize=128m -XX:MaxPermSize=256m
> -Xloggc:/var/log/hadoop/yarn/gc.log-201606140038 -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms8192m
> -Xmx8192m -Dhadoop.security.logger=INFO,DRFAS
> -Dhdfs.audit.logger=INFO,DRFAAUDIT
> -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node"
> -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 -server
> -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC
> -XX:ErrorFile=/var/log/hadoop/yarn/hs_err_pid%p.log -XX:NewSize=200m
> -XX:MaxNewSize=200m -XX:PermSize=128m -XX:MaxPermSize=256m
> -Xloggc:/var/log/hadoop/yarn/gc.log-201606140038 -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms8192m
> -Xmx8192m -Dhadoop.security.logger=INFO,DRFAS
> -Dhdfs.audit.logger=INFO,DRFAAUDIT
> -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node"
> -Dorg.mortbay.jetty.Request.maxFormContentSize=-1
>
> HADOOP_IDENT_STRING=yarn
>
> HADOOP_MAPRED_LOG_DIR=/var/log/hadoop-mapreduce/yarn
>
> NM_HOST=guedlpdhdp012.saifg.rbc.com
>
> XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
>
> HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop/hdfs
>
> YARN_HISTORYSERVER_HEAPSIZE=1024
>
> JVM_PID=2638
>
> YARN_PID_DIR=/var/run/hadoop-yarn/yarn
>
> HADOOP_HOME_WARN_SUPPRESS=1
>
> NM_PORT=45454
>
> LOGNAME=mukkamula
>
> YARN_CONF_DIR=/usr/hdp/current/hadoop-client/conf
>
> HADOOP_YARN_USER=yarn
>
> QTDIR=/usr/lib64/qt-3.3
>
> _=/usr/lib/jvm/java-1.7.0/bin/java
>
> MSM_PRODUCT=MSM
>
> HADOOP_HOME=/usr/hdp/2.3.2.0-2950/hadoop
>
> MALLOC_ARENA_MAX=4
>
> HADOOP_OPTS=-Dhdp.version=2.3.2.0-2950 -Djava.net.preferIPv4Stack=true
> -Dhdp.version= -Djava.net.preferIPv4Stack=true
> -Dhadoop.log.dir=/var/log/hadoop/yarn -Dhadoop.log.file=hadoop.log
> -Dhadoop.home.dir=/usr/hdp/2.3.2.0-2950/hadoop -Dhadoop.id.str=yarn
> -Dhadoop.root.logger=INFO,console
> -Djava.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> -Dhdp.version=2.3.2.0-2950 -Dhadoop.log.dir=/var/log/hadoop/yarn
> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.3.2.0-2950/hadoop
> -Dhadoop.id.str=yarn -Dhadoop.root.logger=INFO,console
> -Djava.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native:/var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
>
> SHELL=/bin/bash
>
> YARN_ROOT_LOGGER=INFO,EWMA,RFA
>
>
> HADOOP_TOKEN_FILE_LOCATION=/grid/11/hadoop/yarn/local/usercache/mukkamula/appcache/application_1465495186350_2224/container_e35_1465495186350_2224_01_000001/container_tokens
>
>
> CLASSPATH=./*:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*
>
> HADOOP_MAPRED_PID_DIR=/var/run/hadoop-mapreduce/yarn
>
> YARN_NODEMANAGER_HEAPSIZE=1024
>
> QTINC=/usr/lib64/qt-3.3/include
>
> USER=mukkamula
>
> HADOOP_CLIENT_OPTS=-Xmx2048m -XX:MaxPermSize=512m -Xmx2048m
> -XX:MaxPermSize=512m
>
> CONTAINER_ID=container_e35_1465495186350_2224_01_000001
>
> HADOOP_SECURE_DN_PID_DIR=/var/run/hadoop/hdfs
>
> HISTCONTROL=ignoredups
>
> HOME=/home/
>
> HADOOP_NAMENODE_INIT_HEAPSIZE=-Xms8192m
>
> MSM_HOME=/usr/local/MegaRAID Storage Manager
>
> LESSOPEN=||/usr/bin/lesspipe.sh %s
>
> LANG=en_US.UTF-8
>
> YARN_NICENESS=0
>
> YARN_IDENT_STRING=yarn
>
> HADOOP_MAPRED_HOME=/usr/hdp/2.3.2.0-2950/hadoop-mapreduce
>
>
>
>
>
> Regards,
>
> Surya Vamshi
>
>
>
> *From:* Mukkamula, Suryavamshivardhan (CWM-NR)
> *Sent:* 2016, June, 16 8:58 AM
> *To:* users@apex.apache.org
> *Subject:* RE: Multiple directories
>
>
>
> Thank you for the inputs.
>
>
>
> Regards,
>
> Surya Vamshi
>
> *From:* Thomas Weise [mailto:thomas.we...@gmail.com
> <thomas.we...@gmail.com>]
> *Sent:* 2016, June, 15 5:08 PM
> *To:* users@apex.apache.org
> *Subject:* Re: Multiple directories
>
>
>
>
>
> On Wed, Jun 15, 2016 at 1:55 PM, Mukkamula, Suryavamshivardhan (CWM-NR) <
> suryavamshivardhan.mukkam...@rbc.com> wrote:
>
> Hi Ram/Team,
>
>
>
> I could create an operator which reads multiple directories and parses the
> each file with respect to an individual configuration file and generates
> output file to different directories.
>
>
>
> However I have some questions regarding the design.
>
>
>
> è We have 120 directories to scan on HDFS, if we use parallel
> partitioning with operator memory around 250MB , it might be around 30GB of
> RAM for the processing of this operator, are these figures going to create
> any problem in production ?
>
>
>
> You can benchmark this with a single partition. If the downstream
> operators can keep up with the rate at which the file reader emits, then
> the memory consumption should be minimal. Keep in mind though that the
> container memory is not just heap space for the operator, but also memory
> the JVM requires to run and the memory that the buffer server consumes. You
> see the allocated memory in the UI if you use the DT community edition
> (container list in the physical plan).
>
>
>
> è Should I use a scheduler for running the batch job (or) define next
> scan time and make the DT job running continuously ? if I run DT job
> continuously I assume memory will be continuously utilized by the DT Job it
> is not available to other resources on the cluster, please clarify.
>
> It is possible to set this up elastically also, so that when there is no
> input available, the number of reader partitions are reduced and the memory
> given back (Apex supports dynamic scaling).
>
>
>
>
>
> Regards,
>
> Surya Vamshi
>
>
>
> *From:* Munagala Ramanath [mailto:r...@datatorrent.com]
> *Sent:* 2016, June, 05 10:24 PM
>
>
> *To:* users@apex.apache.org
> *Subject:* Re: Multiple directories
>
>
>
> Some sample code to monitor multiple directories is now available at:
>
>
> https://github.com/DataTorrent/examples/tree/master/tutorials/fileIO-multiDir
>
>
>
> It shows how to use a custom implementation of definePartitions() to create
>
> multiple partitions of the file input operator and group them
>
> into "slices" where each slice monitors a single directory.
>
>
>
> Ram
>
>
>
> On Wed, May 25, 2016 at 9:55 AM, Munagala Ramanath <r...@datatorrent.com>
> wrote:
>
> I'm hoping to have a sample sometime next week.
>
>
>
> Ram
>
>
>
> On Wed, May 25, 2016 at 9:30 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <
> suryavamshivardhan.mukkam...@rbc.com> wrote:
>
> Thank you so much ram, for your advice , Option (a) would be ideal for my
> requirement.
>
>
>
> Do you have sample usage for partitioning with individual configuration
> set ups different partitions?
>
>
>
> Regards,
>
> Surya Vamshi
>
>
>
> *From:* Munagala Ramanath [mailto:r...@datatorrent.com]
> *Sent:* 2016, May, 25 12:11 PM
> *To:* users@apex.apache.org
> *Subject:* Re: Multiple directories
>
>
>
> You have 2 options: (a) AbstractFileInputOperator (b)
> FileSplitter/BlockReader
>
>
>
> For (a), each partition (i.e. replica or the operator) can scan only a
> single directory, so if you have 100
>
> directories, you can simply start with 100 partitions; since each
> partition is scanning its own directory
>
> you don't need to worry about which files the lines came from. This
> approach however needs a custom
>
> definePartition() implementation in your subclass to assign the
> appropriate directory and XML parsing
>
> config file to each partition; it also needs adequate cluster resources to
> be able to spin up the required
>
> number of partitions.
>
>
>
> For (b), there is some documentation in the Operators section at
> http://docs.datatorrent.com/ including
>
> sample code. There operators support scanning multiple directories out of
> the box but have more
>
> elaborate configuration options. Check this out and see if it works in
> your use case.
>
>
>
> Ram
>
>
>
> On Wed, May 25, 2016 at 8:17 AM, Mukkamula, Suryavamshivardhan (CWM-NR) <
> suryavamshivardhan.mukkam...@rbc.com> wrote:
>
> Hello Ram/Team,
>
>
>
> My requirement is to read input feeds from different locations on HDFS and
> parse those files by reading XML configuration files (each input feed has
> configuration file which defines the fields inside the input feeds).
>
>
>
> My approach : I would like to define a mapping file which contains
> individual feed identifier, feed location , configuration file location. I
> would like to read this mapping file at initial load within setup() method
> and define my DirectoryScan.acceptFiles. Here my challenge is when I read
> the files , I should parse the lines by reading the individual
> configuration files. How do I know the line is from particular file , if I
> know this I can read the corresponding configuration file before parsing
> the line.
>
>
>
> Please let me know how do I handle this.
>
>
>
> Regards,
>
> Surya Vamshi
>
>
>
> *From:* Munagala Ramanath [mailto:r...@datatorrent.com]
> *Sent:* 2016, May, 24 5:49 PM
> *To:* Mukkamula, Suryavamshivardhan (CWM-NR)
> *Subject:* Multiple directories
>
>
>
> One way of addressing the issue is to use some sort of external tool (like
> a script) to
>
> copy all the input files to a common directory (making sure that the file
> names are
>
> unique to prevent one file from overwriting another) before the Apex
> application starts.
>
>
>
> The Apex application then starts and processes files from this directory.
>
>
>
> If you set the partition count of the file input operator to N, it will
> create N partitions and
>
> the files will be automatically distributed among the partitions. The
> partitions will work
>
> in parallel.
>
>
>
> Ram
>
> _______________________________________________________________________
>
> This [email] may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this [email] or the information it contains by other than an
> intended recipient is unauthorized. If you received this [email] in error,
> please advise the sender (by return [email] or otherwise) immediately. You
> have consented to receive the attached electronically at the above-noted
> address; please retain a copy of this confirmation for future reference.
>
>
>
> _______________________________________________________________________
>
> This [email] may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this [email] or the information it contains by other than an
> intended recipient is unauthorized. If you received this [email] in error,
> please advise the sender (by return [email] or otherwise) immediately. You
> have consented to receive the attached electronically at the above-noted
> address; please retain a copy of this confirmation for future reference.
>
>
>
>
>
> _______________________________________________________________________
>
> This [email] may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this [email] or the information it contains by other than an
> intended recipient is unauthorized. If you received this [email] in error,
> please advise the sender (by return [email] or otherwise) immediately. You
> have consented to receive the attached electronically at the above-noted
> address; please retain a copy of this confirmation for future reference.
>
>
>
> _______________________________________________________________________
>
> This [email] may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this [email] or the information it contains by other than an
> intended recipient is unauthorized. If you received this [email] in error,
> please advise the sender (by return [email] or otherwise) immediately. You
> have consented to receive the attached electronically at the above-noted
> address; please retain a copy of this confirmation for future reference.
>
>
>
> _______________________________________________________________________
>
> This [email] may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this [email] or the information it contains by other than an
> intended recipient is unauthorized. If you received this [email] in error,
> please advise the sender (by return [email] or otherwise) immediately. You
> have consented to receive the attached electronically at the above-noted
> address; please retain a copy of this confirmation for future reference.
>
>

Re: Multiple directories

Reply via email to