Hi,

I am trying to configure apache gobblin on yarn to pull data from postgres to 
hdfs to store as avro files in daily partitions. Gobblin uses helix version : 
0.8.2 to manage the tasks.
I am facing an issue as the job gets stuck when data volume is increased with 
some of the tasks getting completed(as per debug logs) but result files are 
missing.

There are 63 tasks one for each partition for this job and I can see from logs 
4 task runners are initialized and assigned tasks.
After creating most of the task result files in task output dir, the job is 
getting stuck - with no error message/exception.
For a reduced volume of data same configuration works and the job finishes. If 
it is getting stuck that happens roughly in 25 - 30 mins.

One of such tasks with log as COMPLETED but file missing in output dir is 
..._1577133620749_3 as shown below.

2019-12-23 20:40:28 UTC WARN  [GenericHelixController-event_process] 
org.apache.helix.task.assigner.AssignableInstance  - AssignableInstance does 
not have enough capacity for quotaType: DEFAULT. Task: 
23b17106-6e18-4516-9745-879a3f6a30b8, quotaType: DEFAULT, Instance name: 
GobblinYarnTaskRunner_2. Current capacity: 40 capacity needed to schedule: 40

2019-12-23 20:40:32 UTC DEBUG [GenericHelixController-event_process] 
org.apache.helix.task.AbstractTaskDispatcher  - Setting task partition 
job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to 
RUNNING on instance GobblinYarnTaskRunner_1

2019-12-23 20:48:38 UTC DEBUG [GenericHelixController-event_process] 
org.apache.helix.task.AbstractTaskDispatcher  - Task partition 
job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has a pending 
state transition on instance GobblinYarnTaskRunner_4. Using the previous ideal 
state which was RUNNING.
2019-12-23 20:50:20 UTC DEBUG [GenericHelixController-event_process] 
org.apache.helix.task.AbstractTaskDispatcher  - Setting task partition 
job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to 
RUNNING on instance GobblinYarnTaskRunner_4.

2019-12-23 20:50:22 UTC DEBUG [GenericHelixController-event_process] 
org.apache.helix.task.AbstractTaskDispatcher  - Task partition 
job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has completed 
with state COMPLETED. Marking as such in rebalancer context.

Not sure how to figure out what is happening with the job. Appreciate any 
advice/suggestions.

Thanks & Regards,
Praveen

CONFIDENTIALITY NOTICE: This message is the property of International Game 
Technology PLC and/or its subsidiaries and may contain proprietary, 
confidential or trade secret information. This message is intended solely for 
the use of the addressee. If you are not the intended recipient and have 
received this message in error, please delete this message from your system. 
Any unauthorized reading, distribution, copying, or other use of this message 
or its attachments is strictly prohibited.

Reply via email to