Sangjin Lee commented on YARN-3901:

(2) colliding puts in the co-processor
We found another issue with the write side of things via unit test. It fails 
only occasionally (and more often on some environments than others). It happens 
when 2 puts on the same column are coming in very closely (namely within 1 
millisecond). The code in question is {{FlowRunCoprocessor.prePut()}}:

      for (Map.Entry<byte[], List<Cell>> entry : put.getFamilyCellMap()
          .entrySet()) {
        List<Cell> newCells = new ArrayList<>(entry.getValue().size());
        for (Cell cell : entry.getValue()) {
          // for each cell in the put add the tags
          // Assumption is that all the cells in
          // one put are the same operation
              CellUtil.cloneFamily(cell), CellUtil.cloneQualifier(cell),
              cell.getTimestamp(), KeyValue.Type.Put,
              CellUtil.cloneValue(cell), Tag.fromList(tags)));
        newFamilyMap.put(entry.getKey(), newCells);
      } // for each entry

If 2 cells for example carry the same timestamp, then the later one ends up 
overwriting the previous one, effectively losing one put. This was triggered by 
one of the tests in {{TestHBaseTimelineWriterImplFlowRun.java}}.

It's an edge case which is rather unlikely to happen normally, but is an issue 
nonetheless. And how to solve this problem is pretty complicated. We'll soon 
post possible approaches for handling this.

But at any rate, I suspect we could isolate this issue into a separate JIRA, 
and tackle it post-UI-POC. I'd appreciate your feedback.

> Populate flow run data in the flow_run & flow activity tables
> -------------------------------------------------------------
>                 Key: YARN-3901
>                 URL: https://issues.apache.org/jira/browse/YARN-3901
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>         Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay

This message was sent by Atlassian JIRA

Reply via email to