[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3904: --- Fix Version/s: 2.9.0 > Refactor timelineservice.storage to add support to online and offline > aggregation writers > - > > Key: YARN-3904 > URL: https://issues.apache.org/jira/browse/YARN-3904 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.9.0, 3.0.0-alpha1 > > Attachments: YARN-3904-YARN-2928.001.patch, > YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, > YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch, > YARN-3904-YARN-2928.006.patch, YARN-3904-YARN-2928.007.patch, > YARN-3904-YARN-2928.008.patch, YARN-3904-YARN-2928.009.patch > > > After we finished the design for time-based aggregation, we can adopt our > existing Phoenix storage into the storage of the aggregated data. In this > JIRA, I'm proposing to refactor writers to add support to aggregation > writers. Offline aggregation writers typically has less contextual > information. We can distinguish these writers by special naming. We can also > use CollectorContexts to model all contextual information and use it in our > writer interfaces. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3904: Attachment: YARN-3904-YARN-2928.009.patch Fixed the typo raised by [~vrushalic]. Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch, YARN-3904-YARN-2928.006.patch, YARN-3904-YARN-2928.007.patch, YARN-3904-YARN-2928.008.patch, YARN-3904-YARN-2928.009.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3904: Attachment: YARN-3904-YARN-2928.008.patch v008 patch, rebase to latest YARN-2928 branch. Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch, YARN-3904-YARN-2928.006.patch, YARN-3904-YARN-2928.007.patch, YARN-3904-YARN-2928.008.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3904: Attachment: YARN-3904-YARN-2928.007.patch Thanks [~zjshen] and [~sjlee0]! I've uploaded v007 patch to address your comments. Specifically: bq. However, it's better to have config such as blah.blah.backend.type. When backend.type = hbase, we user can access HBase both directly and via Phoenix, and we allow aggregation. This may not need to part of this jira, but just think it out loudly. Yes, this proposal makes sense. Actually it significantly simplifies deployment, since users no longer needs to know the exact class name of the backend. However, if we decide to move along this direction, there are some foreseeable nontrivial work such as changing class loading strategies. To better separate the current workload I propose to address this issue in a separate JIRA (we may not address it immediately). bq. Make sense, but can we still make table creation centralized? I think we can make some option to create raw entity tables and aggregation tables separately. bq. As for createTables(), I'm also of the opinion that it might be better if we moved it to a dedicated creator class. I agree it is appealing to centralize table creations. After putting some thoughts here I think what we really want is a centralized _workflow_ for storage schema creations. That is to say, when setting up a v2 timeline server, users can simply run data schema creator for once to create necessary data storage schemas. With this in mind, I added Phoenix schema creation into the existing data schema creator, with a separate option {{-p}}. However, I'm keeping the SQL statements for table creation inside the writer file so that we also have a centralized place for the Phoenix storage schema. bq. Actually I'd like to ask whether this needs to be a service. Note that it is possible (or likely) that the writer will be executed in a mapreduce task. We implement offline writers as {{AbstractServices}} to reuse the existing logic for service initialization, start, and finish. This pattern matches nicely with our use cases of our offline writers. I admit it sounds a little bit awkward if we call something inside a mapreduce job as a service. However, the hadoop {{Service}} is just a light weight package for service lifecycle management. It does not strongly tight to server side or non-application use cases. Therefore I modified the writer to an {AbstractService} per [~zjshen]'s suggestion. bq. For the user aggregation tables, I believe the cluster needs to be included in the row key. Yes. Fixed. bq. l.156: My JDBC knowledge is bit outdated, but do you want to prepare the statement every time write is done? Don't you want to prepare it once and reuse it? That optimization will follow later? Nice catch. We can definitely reuse this PreparedStatement (as well as the connections) after we integrated the aggregation writer with the aggregator. My plan is to use this (relatively) stable writer to unblock the future patch on flow and user level offline aggregation. After we have the whole workflow, we can gradually add optimizations. Thoughts? bq. I would enforce the notion that this is a read-only object by making the members final Yes. Fixed. bq. Should the primary key user and then cluster, or cluster and user? I think it might be better if it is cluster and user although it is different than the entity table. Vrushali C? I'm OK with either. Any suggestions [~vrushalic]? Anyways we can decide it similar to TimelineEvents in HBase storage so I don't think this is blocking the JIRA? Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch, YARN-3904-YARN-2928.006.patch, YARN-3904-YARN-2928.007.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3904: Attachment: YARN-3904-YARN-2928.006.patch Upload the 006 version of the patch. In this patch I addressed most of [~zjshen]'s review comments. I think I need some discussions on the following two points: bq. moving the table creation stuff into TimelineSchemaCreator. I'm not 100% sure if that's what we would like to do. Maybe we would like to decouple the offline aggregation module from our normal entity storage. Therefore, maybe it's also appealing to allow users specify if they need to create data schema in the offline aggregation process? Such as, setting one flag in the offline aggregator to create data schema? bq. As HBase backend is accessed both directly and via Phoenix, it's good for us to cleanup the configuration to say we're using the HBase backend (comparing to FS backend) instead of specifically HBase or Phoenix writer/reader. After the changes in this JIRA, we will only have two types of TimelineWriters, one for FS (test only) and one for HBase. The setting on the offline storage should be independent from this setting, I assume? Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch, YARN-3904-YARN-2928.006.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3904: Attachment: YARN-3904-YARN-2928.005.patch Refreshed my patch according to [~sjlee0]'s comments. Specifically, I set up a new interface (OfflineAggregationWriter) for aggregation writers. With this new interface I decoupled PhoenixOfflineAggregationWriter from TimelineWriter. Having a separate offline writer interface also gives us more freedom to design the aggregation storage interface. Now in the new writer API the type of the offline aggregation is specified by the incoming {{OfflineAggregationInfo}}. I also considered to combine reader and writer interfaces into a OfflineAggregationStorage interface, but it turned out that we may have some reader-only implementations (such as reading app level aggregations from HBase). Separating offline readers and writers will give us more freedom in this case. Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3904: Attachment: YARN-3904-YARN-2928.004.patch Update the 004 version of the patch. This patch addresses the following two major issues: # Rebuild the current Phoenix writer into an offline aggregation writer. Specifically, the writer writes info and metric data into the newly created Phoenix offline aggregation table. # Simplify writer interface by using TimelineCollectorContext. In this way both normal writers and offline aggregation writers can use the same interface to write data. One thing pending discussion is about the {{aggregation}} method. I feel this method is a little bit outdated. Could anyone remind me the assumed use case for it? Will it fit for real-time aggregations only? Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3904: Description: After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. (was: After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. This JIRA proposes to move the Phoenix storage implementation from o.a.h.yarn.server.timelineservice.storage to o.a.h.yarn.server.timelineservice.aggregation.timebased, and make it a fully devoted writer for time-based aggregation. ) Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3904: Summary: Refactor timelineservice.storage to add support to online and offline aggregation writers (was: Adopt PhoenixTimelineWriter into time-based aggregation storage) Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. This JIRA proposes to move the Phoenix storage implementation from o.a.h.yarn.server.timelineservice.storage to o.a.h.yarn.server.timelineservice.aggregation.timebased, and make it a fully devoted writer for time-based aggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)