[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515163#comment-14515163 ] Zhijie Shen commented on YARN-3411: --- bq. Can we make both the implementations use hBase-client from 0.98 irrespective of what the server uses? I guess it's not the client problem, but the server problem. To use Phoenix, HBase daemon needs to start with Phoenix server lib installed. That said, we won't be able to have a HBase 1.0 cluster which has installed 4.3 Phoenix. And given Vrushali's comment "Yes, since hbase 1.0 is both on-wire and on-disk compatible with HBase 0.98.x, I believe we should be able to use the 0.98 client to write to a hbase 1.0 cluster." is right, I guess the client should probably be fine instead. 4.3 Phoenix client uses HBase 0.98.x client, such that it can talk to HBase 1.0 server. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515135#comment-14515135 ] Vrushali C commented on YARN-3411: -- Hi [~vinodkv] bq. Can we make both the implementations use hBase-client from 0.98 irrespective of what the server uses? Yes, since hbase 1.0 is both on-wire and on-disk compatible with HBase 0.98.x, I believe we should be able to use the 0.98 client to write to a hbase 1.0 cluster. But, that means we would still be using the 0.98 APIs in the timeline writer and would need code changes to move to 1.0 client. (My current patch uses the new 1.0 APIs). Using the 0.98.x client means we won’t be able to take advantage of the 1.0 features which would really be useful in ATSv2. thanks Vrushali > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515075#comment-14515075 ] Li Lu commented on YARN-3411: - One alternative plan is to try the 4.4.0 snapshot version of Phoenix in our benchmark, which is yet to be released but (probably) usable. I'll double check with the Phoenix/hbase team to see how hard this is. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515072#comment-14515072 ] Vinod Kumar Vavilapalli commented on YARN-3411: --- I think the bigger question is this - if the Phoenix based storage impl needs HBase 0.98 and the native HBase impl needs 1.0, how can they both be used by yarn-timeline-service module / reside in the same JVM space? Can we make both the implementations use hBase-client from 0.98 irrespective of what the server uses? > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514942#comment-14514942 ] Vrushali C commented on YARN-3411: -- Hi Li I see. Hmm. So there are some major changes between hbase 0.98 and hbase like the client facing APIs (HTableInterface, etc) have been deprecated and replaced with new interfaces. (Connection management moved to new ConnectionFactory class; A table is now referred to only with TableName not String or byte[] ; etc) This means, we would need several code changes and upgrade steps to move from 0.98 to 1.0 in the future. Also, would like to mention that HBase 1.0 comes with a whole bunch of improvements, performance fixes (improved WAL pipeline, using disruptor, multi-WAL, more off-heap, etc) and bug fixes, some of which I think would be very beneficial for ATS v2. For instance, - per cell TTLs can be set - Better support for HBase Cell interface internally in read and write paths for better performance and flexibility - It now has the coprocessor functionality to make endpoint calls against a region server, which would be very helpful with aggregations - A Dockerfile to easily build and run HBase from source (which would be helpful during deploy and set up for users) - It contains a feature where in a region can be hosted in multiple region servers in read-only mode. One of the replicas for the region will be primary, accepting writes, and other replicas will share the same data files. Read requests can be done against any replica for the region with backup RPCs for high availability with timeline consistency guarantees. This should help us significantly on the reader side. I think for the writer performance testing, we can have phoenix on 0.98 and this native approach on 1.0 but that means we need two hbase clusters, one on 0.98 and one on 1.0. What do you say.. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514872#comment-14514872 ] Sangjin Lee commented on YARN-3411: --- [~vinodkv], that's what I understood as well. The remaining concern is that we need to pick versions carefully lest HBase (or any other library in this situation) may be forced on a uncertified/incompatible version of hadoop. But it is true that problem may exist no matter how we structure the code/projects. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514591#comment-14514591 ] Li Lu commented on YARN-3411: - Hi [~vrushalic], one more quick question about the version numbers. The current Phoenix release only works with 0.98. Right now we're waiting for Phoenix 4.4 to support hbase 1, but that may take a while. So for now, will there by significant problem if we use hbase 0.98 as the standard version? I know unit tests may not run with 0.98 on trunk, but once the main logic works that should not block the performance benchmark I guess? Thanks! > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514554#comment-14514554 ] Vinod Kumar Vavilapalli commented on YARN-3411: --- There is no single YARN artifact. hbase-client may depend on yarn-client. But yarn-timeline-service may depend on hbase-client. There is no cause for concern. yarn-timeline-service should depend on the last stable hbase-client that we will test with and support. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514466#comment-14514466 ] Sangjin Lee commented on YARN-3411: --- bq. In detail thinking, HBase relies on Hadoop/HDFS only (not YARN), so it should be fine for YARN to rely on HBase component especially it is downloading jar rather than build from source? OK I got clearer about this. "hbase-client" does depend on YARN indirectly as it depends on hadoop-mapreduce-client-core. But since timelineservice is high enough in terms of YARN project dependency hierarchy so they don't form a cycle. I think this specific situation might be OK, so we can move forward. As a to-do item, though, I think we want to think about the code structure of adding library dependencies that depend on hadoop, as things like versions can become issues. Are there any precedents? bq. Do we have solid case for non-numeric metrics so far? Boolean case should be fine as we can represent true and false with 1 and 0. I think we should stick with numbers (perhaps java.lang.Number as the base class for types). I'm not even sure how boolean "aggregation" would work so we may just decide not to support it. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513989#comment-14513989 ] Junping Du commented on YARN-3411: -- Just 2 cents on comments above: bq. By timelineservice depending on HBase, we now have a circular dependency between hadoop and HBase. Evidently it builds (which is bit surprising), but it creates an interesting situation. I'm wondering how we should handle this. I suppose the same issue exists with Phoenix. In detail thinking, HBase relies on Hadoop/HDFS only (not YARN), so it should be fine for YARN to rely on HBase component especially it is downloading jar rather than build from source? In prospective of upstream/downstream project relationship, I do agree it bring extra complexity between Hadoop project and HBase project in syncing project releases. However, I think we should expect this and already decided to take the pain before we are moving to HBase. Don't we? Though, I didn't remember to see public discussions on this concern before. bq. For metrics, shall we be more generalized to support all kinds of numeric value, boolean and so on? Do we have solid case for non-numeric metrics so far? Boolean case should be fine as we can represent true and false with 1 and 0. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512098#comment-14512098 ] Zhijie Shen commented on YARN-3411: --- [~vrushalic], I've commented on [YARN-3134|https://issues.apache.org/jira/browse/YARN-3134?focusedCommentId=14512080&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14512080] about the question of config/info/metric value. I scan through the hbase implementation. It seem config value is treated as string, info is stored as directly in bytes, and metrics is treated as Long. I kindly agree on config/info, and I think it should be fine if config is assumed to be string (maybe we need to adjust the data model), but let's see community's opinion. For metrics, shall we be more generalized to support all kinds of numeric value, boolean and so on? > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511470#comment-14511470 ] Sangjin Lee commented on YARN-3411: --- Sorry [~vrushalic] for my late comments. I took a quick pass on the POC patch, and as with the Phoenix one I haven't fully delved into the schema-related code, but here are some initial quick comments. - I'm sure you're aware, but please don't forget to add the license to the new files later. (pom.xml) - l.77: So do we need to depend on hbase-server? That's bit unexpected? - Come to think of it, this is bit interesting. By timelineservice depending on HBase, we now have a circular dependency between hadoop and HBase. Evidently it builds (which is bit surprising), but it creates an interesting situation. I'm wondering how we should handle this. I suppose the same issue exists with Phoenix. (EntityTableDetails.java) - l.7: I think we're now moving away from the acronym "ats". Perhaps we should simply use "timeline.entity"? (HBaseTimelineWriterImpl.java) - l.40: Just curious, does this mean a single writer instance has only on HBase client connection? We should be able to have multiple connections? What is your thought on this? - Also, the initialization operations inside the constructor should probably belong in serviceInit() or serviceStart()? > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511334#comment-14511334 ] Vrushali C commented on YARN-3411: -- Hi [~gtCarrera9] and [~djp] Thanks for the comments, I will reply to these shortly but wanted to quickly respond about bq. Can we turn this into java code? As we haven't add any ruby code before, it could bring extra complexity/dependency on ruby. Sure, I can change it to java but I think, as such there should not be any problem running ruby code with hbase, it does not add to dependencies. In fact, the HBase Shell, which is a command line interpreter for HBase is written in Ruby. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511323#comment-14511323 ] Junping Du commented on YARN-3411: -- Hi [~vrushalic], thanks for updating the patch! Just a quick go through for the latest patch, sounds like we have ruby code to create schema: {code} +create 'ats.entity', + {NAME => 'i', COMPRESSION => 'LZO', BLOOMFILTER => 'ROWCOL'}, + {NAME => 'm', VERSIONS => 2147483647, MIN_VERSIONS => 1, COMPRESSION => 'LZO', BLOCKCACHE => false, TTL => '2592000'}, + {NAME => 'c', COMPRESSION => 'LZO', BLOCKCACHE => false, BLOOMFILTER => 'ROWCOL' } {code} Can we turn this into java code? As we haven't add any ruby code before, it could bring extra complexity/dependency on ruby. BTW, I meet the problem to build locally with applying the patch. I think the reason seems like my local use JDK 1.8, but HBase version we are using here (1.0.0) depends on jdk tools 1.7. I assume we will move to Java 8 in short term (may be in 2.8 release cycle? - need confirm later). If so, we may need solution for this problem. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509961#comment-14509961 ] Li Lu commented on YARN-3411: - Oh, and one thing to add, in the added pom file, maybe we can centralize the version of hbase (the Phoenix patch also has this problem)? This may make version management slightly easier. Maybe we can address this problem together with the Phoenix one in YARN-3529? > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509951#comment-14509951 ] Li Lu commented on YARN-3411: - Hi [~vrushalic], thanks for the patch! I'm OK with the major part of this patch for now. Here, I'm listing some questions that we can have some discussion on. # About null checks: so far we do not have a fixed standard on if and where we need to do null checks. I noticed you assumed info, config, event, and other similar fields are not null. Maybe we'd like to explicitly decide when all those fields can be null or empty. # Maybe we'd like to change TimelineWriterUtils to default access modifier? I think it would be sufficient to make it visible in package? # One thing I'd like to open a discussion is on deciding the way to store and process metrics. Currently, in the hbase patch, startTime and endTime are not used. In the Phoenix patch, I store time series as a flattened, non-queryable strings. I think this part also requires some hint from the time-based aggregations. # Another thing I'd like to discuss here is if and how we'd like to set up a separate "fast path" for metric only updates. On the storage layer, I'd strongly +1 for a separate fast path such that we can only touch the (frequently updated) metrics table. Any proposals everyone? > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, > YARN-3411.poc.2.txt, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507504#comment-14507504 ] Junping Du commented on YARN-3411: -- Thanks [~vrushalic] for reply! bq. But I will be uploading a refined patch + some more changes like Metric writing soon. +1. The plan sounds good to me. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498366#comment-14498366 ] Vrushali C commented on YARN-3411: -- Thanks [~djp] ! bq. Just quickly go through the poc patch which is good but only have EntityTable so far. Do we have plan to split other tables to other JIRAs? yes, we can have jiras for other tables as we add in those functionalities. Right now, the PoC is focussed only on entity writes, hence this patch has only that table related stuff. bq. Some quick comments on poc patch is we should reuse many operations here like split() or join() in other classes, so better to create a utility class with putting common methods to share. Absolutely agreed, I am refining the patch. With hRaven we have a bunch of such utility classes. I was trying to see how many I can put in, since it's not confirmed that this would be the way to go. I did not want to mix up too much code. But I will be uploading a refined patch + some more changes like Metric writing soon. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498278#comment-14498278 ] Junping Du commented on YARN-3411: -- Just quickly go through the poc patch which is good but only have EntityTable so far. Do we have plan to split other tables to other JIRAs? I would support that because mid size patch (not too large, not small) can make development/review iteration moving faster. Some quick comments on poc patch is we should reuse many operations here like split() or join() in other classes, so better to create a utility class with putting common methods to share. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496538#comment-14496538 ] Junping Du commented on YARN-3411: -- Thanks [~vrushalic] for delivering the proposal and poc patch which is an excellent job! Some quick comments from walk through proposal: bq. Entity Table - primary key components-putting the UserID first helps to distribute writes across the regions in the hbase cluster. Pros: avoids single region hotspotting. Cons: connections would be open to several region servers during writes from per node ATS. Looks like we are try to get rid of region server hotspotting issues. I agree that this design could helps. However, this is still possible that specific user could submit much more applications than anyone else. In that case, the region hotspot issue will still appear. Isn't it? I think the more general way to solve this problem is making keys get salted with a prefix. Thoughts? bq. Entity Table - column families-config needs to be stored as key value, not as a blob to enable efficient key based querying based on config param name. storing it in a separate column family helps to avoid scanning over config while reading metrics and vice versa +1. This leverage strength of columnar database. We should get rid of storing any default value for key. However, this sounds challengable if TimelineClient only has a configuration object. bq. Entity Table - metrics are written to with an hbase cell timestamp set to top of the minute or top of the 5 minute interval or whatever is decided. This helps in timeseries storage and retrieval in case of querying at the entity level. Can we also let TimelineCollector do some aggregation of metrics in a similar time interval rather than sending to HBase/Pheonix for every metrics when it received? This may help to lease some pressure to backend. bq. Flow by application id table I am still think we should figure out some way to store application attempts info. The typical usecase here is: for some reason (like: bug or hardware capability reason), some flow/application's AM could always get failed more times than other flows/applications. Keeping this info can help us to track these issues. Isn't it? bq. flow summary daily table (aggregation table managed by Phoenix) - could be triggered via coprocessor with each put in flow table or a cron run once per day to aggregate for yesterday (with catchup functionality in case of backlog etc) Do each put in flow table sounds a little expensive especially when putting activity is very frequently. May be we should do some batch mode here? In addition, I think we can leverage per node TimelineCollector to do some first level aggregation which can help to relieve workload in backend. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.txt > > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384531#comment-14384531 ] Li Lu commented on YARN-3411: - Hi [~vrushalic], thanks for working on this! It would be good for us to have both hbase and Phoenix storage implementations for comparison. Just keeping a record here that I think we can do the evaluation, before we move into implementing the aggregations. In this way we may save duplicated efforts in designing and implementing aggregations. > [Storage implementation] explore the native HBase write schema for storage > -- > > Key: YARN-3411 > URL: https://issues.apache.org/jira/browse/YARN-3411 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Vrushali C >Priority: Critical > > There is work that's in progress to implement the storage based on a Phoenix > schema (YARN-3134). > In parallel, we would like to explore an implementation based on a native > HBase schema for the write path. Such a schema does not exclude using > Phoenix, especially for reads and offline queries. > Once we have basic implementations of both options, we could evaluate them in > terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)