Re: Google Summer Of Code 2016
Cool. Congrats. Enis On Fri, Apr 22, 2016 at 2:56 PM, Talat Uyarer wrote: > Hi all, > I really thankful for Apache HBase Community for sharing your ideas and > accepting my GSoC 2016 proposal. Specially > thankful for Enis to shared a good idea and Stack for volunteering to > mentor my project. > > I am really excited to work with you :) > > Talat > On Apr 22, 2016 12:22 PM, "Elliott Clark" wrote: > > > On Fri, Apr 22, 2016 at 12:07 PM, Stack wrote: > > > > > Congrats Talat. You are our GSoC. We'll try and be nice (smile). > > > > > > > Congrats. That's awesome! > > >
Re: Google Summer Of Code 2016
Hi all, I really thankful for Apache HBase Community for sharing your ideas and accepting my GSoC 2016 proposal. Specially thankful for Enis to shared a good idea and Stack for volunteering to mentor my project. I am really excited to work with you :) Talat On Apr 22, 2016 12:22 PM, "Elliott Clark" wrote: > On Fri, Apr 22, 2016 at 12:07 PM, Stack wrote: > > > Congrats Talat. You are our GSoC. We'll try and be nice (smile). > > > > Congrats. That's awesome! >
Re: Google Summer Of Code 2016
On Fri, Apr 22, 2016 at 12:07 PM, Stack wrote: > Congrats Talat. You are our GSoC. We'll try and be nice (smile). > Congrats. That's awesome!
Re: Google Summer Of Code 2016
Congrats Talat. You are our GSoC. We'll try and be nice (smile). St.Ack On Fri, Mar 25, 2016 at 1:52 PM, Stack wrote: > Thanks Talat... I shoved some comments up in it but looks basically sound. > Thanks for sending it in. > St.Ack > > On Fri, Mar 25, 2016 at 11:09 AM, Talat Uyarer wrote: > >> Hi all, >> >> I created my GSoC proposal for Block Encoding and Compression for RPC >> Layer[1]. If you review and share your comments I will be appreciated. >> >> [1] >> https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing >> [2] https://issues.apache.org/jira/browse/HBASE-15530 >> >> Thanks >> >> On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer wrote: >> > Hi, >> > >> > I am appreciated to being mentor Stack :) As I know as ASF already >> > participate and you can sign up. [1] last year I was a mentor. I just >> > send an email to private and [email protected]. Would you >> > like to check it ? >> > >> > [1] >> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this >> > >> > 2016-03-22 17:32 GMT-07:00 Enis Söztutar : >> >>> >> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is >> it too >> >>> late for us to participate now? >> >>> >> >>> >> >> ASF participates in GSOC, so HBase automatically can participate AFAIK. >> >> >> >> >> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed >> the >> >>> mentor signup deadline. >> >>> >> >> >> >> I did not check the deadline, if that is the case, it means this year >> is >> >> over? >> >> >> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc. >> >> >> >> >> >>> >> >>> >> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. >> These >> >>> > are: >> >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. >> >>> > encoding. But these encodings just can use in HFile context. In RPC >> >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can >> >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't >> say >> >>> > the issue number But I guessed it is HBASE-12883 Support block >> >>> > encoding based on knowing set of column qualifiers up front) >> >>> > >> >>> >> >>> Sounds like a fine project (Someone was just asking about this >> offline...) >> >>> >> >>> >> >>> >> >>> > - HBASE-14379 Replication V2 >> >>> > - HBASE-8691 High-Throughput Streaming Scan API >> >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase >> -> >> >>> > SOLR indexing. I guess it could be this issue.) >> >>> > >> >>> > Could you help me for selecting topics or could you offer another >> issue ? >> >>> > >> >>> > >> >>> All above are good. >> >>> >> >>> Here's a few others made for another context: >> >>> >> >>> + Become Jepsen distributed systems test tool expert: run it against >> HBase >> >>> and HDFS. Analyze results. E.g. see >> >>> >> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen >> >>> + Deep dive on hbase Compactions. Own it. Review current options both >> the >> >>> defaults, experimental, and the stale. Build tooling and surface >> metrics >> >>> that give better insight on effectiveness of compaction mechanics and >> >>> policies. Develop tunings and alternate, new policies. For further >> credit, >> >>> develop master-orchestrated compaction algorithm. >> >>> + Reimplement HBase append and increment as write-only with rollup on >> read >> >>> or using CRDTs ( >> >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) >> >>> + Make the HBase Server async/event driven/SEDA moving it off its >> current >> >>> thread-per-request basis >> >>> + UI: build out more pages and tabs on the HBase master exposing more >> of >> >>> our cluster metrics (make the master into a metrics sink). Extra >> points for >> >>> views, histograms, or dashboards that are both informative AND pretty >> (D3, >> >>> etc.). A good benchmark would be subsuming the Hannibal tool >> >>> https://github.com/sentric/hannibal >> >>> + Build an example application on HBase for test and illustration: >> e.g. use >> >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase >> to >> >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load >> >>> hbase >> >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra >> >>> credit for documenting steps involved and filing issues where API is >> >>> awkward or hard to follow. >> >>> + Add actionable statistics to hbase internals that capture vitals >> about >> >>> the data being served and that we exploit responding to queries; e.g. >> rough >> >>> sizes of rows, column-families, columns-per-row-per-region, etc. For >> >>> example, if a client has been stepping sequentially through the data, >> the >> >>> stats would allow us recognize this state so we could switch to a >> different >> >>> scan type; one that is optimal to a sequential progression. >> >>> + Review and redo our fun
Re: Google Summer Of Code 2016
Thanks Talat... I shoved some comments up in it but looks basically sound. Thanks for sending it in. St.Ack On Fri, Mar 25, 2016 at 11:09 AM, Talat Uyarer wrote: > Hi all, > > I created my GSoC proposal for Block Encoding and Compression for RPC > Layer[1]. If you review and share your comments I will be appreciated. > > [1] > https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing > [2] https://issues.apache.org/jira/browse/HBASE-15530 > > Thanks > > On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer wrote: > > Hi, > > > > I am appreciated to being mentor Stack :) As I know as ASF already > > participate and you can sign up. [1] last year I was a mentor. I just > > send an email to private and [email protected]. Would you > > like to check it ? > > > > [1] > https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this > > > > 2016-03-22 17:32 GMT-07:00 Enis Söztutar : > >>> > >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is > it too > >>> late for us to participate now? > >>> > >>> > >> ASF participates in GSOC, so HBase automatically can participate AFAIK. > >> > >> > >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed > the > >>> mentor signup deadline. > >>> > >> > >> I did not check the deadline, if that is the case, it means this year is > >> over? > >> > >> Your list is pretty good. We can POC with Capt'n proto as well as grpc. > >> > >> > >>> > >>> > >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. > These > >>> > are: > >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. > >>> > encoding. But these encodings just can use in HFile context. In RPC > >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can > >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't say > >>> > the issue number But I guessed it is HBASE-12883 Support block > >>> > encoding based on knowing set of column qualifiers up front) > >>> > > >>> > >>> Sounds like a fine project (Someone was just asking about this > offline...) > >>> > >>> > >>> > >>> > - HBASE-14379 Replication V2 > >>> > - HBASE-8691 High-Throughput Streaming Scan API > >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase -> > >>> > SOLR indexing. I guess it could be this issue.) > >>> > > >>> > Could you help me for selecting topics or could you offer another > issue ? > >>> > > >>> > > >>> All above are good. > >>> > >>> Here's a few others made for another context: > >>> > >>> + Become Jepsen distributed systems test tool expert: run it against > HBase > >>> and HDFS. Analyze results. E.g. see > >>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen > >>> + Deep dive on hbase Compactions. Own it. Review current options both > the > >>> defaults, experimental, and the stale. Build tooling and surface > metrics > >>> that give better insight on effectiveness of compaction mechanics and > >>> policies. Develop tunings and alternate, new policies. For further > credit, > >>> develop master-orchestrated compaction algorithm. > >>> + Reimplement HBase append and increment as write-only with rollup on > read > >>> or using CRDTs ( > >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) > >>> + Make the HBase Server async/event driven/SEDA moving it off its > current > >>> thread-per-request basis > >>> + UI: build out more pages and tabs on the HBase master exposing more > of > >>> our cluster metrics (make the master into a metrics sink). Extra > points for > >>> views, histograms, or dashboards that are both informative AND pretty > (D3, > >>> etc.). A good benchmark would be subsuming the Hannibal tool > >>> https://github.com/sentric/hannibal > >>> + Build an example application on HBase for test and illustration: > e.g. use > >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase > to > >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load > >>> hbase > >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra > >>> credit for documenting steps involved and filing issues where API is > >>> awkward or hard to follow. > >>> + Add actionable statistics to hbase internals that capture vitals > about > >>> the data being served and that we exploit responding to queries; e.g. > rough > >>> sizes of rows, column-families, columns-per-row-per-region, etc. For > >>> example, if a client has been stepping sequentially through the data, > the > >>> stats would allow us recognize this state so we could switch to a > different > >>> scan type; one that is optimal to a sequential progression. > >>> + Review and redo our fundamental merge sort, the basis of our read. > There > >>> are a few techniques to try such as a "loser tree merge" ( > >>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd > >>> make > >>> our merge sort block-based rather than Cell-based. Se
Re: Google Summer Of Code 2016
Hi all, I created my GSoC proposal for Block Encoding and Compression for RPC Layer[1]. If you review and share your comments I will be appreciated. [1] https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing [2] https://issues.apache.org/jira/browse/HBASE-15530 Thanks On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer wrote: > Hi, > > I am appreciated to being mentor Stack :) As I know as ASF already > participate and you can sign up. [1] last year I was a mentor. I just > send an email to private and [email protected]. Would you > like to check it ? > > [1] https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this > > 2016-03-22 17:32 GMT-07:00 Enis Söztutar : >>> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too >>> late for us to participate now? >>> >>> >> ASF participates in GSOC, so HBase automatically can participate AFAIK. >> >> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the >>> mentor signup deadline. >>> >> >> I did not check the deadline, if that is the case, it means this year is >> over? >> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc. >> >> >>> >>> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These >>> > are: >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. >>> > encoding. But these encodings just can use in HFile context. In RPC >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't say >>> > the issue number But I guessed it is HBASE-12883 Support block >>> > encoding based on knowing set of column qualifiers up front) >>> > >>> >>> Sounds like a fine project (Someone was just asking about this offline...) >>> >>> >>> >>> > - HBASE-14379 Replication V2 >>> > - HBASE-8691 High-Throughput Streaming Scan API >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase -> >>> > SOLR indexing. I guess it could be this issue.) >>> > >>> > Could you help me for selecting topics or could you offer another issue ? >>> > >>> > >>> All above are good. >>> >>> Here's a few others made for another context: >>> >>> + Become Jepsen distributed systems test tool expert: run it against HBase >>> and HDFS. Analyze results. E.g. see >>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen >>> + Deep dive on hbase Compactions. Own it. Review current options both the >>> defaults, experimental, and the stale. Build tooling and surface metrics >>> that give better insight on effectiveness of compaction mechanics and >>> policies. Develop tunings and alternate, new policies. For further credit, >>> develop master-orchestrated compaction algorithm. >>> + Reimplement HBase append and increment as write-only with rollup on read >>> or using CRDTs ( >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) >>> + Make the HBase Server async/event driven/SEDA moving it off its current >>> thread-per-request basis >>> + UI: build out more pages and tabs on the HBase master exposing more of >>> our cluster metrics (make the master into a metrics sink). Extra points for >>> views, histograms, or dashboards that are both informative AND pretty (D3, >>> etc.). A good benchmark would be subsuming the Hannibal tool >>> https://github.com/sentric/hannibal >>> + Build an example application on HBase for test and illustration: e.g. use >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load >>> hbase >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra >>> credit for documenting steps involved and filing issues where API is >>> awkward or hard to follow. >>> + Add actionable statistics to hbase internals that capture vitals about >>> the data being served and that we exploit responding to queries; e.g. rough >>> sizes of rows, column-families, columns-per-row-per-region, etc. For >>> example, if a client has been stepping sequentially through the data, the >>> stats would allow us recognize this state so we could switch to a different >>> scan type; one that is optimal to a sequential progression. >>> + Review and redo our fundamental merge sort, the basis of our read. There >>> are a few techniques to try such as a "loser tree merge" ( >>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd >>> make >>> our merge sort block-based rather than Cell-based. Set yourself up in a rig >>> and try different Cell formats to get yourself to a cache-friendly Cell >>> format that maximizes instructions per cycle. >>> + Our client is heavy-weight and has accumulated lots of logic over time. >>> E.g. it is hard to set a single timeout for a request because client is >>> layered each with its own running timeouts. At its core is a mostly-done >>> async engine. Review, and fi
Re: Google Summer Of Code 2016
Hi, I am appreciated to being mentor Stack :) As I know as ASF already participate and you can sign up. [1] last year I was a mentor. I just send an email to private and [email protected]. Would you like to check it ? [1] https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this 2016-03-22 17:32 GMT-07:00 Enis Söztutar : >> >> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too >> late for us to participate now? >> >> > ASF participates in GSOC, so HBase automatically can participate AFAIK. > > >> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the >> mentor signup deadline. >> > > I did not check the deadline, if that is the case, it means this year is > over? > > Your list is pretty good. We can POC with Capt'n proto as well as grpc. > > >> >> >> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These >> > are: >> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. >> > encoding. But these encodings just can use in HFile context. In RPC >> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can >> > improve them or using HFile encodings in RPC and WAL" ( He didn't say >> > the issue number But I guessed it is HBASE-12883 Support block >> > encoding based on knowing set of column qualifiers up front) >> > >> >> Sounds like a fine project (Someone was just asking about this offline...) >> >> >> >> > - HBASE-14379 Replication V2 >> > - HBASE-8691 High-Throughput Streaming Scan API >> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase -> >> > SOLR indexing. I guess it could be this issue.) >> > >> > Could you help me for selecting topics or could you offer another issue ? >> > >> > >> All above are good. >> >> Here's a few others made for another context: >> >> + Become Jepsen distributed systems test tool expert: run it against HBase >> and HDFS. Analyze results. E.g. see >> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen >> + Deep dive on hbase Compactions. Own it. Review current options both the >> defaults, experimental, and the stale. Build tooling and surface metrics >> that give better insight on effectiveness of compaction mechanics and >> policies. Develop tunings and alternate, new policies. For further credit, >> develop master-orchestrated compaction algorithm. >> + Reimplement HBase append and increment as write-only with rollup on read >> or using CRDTs ( >> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) >> + Make the HBase Server async/event driven/SEDA moving it off its current >> thread-per-request basis >> + UI: build out more pages and tabs on the HBase master exposing more of >> our cluster metrics (make the master into a metrics sink). Extra points for >> views, histograms, or dashboards that are both informative AND pretty (D3, >> etc.). A good benchmark would be subsuming the Hannibal tool >> https://github.com/sentric/hannibal >> + Build an example application on HBase for test and illustration: e.g. use >> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to >> load common crawl regular webcrawls https://commoncrawl.org/ or, load >> hbase >> with wikipedia, the flickr dataset, or any dataset that appeals. Extra >> credit for documenting steps involved and filing issues where API is >> awkward or hard to follow. >> + Add actionable statistics to hbase internals that capture vitals about >> the data being served and that we exploit responding to queries; e.g. rough >> sizes of rows, column-families, columns-per-row-per-region, etc. For >> example, if a client has been stepping sequentially through the data, the >> stats would allow us recognize this state so we could switch to a different >> scan type; one that is optimal to a sequential progression. >> + Review and redo our fundamental merge sort, the basis of our read. There >> are a few techniques to try such as a "loser tree merge" ( >> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd >> make >> our merge sort block-based rather than Cell-based. Set yourself up in a rig >> and try different Cell formats to get yourself to a cache-friendly Cell >> format that maximizes instructions per cycle. >> + Our client is heavy-weight and has accumulated lots of logic over time. >> E.g. it is hard to set a single timeout for a request because client is >> layered each with its own running timeouts. At its core is a mostly-done >> async engine. Review, and finish the async work. Rewrite where it makes >> sense after analysis. >> + Our RPC is based on protobuf Service where we plugged in our own RPC >> transport. An exploratory PoC putting HBase up on grpc was done by the grpc >> team. Bring this project home. Extra points if you reveal a Streaming >> Interface between Client and Server. >> + Tiering... if regions are cold, close them so they don't occupy resources >> (close files, purge its data from cache...) reopen whe
Re: Google Summer Of Code 2016
I think you guys missed enrolling as mentors. From my experience last year, Goog is very strict about their deadlines, but you'd need to ask over on the Apache Mentors list. On Tuesday, March 22, 2016, Enis Söztutar wrote: > > > > I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it > too > > late for us to participate now? > > > > > ASF participates in GSOC, so HBase automatically can participate AFAIK. > > > > I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the > > mentor signup deadline. > > > > I did not check the deadline, if that is the case, it means this year is > over? > > Your list is pretty good. We can POC with Capt'n proto as well as grpc. > > > > > > > > > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These > > > are: > > > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. > > > encoding. But these encodings just can use in HFile context. In RPC > > > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can > > > improve them or using HFile encodings in RPC and WAL" ( He didn't say > > > the issue number But I guessed it is HBASE-12883 Support block > > > encoding based on knowing set of column qualifiers up front) > > > > > > > Sounds like a fine project (Someone was just asking about this > offline...) > > > > > > > > > - HBASE-14379 Replication V2 > > > - HBASE-8691 High-Throughput Streaming Scan API > > > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase -> > > > SOLR indexing. I guess it could be this issue.) > > > > > > Could you help me for selecting topics or could you offer another > issue ? > > > > > > > > All above are good. > > > > Here's a few others made for another context: > > > > + Become Jepsen distributed systems test tool expert: run it against > HBase > > and HDFS. Analyze results. E.g. see > > https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen > > + Deep dive on hbase Compactions. Own it. Review current options both the > > defaults, experimental, and the stale. Build tooling and surface metrics > > that give better insight on effectiveness of compaction mechanics and > > policies. Develop tunings and alternate, new policies. For further > credit, > > develop master-orchestrated compaction algorithm. > > + Reimplement HBase append and increment as write-only with rollup on > read > > or using CRDTs ( > > https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) > > + Make the HBase Server async/event driven/SEDA moving it off its current > > thread-per-request basis > > + UI: build out more pages and tabs on the HBase master exposing more of > > our cluster metrics (make the master into a metrics sink). Extra points > for > > views, histograms, or dashboards that are both informative AND pretty > (D3, > > etc.). A good benchmark would be subsuming the Hannibal tool > > https://github.com/sentric/hannibal > > + Build an example application on HBase for test and illustration: e.g. > use > > Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to > > load common crawl regular webcrawls https://commoncrawl.org/ or, load > > hbase > > with wikipedia, the flickr dataset, or any dataset that appeals. Extra > > credit for documenting steps involved and filing issues where API is > > awkward or hard to follow. > > + Add actionable statistics to hbase internals that capture vitals about > > the data being served and that we exploit responding to queries; e.g. > rough > > sizes of rows, column-families, columns-per-row-per-region, etc. For > > example, if a client has been stepping sequentially through the data, the > > stats would allow us recognize this state so we could switch to a > different > > scan type; one that is optimal to a sequential progression. > > + Review and redo our fundamental merge sort, the basis of our read. > There > > are a few techniques to try such as a "loser tree merge" ( > > http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd > > make > > our merge sort block-based rather than Cell-based. Set yourself up in a > rig > > and try different Cell formats to get yourself to a cache-friendly Cell > > format that maximizes instructions per cycle. > > + Our client is heavy-weight and has accumulated lots of logic over time. > > E.g. it is hard to set a single timeout for a request because client is > > layered each with its own running timeouts. At its core is a mostly-done > > async engine. Review, and finish the async work. Rewrite where it makes > > sense after analysis. > > + Our RPC is based on protobuf Service where we plugged in our own RPC > > transport. An exploratory PoC putting HBase up on grpc was done by the > grpc > > team. Bring this project home. Extra points if you reveal a Streaming > > Interface between Client and Server. > > + Tiering... if regions are cold, close them so they don't occupy > resources > > (close files, purge its data from cache...) reopen when a request
Re: Google Summer Of Code 2016
> > I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too > late for us to participate now? > > ASF participates in GSOC, so HBase automatically can participate AFAIK. > I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the > mentor signup deadline. > I did not check the deadline, if that is the case, it means this year is over? Your list is pretty good. We can POC with Capt'n proto as well as grpc. > > > > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These > > are: > > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. > > encoding. But these encodings just can use in HFile context. In RPC > > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can > > improve them or using HFile encodings in RPC and WAL" ( He didn't say > > the issue number But I guessed it is HBASE-12883 Support block > > encoding based on knowing set of column qualifiers up front) > > > > Sounds like a fine project (Someone was just asking about this offline...) > > > > > - HBASE-14379 Replication V2 > > - HBASE-8691 High-Throughput Streaming Scan API > > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase -> > > SOLR indexing. I guess it could be this issue.) > > > > Could you help me for selecting topics or could you offer another issue ? > > > > > All above are good. > > Here's a few others made for another context: > > + Become Jepsen distributed systems test tool expert: run it against HBase > and HDFS. Analyze results. E.g. see > https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen > + Deep dive on hbase Compactions. Own it. Review current options both the > defaults, experimental, and the stale. Build tooling and surface metrics > that give better insight on effectiveness of compaction mechanics and > policies. Develop tunings and alternate, new policies. For further credit, > develop master-orchestrated compaction algorithm. > + Reimplement HBase append and increment as write-only with rollup on read > or using CRDTs ( > https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) > + Make the HBase Server async/event driven/SEDA moving it off its current > thread-per-request basis > + UI: build out more pages and tabs on the HBase master exposing more of > our cluster metrics (make the master into a metrics sink). Extra points for > views, histograms, or dashboards that are both informative AND pretty (D3, > etc.). A good benchmark would be subsuming the Hannibal tool > https://github.com/sentric/hannibal > + Build an example application on HBase for test and illustration: e.g. use > Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to > load common crawl regular webcrawls https://commoncrawl.org/ or, load > hbase > with wikipedia, the flickr dataset, or any dataset that appeals. Extra > credit for documenting steps involved and filing issues where API is > awkward or hard to follow. > + Add actionable statistics to hbase internals that capture vitals about > the data being served and that we exploit responding to queries; e.g. rough > sizes of rows, column-families, columns-per-row-per-region, etc. For > example, if a client has been stepping sequentially through the data, the > stats would allow us recognize this state so we could switch to a different > scan type; one that is optimal to a sequential progression. > + Review and redo our fundamental merge sort, the basis of our read. There > are a few techniques to try such as a "loser tree merge" ( > http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd > make > our merge sort block-based rather than Cell-based. Set yourself up in a rig > and try different Cell formats to get yourself to a cache-friendly Cell > format that maximizes instructions per cycle. > + Our client is heavy-weight and has accumulated lots of logic over time. > E.g. it is hard to set a single timeout for a request because client is > layered each with its own running timeouts. At its core is a mostly-done > async engine. Review, and finish the async work. Rewrite where it makes > sense after analysis. > + Our RPC is based on protobuf Service where we plugged in our own RPC > transport. An exploratory PoC putting HBase up on grpc was done by the grpc > team. Bring this project home. Extra points if you reveal a Streaming > Interface between Client and Server. > + Tiering... if regions are cold, close them so they don't occupy resources > (close files, purge its data from cache...) reopen when a request comes > in > + Dynamic configuration of running HBase > > > St.Ack > > > > > > Thanks > > -- > > Talat UYARER > > >
Re: Google Summer Of Code 2016
On Tue, Mar 22, 2016 at 3:04 AM, Talat Uyarer wrote: > Hi All, > > I am Talat UYARER. I am PMC member and Commiter at Nutch and Gora. I > have few contributions to Hbase and want to work for HBase in GSoC > 2016. As far as I know, you haven't selected any issue for GSoC. > > I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too late for us to participate now? > I am wondering is there anybody who can be a mentor for GSoC in HBase? > > I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the mentor signup deadline. > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These > are: > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. > encoding. But these encodings just can use in HFile context. In RPC > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can > improve them or using HFile encodings in RPC and WAL" ( He didn't say > the issue number But I guessed it is HBASE-12883 Support block > encoding based on knowing set of column qualifiers up front) > Sounds like a fine project (Someone was just asking about this offline...) > - HBASE-14379 Replication V2 > - HBASE-8691 High-Throughput Streaming Scan API > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase -> > SOLR indexing. I guess it could be this issue.) > > Could you help me for selecting topics or could you offer another issue ? > > All above are good. Here's a few others made for another context: + Become Jepsen distributed systems test tool expert: run it against HBase and HDFS. Analyze results. E.g. see https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen + Deep dive on hbase Compactions. Own it. Review current options both the defaults, experimental, and the stale. Build tooling and surface metrics that give better insight on effectiveness of compaction mechanics and policies. Develop tunings and alternate, new policies. For further credit, develop master-orchestrated compaction algorithm. + Reimplement HBase append and increment as write-only with rollup on read or using CRDTs ( https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) + Make the HBase Server async/event driven/SEDA moving it off its current thread-per-request basis + UI: build out more pages and tabs on the HBase master exposing more of our cluster metrics (make the master into a metrics sink). Extra points for views, histograms, or dashboards that are both informative AND pretty (D3, etc.). A good benchmark would be subsuming the Hannibal tool https://github.com/sentric/hannibal + Build an example application on HBase for test and illustration: e.g. use Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to load common crawl regular webcrawls https://commoncrawl.org/ or, load hbase with wikipedia, the flickr dataset, or any dataset that appeals. Extra credit for documenting steps involved and filing issues where API is awkward or hard to follow. + Add actionable statistics to hbase internals that capture vitals about the data being served and that we exploit responding to queries; e.g. rough sizes of rows, column-families, columns-per-row-per-region, etc. For example, if a client has been stepping sequentially through the data, the stats would allow us recognize this state so we could switch to a different scan type; one that is optimal to a sequential progression. + Review and redo our fundamental merge sort, the basis of our read. There are a few techniques to try such as a "loser tree merge" ( http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd make our merge sort block-based rather than Cell-based. Set yourself up in a rig and try different Cell formats to get yourself to a cache-friendly Cell format that maximizes instructions per cycle. + Our client is heavy-weight and has accumulated lots of logic over time. E.g. it is hard to set a single timeout for a request because client is layered each with its own running timeouts. At its core is a mostly-done async engine. Review, and finish the async work. Rewrite where it makes sense after analysis. + Our RPC is based on protobuf Service where we plugged in our own RPC transport. An exploratory PoC putting HBase up on grpc was done by the grpc team. Bring this project home. Extra points if you reveal a Streaming Interface between Client and Server. + Tiering... if regions are cold, close them so they don't occupy resources (close files, purge its data from cache...) reopen when a request comes in + Dynamic configuration of running HBase St.Ack > Thanks > -- > Talat UYARER >
