Re: Google Summer Of Code 2016

2016-04-22 Thread Enis Söztutar
Cool. Congrats.

Enis

On Fri, Apr 22, 2016 at 2:56 PM, Talat Uyarer  wrote:

> Hi all,
> I really thankful for Apache HBase Community for sharing your ideas and
> accepting my GSoC 2016 proposal. Specially
> thankful for Enis to shared a good idea and Stack for volunteering to
> mentor my project.
>
> I am really excited to work with you :)
>
> Talat
> On Apr 22, 2016 12:22 PM, "Elliott Clark"  wrote:
>
> > On Fri, Apr 22, 2016 at 12:07 PM, Stack  wrote:
> >
> > > Congrats Talat. You are our GSoC. We'll try and be nice (smile).
> > >
> >
> > Congrats. That's awesome!
> >
>


Re: Google Summer Of Code 2016

2016-04-22 Thread Talat Uyarer
Hi all,
I really thankful for Apache HBase Community for sharing your ideas and
accepting my GSoC 2016 proposal. Specially
thankful for Enis to shared a good idea and Stack for volunteering to
mentor my project.

I am really excited to work with you :)

Talat
On Apr 22, 2016 12:22 PM, "Elliott Clark"  wrote:

> On Fri, Apr 22, 2016 at 12:07 PM, Stack  wrote:
>
> > Congrats Talat. You are our GSoC. We'll try and be nice (smile).
> >
>
> Congrats. That's awesome!
>


Re: Google Summer Of Code 2016

2016-04-22 Thread Elliott Clark
On Fri, Apr 22, 2016 at 12:07 PM, Stack  wrote:

> Congrats Talat. You are our GSoC. We'll try and be nice (smile).
>

Congrats. That's awesome!


Re: Google Summer Of Code 2016

2016-04-22 Thread Stack
Congrats Talat. You are our GSoC. We'll try and be nice (smile).
St.Ack

On Fri, Mar 25, 2016 at 1:52 PM, Stack  wrote:

> Thanks Talat... I shoved some comments up in it but looks basically sound.
> Thanks for sending it in.
> St.Ack
>
> On Fri, Mar 25, 2016 at 11:09 AM, Talat Uyarer  wrote:
>
>> Hi all,
>>
>> I created my GSoC proposal for Block Encoding and Compression for RPC
>> Layer[1]. If you review and share your comments I will be appreciated.
>>
>> [1]
>> https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing
>> [2] https://issues.apache.org/jira/browse/HBASE-15530
>>
>> Thanks
>>
>> On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer  wrote:
>> > Hi,
>> >
>> > I am appreciated to being mentor Stack :) As I know as ASF already
>> > participate and you can sign up. [1] last year I was a mentor. I just
>> > send an email to private and [email protected]. Would you
>> > like to check it ?
>> >
>> > [1]
>> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
>> >
>> > 2016-03-22 17:32 GMT-07:00 Enis Söztutar :
>> >>>
>> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is
>> it too
>> >>> late for us to participate now?
>> >>>
>> >>>
>> >> ASF participates in GSOC, so HBase automatically can participate AFAIK.
>> >>
>> >>
>> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed
>> the
>> >>> mentor signup deadline.
>> >>>
>> >>
>> >> I did not check the deadline, if that is the case, it means this year
>> is
>> >> over?
>> >>
>> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
>> >>
>> >>
>> >>>
>> >>>
>> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC.
>> These
>> >>> > are:
>> >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
>> >>> > encoding. But these encodings just can use in HFile context. In RPC
>> >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
>> >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't
>> say
>> >>> > the issue number But I guessed it is HBASE-12883 Support block
>> >>> > encoding based on knowing set of column qualifiers up front)
>> >>> >
>> >>>
>> >>> Sounds like a fine project (Someone was just asking about this
>> offline...)
>> >>>
>> >>>
>> >>>
>> >>> > - HBASE-14379 Replication V2
>> >>> > - HBASE-8691 High-Throughput Streaming Scan API
>> >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase
>> ->
>> >>> > SOLR indexing. I guess it could be this issue.)
>> >>> >
>> >>> > Could you help me for selecting topics or could you offer another
>> issue ?
>> >>> >
>> >>> >
>> >>> All above are good.
>> >>>
>> >>> Here's a few others made for another context:
>> >>>
>> >>> + Become Jepsen distributed systems test tool expert: run it against
>> HBase
>> >>> and HDFS. Analyze results. E.g. see
>> >>>
>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
>> >>> + Deep dive on hbase Compactions. Own it. Review current options both
>> the
>> >>> defaults, experimental, and the stale. Build tooling and surface
>> metrics
>> >>> that give better insight on effectiveness of compaction mechanics and
>> >>> policies. Develop tunings and alternate, new policies. For further
>> credit,
>> >>> develop master-orchestrated compaction algorithm.
>> >>> + Reimplement HBase append and increment as write-only with rollup on
>> read
>> >>> or using CRDTs (
>> >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
>> >>> + Make the HBase Server async/event driven/SEDA moving it off its
>> current
>> >>> thread-per-request basis
>> >>> + UI: build out more pages and tabs on the HBase master exposing more
>> of
>> >>> our cluster metrics (make the master into a metrics sink). Extra
>> points for
>> >>> views, histograms, or dashboards that are both informative AND pretty
>> (D3,
>> >>> etc.). A good benchmark would be subsuming the Hannibal tool
>> >>> https://github.com/sentric/hannibal
>> >>> + Build an example application on HBase for test and illustration:
>> e.g. use
>> >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase
>> to
>> >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
>> >>> hbase
>> >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
>> >>> credit for documenting steps involved and filing issues where API is
>> >>> awkward or hard to follow.
>> >>> + Add actionable statistics to hbase internals that capture vitals
>> about
>> >>> the data being served and that we exploit responding to queries; e.g.
>> rough
>> >>> sizes of rows, column-families, columns-per-row-per-region, etc. For
>> >>> example, if a client has been stepping sequentially through the data,
>> the
>> >>> stats would allow us recognize this state so we could switch to a
>> different
>> >>> scan type; one that is optimal to a sequential progression.
>> >>> + Review and redo our fun

Re: Google Summer Of Code 2016

2016-03-25 Thread Stack
Thanks Talat... I shoved some comments up in it but looks basically sound.
Thanks for sending it in.
St.Ack

On Fri, Mar 25, 2016 at 11:09 AM, Talat Uyarer  wrote:

> Hi all,
>
> I created my GSoC proposal for Block Encoding and Compression for RPC
> Layer[1]. If you review and share your comments I will be appreciated.
>
> [1]
> https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing
> [2] https://issues.apache.org/jira/browse/HBASE-15530
>
> Thanks
>
> On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer  wrote:
> > Hi,
> >
> > I am appreciated to being mentor Stack :) As I know as ASF already
> > participate and you can sign up. [1] last year I was a mentor. I just
> > send an email to private and [email protected]. Would you
> > like to check it ?
> >
> > [1]
> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
> >
> > 2016-03-22 17:32 GMT-07:00 Enis Söztutar :
> >>>
> >>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is
> it too
> >>> late for us to participate now?
> >>>
> >>>
> >> ASF participates in GSOC, so HBase automatically can participate AFAIK.
> >>
> >>
> >>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed
> the
> >>> mentor signup deadline.
> >>>
> >>
> >> I did not check the deadline, if that is the case, it means this year is
> >> over?
> >>
> >> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
> >>
> >>
> >>>
> >>>
> >>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC.
> These
> >>> > are:
> >>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
> >>> > encoding. But these encodings just can use in HFile context. In RPC
> >>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
> >>> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
> >>> > the issue number But I guessed it is HBASE-12883 Support block
> >>> > encoding based on knowing set of column qualifiers up front)
> >>> >
> >>>
> >>> Sounds like a fine project (Someone was just asking about this
> offline...)
> >>>
> >>>
> >>>
> >>> > - HBASE-14379 Replication V2
> >>> > - HBASE-8691 High-Throughput Streaming Scan API
> >>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
> >>> > SOLR indexing. I guess it could be this issue.)
> >>> >
> >>> > Could you help me for selecting topics or could you offer another
> issue ?
> >>> >
> >>> >
> >>> All above are good.
> >>>
> >>> Here's a few others made for another context:
> >>>
> >>> + Become Jepsen distributed systems test tool expert: run it against
> HBase
> >>> and HDFS. Analyze results. E.g. see
> >>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
> >>> + Deep dive on hbase Compactions. Own it. Review current options both
> the
> >>> defaults, experimental, and the stale. Build tooling and surface
> metrics
> >>> that give better insight on effectiveness of compaction mechanics and
> >>> policies. Develop tunings and alternate, new policies. For further
> credit,
> >>> develop master-orchestrated compaction algorithm.
> >>> + Reimplement HBase append and increment as write-only with rollup on
> read
> >>> or using CRDTs (
> >>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
> >>> + Make the HBase Server async/event driven/SEDA moving it off its
> current
> >>> thread-per-request basis
> >>> + UI: build out more pages and tabs on the HBase master exposing more
> of
> >>> our cluster metrics (make the master into a metrics sink). Extra
> points for
> >>> views, histograms, or dashboards that are both informative AND pretty
> (D3,
> >>> etc.). A good benchmark would be subsuming the Hannibal tool
> >>> https://github.com/sentric/hannibal
> >>> + Build an example application on HBase for test and illustration:
> e.g. use
> >>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase
> to
> >>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
> >>> hbase
> >>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
> >>> credit for documenting steps involved and filing issues where API is
> >>> awkward or hard to follow.
> >>> + Add actionable statistics to hbase internals that capture vitals
> about
> >>> the data being served and that we exploit responding to queries; e.g.
> rough
> >>> sizes of rows, column-families, columns-per-row-per-region, etc. For
> >>> example, if a client has been stepping sequentially through the data,
> the
> >>> stats would allow us recognize this state so we could switch to a
> different
> >>> scan type; one that is optimal to a sequential progression.
> >>> + Review and redo our fundamental merge sort, the basis of our read.
> There
> >>> are a few techniques to try such as a "loser tree merge" (
> >>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
> >>> make
> >>> our merge sort block-based rather than Cell-based. Se

Re: Google Summer Of Code 2016

2016-03-25 Thread Talat Uyarer
Hi all,

I created my GSoC proposal for Block Encoding and Compression for RPC
Layer[1]. If you review and share your comments I will be appreciated.

[1] 
https://docs.google.com/document/d/10MEsmGN5UCh6m-de_nhIG5QYnDRTkmwBTLQ0CRmwOMk/edit?usp=sharing
[2] https://issues.apache.org/jira/browse/HBASE-15530

Thanks

On Tue, Mar 22, 2016 at 6:44 PM, Talat Uyarer  wrote:
> Hi,
>
> I am appreciated to being mentor Stack :) As I know as ASF already
> participate and you can sign up. [1] last year I was a mentor. I just
> send an email to private and [email protected]. Would you
> like to check it ?
>
> [1] https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
>
> 2016-03-22 17:32 GMT-07:00 Enis Söztutar :
>>>
>>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too
>>> late for us to participate now?
>>>
>>>
>> ASF participates in GSOC, so HBase automatically can participate AFAIK.
>>
>>
>>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
>>> mentor signup deadline.
>>>
>>
>> I did not check the deadline, if that is the case, it means this year is
>> over?
>>
>> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
>>
>>
>>>
>>>
>>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
>>> > are:
>>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
>>> > encoding. But these encodings just can use in HFile context. In RPC
>>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
>>> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
>>> > the issue number But I guessed it is HBASE-12883 Support block
>>> > encoding based on knowing set of column qualifiers up front)
>>> >
>>>
>>> Sounds like a fine project (Someone was just asking about this offline...)
>>>
>>>
>>>
>>> > - HBASE-14379 Replication V2
>>> > - HBASE-8691 High-Throughput Streaming Scan API
>>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
>>> > SOLR indexing. I guess it could be this issue.)
>>> >
>>> > Could you help me for selecting topics or could you offer another issue ?
>>> >
>>> >
>>> All above are good.
>>>
>>> Here's a few others made for another context:
>>>
>>> + Become Jepsen distributed systems test tool expert: run it against HBase
>>> and HDFS. Analyze results. E.g. see
>>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
>>> + Deep dive on hbase Compactions. Own it. Review current options both the
>>> defaults, experimental, and the stale. Build tooling and surface metrics
>>> that give better insight on effectiveness of compaction mechanics and
>>> policies. Develop tunings and alternate, new policies. For further credit,
>>> develop master-orchestrated compaction algorithm.
>>> + Reimplement HBase append and increment as write-only with rollup on read
>>> or using CRDTs (
>>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
>>> + Make the HBase Server async/event driven/SEDA moving it off its current
>>> thread-per-request basis
>>> + UI: build out more pages and tabs on the HBase master exposing more of
>>> our cluster metrics (make the master into a metrics sink). Extra points for
>>> views, histograms, or dashboards that are both informative AND pretty (D3,
>>> etc.). A good benchmark would be subsuming the Hannibal tool
>>> https://github.com/sentric/hannibal
>>> + Build an example application on HBase for test and illustration: e.g. use
>>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
>>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
>>> hbase
>>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
>>> credit for documenting steps involved and filing issues where API is
>>> awkward or hard to follow.
>>> + Add actionable statistics to hbase internals that capture vitals about
>>> the data being served and that we exploit responding to queries; e.g. rough
>>> sizes of rows, column-families, columns-per-row-per-region, etc. For
>>> example, if a client has been stepping sequentially through the data, the
>>> stats would allow us recognize this state so we could switch to a different
>>> scan type; one that is optimal to a sequential progression.
>>> + Review and redo our fundamental merge sort, the basis of our read. There
>>> are a few techniques to try such as a "loser tree merge" (
>>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
>>> make
>>> our merge sort block-based rather than Cell-based. Set yourself up in a rig
>>> and try different Cell formats to get yourself to a cache-friendly Cell
>>> format that maximizes instructions per cycle.
>>> + Our client is heavy-weight and has accumulated lots of logic over time.
>>> E.g. it is hard to set a single timeout for a request because client is
>>> layered each with its own running timeouts. At its core is a mostly-done
>>> async engine. Review, and fi

Re: Google Summer Of Code 2016

2016-03-22 Thread Talat Uyarer
Hi,

I am appreciated to being mentor Stack :) As I know as ASF already
participate and you can sign up. [1] last year I was a mentor. I just
send an email to private and [email protected]. Would you
like to check it ?

[1] https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this

2016-03-22 17:32 GMT-07:00 Enis Söztutar :
>>
>> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too
>> late for us to participate now?
>>
>>
> ASF participates in GSOC, so HBase automatically can participate AFAIK.
>
>
>> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
>> mentor signup deadline.
>>
>
> I did not check the deadline, if that is the case, it means this year is
> over?
>
> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
>
>
>>
>>
>> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
>> > are:
>> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
>> > encoding. But these encodings just can use in HFile context. In RPC
>> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
>> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
>> > the issue number But I guessed it is HBASE-12883 Support block
>> > encoding based on knowing set of column qualifiers up front)
>> >
>>
>> Sounds like a fine project (Someone was just asking about this offline...)
>>
>>
>>
>> > - HBASE-14379 Replication V2
>> > - HBASE-8691 High-Throughput Streaming Scan API
>> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
>> > SOLR indexing. I guess it could be this issue.)
>> >
>> > Could you help me for selecting topics or could you offer another issue ?
>> >
>> >
>> All above are good.
>>
>> Here's a few others made for another context:
>>
>> + Become Jepsen distributed systems test tool expert: run it against HBase
>> and HDFS. Analyze results. E.g. see
>> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
>> + Deep dive on hbase Compactions. Own it. Review current options both the
>> defaults, experimental, and the stale. Build tooling and surface metrics
>> that give better insight on effectiveness of compaction mechanics and
>> policies. Develop tunings and alternate, new policies. For further credit,
>> develop master-orchestrated compaction algorithm.
>> + Reimplement HBase append and increment as write-only with rollup on read
>> or using CRDTs (
>> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
>> + Make the HBase Server async/event driven/SEDA moving it off its current
>> thread-per-request basis
>> + UI: build out more pages and tabs on the HBase master exposing more of
>> our cluster metrics (make the master into a metrics sink). Extra points for
>> views, histograms, or dashboards that are both informative AND pretty (D3,
>> etc.). A good benchmark would be subsuming the Hannibal tool
>> https://github.com/sentric/hannibal
>> + Build an example application on HBase for test and illustration: e.g. use
>> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
>> load common crawl regular webcrawls https://commoncrawl.org/ or, load
>> hbase
>> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
>> credit for documenting steps involved and filing issues where API is
>> awkward or hard to follow.
>> + Add actionable statistics to hbase internals that capture vitals about
>> the data being served and that we exploit responding to queries; e.g. rough
>> sizes of rows, column-families, columns-per-row-per-region, etc. For
>> example, if a client has been stepping sequentially through the data, the
>> stats would allow us recognize this state so we could switch to a different
>> scan type; one that is optimal to a sequential progression.
>> + Review and redo our fundamental merge sort, the basis of our read. There
>> are a few techniques to try such as a "loser tree merge" (
>> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
>> make
>> our merge sort block-based rather than Cell-based. Set yourself up in a rig
>> and try different Cell formats to get yourself to a cache-friendly Cell
>> format that maximizes instructions per cycle.
>> + Our client is heavy-weight and has accumulated lots of logic over time.
>> E.g. it is hard to set a single timeout for a request because client is
>> layered each with its own running timeouts. At its core is a mostly-done
>> async engine. Review, and finish the async work. Rewrite where it makes
>> sense after analysis.
>> + Our RPC is based on protobuf Service where we plugged in our own RPC
>> transport. An exploratory PoC putting HBase up on grpc was done by the grpc
>> team. Bring this project home. Extra points if you reveal a Streaming
>> Interface between Client and Server.
>> + Tiering... if regions are cold, close them so they don't occupy resources
>> (close files, purge its data from cache...) reopen whe

Re: Google Summer Of Code 2016

2016-03-22 Thread Nick Dimiduk
I think you guys missed enrolling as mentors. From my experience last year,
Goog is very strict about their deadlines, but you'd need to ask over on
the Apache Mentors list.

On Tuesday, March 22, 2016, Enis Söztutar  wrote:

> >
> > I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it
> too
> > late for us to participate now?
> >
> >
> ASF participates in GSOC, so HBase automatically can participate AFAIK.
>
>
> > I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
> > mentor signup deadline.
> >
>
> I did not check the deadline, if that is the case, it means this year is
> over?
>
> Your list is pretty good. We can POC with Capt'n proto as well as grpc.
>
>
> >
> >
> > > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
> > > are:
> > > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
> > > encoding. But these encodings just can use in HFile context. In RPC
> > > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
> > > improve them or using HFile encodings in RPC and WAL" ( He didn't say
> > > the issue number But I guessed it is HBASE-12883 Support block
> > > encoding based on knowing set of column qualifiers up front)
> > >
> >
> > Sounds like a fine project (Someone was just asking about this
> offline...)
> >
> >
> >
> > > - HBASE-14379 Replication V2
> > > - HBASE-8691 High-Throughput Streaming Scan API
> > > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
> > > SOLR indexing. I guess it could be this issue.)
> > >
> > > Could you help me for selecting topics or could you offer another
> issue ?
> > >
> > >
> > All above are good.
> >
> > Here's a few others made for another context:
> >
> > + Become Jepsen distributed systems test tool expert: run it against
> HBase
> > and HDFS. Analyze results. E.g. see
> > https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
> > + Deep dive on hbase Compactions. Own it. Review current options both the
> > defaults, experimental, and the stale. Build tooling and surface metrics
> > that give better insight on effectiveness of compaction mechanics and
> > policies. Develop tunings and alternate, new policies. For further
> credit,
> > develop master-orchestrated compaction algorithm.
> > + Reimplement HBase append and increment as write-only with rollup on
> read
> > or using CRDTs (
> > https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
> > + Make the HBase Server async/event driven/SEDA moving it off its current
> > thread-per-request basis
> > + UI: build out more pages and tabs on the HBase master exposing more of
> > our cluster metrics (make the master into a metrics sink). Extra points
> for
> > views, histograms, or dashboards that are both informative AND pretty
> (D3,
> > etc.). A good benchmark would be subsuming the Hannibal tool
> > https://github.com/sentric/hannibal
> > + Build an example application on HBase for test and illustration: e.g.
> use
> > Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
> > load common crawl regular webcrawls https://commoncrawl.org/ or, load
> > hbase
> > with wikipedia, the flickr dataset, or any dataset that appeals. Extra
> > credit for documenting steps involved and filing issues where API is
> > awkward or hard to follow.
> > + Add actionable statistics to hbase internals that capture vitals about
> > the data being served and that we exploit responding to queries; e.g.
> rough
> > sizes of rows, column-families, columns-per-row-per-region, etc. For
> > example, if a client has been stepping sequentially through the data, the
> > stats would allow us recognize this state so we could switch to a
> different
> > scan type; one that is optimal to a sequential progression.
> > + Review and redo our fundamental merge sort, the basis of our read.
> There
> > are a few techniques to try such as a "loser tree merge" (
> > http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
> > make
> > our merge sort block-based rather than Cell-based. Set yourself up in a
> rig
> > and try different Cell formats to get yourself to a cache-friendly Cell
> > format that maximizes instructions per cycle.
> > + Our client is heavy-weight and has accumulated lots of logic over time.
> > E.g. it is hard to set a single timeout for a request because client is
> > layered each with its own running timeouts. At its core is a mostly-done
> > async engine. Review, and finish the async work. Rewrite where it makes
> > sense after analysis.
> > + Our RPC is based on protobuf Service where we plugged in our own RPC
> > transport. An exploratory PoC putting HBase up on grpc was done by the
> grpc
> > team. Bring this project home. Extra points if you reveal a Streaming
> > Interface between Client and Server.
> > + Tiering... if regions are cold, close them so they don't occupy
> resources
> > (close files, purge its data from cache...) reopen when a request

Re: Google Summer Of Code 2016

2016-03-22 Thread Enis Söztutar
>
> I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too
> late for us to participate now?
>
>
ASF participates in GSOC, so HBase automatically can participate AFAIK.


> I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
> mentor signup deadline.
>

I did not check the deadline, if that is the case, it means this year is
over?

Your list is pretty good. We can POC with Capt'n proto as well as grpc.


>
>
> > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
> > are:
> > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
> > encoding. But these encodings just can use in HFile context. In RPC
> > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
> > improve them or using HFile encodings in RPC and WAL" ( He didn't say
> > the issue number But I guessed it is HBASE-12883 Support block
> > encoding based on knowing set of column qualifiers up front)
> >
>
> Sounds like a fine project (Someone was just asking about this offline...)
>
>
>
> > - HBASE-14379 Replication V2
> > - HBASE-8691 High-Throughput Streaming Scan API
> > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
> > SOLR indexing. I guess it could be this issue.)
> >
> > Could you help me for selecting topics or could you offer another issue ?
> >
> >
> All above are good.
>
> Here's a few others made for another context:
>
> + Become Jepsen distributed systems test tool expert: run it against HBase
> and HDFS. Analyze results. E.g. see
> https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
> + Deep dive on hbase Compactions. Own it. Review current options both the
> defaults, experimental, and the stale. Build tooling and surface metrics
> that give better insight on effectiveness of compaction mechanics and
> policies. Develop tunings and alternate, new policies. For further credit,
> develop master-orchestrated compaction algorithm.
> + Reimplement HBase append and increment as write-only with rollup on read
> or using CRDTs (
> https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
> + Make the HBase Server async/event driven/SEDA moving it off its current
> thread-per-request basis
> + UI: build out more pages and tabs on the HBase master exposing more of
> our cluster metrics (make the master into a metrics sink). Extra points for
> views, histograms, or dashboards that are both informative AND pretty (D3,
> etc.). A good benchmark would be subsuming the Hannibal tool
> https://github.com/sentric/hannibal
> + Build an example application on HBase for test and illustration: e.g. use
> Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
> load common crawl regular webcrawls https://commoncrawl.org/ or, load
> hbase
> with wikipedia, the flickr dataset, or any dataset that appeals. Extra
> credit for documenting steps involved and filing issues where API is
> awkward or hard to follow.
> + Add actionable statistics to hbase internals that capture vitals about
> the data being served and that we exploit responding to queries; e.g. rough
> sizes of rows, column-families, columns-per-row-per-region, etc. For
> example, if a client has been stepping sequentially through the data, the
> stats would allow us recognize this state so we could switch to a different
> scan type; one that is optimal to a sequential progression.
> + Review and redo our fundamental merge sort, the basis of our read. There
> are a few techniques to try such as a "loser tree merge" (
> http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd
> make
> our merge sort block-based rather than Cell-based. Set yourself up in a rig
> and try different Cell formats to get yourself to a cache-friendly Cell
> format that maximizes instructions per cycle.
> + Our client is heavy-weight and has accumulated lots of logic over time.
> E.g. it is hard to set a single timeout for a request because client is
> layered each with its own running timeouts. At its core is a mostly-done
> async engine. Review, and finish the async work. Rewrite where it makes
> sense after analysis.
> + Our RPC is based on protobuf Service where we plugged in our own RPC
> transport. An exploratory PoC putting HBase up on grpc was done by the grpc
> team. Bring this project home. Extra points if you reveal a Streaming
> Interface between Client and Server.
> + Tiering... if regions are cold, close them so they don't occupy resources
> (close files, purge its data from cache...) reopen when a request comes
> in
> + Dynamic configuration of running HBase
>
>
> St.Ack
>
>
>
>
> > Thanks
> > --
> > Talat UYARER
> >
>


Re: Google Summer Of Code 2016

2016-03-22 Thread Stack
On Tue, Mar 22, 2016 at 3:04 AM, Talat Uyarer  wrote:

> Hi All,
>
> I am Talat UYARER. I am PMC member and Commiter at Nutch and Gora. I
> have few contributions to Hbase and want to work for HBase in GSoC
> 2016. As far as I know, you haven't selected any issue for GSoC.
>
>
I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too
late for us to participate now?


> I am wondering is there anybody who can be a mentor for GSoC in HBase?
>
>
I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the
mentor signup deadline.



> BTW I talked with Enis Soztutar. He offered some topics for GSoC. These
> are:
> - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc.
> encoding. But these encodings just can use in HFile context. In RPC
> and WAL we use KeyValueEncoding for Cell Blocks. He told "You can
> improve them or using HFile encodings in RPC and WAL" ( He didn't say
> the issue number But I guessed it is HBASE-12883 Support block
> encoding based on knowing set of column qualifiers up front)
>

Sounds like a fine project (Someone was just asking about this offline...)



> - HBASE-14379 Replication V2
> - HBASE-8691 High-Throughput Streaming Scan API
> - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase ->
> SOLR indexing. I guess it could be this issue.)
>
> Could you help me for selecting topics or could you offer another issue ?
>
>
All above are good.

Here's a few others made for another context:

+ Become Jepsen distributed systems test tool expert: run it against HBase
and HDFS. Analyze results. E.g. see
https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
+ Deep dive on hbase Compactions. Own it. Review current options both the
defaults, experimental, and the stale. Build tooling and surface metrics
that give better insight on effectiveness of compaction mechanics and
policies. Develop tunings and alternate, new policies. For further credit,
develop master-orchestrated compaction algorithm.
+ Reimplement HBase append and increment as write-only with rollup on read
or using CRDTs (
https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type)
+ Make the HBase Server async/event driven/SEDA moving it off its current
thread-per-request basis
+ UI: build out more pages and tabs on the HBase master exposing more of
our cluster metrics (make the master into a metrics sink). Extra points for
views, histograms, or dashboards that are both informative AND pretty (D3,
etc.). A good benchmark would be subsuming the Hannibal tool
https://github.com/sentric/hannibal
+ Build an example application on HBase for test and illustration: e.g. use
Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to
load common crawl regular webcrawls https://commoncrawl.org/ or, load hbase
with wikipedia, the flickr dataset, or any dataset that appeals. Extra
credit for documenting steps involved and filing issues where API is
awkward or hard to follow.
+ Add actionable statistics to hbase internals that capture vitals about
the data being served and that we exploit responding to queries; e.g. rough
sizes of rows, column-families, columns-per-row-per-region, etc. For
example, if a client has been stepping sequentially through the data, the
stats would allow us recognize this state so we could switch to a different
scan type; one that is optimal to a sequential progression.
+ Review and redo our fundamental merge sort, the basis of our read. There
are a few techniques to try such as a "loser tree merge" (
http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd make
our merge sort block-based rather than Cell-based. Set yourself up in a rig
and try different Cell formats to get yourself to a cache-friendly Cell
format that maximizes instructions per cycle.
+ Our client is heavy-weight and has accumulated lots of logic over time.
E.g. it is hard to set a single timeout for a request because client is
layered each with its own running timeouts. At its core is a mostly-done
async engine. Review, and finish the async work. Rewrite where it makes
sense after analysis.
+ Our RPC is based on protobuf Service where we plugged in our own RPC
transport. An exploratory PoC putting HBase up on grpc was done by the grpc
team. Bring this project home. Extra points if you reveal a Streaming
Interface between Client and Server.
+ Tiering... if regions are cold, close them so they don't occupy resources
(close files, purge its data from cache...) reopen when a request comes
in
+ Dynamic configuration of running HBase


St.Ack




> Thanks
> --
> Talat UYARER
>