Re: Week 2 Report and A Question

2019-06-11 Thread Renato Marroquín Mogrovejo
Hey Sheriffo,

Cool to hear you are making progress! :) and great to see that we have
some numbers already! :)
Regarding optimization point (1), regardless that this was not he
cause of the issue or not, Alfonso suggestions are something we should
follow, many objects with a short life in java might create a
performance problem sooner or later. Also about your comment:

"Also, I may be wrong but the way I understand YCSB framework is, it
will execute an insert operation for each user object, so I thought it
was right to create a user object within the insert method."

As you pointed out, YCSB is about inserting the objects, and NOT about
creating them, so it doesn't matter if we reuse the objects, as long
as the values that we insert are actually correct. We don't want to
end up measuring object creation+gc. I think Alfonso's comment was
hinting on that direction (please feel free to correct me @Alfonso if
I am misunderstanding you) and I think his comments are just on the
spot.
I have some other questions regarding the numbers you sent around:
- are you running YCSB for each data store with warm JVM? or are these
numbers each with a clean cold JVM? I suppose the latter, right?
- did you try setting gora.hbasestore.scanner.caching to a lower value?
- which is the command that you are using to run/start this code?
- did you try flushing the commits more regularly in:
https://github.com/sneceesay77/gora/blob/GORA-532/gora-benchmark/src/main/java/org/apache/gora/benchmark/GoraBenchmarkClient.java#L142
let's say every 1000 elements? or something like that? I mean instead
of at the end of the 1M elements?

Thanks a lot for the report Sheriffo!


Best,

Renato M.

El mar., 11 jun. 2019 a las 16:12, Sheriffo Ceesay
() escribió:
>
> Hello,
>
> I have taken a proper look at the recommendations from @Alfonso and @Renato 
> and below are the outcomes.
>
> Failed Attempts
> 1. Optimisation, for the insert operation, to avoid the concatenation issue, 
> I have just taken the quickest route by calling the methods directly without 
> reflection. Below are those calls. Note: I have moved all reusable codes to 
> the init method.
>
>> public int insert(String table, String key, HashMap 
>> values) {
>>   user.setField0(values.get("field0").toString());
>>   user.setField1(values.get("field1").toString());
>>   user.setField2(values.get("field2").toString());
>>   user.setField3(values.get("field3").toString());
>>   user.setField4(values.get("field4").toString());
>>   user.setField5(values.get("field5").toString());
>>   user.setField6(values.get("field6").toString());
>>   user.setField7(values.get("field7").toString());
>>   user.setField8(values.get("field8").toString());
>>   user.setField9(values.get("field9").toString());
>>   dataStore.put(user.getUserId().toString(), user);
>> } catch (Exception e) {
>>   return FAILED;
>> }
>> return SUCCESS;
>>   }
>
>
> if the above had worked, I would have changed the code as suggested by 
> Alfonso. Also, I may be wrong but the way I understand YCSB framework is, it 
> will execute an insert operation for each user object, so I thought it was 
> right to create a user object within the insert method.
>
>
> 2. I used different config values for -Xmx (256MB, 512MB, 1GB, 2GB) and even 
> disabled GC checking using -XX:-UseGCOverheadLimit but they all failed with 
> the same GC error.
>
> Successful Attempt -- There may be room for improvement
> Using the configurations below worked but I think it is not the best for 
> write performance.
>
> First, I read from [1] related to [2] that the following oneliner code should 
> be executed for better HBase performance when using YCSB. It basically avoids 
> overloading a single region server.
>
> hbase(main):001:0> n_splits = 200 # HBase recommends (10 * number of 
> regionservers)
> hbase(main):002:0> create 'users', 'info', {SPLITS => (1..n_splits).map {|i| 
> "user#{1000+i*(-1000)/n_splits}"}}
>
> Second, as suggested by @Renato Marroquín Mogrovejo , it only works when I set
>
> hbase.client.autoflush.default=true
>
> However, from [3], I found "HBase autoflushing. Enabling autoflush decreases 
> write performance. Available since Gora 0.2. Defaults to disabled.". So I am 
> of the opinion that the problem is not entirely solved.
>
> I have done the following testing to insert 1M records into MongoDB and 
> HBase, so I think this may not be bad after all but more benchmarks may be 
> required to validate this. HBase in Gora has almost the same performance as 
> vanilla YCSB to benchmark it.
>
> Backend  Ave Time Taken (sec)
> MongoDB  ~90
> HBase in Gora  ~160
> HBase YCSB~160
>
>
> [1] https://github.com/brianfrankcooper/YCSB/tree/master/hbase098
> [2] https://issues.apache.org/jira/browse/HBASE-4163
> [3] https://gora.apache.org/current/gora-hbase.html
>
> Comments are welcomed.
>
> Thank you.
> *Sheriffo 

Re: Week 1 Report and Some Questions

2019-06-11 Thread lewis john mcgibbney
Excellent.
Apologies for being absent. I am undergoing a job transition and it has
been very busy.
I suggest that we start a weekly tagup as well.
Lewis

On Sun, Jun 2, 2019 at 1:14 PM Sheriffo Ceesay 
wrote:

> The code so far is available at the GitHub link below.
>
> https://github.com/sneceesay77/gora/tree/GORA-532/gora-benchmark
>
>
>
> **Sheriffo Ceesay**
>
>
> On Sun, Jun 2, 2019 at 8:34 PM Sheriffo Ceesay 
> wrote:
>
>> Hi Renato,
>>
>> Thanks for the detailed reply. I agree with your recommendations on the
>> way forward. I will go ahead and implement the rest of the functionality
>> using reflection and we can follow your recommendations on the next
>> iterations.
>>
>> As for the backend, I am using both HBase and MongoDB and all seems well
>> at the moment.
>>
>> I will let you all know why I push my code to GitHub.
>>
>> Thank you.
>>
>>
>> **Sheriffo Ceesay**
>>
>>
>> On Sun, Jun 2, 2019 at 7:01 PM Renato Marroquín Mogrovejo <
>> renatoj.marroq...@gmail.com> wrote:
>>
>>> Hi Sheriffo,
>>>
>>> Some opinions about your questions, but others are more than welcome
>>> to suggest other things as well.
>>>
>>> Q1: Are we going to consider arbitrary field length, e.g. if we set
>>> the fieldcount to 100 then we have to create the respective Avro and
>>> mapping files? Currently,
>>> I don't think this process is automated and may be tedious for large
>>> field counts.
>>> I think for the first code iteration, we should use whatever
>>> fieldcount you have generated for. Ideally, we should be able to
>>> invoke the Gora bean generator and generate as many fields as required
>>> by the benchmark configuration.
>>>
>>> Q2: Second: The second problem has to do with the first one, if we
>>> allow arbitrary field counts, then there has to be a mechanism to call
>>> each of the set or get methods during CRUD operations. So to avoid
>>> this I used Java Reflection. See the sample code below.
>>> We have some options to deal with having arbitrarily number of fields.
>>> 1) Use reflection as you have which might be ok for the first code
>>> iteration, but if we want to have some decent performance against
>>> using datastores natively (no Gora), we should go away from it.
>>> 2) Do Gora class generation (and also generate the method used to
>>> insert data through Gora) in a step before the benchmark starts.
>>> Something like this:
>>> # passing config parameters to generate Gora Beans with number of
>>> required fields
>>> # this should output the generate class and the method that does the
>>> insertion
>>> $ gora_compiler.sh --benchmark --fields_required 4
>>> The output path containing the result of this should be then include
>>> (or passed) as runtime dependency to the benchmark class.
>>> 3) Because Gora uses Avro, we can use complex data types, e.g.,
>>> arrays, maps. So we could represent number of fields as number of
>>> elements inside an array. I would think that this option gives us the
>>> best performance.
>>> I think  we should continue with option (1) until we have the entire
>>> pipeline working, and we understand how every piece fits together with
>>> each other (YSCB, Gora, Gora compiler, benchmark setup steps). Then we
>>> should do (2) which is the most general and the one that reflects how
>>> people usually use Gora, and then we test with (3). I think all of
>>> these steps are totally doable in our time frame as we build upon
>>> previous steps.
>>> The other thing that we should decide is which backend to use as there
>>> are backends that are more mature than others. I'd say to use the
>>> HBase backend as it is the most stable one and the one with more
>>> features, and if we feel brave we can try other backends (and fix them
>>> if necessary!)
>>>
>>>
>>> Best,
>>>
>>> Renato M>
>>>
>>> El dom., 2 jun. 2019 a las 19:10, Sheriffo Ceesay
>>> () escribió:
>>> >
>>> > Dear Mentors,
>>> >
>>> > My week one report is available at
>>> >
>>> https://cwiki.apache.org/confluence/display/GORA/%5BGORA-532%5D+Apache+Gora+Benchmark+Module+Weekly+Report
>>> >
>>> > I have also included a detailed question of and I will need your
>>> guidance
>>> > on that.
>>> >
>>> > Please let me know what your thoughts are.
>>> >
>>> > Thank you.
>>> >
>>> > **Sheriffo Ceesay**
>>>
>>

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Week 2 Report and A Question

2019-06-11 Thread Sheriffo Ceesay
Hello,

I have taken a proper look at the recommendations from @Alfonso and @Renato
and below are the outcomes.

Failed Attempts
1. Optimisation, for the insert operation, to avoid the concatenation
issue, I have just taken the quickest route by calling the methods directly
without reflection. Below are those calls. Note: I have moved all reusable
codes to the init method.

public int insert(String table, String key, HashMap
> values) {
>   user.setField0(values.get("field0").toString());
>   user.setField1(values.get("field1").toString());
>   user.setField2(values.get("field2").toString());
>   user.setField3(values.get("field3").toString());
>   user.setField4(values.get("field4").toString());
>   user.setField5(values.get("field5").toString());
>   user.setField6(values.get("field6").toString());
>   user.setField7(values.get("field7").toString());
>   user.setField8(values.get("field8").toString());
>   user.setField9(values.get("field9").toString());
>   dataStore.put(user.getUserId().toString(), user);
> } catch (Exception e) {
>   return FAILED;
> }
> return SUCCESS;
>   }
>

if the above had worked, I would have changed the code as suggested by
Alfonso. Also, I may be wrong but the way I understand YCSB framework is,
it will execute an insert operation for each user object, so I thought it
was right to create a user object within the insert method.


2. I used different config values for *-Xmx (256MB, 512MB, 1GB, 2GB)* and
even disabled GC checking using *-XX:-UseGCOverheadLimit* but they all
failed with the same GC error.

Successful Attempt -- There may be room for improvement
Using the configurations below worked but I think it is not the best for
write performance.

First, I read from [1] related to [2] that the following oneliner code
should be executed for better HBase performance when using YCSB. It
basically avoids overloading a single region server.

hbase(main):001:0> n_splits = 200 # HBase recommends (10 * number of
regionservers)
hbase(main):002:0> create 'users', 'info', {SPLITS =>
(1..n_splits).map {|i| "user#{1000+i*(-1000)/n_splits}"}}

Second, as suggested by @Renato Marroquín Mogrovejo
 , it only works when I set

*hbase.client.autoflush.default=true*

However, from [3], I found "HBase autoflushing. Enabling autoflush
decreases write performance. Available since Gora 0.2. Defaults to
disabled.". So I am of the opinion that the problem is not entirely solved.

I have done the following testing to insert 1M records into MongoDB and
HBase, so I think this may not be bad after all but more benchmarks may be
required to validate this. HBase in Gora has almost the same performance as
vanilla YCSB to benchmark it.

*Backend  Ave Time Taken (sec)*
MongoDB  ~90
HBase in Gora  ~160
HBase YCSB~160


[1] https://github.com/brianfrankcooper/YCSB/tree/master/hbase098
[2] https://issues.apache.org/jira/browse/HBASE-4163
[3] https://gora.apache.org/current/gora-hbase.html

Comments are welcomed.

Thank you.

**Sheriffo Ceesay**


On Tue, Jun 11, 2019 at 12:04 AM Sheriffo Ceesay 
wrote:

> Hello Alfonso and Renato,
>
> Thank you for getting in touch and thanks for the detailed replies.
>
> I will have proper look at this tomorrow morning. I did some
> troubleshooting yesterday (mostly playing with Xmx and zookeeper timeout
> settings), that improved the conditions, but it did not entirely solve the
> problem. Preliminary, it seems the problem has to do with configuration or
> how HBaseStore is implemented (this may not be entirely true).
>
> I will keep you all posted whenever I thoroughly have a look at your
> suggestions.
>
> Thanks again.
>
>
> **Sheriffo Ceesay**
>
>
> On Mon, Jun 10, 2019 at 11:14 PM Alfonso Nishikawa <
> alfonso.nishik...@gmail.com> wrote:
>
>> Hi!
>>
>> My hypothesis is taht that the difference between MongoDB and HBase is
>> that
>> HBase put more stress serializing with avro. It could affect too that if
>> the HBase's test is performed after MongoDB's ones, then the GC starts
>> from
>> a "bad" situation.
>>
>> From [A] linked by @Renato, if the error was OutOfMemoryException I would
>> have recommended lowering gora.hbasestore.scanner.caching to 100, 10 or
>> even 1, but with a GC error I am not that much sure. In anycase,
>> @Sheriffo:
>> you can try this if with the optimizations still doesn't work :)
>>
>> @Renato: Thx for the links!
>>
>> Regards,
>>
>> Alfonso Nishikawa
>>
>>
>>
>> El lun., 10 jun. 2019 a las 22:02, Renato Marroquín Mogrovejo (<
>> renatoj.marroq...@gmail.com>) escribió:
>>
>> > @Alfonso,
>> > Thank you very much for the suggestions! you are totally right about
>> > all of your points! Sheriffo, please benefit from them ;)
>> >
>> > Also what is strange is this (although it can be optimized as Alfonso
>> > pointed out) is that it works for the MongoDB backend. So I would also
>> > suspect on the configuration of the Gora-HBase