Re: How much heap to allocate

eugene miretsky Wed, 12 Sep 2018 14:16:30 -0700

Thanks!

For #2: wouldn't H2 need to bring the data into the heap to make the
queries? Or at least some of the date to do the group_by and sum operation?


On Mon, Sep 10, 2018 at 6:19 AM Vladimir Ozerov <[email protected]>
wrote:

> Hi Eugene,
>
> Answering your questions:
> 1) Grouping is performed on both mapper and reducer (coordinator). If you
> group be affinity key, you may try setting "SqlFieldsQuery.colocated=true"
> to bypass grouping on reducer
> 2) For this specific query H2 will store (customer_id, count(*),
> sum(views)) for every customer_id. It is hard to guess how much space it
> would take in heap, but I think it would be ~50-100 bytes per customer_id.
> So if you have N customers, it would be (100 * N) bytes
> 3) Please see
> https://apacheignite-sql.readme.io/docs/performance-and-debugging
>
> Vladimir.
>
> On Thu, Aug 30, 2018 at 5:57 PM eugene miretsky <[email protected]>
> wrote:
>
>> Thanks against for the detailed response!
>>
>> Our main use case is preforming large SQL queries over tables with 200M+
>> rows  - wanted to give you a bit more details and context you can pass along
>>
>> A simple example would be:
>>
>>    - Table: customer_id, date, category, views, clicks ( pkey =
>>    "customer_id, date", affinity key = date )
>>    - Query: SELECT count(*) where date < X AND categroy in (C1, C2, C3)
>>    GROUP BY customer_id HAVING SUM(views) > 20
>>
>> My main concernse are
>> 1) How is the group by performed. You mentioend that it is performend on
>> the coordinator, I was coping that singe we are grouping using an colomn
>> that is an affintiy key, each node will be able to do it's own group by
>> 2) How much heap should I allocate for the group by stage
>> 3) General performance tips
>>
>> Cheers,
>> Eugene
>>
>>
>> On Thu, Aug 30, 2018 at 1:32 AM Denis Magda <[email protected]> wrote:
>>
>>> Eugene,
>>>
>>> Just want to be sure you know about the existence of the following pages
>>> which elaborate on Ignite memory architecture in details:
>>>
>>>    -
>>>    
>>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Entriesandpagesindurablememory
>>>    -
>>>    
>>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>
>>>
>>>
>>>> 1) Are indexs loaded into heap (when used)?
>>>>
>>>
>>> Something might be copied to disk but in most of the cases we perform
>>> comparisons and other operations directly off-heap.
>>> See 
>>> org.apache.ignite.internal.processors.query.h2.database.InlineIndexHelper
>>> and related classes.
>>>
>>> 2) Are full pages loaded into heap, or only the matching records?
>>>>
>>>
>>> Matching records (result set) are presently loaded. The pages are not.
>>>
>>>
>>>> 3) When the query needs more processing than the exisiting index
>>>> (non-indexed columns, groupBy, aggreag) where/how does it happen?
>>>>
>>>
>>> We will be doing a full scan. Grouping and aggregations are finalized on
>>> the query coordinator which needs to get a full result set.
>>>
>>> 4) How is the query coordinator chosen? Is it the client node? How about
>>>> when using the web console?
>>>>
>>>
>>> That's your application. Web Console uses Ignite SQL APIs as well.
>>>
>>>
>>>> 5) What paralalism settings would your recomend, we were thinking to
>>>> set parallelJobsNumber  to 1  and task parallelism to number of cores * 2 -
>>>> this way we can make sure that each job gets al the heap memory instead of
>>>> all jobs fighting each other. Not sure if it makes sense, and it will also
>>>> prevent us from making real time transactional transactional queries.(we
>>>> are hoping to use ignite for both olap and simple real time queries)
>>>
>>>
>>> I would start a separate discussion for this bringing this question to
>>> the attention of our SQL experts. I'm not the one of them.
>>>
>>> --
>>> Denis
>>>
>>> On Mon, Aug 27, 2018 at 8:54 PM eugene miretsky <
>>> [email protected]> wrote:
>>>
>>>> Denis, thanks for the detailed response.
>>>>
>>>> A few more follow up questions
>>>> 1) Are indexs loaded into heap (when used)?
>>>> 2) Are full pages loaded into heap, or only the matching records?
>>>> 3) When the query needs more processing than the exisiting index
>>>> (non-indexed columns, groupBy, aggreag) where/how does it happen?
>>>> 4) How is the query coordinator chosen? Is it the client node? How
>>>> about when using the web console?
>>>> 5) What paralalism settings would your recomend, we were thinking to
>>>> set parallelJobsNumber  to 1  and task parallelism to number of cores * 2 -
>>>> this way we can make sure that each job gets al the heap memory instead of
>>>> all jobs fighting each other. Not sure if it makes sense, and it will also
>>>> prevent us from making real time transactional transactional queries.(we
>>>> are hoping to use ignite for both olap and simple real time queries)
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>> On Sat, Aug 25, 2018 at 3:25 AM Denis Magda <[email protected]> wrote:
>>>>
>>>>> Hello Eugene,
>>>>>
>>>>> 1) In what format is data stored off heap?
>>>>>
>>>>>
>>>>> Data is always stored in the binary format let it be on-heap, off-heap
>>>>> or Ignite persistence.
>>>>> https://apacheignite.readme.io/docs/binary-marshaller
>>>>>
>>>>> 2) What happens when a SQL query is executed, in particular
>>>>>
>>>>>>
>>>>>>    - How is H2 used? How is data loaded in H2? What if some of the
>>>>>>    data is on disk?
>>>>>>
>>>>>> H2 is used to build execution plans for SELECTs. H2 calls Ignite's
>>>>> B+Tree based indexing implementation to see which indexes are set. All the
>>>>> data and indexes are always stored in Ignite (off-heap + disk).
>>>>>
>>>>>>
>>>>>>    - When is data loaded into heap, and how much? Is only the output
>>>>>>    of H2 loaded, or everything?
>>>>>>
>>>>>> Queries results are stored in Java heap temporarily. Once the result
>>>>> set is read by your application, it will be garbage collected.
>>>>>
>>>>>>
>>>>>>    - How is the reduce stage performed? Is it performed only on one
>>>>>>    node (hence that node needs to load all the data into memory)
>>>>>>
>>>>>> Correct, the final result set is reduced on a query coordinator -
>>>>> your application that executed a SELECT.
>>>>>
>>>>> 3) What happens when Ingite runs out of memory during execution? Is
>>>>>> data evictied to disk (if persistence is enabled)?
>>>>>
>>>>>
>>>>> I guess you mean what happens if a result set doesn't fit in RAM
>>>>> during the execution, right? If so, then OOM will occur. We're working on
>>>>> an improvement that will offload the result set to disk to avoid OOM for
>>>>> all the scenarious:
>>>>> https://issues.apache.org/jira/browse/IGNITE-7526
>>>>>
>>>>>
>>>>>
>>>>>> 4) Based on the code, it looks like I need to set my data region size
>>>>>> to at most 50% of available memory (to avoid the warning), this seems a 
>>>>>> bit
>>>>>> wastefull.
>>>>>
>>>>>
>>>>> There is no such a requirement. I know many deployments use cases when
>>>>> one data region is given 20% of RAM, the other is given 40% and everything
>>>>> else is persisted to disk.
>>>>>
>>>>> 5) Do you have any general advice on benchmarking the memory
>>>>>> requirpement? So far I have not been able to find a way to check how much
>>>>>> memory each table takes on and off heap, and how much memory each query
>>>>>> takes.
>>>>>
>>>>>
>>>>> We use Yardstick for performance benchmarking:
>>>>> https://apacheignite.readme.io/docs/perfomance-benchmarking
>>>>>
>>>>> --
>>>>> Denis
>>>>>
>>>>> On Fri, Aug 24, 2018 at 7:06 AM eugene miretsky <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> I am trying to understand when and how data is moved from off-heap to
>>>>>> on heap, particularly when using SQL.  I took a look at the wiki
>>>>>> <https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood>
>>>>>>  but
>>>>>> still have a few questions
>>>>>>
>>>>>> My understanding is that data is always store off-heap
>>>>>>
>>>>>> 1) In what format is data stored off heap?
>>>>>> 2) What happens when a SQL query is executed, in particular
>>>>>>
>>>>>>    - How is H2 used? How is data loaded in H2? What if some of the
>>>>>>    data is on disk?
>>>>>>    - When is data loaded into heap, and how much? Is only the output
>>>>>>    of H2 loaded, or everything?
>>>>>>    - How is the reduce stage performed? Is it performed only on one
>>>>>>    node (hence that node needs to load all the data into memory)
>>>>>>
>>>>>> 3) What happens when Ingite runs out of memory during execution? Is
>>>>>> data evictied to disk (if persistence is enabled)?
>>>>>> 4) Based on the code, it looks like I need to set my data region size
>>>>>> to at most 50% of available memory (to avoid the warning), this seems a 
>>>>>> bit
>>>>>> wastefull.
>>>>>> 5) Do you have any general advice on benchmarking the memory
>>>>>> requirpement? So far I have not been able to find a way to check how much
>>>>>> memory each table takes on and off heap, and how much memory each query
>>>>>> takes.
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>> On Fri, Aug 24, 2018 at 8:06 AM, NSAmelchev <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Eugene,
>>>>>>>
>>>>>>> Yes, it's a misprint as Dmitry wrote.
>>>>>>>
>>>>>>> Ignite print this warning if nodes on local machine require more
>>>>>>> than 80% of
>>>>>>> physical RAM.
>>>>>>>
>>>>>>> From code, you can see that total heap/offheap memory summing
>>>>>>> from nodes having the same mac address. This way calculates total
>>>>>>> memory
>>>>>>> used
>>>>>>> by the local machine.
>>>>>>>
>>>>>>> --
>>>>>>> Best wishes,
>>>>>>> Amelchev Nikita
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>>>>>>
>>>>>>
>>>>>>

Re: How much heap to allocate

Reply via email to