Thanks! For #2: wouldn't H2 need to bring the data into the heap to make the queries? Or at least some of the date to do the group_by and sum operation?
On Mon, Sep 10, 2018 at 6:19 AM Vladimir Ozerov <[email protected]> wrote: > Hi Eugene, > > Answering your questions: > 1) Grouping is performed on both mapper and reducer (coordinator). If you > group be affinity key, you may try setting "SqlFieldsQuery.colocated=true" > to bypass grouping on reducer > 2) For this specific query H2 will store (customer_id, count(*), > sum(views)) for every customer_id. It is hard to guess how much space it > would take in heap, but I think it would be ~50-100 bytes per customer_id. > So if you have N customers, it would be (100 * N) bytes > 3) Please see > https://apacheignite-sql.readme.io/docs/performance-and-debugging > > Vladimir. > > On Thu, Aug 30, 2018 at 5:57 PM eugene miretsky <[email protected]> > wrote: > >> Thanks against for the detailed response! >> >> Our main use case is preforming large SQL queries over tables with 200M+ >> rows - wanted to give you a bit more details and context you can pass along >> >> A simple example would be: >> >> - Table: customer_id, date, category, views, clicks ( pkey = >> "customer_id, date", affinity key = date ) >> - Query: SELECT count(*) where date < X AND categroy in (C1, C2, C3) >> GROUP BY customer_id HAVING SUM(views) > 20 >> >> My main concernse are >> 1) How is the group by performed. You mentioend that it is performend on >> the coordinator, I was coping that singe we are grouping using an colomn >> that is an affintiy key, each node will be able to do it's own group by >> 2) How much heap should I allocate for the group by stage >> 3) General performance tips >> >> Cheers, >> Eugene >> >> >> On Thu, Aug 30, 2018 at 1:32 AM Denis Magda <[email protected]> wrote: >> >>> Eugene, >>> >>> Just want to be sure you know about the existence of the following pages >>> which elaborate on Ignite memory architecture in details: >>> >>> - >>> >>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Entriesandpagesindurablememory >>> - >>> >>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood >>> >>> >>> >>>> 1) Are indexs loaded into heap (when used)? >>>> >>> >>> Something might be copied to disk but in most of the cases we perform >>> comparisons and other operations directly off-heap. >>> See >>> org.apache.ignite.internal.processors.query.h2.database.InlineIndexHelper >>> and related classes. >>> >>> 2) Are full pages loaded into heap, or only the matching records? >>>> >>> >>> Matching records (result set) are presently loaded. The pages are not. >>> >>> >>>> 3) When the query needs more processing than the exisiting index >>>> (non-indexed columns, groupBy, aggreag) where/how does it happen? >>>> >>> >>> We will be doing a full scan. Grouping and aggregations are finalized on >>> the query coordinator which needs to get a full result set. >>> >>> 4) How is the query coordinator chosen? Is it the client node? How about >>>> when using the web console? >>>> >>> >>> That's your application. Web Console uses Ignite SQL APIs as well. >>> >>> >>>> 5) What paralalism settings would your recomend, we were thinking to >>>> set parallelJobsNumber to 1 and task parallelism to number of cores * 2 - >>>> this way we can make sure that each job gets al the heap memory instead of >>>> all jobs fighting each other. Not sure if it makes sense, and it will also >>>> prevent us from making real time transactional transactional queries.(we >>>> are hoping to use ignite for both olap and simple real time queries) >>> >>> >>> I would start a separate discussion for this bringing this question to >>> the attention of our SQL experts. I'm not the one of them. >>> >>> -- >>> Denis >>> >>> On Mon, Aug 27, 2018 at 8:54 PM eugene miretsky < >>> [email protected]> wrote: >>> >>>> Denis, thanks for the detailed response. >>>> >>>> A few more follow up questions >>>> 1) Are indexs loaded into heap (when used)? >>>> 2) Are full pages loaded into heap, or only the matching records? >>>> 3) When the query needs more processing than the exisiting index >>>> (non-indexed columns, groupBy, aggreag) where/how does it happen? >>>> 4) How is the query coordinator chosen? Is it the client node? How >>>> about when using the web console? >>>> 5) What paralalism settings would your recomend, we were thinking to >>>> set parallelJobsNumber to 1 and task parallelism to number of cores * 2 - >>>> this way we can make sure that each job gets al the heap memory instead of >>>> all jobs fighting each other. Not sure if it makes sense, and it will also >>>> prevent us from making real time transactional transactional queries.(we >>>> are hoping to use ignite for both olap and simple real time queries) >>>> >>>> Cheers, >>>> Eugene >>>> >>>> >>>> On Sat, Aug 25, 2018 at 3:25 AM Denis Magda <[email protected]> wrote: >>>> >>>>> Hello Eugene, >>>>> >>>>> 1) In what format is data stored off heap? >>>>> >>>>> >>>>> Data is always stored in the binary format let it be on-heap, off-heap >>>>> or Ignite persistence. >>>>> https://apacheignite.readme.io/docs/binary-marshaller >>>>> >>>>> 2) What happens when a SQL query is executed, in particular >>>>> >>>>>> >>>>>> - How is H2 used? How is data loaded in H2? What if some of the >>>>>> data is on disk? >>>>>> >>>>>> H2 is used to build execution plans for SELECTs. H2 calls Ignite's >>>>> B+Tree based indexing implementation to see which indexes are set. All the >>>>> data and indexes are always stored in Ignite (off-heap + disk). >>>>> >>>>>> >>>>>> - When is data loaded into heap, and how much? Is only the output >>>>>> of H2 loaded, or everything? >>>>>> >>>>>> Queries results are stored in Java heap temporarily. Once the result >>>>> set is read by your application, it will be garbage collected. >>>>> >>>>>> >>>>>> - How is the reduce stage performed? Is it performed only on one >>>>>> node (hence that node needs to load all the data into memory) >>>>>> >>>>>> Correct, the final result set is reduced on a query coordinator - >>>>> your application that executed a SELECT. >>>>> >>>>> 3) What happens when Ingite runs out of memory during execution? Is >>>>>> data evictied to disk (if persistence is enabled)? >>>>> >>>>> >>>>> I guess you mean what happens if a result set doesn't fit in RAM >>>>> during the execution, right? If so, then OOM will occur. We're working on >>>>> an improvement that will offload the result set to disk to avoid OOM for >>>>> all the scenarious: >>>>> https://issues.apache.org/jira/browse/IGNITE-7526 >>>>> >>>>> >>>>> >>>>>> 4) Based on the code, it looks like I need to set my data region size >>>>>> to at most 50% of available memory (to avoid the warning), this seems a >>>>>> bit >>>>>> wastefull. >>>>> >>>>> >>>>> There is no such a requirement. I know many deployments use cases when >>>>> one data region is given 20% of RAM, the other is given 40% and everything >>>>> else is persisted to disk. >>>>> >>>>> 5) Do you have any general advice on benchmarking the memory >>>>>> requirpement? So far I have not been able to find a way to check how much >>>>>> memory each table takes on and off heap, and how much memory each query >>>>>> takes. >>>>> >>>>> >>>>> We use Yardstick for performance benchmarking: >>>>> https://apacheignite.readme.io/docs/perfomance-benchmarking >>>>> >>>>> -- >>>>> Denis >>>>> >>>>> On Fri, Aug 24, 2018 at 7:06 AM eugene miretsky < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks! >>>>>> >>>>>> I am trying to understand when and how data is moved from off-heap to >>>>>> on heap, particularly when using SQL. I took a look at the wiki >>>>>> <https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood> >>>>>> but >>>>>> still have a few questions >>>>>> >>>>>> My understanding is that data is always store off-heap >>>>>> >>>>>> 1) In what format is data stored off heap? >>>>>> 2) What happens when a SQL query is executed, in particular >>>>>> >>>>>> - How is H2 used? How is data loaded in H2? What if some of the >>>>>> data is on disk? >>>>>> - When is data loaded into heap, and how much? Is only the output >>>>>> of H2 loaded, or everything? >>>>>> - How is the reduce stage performed? Is it performed only on one >>>>>> node (hence that node needs to load all the data into memory) >>>>>> >>>>>> 3) What happens when Ingite runs out of memory during execution? Is >>>>>> data evictied to disk (if persistence is enabled)? >>>>>> 4) Based on the code, it looks like I need to set my data region size >>>>>> to at most 50% of available memory (to avoid the warning), this seems a >>>>>> bit >>>>>> wastefull. >>>>>> 5) Do you have any general advice on benchmarking the memory >>>>>> requirpement? So far I have not been able to find a way to check how much >>>>>> memory each table takes on and off heap, and how much memory each query >>>>>> takes. >>>>>> >>>>>> Cheers, >>>>>> Eugene >>>>>> >>>>>> On Fri, Aug 24, 2018 at 8:06 AM, NSAmelchev <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Eugene, >>>>>>> >>>>>>> Yes, it's a misprint as Dmitry wrote. >>>>>>> >>>>>>> Ignite print this warning if nodes on local machine require more >>>>>>> than 80% of >>>>>>> physical RAM. >>>>>>> >>>>>>> From code, you can see that total heap/offheap memory summing >>>>>>> from nodes having the same mac address. This way calculates total >>>>>>> memory >>>>>>> used >>>>>>> by the local machine. >>>>>>> >>>>>>> -- >>>>>>> Best wishes, >>>>>>> Amelchev Nikita >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >>>>>>> >>>>>> >>>>>>
