Thanks for pointing out the docs issue! I opened IMPALA-6303 to track it.
On 10 December 2017 at 15:47, Lars Francke <[email protected]> wrote: > Thank you Bharath & Dimitris! > > That answers all the questions I have right now, thank you so much for > taking the time to write it up. > > Regarding the docs: > <https://impala.apache.org/docs/build/html/topics/impala_components.html> > >> The Impala component known as the catalog service relays the metadata >> changes from Impala SQL statements to all the DataNodes in a cluster. It is >> physically represented by a daemon process named catalogd; you only need >> such a process on one host in the cluster. Because the requests are passed >> through the statestore daemon, it makes sense to run the statestored and >> catalogd services on the same host. > > Reading it again now it also says "DataNodes" which is not correct. > > Cheers, > Lars > > > On Fri, Dec 8, 2017 at 7:00 PM, Bharath Vissapragada <[email protected]> > wrote: >> >> Looks like a topic for dev@. >> >> On Fri, Dec 8, 2017 at 2:48 AM, Lars Francke <[email protected]> >> wrote: >>> >>> Hi, >>> >>> I'm trying to understand how the communication between the components >>> works. >>> >>> I understand that an impala daemon subscribes to the statestore. The >>> statestore seems to have the concept of heartbeats and topics. But I'm not >>> sure what topics are all about. >> >> >> Statestore follows the standard pub-sub pattern where a publisher >> publishes messages and subscribers subscribe to the messages/categories they >> are interested in. Like you mentioned, statestore is like a mediator >> between the publishers and the subscribers. >> >> "Topic" is an abstraction that makes the content of these messages opaque >> to the statestore. The publishers (like Catalog server for example) >> serialize the messages (metadata for example) into a "Topic" to ship them to >> the statestore which then broadcasts that to the interested subscribers >> (coordinators). The coordinators then unpack/deserialize the topic into the >> corresponding object classes (like Tables/Functions etc.) and apply those >> updates locally. >> >> In Impala, currently we have the following topics: >> >> catalog-update - For Catalog metadata >> impala-membership - For tracking liveness of the coordinators/executors >> impala-request-queue - For admission control >> >> You can see these in the statestore web UI (/topics page) >> >>> >>> >>> The docs also say that only the statestore communicates with the catalog >>> service. How does that happen? >> >> >> Can you point us to which doc you are referring to here? >> >> Techincally speaking, the coordinators also connect to the Catalog service >> for executing DDLs, but I'm assuming you are speaking here in terms of the >> broadcast of the table updates, in which case Catalog sends those tables to >> the statestore (as a part of catalog-update topic) and those are broadcast >> by the statestore to all the coordinators. (described above) >> >> How is a INVALIDATE/REFRESH statement routed from a daemon to the catalog >> service and back? >> >> I'll take the example of REFRESH here. The metadata flow looks something >> like this >> >> - coordinator 'coo' gets 'refresh foo' >> - 'coo' makes an RPC to the catalog server 'cat' for executing 'refresh' >> - 'cat' refreshes the table 'foo', which changes the version of 'foo' from >> v1 to v2 (Internally Catalog versions all the objects to track which objects >> changed over time) >> - 'cat' returns 'foo' (v2) directly to the coordinator 'coo' (as the >> result of RPC) which then applies the update locally. >> - Additionally 'cat' also has a thread running in the background that >> figures out that the 'foo' has changed (v1 -> v2), which then repacks 'foo' >> into a "Topic" update and sends it to the statestore. >> - Statestore then broadcasts the new updates to all the coordinators. >> >> INVALIDATE is slightly different in the sense that the coordinator doesn't >> get foo(v2) back as the result of the rpc, instead it gets an >> "IncompleteTable" (Impala terminology) which means that the table is either >> missing the catalog metadata/it has been invalidated. >> >> There are many minor details on how the entire system works but "most" >> Catalog updates work as above (with some exceptions). >> >>> >>> I'm sure I'll have follow-up questions but this would already be very >>> helpful. Thank you! >> >> >> Sure, feel free to ask the list. Here are some code pointers incase you >> are interested. >> >> >> https://github.com/apache/impala/blob/master/be/src/statestore/statestore.h >> (Topic/TopicEntry and other SS abstractions) >> >> >> https://github.com/apache/impala/blob/master/common/thrift/CatalogService.thrift#L45 >> (thrift definitions for most Catalog operations) >> >> >> https://github.com/apache/impala/blob/master/be/src/service/impala-server.h#L324 >> (how coordinators apply the Catalog updates) >> >> >> https://github.com/apache/impala/blob/master/be/src/catalog/catalog-server.cc#L187 >> (An example of how "catalog-update" is created & subscribed) >> >> HTH. >> >>> >>> Cheers, >>> Lars >> >> >
