Re: Multiple Fuseki Servers in Distributed Environment
Apologies for resurrecting this thread... Yes, it uses Thrift when distributed, ie multi JVM. It was on hold because I changed jobs, yay! I'm starting to look at making it available as a Jena side car, ie jena-mosaic. DickM On 27 May 2018 at 12:02, ajs6f wrote: > There are several systems that distribute SPARQL using Jena. > > Dick Murray has written a system called Mosaic that (I believe) uses > Apache Thrift to distribute the lower-level (DatasetGraph) primitives that > ARQ uses to execute SPARQL. An advantage over your plan might be that he > isn't serializing full results over HTTP to pass them around. I don't > understand that system to be ready for use outside of Dick's deployment, > but he could say more. > > The SANSA project [1] has provided a system that I understand to use ARQ > to execute queries over Apache Spark or Apache Flink. This sounds similar > in some ways to what you are doing, and that system is available today. I > think Jena committer Lorenz Bühmann is involved with that project; if I am > correct, he may be able to say more. > > There are doubtless others about which I don't know. > > ajs6f > > [1] http://sansa-stack.net/ > > > On May 26, 2018, at 5:47 AM, Mirko Kämpf wrote: > > > > Hello Fuseki experts, > > > > I want to ask you for your experience / thoughts about the following > > approach: > > > > > > > > In order to enable semantic queries over "trancient data" or on data > which > > is persisted in HDFS / HBase I > > execute a Fuseki Server (standalone or embedded) on each cluster node, > > which hosts a Spark Executor. > > > > Since the data is partitioned I will not have references between the > > datasets (in this particular case). > > > > A simple query broker allows distributing the query and consolidation of > > results. Next thing would be adding > > a coordinator with graph statistics for optimization of data set dumps > and > > reloading in case of failure. > > > > A load balancer is used to balance request and result flows towards > > clients, eventually, the query broker will run in Docker. > > > > A sketch is available here: > > https://raw.githubusercontent.com/kamir/fuseki-cloud/master/ > > Fuseki%20Cloud.png > > > > > > > > My initial prototype works well. Now I want go deeper. But I wonder, if > > such an activity has already been started or if > > you know reasons, why this is not a good approach. > > > > In any case, if there is no reason for not implementing such a > > "Fuseki-Cloud" approach - I continue on that route and > > I want to contribute the results to the existing project. > > > > Thanks for any hint or recommendation. > > > > Best wishes, > > Mirko > >
Re: Multiple Fuseki Servers in Distributed Environment
There are several systems that distribute SPARQL using Jena. Dick Murray has written a system called Mosaic that (I believe) uses Apache Thrift to distribute the lower-level (DatasetGraph) primitives that ARQ uses to execute SPARQL. An advantage over your plan might be that he isn't serializing full results over HTTP to pass them around. I don't understand that system to be ready for use outside of Dick's deployment, but he could say more. The SANSA project [1] has provided a system that I understand to use ARQ to execute queries over Apache Spark or Apache Flink. This sounds similar in some ways to what you are doing, and that system is available today. I think Jena committer Lorenz Bühmann is involved with that project; if I am correct, he may be able to say more. There are doubtless others about which I don't know. ajs6f [1] http://sansa-stack.net/ > On May 26, 2018, at 5:47 AM, Mirko Kämpfwrote: > > Hello Fuseki experts, > > I want to ask you for your experience / thoughts about the following > approach: > > > > In order to enable semantic queries over "trancient data" or on data which > is persisted in HDFS / HBase I > execute a Fuseki Server (standalone or embedded) on each cluster node, > which hosts a Spark Executor. > > Since the data is partitioned I will not have references between the > datasets (in this particular case). > > A simple query broker allows distributing the query and consolidation of > results. Next thing would be adding > a coordinator with graph statistics for optimization of data set dumps and > reloading in case of failure. > > A load balancer is used to balance request and result flows towards > clients, eventually, the query broker will run in Docker. > > A sketch is available here: > https://raw.githubusercontent.com/kamir/fuseki-cloud/master/ > Fuseki%20Cloud.png > > > > My initial prototype works well. Now I want go deeper. But I wonder, if > such an activity has already been started or if > you know reasons, why this is not a good approach. > > In any case, if there is no reason for not implementing such a > "Fuseki-Cloud" approach - I continue on that route and > I want to contribute the results to the existing project. > > Thanks for any hint or recommendation. > > Best wishes, > Mirko
Fwd: Multiple Fuseki Servers in Distributed Environment
Hello Fuseki experts, I want to ask you for your experience / thoughts about the following approach: In order to enable semantic queries over "trancient data" or on data which is persisted in HDFS / HBase I execute a Fuseki Server (standalone or embedded) on each cluster node, which hosts a Spark Executor. Since the data is partitioned I will not have references between the datasets (in this particular case). A simple query broker allows distributing the query and consolidation of results. Next thing would be adding a coordinator with graph statistics for optimization of data set dumps and reloading in case of failure. A load balancer is used to balance request and result flows towards clients, eventually, the query broker will run in Docker. A sketch is available here: https://raw.githubusercontent.com/kamir/fuseki-cloud/master/ Fuseki%20Cloud.png My initial prototype works well. Now I want go deeper. But I wonder, if such an activity has already been started or if you know reasons, why this is not a good approach. In any case, if there is no reason for not implementing such a "Fuseki-Cloud" approach - I continue on that route and I want to contribute the results to the existing project. Thanks for any hint or recommendation. Best wishes, Mirko