Thanks Cezar/Mike.
Imagine we have a large number of relatively small XML documents stored
on a cluster of computers. For example, documents from the EDGAR dataset
(http://edgar.sec.gov/) which is all paperwork that public companies
need to file every quarter with the SEC, in XML format.
Initially, owing to the side-effect-free nature of the core XQuery
language, VXQuery could target analytics-style queries against such data
while harnessing the power (CPU and I/O) of multiple possibly multi-core
processors.
An example query like,
count(
for $d in collection('EDGAR')
where $d/COMPANY_NAME = 'IBM'
return $d
)
which counts the number of documents that have IBM as the company name
would be evaluated by essentially running the FLWOR inside the count
independently on each machine that stores a part of the entire
collection and compute local counts. Finally the local counts from each
leaf machine would be summed up in one place to produce the result of
the query.
For the past three years Mike and I have been involved with another
Apache Licensed project called Hyracks
(http://code.google.com/p/hyracks/) that provides an efficient runtime
for data-parallel tasks. But what is more interesting from the VXQuery
point of view is a logical algebra abstraction in the Hyracks project
called Algebricks.
Algebricks is an extended nested-relational algebra library that can be
used to express query semantics of a large set of declarative data
processing languages (to which XQuery belongs). The Algebricks framework
automatically optimizes the specified algebraic expression into a
physical plan and parallelizes it to use the Hyracks runtime. In fact,
the goal of the Algebricks platform was to provide a simple path for
language implementors to quickly get new declarative data languages
running efficiently, in parallel, on a shared-nothing cluster of machines.
The high-level tasks that we will have to complete to get VXQuery
running on a cluster would be:
1. Build a translator that converts the existing XQuery AST object model
that is emitted by the parser into an Algebricks algebra expression.
2. Build an implementation of the Metadata interface needed by
algebricks that help the runtime resolve things like location of base data.
3. Build an implementation of the runtime function call interface so the
actual function work is done by code that already exists in VXQuery, but
is invoked by the Hyracks runtime.
4. Implement serializers/deserializers for the various datamodel pieces
in VXQuery to be able to transport data across machines.
If someone is looking for a project in the context of GSoC, I can see
this task of building a parallel XQuery engine could be an interesting one.
Thoughts?
Thanks,
Vinayak
On 02/08/2012 08:18 AM, Cezar Andrei wrote:
I like the idea, sounds very interesting.
Vinayak, will you put your thoughts in more detail and maybe make a list of
features that we can use for the GSOC list?
Cezar
On Wed, Feb 8, 2012 at 8:52 AM, Michael Carey<[email protected]> wrote:
This sounds like a great direction, and one that would be very interesting
to the community! (Except maybe Marklogic? :-))
Cheers,
Mike
On 2/8/12 1:44 AM, Vinayak Borkar wrote:
Guys,
Given that we are at a juncture where either we try to build out this
project and build a community OR remove the project from the incubator, I
have a proposal that I feel will help us get community interest in the
project.
My proposal is to slightly change the focus of the project to cater to
different use cases than originally proposed while still continuing to
build an XQuery processor.
In the original proposal, we proposed to build an XQuery processor to
target multiple input formats. In the beginning there was a lot of interest
in this direction from the mentors which has seemed to quiet down recently.
In the meantime, people have been increasingly interested in processing
large amounts of data. To this end, I propose that we switch the focus of
the VXQuery project to target big XML data use cases.
In terms of work done, we get to reuse a majority of the code that has
already been built. In terms of the tasks to be done, the immediate focus
will be on parallelizing the existing codebase to be able to handle large
amounts of XML data.
I feel that this slight change in focus will be a fun challenge from a
development standpoint and also will help us gain a community given the
growing interest in Big data processing.
Looking forward to your thoughts.
Thanks,
Vinayak