Thanks Cezar/Mike.

Imagine we have a large number of relatively small XML documents stored on a cluster of computers. For example, documents from the EDGAR dataset (http://edgar.sec.gov/) which is all paperwork that public companies need to file every quarter with the SEC, in XML format.

Initially, owing to the side-effect-free nature of the core XQuery language, VXQuery could target analytics-style queries against such data while harnessing the power (CPU and I/O) of multiple possibly multi-core processors.

An example query like,

count(
 for $d in collection('EDGAR')
 where $d/COMPANY_NAME = 'IBM'
 return $d
)

which counts the number of documents that have IBM as the company name would be evaluated by essentially running the FLWOR inside the count independently on each machine that stores a part of the entire collection and compute local counts. Finally the local counts from each leaf machine would be summed up in one place to produce the result of the query.

For the past three years Mike and I have been involved with another Apache Licensed project called Hyracks (http://code.google.com/p/hyracks/) that provides an efficient runtime for data-parallel tasks. But what is more interesting from the VXQuery point of view is a logical algebra abstraction in the Hyracks project called Algebricks.

Algebricks is an extended nested-relational algebra library that can be used to express query semantics of a large set of declarative data processing languages (to which XQuery belongs). The Algebricks framework automatically optimizes the specified algebraic expression into a physical plan and parallelizes it to use the Hyracks runtime. In fact, the goal of the Algebricks platform was to provide a simple path for language implementors to quickly get new declarative data languages running efficiently, in parallel, on a shared-nothing cluster of machines.

The high-level tasks that we will have to complete to get VXQuery running on a cluster would be:

1. Build a translator that converts the existing XQuery AST object model that is emitted by the parser into an Algebricks algebra expression. 2. Build an implementation of the Metadata interface needed by algebricks that help the runtime resolve things like location of base data. 3. Build an implementation of the runtime function call interface so the actual function work is done by code that already exists in VXQuery, but is invoked by the Hyracks runtime. 4. Implement serializers/deserializers for the various datamodel pieces in VXQuery to be able to transport data across machines.

If someone is looking for a project in the context of GSoC, I can see this task of building a parallel XQuery engine could be an interesting one.


Thoughts?


Thanks,
Vinayak


On 02/08/2012 08:18 AM, Cezar Andrei wrote:
I like the idea, sounds very interesting.
Vinayak, will you put your thoughts in more detail and maybe make a list of
features that we can use for the GSOC list?

Cezar

On Wed, Feb 8, 2012 at 8:52 AM, Michael Carey<[email protected]>  wrote:

This sounds like a great direction, and one that would be very interesting
to the community!  (Except maybe Marklogic? :-))

Cheers,
Mike



On 2/8/12 1:44 AM, Vinayak Borkar wrote:

Guys,


Given that we are at a juncture where either we try to build out this
project and build a community OR remove the project from the incubator, I
have a proposal that I feel will help us get community interest in the
project.

My proposal is to slightly change the focus of the project to cater to
different use cases than originally proposed while still continuing to
build an XQuery processor.

In the original proposal, we proposed to build an XQuery processor to
target multiple input formats. In the beginning there was a lot of interest
in this direction from the mentors which has seemed to quiet down recently.
In the meantime, people have been increasingly interested in processing
large amounts of data. To this end, I propose that we switch the focus of
the VXQuery project to target big XML data use cases.

In terms of work done, we get to reuse a majority of the code that has
already been built. In terms of the tasks to be done, the immediate focus
will be on parallelizing the existing codebase to be able to handle large
amounts of XML data.

I feel that this slight change in focus will be a fun challenge from a
development standpoint and also will help us gain a community given the
growing interest in Big data processing.

Looking forward to your thoughts.

Thanks,
Vinayak




Reply via email to