Re: VXQuery focus

Vinayak Borkar Wed, 08 Feb 2012 10:35:55 -0800

Thanks Cezar/Mike.

Imagine we have a large number of relatively small XML documents storedon a cluster of computers. For example, documents from the EDGAR dataset(http://edgar.sec.gov/) which is all paperwork that public companiesneed to file every quarter with the SEC, in XML format.

Initially, owing to the side-effect-free nature of the core XQuerylanguage, VXQuery could target analytics-style queries against such datawhile harnessing the power (CPU and I/O) of multiple possibly multi-coreprocessors.


An example query like,

count(
 for $d in collection('EDGAR')
 where $d/COMPANY_NAME = 'IBM'
 return $d
)

which counts the number of documents that have IBM as the company namewould be evaluated by essentially running the FLWOR inside the countindependently on each machine that stores a part of the entirecollection and compute local counts. Finally the local counts from eachleaf machine would be summed up in one place to produce the result ofthe query.

For the past three years Mike and I have been involved with anotherApache Licensed project called Hyracks(http://code.google.com/p/hyracks/) that provides an efficient runtimefor data-parallel tasks. But what is more interesting from the VXQuerypoint of view is a logical algebra abstraction in the Hyracks projectcalled Algebricks.

Algebricks is an extended nested-relational algebra library that can beused to express query semantics of a large set of declarative dataprocessing languages (to which XQuery belongs). The Algebricks frameworkautomatically optimizes the specified algebraic expression into aphysical plan and parallelizes it to use the Hyracks runtime. In fact,the goal of the Algebricks platform was to provide a simple path forlanguage implementors to quickly get new declarative data languagesrunning efficiently, in parallel, on a shared-nothing cluster of machines.

The high-level tasks that we will have to complete to get VXQueryrunning on a cluster would be:

1. Build a translator that converts the existing XQuery AST object modelthat is emitted by the parser into an Algebricks algebra expression.2. Build an implementation of the Metadata interface needed byalgebricks that help the runtime resolve things like location of base data.3. Build an implementation of the runtime function call interface so theactual function work is done by code that already exists in VXQuery, butis invoked by the Hyracks runtime.4. Implement serializers/deserializers for the various datamodel piecesin VXQuery to be able to transport data across machines.

If someone is looking for a project in the context of GSoC, I can seethis task of building a parallel XQuery engine could be an interesting one.



Thoughts?


Thanks,
Vinayak


On 02/08/2012 08:18 AM, Cezar Andrei wrote:

I like the idea, sounds very interesting.
Vinayak, will you put your thoughts in more detail and maybe make a list of
features that we can use for the GSOC list?

Cezar

On Wed, Feb 8, 2012 at 8:52 AM, Michael Carey<[email protected]>  wrote:

This sounds like a great direction, and one that would be very interesting
to the community!  (Except maybe Marklogic? :-))

Cheers,
Mike



On 2/8/12 1:44 AM, Vinayak Borkar wrote:

Guys,


Given that we are at a juncture where either we try to build out this
project and build a community OR remove the project from the incubator, I
have a proposal that I feel will help us get community interest in the
project.

My proposal is to slightly change the focus of the project to cater to
different use cases than originally proposed while still continuing to
build an XQuery processor.

In the original proposal, we proposed to build an XQuery processor to
target multiple input formats. In the beginning there was a lot of interest
in this direction from the mentors which has seemed to quiet down recently.
In the meantime, people have been increasingly interested in processing
large amounts of data. To this end, I propose that we switch the focus of
the VXQuery project to target big XML data use cases.

In terms of work done, we get to reuse a majority of the code that has
already been built. In terms of the tasks to be done, the immediate focus
will be on parallelizing the existing codebase to be able to handle large
amounts of XML data.

I feel that this slight change in focus will be a fun challenge from a
development standpoint and also will help us gain a community given the
growing interest in Big data processing.

Looking forward to your thoughts.

Thanks,
Vinayak

Re: VXQuery focus

Reply via email to