Re: [Shepherd] Checking In With A Challenge

Vinayak Borkar Tue, 09 Apr 2013 17:39:16 -0700

Hi Dave,

The goal of VXQuery is to provide a parallel XML data processing systembased on the XQuery language (http://www.w3.org/TR/xquery/). VXQueryachieves high performance by using lots of commodity machines to executeparts of a query in parallel, much like Hive does with SQL queries. Incontrast with systems like Hive and Pig, VXQuery is built upon a moreflexible data-parallel runtime platform called Hyracks(http://hyracks.googlecode.com).

To process large amounts of EDGAR data, one would start by distributingthe XML files across disks of different machines running VXQuery in saya folder on each machine (/data/edgar).

Q1:

A query as follows would then count the total number of XML files acrossall the machines:


count(collection('/data/edgar'))


Q2:

To answer your question of find 10-Ks (sorted newest to oldest) for acompany, you would say (I am making up the field names, but EDGARcontains equivalent fields in each document):



for $doc in collection('/data/edgar')
where $doc/documentType = '10-K'
  and $doc/company = 'ACME Inc.'
order by xs:dateTime($doc/filingDate) descending
return $doc

In the upcoming release of VXQuery, the only trick in the book to speedup queries will be to scan data on each machine in parallel to filterand aggregate results locally and then combine the partial resultsobtained at each machine.

One of the projects planned for the future (next release hopefully), isto be able to build an index locally on each machine on various fieldsof the document so that Q2 can be answered even more quickly byperforming an index lookup instead of having to scan all the files.



Hope that gives you a glimpse into XML query processing with VXQuery.


Vinayak

On 4/8/13 5:00 PM, Dave Fisher wrote:

Hi XVQuery Devs,

I'm your volunteer incubator shepherd for this report. I can see from your 
activity and report that you seem to again have a need for mentors.

I signed up in part because I was intrigued by the comments regarding querying 
the filings at sec.gov managed by Edgar. I know a bit about the structure and 
challenges of this data. XBRL, HTML and PDF documents contain the accounting 
data with some great variation, etc.

Tell me how VXQuery would a quick performant solution to make a query like latest 
10K, Total Revenue, All US Companies? Or all 13D Filings of >%5 ownership of US 
companies in the last year?

Regards,
Dave

Re: [Shepherd] Checking In With A Challenge

Reply via email to