Hi Dave,

The goal of VXQuery is to provide a parallel XML data processing system based on the XQuery language (http://www.w3.org/TR/xquery/). VXQuery achieves high performance by using lots of commodity machines to execute parts of a query in parallel, much like Hive does with SQL queries. In contrast with systems like Hive and Pig, VXQuery is built upon a more flexible data-parallel runtime platform called Hyracks (http://hyracks.googlecode.com).

To process large amounts of EDGAR data, one would start by distributing the XML files across disks of different machines running VXQuery in say a folder on each machine (/data/edgar).

Q1:
A query as follows would then count the total number of XML files across all the machines:

count(collection('/data/edgar'))


Q2:
To answer your question of find 10-Ks (sorted newest to oldest) for a company, you would say (I am making up the field names, but EDGAR contains equivalent fields in each document):


for $doc in collection('/data/edgar')
where $doc/documentType = '10-K'
  and $doc/company = 'ACME Inc.'
order by xs:dateTime($doc/filingDate) descending
return $doc


In the upcoming release of VXQuery, the only trick in the book to speed up queries will be to scan data on each machine in parallel to filter and aggregate results locally and then combine the partial results obtained at each machine.

One of the projects planned for the future (next release hopefully), is to be able to build an index locally on each machine on various fields of the document so that Q2 can be answered even more quickly by performing an index lookup instead of having to scan all the files.


Hope that gives you a glimpse into XML query processing with VXQuery.


Vinayak

On 4/8/13 5:00 PM, Dave Fisher wrote:
Hi XVQuery Devs,

I'm your volunteer incubator shepherd for this report. I can see from your 
activity and report that you seem to again have a need for mentors.

I signed up in part because I was intrigued by the comments regarding querying 
the filings at sec.gov managed by Edgar. I know a bit about the structure and 
challenges of this data. XBRL, HTML and PDF documents contain the accounting 
data with some great variation, etc.

Tell me how VXQuery would a quick performant solution to make a query like latest 
10K, Total Revenue, All US Companies? Or all 13D Filings of >%5 ownership of US 
companies in the last year?

Regards,
Dave


Reply via email to