Hi Dave,
The goal of VXQuery is to provide a parallel XML data processing system
based on the XQuery language (http://www.w3.org/TR/xquery/). VXQuery
achieves high performance by using lots of commodity machines to execute
parts of a query in parallel, much like Hive does with SQL queries. In
contrast with systems like Hive and Pig, VXQuery is built upon a more
flexible data-parallel runtime platform called Hyracks
(http://hyracks.googlecode.com).
To process large amounts of EDGAR data, one would start by distributing
the XML files across disks of different machines running VXQuery in say
a folder on each machine (/data/edgar).
Q1:
A query as follows would then count the total number of XML files across
all the machines:
count(collection('/data/edgar'))
Q2:
To answer your question of find 10-Ks (sorted newest to oldest) for a
company, you would say (I am making up the field names, but EDGAR
contains equivalent fields in each document):
for $doc in collection('/data/edgar')
where $doc/documentType = '10-K'
and $doc/company = 'ACME Inc.'
order by xs:dateTime($doc/filingDate) descending
return $doc
In the upcoming release of VXQuery, the only trick in the book to speed
up queries will be to scan data on each machine in parallel to filter
and aggregate results locally and then combine the partial results
obtained at each machine.
One of the projects planned for the future (next release hopefully), is
to be able to build an index locally on each machine on various fields
of the document so that Q2 can be answered even more quickly by
performing an index lookup instead of having to scan all the files.
Hope that gives you a glimpse into XML query processing with VXQuery.
Vinayak
On 4/8/13 5:00 PM, Dave Fisher wrote:
Hi XVQuery Devs,
I'm your volunteer incubator shepherd for this report. I can see from your
activity and report that you seem to again have a need for mentors.
I signed up in part because I was intrigued by the comments regarding querying
the filings at sec.gov managed by Edgar. I know a bit about the structure and
challenges of this data. XBRL, HTML and PDF documents contain the accounting
data with some great variation, etc.
Tell me how VXQuery would a quick performant solution to make a query like latest
10K, Total Revenue, All US Companies? Or all 13D Filings of >%5 ownership of US
companies in the last year?
Regards,
Dave