Dear all,

We are happy to announce the latest alpha release of Sparksoniq.

Sparksoniq runs JSONiq queries on top of Spark, taking as input JSON data sets 
stored on distributed file systems such as (but not only) HDFS. Its goal is to 
increase productivity when querying heterogeneous, nested datasets that are 
challenging to handle with DataFrames.

JSONiq is the JSON brother of XQuery (XQuery - XML + JSON) and shares 90% of 
its DNA.

Sparksoniq is open source (Apache 2.0) and can be downloaded for free. The jar 
as well as the documentation can be found on http://sparksoniq.org/.


Since the announcement of our initial prototype last year, the following 
progress was made:

- Many bugfixes following user feedback. It is getting stable enough to 
consider soon going to beta, and was already used in large classrooms.

- All FLWOR clauses are supported both in parallel and (new) locally. Locally 
means without invoking Spark transformations with parallelize() or json-file() 
calls.

- FLWOR expressions can fully nest, with the only exception that those that run 
in parallel cannot nest with each other (because Spark jobs do not nest).

E.g.:

for $i in json-file("hdfs://path/to/orders.json") (: this will be executed in 
parallel on that large file, split after HDFS blocks :)
where $i.customer eq "John Smith"
return {
  "total": sum($i.items[].amount),
  "sorted-items" : [
    for $j in $i.items[]
    order by $j.amount
    return $j
  ]
}

- We improved the memory footprint, in particular filtering queries are 
streamed through (within a task) rather than materialized.

- We worked on performance: it can handle files of 10,000,000+ objects on a 
regular laptop for count, filtering, grouping and ordering with a local Spark 
execution. Performance also noticeably improved querying bigger datasets on 
clusters (tested with several billion objects on 64 machines).

Feedback is, as always, appreciated.

Kind regards
Ghislain
_______________________________________________
talk@x-query.com
http://x-query.com/mailman/listinfo/talk

Reply via email to