Re: [basex-talk] Func def & performance: element()* vs item()*

2019-04-12 Thread Chuck Bearden
The snapshot executes both versions of my recursive function (one with
'element()*' and one with 'item()*' for the sequence of elements)
equally fast, which is to say in about 4s. I verified that 9.1.2
executes the 'item()*' one in about the same time, but the
'element()*' one drags on for a few minutes before I stop it.

Note that I exit & restart BaseX between tests.

Thanks Christian!


On Fri, Apr 12, 2019 at 1:12 PM Christian Grün
 wrote:
>
> A new snapshot is online! Looking forward to feedback.
>
>
> On Thu, Apr 11, 2019 at 12:10 AM Chuck Bearden  wrote:
> >
> > BaseX is a great tool for analyzing & characterizing large amounts of
> > XML data. I have used it both at work and on personal projects. I hope
> > the following observation is useful.
> >
> > When I define a function that recurs over a sequence of elements in
> > order to build a map of element name counts, I find that when I
> > specify the type of the element sequence as 'element()*', the function
> > runs so slowly that I give up after 5 minutes or so. But when I
> > specify the type as 'item()*', it finishes in 40 seconds or less.
> > Here's an example:
> >
> > -begin code snippet-
> > declare namespace local="w00fw00f";
> > declare function local:count($elems as element()*, $elem_counts as map(*))
> > as map(*) {
> > let $elem := head($elems),
> > $elem_name := $elem/name(),
> > $elems_new := tail($elems),
> > $elem_name_count := if (map:contains($elem_counts, $elem_name))
> > then map:get($elem_counts, $elem_name) + 1
> > else 1,
> > $elem_counts_new := map:put($elem_counts, $elem_name, 
> > $elem_name_count)
> > return if (count($elems_new) = 0)
> > then $elem_counts_new
> > else local:count($elems_new, $elem_counts_new)
> > };
> >
> > let $coll := collection('pure_20190402'),
> > $elems := $coll/result/items/*,
> > $elem_names_map := local:count($elems, map {})
> > return json:serialize($elem_names_map, map {'format' : 'xquery'})
> > -end code snippet-
> >
> > In the function declaration, changing "$elems as element()*" to
> > "$elems as item()*" makes the difference in performance. Replacing the
> > JSON serialization with a standard XML one does not change the
> > performance. I am running BaseX 9.1.2 under Ubuntu 16.04.6.
> >
> > All the best,
> > Chuck Bearden


Re: [basex-talk] Func def & performance: element()* vs item()*

2019-04-12 Thread Christian Grün
A new snapshot is online! Looking forward to feedback.


On Thu, Apr 11, 2019 at 12:10 AM Chuck Bearden  wrote:
>
> BaseX is a great tool for analyzing & characterizing large amounts of
> XML data. I have used it both at work and on personal projects. I hope
> the following observation is useful.
>
> When I define a function that recurs over a sequence of elements in
> order to build a map of element name counts, I find that when I
> specify the type of the element sequence as 'element()*', the function
> runs so slowly that I give up after 5 minutes or so. But when I
> specify the type as 'item()*', it finishes in 40 seconds or less.
> Here's an example:
>
> -begin code snippet-
> declare namespace local="w00fw00f";
> declare function local:count($elems as element()*, $elem_counts as map(*))
> as map(*) {
> let $elem := head($elems),
> $elem_name := $elem/name(),
> $elems_new := tail($elems),
> $elem_name_count := if (map:contains($elem_counts, $elem_name))
> then map:get($elem_counts, $elem_name) + 1
> else 1,
> $elem_counts_new := map:put($elem_counts, $elem_name, 
> $elem_name_count)
> return if (count($elems_new) = 0)
> then $elem_counts_new
> else local:count($elems_new, $elem_counts_new)
> };
>
> let $coll := collection('pure_20190402'),
> $elems := $coll/result/items/*,
> $elem_names_map := local:count($elems, map {})
> return json:serialize($elem_names_map, map {'format' : 'xquery'})
> -end code snippet-
>
> In the function declaration, changing "$elems as element()*" to
> "$elems as item()*" makes the difference in performance. Replacing the
> JSON serialization with a standard XML one does not change the
> performance. I am running BaseX 9.1.2 under Ubuntu 16.04.6.
>
> All the best,
> Chuck Bearden


Re: [basex-talk] Func def & performance: element()* vs item()*

2019-04-11 Thread Christian Grün
> You may be interested in my 
> https://github.com/rhdunn/xquery-intellij-plugin/blob/master/docs/XQuery%20IntelliJ%20Plugin%20Data%20Model.md
>  document. It is the result of previous investigations in supporting static 
> type analysis in my XQuery plugin.

Thanks, I’ll definitely have a look at your document.

In BaseX, we don’t have union types; apart from that, our static
typing system should be close to complete (if some types in the query
plan should indicate otherwise, feedback is welcome). For example, we
are iteratively computing the type of built-in higher-order functions
at compile time. The static type of the following function is
xs:decimal+:

  fold-left(1 to 123, 4.5, function($tmp, $curr) {
($curr, $tmp)
  })

And the static type of the next expression is xs:short+, not
xs:integer+ as one might guess (the type derived from the actual types
of the input arguments, which will never yield an xs:integer result
type):

  fold-left((1 to 20) ! xs:byte(.), (), function($tmp, $curr) {
$tmp,
if($curr instance of xs:byte) then xs:short($curr)
else xs:integer($curr)
  })

One missing link is that the static type is not always propagated to
the result (i.e., to the internal result representation, which can be
a plain untyped array of typed items, in particular if the result is
generated by untyped iterators). I think I’ve found a good trade-off
between performance and better runtime typing. In future, the
intersection of the type of the original expression and the resulting
type will be assigned to the resulting value instance.


Re: [basex-talk] Func def & performance: element()* vs item()*

2019-04-11 Thread Reece Dunn
Hi Christian,

On Thu, 11 Apr 2019 at 13:37, Christian Grün 
wrote:

> Hi Chuck,
>
> Martin already suggested that map construction via map:merge is
> preferable and faster (my personal experience is that there are just
> few cases in which map:put is a better choice).
>
> Your query was an interesting one, though. In various cases, we drop
> type information at runtime, as it can be expensive to decorate all
> newly generated sequences with the correct type. As a result, the type
> of your function arguments is verified every time the function is
> called, and this takes additional time.
>
> But as it’s always recommendable to declare types, and as this is not
> the first time that this is chasing me, I had some more thoughts, and
> I have found a good answer on how to improve generally typing at
> runtime! You can already be sure that your query will benefit from the
> upcoming optimizations, i.e., with BaseX 9.2.
>

You may be interested in my
https://github.com/rhdunn/xquery-intellij-plugin/blob/master/docs/XQuery%20IntelliJ%20Plugin%20Data%20Model.md
document. It is the result of previous investigations in supporting static
type analysis in my XQuery plugin. Specifically:
1.  3.2.1 Item Type Union -- computing the best matching union type of two
item types.
2.  3.2.2 Sequence Type Union -- computing the union of two sequences for
use in disjoint expressions such as the if and else branches of an IfExpr.
3.  3.2.3 Sequence Type Addition -- computing the resulting type that best
matches an Expr.

The advantage of this is that the type information can be computed at
compile time.

I was able to get a basic prototype implementation working for some
expressions, and have tested the logic for the rules in that document. I
haven't worked on this recently, as I have been adding other features to my
plugin.

Kind regards,
Reece

Due to this, and due to some other minor optimizations that are still
> in progress, we decided to delay the release until beginning of next
> week.
>
> Cheers
> Christian
>
>
>
> On Thu, Apr 11, 2019 at 12:10 AM Chuck Bearden 
> wrote:
> >
> > BaseX is a great tool for analyzing & characterizing large amounts of
> > XML data. I have used it both at work and on personal projects. I hope
> > the following observation is useful.
> >
> > When I define a function that recurs over a sequence of elements in
> > order to build a map of element name counts, I find that when I
> > specify the type of the element sequence as 'element()*', the function
> > runs so slowly that I give up after 5 minutes or so. But when I
> > specify the type as 'item()*', it finishes in 40 seconds or less.
> > Here's an example:
> >
> > -begin code snippet-
> > declare namespace local="w00fw00f";
> > declare function local:count($elems as element()*, $elem_counts as
> map(*))
> > as
> map(*) {
> > let $elem := head($elems),
> > $elem_name := $elem/name(),
> > $elems_new := tail($elems),
> > $elem_name_count := if (map:contains($elem_counts, $elem_name))
> > then map:get($elem_counts, $elem_name) + 1
> > else 1,
> > $elem_counts_new := map:put($elem_counts, $elem_name,
> $elem_name_count)
> > return if (count($elems_new) = 0)
> > then $elem_counts_new
> > else local:count($elems_new, $elem_counts_new)
> > };
> >
> > let $coll := collection('pure_20190402'),
> > $elems := $coll/result/items/*,
> > $elem_names_map := local:count($elems, map {})
> > return json:serialize($elem_names_map, map {'format' : 'xquery'})
> > -end code snippet-
> >
> > In the function declaration, changing "$elems as element()*" to
> > "$elems as item()*" makes the difference in performance. Replacing the
> > JSON serialization with a standard XML one does not change the
> > performance. I am running BaseX 9.1.2 under Ubuntu 16.04.6.
> >
> > All the best,
> > Chuck Bearden
>


Re: [basex-talk] Func def & performance: element()* vs item()*

2019-04-11 Thread Christian Grün
Hi Chuck,

Martin already suggested that map construction via map:merge is
preferable and faster (my personal experience is that there are just
few cases in which map:put is a better choice).

Your query was an interesting one, though. In various cases, we drop
type information at runtime, as it can be expensive to decorate all
newly generated sequences with the correct type. As a result, the type
of your function arguments is verified every time the function is
called, and this takes additional time.

But as it’s always recommendable to declare types, and as this is not
the first time that this is chasing me, I had some more thoughts, and
I have found a good answer on how to improve generally typing at
runtime! You can already be sure that your query will benefit from the
upcoming optimizations, i.e., with BaseX 9.2.

Due to this, and due to some other minor optimizations that are still
in progress, we decided to delay the release until beginning of next
week.

Cheers
Christian



On Thu, Apr 11, 2019 at 12:10 AM Chuck Bearden  wrote:
>
> BaseX is a great tool for analyzing & characterizing large amounts of
> XML data. I have used it both at work and on personal projects. I hope
> the following observation is useful.
>
> When I define a function that recurs over a sequence of elements in
> order to build a map of element name counts, I find that when I
> specify the type of the element sequence as 'element()*', the function
> runs so slowly that I give up after 5 minutes or so. But when I
> specify the type as 'item()*', it finishes in 40 seconds or less.
> Here's an example:
>
> -begin code snippet-
> declare namespace local="w00fw00f";
> declare function local:count($elems as element()*, $elem_counts as map(*))
> as map(*) {
> let $elem := head($elems),
> $elem_name := $elem/name(),
> $elems_new := tail($elems),
> $elem_name_count := if (map:contains($elem_counts, $elem_name))
> then map:get($elem_counts, $elem_name) + 1
> else 1,
> $elem_counts_new := map:put($elem_counts, $elem_name, 
> $elem_name_count)
> return if (count($elems_new) = 0)
> then $elem_counts_new
> else local:count($elems_new, $elem_counts_new)
> };
>
> let $coll := collection('pure_20190402'),
> $elems := $coll/result/items/*,
> $elem_names_map := local:count($elems, map {})
> return json:serialize($elem_names_map, map {'format' : 'xquery'})
> -end code snippet-
>
> In the function declaration, changing "$elems as element()*" to
> "$elems as item()*" makes the difference in performance. Replacing the
> JSON serialization with a standard XML one does not change the
> performance. I am running BaseX 9.1.2 under Ubuntu 16.04.6.
>
> All the best,
> Chuck Bearden


Re: [basex-talk] Func def & performance: element()* vs item()*

2019-04-10 Thread Martin Honnen

Am 11.04.2019 um 00:09 schrieb Chuck Bearden:

BaseX is a great tool for analyzing & characterizing large amounts of
XML data. I have used it both at work and on personal projects. I hope
the following observation is useful.

When I define a function that recurs over a sequence of elements in
order to build a map of element name counts, I find that when I
specify the type of the element sequence as 'element()*', the function
runs so slowly that I give up after 5 minutes or so. But when I
specify the type as 'item()*', it finishes in 40 seconds or less.
Here's an example:

-begin code snippet-
declare namespace local="w00fw00f";
declare function local:count($elems as element()*, $elem_counts as map(*))
 as map(*) {
 let $elem := head($elems),
 $elem_name := $elem/name(),
 $elems_new := tail($elems),
 $elem_name_count := if (map:contains($elem_counts, $elem_name))
 then map:get($elem_counts, $elem_name) + 1
 else 1,
 $elem_counts_new := map:put($elem_counts, $elem_name, $elem_name_count)
 return if (count($elems_new) = 0)
 then $elem_counts_new
 else local:count($elems_new, $elem_counts_new)
};

let $coll := collection('pure_20190402'),
 $elems := $coll/result/items/*,
 $elem_names_map := local:count($elems, map {})



It seems that task to build the map can also be solved with grouping:

let $elem_names_map := map:merge(
for $item in $coll/result/items/*
group by $name := name($item)
return map { $name : count($item) }
)


Not sure whether that improves performance.



[basex-talk] Func def & performance: element()* vs item()*

2019-04-10 Thread Chuck Bearden
BaseX is a great tool for analyzing & characterizing large amounts of
XML data. I have used it both at work and on personal projects. I hope
the following observation is useful.

When I define a function that recurs over a sequence of elements in
order to build a map of element name counts, I find that when I
specify the type of the element sequence as 'element()*', the function
runs so slowly that I give up after 5 minutes or so. But when I
specify the type as 'item()*', it finishes in 40 seconds or less.
Here's an example:

-begin code snippet-
declare namespace local="w00fw00f";
declare function local:count($elems as element()*, $elem_counts as map(*))
as map(*) {
let $elem := head($elems),
$elem_name := $elem/name(),
$elems_new := tail($elems),
$elem_name_count := if (map:contains($elem_counts, $elem_name))
then map:get($elem_counts, $elem_name) + 1
else 1,
$elem_counts_new := map:put($elem_counts, $elem_name, $elem_name_count)
return if (count($elems_new) = 0)
then $elem_counts_new
else local:count($elems_new, $elem_counts_new)
};

let $coll := collection('pure_20190402'),
$elems := $coll/result/items/*,
$elem_names_map := local:count($elems, map {})
return json:serialize($elem_names_map, map {'format' : 'xquery'})
-end code snippet-

In the function declaration, changing "$elems as element()*" to
"$elems as item()*" makes the difference in performance. Replacing the
JSON serialization with a standard XML one does not change the
performance. I am running BaseX 9.1.2 under Ubuntu 16.04.6.

All the best,
Chuck Bearden