Re: [basex-talk] BaseX XQuery vs. python / lxml performance

2012-03-30 Thread Ronny Möbius
Hi for another time,

I found a big performance killer by myself.

On 03/29/2012 12:18 PM, Ronny Möbius wrote:
 declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string)
 as node()*
 {
  for $l in $semxml//Lesson
  where $l[AssociatedModules/Module/@Abbr=$modabbr]
  order by data($l/@ID)
  return $l
 };

Replacing $semxml//Lesson by $semxml/Dataset/Lessons/Lesson makes a
difference of about one third of time spent.

Why is that? I thought, I don't have to care about the inefficiency of
//. Isn’t that handled by indices?

All the best,
Ronny
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] BaseX XQuery vs. python / lxml performance

2012-03-29 Thread Michael Seiferle
Hi Ronny, 

Hi Johannes  Charles, thanks for joining the conversation. 


In my opinion, and speaking officially for BaseX, I'd suppose that XML 
processing with BaseX databases should almost always[1] be faster than 
processing the XML sequentially via lxml.

However, performance may vary depending on the actual queries and/or the python 
glue code. 

I think Charles' approach of having as much logic in XQuery as possible will be 
the best option to pick here.
Maybe some of your Python code could as well be rewritten in XQuery, on the 
other hand this might not even be necessary due to XQuery rewrites as Johannes 
suggested.

@Ronny, maybe you could provide us with some sample code? In case it is not 
intended for the general public feel free to send it to supp...@basex.org.

Looking forward to seeing your code!

Viele Grüße vom Bodensee

Michael 

[1] I can sure think of examples that prove me wrong ;-)
Am 28.03.2012 um 23:19 schrieb Johannes.Lichtenberger:

 Thus I suppose it
 would be the best to write the queries in a reply, such that the BaseX
 team can make suggestions for similar queries which better utilize
 index-structures and the query optimizations from the query processor.

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] BaseX XQuery vs. python / lxml performance

2012-03-29 Thread Ronny Möbius
Hi Jonannes, Charles and Michael,

at first thanks for your immediate readiness to help.

I will shortly present the structure of the database:

Dataset
Structure
Institute Name=Physik   
Degree Abbr=ABC Name=ABC
Module Abbr=HIJ Name=HIJ
!-- the Module nodes are arbirtrary 
nested in themselves --
/Module
!-- more Module nodes --
/Degree
!-- more Degree nodes --
/Institute
!--more Institute nodes--
/Structure
!-- other informations --
Lessons
Lesson ID=12345
Name lang=deName of a Lesson/Name
AssociatedModules
Module Abbr=HIJ/
Module Abbr=ABC/
!-- there are 1..unbounded Modules per Lesson, 
only modules
containing no modules are referenced --
/AssociatedModules
!-- othere informations --
/Lesson
/Lessons
/Dataset

The task is now to create a list like that:
http://vlvz1.physik.hu-berlin.de/ss2012/physik/verzeichnis/en/, that is
the whole structure, but only with Modules, where are in fact associated
lessons.

The current query looks like this:

let $lang := data($ses/lang)
let $sem := data($ses/sem)
let $inst := data($ses/inst)
let $semxml := db:open(vlvz,concat($sem,'.xml'))
let $moduleswithlvs :=
distinct-values($semxml//AssociatedModules/Module/@Abbr)
return
span
div class=struc
{
for $degree in
$semxml//Institute[@Name=$ses//inst]/Degree[Modules//Module/@Abbr=$moduleswithlvs]
return div class=indent
span
class=degree{data($degree/@Abbr)}#x20;{data($degree/@Name)}br//span
{   
for $module in $degree/Modules//Module[(* and
*/@Abbr=$moduleswithlvs) or @Abbr=$moduleswithlvs]
let $leaf := not($module/*)
let $depth := functx:depth-of-node($module)-7
return
div class=indent depth{$depth}
{data($module/@Abbr)}#x20;{data($module/@Name)}#x20;br/
{
if ($leaf)
then
 div class=indent
{
for $lesson in vlvz:getlvs($semxml,data($module/@Abbr))
return div class=lessonspan
class=lessonid{$lesson/@ID}/spanspan
class=lessonname{$lesson/Name[@lang=$ses//lang]}/spanspan
class=lessonmodules{string-join($lesson/AssociatedModules/Module/@Abbr,',
')}/span/div
!-- note [1] --
}
/div  
else ()
}
/div
}
/div
}
/div
/span

I noticed already, that [1] is crucial: This node makes running the
query about 10 times longer than with returning an empty sequence
There is no difference with respect to just returning div/div, its
as slow as with its content.
I should also mention the function vlvz:getlvs:

declare function vlvz:getlvs($semxml as node()*,$modabbr as xs:string)
as node()*
{
 for $l in $semxml//Lesson
 where $l[AssociatedModules/Module/@Abbr=$modabbr]
 order by data($l/@ID)
 return $l
};


That the queries are bad designed with respect to performance is
probably the case: Basicly all what I've done till know with XQuery was
just learning by doing.

Beste Grüße aus der Hauptstadt,
Ronny

On 03/29/2012 11:00 AM, Michael Seiferle wrote:
 Hi Ronny, 
 
 Hi Johannes  Charles, thanks for joining the conversation. 
 
 
 In my opinion, and speaking officially for BaseX, I'd suppose that XML
 processing with BaseX databases should almost always[1] be faster than
 processing the XML sequentially via lxml.
 
 However, performance may vary depending on the actual queries and/or the
 python glue code. 
 
 I think Charles' approach of having as much logic in XQuery as possible
 will be the best option to pick here.
 Maybe some of your Python code could as well be rewritten in XQuery, on
 the other hand this might not even be necessary due to XQuery rewrites
 as Johannes suggested.
 
 @Ronny, maybe you could provide us with some sample code? In case it is
 not intended for the general public feel free to send it to
 supp...@basex.org mailto:supp...@basex.org.
 
 Looking forward to seeing your code!
 
 Viele Grüße vom Bodensee
 
 Michael 
 
 [1] I can sure think of examples that prove me wrong ;-)
 Am 28.03.2012 um 23:19 schrieb Johannes.Lichtenberger:
 
 Thus I suppose it
 would be the best to write the queries in a reply, such that the BaseX
 team can make suggestions for similar queries which better utilize
 index-structures and the query optimizations from the query processor.
 
 
 
 ___
 BaseX-Talk mailing list
 

Re: [basex-talk] BaseX XQuery vs. python / lxml performance

2012-03-28 Thread Charles Duffy

On 03/28/2012 01:49 PM, Ronny Möbius wrote:

I'm now interested in your general opinion about this: Is it surprising,
that the XQuery implementation than the lxml/python one (For me it is,
as I thought the indices etc. created when importing the data should
decrease computational affords in searching the tree)? Is there some
catch in my approach? May the reason be bad designing the query?


Howdy, Ronny --

I can't speak for the BaseX team, but I can certainly say that the 
specific queries in use make a very, very big difference.


(Personally, by the way, I'm in the process of moving as much logic as I 
can into code running in BaseX simply because using XQuery 3.0 makes it 
hard to go back to the minimal XPath 1.0 implementation that libxml2, 
and thus lxml, supports).


___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk