Re: [basex-talk] Optimizing Element Access By Attribute Value Matching
Happy to know it can be done. I will definitely ping you when the project turns commercial.
Re: [basex-talk] Optimizing Element Access By Attribute Value Matching
Hi France, I guess we won't be able to spend time for this unless we don't have a commercial project within sight. However, I have created a new DITA wish list [1] to collect features that DITA users miss most in BaseX. All your feedback is welcome! Christian [1] https://github.com/BaseXdb/basex/issues/1130 On Tue, Apr 21, 2015 at 4:00 PM, France Baril wrote: > Chirstian, > > I'm also starting to a few DITA projects with BaseX that use attributes for > matches. Other projects were DITA-like and did not need to match attribute > tokens in @class. There is a big and growing DITA community out there who > need native XML database support, I think it might be interesting to include > indexing attribute tokens in the roadmap at some point, even if only as a > longer term items. > > This is a very important aspect of the DITA standard. HTML classes also use > tokens > more and more. So that's not just relevant to the DITA community. > > Regards, > > France > > On Mon, Apr 13, 2015 at 2:50 PM, Liam R. E. Quin wrote: >> >> On Mon, 2015-04-13 at 12:38 -0500, Eliot Kimber wrote: >> >> > For large repositories an >> > XQuery like >> > "//*[contains(@class, ' topic/topic ')]" is going to be quite slow >> >> I took this use case to the XQuery & XSLT Working Groups a year or two >> ago (Jirka added the DITA case - I was thinking of (X)HTML) and the >> result was contains-token() which might be easier for the database to >> optimize. >> >> Judging by comments submitted against my awful tests for it :) I think >> BaseX may well support it already. >> >> Liam >> > > > > -- > France Baril > Architecte documentaire / Documentation architect > france.ba...@architextus.com
Re: [basex-talk] Optimizing Element Access By Attribute Value Matching
Chirstian, I'm also starting to a few DITA projects with BaseX that use attributes for matches. Other projects were DITA-like and did not need to match attribute tokens in @class. There is a big and growing DITA community out there who need native XML database support, I think it might be interesting to include indexing attribute tokens in the roadmap at some point, even if only as a longer term items. This is a very important aspect of the DITA standard. HTML classes also use tokens more and more. So that's not just relevant to the DITA community. Regards, France On Mon, Apr 13, 2015 at 2:50 PM, Liam R. E. Quin wrote: > On Mon, 2015-04-13 at 12:38 -0500, Eliot Kimber wrote: > > > For large repositories an > > XQuery like > > "//*[contains(@class, ' topic/topic ')]" is going to be quite slow > > I took this use case to the XQuery & XSLT Working Groups a year or two > ago (Jirka added the DITA case - I was thinking of (X)HTML) and the > result was contains-token() which might be easier for the database to > optimize. > > Judging by comments submitted against my awful tests for it :) I think > BaseX may well support it already. > > Liam > > -- France Baril Architecte documentaire / Documentation architect france.ba...@architextus.com
Re: [basex-talk] Optimizing Element Access By Attribute Value Matching
On Mon, 2015-04-13 at 12:38 -0500, Eliot Kimber wrote: > For large repositories an > XQuery like > "//*[contains(@class, ' topic/topic ')]" is going to be quite slow I took this use case to the XQuery & XSLT Working Groups a year or two ago (Jirka added the DITA case - I was thinking of (X)HTML) and the result was contains-token() which might be easier for the database to optimize. Judging by comments submitted against my awful tests for it :) I think BaseX may well support it already. Liam
Re: [basex-talk] Optimizing Element Access By Attribute Value Matching
Hi Eliot, I (am sorry to) agree there is no straightforward solution to speed up the lookup of single tokens in attributes. XQuery 3.1 provides a new string function "contains-token" [1]... //*[contains-token(@class, 'topic/topic')] ...but (up to now) it is not index-driven in BaseX. Some users would love to see us extend our full-text index to attributes. This way, queries your could be sped as follows: //*[@class contains text 'topic/topic'][contains-token(@class, 'topic/topic')] The second predicate is still required, as the full-text query would also potentially yield hits like "topic topic" or "ToPiC-!-tOpIc". Currently, an efficient and (if you get used to it) rather simple way out is to create your own index... let $index := { for $element in db:open('db')//*[@class] let $id := db:node-id($element) for $token in $element/@class/tokenize(., '\s+') return { $id } } return db:create('index', $index, 'index.xml') ...and access it in the next step: for $id in db:open('index')//class[@token = 'topic/topic'] return db:open-id('db', $id) Hope this helps, Christian [1] http://docs.basex.org/wiki/XQuery_3.1#fn:contains-token On Mon, Apr 13, 2015 at 7:38 PM, Eliot Kimber wrote: > DITA defines the notion of layered hierarchy of element types, where every > DITA-defined element is either a base type or a "specialized" type derived > from some base type. The type hierarchy of each element is specified by a > @class attribute that lists the ancestry and leaf type of the element. > > For example, the element type "concept" is a specialization of the base > type "topic" and so has a @class value of "- topic/topic concept/concept > ". Each blank-delimited term is a module name/element name pair. > > Processing in DITA is "specialization aware" if selection of elements is > in terms of a @class token rather than concrete element type. For example, > you might apply processing to topics of any type by matching on > "*[contains(@class, ' topic/topic ')]", which will match all DITA topics, > regardless of their specialized type. > > The challenge this presents in a database context is optimizing finding of > things based on these @class values. For large repositories an XQuery like > "//*[contains(@class, ' topic/topic ')]" is going to be quite slow as it > requires a string comparison of every @class value. Even if there is an > attribute value index it will still be slow. > > The obvious solution would be to index by @class token, e.g., an index > where keys are "topic/topic", "topic/p", etc. > > Is there a way to construct such an index in BaseX? Is there a better to > address type of string-match-based lookup? > > Thanks, > > Eliot > > — > Eliot Kimber, Owner > Contrext, LLC > http://contrext.com > > >
[basex-talk] Optimizing Element Access By Attribute Value Matching
DITA defines the notion of layered hierarchy of element types, where every DITA-defined element is either a base type or a "specialized" type derived from some base type. The type hierarchy of each element is specified by a @class attribute that lists the ancestry and leaf type of the element. For example, the element type "concept" is a specialization of the base type "topic" and so has a @class value of "- topic/topic concept/concept ". Each blank-delimited term is a module name/element name pair. Processing in DITA is "specialization aware" if selection of elements is in terms of a @class token rather than concrete element type. For example, you might apply processing to topics of any type by matching on "*[contains(@class, ' topic/topic ')]", which will match all DITA topics, regardless of their specialized type. The challenge this presents in a database context is optimizing finding of things based on these @class values. For large repositories an XQuery like "//*[contains(@class, ' topic/topic ')]" is going to be quite slow as it requires a string comparison of every @class value. Even if there is an attribute value index it will still be slow. The obvious solution would be to index by @class token, e.g., an index where keys are "topic/topic", "topic/p", etc. Is there a way to construct such an index in BaseX? Is there a better to address type of string-match-based lookup? Thanks, Eliot — Eliot Kimber, Owner Contrext, LLC http://contrext.com