Re: [basex-talk] Optimizing Element Access By Attribute Value Matching

2015-04-21 Thread France Baril
Happy to know it can be done. I will definitely ping you when the project
turns commercial.


Re: [basex-talk] Optimizing Element Access By Attribute Value Matching

2015-04-21 Thread Christian Grün
Hi France,

I guess we won't be able to spend time for this unless we don't have a
commercial project within sight. However, I have created a new DITA
wish list [1] to collect features that DITA users miss most in BaseX.
All your feedback is welcome!

Christian

[1] https://github.com/BaseXdb/basex/issues/1130



On Tue, Apr 21, 2015 at 4:00 PM, France Baril
 wrote:
> Chirstian,
>
> I'm also starting to a few DITA projects with BaseX that use attributes for
> matches. Other projects were DITA-like and did not need to match attribute
> tokens in @class. There is a big and growing DITA community out there who
> need native XML database support, I think it might be interesting to include
> indexing attribute tokens in the roadmap at some point, even if only as a
> longer term items.
>
> This is a very important aspect of the DITA standard. HTML classes also use
> tokens
> more and more. So that's not just relevant to the DITA community.
>
> Regards,
>
> France
>
> On Mon, Apr 13, 2015 at 2:50 PM, Liam R. E. Quin  wrote:
>>
>> On Mon, 2015-04-13 at 12:38 -0500, Eliot Kimber wrote:
>>
>> > For large repositories an
>> > XQuery like
>> > "//*[contains(@class, ' topic/topic ')]" is going to be quite slow
>>
>> I took this use case to the XQuery & XSLT Working Groups a year or two
>> ago (Jirka added the DITA case - I was thinking of (X)HTML) and the
>> result was contains-token() which might be easier for the database to
>> optimize.
>>
>> Judging by comments submitted against my awful tests for it :) I think
>> BaseX may well support it already.
>>
>> Liam
>>
>
>
>
> --
> France Baril
> Architecte documentaire / Documentation architect
> france.ba...@architextus.com


Re: [basex-talk] Optimizing Element Access By Attribute Value Matching

2015-04-21 Thread France Baril
Chirstian,

I'm also starting to a few DITA projects with BaseX that use attributes for
matches. Other projects were DITA-like and did not need to match attribute
tokens in @class. There is a big and growing DITA community out there who
need native XML database support, I think it might be interesting to
include indexing attribute tokens in the roadmap at some point, even if
only as a longer term items.

This is a very important aspect of the DITA standard. HTML classes also use
tokens
more and more. So that's not just relevant to the DITA community.

Regards,

France

On Mon, Apr 13, 2015 at 2:50 PM, Liam R. E. Quin  wrote:

> On Mon, 2015-04-13 at 12:38 -0500, Eliot Kimber wrote:
>
> > For large repositories an
> > XQuery like
> > "//*[contains(@class, ' topic/topic ')]" is going to be quite slow
>
> I took this use case to the XQuery & XSLT Working Groups a year or two
> ago (Jirka added the DITA case - I was thinking of (X)HTML) and the
> result was contains-token() which might be easier for the database to
> optimize.
>
> Judging by comments submitted against my awful tests for it :) I think
> BaseX may well support it already.
>
> Liam
>
>


-- 
France Baril
Architecte documentaire / Documentation architect
france.ba...@architextus.com


Re: [basex-talk] Optimizing Element Access By Attribute Value Matching

2015-04-13 Thread Liam R. E. Quin
On Mon, 2015-04-13 at 12:38 -0500, Eliot Kimber wrote:

> For large repositories an
> XQuery like
> "//*[contains(@class, ' topic/topic ')]" is going to be quite slow

I took this use case to the XQuery & XSLT Working Groups a year or two 
ago (Jirka added the DITA case - I was thinking of (X)HTML) and the 
result was contains-token() which might be easier for the database to 
optimize.

Judging by comments submitted against my awful tests for it :) I think 
BaseX may well support it already.

Liam



Re: [basex-talk] Optimizing Element Access By Attribute Value Matching

2015-04-13 Thread Christian Grün
Hi Eliot,

I (am sorry to) agree there is no straightforward solution to speed up
the lookup of single tokens in attributes. XQuery 3.1 provides a new
string function "contains-token" [1]...

  //*[contains-token(@class, 'topic/topic')]

...but (up to now) it is not index-driven in BaseX.

Some users would love to see us extend our full-text index to
attributes. This way, queries your could be sped as follows:

  //*[@class contains text 'topic/topic'][contains-token(@class, 'topic/topic')]

The second predicate is still required, as the full-text query would
also potentially yield hits like "topic topic" or "ToPiC-!-tOpIc".

Currently, an efficient and (if you get used to it) rather simple way
out is to create your own index...

  let $index := {
for $element in db:open('db')//*[@class]
let $id := db:node-id($element)
for $token in $element/@class/tokenize(., '\s+')
return { $id }
  }
  return db:create('index', $index, 'index.xml')

...and access it in the next step:

  for $id in db:open('index')//class[@token = 'topic/topic']
  return db:open-id('db', $id)

Hope this helps,
Christian

[1] http://docs.basex.org/wiki/XQuery_3.1#fn:contains-token



On Mon, Apr 13, 2015 at 7:38 PM, Eliot Kimber  wrote:
> DITA defines the notion of layered hierarchy of element types, where every
> DITA-defined element is either a base type or a "specialized" type derived
> from some base type. The type hierarchy of each element is specified by a
> @class attribute that lists the ancestry and leaf type of the element.
>
> For example, the element type "concept" is a specialization of the base
> type "topic" and so has a @class value of "- topic/topic concept/concept
> ". Each blank-delimited term is a module name/element name pair.
>
> Processing in DITA is "specialization aware" if selection of elements is
> in terms of a @class token rather than concrete element type. For example,
> you might apply processing to topics of any type by matching on
> "*[contains(@class, ' topic/topic ')]", which will match all DITA topics,
> regardless of their specialized type.
>
> The challenge this presents in a database context is optimizing finding of
> things based on these @class values. For large repositories an XQuery like
> "//*[contains(@class, ' topic/topic ')]" is going to be quite slow as it
> requires a string comparison of every @class value. Even if there is an
> attribute value index it will still be slow.
>
> The obvious solution would be to index by @class token, e.g., an index
> where keys are "topic/topic", "topic/p", etc.
>
> Is there a way to construct such an index in BaseX? Is there a better to
> address type of string-match-based lookup?
>
> Thanks,
>
> Eliot
>
> —
> Eliot Kimber, Owner
> Contrext, LLC
> http://contrext.com
>
>
>


[basex-talk] Optimizing Element Access By Attribute Value Matching

2015-04-13 Thread Eliot Kimber
DITA defines the notion of layered hierarchy of element types, where every
DITA-defined element is either a base type or a "specialized" type derived
from some base type. The type hierarchy of each element is specified by a
@class attribute that lists the ancestry and leaf type of the element.

For example, the element type "concept" is a specialization of the base
type "topic" and so has a @class value of "- topic/topic concept/concept
". Each blank-delimited term is a module name/element name pair.

Processing in DITA is "specialization aware" if selection of elements is
in terms of a @class token rather than concrete element type. For example,
you might apply processing to topics of any type by matching on
"*[contains(@class, ' topic/topic ')]", which will match all DITA topics,
regardless of their specialized type.

The challenge this presents in a database context is optimizing finding of
things based on these @class values. For large repositories an XQuery like
"//*[contains(@class, ' topic/topic ')]" is going to be quite slow as it
requires a string comparison of every @class value. Even if there is an
attribute value index it will still be slow.

The obvious solution would be to index by @class token, e.g., an index
where keys are "topic/topic", "topic/p", etc.

Is there a way to construct such an index in BaseX? Is there a better to
address type of string-match-based lookup?

Thanks,

Eliot

—
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com