Re: [basex-talk] multi-language full-text indexing

2015-04-23 Thread Christopher Yocum
Hi Christian,

No problem.  I am always happy to help.

I did not try that as I did not have the time to implement something
like that for the project.  Also, on my machine and on the vps we were
using at least, it was fast enough for most uses.  Also, I told the
user about the problem as using diacritics was an all or nothing
setting and there was not much I could do to make it more granular.
They were content with that but I thought as the subject came up that
I would mention something.

Please let me know if you need any more information.

All the best,
Chris

On Wed, Apr 22, 2015 at 10:24 PM, Christian Grün
christian.gr...@gmail.com wrote:
 Chris,

 Thanks for your feedback.

 Yes, I see that there is a lot of demand for a more customizable
 full-text index. Did you already try to build some additional index
 databases, based on the rules you were listing here? It's not as
 comfortable as a tightly coupled full-text index, but the more use
 case I get to hear of, the more I wonder if we could at all manage to
 satisfy everyone's needs..

 Cheers,
 Christian


 Wed, Apr 22, 2015 at 11:20 PM, Chris Yocum cyo...@gmail.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 Hi,

 I just want to say that for the dictionary that I used BaseX for,
 having a multi-lingual full text would have been very nice.  Bar that
 a partial index based on certain rules the user supplies would have
 also been nice.  For instance, being able to distinguish between á, a,
 and ā in a word.  In early Irish textual criticism, length marks are
 often added by text editors with a macron to denote a long vowel that
 has been idenified by the editor but is not in the original text.
 Being able to say: build an index with á and a but not ā would be
 helpful.

 I would suggest as a first pass building the index by using xml:lang
 attributes to determine what stemmer to use, etc.  If the document has
 supplied them, you could use them to build the indices differently
 based on it.

 All the best,
 Chris

 On Wed, Apr 22, 2015 at 11:35:48AM +0200, Goetz Heller wrote:
 Here's another addendum: Even if multi-language full-text indexing is not 
 going tob e implemented in the near future, it still would be a useful 
 feature to be able to restrict  full-text indexing to parts of a document, 
 e.g.

 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
   (path_a)/PART_A,
   (path_b)/ PART_B,…
 )

 Kind regards,

 Goetz

 -Ursprüngliche Nachricht-
 Von: Christian Grün [mailto:christian.gr...@gmail.com]
 Gesendet: Mittwoch, 22. April 2015 11:03
 An: Goetz Heller
 Cc: BaseX
 Betreff: Re: [basex-talk] multi-language full-text indexing

  It is desirable to have
  documents indexed by locale-specific parts, e.g.

 I can see that this would absolutely make sense, but it would be quite some 
 effort to realize it. There are also various conceptul issues related to 
 XQuery Full Text: If you don't specify the language in the query, we'd need 
 to dynamically decide what stemmers to use for the query strings, depending 
 on the nodes that are currently targeted.
 This would pretty much blow up the existing architecture.

 As there are so many other types of index structures that could be helpful, 
 depending on the particular use case, we usually recommend users to create 
 additional BaseX databases, which can then serve as indexes. This can all 
 be done in XQuery. I remember there have been various examples for this on 
 this mailing list (see e.g. [1,2]).

 Hope this helps,
 Christian

 [1] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
 [2] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html




 
 
 
  CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
 
  (path_a)/LOCALIZED_PART_A[@LANG=$lang],
 
  (path_b)/LOCALIZED_PART_B[@LG=$lang],…
 
  ) FOR LANGUAGE $lang IN (
 
  BG,
 
  DN,
 
  DE WITH STOPWORDS filepath_de WITH STEM = YES,
 
  EN WITH STOPWORDS filepath_en,
 
  FR, …
 
  )  [USING language_code_map]
 
  and then to write full-text retrieval queries with a clause such as
  ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller
  and full-text retrieval therefore much faster. The language codes
  would be mapped somehow to standard values recognized by BaseX in the
  language_code_map file.
 
  Are there any efforts towards such a feature?

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1

 iF4EAREIAAYFAlU4EIcACgkQDjE+CSbP7HrWyAEAsW698gwkbKrtEb9Vkv7S1aV/
 r5YyIp/UKv9k8gYxqboA/0+oesCxM6K2dZGEfIolUJG+x3vhxMfocY+QGpwjtLhy
 =Zq8Q
 -END PGP SIGNATURE-


Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Chris Yocum
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi,

I just want to say that for the dictionary that I used BaseX for,
having a multi-lingual full text would have been very nice.  Bar that
a partial index based on certain rules the user supplies would have
also been nice.  For instance, being able to distinguish between á, a,
and ā in a word.  In early Irish textual criticism, length marks are
often added by text editors with a macron to denote a long vowel that
has been idenified by the editor but is not in the original text.
Being able to say: build an index with á and a but not ā would be
helpful.

I would suggest as a first pass building the index by using xml:lang
attributes to determine what stemmer to use, etc.  If the document has
supplied them, you could use them to build the indices differently
based on it.

All the best,
Chris

On Wed, Apr 22, 2015 at 11:35:48AM +0200, Goetz Heller wrote:
 Here's another addendum: Even if multi-language full-text indexing is not 
 going tob e implemented in the near future, it still would be a useful 
 feature to be able to restrict  full-text indexing to parts of a document, 
 e.g.
 
 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
   (path_a)/PART_A,
   (path_b)/ PART_B,…
 )
 
 Kind regards,
 
 Goetz
 
 -Ursprüngliche Nachricht-
 Von: Christian Grün [mailto:christian.gr...@gmail.com] 
 Gesendet: Mittwoch, 22. April 2015 11:03
 An: Goetz Heller
 Cc: BaseX
 Betreff: Re: [basex-talk] multi-language full-text indexing
 
  It is desirable to have
  documents indexed by locale-specific parts, e.g.
 
 I can see that this would absolutely make sense, but it would be quite some 
 effort to realize it. There are also various conceptul issues related to 
 XQuery Full Text: If you don't specify the language in the query, we'd need 
 to dynamically decide what stemmers to use for the query strings, depending 
 on the nodes that are currently targeted.
 This would pretty much blow up the existing architecture.
 
 As there are so many other types of index structures that could be helpful, 
 depending on the particular use case, we usually recommend users to create 
 additional BaseX databases, which can then serve as indexes. This can all be 
 done in XQuery. I remember there have been various examples for this on this 
 mailing list (see e.g. [1,2]).
 
 Hope this helps,
 Christian
 
 [1] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
 [2] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html
 
 
 
 
 
 
 
  CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
 
  (path_a)/LOCALIZED_PART_A[@LANG=$lang],
 
  (path_b)/LOCALIZED_PART_B[@LG=$lang],…
 
  ) FOR LANGUAGE $lang IN (
 
  BG,
 
  DN,
 
  DE WITH STOPWORDS filepath_de WITH STEM = YES,
 
  EN WITH STOPWORDS filepath_en,
 
  FR, …
 
  )  [USING language_code_map]
 
  and then to write full-text retrieval queries with a clause such as 
  ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller 
  and full-text retrieval therefore much faster. The language codes 
  would be mapped somehow to standard values recognized by BaseX in the 
  language_code_map file.
 
  Are there any efforts towards such a feature?
 
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iF4EAREIAAYFAlU4EIcACgkQDjE+CSbP7HrWyAEAsW698gwkbKrtEb9Vkv7S1aV/
r5YyIp/UKv9k8gYxqboA/0+oesCxM6K2dZGEfIolUJG+x3vhxMfocY+QGpwjtLhy
=Zq8Q
-END PGP SIGNATURE-


Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Christian Grün
Chris,

Thanks for your feedback.

Yes, I see that there is a lot of demand for a more customizable
full-text index. Did you already try to build some additional index
databases, based on the rules you were listing here? It's not as
comfortable as a tightly coupled full-text index, but the more use
case I get to hear of, the more I wonder if we could at all manage to
satisfy everyone's needs..

Cheers,
Christian


 Wed, Apr 22, 2015 at 11:20 PM, Chris Yocum cyo...@gmail.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 Hi,

 I just want to say that for the dictionary that I used BaseX for,
 having a multi-lingual full text would have been very nice.  Bar that
 a partial index based on certain rules the user supplies would have
 also been nice.  For instance, being able to distinguish between á, a,
 and ā in a word.  In early Irish textual criticism, length marks are
 often added by text editors with a macron to denote a long vowel that
 has been idenified by the editor but is not in the original text.
 Being able to say: build an index with á and a but not ā would be
 helpful.

 I would suggest as a first pass building the index by using xml:lang
 attributes to determine what stemmer to use, etc.  If the document has
 supplied them, you could use them to build the indices differently
 based on it.

 All the best,
 Chris

 On Wed, Apr 22, 2015 at 11:35:48AM +0200, Goetz Heller wrote:
 Here's another addendum: Even if multi-language full-text indexing is not 
 going tob e implemented in the near future, it still would be a useful 
 feature to be able to restrict  full-text indexing to parts of a document, 
 e.g.

 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
   (path_a)/PART_A,
   (path_b)/ PART_B,…
 )

 Kind regards,

 Goetz

 -Ursprüngliche Nachricht-
 Von: Christian Grün [mailto:christian.gr...@gmail.com]
 Gesendet: Mittwoch, 22. April 2015 11:03
 An: Goetz Heller
 Cc: BaseX
 Betreff: Re: [basex-talk] multi-language full-text indexing

  It is desirable to have
  documents indexed by locale-specific parts, e.g.

 I can see that this would absolutely make sense, but it would be quite some 
 effort to realize it. There are also various conceptul issues related to 
 XQuery Full Text: If you don't specify the language in the query, we'd need 
 to dynamically decide what stemmers to use for the query strings, depending 
 on the nodes that are currently targeted.
 This would pretty much blow up the existing architecture.

 As there are so many other types of index structures that could be helpful, 
 depending on the particular use case, we usually recommend users to create 
 additional BaseX databases, which can then serve as indexes. This can all be 
 done in XQuery. I remember there have been various examples for this on this 
 mailing list (see e.g. [1,2]).

 Hope this helps,
 Christian

 [1] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
 [2] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html




 
 
 
  CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
 
  (path_a)/LOCALIZED_PART_A[@LANG=$lang],
 
  (path_b)/LOCALIZED_PART_B[@LG=$lang],…
 
  ) FOR LANGUAGE $lang IN (
 
  BG,
 
  DN,
 
  DE WITH STOPWORDS filepath_de WITH STEM = YES,
 
  EN WITH STOPWORDS filepath_en,
 
  FR, …
 
  )  [USING language_code_map]
 
  and then to write full-text retrieval queries with a clause such as
  ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller
  and full-text retrieval therefore much faster. The language codes
  would be mapped somehow to standard values recognized by BaseX in the
  language_code_map file.
 
  Are there any efforts towards such a feature?

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1

 iF4EAREIAAYFAlU4EIcACgkQDjE+CSbP7HrWyAEAsW698gwkbKrtEb9Vkv7S1aV/
 r5YyIp/UKv9k8gYxqboA/0+oesCxM6K2dZGEfIolUJG+x3vhxMfocY+QGpwjtLhy
 =Zq8Q
 -END PGP SIGNATURE-


[basex-talk] multi-language full-text indexing

2015-04-22 Thread Goetz Heller
Here's another addendum: Even if multi-language full-text indexing is not going 
tob e implemented in the near future, it still would be a useful feature to be 
able to restrict  full-text indexing to parts of a document, e.g.

CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/PART_A,
(path_b)/ PART_B,…
)

Kind regards,

Goetz

-Ursprüngliche Nachricht-
Von: Christian Grün [mailto:christian.gr...@gmail.com] 
Gesendet: Mittwoch, 22. April 2015 11:03
An: Goetz Heller
Cc: BaseX
Betreff: Re: [basex-talk] multi-language full-text indexing

 It is desirable to have
 documents indexed by locale-specific parts, e.g.

I can see that this would absolutely make sense, but it would be quite some 
effort to realize it. There are also various conceptul issues related to XQuery 
Full Text: If you don't specify the language in the query, we'd need to 
dynamically decide what stemmers to use for the query strings, depending on the 
nodes that are currently targeted.
This would pretty much blow up the existing architecture.

As there are so many other types of index structures that could be helpful, 
depending on the particular use case, we usually recommend users to create 
additional BaseX databases, which can then serve as indexes. This can all be 
done in XQuery. I remember there have been various examples for this on this 
mailing list (see e.g. [1,2]).

Hope this helps,
Christian

[1] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
[2] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as 
 ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller 
 and full-text retrieval therefore much faster. The language codes 
 would be mapped somehow to standard values recognized by BaseX in the 
 language_code_map file.

 Are there any efforts towards such a feature?



Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Christian Grün
Reminds me of an old GitHub issue.. I have added a link to your
request: https://github.com/BaseXdb/basex/issues/59.


On Wed, Apr 22, 2015 at 11:35 AM, Goetz Heller hel...@hellerim.de wrote:
 Here's another addendum: Even if multi-language full-text indexing is not 
 going tob e implemented in the near future, it still would be a useful 
 feature to be able to restrict  full-text indexing to parts of a document, 
 e.g.

 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
 (path_a)/PART_A,
 (path_b)/ PART_B,…
 )

 Kind regards,

 Goetz

 -Ursprüngliche Nachricht-
 Von: Christian Grün [mailto:christian.gr...@gmail.com]
 Gesendet: Mittwoch, 22. April 2015 11:03
 An: Goetz Heller
 Cc: BaseX
 Betreff: Re: [basex-talk] multi-language full-text indexing

 It is desirable to have
 documents indexed by locale-specific parts, e.g.

 I can see that this would absolutely make sense, but it would be quite some 
 effort to realize it. There are also various conceptul issues related to 
 XQuery Full Text: If you don't specify the language in the query, we'd need 
 to dynamically decide what stemmers to use for the query strings, depending 
 on the nodes that are currently targeted.
 This would pretty much blow up the existing architecture.

 As there are so many other types of index structures that could be helpful, 
 depending on the particular use case, we usually recommend users to create 
 additional BaseX databases, which can then serve as indexes. This can all be 
 done in XQuery. I remember there have been various examples for this on this 
 mailing list (see e.g. [1,2]).

 Hope this helps,
 Christian

 [1] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
 [2] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as
 ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller
 and full-text retrieval therefore much faster. The language codes
 would be mapped somehow to standard values recognized by BaseX in the
 language_code_map file.

 Are there any efforts towards such a feature?



Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Goetz Heller
Thanks, Fabrice!
I’ll work it out.
 
Kind regards,
 
Goetz
 
 
Von: Fabrice Etanchaud [mailto:fetanch...@questel.com] 
Gesendet: Mittwoch, 22. April 2015 11:32
An: Goetz Heller; basex-talk@mailman.uni-konstanz.de
Betreff: RE: [basex-talk] multi-language full-text indexing
 
Great, Goetz !
 
A last thing :
If you need to rebuild the original document from parts, be sure to have a
way to retrieve them all (by document path, attribute index, or separate
index collection with node-id/pre values).
 
If disk space is not an issue, you could store the original document as it
is, and create localized collection for full text indexing purposes.
 
Hoping it helps,
 
Best regards,
Fabrice
 
De : basex-talk-boun...@mailman.uni-konstanz.de
mailto:basex-talk-boun...@mailman.uni-konstanz.de
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz
Heller
Envoyé : mercredi 22 avril 2015 11:20
À : basex-talk@mailman.uni-konstanz.de
mailto:basex-talk@mailman.uni-konstanz.de 
Objet : Re: [basex-talk] multi-language full-text indexing
 
Fabrice,
For the time being, this sounds quite nice. I’d to split up the files in
some common part and a set of “satellites”, one satellite for each language
present in the document.
 
Thanks!
 
Kind regards,
 
Goetz
 
Von: Fabrice Etanchaud [mailto:fetanch...@questel.com] 
Gesendet: Mittwoch, 22. April 2015 11:04
An: Goetz Heller; basex-talk@mailman.uni-konstanz.de
mailto:basex-talk@mailman.uni-konstanz.de 
Betreff: RE: [basex-talk] multi-language full-text indexing
 
Dear Goetz,
 
I have the same requirement (patent documents containing text in different
languages).
I ended up splitting/filtering each original document in localized parts
inserted in different collections (each collection having its own full text
index configuration).
BaseX is as flexible as our data !
 
Best regards,
 
 
De : basex-talk-boun...@mailman.uni-konstanz.de
mailto:basex-talk-boun...@mailman.uni-konstanz.de
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz
Heller
Envoyé : mercredi 22 avril 2015 10:50
À : basex-talk@mailman.uni-konstanz.de
mailto:basex-talk@mailman.uni-konstanz.de 
Objet : [basex-talk] multi-language full-text indexing
 
I’m working with documents destined to be consumed anywhere in the European
Community. Many of them have the same tags multiple times but with a
different language attribute. It does not make sense to create a full-text
index for the whole of these documents therefore. It is desirable to have
documents indexed by locale-specific parts, e.g.
 
CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],…
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, …
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as ‘FOR
LANGUAGE BG’, for example. The index parts would be much smaller and
full-text retrieval therefore much faster. The language codes would be
mapped somehow to standard values recognized by BaseX in the
language_code_map file. 
Are there any efforts towards such a feature?


[basex-talk] multi-language full-text indexing

2015-04-22 Thread Goetz Heller
I'm working with documents destined to be consumed anywhere in the European
Community. Many of them have the same tags multiple times but with a
different language attribute. It does not make sense to create a full-text
index for the whole of these documents therefore. It is desirable to have
documents indexed by locale-specific parts, e.g.
 
CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],.
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, .
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as 'FOR
LANGUAGE BG', for example. The index parts would be much smaller and
full-text retrieval therefore much faster. The language codes would be
mapped somehow to standard values recognized by BaseX in the
language_code_map file. 
Are there any efforts towards such a feature?


Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Fabrice Etanchaud
Dear Goetz,

I have the same requirement (patent documents containing text in different 
languages).
I ended up splitting/filtering each original document in localized parts 
inserted in different collections (each collection having its own full text 
index configuration).
BaseX is as flexible as our data !

Best regards,


De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 10:50
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] multi-language full-text indexing

I'm working with documents destined to be consumed anywhere in the European 
Community. Many of them have the same tags multiple times but with a different 
language attribute. It does not make sense to create a full-text index for the 
whole of these documents therefore. It is desirable to have documents indexed 
by locale-specific parts, e.g.

CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],...
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, ...
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as 'FOR 
LANGUAGE BG', for example. The index parts would be much smaller and full-text 
retrieval therefore much faster. The language codes would be mapped somehow to 
standard values recognized by BaseX in the language_code_map file.
Are there any efforts towards such a feature?


Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Christian Grün
 It is desirable to have
 documents indexed by locale-specific parts, e.g.

I can see that this would absolutely make sense, but it would be quite
some effort to realize it. There are also various conceptul issues
related to XQuery Full Text: If you don't specify the language in the
query, we'd need to dynamically decide what stemmers to use for the
query strings, depending on the nodes that are currently targeted.
This would pretty much blow up the existing architecture.

As there are so many other types of index structures that could be
helpful, depending on the particular use case, we usually recommend
users to create additional BaseX databases, which can then serve as
indexes. This can all be done in XQuery. I remember there have been
various examples for this on this mailing list (see e.g. [1,2]).

Hope this helps,
Christian

[1] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
[2] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as ‘FOR
 LANGUAGE BG’, for example. The index parts would be much smaller and
 full-text retrieval therefore much faster. The language codes would be
 mapped somehow to standard values recognized by BaseX in the
 language_code_map file.

 Are there any efforts towards such a feature?


Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Christian Grün
In a nutshell: It would take some more time to explain all the
implications.. We know that there are various non-trivial issues to be
solved, as we already thought about adding such an index some years
ago.

Cheers,
Christian


On Wed, Apr 22, 2015 at 11:15 AM, Goetz Heller hel...@hellerim.de wrote:
 The case you described should be made a non-issue:
 If a multi-language full-text index was created then it was surely intended 
 to execute searches within the confines of a specific language. Hence, if 
 none was specified in the query, a runtime error should be thrown in such 
 cases.

 Kind regards,

 Goetz

 -Ursprüngliche Nachricht-
 Von: Christian Grün [mailto:christian.gr...@gmail.com]
 Gesendet: Mittwoch, 22. April 2015 11:03
 An: Goetz Heller
 Cc: BaseX
 Betreff: Re: [basex-talk] multi-language full-text indexing

 It is desirable to have
 documents indexed by locale-specific parts, e.g.

 I can see that this would absolutely make sense, but it would be quite some 
 effort to realize it. There are also various conceptul issues related to 
 XQuery Full Text: If you don't specify the language in the query, we'd need 
 to dynamically decide what stemmers to use for the query strings, depending 
 on the nodes that are currently targeted.
 This would pretty much blow up the existing architecture.

 As there are so many other types of index structures that could be helpful, 
 depending on the particular use case, we usually recommend users to create 
 additional BaseX databases, which can then serve as indexes. This can all be 
 done in XQuery. I remember there have been various examples for this on this 
 mailing list (see e.g. [1,2]).

 Hope this helps,
 Christian

 [1] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
 [2] 
 https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as
 ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller
 and full-text retrieval therefore much faster. The language codes
 would be mapped somehow to standard values recognized by BaseX in the
 language_code_map file.

 Are there any efforts towards such a feature?



Re: [basex-talk] multi-language full-text indexing

2015-04-22 Thread Fabrice Etanchaud
Great, Goetz !

A last thing :
If you need to rebuild the original document from parts, be sure to have a way 
to retrieve them all (by document path, attribute index, or separate index 
collection with node-id/pre values).

If disk space is not an issue, you could store the original document as it is, 
and create localized collection for full text indexing purposes.

Hoping it helps,

Best regards,
Fabrice

De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 11:20
À : basex-talk@mailman.uni-konstanz.de
Objet : Re: [basex-talk] multi-language full-text indexing

Fabrice,
For the time being, this sounds quite nice. I'd to split up the files in some 
common part and a set of satellites, one satellite for each language present 
in the document.

Thanks!

Kind regards,

Goetz

Von: Fabrice Etanchaud [mailto:fetanch...@questel.com]
Gesendet: Mittwoch, 22. April 2015 11:04
An: Goetz Heller; 
basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de
Betreff: RE: [basex-talk] multi-language full-text indexing

Dear Goetz,

I have the same requirement (patent documents containing text in different 
languages).
I ended up splitting/filtering each original document in localized parts 
inserted in different collections (each collection having its own full text 
index configuration).
BaseX is as flexible as our data !

Best regards,


De : 
basex-talk-boun...@mailman.uni-konstanz.demailto:basex-talk-boun...@mailman.uni-konstanz.de
 [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Goetz Heller
Envoyé : mercredi 22 avril 2015 10:50
À : 
basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] multi-language full-text indexing

I'm working with documents destined to be consumed anywhere in the European 
Community. Many of them have the same tags multiple times but with a different 
language attribute. It does not make sense to create a full-text index for the 
whole of these documents therefore. It is desirable to have documents indexed 
by locale-specific parts, e.g.

CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],...
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, ...
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as 'FOR 
LANGUAGE BG', for example. The index parts would be much smaller and full-text 
retrieval therefore much faster. The language codes would be mapped somehow to 
standard values recognized by BaseX in the language_code_map file.
Are there any efforts towards such a feature?


[basex-talk] multi-language full-text indexing

2015-04-22 Thread Goetz Heller
The case you described should be made a non-issue:
If a multi-language full-text index was created then it was surely intended to 
execute searches within the confines of a specific language. Hence, if none was 
specified in the query, a runtime error should be thrown in such cases.

Kind regards,

Goetz

-Ursprüngliche Nachricht-
Von: Christian Grün [mailto:christian.gr...@gmail.com] 
Gesendet: Mittwoch, 22. April 2015 11:03
An: Goetz Heller
Cc: BaseX
Betreff: Re: [basex-talk] multi-language full-text indexing

 It is desirable to have
 documents indexed by locale-specific parts, e.g.

I can see that this would absolutely make sense, but it would be quite some 
effort to realize it. There are also various conceptul issues related to XQuery 
Full Text: If you don't specify the language in the query, we'd need to 
dynamically decide what stemmers to use for the query strings, depending on the 
nodes that are currently targeted.
This would pretty much blow up the existing architecture.

As there are so many other types of index structures that could be helpful, 
depending on the particular use case, we usually recommend users to create 
additional BaseX databases, which can then serve as indexes. This can all be 
done in XQuery. I remember there have been various examples for this on this 
mailing list (see e.g. [1,2]).

Hope this helps,
Christian

[1] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.html
[2] 
https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.html







 CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (

 (path_a)/LOCALIZED_PART_A[@LANG=$lang],

 (path_b)/LOCALIZED_PART_B[@LG=$lang],…

 ) FOR LANGUAGE $lang IN (

 BG,

 DN,

 DE WITH STOPWORDS filepath_de WITH STEM = YES,

 EN WITH STOPWORDS filepath_en,

 FR, …

 )  [USING language_code_map]

 and then to write full-text retrieval queries with a clause such as 
 ‘FOR LANGUAGE BG’, for example. The index parts would be much smaller 
 and full-text retrieval therefore much faster. The language codes 
 would be mapped somehow to standard values recognized by BaseX in the 
 language_code_map file.

 Are there any efforts towards such a feature?