Re: [basex-talk] diacritics sensitive not working

2018-09-03 Thread Graydon Saunders
Let's suppose you've got a map like:  (and that by just typing this into
the email I haven't left in any really horrible typos!)

let $drugInfo as map(xs:string,element()) := map:merge(
for $element in collection('newDrugInfo')/descendant::infoElement
let $name as xs:string := (: whatever you do to extract the official drug
name from the update data :)
return map:entry($name,$element))

then in the other docbase you've got:
let $updatePlaces as map(xs:string,element()+) := map:merge(
for $place in collection('updating-this-one')/descendant::couldBeInteresting
let $drugName as xs:string := (:whatever you're doing now to match the drug
name; there's an assumption you expect to find only one :)
where exists($drugName) (: because you might not have one! :)
group by $drugName (:baseX will magically make $place be a sequence of all
the $place values with this drug name, effectively a sequence of pointers
to those element nodes)
return map:entry($drugName,$place)
)

So now you can:
for $drug in map:keys($drugInfo) (: we're iterating through the official
list :)
let $needsUpdate as element()+ := $updatePlaces($drug)
for $place in $needsUpdate (: iterate through our sequence of pointers :)
(: do whatever you're doing to insert the information in $drugInfo($drug) :)

It looks like the same old n-squared inner-loop/outer-loop update process,
but I have found that it doesn't perform like that.  I am almost never
updating the docbase so whatever magic is involved may go away when you do
that, but I've found this "map both sides" pattern to be very useful when
merging data.

-- Graydon

On Sun, Sep 2, 2018 at 9:25 PM Ron Katriel  wrote:

> Hi Graydon,
>
> Thanks for the suggestion. Could you provide sample code to help with
> this? If needed I can share the relevant BaseX snippet.
>
> Best,
> Ron
>
> On Sep 2, 2018, at 9:16 PM, Graydon Saunders  wrote:
>
> Maps that reference nodes are pointers, rather than copies.  It sounds
> like you could map every drug name to every "interesting" XML node that
> contains it using grouping during map creation and then just iterate on the
> keys to process the nodes.
>
> On Sun, Sep 2, 2018 at 4:52 PM Ron Katriel  wrote:
>
>> Hi Christian,
>>
>> As promised here is a summary of my experimentation. I replaced the
>> expensive join with a map lookup and the program finished in 4 minutes vs.
>> 1 hour using a naive loop over the two databases (the original 6 hours
>> reported were due to overly aggressive virus scanning software, which I
>> turned off for this benchmarking).
>>
>> The downside of not using “contains text” inside the double loop (due to
>> its slowness) is that I had to tokenize the CT.gov
>> 
>> interventions and remove stopwords prior to looking them up in the DrugBank
>> map. This is a subpar solution as some drugs are missed (looking up all the
>> possible word combinations would be expensive).
>>
>> It would be nice if there was a way to combine the matching flexibility
>> of the “contains text” construct (with its myriad of options) and the
>> efficiency of a map lookup but that may require a finite-state automaton
>> such as the Aho–Corasick algorithm. If you are aware of any existing
>> solutions I would appreciate your sharing them.
>>
>> Thanks,
>> Ron
>>
>> On August 4, 2018 at 8:47:49 PM, Ron Katriel (rkatr...@mdsol.com) wrote:
>>
>> Hi Christian,
>>
>> Thanks for the advise. The BaseX engine is phenomenal so I realized
>> quickly that the problem was performing a naive cross product.
>>
>> Since this query is run only once a month (to serialize XML to CSV) and
>> applied to new data (DB) each time, a BaseX map will likely be the most
>> straightforward solution (I used the same idea for another project with
>> good results).
>>
>> I will not be able to implement and test this for another couple of weeks
>> but will summarize my findings to the group as soon as possible.
>>
>> Best,
>> Ron
>>
>>
>> > On Aug 4, 2018, at 6:00 AM, Christian Grün 
>> wrote:
>> >
>> > Hi Ron,
>> >
>> >> I believe the slow execution may be due to a combinatorial issue: the
>> cross product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not
>> counting synonyms).
>> >
>> > Yes, this sounds like a pretty expensive operation. Having maps
>> > (XQuery, Java) will be much faster indeed.
>> >
>> > As Gerrit suggested, and if you will run your query more than once, it
>> > would definitely be another interesting option to build an auxiliary,
>> > custom "index database" that allows you to do exact searches (this
>> > database may still have references to your original data sets). Since
>> > version 9 of BaseX, volatile hash maps will be created for looped
>> > string comparisons. See the following example:
>> >
>> > let $values1 := (1 to 50) !

Re: [basex-talk] diacritics sensitive not working

2018-09-02 Thread Ron Katriel
Hi Graydon,

Thanks for the suggestion. Could you provide sample code to help with this? If 
needed I can share the relevant BaseX snippet.

Best,
Ron

> On Sep 2, 2018, at 9:16 PM, Graydon Saunders  wrote:
> 
> Maps that reference nodes are pointers, rather than copies.  It sounds like 
> you could map every drug name to every "interesting" XML node that contains 
> it using grouping during map creation and then just iterate on the keys to 
> process the nodes.
> 
>> On Sun, Sep 2, 2018 at 4:52 PM Ron Katriel  wrote:
>> Hi Christian,
>> 
>> As promised here is a summary of my experimentation. I replaced the 
>> expensive join with a map lookup and the program finished in 4 minutes vs. 1 
>> hour using a naive loop over the two databases (the original 6 hours 
>> reported were due to overly aggressive virus scanning software, which I 
>> turned off for this benchmarking).
>> 
>> The downside of not using “contains text” inside the double loop (due to its 
>> slowness) is that I had to tokenize the CT.gov interventions and remove 
>> stopwords prior to looking them up in the DrugBank map. This is a subpar 
>> solution as some drugs are missed (looking up all the possible word 
>> combinations would be expensive).
>> 
>> It would be nice if there was a way to combine the matching flexibility of 
>> the “contains text” construct (with its myriad of options) and the 
>> efficiency of a map lookup but that may require a finite-state automaton 
>> such as the Aho–Corasick algorithm. If you are aware of any existing 
>> solutions I would appreciate your sharing them.
>> 
>> Thanks,
>> Ron
>> 
>>> On August 4, 2018 at 8:47:49 PM, Ron Katriel (rkatr...@mdsol.com) wrote:
>>> 
>>> Hi Christian, 
>>> 
>>> Thanks for the advise. The BaseX engine is phenomenal so I realized quickly 
>>> that the problem was performing a naive cross product. 
>>> 
>>> Since this query is run only once a month (to serialize XML to CSV) and 
>>> applied to new data (DB) each time, a BaseX map will likely be the most 
>>> straightforward solution (I used the same idea for another project with 
>>> good results). 
>>> 
>>> I will not be able to implement and test this for another couple of weeks 
>>> but will summarize my findings to the group as soon as possible. 
>>> 
>>> Best, 
>>> Ron 
>>> 
>>> 
>>> > On Aug 4, 2018, at 6:00 AM, Christian Grün  
>>> > wrote: 
>>> > 
>>> > Hi Ron, 
>>> > 
>>> >> I believe the slow execution may be due to a combinatorial issue: the 
>>> >> cross product of 280,000 clinical trials and ~10,000 drugs in DrugBank 
>>> >> (not counting synonyms). 
>>> > 
>>> > Yes, this sounds like a pretty expensive operation. Having maps 
>>> > (XQuery, Java) will be much faster indeed. 
>>> > 
>>> > As Gerrit suggested, and if you will run your query more than once, it 
>>> > would definitely be another interesting option to build an auxiliary, 
>>> > custom "index database" that allows you to do exact searches (this 
>>> > database may still have references to your original data sets). Since 
>>> > version 9 of BaseX, volatile hash maps will be created for looped 
>>> > string comparisons. See the following example: 
>>> > 
>>> > let $values1 := (1 to 50) ! string() 
>>> > let $values2 := (51 to 100) ! string() 
>>> > return $values1[. = $values2] 
>>> > 
>>> > Algorithmically, 500'000 * 500'000 string comparisons will need to be 
>>> > performed, resulting in a total of 250 billion operations (and no 
>>> > results). The runtime is much faster as you might expect (and, as far 
>>> > as I can judge, much faster than in any other XQuery processor). 
>>> > 
>>> > Best, 
>>> > Christian 


Re: [basex-talk] diacritics sensitive not working

2018-09-02 Thread Graydon Saunders
Maps that reference nodes are pointers, rather than copies.  It sounds like
you could map every drug name to every "interesting" XML node that contains
it using grouping during map creation and then just iterate on the keys to
process the nodes.

On Sun, Sep 2, 2018 at 4:52 PM Ron Katriel  wrote:

> Hi Christian,
>
> As promised here is a summary of my experimentation. I replaced the
> expensive join with a map lookup and the program finished in 4 minutes vs.
> 1 hour using a naive loop over the two databases (the original 6 hours
> reported were due to overly aggressive virus scanning software, which I
> turned off for this benchmarking).
>
> The downside of not using “contains text” inside the double loop (due to
> its slowness) is that I had to tokenize the CT.gov interventions and
> remove stopwords prior to looking them up in the DrugBank map. This is a
> subpar solution as some drugs are missed (looking up all the possible word
> combinations would be expensive).
>
> It would be nice if there was a way to combine the matching flexibility of
> the “contains text” construct (with its myriad of options) and the
> efficiency of a map lookup but that may require a finite-state automaton
> such as the Aho–Corasick algorithm. If you are aware of any existing
> solutions I would appreciate your sharing them.
>
> Thanks,
> Ron
>
> On August 4, 2018 at 8:47:49 PM, Ron Katriel (rkatr...@mdsol.com) wrote:
>
> Hi Christian,
>
> Thanks for the advise. The BaseX engine is phenomenal so I realized
> quickly that the problem was performing a naive cross product.
>
> Since this query is run only once a month (to serialize XML to CSV) and
> applied to new data (DB) each time, a BaseX map will likely be the most
> straightforward solution (I used the same idea for another project with
> good results).
>
> I will not be able to implement and test this for another couple of weeks
> but will summarize my findings to the group as soon as possible.
>
> Best,
> Ron
>
>
> > On Aug 4, 2018, at 6:00 AM, Christian Grün 
> wrote:
> >
> > Hi Ron,
> >
> >> I believe the slow execution may be due to a combinatorial issue: the
> cross product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not
> counting synonyms).
> >
> > Yes, this sounds like a pretty expensive operation. Having maps
> > (XQuery, Java) will be much faster indeed.
> >
> > As Gerrit suggested, and if you will run your query more than once, it
> > would definitely be another interesting option to build an auxiliary,
> > custom "index database" that allows you to do exact searches (this
> > database may still have references to your original data sets). Since
> > version 9 of BaseX, volatile hash maps will be created for looped
> > string comparisons. See the following example:
> >
> > let $values1 := (1 to 50) ! string()
> > let $values2 := (51 to 100) ! string()
> > return $values1[. = $values2]
> >
> > Algorithmically, 500'000 * 500'000 string comparisons will need to be
> > performed, resulting in a total of 250 billion operations (and no
> > results). The runtime is much faster as you might expect (and, as far
> > as I can judge, much faster than in any other XQuery processor).
> >
> > Best,
> > Christian
>
>


Re: [basex-talk] diacritics sensitive not working

2018-09-02 Thread Ron Katriel
Hi Christian,

As promised here is a summary of my experimentation. I replaced the
expensive join with a map lookup and the program finished in 4 minutes vs.
1 hour using a naive loop over the two databases (the original 6 hours
reported were due to overly aggressive virus scanning software, which I
turned off for this benchmarking).

The downside of not using “contains text” inside the double loop (due to
its slowness) is that I had to tokenize the CT.gov interventions and remove
stopwords prior to looking them up in the DrugBank map. This is a subpar
solution as some drugs are missed (looking up all the possible word
combinations would be expensive).

It would be nice if there was a way to combine the matching flexibility of
the “contains text” construct (with its myriad of options) and the
efficiency of a map lookup but that may require a finite-state automaton
such as the Aho–Corasick algorithm. If you are aware of any existing
solutions I would appreciate your sharing them.

Thanks,
Ron

On August 4, 2018 at 8:47:49 PM, Ron Katriel (rkatr...@mdsol.com) wrote:

Hi Christian,

Thanks for the advise. The BaseX engine is phenomenal so I realized quickly
that the problem was performing a naive cross product.

Since this query is run only once a month (to serialize XML to CSV) and
applied to new data (DB) each time, a BaseX map will likely be the most
straightforward solution (I used the same idea for another project with
good results).

I will not be able to implement and test this for another couple of weeks
but will summarize my findings to the group as soon as possible.

Best,
Ron


> On Aug 4, 2018, at 6:00 AM, Christian Grün 
wrote:
>
> Hi Ron,
>
>> I believe the slow execution may be due to a combinatorial issue: the
cross product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not
counting synonyms).
>
> Yes, this sounds like a pretty expensive operation. Having maps
> (XQuery, Java) will be much faster indeed.
>
> As Gerrit suggested, and if you will run your query more than once, it
> would definitely be another interesting option to build an auxiliary,
> custom "index database" that allows you to do exact searches (this
> database may still have references to your original data sets). Since
> version 9 of BaseX, volatile hash maps will be created for looped
> string comparisons. See the following example:
>
> let $values1 := (1 to 50) ! string()
> let $values2 := (51 to 100) ! string()
> return $values1[. = $values2]
>
> Algorithmically, 500'000 * 500'000 string comparisons will need to be
> performed, resulting in a total of 250 billion operations (and no
> results). The runtime is much faster as you might expect (and, as far
> as I can judge, much faster than in any other XQuery processor).
>
> Best,
> Christian


Re: [basex-talk] diacritics sensitive not working

2018-08-04 Thread Ron Katriel
Hi Christian,

Thanks for the advise. The BaseX engine is phenomenal so I realized quickly 
that the problem was performing a naive cross product. 

Since this query is run only once a month (to serialize XML to CSV) and applied 
to new data (DB) each time, a BaseX map will likely be the most straightforward 
solution (I used the same idea for another project with good results).

I will not be able to implement and test this for another couple of weeks but 
will summarize my findings to the group as soon as possible.

Best,
Ron


> On Aug 4, 2018, at 6:00 AM, Christian Grün  wrote:
> 
> Hi Ron,
> 
>> I believe the slow execution may be due to a combinatorial issue: the cross 
>> product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not 
>> counting synonyms).
> 
> Yes, this sounds like a pretty expensive operation. Having maps
> (XQuery, Java) will be much faster indeed.
> 
> As Gerrit suggested, and if you will run your query more than once, it
> would definitely be another interesting option to build an auxiliary,
> custom "index database" that allows you to do exact searches (this
> database may still have references to your original data sets). Since
> version 9 of BaseX, volatile hash maps will be created for looped
> string comparisons. See the following example:
> 
>  let $values1 := (1 to 50) ! string()
>  let $values2 := (51 to 100) ! string()
>  return $values1[. = $values2]
> 
> Algorithmically, 500'000 * 500'000 string comparisons will need to be
> performed, resulting in a total of 250 billion operations (and no
> results). The runtime is much faster as you might expect (and, as far
> as I can judge, much faster than in any other XQuery processor).
> 
> Best,
> Christian


Re: [basex-talk] diacritics sensitive not working

2018-08-04 Thread Christian Grün
Hi Ron,

> I believe the slow execution may be due to a combinatorial issue: the cross 
> product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not 
> counting synonyms).

Yes, this sounds like a pretty expensive operation. Having maps
(XQuery, Java) will be much faster indeed.

As Gerrit suggested, and if you will run your query more than once, it
would definitely be another interesting option to build an auxiliary,
custom "index database" that allows you to do exact searches (this
database may still have references to your original data sets). Since
version 9 of BaseX, volatile hash maps will be created for looped
string comparisons. See the following example:

  let $values1 := (1 to 50) ! string()
  let $values2 := (51 to 100) ! string()
  return $values1[. = $values2]

Algorithmically, 500'000 * 500'000 string comparisons will need to be
performed, resulting in a total of 250 billion operations (and no
results). The runtime is much faster as you might expect (and, as far
as I can judge, much faster than in any other XQuery processor).

Best,
Christian


Re: [basex-talk] diacritics sensitive not working

2018-08-03 Thread Ron Katriel
Christian,

Thanks for sharing that. I assumed all along that this happens
automatically. Anyway, I ran my query (for one drug, to save time) and see
the following in the Info view

- apply text index for "Lenalidomide"

I believe the slow execution may be due to a combinatorial issue: the cross
product of 280,000 clinical trials and ~10,000 drugs in DrugBank (not
counting synonyms).

I am considering an algorithmic solution that involves storing the DrugBank
information in a hash table (map) and looking it up while iterating through
the CT.gov  trials.

Best,
Ron

On August 3, 2018 at 5:49:30 PM, Christian Grün (christian.gr...@gmail.com)
wrote:

Our documentation should help you here: http://docs.basex.org/wiki/Indexes




Ron Katriel  schrieb am Fr., 3. Aug. 2018, 23:20:

> Hi Christian,
>
> Yes, I created a full-text index when the databases where loaded (see the
> commands below). I also verified that FTINDEX is true for both databases
> (in the GUI under Database > Open & Manage).
>
> How do I ensure that my query is rewritten for index access?
>
> Thanks,
> Ron
>
>
> SET FTINDEX true; SET TOKENINDEX true; CREATE DB CTGov "/Data Sets/
> ct.gov/xml
> 
> "
> SET FTINDEX true; SET TOKENINDEX true; SET STRIPNS true; CREATE DB
> DrugBank “/Data Sets/DrugBank/drugbank.xml"
>
> On August 3, 2018 at 4:12:43 PM, Christian Grün (christian.gr...@gmail.com)
> wrote:
>
> Hi Ron,
>
> Did you a) create a full-text index for your data and b) ensure that
> your query is rewritten for index access?
>
> Best,
> Christian
>
>
> On Fri, Aug 3, 2018 at 2:39 PM Ron Katriel  wrote:
> >
> > Christian,
> >
> > Adding diacritics sensitive slows execution by a factor of 3. My script
> (fragment below), which joins two large databases, namely CT.gov and
> DrugBank, takes 2 hours without the diacritics sensitive constraint but 6
> hours with it. Given the combinatorics involved, I am wondering if there is
> a better way to do this in BaseX.
> >
> > Thanks,
> > Ron
> >
> >
> > for $drug in db:open('DrugBank')/drugbank/drug
> > let $drug_name := $drug/name/text()
> > let $drug_synonyms :=
> functx:value-union(normalize-space(lower-case($drug/name)),
> local:drug-synonyms($drug_name))
> > for $synonym_name in $drug_synonyms
> > ...
> > for $study in
> db:open('CTGov')/clinical_study[intervention/intervention_name contains
> text { $synonym_name } using case insensitive using diacritics sensitive]
> > ...
> >
> >
> > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> > 350 Hudson Street, 7th Floor, New York, NY 10014
> > rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598
> | main: +1 212 918 1800
> >
> > On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatr...@mdsol.com)
> wrote:
> >
> > Thanks, Christian. Strange, prior to contacting you and on a hunch, I
> tried adding the missing “using” keyword but still got the syntax error.
> Anyway, everything is good now!
> >
> > Best,
> > Ron
> >
> > On August 1, 2018 at 3:57:51 AM, Christian Grün (
> christian.gr...@gmail.com) wrote:
> >
> > I have fixed the example in the doc.
> > Best, Christian
> >
> >
> > On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel  wrote:
> > >
> > > Hi,
> > >
> > > The following from your website (docs.basex.org/wiki/Full-Text
> )
> appears to be syntactically incorrect
> > >
> > > "'Äpfel' will not be found..." contains text "Apfel" diacritics
> sensitive
> > >
> > > In the BaseX GUI the keyword diacritics is underlined in red and the
> following error is reported
> > >
> > > Unexpected end of query: 'diacritic sens...'.
> > >
> > > This happens in version 8.6.4 and also the latest (9.0.2).
> > >
> > > Thanks,
> > > Ron
> > >
> > >
> > > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> > >
> > > 350 Hudson Street, 7th Floor, New York, NY 10014
> > >
> > > rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675
> 5598 | main: +1 212 918 1800
> > >
> > >
>
>


Re: [basex-talk] diacritics sensitive not working

2018-08-03 Thread Christian Grün
Our documentation should help you here: http://docs.basex.org/wiki/Indexes



Ron Katriel  schrieb am Fr., 3. Aug. 2018, 23:20:

> Hi Christian,
>
> Yes, I created a full-text index when the databases where loaded (see the
> commands below). I also verified that FTINDEX is true for both databases
> (in the GUI under Database > Open & Manage).
>
> How do I ensure that my query is rewritten for index access?
>
> Thanks,
> Ron
>
>
> SET FTINDEX true; SET TOKENINDEX true; CREATE DB CTGov "/Data Sets/
> ct.gov/xml"
> SET FTINDEX true; SET TOKENINDEX true; SET STRIPNS true; CREATE DB
> DrugBank “/Data Sets/DrugBank/drugbank.xml"
>
> On August 3, 2018 at 4:12:43 PM, Christian Grün (christian.gr...@gmail.com)
> wrote:
>
> Hi Ron,
>
> Did you a) create a full-text index for your data and b) ensure that
> your query is rewritten for index access?
>
> Best,
> Christian
>
>
> On Fri, Aug 3, 2018 at 2:39 PM Ron Katriel  wrote:
> >
> > Christian,
> >
> > Adding diacritics sensitive slows execution by a factor of 3. My script
> (fragment below), which joins two large databases, namely CT.gov and
> DrugBank, takes 2 hours without the diacritics sensitive constraint but 6
> hours with it. Given the combinatorics involved, I am wondering if there is
> a better way to do this in BaseX.
> >
> > Thanks,
> > Ron
> >
> >
> > for $drug in db:open('DrugBank')/drugbank/drug
> > let $drug_name := $drug/name/text()
> > let $drug_synonyms :=
> functx:value-union(normalize-space(lower-case($drug/name)),
> local:drug-synonyms($drug_name))
> > for $synonym_name in $drug_synonyms
> > ...
> > for $study in
> db:open('CTGov')/clinical_study[intervention/intervention_name contains
> text { $synonym_name } using case insensitive using diacritics sensitive]
> > ...
> >
> >
> > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> > 350 Hudson Street, 7th Floor, New York, NY 10014
> > rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598
> | main: +1 212 918 1800
> >
> > On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatr...@mdsol.com)
> wrote:
> >
> > Thanks, Christian. Strange, prior to contacting you and on a hunch, I
> tried adding the missing “using” keyword but still got the syntax error.
> Anyway, everything is good now!
> >
> > Best,
> > Ron
> >
> > On August 1, 2018 at 3:57:51 AM, Christian Grün (
> christian.gr...@gmail.com) wrote:
> >
> > I have fixed the example in the doc.
> > Best, Christian
> >
> >
> > On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel  wrote:
> > >
> > > Hi,
> > >
> > > The following from your website (docs.basex.org/wiki/Full-Text)
> appears to be syntactically incorrect
> > >
> > > "'Äpfel' will not be found..." contains text "Apfel" diacritics
> sensitive
> > >
> > > In the BaseX GUI the keyword diacritics is underlined in red and the
> following error is reported
> > >
> > > Unexpected end of query: 'diacritic sens...'.
> > >
> > > This happens in version 8.6.4 and also the latest (9.0.2).
> > >
> > > Thanks,
> > > Ron
> > >
> > >
> > > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> > >
> > > 350 Hudson Street, 7th Floor, New York, NY 10014
> > >
> > > rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675
> 5598 | main: +1 212 918 1800
> > >
> > >
>
>


Re: [basex-talk] diacritics sensitive not working

2018-08-03 Thread Ron Katriel
Hi Gerrit,

Thanks for the suggestions. I would like to retain the original diacritics
(for output purposes) but only match them when warranted (e.g., match
acétazolamide to acétazolamide, but not acétazolamide to acetazolamide). I
am looking for a simple solution that does not involve modifying the
database or maintaining multiple copies (both for processing simplicity and
storage efficiency reasons).

Thanks,
Ron

On August 3, 2018 at 9:08:19 AM, Imsieke, Gerrit, le-tex (
gerrit.imsi...@le-tex.de) wrote:

Hi Ron,

You can add an extra element (or attribute) to the content when
importing or modifying it. (Or another document in another database if
you like – you can create and later find such an index document by
giving it the same db:path as the original document.)

In this extra database, document, element and/or attribute, you can
recreate the original text, except that you normalize the characters
with diacritical marks to a canonical decomposition form and then strip
away the diacritical marks like this:

replace(normalize-unicode($input, 'NFKD'), '\p{Mn}', '')

The full updating statement is beyond my cursory XQuery capabilities –
I’d probably do it in XSLT. Also I don’t know how to trigger an event
that would cause an update of the auxiliary fields when the underlying
data changes.

Gerrit


On 03.08.2018 14:39, Ron Katriel wrote:
> Christian,
>
> Adding diacritics sensitive slows execution by a factor of 3. My script
> (fragment below), which joins two large databases, namely CT.gov
> <
https://urldefense.proofpoint.com/v2/url?u=http-3A__clinicaltrials.gov&d=DwIDaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=Ey4jDDhPLggInP39ySeaE3VfSTCYVNh_9_fJGgZfoMQ&s=koceIUV9xm7YkAEx4zHuVLM00ueSFrJPydvVoqoa_JE&e=>
and
DrugBank, takes 2 hours without the
> diacritics sensitive constraint but 6 hours with it. Given the
> combinatorics involved, I am wondering if there is a better way to do
> this in BaseX.
>
> Thanks,
> Ron
>
>
> for $drug in db:open('DrugBank')/drugbank/drug
>  let $drug_name := $drug/name/text()
>  let $drug_synonyms :=
> functx:value-union(normalize-space(lower-case($drug/name)),
> local:drug-synonyms($drug_name))
>  for $synonym_name in $drug_synonyms
>  ...
>  for $study in
> db:open('CTGov')/clinical_study[intervention/intervention_name contains
> text { $synonym_name } using case insensitive using diacritics sensitive]
>  ...
>
>
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> 
> 350 Hudson Street, 7th Floor, New York, NY 10014
> rkatr...@mdsol.com  | direct: +1 201 337 3622
>  | mobile: +1 201 675 5598
>  | main: +1 212 918 1800
> 
>
> On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatr...@mdsol.com
> ) wrote:
>
>> Thanks, Christian. Strange, prior to contacting you and on a hunch, I
>> tried adding the missing “using” keyword but still got the syntax
>> error. Anyway, everything is good now!
>>
>> Best,
>> Ron
>>
>> On August 1, 2018 at 3:57:51 AM, Christian Grün
>> (christian.gr...@gmail.com ) wrote:
>>
>>> I have fixed the example in the doc.
>>> Best, Christian
>>>
>>>
>>> On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel >> > wrote:
>>> >
>>> > Hi,
>>> >
>>> > The following from your website (docs.basex.org/wiki/Full-Text
>>> <
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Full-2DText&d=DwIDaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=Ey4jDDhPLggInP39ySeaE3VfSTCYVNh_9_fJGgZfoMQ&s=SiWAa4ymPcj4HabGlA411Rp03-IG4l2krSrbu2-WJSs&e=>)
appears to be syntactically
>>> incorrect
>>> >
>>> > "'Äpfel' will not be found..." contains text "Apfel" diacritics
sensitive
>>> >
>>> > In the BaseX GUI the keyword diacritics is underlined in red and the
following error is reported
>>> >
>>> > Unexpected end of query: 'diacritic sens...'.
>>> >
>>> > This happens in version 8.6.4 and also the latest (9.0.2).
>>> >
>>> > Thanks,
>>> > Ron
>>> >
>>> >
>>> > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
>>> >
>>> > 350 Hudson Street, 7th Floor, New York, NY 10014
>>> >
>>> > rkatr...@mdsol.com  | direct: +1 201 337
>>> 3622 | mobile: +1 201 675 5598 | main: +1 212 918 1800
>>> >
>>> >


Re: [basex-talk] diacritics sensitive not working

2018-08-03 Thread Ron Katriel
Hi Christian,

Yes, I created a full-text index when the databases where loaded (see the
commands below). I also verified that FTINDEX is true for both databases
(in the GUI under Database > Open & Manage).

How do I ensure that my query is rewritten for index access?

Thanks,
Ron


SET FTINDEX true; SET TOKENINDEX true; CREATE DB CTGov "/Data Sets/
ct.gov/xml"
SET FTINDEX true; SET TOKENINDEX true; SET STRIPNS true; CREATE DB DrugBank
“/Data Sets/DrugBank/drugbank.xml"

On August 3, 2018 at 4:12:43 PM, Christian Grün (christian.gr...@gmail.com)
wrote:

Hi Ron,

Did you a) create a full-text index for your data and b) ensure that
your query is rewritten for index access?

Best,
Christian


On Fri, Aug 3, 2018 at 2:39 PM Ron Katriel  wrote:
>
> Christian,
>
> Adding diacritics sensitive slows execution by a factor of 3. My script
(fragment below), which joins two large databases, namely CT.gov and
DrugBank, takes 2 hours without the diacritics sensitive constraint but 6
hours with it. Given the combinatorics involved, I am wondering if there is
a better way to do this in BaseX.
>
> Thanks,
> Ron
>
>
> for $drug in db:open('DrugBank')/drugbank/drug
> let $drug_name := $drug/name/text()
> let $drug_synonyms :=
functx:value-union(normalize-space(lower-case($drug/name)),
local:drug-synonyms($drug_name))
> for $synonym_name in $drug_synonyms
> ...
> for $study in
db:open('CTGov')/clinical_study[intervention/intervention_name contains
text { $synonym_name } using case insensitive using diacritics sensitive]
> ...
>
>
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> 350 Hudson Street, 7th Floor, New York, NY 10014
> rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 |
main: +1 212 918 1800
>
> On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatr...@mdsol.com) wrote:
>
> Thanks, Christian. Strange, prior to contacting you and on a hunch, I
tried adding the missing “using” keyword but still got the syntax error.
Anyway, everything is good now!
>
> Best,
> Ron
>
> On August 1, 2018 at 3:57:51 AM, Christian Grün (christian.gr...@gmail.com)
wrote:
>
> I have fixed the example in the doc.
> Best, Christian
>
>
> On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel  wrote:
> >
> > Hi,
> >
> > The following from your website (docs.basex.org/wiki/Full-Text) appears
to be syntactically incorrect
> >
> > "'Äpfel' will not be found..." contains text "Apfel" diacritics
sensitive
> >
> > In the BaseX GUI the keyword diacritics is underlined in red and the
following error is reported
> >
> > Unexpected end of query: 'diacritic sens...'.
> >
> > This happens in version 8.6.4 and also the latest (9.0.2).
> >
> > Thanks,
> > Ron
> >
> >
> > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> >
> > 350 Hudson Street, 7th Floor, New York, NY 10014
> >
> > rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598
| main: +1 212 918 1800
> >
> >


Re: [basex-talk] diacritics sensitive not working

2018-08-03 Thread Christian Grün
Hi Ron,

Did you a) create a full-text index for your data and b) ensure that
your query is rewritten for index access?

Best,
Christian


On Fri, Aug 3, 2018 at 2:39 PM Ron Katriel  wrote:
>
> Christian,
>
> Adding diacritics sensitive slows execution by a factor of 3. My script 
> (fragment below), which joins two large databases, namely CT.gov and 
> DrugBank, takes 2 hours without the diacritics sensitive constraint but 6 
> hours with it. Given the combinatorics involved, I am wondering if there is a 
> better way to do this in BaseX.
>
> Thanks,
> Ron
>
>
>   for $drug in db:open('DrugBank')/drugbank/drug
>  let $drug_name := $drug/name/text()
>  let $drug_synonyms := 
> functx:value-union(normalize-space(lower-case($drug/name)), 
> local:drug-synonyms($drug_name))
>  for $synonym_name in $drug_synonyms
>  ...
>  for $study in db:open('CTGov')/clinical_study[intervention/intervention_name 
> contains text { $synonym_name } using case insensitive using diacritics 
> sensitive]
>  ...
>
>
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> 350 Hudson Street, 7th Floor, New York, NY 10014
> rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 | 
> main: +1 212 918 1800
>
> On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatr...@mdsol.com) wrote:
>
> Thanks, Christian. Strange, prior to contacting you and on a hunch, I tried 
> adding the missing “using” keyword but still got the syntax error. Anyway, 
> everything is good now!
>
> Best,
> Ron
>
> On August 1, 2018 at 3:57:51 AM, Christian Grün (christian.gr...@gmail.com) 
> wrote:
>
> I have fixed the example in the doc.
> Best, Christian
>
>
> On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel  wrote:
> >
> > Hi,
> >
> > The following from your website (docs.basex.org/wiki/Full-Text) appears to 
> > be syntactically incorrect
> >
> > "'Äpfel' will not be found..." contains text "Apfel" diacritics sensitive
> >
> > In the BaseX GUI the keyword diacritics is underlined in red and the 
> > following error is reported
> >
> > Unexpected end of query: 'diacritic sens...'.
> >
> > This happens in version 8.6.4 and also the latest (9.0.2).
> >
> > Thanks,
> > Ron
> >
> >
> > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> >
> > 350 Hudson Street, 7th Floor, New York, NY 10014
> >
> > rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 | 
> > main: +1 212 918 1800
> >
> >


Re: [basex-talk] diacritics sensitive not working

2018-08-03 Thread Imsieke, Gerrit, le-tex

Hi Ron,

You can add an extra element (or attribute) to the content when 
importing or modifying it. (Or another document in another database if 
you like – you can create and later find such an index document by 
giving it the same db:path as the original document.)


In this extra database, document, element and/or attribute, you can 
recreate the original text, except that you normalize the characters 
with diacritical marks to a canonical decomposition form and then strip 
away the diacritical marks like this:


replace(normalize-unicode($input, 'NFKD'), '\p{Mn}', '')

The full updating statement is beyond my cursory XQuery capabilities – 
I’d probably do it in XSLT. Also I don’t know how to trigger an event 
that would cause an update of the auxiliary fields when the underlying 
data changes.


Gerrit


On 03.08.2018 14:39, Ron Katriel wrote:

Christian,

Adding diacritics sensitive slows execution by a factor of 3. My script 
(fragment below), which joins two large databases, namely CT.gov 
 and DrugBank, takes 2 hours without the 
diacritics sensitive constraint but 6 hours with it. Given the 
combinatorics involved, I am wondering if there is a better way to do 
this in BaseX.


Thanks,
Ron


for $drug in db:open('DrugBank')/drugbank/drug
  let $drug_name := $drug/name/text()
  let $drug_synonyms := 
functx:value-union(normalize-space(lower-case($drug/name)), 
local:drug-synonyms($drug_name))

  for $synonym_name in $drug_synonyms
  ...
  for $study in 
db:open('CTGov')/clinical_study[intervention/intervention_name contains 
text { $synonym_name } using case insensitive using diacritics sensitive]

  ...


Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions 


350 Hudson Street, 7th Floor, New York, NY 10014
rkatr...@mdsol.com  | direct: +1 201 337 3622 
 | mobile: +1 201 675 5598 
 | main: +1 212 918 1800 



On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatr...@mdsol.com 
) wrote:


Thanks, Christian. Strange, prior to contacting you and on a hunch, I 
tried adding the missing “using” keyword but still got the syntax 
error. Anyway, everything is good now!


Best,
Ron

On August 1, 2018 at 3:57:51 AM, Christian Grün 
(christian.gr...@gmail.com ) wrote:



I have fixed the example in the doc.
Best, Christian


On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel > wrote:

>
> Hi,
>
> The following from your website (docs.basex.org/wiki/Full-Text 
) appears to be syntactically 
incorrect

>
> "'Äpfel' will not be found..." contains text "Apfel" diacritics sensitive
>
> In the BaseX GUI the keyword diacritics is underlined in red and the 
following error is reported
>
> Unexpected end of query: 'diacritic sens...'.
>
> This happens in version 8.6.4 and also the latest (9.0.2).
>
> Thanks,
> Ron
>
>
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
>
> 350 Hudson Street, 7th Floor, New York, NY 10014
>
> rkatr...@mdsol.com  | direct: +1 201 337 
3622 | mobile: +1 201 675 5598 | main: +1 212 918 1800

>
>




Re: [basex-talk] diacritics sensitive not working

2018-08-03 Thread Ron Katriel
Christian,

Adding diacritics sensitive slows execution by a factor of 3. My script
(fragment below), which joins two large databases, namely CT.gov
 and DrugBank, takes 2 hours without the
diacritics sensitive constraint but 6 hours with it. Given the
combinatorics involved, I am wondering if there is a better way to do this
in BaseX.

Thanks,
Ron


  for $drug in db:open('DrugBank')/drugbank/drug
 let $drug_name := $drug/name/text()
 let $drug_synonyms :=
functx:value-union(normalize-space(lower-case($drug/name)),
local:drug-synonyms($drug_name))
 for $synonym_name in $drug_synonyms
 ...
 for $study in
db:open('CTGov')/clinical_study[intervention/intervention_name contains
text { $synonym_name } using case insensitive using diacritics sensitive]
 ...


Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions

350 Hudson Street, 7th Floor, New York, NY 10014
rkatr...@mdsol.com  | direct: +1 201 337 3622  | mobile: +1 201 675 5598  | main: +1 212 918 1800


On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatr...@mdsol.com) wrote:

Thanks, Christian. Strange, prior to contacting you and on a hunch, I tried
adding the missing “using” keyword but still got the syntax error. Anyway,
everything is good now!

Best,
Ron

On August 1, 2018 at 3:57:51 AM, Christian Grün (christian.gr...@gmail.com)
wrote:

I have fixed the example in the doc.
Best, Christian


On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel  wrote:
>
> Hi,
>
> The following from your website (docs.basex.org/wiki/Full-Text) appears
to be syntactically incorrect
>
> "'Äpfel' will not be found..." contains text "Apfel" diacritics sensitive
>
> In the BaseX GUI the keyword diacritics is underlined in red and the
following error is reported
>
> Unexpected end of query: 'diacritic sens...'.
>
> This happens in version 8.6.4 and also the latest (9.0.2).
>
> Thanks,
> Ron
>
>
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
>
> 350 Hudson Street, 7th Floor, New York, NY 10014
>
> rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 |
main: +1 212 918 1800
>
>


Re: [basex-talk] diacritics sensitive not working

2018-08-01 Thread Ron Katriel
Thanks, Christian. Strange, prior to contacting you and on a hunch, I tried
adding the missing “using” keyword but still got the syntax error. Anyway,
everything is good now!

Best,
Ron

On August 1, 2018 at 3:57:51 AM, Christian Grün (christian.gr...@gmail.com)
wrote:

I have fixed the example in the doc.
Best, Christian


On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel  wrote:
>
> Hi,
>
> The following from your website (docs.basex.org/wiki/Full-Text) appears
to be syntactically incorrect
>
> "'Äpfel' will not be found..." contains text "Apfel" diacritics sensitive
>
> In the BaseX GUI the keyword diacritics is underlined in red and the
following error is reported
>
> Unexpected end of query: 'diacritic sens...'.
>
> This happens in version 8.6.4 and also the latest (9.0.2).
>
> Thanks,
> Ron
>
>
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
>
> 350 Hudson Street, 7th Floor, New York, NY 10014
>
> rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 |
main: +1 212 918 1800
>
>


Re: [basex-talk] diacritics sensitive not working

2018-08-01 Thread Christian Grün
I have fixed the example in the doc.
Best, Christian


On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel  wrote:
>
> Hi,
>
> The following from your website (docs.basex.org/wiki/Full-Text) appears to be 
> syntactically incorrect
>
> "'Äpfel' will not be found..." contains text "Apfel" diacritics sensitive
>
> In the BaseX GUI the keyword diacritics is underlined in red and the 
> following error is reported
>
> Unexpected end of query: 'diacritic sens...'.
>
> This happens in version 8.6.4 and also the latest (9.0.2).
>
> Thanks,
> Ron
>
>
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
>
> 350 Hudson Street, 7th Floor, New York, NY 10014
>
> rkatr...@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 | 
> main: +1 212 918 1800
>
>


[basex-talk] diacritics sensitive not working

2018-07-31 Thread Ron Katriel
Hi,

The following from your website (docs.basex.org/wiki/Full-Text) appears to
be syntactically incorrect

"'Äpfel' will not be found..." contains text "Apfel" diacritics sensitive

In the BaseX GUI the keyword diacritics is underlined in red and the
following error is reported

Unexpected end of query: 'diacritic sens...'.

This happens in version 8.6.4 and also the latest (9.0.2).

Thanks,
Ron


*Ron Katriel, Ph.D. *| Principal Data Scientist | Medidata Solutions


350 Hudson Street, 7th Floor, New York, NY 10014

rkatr...@mdsol.com  | direct: +1 201 337 3622
 | mobile: +1 201 675 5598
 | main: +1 212 918 1800