Re: [basex-talk] Issue with Full Text Retrieval

2017-09-11 Thread Ron Katriel
Thanks Fabrice and Michael. Solution (1) works great!

A parting question: why not make the default behavior when querying the
textual representation of a document to not “chop” away critical word
boundary delimiters? So, in the example below it would return

  XQuery
  and XPAth are   awesome

The munging together of "XPAth" and “are” seems counter intuitive to me.

Best,
Ron

On September 11, 2017 at 4:13:54 AM, Michael Seiferle (m...@basex.org) wrote:

Hi Ron,
Hi Fabrice,

Your observation w.r.t. to element boundaries is right, the document is
converted to a textual representation, by default it returns all nodes in
their string representation:

$doc :=


  XQuery
  <_>and XPAth
  <_>are   awesome
*/data()*

Will turn to:


  XQuery
  and *XPAthare*   awesome


So:

$doc contains text { 'XPath‘ }


will return false.

You have 3.5 options:

1) => as Fabrice showed, query the individual text nodes

2) use the ft:search() Function to query the index directly,
http://docs.basex.org/wiki/Full-Text_Module#ft:search


ft:search(
  'CTGovDebug',
  'neoplasms'
)/.. (: get parent element for the matching text()-node


3) disable chopping when creating the database,
http://docs.basex.org/wiki/Options#XML_Parsing



db:create(
  'CTGovDebug',
  "Path/to/NCT00473512.xml",
  "NCT00473512.xml",

  map {
   'ftindex': true(),
   'chop': false()
  })


3.5) use the xml:space="preserve“ attribute to tell the parser not to chop
child nodes of  when creating a database:


  
  
ClinicalTrials.gov

processed this data on August 31, 2017
Link to the current ClinicalTrials.gov

record.



Hope this helped shed some light :-)

Best from Konstanz
Michael
--
Michael Seiferle, BaseX GmbH, http://www.basexgmbh.de

|-- Firmensitz: Obere Laube 73, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Tel: +49 7531 916 82 77

Am 11.09.2017 um 09:35 schrieb Fabrice ETANCHAUD <
fetanch...@pch.cerfrance.fr>:

Hello Ron,

I don’t know how ft operators behave on document nodes.
Supposing documents are converted to their data() representation, Your
query would yield the same negative answer.
You should consider applying ft operators on text nodes like this :

for $trial in db:open('NCT00473512')//text() (:
[clinical_study/id_info/nct_id='NCT00473512'] :)
return $trial[. contains text { 'neoplasms' }]

Best regards,
Fabrice Etanchaud


*De :* basex-talk-boun...@mailman.uni-konstanz.de [
mailto:basex-talk-boun...@mailman.uni-konstanz.de
] *De la part de* Ron Katriel
*Envoyé :* lundi 11 septembre 2017 00:42
*À :* BaseX
*Objet :* [basex-talk] Issue with Full Text Retrieval

Hi,

I am seeing strange behavior with Full Text retrieval. The following query
fails for a number of words that are in the XML document (see attached):

for $trial in db:open('CTGovDebug') (:
[clinical_study/id_info/nct_id='NCT00473512']
:)
return $trial contains text { 'neoplasms' }

It fails on a good number of words including neoplasms, cougar, industry,
yes, completed, november, 2005, interventional, single, male, female,
assignment, none, research, principal, primary, secondary, age, years,
gender, etc. But it matches most of the words in the file.

Observation: The words that fail are located at the beginning and/or end of
the text *and* do not occur anywhere else in the middle of any text.

The document is the only one in the database. It does not make a difference
whether full text indexing is on or off. My BaseX version is 8.6.4.

Thanks,
Ron


Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions

350 Hudson Street, 7th Floor, New York, NY 10014
rkatr...@mdsol.

Re: [basex-talk] Issue with Full Text Retrieval

2017-09-11 Thread Michael Seiferle
Hi Ron,
Hi Fabrice,

Your observation w.r.t. to element boundaries is right, the document is 
converted to a textual representation, by default it returns all nodes in their 
string representation:

$doc :=
> 
>   XQuery 
>   <_>and XPAth
>   <_>are   awesome
> /data()

Will turn to:
> 
>   XQuery 
>   and XPAthare   awesome
>  
So:
> $doc contains text { 'XPath‘ }

will return false.

You have 3.5 options:

1) => as Fabrice showed, query the individual text nodes

2) use the ft:search() Function to query the index directly, 
http://docs.basex.org/wiki/Full-Text_Module#ft:search 


> ft:search(
>   'CTGovDebug',
>   'neoplasms'
> )/.. (: get parent element for the matching text()-node

3) disable chopping when creating the database, 
http://docs.basex.org/wiki/Options#XML_Parsing 
 
> db:create(
>   'CTGovDebug',
>   "Path/to/NCT00473512.xml",
>   "NCT00473512.xml",
>   map {
>'ftindex': true(),
>'chop': false()
>   })


3.5) use the xml:space="preserve“ attribute to tell the parser not to chop 
child nodes of  when creating a database:
> 
>   
>   
> ClinicalTrials.gov processed this data on August 31, 
> 2017
> Link to the current ClinicalTrials.gov record.
> 



Hope this helped shed some light :-)

Best from Konstanz
Michael
--
Michael Seiferle, BaseX GmbH, http://www.basexgmbh.de
|-- Firmensitz: Obere Laube 73, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Tel: +49 7531 916 82 77

> Am 11.09.2017 um 09:35 schrieb Fabrice ETANCHAUD 
> :
> 
> Hello Ron,
>  
> I don’t know how ft operators behave on document nodes.
> Supposing documents are converted to their data() representation, Your query 
> would yield the same negative answer.
> You should consider applying ft operators on text nodes like this :
>  
> for $trial in db:open('NCT00473512')//text() (: 
> [clinical_study/id_info/nct_id='NCT00473512'] :)
> return $trial[. contains text { 'neoplasms' }]
>  
> Best regards,
> Fabrice Etanchaud
>  
>  
> De : basex-talk-boun...@mailman.uni-konstanz.de 
> [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Ron Katriel
> Envoyé : lundi 11 septembre 2017 00:42
> À : BaseX
> Objet : [basex-talk] Issue with Full Text Retrieval
>  
> Hi,
>  
> I am seeing strange behavior with Full Text retrieval. The following query 
> fails for a number of words that are in the XML document (see attached):
>  
> for $trial in db:open('CTGovDebug') (: 
> [clinical_study/id_info/nct_id='NCT00473512'] :)
> return $trial contains text { 'neoplasms' }
>  
> It fails on a good number of words including neoplasms, cougar, industry, 
> yes, completed, november, 2005, interventional, single, male, female, 
> assignment, none, research, principal, primary, secondary, age, years, 
> gender, etc. But it matches most of the words in the file.
>  
> Observation: The words that fail are located at the beginning and/or end of 
> the text and do not occur anywhere else in the middle of any text.
>  
> The document is the only one in the database. It does not make a difference 
> whether full text indexing is on or off. My BaseX version is 8.6.4.
>  
> Thanks,
> Ron
>  
>  
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions 
> 
> 350 Hudson Street, 7th Floor, New York, NY 10014
> rkatr...@mdsol.com  | direct: +1 201 337 3622 
>  | mobile: +1 201 675 5598 
>  | main: +1 212 918 1800 
> 


Re: [basex-talk] Issue with Full Text Retrieval

2017-09-11 Thread Fabrice ETANCHAUD
Hello Ron,

I don’t know how ft operators behave on document nodes.
Supposing documents are converted to their data() representation, Your query 
would yield the same negative answer.
You should consider applying ft operators on text nodes like this :

for $trial in db:open('NCT00473512')//text() (: 
[clinical_study/id_info/nct_id='NCT00473512'] :)
return $trial[. contains text { 'neoplasms' }]

Best regards,
Fabrice Etanchaud


De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Ron Katriel
Envoyé : lundi 11 septembre 2017 00:42
À : BaseX
Objet : [basex-talk] Issue with Full Text Retrieval

Hi,

I am seeing strange behavior with Full Text retrieval. The following query 
fails for a number of words that are in the XML document (see attached):

for $trial in db:open('CTGovDebug') (: 
[clinical_study/id_info/nct_id='NCT00473512'] :)
return $trial contains text { 'neoplasms' }

It fails on a good number of words including neoplasms, cougar, industry, yes, 
completed, november, 2005, interventional, single, male, female, assignment, 
none, research, principal, primary, secondary, age, years, gender, etc. But it 
matches most of the words in the file.

Observation: The words that fail are located at the beginning and/or end of the 
text and do not occur anywhere else in the middle of any text.

The document is the only one in the database. It does not make a difference 
whether full text indexing is on or off. My BaseX version is 8.6.4.

Thanks,
Ron


Ron Katriel, Ph.D. | Principal Data Scientist | Medidata 
Solutions
350 Hudson Street, 7th Floor, New York, NY 10014
rkatr...@mdsol.com | direct: +1 201 337 
3622 | mobile: +1 201 675 
5598 | main: +1 212 918 
1800