Re: [basex-talk] Text Index just over some elements

2014-09-28 Thread Oscar Herrera
Hi Dirk and Fabrice!, thank you so much for your help with these subjects!,
it has been of great help to read this answers, the link that Fabrice
suggested and reading about  indexes also. Now is much clear for me how
BaseX works!, I have to recognize some part of my brain still fights and
sticks to RDBMS model after more than 12 years working with it, but at the
same time I'm feeling pretty excited about working with BaseX. I'll keep
you guys posted on how everything goes on this project!, and will come back
to you also if I get into some other question ;)

Thank you so much!

Oscar H


2014-09-25 5:54 GMT-05:00 Dirk Kirsten d...@basex.org:

 Hello Oscar,

 As Fabrice already suggested, maintaining a separate collection with
 node-id mappings might be a viable solution. Another option could be to
 split your documents up in a way that the relevant information is stored
 in one collection (which is indexed) and all the other supplemental
 information is stored in another collection. This way, the first
 collection should be rather small and the text index should work fine.

 
  So, from all the information we receive, at this moment I estimate we
 only
  need around 25%, I though about having different databases with full and
  partial information but the thing is that somehow the requirements are
 not
  entirely defined on one hand, and on the other, there's information that
 we
  use on the queries and some other that we still need to display to its
  owner and that we're displaying using XSLT.

 If you need to display additional information, it is no problem to
 access multiple collections in a single XQuery. So splitting up the data
 should not be a show-stopper.

 
  == Question 1: Indexes are only required for some fields ==
  We usually need to locate the records by some id, or query over some of
 the
  elements available on the XML files, but those are pretty much always the
  same, so those are the elements that I'd like to have indexed. That's
 why I
  don't see a reason for having indexes over the contents of all the
 elements
  since is unlikely (at least right now) we'll make use of those and
 instead
  they consume a lot of hard drive.


 You currently can't define an index to just select certain elements. It
 would certainly be very nice to have super-flexible indexes, but as you
 can guess this is a non-trivial task. Maintaining separate collections
 is currently the way to go.
 
  == Question 2: to store files on the filesystem or as raw on BaseX? ==
  Right now, we're storing the information we receive as XML files on the
  file system on a RAID 10, anyway what's your advice?, to keep the files
  stored on the filesystem directly or to let BaseX handle those (I think
  this is the difference between add/replace and store commands right?), is
  there any article you could point me I could use for reference?, as I see
  BaseX right now it is handling the queries and the index information
 right
  now but depends on the filesystem to retrieve the entire document, am I
  right?

 If you have a non-small collection of documents, simply storing them in
 the file system is certainly not very performant. Using XQuery, you can
 read from the file system, but that means parsing has to be executed
 each time.

 As Fabrice pointed out (thanks!), the concept is different than what you
 described here. Using add/replace parses an XML file and adds it to the
 database. During parsing, the XML file will be stored in a binary
 format, to be able to optimize queries and to access relevant data much
 faster. You can not add/replace any binary file to BaseX, as it would
 not be parseable. Store, on the other hand, simply copies the file and
 can therefore handle any binary file. This is useful if you e.g. want to
 store media files within your DB, but you most likely do not want to
 store XML files in a binary way, as it is similar in performance as
 reading from the plain filesystem.

 In short: You most likely want to add your documents to a collection.

 
  == Question 3: dynamic optimize and index updates? ==
  As you can imagine, I'll need to have the indexes updated
  sincedata-mining will be done with the information from the people
  registered on it. I've seen is not possible to run the optimize command
  while the app is up, I'm not sure about the indexes getting updated on
 real
  time either, but this somehow is troubling me since the idea is to have
 the
  app running 24x7, and if we get to have a lot of registered users, to
  update the indexes or to optimize the db will take a long time, isn't
 it?.
  So any strategies on this?

 I don't quite get what you mean by optimize can not be run when the app
 is up. Optimize can not be run if the database is opened by another
 context (as it is updating and we maintain ACID), but your app shouldn't
 hold open the database all the time.

 One option you might want to look into is updating indexes (see
 http://docs.basex.org/wiki/Options#UPDINDEX), it 

Re: [basex-talk] Text Index just over some elements

2014-09-25 Thread Fabrice Etanchaud
Dear Oscar,

From what I read, I’m not sure you had a look at  the underlying BaseX data 
structure yet.

Xml files  in BaseX are digested in a binary format

http://docs.basex.org/wiki/Node_Storage

but ‘stored’ raw files are simply copied on the filesystem.

You can only index digested data.
Best regards,
Fabrice
Questel/Orbit


De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Oscar Herrera
Envoyé : mercredi 24 septembre 2014 19:41
À : basex-talk@mailman.uni-konstanz.de
Objet : Re: [basex-talk] Text Index just over some elements

== The Scenario ==
What we have is a dynamic collection with information from people who registers 
on the site. Basically the information is retrieved from third party companies 
that provide us the information on XML via WebService calls, so we do request 
the person information to these third parties on the moment people gets 
registered. So that's how we got into BaseX since we consider is inconvenient 
to store large XML files on a RDBMS and I don't see the point on having to 
parse all the information when we receive it to re-organize it mostly because 
from my point of view the information is already well structured via these 
large XML files.

This XML are on average 2mb each. Of course, there are some that are very small 
(80kb) as there are some that we have been advised might get up to 500mb.

So, from all the information we receive, at this moment I estimate we only need 
around 25%, I though about having different databases with full and partial 
information but the thing is that somehow the requirements are not entirely 
defined on one hand, and on the other, there's information that we use on the 
queries and some other that we still need to display to its owner and that 
we're displaying using XSLT.
== Question 1: Indexes are only required for some fields ==
We usually need to locate the records by some id, or query over some of the 
elements available on the XML files, but those are pretty much always the same, 
so those are the elements that I'd like to have indexed. That's why I don't see 
a reason for having indexes over the contents of all the elements since is 
unlikely (at least right now) we'll make use of those and instead they consume 
a lot of hard drive.
== Question 2: to store files on the filesystem or as raw on BaseX? ==
Right now, we're storing the information we receive as XML files on the file 
system on a RAID 10, anyway what's your advice?, to keep the files stored on 
the filesystem directly or to let BaseX handle those (I think this is the 
difference between add/replace and store commands right?), is there any article 
you could point me I could use for reference?, as I see BaseX right now it is 
handling the queries and the index information right now but depends on the 
filesystem to retrieve the entire document, am I right?
== Question 3: dynamic optimize and index updates? ==
As you can imagine, I'll need to have the indexes updated sincedata-mining 
will be done with the information from the people registered on it. I've seen 
is not possible to run the optimize command while the app is up, I'm not sure 
about the indexes getting updated on real time either, but this somehow is 
troubling me since the idea is to have the app running 24x7, and if we get to 
have a lot of registered users, to update the indexes or to optimize the db 
will take a long time, isn't it?. So any strategies on this?
== Question 4: connection pooling ==
I have only found XQJ-Pool to be used with BaseX, does anybody know about any 
other pooling mechanism available for BaseX?
Thank you so much for your help with this subject, and sorry for the long long 
email ;)
Oscar H




2014-09-24 3:21 GMT-05:00 Fabrice Etanchaud 
fetanch...@questel.commailto:fetanch...@questel.com:
Hi Oscar,

You will have to maintain a separate collection in order to do that.

That separate collection will contain the node-pre or node-id of each value to 
be indexed.
Storing the node-pre is the faster way but require a append-only main 
collection if you do not want to have to recreate the entire separate 
collection after each main collection update.


1.   Add the new map entries (value-to-be-indexed,node-pre or node-id) in 
the separate collection

2.   Reindex the separate collection

An even faster solution is to store the values in text nodes and node-pre or 
node-id in attributes in order to create only a text index (or vice/versa). 
That will speed up the reindexation.

To use this custom index :

1.Use the db:attribute or db:text function on the separate collection 
to obtain the list of node-id or node-pre associated with a given value,

2.   For each node-xx, use the db:open-xx function on the main collection 
to obtain the real node.

If you are familiar with CouchBase/CouchDB, it’s a little like creating a view 
;-)

But such a built-in feature would be great !

Best regards,
Fabrice

Re: [basex-talk] Text Index just over some elements

2014-09-25 Thread Dirk Kirsten
Hello Oscar,

As Fabrice already suggested, maintaining a separate collection with
node-id mappings might be a viable solution. Another option could be to
split your documents up in a way that the relevant information is stored
in one collection (which is indexed) and all the other supplemental
information is stored in another collection. This way, the first
collection should be rather small and the text index should work fine.

 
 So, from all the information we receive, at this moment I estimate we only
 need around 25%, I though about having different databases with full and
 partial information but the thing is that somehow the requirements are not
 entirely defined on one hand, and on the other, there's information that we
 use on the queries and some other that we still need to display to its
 owner and that we're displaying using XSLT.

If you need to display additional information, it is no problem to
access multiple collections in a single XQuery. So splitting up the data
should not be a show-stopper.

 
 == Question 1: Indexes are only required for some fields ==
 We usually need to locate the records by some id, or query over some of the
 elements available on the XML files, but those are pretty much always the
 same, so those are the elements that I'd like to have indexed. That's why I
 don't see a reason for having indexes over the contents of all the elements
 since is unlikely (at least right now) we'll make use of those and instead
 they consume a lot of hard drive.


You currently can't define an index to just select certain elements. It
would certainly be very nice to have super-flexible indexes, but as you
can guess this is a non-trivial task. Maintaining separate collections
is currently the way to go.
 
 == Question 2: to store files on the filesystem or as raw on BaseX? ==
 Right now, we're storing the information we receive as XML files on the
 file system on a RAID 10, anyway what's your advice?, to keep the files
 stored on the filesystem directly or to let BaseX handle those (I think
 this is the difference between add/replace and store commands right?), is
 there any article you could point me I could use for reference?, as I see
 BaseX right now it is handling the queries and the index information right
 now but depends on the filesystem to retrieve the entire document, am I
 right?

If you have a non-small collection of documents, simply storing them in
the file system is certainly not very performant. Using XQuery, you can
read from the file system, but that means parsing has to be executed
each time.

As Fabrice pointed out (thanks!), the concept is different than what you
described here. Using add/replace parses an XML file and adds it to the
database. During parsing, the XML file will be stored in a binary
format, to be able to optimize queries and to access relevant data much
faster. You can not add/replace any binary file to BaseX, as it would
not be parseable. Store, on the other hand, simply copies the file and
can therefore handle any binary file. This is useful if you e.g. want to
store media files within your DB, but you most likely do not want to
store XML files in a binary way, as it is similar in performance as
reading from the plain filesystem.

In short: You most likely want to add your documents to a collection.

 
 == Question 3: dynamic optimize and index updates? ==
 As you can imagine, I'll need to have the indexes updated
 sincedata-mining will be done with the information from the people
 registered on it. I've seen is not possible to run the optimize command
 while the app is up, I'm not sure about the indexes getting updated on real
 time either, but this somehow is troubling me since the idea is to have the
 app running 24x7, and if we get to have a lot of registered users, to
 update the indexes or to optimize the db will take a long time, isn't it?.
 So any strategies on this?

I don't quite get what you mean by optimize can not be run when the app
is up. Optimize can not be run if the database is opened by another
context (as it is updating and we maintain ACID), but your app shouldn't
hold open the database all the time.

One option you might want to look into is updating indexes (see
http://docs.basex.org/wiki/Options#UPDINDEX), it might be beneficial for
your use case. You still have to trigger the indexing by using optimize.

One common strategy for such scenarios is also to maintain separate
collections. One with the most current data, which is not indexed and
can be updated quite fast. And then another collection with the bulk of
the data, which is indexed and can be access fast. A cron job would than
schedule to current data to be transferred to the other collection
during times of low load on the server. This way, your updates will be
performed on a rather small collection without the need to optimize the
indexes all the time, while read operations can be fast as the majority
of the data is nicely indexed. Again, accessing both collections 

Re: [basex-talk] Text Index just over some elements

2014-09-24 Thread Fabrice Etanchaud
Hi Oscar,

You will have to maintain a separate collection in order to do that.

That separate collection will contain the node-pre or node-id of each value to 
be indexed.
Storing the node-pre is the faster way but require a append-only main 
collection if you do not want to have to recreate the entire separate 
collection after each main collection update.


1.   Add the new map entries (value-to-be-indexed,node-pre or node-id) in 
the separate collection

2.   Reindex the separate collection

An even faster solution is to store the values in text nodes and node-pre or 
node-id in attributes in order to create only a text index (or vice/versa). 
That will speed up the reindexation.

To use this custom index :

1.Use the db:attribute or db:text function on the separate collection 
to obtain the list of node-id or node-pre associated with a given value,

2.   For each node-xx, use the db:open-xx function on the main collection 
to obtain the real node.

If you are familiar with CouchBase/CouchDB, it’s a little like creating a view 
;-)

But such a built-in feature would be great !

Best regards,
Fabrice Etanchaud
Questel/Orbit


De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Oscar Herrera
Envoyé : mercredi 24 septembre 2014 03:30
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] Text Index just over some elements

Hi,
I'm trying to tune my BaseX with a text index only over certain elements, is 
that possible?, what I have found so far is to create a text index, but I have 
plenty of nodes on my documents that don't need to be indexed since is very 
unlikely that a search over that value will occur.
So, is there any way in which I can create a text index only over certain 
elements and not all of them?, if not, is this planned on a near future?

Thank you,

Oscar H


Re: [basex-talk] Text Index just over some elements

2014-09-24 Thread Oscar Herrera
== The Scenario ==
What we have is a dynamic collection with information from people who
registers on the site. Basically the information is retrieved from third
party companies that provide us the information on XML via WebService
calls, so we do request the person information to these third parties on
the moment people gets registered. So that's how we got into BaseX since we
consider is inconvenient to store large XML files on a RDBMS and I don't
see the point on having to parse all the information when we receive it to
re-organize it mostly because from my point of view the information is
already well structured via these large XML files.

This XML are on average 2mb each. Of course, there are some that are very
small (80kb) as there are some that we have been advised might get up to
500mb.

So, from all the information we receive, at this moment I estimate we only
need around 25%, I though about having different databases with full and
partial information but the thing is that somehow the requirements are not
entirely defined on one hand, and on the other, there's information that we
use on the queries and some other that we still need to display to its
owner and that we're displaying using XSLT.

== Question 1: Indexes are only required for some fields ==
We usually need to locate the records by some id, or query over some of the
elements available on the XML files, but those are pretty much always the
same, so those are the elements that I'd like to have indexed. That's why I
don't see a reason for having indexes over the contents of all the elements
since is unlikely (at least right now) we'll make use of those and instead
they consume a lot of hard drive.

== Question 2: to store files on the filesystem or as raw on BaseX? ==
Right now, we're storing the information we receive as XML files on the
file system on a RAID 10, anyway what's your advice?, to keep the files
stored on the filesystem directly or to let BaseX handle those (I think
this is the difference between add/replace and store commands right?), is
there any article you could point me I could use for reference?, as I see
BaseX right now it is handling the queries and the index information right
now but depends on the filesystem to retrieve the entire document, am I
right?

== Question 3: dynamic optimize and index updates? ==
As you can imagine, I'll need to have the indexes updated
sincedata-mining will be done with the information from the people
registered on it. I've seen is not possible to run the optimize command
while the app is up, I'm not sure about the indexes getting updated on real
time either, but this somehow is troubling me since the idea is to have the
app running 24x7, and if we get to have a lot of registered users, to
update the indexes or to optimize the db will take a long time, isn't it?.
So any strategies on this?

== Question 4: connection pooling ==
I have only found XQJ-Pool to be used with BaseX, does anybody know about
any other pooling mechanism available for BaseX?

Thank you so much for your help with this subject, and sorry for the long
long email ;)

Oscar H






2014-09-24 3:21 GMT-05:00 Fabrice Etanchaud fetanch...@questel.com:

  Hi Oscar,



 You will have to maintain a separate collection in order to do that.



 That separate collection will contain the node-pre or node-id of each
 value to be indexed.

 Storing the node-pre is the faster way but require a append-only main
 collection if you do not want to have to recreate the entire separate
 collection after each main collection update.



 1.   Add the new map entries (value-to-be-indexed,node-pre or
 node-id) in the separate collection

 2.   Reindex the separate collection



 An even faster solution is to store the values in text nodes and node-pre
 or node-id in attributes in order to create only a text index (or
 vice/versa). That will speed up the reindexation.



 To use this custom index :

 1.Use the db:attribute or db:text function on the separate
 collection to obtain the list of node-id or node-pre associated with a
 given value,

 2.   For each node-xx, use the db:open-xx function on the main
 collection to obtain the real node.



 If you are familiar with CouchBase/CouchDB, it’s a little like creating a
 view ;-)



 But such a built-in feature would be great !



 Best regards,

 Fabrice Etanchaud

 Questel/Orbit





 *De :* basex-talk-boun...@mailman.uni-konstanz.de [mailto:
 basex-talk-boun...@mailman.uni-konstanz.de] *De la part de* Oscar Herrera
 *Envoyé :* mercredi 24 septembre 2014 03:30
 *À :* basex-talk@mailman.uni-konstanz.de
 *Objet :* [basex-talk] Text Index just over some elements



 Hi,

 I'm trying to tune my BaseX with a text index only over certain elements,
 is that possible?, what I have found so far is to create a text index, but
 I have plenty of nodes on my documents that don't need to be indexed since
 is very unlikely that a search over that value will occur.

 So, is there any way

[basex-talk] Text Index just over some elements

2014-09-23 Thread Oscar Herrera
Hi,

I'm trying to tune my BaseX with a text index only over certain elements,
is that possible?, what I have found so far is to create a text index, but
I have plenty of nodes on my documents that don't need to be indexed since
is very unlikely that a search over that value will occur.

So, is there any way in which I can create a text index only over certain
elements and not all of them?, if not, is this planned on a near future?

Thank you,

Oscar H