Re: [basex-talk] Text Index just over some elements
Hi Dirk and Fabrice!, thank you so much for your help with these subjects!, it has been of great help to read this answers, the link that Fabrice suggested and reading about indexes also. Now is much clear for me how BaseX works!, I have to recognize some part of my brain still fights and sticks to RDBMS model after more than 12 years working with it, but at the same time I'm feeling pretty excited about working with BaseX. I'll keep you guys posted on how everything goes on this project!, and will come back to you also if I get into some other question ;) Thank you so much! Oscar H 2014-09-25 5:54 GMT-05:00 Dirk Kirsten d...@basex.org: Hello Oscar, As Fabrice already suggested, maintaining a separate collection with node-id mappings might be a viable solution. Another option could be to split your documents up in a way that the relevant information is stored in one collection (which is indexed) and all the other supplemental information is stored in another collection. This way, the first collection should be rather small and the text index should work fine. So, from all the information we receive, at this moment I estimate we only need around 25%, I though about having different databases with full and partial information but the thing is that somehow the requirements are not entirely defined on one hand, and on the other, there's information that we use on the queries and some other that we still need to display to its owner and that we're displaying using XSLT. If you need to display additional information, it is no problem to access multiple collections in a single XQuery. So splitting up the data should not be a show-stopper. == Question 1: Indexes are only required for some fields == We usually need to locate the records by some id, or query over some of the elements available on the XML files, but those are pretty much always the same, so those are the elements that I'd like to have indexed. That's why I don't see a reason for having indexes over the contents of all the elements since is unlikely (at least right now) we'll make use of those and instead they consume a lot of hard drive. You currently can't define an index to just select certain elements. It would certainly be very nice to have super-flexible indexes, but as you can guess this is a non-trivial task. Maintaining separate collections is currently the way to go. == Question 2: to store files on the filesystem or as raw on BaseX? == Right now, we're storing the information we receive as XML files on the file system on a RAID 10, anyway what's your advice?, to keep the files stored on the filesystem directly or to let BaseX handle those (I think this is the difference between add/replace and store commands right?), is there any article you could point me I could use for reference?, as I see BaseX right now it is handling the queries and the index information right now but depends on the filesystem to retrieve the entire document, am I right? If you have a non-small collection of documents, simply storing them in the file system is certainly not very performant. Using XQuery, you can read from the file system, but that means parsing has to be executed each time. As Fabrice pointed out (thanks!), the concept is different than what you described here. Using add/replace parses an XML file and adds it to the database. During parsing, the XML file will be stored in a binary format, to be able to optimize queries and to access relevant data much faster. You can not add/replace any binary file to BaseX, as it would not be parseable. Store, on the other hand, simply copies the file and can therefore handle any binary file. This is useful if you e.g. want to store media files within your DB, but you most likely do not want to store XML files in a binary way, as it is similar in performance as reading from the plain filesystem. In short: You most likely want to add your documents to a collection. == Question 3: dynamic optimize and index updates? == As you can imagine, I'll need to have the indexes updated sincedata-mining will be done with the information from the people registered on it. I've seen is not possible to run the optimize command while the app is up, I'm not sure about the indexes getting updated on real time either, but this somehow is troubling me since the idea is to have the app running 24x7, and if we get to have a lot of registered users, to update the indexes or to optimize the db will take a long time, isn't it?. So any strategies on this? I don't quite get what you mean by optimize can not be run when the app is up. Optimize can not be run if the database is opened by another context (as it is updating and we maintain ACID), but your app shouldn't hold open the database all the time. One option you might want to look into is updating indexes (see http://docs.basex.org/wiki/Options#UPDINDEX), it
Re: [basex-talk] Text Index just over some elements
Dear Oscar, From what I read, I’m not sure you had a look at the underlying BaseX data structure yet. Xml files in BaseX are digested in a binary format http://docs.basex.org/wiki/Node_Storage but ‘stored’ raw files are simply copied on the filesystem. You can only index digested data. Best regards, Fabrice Questel/Orbit De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Oscar Herrera Envoyé : mercredi 24 septembre 2014 19:41 À : basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Text Index just over some elements == The Scenario == What we have is a dynamic collection with information from people who registers on the site. Basically the information is retrieved from third party companies that provide us the information on XML via WebService calls, so we do request the person information to these third parties on the moment people gets registered. So that's how we got into BaseX since we consider is inconvenient to store large XML files on a RDBMS and I don't see the point on having to parse all the information when we receive it to re-organize it mostly because from my point of view the information is already well structured via these large XML files. This XML are on average 2mb each. Of course, there are some that are very small (80kb) as there are some that we have been advised might get up to 500mb. So, from all the information we receive, at this moment I estimate we only need around 25%, I though about having different databases with full and partial information but the thing is that somehow the requirements are not entirely defined on one hand, and on the other, there's information that we use on the queries and some other that we still need to display to its owner and that we're displaying using XSLT. == Question 1: Indexes are only required for some fields == We usually need to locate the records by some id, or query over some of the elements available on the XML files, but those are pretty much always the same, so those are the elements that I'd like to have indexed. That's why I don't see a reason for having indexes over the contents of all the elements since is unlikely (at least right now) we'll make use of those and instead they consume a lot of hard drive. == Question 2: to store files on the filesystem or as raw on BaseX? == Right now, we're storing the information we receive as XML files on the file system on a RAID 10, anyway what's your advice?, to keep the files stored on the filesystem directly or to let BaseX handle those (I think this is the difference between add/replace and store commands right?), is there any article you could point me I could use for reference?, as I see BaseX right now it is handling the queries and the index information right now but depends on the filesystem to retrieve the entire document, am I right? == Question 3: dynamic optimize and index updates? == As you can imagine, I'll need to have the indexes updated sincedata-mining will be done with the information from the people registered on it. I've seen is not possible to run the optimize command while the app is up, I'm not sure about the indexes getting updated on real time either, but this somehow is troubling me since the idea is to have the app running 24x7, and if we get to have a lot of registered users, to update the indexes or to optimize the db will take a long time, isn't it?. So any strategies on this? == Question 4: connection pooling == I have only found XQJ-Pool to be used with BaseX, does anybody know about any other pooling mechanism available for BaseX? Thank you so much for your help with this subject, and sorry for the long long email ;) Oscar H 2014-09-24 3:21 GMT-05:00 Fabrice Etanchaud fetanch...@questel.commailto:fetanch...@questel.com: Hi Oscar, You will have to maintain a separate collection in order to do that. That separate collection will contain the node-pre or node-id of each value to be indexed. Storing the node-pre is the faster way but require a append-only main collection if you do not want to have to recreate the entire separate collection after each main collection update. 1. Add the new map entries (value-to-be-indexed,node-pre or node-id) in the separate collection 2. Reindex the separate collection An even faster solution is to store the values in text nodes and node-pre or node-id in attributes in order to create only a text index (or vice/versa). That will speed up the reindexation. To use this custom index : 1.Use the db:attribute or db:text function on the separate collection to obtain the list of node-id or node-pre associated with a given value, 2. For each node-xx, use the db:open-xx function on the main collection to obtain the real node. If you are familiar with CouchBase/CouchDB, it’s a little like creating a view ;-) But such a built-in feature would be great ! Best regards, Fabrice
Re: [basex-talk] Text Index just over some elements
Hello Oscar, As Fabrice already suggested, maintaining a separate collection with node-id mappings might be a viable solution. Another option could be to split your documents up in a way that the relevant information is stored in one collection (which is indexed) and all the other supplemental information is stored in another collection. This way, the first collection should be rather small and the text index should work fine. So, from all the information we receive, at this moment I estimate we only need around 25%, I though about having different databases with full and partial information but the thing is that somehow the requirements are not entirely defined on one hand, and on the other, there's information that we use on the queries and some other that we still need to display to its owner and that we're displaying using XSLT. If you need to display additional information, it is no problem to access multiple collections in a single XQuery. So splitting up the data should not be a show-stopper. == Question 1: Indexes are only required for some fields == We usually need to locate the records by some id, or query over some of the elements available on the XML files, but those are pretty much always the same, so those are the elements that I'd like to have indexed. That's why I don't see a reason for having indexes over the contents of all the elements since is unlikely (at least right now) we'll make use of those and instead they consume a lot of hard drive. You currently can't define an index to just select certain elements. It would certainly be very nice to have super-flexible indexes, but as you can guess this is a non-trivial task. Maintaining separate collections is currently the way to go. == Question 2: to store files on the filesystem or as raw on BaseX? == Right now, we're storing the information we receive as XML files on the file system on a RAID 10, anyway what's your advice?, to keep the files stored on the filesystem directly or to let BaseX handle those (I think this is the difference between add/replace and store commands right?), is there any article you could point me I could use for reference?, as I see BaseX right now it is handling the queries and the index information right now but depends on the filesystem to retrieve the entire document, am I right? If you have a non-small collection of documents, simply storing them in the file system is certainly not very performant. Using XQuery, you can read from the file system, but that means parsing has to be executed each time. As Fabrice pointed out (thanks!), the concept is different than what you described here. Using add/replace parses an XML file and adds it to the database. During parsing, the XML file will be stored in a binary format, to be able to optimize queries and to access relevant data much faster. You can not add/replace any binary file to BaseX, as it would not be parseable. Store, on the other hand, simply copies the file and can therefore handle any binary file. This is useful if you e.g. want to store media files within your DB, but you most likely do not want to store XML files in a binary way, as it is similar in performance as reading from the plain filesystem. In short: You most likely want to add your documents to a collection. == Question 3: dynamic optimize and index updates? == As you can imagine, I'll need to have the indexes updated sincedata-mining will be done with the information from the people registered on it. I've seen is not possible to run the optimize command while the app is up, I'm not sure about the indexes getting updated on real time either, but this somehow is troubling me since the idea is to have the app running 24x7, and if we get to have a lot of registered users, to update the indexes or to optimize the db will take a long time, isn't it?. So any strategies on this? I don't quite get what you mean by optimize can not be run when the app is up. Optimize can not be run if the database is opened by another context (as it is updating and we maintain ACID), but your app shouldn't hold open the database all the time. One option you might want to look into is updating indexes (see http://docs.basex.org/wiki/Options#UPDINDEX), it might be beneficial for your use case. You still have to trigger the indexing by using optimize. One common strategy for such scenarios is also to maintain separate collections. One with the most current data, which is not indexed and can be updated quite fast. And then another collection with the bulk of the data, which is indexed and can be access fast. A cron job would than schedule to current data to be transferred to the other collection during times of low load on the server. This way, your updates will be performed on a rather small collection without the need to optimize the indexes all the time, while read operations can be fast as the majority of the data is nicely indexed. Again, accessing both collections
Re: [basex-talk] Text Index just over some elements
Hi Oscar, You will have to maintain a separate collection in order to do that. That separate collection will contain the node-pre or node-id of each value to be indexed. Storing the node-pre is the faster way but require a append-only main collection if you do not want to have to recreate the entire separate collection after each main collection update. 1. Add the new map entries (value-to-be-indexed,node-pre or node-id) in the separate collection 2. Reindex the separate collection An even faster solution is to store the values in text nodes and node-pre or node-id in attributes in order to create only a text index (or vice/versa). That will speed up the reindexation. To use this custom index : 1.Use the db:attribute or db:text function on the separate collection to obtain the list of node-id or node-pre associated with a given value, 2. For each node-xx, use the db:open-xx function on the main collection to obtain the real node. If you are familiar with CouchBase/CouchDB, it’s a little like creating a view ;-) But such a built-in feature would be great ! Best regards, Fabrice Etanchaud Questel/Orbit De : basex-talk-boun...@mailman.uni-konstanz.de [mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Oscar Herrera Envoyé : mercredi 24 septembre 2014 03:30 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Text Index just over some elements Hi, I'm trying to tune my BaseX with a text index only over certain elements, is that possible?, what I have found so far is to create a text index, but I have plenty of nodes on my documents that don't need to be indexed since is very unlikely that a search over that value will occur. So, is there any way in which I can create a text index only over certain elements and not all of them?, if not, is this planned on a near future? Thank you, Oscar H
Re: [basex-talk] Text Index just over some elements
== The Scenario == What we have is a dynamic collection with information from people who registers on the site. Basically the information is retrieved from third party companies that provide us the information on XML via WebService calls, so we do request the person information to these third parties on the moment people gets registered. So that's how we got into BaseX since we consider is inconvenient to store large XML files on a RDBMS and I don't see the point on having to parse all the information when we receive it to re-organize it mostly because from my point of view the information is already well structured via these large XML files. This XML are on average 2mb each. Of course, there are some that are very small (80kb) as there are some that we have been advised might get up to 500mb. So, from all the information we receive, at this moment I estimate we only need around 25%, I though about having different databases with full and partial information but the thing is that somehow the requirements are not entirely defined on one hand, and on the other, there's information that we use on the queries and some other that we still need to display to its owner and that we're displaying using XSLT. == Question 1: Indexes are only required for some fields == We usually need to locate the records by some id, or query over some of the elements available on the XML files, but those are pretty much always the same, so those are the elements that I'd like to have indexed. That's why I don't see a reason for having indexes over the contents of all the elements since is unlikely (at least right now) we'll make use of those and instead they consume a lot of hard drive. == Question 2: to store files on the filesystem or as raw on BaseX? == Right now, we're storing the information we receive as XML files on the file system on a RAID 10, anyway what's your advice?, to keep the files stored on the filesystem directly or to let BaseX handle those (I think this is the difference between add/replace and store commands right?), is there any article you could point me I could use for reference?, as I see BaseX right now it is handling the queries and the index information right now but depends on the filesystem to retrieve the entire document, am I right? == Question 3: dynamic optimize and index updates? == As you can imagine, I'll need to have the indexes updated sincedata-mining will be done with the information from the people registered on it. I've seen is not possible to run the optimize command while the app is up, I'm not sure about the indexes getting updated on real time either, but this somehow is troubling me since the idea is to have the app running 24x7, and if we get to have a lot of registered users, to update the indexes or to optimize the db will take a long time, isn't it?. So any strategies on this? == Question 4: connection pooling == I have only found XQJ-Pool to be used with BaseX, does anybody know about any other pooling mechanism available for BaseX? Thank you so much for your help with this subject, and sorry for the long long email ;) Oscar H 2014-09-24 3:21 GMT-05:00 Fabrice Etanchaud fetanch...@questel.com: Hi Oscar, You will have to maintain a separate collection in order to do that. That separate collection will contain the node-pre or node-id of each value to be indexed. Storing the node-pre is the faster way but require a append-only main collection if you do not want to have to recreate the entire separate collection after each main collection update. 1. Add the new map entries (value-to-be-indexed,node-pre or node-id) in the separate collection 2. Reindex the separate collection An even faster solution is to store the values in text nodes and node-pre or node-id in attributes in order to create only a text index (or vice/versa). That will speed up the reindexation. To use this custom index : 1.Use the db:attribute or db:text function on the separate collection to obtain the list of node-id or node-pre associated with a given value, 2. For each node-xx, use the db:open-xx function on the main collection to obtain the real node. If you are familiar with CouchBase/CouchDB, it’s a little like creating a view ;-) But such a built-in feature would be great ! Best regards, Fabrice Etanchaud Questel/Orbit *De :* basex-talk-boun...@mailman.uni-konstanz.de [mailto: basex-talk-boun...@mailman.uni-konstanz.de] *De la part de* Oscar Herrera *Envoyé :* mercredi 24 septembre 2014 03:30 *À :* basex-talk@mailman.uni-konstanz.de *Objet :* [basex-talk] Text Index just over some elements Hi, I'm trying to tune my BaseX with a text index only over certain elements, is that possible?, what I have found so far is to create a text index, but I have plenty of nodes on my documents that don't need to be indexed since is very unlikely that a search over that value will occur. So, is there any way
[basex-talk] Text Index just over some elements
Hi, I'm trying to tune my BaseX with a text index only over certain elements, is that possible?, what I have found so far is to create a text index, but I have plenty of nodes on my documents that don't need to be indexed since is very unlikely that a search over that value will occur. So, is there any way in which I can create a text index only over certain elements and not all of them?, if not, is this planned on a near future? Thank you, Oscar H