Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-06 Thread first name last name
Regarding selective full-text indexing, I just tried
XQUERY db:optimize("linuxquestions.org-selective", true(), map { 'ftindex':
true(), 'ftinclude': 'div table td a' })
And I got OOM on that, the exact stacktrace attached in this message.

I will open a separate thread regarding migrating the data from BaseX
shards to PostgreSQL (for the purpose of full-text indexing).

On Sun, Oct 6, 2019 at 10:19 AM Christian Grün 
wrote:

> The current full text index builder provides a similar outsourcing
> mechanism to that of the index builder for the default index structures;
> but the meta data structures are kept in main-memory, and they are more
> bulky. There are definitely ways to tackle this technically; it hasn't been
> of high priority so far, but this may change.
>
> Please note that you won't create an index over your whole data set in
> RDBMS. Instead, you'll usually create it for specific fields that you will
> query later on. It's a convenience feature in BaseX that you can build an
> index for all of your data. For large full-text corpora, however, it's
> recommendable in most cases to restrict indexing to the relevant XML
> elements.
>
>
>
>
> first name last name  schrieb am Sa., 5. Okt.
> 2019, 23:28:
>
>> Attached a more complete output of ./bin/basexhttp . Judging from this
>> output, it would seem that everything was ok, except for the full-text
>> index.
>> I now realize that I have another question about full-text indexes. It
>> seems like the full-text index here is dependent on the amount of memory
>> available (in other words, the more data to be indexed, the more RAM memory
>> required).
>>
>> I was using a certain popular RDBMS, for full-text indexing, and I never
>> bumped into problems like it running out of memory when building such
>> indexes.
>> I think their model uses a certain buffer in memory, and it keeps
>> multiple files on disk where it store data, and then it assembles together
>> the results in-memory
>> but always keeping the constraint of using only as much memory as was
>> declared to be allowed for it to use.
>> Perhaps the topic would be "external memory algorithms" or "full-text
>> search using secondary storage".
>> I'm not an expert in this field, but.. my question here would be if this
>> kind of thing is something that BaseX is looking to handle in the future?
>>
>> Thanks,
>> Stefan
>>
>>
>> On Sat, Oct 5, 2019 at 11:08 PM Christian Grün 
>> wrote:
>>
>>> The stack Trace indicates that you enabled the fulltext index as well.
>>> For this index, you definitely need more memory than available on your
>>> system.
>>>
>>> So I assume you didn't encounter trouble with the default index
>>> structures?
>>>
>>>
>>>
>>>
>>> first name last name  schrieb am Sa., 5. Okt.
>>> 2019, 20:52:
>>>
 Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
 allocate on that system for BaseX) and I got OOM.

 On Sat, Oct 5, 2019 at 2:19 AM Christian Grün <
 christian.gr...@gmail.com> wrote:

> About option 1: How much memory have you been able to assign to the
> Java VM?
>
>
>
>
>
> first name last name  schrieb am Sa., 5. Okt.
> 2019, 01:11:
>
>> I had another look at the script I wrote and realized that it's not
>> working as it's supposed to.
>> Apparently the order of operations should be this:
>> - turn on all the types of indexes required
>> - create the db
>> - the parser settings and the filter settings
>> - add all the files to the db
>> - run "OPTIMIZE"
>>
>> If I'm not doing them in this order (specifically with "OPTIMIZE" at
>> the end) the resulting db lacks all indexes.
>>
>>
>>
>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>> randomcod...@gmail.com> wrote:
>>
>>> Hi Christian,
>>>
>>> About option 4:
>>> I agree with the options you laid out. I am currently diving deeper
>>> into option 4 in the list you wrote.
>>> Regarding the partitioning strategy, I agree. I did manage however
>>> to partition the files to be imported, into separate sets, with a
>>> constraint on max partition size (on disk) and max partition file count
>>> (the number of XML documents in each partition).
>>> The tool called fpart [5] made this possible (I can imagine more
>>> sophisticated bin-packing methods, involving pre-computed node count
>>> values, and other variables, can be achieved via glpk [6] but that 
>>> might be
>>> too much work).
>>> So, currently I am experimenting with a max partition size of 2.4GB
>>> and a max file count of 85k files, and fpart seems to have split the 
>>> file
>>> list into 11 partitions of 33k files each and the size of a partition 
>>> being
>>> ~ 2.4GB.
>>> So, I wrote a script for this, it's called sharded-import.sh and
>>> attached here. I'm also noticing that the /dba/ BaseX web interface is 
>>> not

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-06 Thread Christian Grün
The current full text index builder provides a similar outsourcing
mechanism to that of the index builder for the default index structures;
but the meta data structures are kept in main-memory, and they are more
bulky. There are definitely ways to tackle this technically; it hasn't been
of high priority so far, but this may change.

Please note that you won't create an index over your whole data set in
RDBMS. Instead, you'll usually create it for specific fields that you will
query later on. It's a convenience feature in BaseX that you can build an
index for all of your data. For large full-text corpora, however, it's
recommendable in most cases to restrict indexing to the relevant XML
elements.




first name last name  schrieb am Sa., 5. Okt. 2019,
23:28:

> Attached a more complete output of ./bin/basexhttp . Judging from this
> output, it would seem that everything was ok, except for the full-text
> index.
> I now realize that I have another question about full-text indexes. It
> seems like the full-text index here is dependent on the amount of memory
> available (in other words, the more data to be indexed, the more RAM memory
> required).
>
> I was using a certain popular RDBMS, for full-text indexing, and I never
> bumped into problems like it running out of memory when building such
> indexes.
> I think their model uses a certain buffer in memory, and it keeps multiple
> files on disk where it store data, and then it assembles together the
> results in-memory
> but always keeping the constraint of using only as much memory as was
> declared to be allowed for it to use.
> Perhaps the topic would be "external memory algorithms" or "full-text
> search using secondary storage".
> I'm not an expert in this field, but.. my question here would be if this
> kind of thing is something that BaseX is looking to handle in the future?
>
> Thanks,
> Stefan
>
>
> On Sat, Oct 5, 2019 at 11:08 PM Christian Grün 
> wrote:
>
>> The stack Trace indicates that you enabled the fulltext index as well.
>> For this index, you definitely need more memory than available on your
>> system.
>>
>> So I assume you didn't encounter trouble with the default index
>> structures?
>>
>>
>>
>>
>> first name last name  schrieb am Sa., 5. Okt.
>> 2019, 20:52:
>>
>>> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
>>> allocate on that system for BaseX) and I got OOM.
>>>
>>> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün 
>>> wrote:
>>>
 About option 1: How much memory have you been able to assign to the
 Java VM?





 first name last name  schrieb am Sa., 5. Okt.
 2019, 01:11:

> I had another look at the script I wrote and realized that it's not
> working as it's supposed to.
> Apparently the order of operations should be this:
> - turn on all the types of indexes required
> - create the db
> - the parser settings and the filter settings
> - add all the files to the db
> - run "OPTIMIZE"
>
> If I'm not doing them in this order (specifically with "OPTIMIZE" at
> the end) the resulting db lacks all indexes.
>
>
>
> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
> randomcod...@gmail.com> wrote:
>
>> Hi Christian,
>>
>> About option 4:
>> I agree with the options you laid out. I am currently diving deeper
>> into option 4 in the list you wrote.
>> Regarding the partitioning strategy, I agree. I did manage however to
>> partition the files to be imported, into separate sets, with a constraint
>> on max partition size (on disk) and max partition file count (the number 
>> of
>> XML documents in each partition).
>> The tool called fpart [5] made this possible (I can imagine more
>> sophisticated bin-packing methods, involving pre-computed node count
>> values, and other variables, can be achieved via glpk [6] but that might 
>> be
>> too much work).
>> So, currently I am experimenting with a max partition size of 2.4GB
>> and a max file count of 85k files, and fpart seems to have split the file
>> list into 11 partitions of 33k files each and the size of a partition 
>> being
>> ~ 2.4GB.
>> So, I wrote a script for this, it's called sharded-import.sh and
>> attached here. I'm also noticing that the /dba/ BaseX web interface is 
>> not
>> blocked anymore if I run this script, as opposed to running the previous
>> import where I run
>>   CREATE DB db_name /directory/
>> which allows me to see the progress or allows me to run queries
>> before the big import finishes.
>> Maybe the downside is that it's more verbose, and prints out a ton of
>> lines like
>>   > ADD /share/Public/archive/tech-sites/
>> linuxquestions.org/threads/viewtopic_9_356613.html
>>   Resource(s) added in 47.76 ms.
>> along the way, and maybe that's slower than before.
>>
>> About option 1:
>> 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-05 Thread first name last name
Attached a more complete output of ./bin/basexhttp . Judging from this
output, it would seem that everything was ok, except for the full-text
index.
I now realize that I have another question about full-text indexes. It
seems like the full-text index here is dependent on the amount of memory
available (in other words, the more data to be indexed, the more RAM memory
required).

I was using a certain popular RDBMS, for full-text indexing, and I never
bumped into problems like it running out of memory when building such
indexes.
I think their model uses a certain buffer in memory, and it keeps multiple
files on disk where it store data, and then it assembles together the
results in-memory
but always keeping the constraint of using only as much memory as was
declared to be allowed for it to use.
Perhaps the topic would be "external memory algorithms" or "full-text
search using secondary storage".
I'm not an expert in this field, but.. my question here would be if this
kind of thing is something that BaseX is looking to handle in the future?

Thanks,
Stefan


On Sat, Oct 5, 2019 at 11:08 PM Christian Grün 
wrote:

> The stack Trace indicates that you enabled the fulltext index as well. For
> this index, you definitely need more memory than available on your system.
>
> So I assume you didn't encounter trouble with the default index structures?
>
>
>
>
> first name last name  schrieb am Sa., 5. Okt.
> 2019, 20:52:
>
>> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
>> allocate on that system for BaseX) and I got OOM.
>>
>> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün 
>> wrote:
>>
>>> About option 1: How much memory have you been able to assign to the Java
>>> VM?
>>>
>>>
>>>
>>>
>>>
>>> first name last name  schrieb am Sa., 5. Okt.
>>> 2019, 01:11:
>>>
 I had another look at the script I wrote and realized that it's not
 working as it's supposed to.
 Apparently the order of operations should be this:
 - turn on all the types of indexes required
 - create the db
 - the parser settings and the filter settings
 - add all the files to the db
 - run "OPTIMIZE"

 If I'm not doing them in this order (specifically with "OPTIMIZE" at
 the end) the resulting db lacks all indexes.



 On Fri, Oct 4, 2019 at 11:32 PM first name last name <
 randomcod...@gmail.com> wrote:

> Hi Christian,
>
> About option 4:
> I agree with the options you laid out. I am currently diving deeper
> into option 4 in the list you wrote.
> Regarding the partitioning strategy, I agree. I did manage however to
> partition the files to be imported, into separate sets, with a constraint
> on max partition size (on disk) and max partition file count (the number 
> of
> XML documents in each partition).
> The tool called fpart [5] made this possible (I can imagine more
> sophisticated bin-packing methods, involving pre-computed node count
> values, and other variables, can be achieved via glpk [6] but that might 
> be
> too much work).
> So, currently I am experimenting with a max partition size of 2.4GB
> and a max file count of 85k files, and fpart seems to have split the file
> list into 11 partitions of 33k files each and the size of a partition 
> being
> ~ 2.4GB.
> So, I wrote a script for this, it's called sharded-import.sh and
> attached here. I'm also noticing that the /dba/ BaseX web interface is not
> blocked anymore if I run this script, as opposed to running the previous
> import where I run
>   CREATE DB db_name /directory/
> which allows me to see the progress or allows me to run queries before
> the big import finishes.
> Maybe the downside is that it's more verbose, and prints out a ton of
> lines like
>   > ADD /share/Public/archive/tech-sites/
> linuxquestions.org/threads/viewtopic_9_356613.html
>   Resource(s) added in 47.76 ms.
> along the way, and maybe that's slower than before.
>
> About option 1:
> Re: increase memory, I am running these experiments on a low-memory,
> old, network-attached storage, model QNAP TS-451+ [7] [8], which I had to
> take apart with a screwdriver to add 2GB of RAM (now it has 4GB of 
> memory),
> and I can't seem to find around the house any additional memory sticks to
> take it up to 8GB (which is also the maximum memory it supports). And if I
> want to find like 2 x 4GB sticks of RAM, the frequency of the memory has 
> to
> match what it supports, I'm having trouble finding the exact one, Corsair
> says it has memory sticks that would work, but I'd have to wait weeks for
> them to ship to Bucharest which is where I live.
> It seems like buying an Intel NUC that goes up to 64GB of memory would
> be a bit too expensive at $1639 [9] but .. people on reddit [10] were
> discussing some years back about this supermicro 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-05 Thread Christian Grün
The stack Trace indicates that you enabled the fulltext index as well. For
this index, you definitely need more memory than available on your system.

So I assume you didn't encounter trouble with the default index structures?




first name last name  schrieb am Sa., 5. Okt. 2019,
20:52:

> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
> allocate on that system for BaseX) and I got OOM.
>
> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün 
> wrote:
>
>> About option 1: How much memory have you been able to assign to the Java
>> VM?
>>
>>
>>
>>
>>
>> first name last name  schrieb am Sa., 5. Okt.
>> 2019, 01:11:
>>
>>> I had another look at the script I wrote and realized that it's not
>>> working as it's supposed to.
>>> Apparently the order of operations should be this:
>>> - turn on all the types of indexes required
>>> - create the db
>>> - the parser settings and the filter settings
>>> - add all the files to the db
>>> - run "OPTIMIZE"
>>>
>>> If I'm not doing them in this order (specifically with "OPTIMIZE" at the
>>> end) the resulting db lacks all indexes.
>>>
>>>
>>>
>>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>>> randomcod...@gmail.com> wrote:
>>>
 Hi Christian,

 About option 4:
 I agree with the options you laid out. I am currently diving deeper
 into option 4 in the list you wrote.
 Regarding the partitioning strategy, I agree. I did manage however to
 partition the files to be imported, into separate sets, with a constraint
 on max partition size (on disk) and max partition file count (the number of
 XML documents in each partition).
 The tool called fpart [5] made this possible (I can imagine more
 sophisticated bin-packing methods, involving pre-computed node count
 values, and other variables, can be achieved via glpk [6] but that might be
 too much work).
 So, currently I am experimenting with a max partition size of 2.4GB and
 a max file count of 85k files, and fpart seems to have split the file list
 into 11 partitions of 33k files each and the size of a partition being ~
 2.4GB.
 So, I wrote a script for this, it's called sharded-import.sh and
 attached here. I'm also noticing that the /dba/ BaseX web interface is not
 blocked anymore if I run this script, as opposed to running the previous
 import where I run
   CREATE DB db_name /directory/
 which allows me to see the progress or allows me to run queries before
 the big import finishes.
 Maybe the downside is that it's more verbose, and prints out a ton of
 lines like
   > ADD /share/Public/archive/tech-sites/
 linuxquestions.org/threads/viewtopic_9_356613.html
   Resource(s) added in 47.76 ms.
 along the way, and maybe that's slower than before.

 About option 1:
 Re: increase memory, I am running these experiments on a low-memory,
 old, network-attached storage, model QNAP TS-451+ [7] [8], which I had to
 take apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory),
 and I can't seem to find around the house any additional memory sticks to
 take it up to 8GB (which is also the maximum memory it supports). And if I
 want to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
 match what it supports, I'm having trouble finding the exact one, Corsair
 says it has memory sticks that would work, but I'd have to wait weeks for
 them to ship to Bucharest which is where I live.
 It seems like buying an Intel NUC that goes up to 64GB of memory would
 be a bit too expensive at $1639 [9] but .. people on reddit [10] were
 discussing some years back about this supermicro server [11] which is only
 $668 and would allow to add up to 64GB of memory.
 Basically I would buy something cheap that I can jampack with a lot of
 RAM, but a hands-off approach would be best here, so if it comes
 pre-equipped with all the memory and everything, would be nice (would spare
 the trouble of having to buy the memory separate, making sure it matches
 the motherboard specs etc).

 About option 2:
 In fact, that's a great idea. But it would require me to write
 something that would figure out the XPath patterns where the actual content
 sits. I actually wanted to look for some algorithm that's designed to do
 that, and try to implement it, but I had no time.
 It would either have to detect the repetitive bloated nodes, and build
 XPaths for the rest of the nodes, where the actual content sits. I think
 this would be equivalent to computing the "web template" of a website,
 given all its pages.
 It would definitely decrease the size of the content that would have to
 be indexed.
 By the way, here I'm writing about a more general procedure, because
 it's not just this dataset that I want to import.. I want to import heavy,
 large amounts of data 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-05 Thread first name last name
Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
allocate on that system for BaseX) and I got OOM.

On Sat, Oct 5, 2019 at 2:19 AM Christian Grün 
wrote:

> About option 1: How much memory have you been able to assign to the Java
> VM?
>
>
>
>
>
> first name last name  schrieb am Sa., 5. Okt.
> 2019, 01:11:
>
>> I had another look at the script I wrote and realized that it's not
>> working as it's supposed to.
>> Apparently the order of operations should be this:
>> - turn on all the types of indexes required
>> - create the db
>> - the parser settings and the filter settings
>> - add all the files to the db
>> - run "OPTIMIZE"
>>
>> If I'm not doing them in this order (specifically with "OPTIMIZE" at the
>> end) the resulting db lacks all indexes.
>>
>>
>>
>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>> randomcod...@gmail.com> wrote:
>>
>>> Hi Christian,
>>>
>>> About option 4:
>>> I agree with the options you laid out. I am currently diving deeper into
>>> option 4 in the list you wrote.
>>> Regarding the partitioning strategy, I agree. I did manage however to
>>> partition the files to be imported, into separate sets, with a constraint
>>> on max partition size (on disk) and max partition file count (the number of
>>> XML documents in each partition).
>>> The tool called fpart [5] made this possible (I can imagine more
>>> sophisticated bin-packing methods, involving pre-computed node count
>>> values, and other variables, can be achieved via glpk [6] but that might be
>>> too much work).
>>> So, currently I am experimenting with a max partition size of 2.4GB and
>>> a max file count of 85k files, and fpart seems to have split the file list
>>> into 11 partitions of 33k files each and the size of a partition being ~
>>> 2.4GB.
>>> So, I wrote a script for this, it's called sharded-import.sh and
>>> attached here. I'm also noticing that the /dba/ BaseX web interface is not
>>> blocked anymore if I run this script, as opposed to running the previous
>>> import where I run
>>>   CREATE DB db_name /directory/
>>> which allows me to see the progress or allows me to run queries before
>>> the big import finishes.
>>> Maybe the downside is that it's more verbose, and prints out a ton of
>>> lines like
>>>   > ADD /share/Public/archive/tech-sites/
>>> linuxquestions.org/threads/viewtopic_9_356613.html
>>>   Resource(s) added in 47.76 ms.
>>> along the way, and maybe that's slower than before.
>>>
>>> About option 1:
>>> Re: increase memory, I am running these experiments on a low-memory,
>>> old, network-attached storage, model QNAP TS-451+ [7] [8], which I had to
>>> take apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory),
>>> and I can't seem to find around the house any additional memory sticks to
>>> take it up to 8GB (which is also the maximum memory it supports). And if I
>>> want to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
>>> match what it supports, I'm having trouble finding the exact one, Corsair
>>> says it has memory sticks that would work, but I'd have to wait weeks for
>>> them to ship to Bucharest which is where I live.
>>> It seems like buying an Intel NUC that goes up to 64GB of memory would
>>> be a bit too expensive at $1639 [9] but .. people on reddit [10] were
>>> discussing some years back about this supermicro server [11] which is only
>>> $668 and would allow to add up to 64GB of memory.
>>> Basically I would buy something cheap that I can jampack with a lot of
>>> RAM, but a hands-off approach would be best here, so if it comes
>>> pre-equipped with all the memory and everything, would be nice (would spare
>>> the trouble of having to buy the memory separate, making sure it matches
>>> the motherboard specs etc).
>>>
>>> About option 2:
>>> In fact, that's a great idea. But it would require me to write something
>>> that would figure out the XPath patterns where the actual content sits. I
>>> actually wanted to look for some algorithm that's designed to do that, and
>>> try to implement it, but I had no time.
>>> It would either have to detect the repetitive bloated nodes, and build
>>> XPaths for the rest of the nodes, where the actual content sits. I think
>>> this would be equivalent to computing the "web template" of a website,
>>> given all its pages.
>>> It would definitely decrease the size of the content that would have to
>>> be indexed.
>>> By the way, here I'm writing about a more general procedure, because
>>> it's not just this dataset that I want to import.. I want to import heavy,
>>> large amounts of data :)
>>>
>>> These are my thoughts for now
>>>
>>> [5] https://github.com/martymac/fpart
>>> [6] https://www.gnu.org/software/glpk/
>>> [7] https://www.amazon.com/dp/B015VNLGF8
>>> [8] https://www.qnap.com/en/product/ts-451+
>>> [9]
>>> https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/
>>> [10]
>>> 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-04 Thread Christian Grün
About option 1: How much memory have you been able to assign to the Java VM?





first name last name  schrieb am Sa., 5. Okt. 2019,
01:11:

> I had another look at the script I wrote and realized that it's not
> working as it's supposed to.
> Apparently the order of operations should be this:
> - turn on all the types of indexes required
> - create the db
> - the parser settings and the filter settings
> - add all the files to the db
> - run "OPTIMIZE"
>
> If I'm not doing them in this order (specifically with "OPTIMIZE" at the
> end) the resulting db lacks all indexes.
>
>
>
> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
> randomcod...@gmail.com> wrote:
>
>> Hi Christian,
>>
>> About option 4:
>> I agree with the options you laid out. I am currently diving deeper into
>> option 4 in the list you wrote.
>> Regarding the partitioning strategy, I agree. I did manage however to
>> partition the files to be imported, into separate sets, with a constraint
>> on max partition size (on disk) and max partition file count (the number of
>> XML documents in each partition).
>> The tool called fpart [5] made this possible (I can imagine more
>> sophisticated bin-packing methods, involving pre-computed node count
>> values, and other variables, can be achieved via glpk [6] but that might be
>> too much work).
>> So, currently I am experimenting with a max partition size of 2.4GB and a
>> max file count of 85k files, and fpart seems to have split the file list
>> into 11 partitions of 33k files each and the size of a partition being ~
>> 2.4GB.
>> So, I wrote a script for this, it's called sharded-import.sh and attached
>> here. I'm also noticing that the /dba/ BaseX web interface is not blocked
>> anymore if I run this script, as opposed to running the previous import
>> where I run
>>   CREATE DB db_name /directory/
>> which allows me to see the progress or allows me to run queries before
>> the big import finishes.
>> Maybe the downside is that it's more verbose, and prints out a ton of
>> lines like
>>   > ADD /share/Public/archive/tech-sites/
>> linuxquestions.org/threads/viewtopic_9_356613.html
>>   Resource(s) added in 47.76 ms.
>> along the way, and maybe that's slower than before.
>>
>> About option 1:
>> Re: increase memory, I am running these experiments on a low-memory, old,
>> network-attached storage, model QNAP TS-451+ [7] [8], which I had to take
>> apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory), and
>> I can't seem to find around the house any additional memory sticks to take
>> it up to 8GB (which is also the maximum memory it supports). And if I want
>> to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
>> match what it supports, I'm having trouble finding the exact one, Corsair
>> says it has memory sticks that would work, but I'd have to wait weeks for
>> them to ship to Bucharest which is where I live.
>> It seems like buying an Intel NUC that goes up to 64GB of memory would be
>> a bit too expensive at $1639 [9] but .. people on reddit [10] were
>> discussing some years back about this supermicro server [11] which is only
>> $668 and would allow to add up to 64GB of memory.
>> Basically I would buy something cheap that I can jampack with a lot of
>> RAM, but a hands-off approach would be best here, so if it comes
>> pre-equipped with all the memory and everything, would be nice (would spare
>> the trouble of having to buy the memory separate, making sure it matches
>> the motherboard specs etc).
>>
>> About option 2:
>> In fact, that's a great idea. But it would require me to write something
>> that would figure out the XPath patterns where the actual content sits. I
>> actually wanted to look for some algorithm that's designed to do that, and
>> try to implement it, but I had no time.
>> It would either have to detect the repetitive bloated nodes, and build
>> XPaths for the rest of the nodes, where the actual content sits. I think
>> this would be equivalent to computing the "web template" of a website,
>> given all its pages.
>> It would definitely decrease the size of the content that would have to
>> be indexed.
>> By the way, here I'm writing about a more general procedure, because it's
>> not just this dataset that I want to import.. I want to import heavy, large
>> amounts of data :)
>>
>> These are my thoughts for now
>>
>> [5] https://github.com/martymac/fpart
>> [6] https://www.gnu.org/software/glpk/
>> [7] https://www.amazon.com/dp/B015VNLGF8
>> [8] https://www.qnap.com/en/product/ts-451+
>> [9] https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/
>> [10]
>> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/
>> [11]
>> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E
>>
>>
>> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün 
>> wrote:
>>
>>> Exactly, it seems to be the final MERGE step during index creation
>>> that blows up your system. If you are 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-04 Thread first name last name
I had another look at the script I wrote and realized that it's not working
as it's supposed to.
Apparently the order of operations should be this:
- turn on all the types of indexes required
- create the db
- the parser settings and the filter settings
- add all the files to the db
- run "OPTIMIZE"

If I'm not doing them in this order (specifically with "OPTIMIZE" at the
end) the resulting db lacks all indexes.



On Fri, Oct 4, 2019 at 11:32 PM first name last name 
wrote:

> Hi Christian,
>
> About option 4:
> I agree with the options you laid out. I am currently diving deeper into
> option 4 in the list you wrote.
> Regarding the partitioning strategy, I agree. I did manage however to
> partition the files to be imported, into separate sets, with a constraint
> on max partition size (on disk) and max partition file count (the number of
> XML documents in each partition).
> The tool called fpart [5] made this possible (I can imagine more
> sophisticated bin-packing methods, involving pre-computed node count
> values, and other variables, can be achieved via glpk [6] but that might be
> too much work).
> So, currently I am experimenting with a max partition size of 2.4GB and a
> max file count of 85k files, and fpart seems to have split the file list
> into 11 partitions of 33k files each and the size of a partition being ~
> 2.4GB.
> So, I wrote a script for this, it's called sharded-import.sh and attached
> here. I'm also noticing that the /dba/ BaseX web interface is not blocked
> anymore if I run this script, as opposed to running the previous import
> where I run
>   CREATE DB db_name /directory/
> which allows me to see the progress or allows me to run queries before the
> big import finishes.
> Maybe the downside is that it's more verbose, and prints out a ton of
> lines like
>   > ADD /share/Public/archive/tech-sites/
> linuxquestions.org/threads/viewtopic_9_356613.html
>   Resource(s) added in 47.76 ms.
> along the way, and maybe that's slower than before.
>
> About option 1:
> Re: increase memory, I am running these experiments on a low-memory, old,
> network-attached storage, model QNAP TS-451+ [7] [8], which I had to take
> apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory), and
> I can't seem to find around the house any additional memory sticks to take
> it up to 8GB (which is also the maximum memory it supports). And if I want
> to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
> match what it supports, I'm having trouble finding the exact one, Corsair
> says it has memory sticks that would work, but I'd have to wait weeks for
> them to ship to Bucharest which is where I live.
> It seems like buying an Intel NUC that goes up to 64GB of memory would be
> a bit too expensive at $1639 [9] but .. people on reddit [10] were
> discussing some years back about this supermicro server [11] which is only
> $668 and would allow to add up to 64GB of memory.
> Basically I would buy something cheap that I can jampack with a lot of
> RAM, but a hands-off approach would be best here, so if it comes
> pre-equipped with all the memory and everything, would be nice (would spare
> the trouble of having to buy the memory separate, making sure it matches
> the motherboard specs etc).
>
> About option 2:
> In fact, that's a great idea. But it would require me to write something
> that would figure out the XPath patterns where the actual content sits. I
> actually wanted to look for some algorithm that's designed to do that, and
> try to implement it, but I had no time.
> It would either have to detect the repetitive bloated nodes, and build
> XPaths for the rest of the nodes, where the actual content sits. I think
> this would be equivalent to computing the "web template" of a website,
> given all its pages.
> It would definitely decrease the size of the content that would have to be
> indexed.
> By the way, here I'm writing about a more general procedure, because it's
> not just this dataset that I want to import.. I want to import heavy, large
> amounts of data :)
>
> These are my thoughts for now
>
> [5] https://github.com/martymac/fpart
> [6] https://www.gnu.org/software/glpk/
> [7] https://www.amazon.com/dp/B015VNLGF8
> [8] https://www.qnap.com/en/product/ts-451+
> [9] https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/
> [10]
> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/
> [11]
> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E
>
>
> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün 
> wrote:
>
>> Exactly, it seems to be the final MERGE step during index creation
>> that blows up your system. If you are restricted to the 2 GB of
>> main-memory, this is what you could try next:
>>
>> 1. Did you already try to tweak the JVM memory limit via -Xmx? What’s
>> the largest value that you can assign on your system?
>>
>> 2. If you will query only specific values of your data sets, you 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-04 Thread first name last name
Hi Christian,

About option 4:
I agree with the options you laid out. I am currently diving deeper into
option 4 in the list you wrote.
Regarding the partitioning strategy, I agree. I did manage however to
partition the files to be imported, into separate sets, with a constraint
on max partition size (on disk) and max partition file count (the number of
XML documents in each partition).
The tool called fpart [5] made this possible (I can imagine more
sophisticated bin-packing methods, involving pre-computed node count
values, and other variables, can be achieved via glpk [6] but that might be
too much work).
So, currently I am experimenting with a max partition size of 2.4GB and a
max file count of 85k files, and fpart seems to have split the file list
into 11 partitions of 33k files each and the size of a partition being ~
2.4GB.
So, I wrote a script for this, it's called sharded-import.sh and attached
here. I'm also noticing that the /dba/ BaseX web interface is not blocked
anymore if I run this script, as opposed to running the previous import
where I run
  CREATE DB db_name /directory/
which allows me to see the progress or allows me to run queries before the
big import finishes.
Maybe the downside is that it's more verbose, and prints out a ton of lines
like
  > ADD /share/Public/archive/tech-sites/
linuxquestions.org/threads/viewtopic_9_356613.html
  Resource(s) added in 47.76 ms.
along the way, and maybe that's slower than before.

About option 1:
Re: increase memory, I am running these experiments on a low-memory, old,
network-attached storage, model QNAP TS-451+ [7] [8], which I had to take
apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory), and
I can't seem to find around the house any additional memory sticks to take
it up to 8GB (which is also the maximum memory it supports). And if I want
to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
match what it supports, I'm having trouble finding the exact one, Corsair
says it has memory sticks that would work, but I'd have to wait weeks for
them to ship to Bucharest which is where I live.
It seems like buying an Intel NUC that goes up to 64GB of memory would be a
bit too expensive at $1639 [9] but .. people on reddit [10] were discussing
some years back about this supermicro server [11] which is only $668 and
would allow to add up to 64GB of memory.
Basically I would buy something cheap that I can jampack with a lot of RAM,
but a hands-off approach would be best here, so if it comes pre-equipped
with all the memory and everything, would be nice (would spare the trouble
of having to buy the memory separate, making sure it matches the
motherboard specs etc).

About option 2:
In fact, that's a great idea. But it would require me to write something
that would figure out the XPath patterns where the actual content sits. I
actually wanted to look for some algorithm that's designed to do that, and
try to implement it, but I had no time.
It would either have to detect the repetitive bloated nodes, and build
XPaths for the rest of the nodes, where the actual content sits. I think
this would be equivalent to computing the "web template" of a website,
given all its pages.
It would definitely decrease the size of the content that would have to be
indexed.
By the way, here I'm writing about a more general procedure, because it's
not just this dataset that I want to import.. I want to import heavy, large
amounts of data :)

These are my thoughts for now

[5] https://github.com/martymac/fpart
[6] https://www.gnu.org/software/glpk/
[7] https://www.amazon.com/dp/B015VNLGF8
[8] https://www.qnap.com/en/product/ts-451+
[9] https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/
[10]
https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/
[11]
https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E


On Thu, Oct 3, 2019 at 1:30 PM Christian Grün 
wrote:

> Exactly, it seems to be the final MERGE step during index creation
> that blows up your system. If you are restricted to the 2 GB of
> main-memory, this is what you could try next:
>
> 1. Did you already try to tweak the JVM memory limit via -Xmx? What’s
> the largest value that you can assign on your system?
>
> 2. If you will query only specific values of your data sets, you can
> restrict your indexes to specific elements or attributes; this will
> reduce memory consumption (see [1] for details). If you observe that
> no indexes will be utilized in your queries anyway, you can simply
> disable the text and attribute indexes, and memory usage will shrink
> even more.
>
> 3. Create your database on a more powerful system [2] and move it to
> your target machine (makes only sense if there’s no need for further
> updates).
>
> 4. Distribute your data across multiple databases. In some way, this
> is comparable to sharding; it cannot be automated, though, as the
> partitioning strategy depends on the 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-03 Thread Christian Grün
Exactly, it seems to be the final MERGE step during index creation
that blows up your system. If you are restricted to the 2 GB of
main-memory, this is what you could try next:

1. Did you already try to tweak the JVM memory limit via -Xmx? What’s
the largest value that you can assign on your system?

2. If you will query only specific values of your data sets, you can
restrict your indexes to specific elements or attributes; this will
reduce memory consumption (see [1] for details). If you observe that
no indexes will be utilized in your queries anyway, you can simply
disable the text and attribute indexes, and memory usage will shrink
even more.

3. Create your database on a more powerful system [2] and move it to
your target machine (makes only sense if there’s no need for further
updates).

4. Distribute your data across multiple databases. In some way, this
is comparable to sharding; it cannot be automated, though, as the
partitioning strategy depends on the characteristics of your XML input
data (some people have huge standalone documents, others have millions
of small documents, …).

[1] http://docs.basex.org/wiki/Indexes
[2] A single CREATE call may be sufficient: CREATE DB database
sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz




On Thu, Oct 3, 2019 at 8:53 AM first name last name
 wrote:
>
> I tried again, using SPLITSIZE = 12 in the .basex config file
> The batch(console) script I used is attached mass-import.xq
> This time I didn't do the optimize or index creation post-import, but 
> instead, I did it as part of the import similar to what
> is described in [4].
> This time I got a different error, that is, "org.basex.core.BaseXException: 
> Out of Main Memory."
> So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any difference 
> here?
>
> Thanks
>
> [4] http://docs.basex.org/wiki/Indexes#Performance
>
>
> On Wed, Oct 2, 2019 at 11:06 AM first name last name  
> wrote:
>>
>> Hey Christian,
>>
>> Thank you for your answer :)
>> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the same OOM 
>> behavior. It looks like the memory consumption is moderate until when it 
>> reaches about 30GB (the size of the db before optimize) and
>> then memory consumption spikes, and OOM occurs. Now I'm trying with 
>> SPLITSIZE = 1000 and will report back if I get OOM again.
>> Regarding what you said, it might be that the merge step is where the OOM 
>> occurs (I wonder if there's any way to control how much memory is being used 
>> inside the merge step).
>>
>> To quote the statistics page from the wiki:
>> Databases in BaseX are light-weight. If a database limit is reached, you 
>> can distribute your documents across multiple database instances and access 
>> all of them with a single XQuery expression.
>> This to me sounds like sharding. I would probably be able to split the 
>> documents into chunks and upload them under a db with the same prefix, but 
>> varying suffix.. seems a lot like shards. By doing this
>> I think I can avoid OOM, but if BaseX provides other, better, maybe native 
>> mechanisms of avoiding OOM, I would try them.
>>
>> Best regards,
>> Stefan
>>
>>
>> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün  
>> wrote:
>>>
>>> Hi first name,
>>>
>>> If you optimize your database, the indexes will be rebuilt. In this
>>> step, the builder tries to guess how much free memory is still
>>> available. If memory is exhausted, parts of the index will be split
>>> (i. e., partially written to disk) and merged in a final step.
>>> However, you can circumvent the heuristics by manually assigning a
>>> static split value; see [1] for more information. If you use the DBA,
>>> you’ll need to assign this value to your .basex or the web.xml file
>>> [2]. In order to find the best value for your setup, it may be easier
>>> to play around with the BaseX GUI.
>>>
>>> As you have already seen in our statistics, an XML document has
>>> various properties that may represent a limit for a single database.
>>> Accordingly, these properties make it difficult to decide for the
>>> system when the memory will be exhausted during an import or index
>>> rebuild.
>>>
>>> In general, you’ll get best performance (and your memory consumption
>>> will be lower) if you create your database and specify the data to be
>>> imported in a single run. This is currently not possible via the DBA;
>>> use the GUI (Create Database) or console mode (CREATE DB command)
>>> instead.
>>>
>>> Hope this helps,
>>> Christian
>>>
>>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
>>> [2] http://docs.basex.org/wiki/Configuration
>>>
>>>
>>>
>>> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>>>  wrote:
>>> >
>>> > Hi,
>>> >
>>> > Let's say there's a 30GB dataset [3] containing most threads/posts from 
>>> > [1].
>>> > After importing all of it, when I try to run /dba/db-optimize/ on it 
>>> > (which must have some corresponding command) I get the OOM error in the 
>>> > stacktrace attached. I 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-03 Thread Imsieke, Gerrit, le-tex

Hi,

just saying that 16 GB of DDR3 RAM cost about 40 € now.

Gerrit

On 03.10.2019 08:53, first name last name wrote:

I tried again, using SPLITSIZE = 12 in the .basex config file
The batch(console) script I used is attached mass-import.xq
This time I didn't do the optimize or index creation post-import, but 
instead, I did it as part of the import similar to what

is described in [4].
This time I got a different error, that is, 
"org.basex.core.BaseXException: Out of Main Memory.*"*
So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any 
difference here?


Thanks

[4] http://docs.basex.org/wiki/Indexes#Performance


On Wed, Oct 2, 2019 at 11:06 AM first name last name 
mailto:randomcod...@gmail.com>> wrote:


Hey Christian,

Thank you for your answer :)
I tried setting in .basex the SPLITSIZE = 24000 but I've seen the
same OOM behavior. It looks like the memory consumption is moderate
until when it reaches about 30GB (the size of the db before
optimize) and
then memory consumption spikes, and OOM occurs. Now I'm trying with
SPLITSIZE = 1000 and will report back if I get OOM again.
Regarding what you said, it might be that the merge step is where
the OOM occurs (I wonder if there's any way to control how much
memory is being used inside the merge step).

To quote the statistics page from the wiki:
Databases  in BaseX are
light-weight. If a database limit is reached, you can distribute
your documents across multiple database instances and access all of
them with a single XQuery expression.
This to me sounds like sharding. I would probably be able to split
the documents into chunks and upload them under a db with the same
prefix, but varying suffix.. seems a lot like shards. By doing this
I think I can avoid OOM, but if BaseX provides other, better, maybe
native mechanisms of avoiding OOM, I would try them.

Best regards,
Stefan


On Tue, Oct 1, 2019 at 4:22 PM Christian Grün
mailto:christian.gr...@gmail.com>> wrote:

Hi first name,

If you optimize your database, the indexes will be rebuilt. In this
step, the builder tries to guess how much free memory is still
available. If memory is exhausted, parts of the index will be split
(i. e., partially written to disk) and merged in a final step.
However, you can circumvent the heuristics by manually assigning a
static split value; see [1] for more information. If you use the
DBA,
you’ll need to assign this value to your .basex or the web.xml file
[2]. In order to find the best value for your setup, it may be
easier
to play around with the BaseX GUI.

As you have already seen in our statistics, an XML document has
various properties that may represent a limit for a single database.
Accordingly, these properties make it difficult to decide for the
system when the memory will be exhausted during an import or index
rebuild.

In general, you’ll get best performance (and your memory consumption
will be lower) if you create your database and specify the data
to be
imported in a single run. This is currently not possible via the
DBA;
use the GUI (Create Database) or console mode (CREATE DB command)
instead.

Hope this helps,
Christian

[1] http://docs.basex.org/wiki/Options#SPLITSIZE
[2] http://docs.basex.org/wiki/Configuration



On Mon, Sep 30, 2019 at 7:09 AM first name last name
mailto:randomcod...@gmail.com>> wrote:
 >
 > Hi,
 >
 > Let's say there's a 30GB dataset [3] containing most
threads/posts from [1].
 > After importing all of it, when I try to run
/dba/db-optimize/ on it (which must have some corresponding
command) I get the OOM error in the stacktrace attached. I am
using -Xmx2g so BaseX is limited to 2GB of memory (the machine
I'm running this on doesn't have a lot of memory).
 > I was looking at [2] for some estimates of peak memory usage
for this "db-optimize" operation, but couldn't find any.
 > Actually it would be nice to know peak memory usage because..
of course, for any database (including BaseX) a common operation
is to do server sizing, to know what kind of server would be needed.
 > In this case, it seems like 2GB memory is enough to import
340k documents, weighing in at 30GB total, but it's not enough
to run "dba-optimize".
 > Is there any info about peak memory usage on [2] ? And are
there guidelines for large-scale collection imports like I'm
trying to do?
 >
 > Thanks,
 > Stefan
 >
 > [1] https://www.linuxquestions.org/
 > [2] 

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-03 Thread first name last name
I tried again, using SPLITSIZE = 12 in the .basex config file
The batch(console) script I used is attached mass-import.xq
This time I didn't do the optimize or index creation post-import, but
instead, I did it as part of the import similar to what
is described in [4].
This time I got a different error, that is, "org.basex.core.BaseXException:
Out of Main Memory.*"*
So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any
difference here?

Thanks

[4] http://docs.basex.org/wiki/Indexes#Performance


On Wed, Oct 2, 2019 at 11:06 AM first name last name 
wrote:

> Hey Christian,
>
> Thank you for your answer :)
> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the same OOM
> behavior. It looks like the memory consumption is moderate until when it
> reaches about 30GB (the size of the db before optimize) and
> then memory consumption spikes, and OOM occurs. Now I'm trying with
> SPLITSIZE = 1000 and will report back if I get OOM again.
> Regarding what you said, it might be that the merge step is where the OOM
> occurs (I wonder if there's any way to control how much memory is being
> used inside the merge step).
>
> To quote the statistics page from the wiki:
> Databases  in BaseX are
> light-weight. If a database limit is reached, you can distribute your
> documents across multiple database instances and access all of them with a
> single XQuery expression.
> This to me sounds like sharding. I would probably be able to split the
> documents into chunks and upload them under a db with the same prefix, but
> varying suffix.. seems a lot like shards. By doing this
> I think I can avoid OOM, but if BaseX provides other, better, maybe native
> mechanisms of avoiding OOM, I would try them.
>
> Best regards,
> Stefan
>
>
> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün 
> wrote:
>
>> Hi first name,
>>
>> If you optimize your database, the indexes will be rebuilt. In this
>> step, the builder tries to guess how much free memory is still
>> available. If memory is exhausted, parts of the index will be split
>> (i. e., partially written to disk) and merged in a final step.
>> However, you can circumvent the heuristics by manually assigning a
>> static split value; see [1] for more information. If you use the DBA,
>> you’ll need to assign this value to your .basex or the web.xml file
>> [2]. In order to find the best value for your setup, it may be easier
>> to play around with the BaseX GUI.
>>
>> As you have already seen in our statistics, an XML document has
>> various properties that may represent a limit for a single database.
>> Accordingly, these properties make it difficult to decide for the
>> system when the memory will be exhausted during an import or index
>> rebuild.
>>
>> In general, you’ll get best performance (and your memory consumption
>> will be lower) if you create your database and specify the data to be
>> imported in a single run. This is currently not possible via the DBA;
>> use the GUI (Create Database) or console mode (CREATE DB command)
>> instead.
>>
>> Hope this helps,
>> Christian
>>
>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
>> [2] http://docs.basex.org/wiki/Configuration
>>
>>
>>
>> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>>  wrote:
>> >
>> > Hi,
>> >
>> > Let's say there's a 30GB dataset [3] containing most threads/posts from
>> [1].
>> > After importing all of it, when I try to run /dba/db-optimize/ on it
>> (which must have some corresponding command) I get the OOM error in the
>> stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of memory
>> (the machine I'm running this on doesn't have a lot of memory).
>> > I was looking at [2] for some estimates of peak memory usage for this
>> "db-optimize" operation, but couldn't find any.
>> > Actually it would be nice to know peak memory usage because.. of
>> course, for any database (including BaseX) a common operation is to do
>> server sizing, to know what kind of server would be needed.
>> > In this case, it seems like 2GB memory is enough to import 340k
>> documents, weighing in at 30GB total, but it's not enough to run
>> "dba-optimize".
>> > Is there any info about peak memory usage on [2] ? And are there
>> guidelines for large-scale collection imports like I'm trying to do?
>> >
>> > Thanks,
>> > Stefan
>> >
>> > [1] https://www.linuxquestions.org/
>> > [2] http://docs.basex.org/wiki/Statistics
>> > [3] https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2
>>
>
.;...;..;..;..;..;...;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;
..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...;..;..;..;...

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-02 Thread first name last name
Hey Christian,

Thank you for your answer :)
I tried setting in .basex the SPLITSIZE = 24000 but I've seen the same OOM
behavior. It looks like the memory consumption is moderate until when it
reaches about 30GB (the size of the db before optimize) and
then memory consumption spikes, and OOM occurs. Now I'm trying with
SPLITSIZE = 1000 and will report back if I get OOM again.
Regarding what you said, it might be that the merge step is where the OOM
occurs (I wonder if there's any way to control how much memory is being
used inside the merge step).

To quote the statistics page from the wiki:
Databases  in BaseX are
light-weight. If a database limit is reached, you can distribute your
documents across multiple database instances and access all of them with a
single XQuery expression.
This to me sounds like sharding. I would probably be able to split the
documents into chunks and upload them under a db with the same prefix, but
varying suffix.. seems a lot like shards. By doing this
I think I can avoid OOM, but if BaseX provides other, better, maybe native
mechanisms of avoiding OOM, I would try them.

Best regards,
Stefan


On Tue, Oct 1, 2019 at 4:22 PM Christian Grün 
wrote:

> Hi first name,
>
> If you optimize your database, the indexes will be rebuilt. In this
> step, the builder tries to guess how much free memory is still
> available. If memory is exhausted, parts of the index will be split
> (i. e., partially written to disk) and merged in a final step.
> However, you can circumvent the heuristics by manually assigning a
> static split value; see [1] for more information. If you use the DBA,
> you’ll need to assign this value to your .basex or the web.xml file
> [2]. In order to find the best value for your setup, it may be easier
> to play around with the BaseX GUI.
>
> As you have already seen in our statistics, an XML document has
> various properties that may represent a limit for a single database.
> Accordingly, these properties make it difficult to decide for the
> system when the memory will be exhausted during an import or index
> rebuild.
>
> In general, you’ll get best performance (and your memory consumption
> will be lower) if you create your database and specify the data to be
> imported in a single run. This is currently not possible via the DBA;
> use the GUI (Create Database) or console mode (CREATE DB command)
> instead.
>
> Hope this helps,
> Christian
>
> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
> [2] http://docs.basex.org/wiki/Configuration
>
>
>
> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>  wrote:
> >
> > Hi,
> >
> > Let's say there's a 30GB dataset [3] containing most threads/posts from
> [1].
> > After importing all of it, when I try to run /dba/db-optimize/ on it
> (which must have some corresponding command) I get the OOM error in the
> stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of memory
> (the machine I'm running this on doesn't have a lot of memory).
> > I was looking at [2] for some estimates of peak memory usage for this
> "db-optimize" operation, but couldn't find any.
> > Actually it would be nice to know peak memory usage because.. of course,
> for any database (including BaseX) a common operation is to do server
> sizing, to know what kind of server would be needed.
> > In this case, it seems like 2GB memory is enough to import 340k
> documents, weighing in at 30GB total, but it's not enough to run
> "dba-optimize".
> > Is there any info about peak memory usage on [2] ? And are there
> guidelines for large-scale collection imports like I'm trying to do?
> >
> > Thanks,
> > Stefan
> >
> > [1] https://www.linuxquestions.org/
> > [2] http://docs.basex.org/wiki/Statistics
> > [3] https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2
>


Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

2019-10-01 Thread Christian Grün
Hi first name,

If you optimize your database, the indexes will be rebuilt. In this
step, the builder tries to guess how much free memory is still
available. If memory is exhausted, parts of the index will be split
(i. e., partially written to disk) and merged in a final step.
However, you can circumvent the heuristics by manually assigning a
static split value; see [1] for more information. If you use the DBA,
you’ll need to assign this value to your .basex or the web.xml file
[2]. In order to find the best value for your setup, it may be easier
to play around with the BaseX GUI.

As you have already seen in our statistics, an XML document has
various properties that may represent a limit for a single database.
Accordingly, these properties make it difficult to decide for the
system when the memory will be exhausted during an import or index
rebuild.

In general, you’ll get best performance (and your memory consumption
will be lower) if you create your database and specify the data to be
imported in a single run. This is currently not possible via the DBA;
use the GUI (Create Database) or console mode (CREATE DB command)
instead.

Hope this helps,
Christian

[1] http://docs.basex.org/wiki/Options#SPLITSIZE
[2] http://docs.basex.org/wiki/Configuration



On Mon, Sep 30, 2019 at 7:09 AM first name last name
 wrote:
>
> Hi,
>
> Let's say there's a 30GB dataset [3] containing most threads/posts from [1].
> After importing all of it, when I try to run /dba/db-optimize/ on it (which 
> must have some corresponding command) I get the OOM error in the stacktrace 
> attached. I am using -Xmx2g so BaseX is limited to 2GB of memory (the machine 
> I'm running this on doesn't have a lot of memory).
> I was looking at [2] for some estimates of peak memory usage for this 
> "db-optimize" operation, but couldn't find any.
> Actually it would be nice to know peak memory usage because.. of course, for 
> any database (including BaseX) a common operation is to do server sizing, to 
> know what kind of server would be needed.
> In this case, it seems like 2GB memory is enough to import 340k documents, 
> weighing in at 30GB total, but it's not enough to run "dba-optimize".
> Is there any info about peak memory usage on [2] ? And are there guidelines 
> for large-scale collection imports like I'm trying to do?
>
> Thanks,
> Stefan
>
> [1] https://www.linuxquestions.org/
> [2] http://docs.basex.org/wiki/Statistics
> [3] https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2