Re: Reindexing a workspace ...

2009-06-15 Thread Bart van der Schans
Hi Thomas,

Thanks for you answer. This is also exactly what we already do in most cases ;-)

I guess currently it's a trade off: If you don't want to stop
jackrabbit for making a backup, you must have a clustered node that
you can dedicate to making the backup.

Regards,
Bart

On Mon, Jun 15, 2009 at 11:49 AM, Thomas Müller  wrote:
> Hi,
>
> If you use database persistence managers and a database journal, you
> could use the following procedure:
>
> 1) stop the cluster node
> 2) backup the lucene index, config files, and revision.log of this cluster 
> node
> 3) later on, backup the persistence manager data, journal data, and data store
>
> This backup should be consistent because the journal includes the list
> of changes, so the lucene index is updated.
>
> Regards,
> Thomas
>
> On Tue, Jun 9, 2009 at 3:48 PM, KÖLL Claus wrote:
>> hi (thomas),
>>
>> your post was clear thanks for the info ...
>> ok the lucene index is consistent but you will not get a snapshot from the 
>> repository
>> as bart wrote.
>>
>> I see some problems with barts solution ..
>> if you have a large repository a write lock that runs hours is not good
>> but maybe some others have good ideas  ...
>>
>> i have tested the environment as you mentioned with the cluster and it works 
>> fine at the
>> moment for us because we can re-index the backup cluster in the background 
>> if we get
>> a crash ... hopefully not :-)
>>
>> greets
>> claus
>>
>



-- 
Hippo B.V.  -  Amsterdam
Oosteinde 11, 1017 WT, Amsterdam, +31(0)20-5224466

Hippo USA Inc.  -  San Francisco
101 H Street, Suite Q, Petaluma CA, 94952-3329, +1 (707) 773-4646
-
http://www.onehippo.com   -  [email protected]
-


Re: Reindexing a workspace ...

2009-06-15 Thread Thomas Müller
Hi,

If you use database persistence managers and a database journal, you
could use the following procedure:

1) stop the cluster node
2) backup the lucene index, config files, and revision.log of this cluster node
3) later on, backup the persistence manager data, journal data, and data store

This backup should be consistent because the journal includes the list
of changes, so the lucene index is updated.

Regards,
Thomas

On Tue, Jun 9, 2009 at 3:48 PM, KÖLL Claus wrote:
> hi (thomas),
>
> your post was clear thanks for the info ...
> ok the lucene index is consistent but you will not get a snapshot from the 
> repository
> as bart wrote.
>
> I see some problems with barts solution ..
> if you have a large repository a write lock that runs hours is not good
> but maybe some others have good ideas  ...
>
> i have tested the environment as you mentioned with the cluster and it works 
> fine at the
> moment for us because we can re-index the backup cluster in the background if 
> we get
> a crash ... hopefully not :-)
>
> greets
> claus
>


Re: Reindexing a workspace ...

2009-06-09 Thread Bart van der Schans
On Tue, Jun 9, 2009 at 3:48 PM, KÖLL Claus  wrote:
> hi (thomas),
>
> your post was clear thanks for the info ...
> ok the lucene index is consistent but you will not get a snapshot from the 
> repository
> as bart wrote.
>
> I see some problems with barts solution ..
> if you have a large repository a write lock that runs hours is not good
> but maybe some others have good ideas  ...
Of course, but it depends on your definition of large. For example
dumping 12 GB of data to disk from mysql will take something like half
an hour. Or in other terms that's about 1.000.000 node budles and
about 4.500.000 version bundles. Running for half an hour in read only
in low traffic hours is imo quite acceptable in a lot of environments.

>
> i have tested the environment as you mentioned with the cluster and it works 
> fine at the
> moment for us because we can re-index the backup cluster in the background if 
> we get
> a crash ... hopefully not :-)
Keep in mind that re-indexing can take quite a lot of time. IIRC a
full re-index of the repository mentioned above took somewhere between
6-12 hours.

Regards,
Bart


Re: Reindexing a workspace ...

2009-06-09 Thread Bart van der Schans
On Tue, Jun 9, 2009 at 11:01 AM, Thomas Müller  wrote:
> Hi,
>
>> lets say you have a disk crash and you must restore the index folder but it 
>> was backed up a day before.
>> to get a consistent state with the data you must re-index the whole 
>> workspace.
>
> Probably my original mail was unclear. I repeat: "One solution is to
> use clustering. One cluster node (the 'master') is used for regular
> requests, while the other (the 'backup') is used for backup. The
> master node continuously runs, while the backup node is stopped from
> time to time to create a backup." In that case the Lucene index in the
> backup is consistent. After a crash, you restore the backup. Like
> that, you don't have an inconsistent index.

We use exactly such a solution with some success. The problem I see
with this solution apart from the obvious extra resources you will
need, is that you also have to backup the database at 'about the same
time'. It just feels like a big workaround and you never feel really
sure you've got 'everything'...

If we could have some kind of 'flush everything to disk/database/index
and hold a write lock until further notice' method and a 'please
continue as normal' method that could be called remotely somehow it
would make creating consistent backups much easier:
- issue 'flush and hold'
- use your favorite backup method, rsync, scp, db dumps, etc.
- issue 'continue'

Any thoughts if such a thing would be possible? If so, I could help to
implement this.

Regards,
Bart


-- 
Hippo B.V.  -  Amsterdam
Oosteinde 11, 1017 WT, Amsterdam, +31(0)20-5224466

Hippo USA Inc.  -  San Francisco
101 H Street, Suite Q, Petaluma CA, 94952-3329, +1 (707) 773-4646
-
http://www.onehippo.com   -  [email protected]
-


Re: Reindexing a workspace ...

2009-06-09 Thread Thomas Müller
Hi,

> lets say you have a disk crash and you must restore the index folder but it 
> was backed up a day before.
> to get a consistent state with the data you must re-index the whole workspace.

Probably my original mail was unclear. I repeat: "One solution is to
use clustering. One cluster node (the 'master') is used for regular
requests, while the other (the 'backup') is used for backup. The
master node continuously runs, while the backup node is stopped from
time to time to create a backup." In that case the Lucene index in the
backup is consistent. After a crash, you restore the backup. Like
that, you don't have an inconsistent index.

Regards,
Thomas


Re: Reindexing a workspace ...

2009-05-29 Thread Thomas Müller
Hi,

> with your solution i will be able to backup the index but that does not solve 
> my problem
> if you come in that situation that you must reindex a workspace from blank.

How would you come in that situation?

> i dont know what happens in a cluster environemnt if i start one member 
> without
> the index to start a reindex process and work with the other cluster member 
> ...
> is that possible ?

If you start a cluster node and the Lucene index files are missing, I
guess they are created automatically. I didn't test this however.

Regards,
Thomas


On Wed, May 20, 2009 at 11:40 AM, KÖLL Claus  wrote:
> hi thomas,
>
> with your solution i will be able to backup the index but that does not solve 
> my problem
> if you come in that situation that you must reindex a workspace from blank.
>
> i dont know what happens in a cluster environemnt if i start one member 
> without
> the index to start a reindex process and work with the other cluster member 
> ...
> is that possible ?
>
> greets
> claus
>


Re: Reindexing a workspace ...

2009-05-19 Thread Thomas Müller
Hi,

One solution is to use clustering. One cluster node (the 'master') is
used for regular requests, while the other (the 'backup') is used for
backup. The master node continuously runs, while the backup node is
stopped from time to time to create a backup.

[Advertisement] Day CRX (http://www.day.com) supports online backup
when using the Tar persistence manager and the Lucene index.

Regards,
Thomas