[midgard-user] Repligard notes

Henri Bergius Thu, 21 Jun 2001 05:33:54 -0700
Greetings!

Here are my notes from today's Repligard seminar here at
Nemein offices. I guess this is currently the most comprehensive
set of documentation on replication and packaging in Midgard, so
it would probably be an useful addition to the Midgard Manual.

We also videotaped Alexander's presentation, and will publish
it on our site, hopefully next week.

I'll be on vacation for the next couple of weeks. See you guys
after that..

/Bergie

Repligard seminar

June 21st 2001
Nemein headquarters, Espoo, Finland

Speaker:
Alexander Bokovoy, Belarusian State University

Introduction to Repligard

Repligard is Midgard's utility for replicating information between
servers. Most frequent use is to synchronize development and
production servers. Another use for Repligard is installing packages.

Repligard is a Midgard application that doesn't use mod_midgard or the
PHP bindings. However, Repligard relies on usage of the PHP bindings
in database manipulation to keep the repligard change information in
consistent state.

The base of a regular Midgard installation is an operating system,
which can be either GNU/Linux, Solaris, FreeBSD, Mac OS X or some
other brand of UNIX. On top of the OS there is the Midgard base
library, which handles all communication with Midgard's database
server (currently MySQL).

The database stores all information including replication change
information.

Each Midgard object is an entry in the database, and also it has an entry in
the database's repligard table, used for replication information. The
repligard table links id numbers of individual Midgard objects to
GUIDs (general universal id) using the realm (table name) and id columns.
While ids may wary between Midgard installations, GUIDs remain the same.

The repligard table also contains two timestamps for each entry,
changed and updated. The changed timestamp notes the date when the
entry was last changed using Midgard's PHP bindings. The updated
timestamp is used by Repligard for determined whether to update a
record for the database.

When exporting Midgard objects, Repligard checks if a changed field is
greater than the updated field, the object is replicated to the other
database. During import phase it does reverse comparision.

If an object was deleted from the source database, it should be also
deleted from the target database. To enable this, we have an action
field in the repligard table. It either contains 'create', 'update' or
'delete' value depending on which action was last used for the object.

Repligard table will keep information also about objects that have
been deleted, so that it will be able to link the GUIDs to correct
states of objects during export phase.

Replication schemas

The replication output file is in the XML format. XML is flexible,
standardized, and supports replication needs easily. If the structure
of the Midgard database is changed, no changes to Repligard code are
needed as it can be made to support the new objects by simple changes
to the XML schema file.

The XML schema file is very simple. It contains several XML elements
in a two level structure. First element is type. Type specifies object
name. Inside type we have a second structure that describes fields for
that object.

Example:

<type name="person">
   <field type="text" name="username" />
   <field type="password" name="password" />
   <link link="topic" name="department" />
</type>

Besides field, we also have an element called link, that links objects
to each other. After parsing the XML schema file, Repligard creates a
graph noting relations between different objects. If the person has a link
for a topic, then when exporting the person, the topic should also be
exported and vise versa.

Sometimes such bidirectional relation isn't needed, as in case of page
and host relation. When page is exported there is no need to export host
because page might be used in several hosts simultaneusly.
If exporting of bidirectional dependencies is not desired, there is
additional attribute reverse='no' for the link element.

An object may contain either fields or links, but usually contains a
mixture of both.

Repligard doesn't care which database software is being used, as it uses
Midgard-lib for database interaction. This was done to minimize impact
of porting Midgard to other databases than MySQL.

The schema XML files can be modified for application needs. As an
example, Midgard articles contain three number fields can be used for
linking to other objects. To make these replication safe, change your
schema file to make Midgard aware of the linkage.

Example:

Originally Midgard has:

<type name="article">
   ...
   <field name="extra1" type="text" />
   ...
</type>

Change this to:

<type name="article">
   ...
   <link name="extra1" link="person" />
   ...
</type>

In this case the schema XML file needs to be distributed with the
application.

Repligard configuration

Repligard uses XML files because of two requirements, flexibility and
minimizing the amount of code needed to process external files. It was
easier to use XML processing than some custom configuration format.

In its current version, Midgard has support for only 8-bit encoded
information. Because of this, Unicode can't be used in storage. This
limitation is both caused by Midgard design and current MySQL's
capabilities. However, Repligard is Unicode-safe, and so Unicode can be
used as soon as Midgard and the database will support it. The XML parser
already supports Unicode.

Since XML is flexible, information can be stored using whichever
encoding desired. Repligard handles translating XML files between
latin-1 and other encodings, so users don't need to handle these
matters. The only requirement is to have ICONV(3) support on your
system which is standard for GNU systems and easily can be added
using libiconv package to other ones. ICONV(3) interface is a part of
X/Open Portability Guide version 2 standard and is supported more or
less by most of commercial and open source Unix-like systems.

All this information needs to be specified in Repligard's
configuration file using a encoding's names understood by the
ICONV library. (ICONV generally provides interactive tool named 'iconv'
which could be used for quering about names of supported encodings.
On GNU systems corresponding command line call will be 'iconv -l')

The configuration file includes many elements. First one is database,
which contains information about the database used, including schema
file used, and database's location and administrator account, similar
to the MidgardDatabase directive in Apache configuration.

Example:

<database
   schema="/path/to/schema.xml"
   database="midgard"
   username="midgard"
   password="midgard"
   encoding="ISO-8859-1"
   blobdir="/var/www/blobs"
/>

All information from the importing XML file will be translated to the 
encoding
specified before entering to the database.

After we've said what database we want to use, we need to specify
which Midgard administrative account we will use for accessing the
database.

Example:

<login
   username="admin"
   password="password"
/>

If you desire to login to a sitegroup, you need to specify which
sitegroup to use since Repligard can't use host information for
determining it. The format is the same as with regular admin site,
username+sitegroup. All modifiers noted in Midgard manual can be used
here. Repligard will handle the exports and imports using the
privileges of the user specified.

Because of this, administrators of sitegroups in a co-hosted setup can
do theirown replication configurations. However, in this setup, the
configuration files should not include database administrator
accounts. For this, there is a directive include for including global
configuration files.

Example:

<include name="/path/to/global/config.xml" />

Repligard supports two general operations, export and import. Import
is the easiest one. When importing, user doesn't need to specify
anything besides database and login information in the configuration
file.

For export, users need to specify which objects to export. For this,
there is the replicate element. Easiest is replicating everything in
database.

Example:

<replicate all="yes" />

Another way is to specify resources using resource structures within
the replicate element.

Example:

<replicate>
   <resource id="1" type="article" />
   <resource guid="3e6729abe2891fca92fad03" />
</replicate>

Make sure that the IDs specified here are local IDs of the object, as
the IDs of objects wary between Midgard installations.

Another possibility is to specify the object using its GUID. Since
GUIDs don't change between databases, the same resource string can be
used on many databases. The GUID is a 32 character string currently
produced using sophisticated algorithm to ensure uniqueness of resource
in 128-bit space. It gives us about 3*10^39 different objects
(exact value is 340282366920938463463374607431768211456).

As many resources can be specified as needed. However, if the resource
has dependencies, Repligard will replicate the whole dependency tree
under it. For example, topics are trees, so if a topic is specified,
its subtopics and articles will be exported as well.

Repligard can export either only changes since last replication, or
everything. This can be decided using Repligard's commandline
arguments. If the option '-a' is used, Repligard will export the
complete database. Otherwise, only changes will be replicated.

Another functionality in the configuration file is location of the
BLOB directory. Repligard will embbed BLOB files to the replication XML
file, so they will not be needed to be transferred
separately. Repligard uses streams for handling objects, so it doesn't
matter whether an object is 5KB long, or 5GB long.

Currently BLOBs are not streamed in import phase, so if a BLOB doesn't
fit into system memory, it can't be imported. But given requirement of
processing one object per time, it means that BLOBs could be quite big,
for example, 50Mb or even more, if you have enough memory to store it.

Selective replication

If an application requires selective export, like exporting only
approved articles, additional tools will be needed. In near future,
Repligard will support such functionalities by itself.

The CVS version of Repligard has basic support for this feature. Users
can specify these rules in the configuration file. However, Repligard
will only check the syntax, but will not act on the rules yet. These
are called export hooks and might even appear before Midgard 1.4.2 release.

Export hook is the Repligard term for little scripts that can be used
to select which objects to replicate. Export hooks are specific to a
object type, and only one hook can be used for each type. The export
hook will be executed for each object of that type, and will return
either 1 or 0, for "let's export" or "don't export".

Example:

<exportHook type="article">
<![CDATA[
   if (object["approved"]) {
         return 0;
   }
   return 1;
]]>
</exportHook>

The example uses pseudocode. Each article would be passed through the
script, and the script would receive the objects as associative
arrays.

The language used can be any which Repligard has bindings for. The
first binding will be for the S-Lang language, which has a C-like syntax.
The script will not be able to communicate with the operating system
for security reasons, but only have access to a limited range of
libraries, like regular expression matching. This will be determined
in the language binding.

This functionality should be available in the near future, probably
not later than early fall 2001.

Repligard usage

Repligard is used on the command line, and supports many options. The
commonest are -i for import and -e for export.

Example:

$ repligard -c repligard.conf -e export.xml.gz

Repligard automatically compresses replication XML files using gzip
during export. Because of this, the system doesn't require much
bandwith when transferring files using for example scp, rsync or HTTP.

Repligard will first read the configuration file, and then export the
resources specified to the compressed XML file.

The XML file can be empty of data if there have been no changes to the
database. In this case the XML file is three lines long, containing
only empty database elements. This will be about 130 bytes long.

Example:

<?xml ... ?>
<Database>
</Database>

This can be checked by uncompressing the file using zcat and running
wc -l for it. If the database container is empty, it doesn't need to
be transferred to the remote host.

The XML file will contain all data of the Midgard objects that are
replicated in the format specified in the schema XML file, and changed
and GUID information.

Example:

   <article id="fa2a09efe24980fba823ecf" changed="200106211314"
     <name>....</name>
     <content>.....</content>
     <author>guid of the object</author>
     ...
   </article>

For text-type fields, the data will be contained within CDATA element
to preserve formatting. For links, the referenced GUID will be listed.

When importing, Repligard will check for all referenced GUIDs. If the
object specified by the GUID is not found in the database, and empty
object will be created. This did not function properly in older
versions, so many Midgard databases were in inconsistent state.

The replication XML files are not valid XML files in the sense that
there is no official schema for the document type. However, Repligard
uses its own schema XML files, and so the documents stay valid for
Repligard's use as long as the schema XML file is consistent.It is
possible to generate proper XML DTD schema for replication file but
it isn't needed because XML parser we are using isn't validating one.

If an object that is being imported already exists in the database,
the changed field in database will be compared to the one in the XML
file, and the newer one will be saved to the database.

Due to instability of Repligard in previous releases, sitegroup
information between a repligard table and the actual records can
differ causing broken databases. There is a script in CVS version of
Midgard-data for fixing these, it is 'cvs::midgard/data/fixsg'

Repligard-safe applications

There are several things about application writing that touch
replicateability. Originally Midgard used server-specific id numbers
for referring to objects in the database.

When replicating, local ids change to whatever is available on the
target server. Instead, application writers should use GUIDs or names
for referring to objects.

The issues for proper replicated applications have been addressed many
times in Midgard development.

The first attempt at this was a year and half ago when addressing of
topics and articles by a name was enabled in Midgard-PHP.

Example:

Use
   mgd_get_article_by_name(topic id, 'article name');
instead of
   mgd_get_article(id);

However, the name functions still require using of local ids when
referring to root topics. Of course, the topic id could first be
dynamically queried by using a series of recurring
mgd_get_topic_by_name queries starting from the topic id 0.

Example:

$topic = mgd_get_topic_by_name(0, 'Company');

$article = mgd_get_article_by_name($topic->id, 'Contact us');

When writing applications, many developers refer to objects by
statically writing id numbers to the application code. However, this
will not work after replication. For solving this, the named functions
provide a working, if cumbersome solution. However, not all Midgard
objects are available through names.

As another solution, we have defined the GUID system for providing a
method of addressing documents that is uniformal on any Midgard
installation, and is still relatively quick to use.

To provide the GUID information for an object, a guid() method was
added to all objects, and for retrieving objects based on GUIDs, a
function mgd_get_object_by_guid was added.

Example:

$topic = mgd_get_object_by_guid("af3a4eafe7a83bca73323421");

if ($topic) {

    $article = mgd_list_topic_articles($topic->id);

    if ($article) {

       while ($article->fetch()) {

            ...
       }

    }

}

By using GUIDs for referring to all objects, your application will be
completely replication safe, as Repligard maintains relations between
GUID entries, and local ids of objects.

How to write code to support these new addressing methods depends on
how your original code has been written. It may even require complete
redesign of your code.

However, at its easiest, it requires only replacing original local ids
with GUIDs retrieved from the repligard table.

Example:

$ mysql midgard

mysql> select guid from repligard where id=123 and realm='article';

Example 2:

$ export ID=123
$ export TYPE='article'
$ echo "select guid from repligard where id=$ID and realm='$TYPE';" | 
mysql midgard | tail +2

This method solves most problems with replicateable
applications. Current Midgard administration interfaces do not show
GUID information for objects.

In the future, Repligard will probably be able to provide GUID
information with some commandline argument.

Once you have changed your application to refer to objects by GUIDs,
you can freely use their properties to refer to other objects. As
these properties (like up) are dynamically retrieved with each query,
they will be consistent for the local database.

Example:

$topic = mgd_get_object_by_guid("892d9343acc66c0a730fd5e6ffc6cbdf");

$article = mgd_get_article_by_name($topic->id, "Contact us");

It does not matter whether the application worked on is a real
application that you want to distribute, or a web site prepared for
moving to production server. As long as you use GUIDs, the
applications should be distributeable.

When you are packaging an application, you must assure that all
objects references in the application go into the package. Because of
this, the replicate statements in the Repligard configuration file
should be used to refer to them.

Currently Repligard does not know how to inherit objects from an upper
dependency level. For example, if you have specified a page in the
resource statement, also the page elements will be replicated. However,
as the page usually does not refer to a style (common case is to
inherit style information from upper level which in case of page is host),
Repligard will not replicate it, even while the host refers to it because
relation between pages and hosts is unidirectional (reverse link creates
one to multiple relation which couldn't be solved easily and you end up
selecting manually from which host style information should be retrievied)

All distributeable applications should be owned by person 0, the
Midgard Administrator, as that is available on all Midgard
systems. Otherwise regular user accounts might be replicated as well.

After Repligard has produced the XML file, it can then be put to a web
site with some documentation for distribution.

The problems also come from internationalization. Since Repligard's XML
format will support only one encoding per file, this can mess up localized
translation catalogues in your application. Because of this, it is a
good idea to distribute the application code in one package, and each
translation catalogue as its own package file.

At the moment, translation catalogues are best stored as snippets,
which can then be referred to using the snippet path, which is
replication safe. This way the snippet name can be the locale, which
helps usage within the application.

Example:

mgd_include_snippet("/my/snippet/path/$lang");

Another option is to store the translation catalogues as page or style
elements. However, this will make mod_midgard load all catalogues from
the database, causing some overhead.

Example:

mgd_eval("?><($lang)><?");

If your application has images or other files, it is best to store
them as BLOBs. This way they will be embedded to the replication XML
file, and will be installed automatically alongside the application
package.

If the files are big, it might be better for perfomance reasons for to
keep them in file system.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
[midgard-user] Repligard notes

Reply via email to