Re: [CODE4LIB] anyone know about Inera?

2008-07-13 Thread Min-Yen Kan
[I forgot to CC: this to the list, I've edited my reply a bit from the
original email to Godmar.]

Hi Godmar, all:

We'd love to do this and may consider doing so in the future.
As we are primarily a research unit doing such services is wonderful
but only when staff have time.

Just FYI, the web service we offer is running on one machine for the
public, but internally in our group we have a number of machines that
handle ParsCit web service calls that are brokered by a load balancing
mechanism; however we cannot spare the computational bandwidth for our
public interfaces.  For us in-house, it is a production system (though
we have yet to really stress test this).  This is why we are hoping
others will find the system useful and install it on their own.

If someone does make ParsCit available in a scalable web service
environment free of charge, we'd certainly link to it from the main
ParsCit website.  Any takers for some volunteer work?

Cheers,

Min

PS - Godmar suggested that we (and others providing like web services)
consider designing our web services in a scalable way from the
beginning so that we don't have to worry or focus bandwidth on making
our services scalable.  I'm very happy to learn such technologies, if
someone can point us in the appropriate direction -- Google App or
otherwise.

On Sat, Jul 12, 2008 at 10:46 PM, Godmar Back [EMAIL PROTECTED] wrote:
 Min, Eric, and others working in this domain -

 have you considered designing your software as a scalable web service
 from the get-go, using such frameworks as Google App Engine? You may
 be able to use Montepython for the CRF computations
 (http://montepython.sourceforge.net/)

 I know Min offers a WSDL wrapper around their software, but that's
 simply a gateway to one single-machine installation, and it's not
 intended as a production service at that.

  - Godmar



-- 
Min-Yen KAN (Dr) :: Assistant Professor :: National University of
Singapore :: School of Computing, AS6 05-12, Law Link, Singapore
117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
[EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W)

Important: This email is confidential and may be privileged. If you
are not the intended recipient, please delete it and notify us
immediately; you should not copy or use it for any purpose, nor
disclose its contents to any other person. Thank you.


[CODE4LIB] a brief summary of the Google App Engine

2008-07-13 Thread Godmar Back
Hi,

since I brought up the issue of the Google App Engine (GAE) (or
similar services, such as Amazon's EC2 Elastic Compute Cloud), I
thought I give a brief overview of what it can and cannot do, such
that we may judge its potential use for library services.

GAE is a cloud infrastructure into which developers can upload
applications. These applications are replicated among Google's network
of data centers and they have access to its computational resources.
Each application has access to a certain amount of resources at no
fee; Google recently announced the pricing for applications whose
resource use exceeds the no fee threshold [1]. The no fee threshold
is rather substantial: 500MB of persistent storage, and, according to
Google, enough bandwidth and cycles to serve about 5 million page
views per month.

Google Apps must be written in Python. They run in a sandboxed
environment. This environment limits what applications can do and how
they communicate with the outside world.  Overall, the sandbox is very
flexible - in particular, application developers have the option of
uploading additional Python libraries of their choice with their
application. The restrictions lie primarily in security and resource
management. For instance, you cannot use arbitrary socket connections
(all outside world communication must be through GAE's fetch service
which supports http/https only), you cannot fork processes or threads
(which would use up CPU cycles), and you cannot write to the
filesystem (instead, you must store all of your persistent data in
Google's scalable datastorage, which is also known as BigTable.)

All resource usage (CPU, Bandwidth, Persistent Storage - though not
memory) is accounted for and you can see your use in the application's
dashboard control panel. Resources are replenished on the fly where
possible, as in the case of CPU and Bandwidth. Developers are
currently restricted to 3 applications per account. Making
applications in multiple accounts work in tandem to work around quota
limitations is against Google's terms of use.

Applications are described by a configuration file that maps URI paths
to scripts in a manner similar to how you would use Apache
mod_rewrite.  URIs can also be mapped to explicitly named static
resources such as images. Static resources are uploaded along with
your application and, like the application, are replicated in Google's
server network.

The programming environment is CGI 1.1.  Google suggests, but doesn't
require, the use of supporting libraries for this model, such as WSGI.
 This use of high-level libraries allows applications to be written in
a very compact, high-level style, the way one is used to from Python.
In addition to the WSGI framework, this allows the use of several
template libraries, such as Django.  Since the model is CGI 1.1, there
are no or very little restrictions on what can be returned - you can
return, for instance, XML or JSON and you have full control over the
Content-Type: returned.

The execution model is request-based.  If a client request arrives,
GAE will start a new instance (or reuse an existing instance if
possible), then invoke the main() method. At this point, you have a
set limit to process this request (though not explicitly stated in
Google's doc, the limit appears to be currently 9 seconds) and return
a result to the client. Note that this per-request limit is a maximum;
you should usually be much quicker in your response. Also note that
any CPU cycles you use during those 9 seconds (but not time you spent
wait fetching results from other application tiers) count against your
overall CPU budget.

The key service the GAE runtime libraries provide is the Google
datastore, aka BigTable [2].
You can think of this service as a highly efficient, persistent store
for structured data. You may think of it as a simplified database that
allows the creation, retrieval, updating, and deletion (CRUD) of
entries using keys and, optionally, indices. It provides limited
support transactions as well. Though it is less powerful than
conventional relational databases - which aren't nearly as scalable -
it can be accessed using GQL, a query language that's similar in
spirit to SQL.  Notably, GQL (or BigTable) does not support JOINs,
which means that you will have to adjust your traditional approach to
database normalization.

The Python binding for the structured data is intuitive and seamless.
You simply declare a Python class for the properties of objects you
wish to store, along with the types of the properties you wish
included, and you can subsequently use a put() or delete() method to
write and delete. Queries will return instances of the objects you
placed in a given table.  Tables are named using the Python classes.

Google provides a number of additional runtime libraries, such as for
simple Image processing a la Google Picasa, for the sending of email
(subject to resource limits), and for user authentication, solely
using Google