Re: [OPEN-ILS-DEV] PATCH: osrf_json.c (rewrite of jsonObjectToJSON[Raw])

2008-11-18 Thread Bill Erickson

On Mon, 17 Nov 2008 23:32:21 -0500, Scott McKellar [EMAIL PROTECTED] wrote:

This patch is a rewrite of the jsonObjectToJSON and jsonObjectToJSONRaw  
functions.  It is dependent on my previous patch to utils.c and utils.h,

adding the new buffer_append_uescape function.

One purpose is to replace a call to the uescape function with a call to
the faster buffer_append_uescape function.  The other purpose is to
introduce a faster way to translate a jsonObject into a string.

(Also in one spot I broke up a very long string literal into several
smaller pieces so that it wouldn't wrap around in the editor.)

In the existing jsonObjectToJSON function, we receive a pointer to a
jsonObject and return a string of JSON.  However we don't translate the
original jsonObject directly.  Instead, we create a modified clone of the
original, inserting an additional JSON_HASH node wherever we find a
classname.  Then we translate the clone, and finally destroy it.

It always struck me as an egregious waste to create and destroy a whole
parallel object just so that we could turn it into a string.  So I looked
for a way to eliminate the cloning.

The result is a modification of add_json_to_buffer(), a local function
that recursively traverses and translates the jsonObject.  When it sees a
classname (and it has been asked to expand classnames), the new version
inserts additional gibberish into the output text and then continues the
traversal, without modifying or copying the original jsonObject at all.

In my benchmark, this new approach was faster than the original by a
factor of about 5.  When I combined this change with the use of the new
buffer_append_uencode function, it was faster by a factor of about 7.2.
The benchmark used a moderately complex jsonObject about 5 or 6 levels
deep, with both hashes and arrays, with classnames at several levels.
The performance gain will no doubt depend on the contents of the
jsonObject,but I haven't tried to isolate the variables.

The new version is a bit trickier and harder to read than the old.  In my
opinion the speedup is worth the obscurity, because a lot of places in
Evergreen will benefit.



With this change, the jsonObjectEncodeClass function, which had been
called only from jsonObjectToJSON, is no longer called from anywhere.
It's the function that created the modified copy of the original
jsonObject.  I see no reason to keep it around.  After a decent interval,
if no one objects I shall submit a patch to eliminate it.


Agreed.



Likewise the jsonObjectToJSONRaw function is no longer called from
anywhere.  It translates a jsonObject to JSON without expanding  
classnames.

In this rewritten version it is identical to jsonObjectToJSON except that
it calls add_json_to_buffer() with a different parameter value.

If we eliminate jsonObjectToJSONRaw, I can simplify add_json_to_buffer()
a bit.  However we might someday want to be able to translate a  
jsonObject

without expanding classnames.  If nobody minds, I'll leave this function
where it is.


Also agreed.  We (or other code using OpenSRF) may need the Raw() version  
in the future.


This code path is used /a lot/, so the speedups are greatly appreciated.

I've run into one issue with these last 2 patches, in particular, there  
seems to be a subtle difference in uescape() and buffer_append_uescape().   
The majority of the JSON parsing succeeds, but parsing certain  
bibliographic data (MARC records) always results in NULL when attempting  
to encode the containing object (biblio_record entry objects).If I  
replace the call to buffer_append_uescape() with a call to uescape() in  
osrf_json_object.c, then all is well.


I would send you some sample data to test, but I've yet to recreate the  
issue outside of a running system.  For now, I'm adding some logging to  
see if I can track it down.  I just wanted to pass this on in case any  
lightbulbs went off.  I'll let you if I find something...


Thanks, Scott

-b


--
Bill Erickson
| VP, Software Development  Integration
| Equinox Software, Inc. / The Evergreen Experts
| phone: 877-OPEN-ILS (673-6457)
| email: [EMAIL PROTECTED]
| web: http://esilibrary.com


Re: [OPEN-ILS-DEV] PATCH: osrf_json.c (rewrite of jsonObjectToJSON[Raw])

2008-11-18 Thread Scott McKellar
--- On Tue, 11/18/08, Bill Erickson [EMAIL PROTECTED] wrote:

 From: Bill Erickson [EMAIL PROTECTED]
snip

 I've run into one issue with these last 2 patches, in
 particular, there  
 seems to be a subtle difference in uescape() and
 buffer_append_uescape().   
 The majority of the JSON parsing succeeds, but parsing
 certain  
 bibliographic data (MARC records) always results in NULL
 when attempting  
 to encode the containing object (biblio_record entry
 objects).If I  
 replace the call to buffer_append_uescape() with a call to
 uescape() in  
 osrf_json_object.c, then all is well.
 
 I would send you some sample data to test, but I've yet
 to recreate the  
 issue outside of a running system.  For now, I'm adding
 some logging to  
 see if I can track it down.  I just wanted to pass this on
 in case any  
 lightbulbs went off.  I'll let you if I find
 something...

I'm not sure what you mean by always results in NULL.  What is NULL?
The return from the original uescape()?  That would indicate invalid
UTF-8 characters in the input string.

Maybe you mean that the buffer_append_uescape() seems to work, but the
resulting JSON string later fails to parse correctly.

In my test harness I translated things both ways and tested with strcmp()
to verify that I got the same results.  However I didn't test with
non-ASCII characters, so that's likely to be the weakest link.  It may
be that I'm mangling the UTF-8.

I'm more inclined to think that you have invalid UTF-8 characters in some
of your MARC records.  You wouldn't know about it because uescape()
silently discards an entire input string if any of it is invalid.  The
new function translates as far as it can before it hits the invalid
characters.  I *think* that the result would still be valid JSON, because
we only translate the values of strings, which get enclosed by double
quotes anyway.  However the semantics of that string might be invalid
at some higher level.

For diagnostic purposes, I suggest either or both of the following:

1. Check the return code from buffer_append_uescape(), and log a 
message if it's non-zero.

2. Tweak buffer_append_uescape() so that, if it detects invalid UTF-8,
it restores the growing_buffer to its original length.  In effect, this
change would make buffer_append_uescape() behave like uescape() in the
face of invalid UTF-8.  I can send you a patch for this if you like, but
I don't know if you would want it based on the original code or on the 
already patched version.

In the long run we should probably think about how best to respond to
invalid UTF-8.  For now we just want to identify the problem and fix it.

Meanwhile you can get most of the speedup by reverting to the original
uescape function but otherwise keeping the rewrite of jsonObjectToJSON().

Scott McKellar
http://home.swbell.net/mck9/ct/




Re: [OPEN-ILS-DEV] PATCH: osrf_json.c (rewrite of jsonObjectToJSON[Raw])

2008-11-18 Thread Scott McKellar
--- On Tue, 11/18/08, Bill Erickson [EMAIL PROTECTED] wrote:

 OK, I tracked down 2 problems, the first makes sense, the
 second, not so  
 much.
 
 The first problem was the result of attempting to escape
 quotes () and  
 backslashes (\) in the final if/else block of
 buffer_append_uescape().   
 Since  and \ are both = ' ', they
 were not getting escaped.

Doh! blush/ Sorry.
 
 The second was caused by NULL jsonObject's being passed
 from  
 OSRF_LIST_GET_INDEX (osrf_json_object:337) into
 add_json_to_buffer().   
 osrfList allows NULL entries, so it seems the old could
 should have been  
 dying here, but for some reason only the new code suffers.

This one I don't understand.  Is it working now, somehow?  If it is,
then I won't worry about it.

Scott McKellar
http://home.swbell.net/mck9/ct/



[OPEN-ILS-DEV] PATCH: osrf_json.c (rewrite of jsonObjectToJSON[Raw])

2008-11-17 Thread Scott McKellar
This patch is a rewrite of the jsonObjectToJSON and jsonObjectToJSONRaw 
functions.  It is dependent on my previous patch to utils.c and utils.h,
adding the new buffer_append_uescape function.

One purpose is to replace a call to the uescape function with a call to
the faster buffer_append_uescape function.  The other purpose is to 
introduce a faster way to translate a jsonObject into a string.

(Also in one spot I broke up a very long string literal into several
smaller pieces so that it wouldn't wrap around in the editor.)

In the existing jsonObjectToJSON function, we receive a pointer to a 
jsonObject and return a string of JSON.  However we don't translate the
original jsonObject directly.  Instead, we create a modified clone of the
original, inserting an additional JSON_HASH node wherever we find a
classname.  Then we translate the clone, and finally destroy it.

It always struck me as an egregious waste to create and destroy a whole
parallel object just so that we could turn it into a string.  So I looked
for a way to eliminate the cloning.

The result is a modification of add_json_to_buffer(), a local function
that recursively traverses and translates the jsonObject.  When it sees a 
classname (and it has been asked to expand classnames), the new version
inserts additional gibberish into the output text and then continues the
traversal, without modifying or copying the original jsonObject at all.

In my benchmark, this new approach was faster than the original by a
factor of about 5.  When I combined this change with the use of the new
buffer_append_uencode function, it was faster by a factor of about 7.2.
The benchmark used a moderately complex jsonObject about 5 or 6 levels
deep, with both hashes and arrays, with classnames at several levels.  
The performance gain will no doubt depend on the contents of the 
jsonObject,but I haven't tried to isolate the variables.

The new version is a bit trickier and harder to read than the old.  In my
opinion the speedup is worth the obscurity, because a lot of places in 
Evergreen will benefit.



With this change, the jsonObjectEncodeClass function, which had been
called only from jsonObjectToJSON, is no longer called from anywhere.
It's the function that created the modified copy of the original
jsonObject.  I see no reason to keep it around.  After a decent interval, 
if no one objects I shall submit a patch to eliminate it.

Likewise the jsonObjectToJSONRaw function is no longer called from
anywhere.  It translates a jsonObject to JSON without expanding classnames.
In this rewritten version it is identical to jsonObjectToJSON except that
it calls add_json_to_buffer() with a different parameter value.

If we eliminate jsonObjectToJSONRaw, I can simplify add_json_to_buffer()
a bit.  However we might someday want to be able to translate a jsonObject
without expanding classnames.  If nobody minds, I'll leave this function
where it is.

Scott McKellar
http://home.swbell.net/mck9/ct/

Developer's Certificate of Origin 1.1 By making a contribution to
this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license indicated
in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source license
and I have the right under that license to submit that work with
modifications, whether created in whole or in part by me, under the
same open source license (unless I am permitted to submit under a
different license), as indicated in the file; or

(c) The contribution was provided directly to me by some other person
who certified (a), (b) or (c) and I have not modified it; and

(d) In the case of each of (a), (b), or (c), I understand and agree
that this project and the contribution are public and that a record
of the contribution (including all personal information I submit
with it, including my sign-off) is maintained indefinitely and may
be redistributed consistent with this project or the open source
license indicated in the file.*** ./trunk/src/libopensrf/osrf_json_object.c	2008-11-17 16:12:14.0 -0600
--- ./trunk-mod/src/libopensrf/osrf_json_object.c	2008-11-17 22:25:01.0 -0600
***
*** 61,67 
  
  static unusedObj* freeObjList = NULL;
  
! static void add_json_to_buffer( const jsonObject* obj, growing_buffer * buf );
  
  /**
   * Return all unused jsonObjects to the heap
--- 61,68 
  
  static unusedObj* freeObjList = NULL;
  
! static void add_json_to_buffer( const jsonObject* obj,
! 	growing_buffer * buf, int do_classname, int second_pass );
  
  /**
   * Return all unused jsonObjects to the heap
***
*** 194,200 
  unusedObjCapture++;
  currentListLen++;
  if (unusedObjCapture  1  !(unusedObjCapture % 1000))
! osrfLogDebug( OSRF_LOG_MARK, Objects malloc()'d: %d, Reusable objects captured: %d, Objects