I would go further than Steve on this.

There is only one thread-safe thing in Daffodil. This is by design/intention. 
Given a DataProcessor object, one may call its parse and unparse methods from 
multiple threads.

These are thread safe because all the shared state of DataProcessor (the 
compiled schema) is read-only, and all structures allocated by a parse/unparse 
call are private (not shared at all) so are private to that one thread running 
that call.

btw: There is one thread-safety bug in Daffodil (known currently)
https://issues.apache.org/jira/browse/DAFFODIL-2216

Everywhere in Daffodil, developers are expected to avoid state, or where 
required use local state and *not* protect it from multi-thread access because 
only one thread should ever be accessing it. Code is expected to use the 
faster, lower-overhead, non-thread-safe collection classes rather than worry 
about state sharing, and we look for this in code review.

The Daffodil compiler has a single global synchronized method lock. So I 
believe you can't compile schemas in parallel unless you run more than one JVM 
instance to do it. The compilation is all sequentialized on purpose so that we 
don't have to worrry about use of singleton objects.


________________________________
From: Steve Lawrence <slawre...@apache.org>
Sent: Thursday, August 6, 2020 9:04 AM
To: users@daffodil.apache.org <users@daffodil.apache.org>
Subject: Re: Caching, thread safety, optimizations

I'm not 100% sure if the Compiler and ProcessorFactory are thread safe.
We fix issues as they come up and try our best, but I'm not sure we
guarantee thread-safety. For example, there are definitely known issues
if you use the set*() functions. The newer with*() functions were added
to deal with these potential issues and should be used instead.

The DataProcessor is thread-safe, and we work hard to make sure it stays
that way, since this is the thing that does most of the work. So every
DataProcessor parse() or unparse() call can definitely be made in
different threads without a problem.

The ScalaXMLInfosetOutputter (as well as most of the other
InfosetOutputters) are stateful, and so should not be shared among
different threads, but they can be reused by calling the reset()
function. I would recommend one InfosetOutputter per thread and call
reset() inbetween uses. Or just create a new one each time parse/unparse
is needed--these should be pretty lightweight to allocate.

In general, I would recommend a workflow of creating a unique
Compiler/ProcessorFactory/DataProcessor for each unique schema that you
want to parse/unparse data with. Once you have the DataProcessor, throw
away the Compiler/ProcessorFactory and cache and reuse that
DataProcessor anytime you need to parse/unparse data using that schema.
And then create/reset the InfosetOutputter as mentioned above.

- Steve

On 8/5/20 2:26 PM, Patrick Grandjean wrote:
> Hi,
>
> I am looking to optimize applications that use Apache Daffodil and would like 
> to
> know which classes or functions are thread-safe, reusable, can be cached in a
> singleton, etc. For instance, I believe that ScalaXMLInfosetOutputter is
> reusable since it has a reset() function. Here is a list of
> classes/functions/instances I am currently using:
> - Daffodil.compiler()
> - ProcessorFactory
> - ProcessorFactory.onPath(String)
> - DataProcessor
> - ScalaXMLInfosetOutputter
>
> I would like to avoid having to instantiate each class at every call. 
> Otherwise,
> what are the common optimizations that can be done when using Apache 
> Daffodil's
> Java/Scala API?
>
> Patrick.
>

Reply via email to