Re: Use of CASes with sofaURI?

2019-10-25 Thread Eddie Epstein
Besides very large documents and remote data, another major motivation was
for non-text data, such as audio or video.
Eddie

On Fri, Oct 25, 2019 at 1:33 PM Marshall Schor  wrote:

> Hi,
>
> Here's what I vaguely remember was the driving use-cases for the sofa as a
> URI.
>
> 1.  The main use case was for applications where the data was so large, it
> would
> be unreasonable to read it all in and save as a string.
>
> 2.  The prohibition on changing a sofa spec (without resetting the CAS)
> was that
> it has the potential for users to invalidate the results, in this
> (imagined)
> scenario:
>
> a) User creates cas with some sofa data,
> b) User runs annotators, which create annotations that "point into"
> the sofa
> data
> c) User changes the sofa spec, to different data, but now all the
> annotations still are pointing into "offsets" in the original data.
>
> You can change the sofa data setting, but only after resetting the CAS.
>
> Did you have a use case for wanting to change the sofa data without
> resetting the CAS?
>
>
> It sounds like you have another interesting use case:
>
> a) want to convert the sofa data uri -> a string and have the normal
> getDocumentText etc. work, but
> b) have the serialization serialize the sofaURI, and not the data
> that's
> present there.
>
> This might be a nice convenience.
>
> I can see a couple of issues:
>   a) it might need to have a good strategy for handling very large data.
> E.g.,
> the convert method might need to include a max string size spec.
>   b) Since the serialization would serialize the annotations, but not the
> data
> (it would only serialize the URI), the data at that URI could easily
> change,
> making the annotation results meaningless.  Perhaps some "fingerprinting"
> (developing a checksum of the data, and serializing that to be able to
> signal if
> that did happen) would be a reasonable protection.
>
> Maybe do a new feature-request issue?
>
> -Marshall
>
> magine the JavaDoc for this method would be saying something like: has the
> potential to exceed your memory, at run time, due to the potential size of
> the
> data...
>
>
> On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote:
> > Hi,
> >
> > On 25. Oct 2019, at 17:53, Marshall Schor  wrote:
> >> One other useful sources for examples:  The test cases for UIMA, e.g.
> search the
> >> uimaj-core projects *.java files for "getSofaDataStream".
> > Ok, let me elaborate :)
> >
> > One can use setSofaDataURI(url) to tell the CAS that the sofa data is
> actually external.
> > One can then use getSofaDataStream() resolve the URL and retrieve the
> data as a stream.
> >
> > So let's assume I have a CAS containing annotations on a text and the
> text is in an external file:
> >
> >   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null,
> null, null);
> >   cas.setSofaDataURI("file:/path/to/my/file", "text/plain");
> >
> > Works nice when I use getSofaDataStream() to retrieve the data.
> >
> > But I can't use the "normal" methods like getDocumentText() or
> getCoveredText() at all.
> >
> > Also, I cannot call setSofaDataString(urlContent, "text/plain") - it
> throws an exception
> > because there is already a sofaURI set. This is a major inconvenience.
> >
> > The ClearTK guys came up with an approach that tries to make this a bit
> more convenient:
> >
> > * they introduce a well-known view named "UriView" and set the
> sofaDataURI in that view.
> > * then they use a special reader which looks up the URI in that view,
> resolves it and
> >   drops the content into the sofaDataString of the "_defaultView".
> >
> > That way they get the benefit of the externally stored sofa as well as
> the ability to use
> > the usual methods to access the text.
> >
> > When I looked at setSofaDataURI(), I naively expected that it would be
> resolved the first
> > time I try to access the sofa data (e.g. via getDocumentText()) - but
> that doesn't happen.
> >
> > Then I expected that I would just call getSofaDataStream() and manually
> drop the contents
> > into setSofaDataString() and that this data string would be "transient",
> i.e. not saved
> > into XMI because we already have a setSofaDataURI set... but that
> expectation was also
> > not fulfilled.
> >
> > Could it be useful to introduce some place where we can transiently drop
> data obtained
> > from the sofaDataURI such that methods like getDocumentText() and
> getCoveredText() do
> > something useful but also such that the data is not included when
> serializing the CAS to
> > whatever format?
> >
> > Cheers,
> >
> > -- Richard
>


Re: Use of CASes with sofaURI?

2019-10-25 Thread Marshall Schor
Hi,

Here's what I vaguely remember was the driving use-cases for the sofa as a URI.

1.  The main use case was for applications where the data was so large, it would
be unreasonable to read it all in and save as a string.

2.  The prohibition on changing a sofa spec (without resetting the CAS) was that
it has the potential for users to invalidate the results, in this (imagined)
scenario:

    a) User creates cas with some sofa data,
    b) User runs annotators, which create annotations that "point into" the sofa
data
    c) User changes the sofa spec, to different data, but now all the
annotations still are pointing into "offsets" in the original data.

You can change the sofa data setting, but only after resetting the CAS. 

    Did you have a use case for wanting to change the sofa data without
resetting the CAS?


It sounds like you have another interesting use case:

    a) want to convert the sofa data uri -> a string and have the normal
getDocumentText etc. work, but
    b) have the serialization serialize the sofaURI, and not the data that's
present there.

This might be a nice convenience.

I can see a couple of issues:
  a) it might need to have a good strategy for handling very large data.  E.g.,
the convert method might need to include a max string size spec.
  b) Since the serialization would serialize the annotations, but not the data
(it would only serialize the URI), the data at that URI could easily change,
making the annotation results meaningless.  Perhaps some "fingerprinting"
(developing a checksum of the data, and serializing that to be able to signal if
that did happen) would be a reasonable protection.

Maybe do a new feature-request issue?

-Marshall

magine the JavaDoc for this method would be saying something like: has the
potential to exceed your memory, at run time, due to the potential size of the
data...


On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote:
> Hi,
>
> On 25. Oct 2019, at 17:53, Marshall Schor  wrote:
>> One other useful sources for examples:  The test cases for UIMA, e.g. search 
>> the
>> uimaj-core projects *.java files for "getSofaDataStream".
> Ok, let me elaborate :)
>
> One can use setSofaDataURI(url) to tell the CAS that the sofa data is 
> actually external.
> One can then use getSofaDataStream() resolve the URL and retrieve the data as 
> a stream.
>
> So let's assume I have a CAS containing annotations on a text and the text is 
> in an external file:
>
>   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, null, 
> null);
>   cas.setSofaDataURI("file:/path/to/my/file", "text/plain");
>
> Works nice when I use getSofaDataStream() to retrieve the data. 
>
> But I can't use the "normal" methods like getDocumentText() or 
> getCoveredText() at all.
>
> Also, I cannot call setSofaDataString(urlContent, "text/plain") - it throws 
> an exception 
> because there is already a sofaURI set. This is a major inconvenience.
>
> The ClearTK guys came up with an approach that tries to make this a bit more 
> convenient:
>
> * they introduce a well-known view named "UriView" and set the sofaDataURI in 
> that view.
> * then they use a special reader which looks up the URI in that view, 
> resolves it and 
>   drops the content into the sofaDataString of the "_defaultView".
>
> That way they get the benefit of the externally stored sofa as well as the 
> ability to use
> the usual methods to access the text.
>
> When I looked at setSofaDataURI(), I naively expected that it would be 
> resolved the first
> time I try to access the sofa data (e.g. via getDocumentText()) - but that 
> doesn't happen.
>
> Then I expected that I would just call getSofaDataStream() and manually drop 
> the contents
> into setSofaDataString() and that this data string would be "transient", i.e. 
> not saved
> into XMI because we already have a setSofaDataURI set... but that expectation 
> was also
> not fulfilled.
>
> Could it be useful to introduce some place where we can transiently drop data 
> obtained
> from the sofaDataURI such that methods like getDocumentText() and 
> getCoveredText() do 
> something useful but also such that the data is not included when serializing 
> the CAS to
> whatever format?
>
> Cheers,
>
> -- Richard


Re: Use of CASes with sofaURI?

2019-10-25 Thread Richard Eckart de Castilho
Hi,

On 25. Oct 2019, at 17:53, Marshall Schor  wrote:
> 
> One other useful sources for examples:  The test cases for UIMA, e.g. search 
> the
> uimaj-core projects *.java files for "getSofaDataStream".

Ok, let me elaborate :)

One can use setSofaDataURI(url) to tell the CAS that the sofa data is actually 
external.
One can then use getSofaDataStream() resolve the URL and retrieve the data as a 
stream.

So let's assume I have a CAS containing annotations on a text and the text is 
in an external file:

  CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, null, 
null);
  cas.setSofaDataURI("file:/path/to/my/file", "text/plain");

Works nice when I use getSofaDataStream() to retrieve the data. 

But I can't use the "normal" methods like getDocumentText() or getCoveredText() 
at all.

Also, I cannot call setSofaDataString(urlContent, "text/plain") - it throws an 
exception 
because there is already a sofaURI set. This is a major inconvenience.

The ClearTK guys came up with an approach that tries to make this a bit more 
convenient:

* they introduce a well-known view named "UriView" and set the sofaDataURI in 
that view.
* then they use a special reader which looks up the URI in that view, resolves 
it and 
  drops the content into the sofaDataString of the "_defaultView".

That way they get the benefit of the externally stored sofa as well as the 
ability to use
the usual methods to access the text.

When I looked at setSofaDataURI(), I naively expected that it would be resolved 
the first
time I try to access the sofa data (e.g. via getDocumentText()) - but that 
doesn't happen.

Then I expected that I would just call getSofaDataStream() and manually drop 
the contents
into setSofaDataString() and that this data string would be "transient", i.e. 
not saved
into XMI because we already have a setSofaDataURI set... but that expectation 
was also
not fulfilled.

Could it be useful to introduce some place where we can transiently drop data 
obtained
from the sofaDataURI such that methods like getDocumentText() and 
getCoveredText() do 
something useful but also such that the data is not included when serializing 
the CAS to
whatever format?

Cheers,

-- Richard

Re: Use of CASes with sofaURI?

2019-10-25 Thread Marshall Schor
One other useful sources for examples:  The test cases for UIMA, e.g. search the
uimaj-core projects *.java files for "getSofaDataStream".

-Marshall

On 10/24/2019 6:11 PM, Richard Eckart de Castilho wrote:
> Hi there,
>
> does somebody have an example of how to work with CASes that where the sofa 
> data is not set using setDocumentText() but rather using setSofaDataURI(...)? 
> 
>
> It looks like the CAS text is then not accessible via the usual means:
>
>   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, 
> null, null);
> cas.setSofaDataURI("https://www.apache.org/licenses/LICENSE-2.0.txt;, 
> "text/plain");
> CasIOUtils.save(cas, System.out, SerialFormat.XMI);
> System.out.println(cas.getDocumentText()); // -> prints "null"
> System.out.println(cas.getSofaDataString()); // -> prints "null"
>
> Apparently, one needs to call getSofaDataStream() - but even after calling 
> that, getDocumentAnnotation().getCoveredText() returns null.
>
> So how is one expected to work with CASes that are using this data URI 
> concept?
>
> Cheers,
>
> -- Richard


Re: Use of CASes with sofaURI?

2019-10-25 Thread Marshall Schor
hi, not my area of expertise, but the docs say

  
http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.aas.accessing_sofa_data

that if you're using a URI, then you use the cas.getSofaDataURI(), which returns
a string representation of the URI.

To get the data, the docs say you need to set up some standard Java I/O.

There's also a special cas method, getSofaDataStream, which returns an input
stream, and works with both local and remote data.

-Marshall

On 10/24/2019 6:11 PM, Richard Eckart de Castilho wrote:
> Hi there,
>
> does somebody have an example of how to work with CASes that where the sofa 
> data is not set using setDocumentText() but rather using setSofaDataURI(...)? 
> 
>
> It looks like the CAS text is then not accessible via the usual means:
>
>   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, 
> null, null);
> cas.setSofaDataURI("https://www.apache.org/licenses/LICENSE-2.0.txt;, 
> "text/plain");
> CasIOUtils.save(cas, System.out, SerialFormat.XMI);
> System.out.println(cas.getDocumentText()); // -> prints "null"
> System.out.println(cas.getSofaDataString()); // -> prints "null"
>
> Apparently, one needs to call getSofaDataStream() - but even after calling 
> that, getDocumentAnnotation().getCoveredText() returns null.
>
> So how is one expected to work with CASes that are using this data URI 
> concept?
>
> Cheers,
>
> -- Richard


Dictionary Lookup Change in Ruta?

2019-10-25 Thread Viorel Morari

Hello,

Having in mind the ticket UIMA-6092, we are looking for ways to improve 
the Ruta dictionary lookup on wordlists and -tables. This would target 
primarily the /MARKFAST/, /MARKTABLE /and /TRIE /actions. One option to 
make the lookup more robust would be to set the default value of the 
existing parameter /ignoreWS /(i.e. ignore whitespaces) to *true 
*(currently  it is *false*).


Ignoring the whitespaces, however, would enable matching over 
whitespaces as well (if there are any), by default. This might have 
undesired side-effects for those who use dictionary lookup extensively 
in production. Therefore, before making any changes, if it may concern 
you in any way, feel free to object against doing that.



Regards,
Viorel Morari