unsubscribe
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Pulling data from a secured SQL database
I am working in an environment where data is stored in MS SQL Server. It has been secured so that only a specific set of machines can access the database through an integrated security Microsoft JDBC connection. We also have a couple of beefy linux machines we can use to host a Spark cluster but those machines do not have access to the databases directly. How can I pull the data from the SQL database on the smaller development machine and then have it distribute to the Spark cluster for processing? Can the driver pull data and then distribute execution? Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: UIMAj3 ideas
Richard, There is an API in UIMA for generating Analysis Engine Descriptors as well as Aggregates and Type System descriptions. I use that API to generate the xml descriptor at runtime after the configuration has been completed. I wrote my own logic to track the delegates of an Aggregate descriptor in order to propagate updates to/from delegates to allow the user to dynamically specify Analysis Engine parameters. I also merged the scale out parameters for UIMA-AS into the Analysis Engine object for ease of configuration. In addition I wrote my own code to generate the deployment descriptor from the programmatic parameters provided. The resulting XML is what the framework uses to generate the Spring Bean file you mentioned. That being said the existing API definitely has a learning curve which was part of the motivation for creating Leo. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jul 16, 2015, at 1:51 PM, Richard Eckart de Castilho r...@apache.org wrote: Hi Thomas, On 16.07.2015, at 21:42, Thomas Ginter thomas.gin...@utah.edu wrote: Have you looked into using Leo? It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all. Furthermore it is available via Maven so your code can compile an run. Did you find an API in UIMA AS to handle the programmatic generation of descriptors, or did you implement that yourself in Leo (as I had tried to in DKPro Lab)? If I remember correctly, then UIMA AS loaded plain XML descriptor files, transforms them to a Spring Bean file using XSLT and then used Spring to instantiate it. But I may have missed something. Cheers, -- Richard
Re: UIMAj3 ideas
Hi Petr, Have you looked into using Leo? It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all. Furthermore it is available via Maven so your code can compile an run. http://department-of-veterans-affairs.github.io/Leo/userguide.html The only catch to running UIMA-AS is making sure the broker is running. A manual step that we have not yet automated. Other than that it can scale most pipelines with the notable exception of pipelines that have really large resources. As for ideas for UIMA 3 I would love to see a much simpler CAS system that didn’t require a pre-definition of types before execution. Such as a very simple abstract base class that defines an “annotation” and is then extended in order to create/use a new type. It seems like the basic location based indexes could still be provided that way as well as the option of extending to provide custom indexes. If the CAS was implemented as a base set of very simple Java objects we would also have more serialization options. Possibly even making it possible for the user to plug in a different serializer if required such as protobuff. Just a thought. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jul 16, 2015, at 10:25 AM, Petr Baudis pa...@ucw.cz wrote: Hi! On Fri, Jul 10, 2015 at 10:28:08AM -0400, Eddie Epstein wrote: Good comments which will likely generate lots of responses. For now please see comments on scaleout below. On Thu, Jul 9, 2015 at 6:52 PM, Petr Baudis pa...@ucw.cz wrote: * UIMAfit is not part of core UIMA and UIMA-AS is not part of core UIMA. It seems to me that UIMA-AS is doing things a bit differently than what the original UIMA idea of doing scaleout was. The two things don't play well together. I'd love a way to easily take my plain UIMA pipeline and scale it out, ideally without any code changes, *and* avoid the terrible XML config files. Not clear what you are referring to as the original UIMA idea of doing scaleout, the CPE? Core UIMA is a single threaded, embeddable framework. UIMA-AS is also an embeddable framework that offers flexible vertical (multi-threading) and horizontal (multi-process) options for deploying an arbitrary pipeline. Admittedly scaleout with UIMA-AS is complicated and the minimal support for process management make it difficult to do scaleout simply. In what ways do you think UIMA-AS is inconsistent with UIMA or UIMA scaleout? Well, my impression after delving into some UIMA internals was that the original idea was to use the Analysis Structure Broker to control the pipeline flow and it would seem natural that when doing scale-out, one would simply provide a different ASB. Its javadoc even reads The Analysis Structure Broker (codeASB/code) is the component responsible for the details of communicating with Analysis Engines that may potentially be distributed across different physical machines. Of course, maybe I got it wrong. DUCC is full cluster management application that will scaleout a plain UIMA pipeline with no code changes, assuming that the application code is threadsafe. But a typical pipeline with a single collection reader creating input CASes and a single cas consumer will limit scaleout performance pretty quickly. DUCC makes it easyto eliminate the input data bottleneck. DUCC sample apps show one approach to eliminating the output bottleneck. Have you looked at DUCC? I use UIMA pipeline for question answering, where each question currently takes ~30s (single-threaded) to process (a lot of it spent waiting on databases), so I don't think I'd hit such a bottleneck. I did spend a few tens of minutes looking at DUCC, but I got the impression that it's not really trivial to set up. One of my goals is to minimize setup hassles for anyone who wants to run my software - ideally, they should be able to just compile and run. If I started to use DUCC, I'm not sure to what degree I could preserve this, but at least it's another element in the already steep learning curve for anyone who wants to tinker with the system. (Then there's this whole issue of UIMA-AS vs. UIMAfit and in-memory resource sharing - though from one of your previous emails, I got the impression that I could run multiple AEs in threads of a single java process; but I guess at that point I was already decided that I want to try something less complex.) -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: Generics in 2.8.0 getAllIndexedFS
So long as the Runtime error is meaningful and documented then I vote for option 3. T extends TOP still limits the user to the family of the UIMA universe so to speak without limiting them to an explicit FS inheritance which is a useful flexibility in spite of the risk of a casting error. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jul 8, 2015, at 09:14, Marshall Schor m...@schor.com wrote: I agree that (3) is not safe. However it imposes a burden on the user (assuming they want to use some method that's in the type but not in TOP) to cast the result to the type. This cast could also throw a runtime error, of course. So, what I'm thinking is that there's no particular value in not allowing 3) - the user could cause a runtime error in either case; but not doing 3) would make UIMA get in the way of coders trying to get their work done :-) - for the case where they were doing proper type casting. On balance, it seems to me to be better to allow 3. re: using the older forms: yes, that's really not needed (except perhaps for edge cases), so could be deprecated. At this point, I'm not sure that's worth doing, though... Here's one edge case (these are hard to think of :-) ). The coder has a type hierarchy A - B . They define JCas class for A, but not for B. To get all instances of B, they would need the older format. -Marshall On 7/8/2015 9:44 AM, Richard Eckart de Castilho wrote: If type inferencing from the surrounding context wasn't done, and the user needed to cast the result, the user would be exposed to the same runtime error. So, unless there's some other pros/cons, it seems to me it would be best to allow generic type inferencing in cases where there's a type specified (by any means) in the getAllIndexedFS method call. I'd not say by any means. using JCas APIs: 1) FSIteratorTOP getAllIndexedFS(aType); 2) T extends TOP FSIteratorT getAllIndexedFS(ClassT clazz) 3) T extends TOP FSIteratorT getAllIndexedFS(aType) I'd consider 1 and 2 to be safe and ok: - 1 is guaranteed to return TOP or a subtype of it. - 2 is quaranteed to return clazz or a subtype of it. 3 is not save: FSIteratorToken i = getAllIndexedFS(Sentence.type) This causes a runtime error. Question: except for history reasons, why do we need the aType signature in a JCas context at all? Couldn't it be deprecated in favor of the type-safe clazz variant? -- Richard On 08.07.2015, at 15:24, Marshall Schor m...@schor.com wrote: More about the signatures and type inference. We have the following cases: (maybe) not JCas, using CAS APIs: (maybe because a JCas user might get a CAS - not a JCas - in some routine) (no arguments in getAllIndexedFS) FSIterator... getAllIndexedFS(); (type argument in getAllIndexedFS) FSIterator... getAllIndexedFS(aType); using JCas APIs: (no arguments in getAllIndexedFS) FSIterator... getAllIndexedFS(); (type argument in getAllIndexedFS) FSIterator... getAllIndexedFS(aType); FSIterator... getAllIndexedFS(ClassFoo clazz) For the getAllIndexedFS() (no argument) kinds of calls, I think there's agreement to use the generic FeatureStructure for the CAS APIs, and TOP for the JCas APIs. When the getAllIndexedFS is given type arguments, the method returns an iterator over that type and its subtypes. Here it seems best to use the JCas type corresponding to the type argument. This is easy to do in the last case, above. It can be allowed if the other calls use generic method forms and pick up the type from the surrounding context. The pro for doing this is that it makes UIMA more coder-friendly, by not requiring the coder to cast the result. The con for doing this is that it allows the coder to make a mistake (specifying the wrong type). This would only be caught at run time.
Re: UIMAFit and UIMA-AS deployment
There is also Leo which allows you to programmatically create pipelines, launch them as UIMA-AS services, and manage types systems and clients without having to touch any descriptor files. You can find documentation at the site below: http://department-of-veterans-affairs.github.io/Leo/userguide.html Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Apr 30, 2015, at 11:33, Richard Eckart de Castilho r...@apache.org wrote: Hi, I have tried once to use UIMA-AS and I did it in conjunction with uimaFIT. At the time, I didn't find any API to programmatically create UIMA-AS deployment descriptors. It appeared to me as if UIMA-AS extracted information directly from the XML stream - but then I maybe didn't dig deep enough. Anyway, I created a class to programmatically build a subset of the UIMA-AS deployment descriptor. You find this class and some code using it here: https://code.google.com/p/dkpro-lab/source/browse/dkpro-lab-uima-engine-uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/AsDeploymentDescription.java Maybe it helps you. Cheers, -- Richard On 30.04.2015, at 17:40, Sylvain Surcin sur...@kwaga.com wrote: Hello, I'm trying to see if I can adapt our UIMA-AS architecture to UIMAFit. And I'm wondering how to actually do it from the main level where I have a class UimaAsynchronousEngine myEngine = new BaseUIMAAsynchronousEngine_impl(); myEngine.addStatusCallbackListener(myListener); myEngine.deploy(myAsDeploymentDescriptorFile, applicationContext); The AS deployment descriptor file has a section topDescriptor import location=./MyAggregateChain.xml/ /topDescriptor Now, if I want to be smart and use UIMAFit's AggregateBuilder, how do I reconciliate that with the deployment descriptor file? Is there a way to do that entirely from within the Java code? Or do I have to use UIMAFit to generate the aggregate descriptor file from the AnalysisEngine built by the AggregateBuilder? Thanks for your help, [+] Add me to your address book https://ws.writethat.name/kwaga-bin/titan/WEB/me.pl/5075409511380703595/i Sylvain SURCIN, Ph.D. *KWAGA* Senior Software Architect 15, rue Jean-Baptiste Berlier 75013 Paris France Tél.: +33 (0)1.55.43.79.20
Re: Read file name in an annotator
Hi Debbie, The file name is not provided by default in UIMA although I believe the UIMA FileReader does populate a SourceDocumentInformation annotation with this information. Our group has a set of readers that populate our own annotation type to provide location data and other meta-information for each record (CAS) being processed. In short you will be better off writing your reader to provide that information for you. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jul 9, 2014, at 5:41, Debbie Zhang debbie.d.zh...@gmail.com wrote: Hi, Can anyone tell me how to read the file name in an annotator using the JCas? It seems the DocumentAnnotation does't contain file name. Thank you! Best regards, Debbie Zhang
Re: FilteredIterator is very slow
Larry, A faster way to get the list of types that you will skip would be to do the following: FSIndexTitlePersonHonorificAnnotation titlePersonHAIndex = aJCas.getAnnotationIndex(TitlePersonHonorificAnnotation.type); Doing this for each type will yield an index that points to just the annotations in the CAS of each type you are interested in. From there you can get an iterator reference ( titlePersonHAIndex.iterator() ) and either traverse each one separately or else add them to a common Collection such as an ArrayList and iterate through that. You could also take advantage of the fact that the default index in UIMA sorts on ascending order on the begin index and descending order on the ending index to stop once you have traversed the list past the ending index of the dictTerm. An important design decision though would be to consider whether the dictTerm annotations are much more numerous than the TitlePersonHonorificAnnotation, MeasurementAnnotation, and ProgFactorTerm filtering annotation types. Generally if the filter types are much more plentiful and the dictTerm type was more rare then looking for overlapping filter types will yield fewer iterations of your algorithm, however if there are a lot of dictTerm occurrences and only a few of the filter types then it may be more efficient to iterate through the filter types and eliminate dictTerms that overlap or are covered. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Mar 31, 2014, at 11:47 AM, Kline, Larry larry.kl...@mckesson.com wrote: When I use a filtered FSIterator it's an order of magnitude slower than a non-filtered iterator. Here's my code: Create the iterator: private FSIteratorAnnotation createConstrainedIterator(JCas aJCas) throws CASException { FSIteratorAnnotation it = aJCas.getAnnotationIndex().iterator(); FSTypeConstraint constraint = aJCas.getConstraintFactory().createTypeConstraint(); constraint.add((new TitlePersonHonorificAnnotation(aJCas)).getType()); constraint.add((new MeasurementAnnotation(aJCas)).getType()); constraint.add((new ProgFactorTerm(aJCas)).getType()); it = aJCas.createFilteredIterator(it, constraint); return it; } Use the iterator: public void process(JCas aJCas) throws AnalysisEngineProcessException { ... // The following is done in a loop if (shouldSkip(dictTerm, skipIter)) continue; ... } Here's the method called: private boolean shouldSkip(G2DictTerm dictTerm, FSIteratorAnnotation skipIter) throws CASException { boolean shouldSkip = false; skipIter.moveToFirst(); while (skipIter.hasNext()) { Annotation annotation = skipIter.next(); if (UIMAUtils.annotationsOverlap(dictTerm, annotation)) { shouldSkip = true; break; } } return shouldSkip; } If I change the method, createConstrainedIterator(), to this (that is, no constraints): private FSIteratorAnnotation createConstrainedIterator(JCas aJCas) throws CASException { FSIteratorAnnotation it = aJCas.getAnnotationIndex().iterator(); return it; } It runs literally 10 times faster. Doing some profiling I see that all of the time is spent in the skipIter.moveToFirst() call. I also tried creating the filtered iterator each time anew in the shouldSkip() method instead of passing it in, but that has even slightly worse performance. Given this performance I suppose I should probably use a non-filtered iterator and just check for the types I'm interested in inside the loop. Any other suggestions welcome. Thanks, Larry Kline
Re: uima jcas get annotation type from string
Once you have the Type object you can get all and index to all the annotations in the case using: AnnotationIndexAnnotation mySentenceIndex = jcas.getAnnotationIndex(mySentenceTypeObj); Then you can get an iterator over the index using: FSIteratorAnnotation mySentenceIterator = mySentenceIndex.iterator(); or you could just use the iterator loop syntax in Java such as: for(Annotation sentence : mySentenceIndex) { /** Do something cool **/ } The AnnotationLibrarian class in the Leo framework provides some pretty convenient methods for this as well such as: CollectionSentence sentenceList = AnnotationLibrarian.getAllAnnotationsOfType(jcas, mySentenceTypeObj); which returns a list of Sentence annotation types. You can find more information about the Leo framework at the following URL: http://decipher.chpc.utah.edu/sites/gov.va.vinci/leo/2014.01.8/ Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Feb 14, 2014, at 2:50 AM, Richard Eckart de Castilho r...@apache.org wrote: On 14.02.2014, at 09:50, hannes schantl johannes.scha...@gmail.com wrote: thanks for the answers. Is there also a way to get a Type from a String, which can be used as argument for the JCasUtil.select method? The JCasUtil methods assume that you have access to JCas classes, e.g. import mypackage.AnnotationType; JCasUtil.select(jcas, AnnotationType.class) If you want to select based on names/types, not on JCas-classes, you could consider using the CasUtil methods: CAS cas = jcas.getCas(); // Or use inherit from CasAnnotator_ImplBase Type annotationType = CasUtil.getType(cas, mypackage.AnnotationType); CasUtil.select(cas, annotationType); Of course, you could also use reflection to get the class for your annotation type and pass it to JCasUtil - but that would be redundant and would require handling various exceptions: JCasUtil.select(jcas, Class.forName(mypackage.AnnotationType)) I want to use the type object to get all Annotations of type Sentence from the Cas. And further extract all Annotations within this sentence. There for sure other ways to solve this issue without using JCasUtil, but it seems JCasUtil provide an easy way to do this by using the methods JCasUtil.select and JCasUtil.selectCovered. CasUtil largely mirrors the functionality of JCasUtil. In fact, JCasUtil calls out to CasUtil for most of the grunt work. Cheers, -- Richard greetings Hannes Am 13.02.2014 22:11, schrieb Thomas Ginter: There are a couple of different ways to get a pointer to specific Type object. jcas.getRequiredType(mypackage.AnnotationType); (cas|jcas).getTypeSystem.getType(mypackage.AnnotationType); The question is what do you want to do with the Type object once you have it. Thanks, Thomas ginter801-448-7676thomas.gin...@utah.edu On Feb 13, 2014, at 6:03 AM, hannes schantl johannes.scha...@gmail.com johannes.scha...@gmail.com wrote: Hi, Is there a way to get an annotation Type from the cas(or Jcas) from a string. For example, i am looking for something like that: jcas.getCasType(AnnotationName) greetings Hannes
Re: uima jcas get annotation type from string
There are a couple of different ways to get a pointer to specific Type object. jcas.getRequiredType(“mypackage.AnnotationType”); (cas|jcas).getTypeSystem.getType(“mypackage.AnnotationType”); The question is what do you want to do with the Type object once you have it. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Feb 13, 2014, at 6:03 AM, hannes schantl johannes.scha...@gmail.com wrote: Hi, Is there a way to get an annotation Type from the cas(or Jcas) from a string. For example, i am looking for something like that: jcas.getCasType(AnnotationName) greetings Hannes
Re: uima-as 2.3.1 - java.io.IOException: Frame size of 147 MB larger than max allowed 100 MB
1. Your annotators can remove as well as add annotations. Perhaps if there is a large number of annotations that you don’t really need you could have a clean up annotator that removes the extra stuff, or else just don’t generate it in the first place, whatever works best for your algorithm. 2. Remote services in your pipeline are serialized the same way as the serialization with the client. In fact the framework essentially creates a client interface for sending and receiving CAS objects and then passing them to/from your pipeline. It is likely then that your expansion is happening after the remote service is called or else is not yet big enough to be over the 100MB limit. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jan 23, 2014, at 12:53 AM, Mihaela M mmihaela1...@yahoo.com wrote: 1. I will upgrade uima-as and review the annotations gathered in the CAS, but is it a way to have the CAS reset before sending it to the client? In my case I only want to get the status of the processing, not all the annotations found, because they were handled by the consumers configured in the pipeline anyway. 2. Do you know whether the aggregates communicate with the clients the same as with the remote CAS consumers? I wonder why it did not complain while sending the exploded CAS to the remote consumer, but it did when communicating with the client. Thank you! Mihaela On Wednesday, January 22, 2014 7:07 PM, Thomas Ginter thomas.gin...@utah.edu wrote: Mihaela, There are two things that you should probably do in order to get started with these issues. 1. Upgrade to UIMA-AS 2.4.2 which uses a newer version of ActiveMQ and contains numerous bug fixes for UIMA-AS related to how the JMS queues are handled. 2. The UIMA-AS framework adds very little as far as overhead space for the CAS objects which means the vast majority of the size expansion from 48KB to 147MB is coming from annotations/metadata being added by your service. Increasing the frame size in ActiveMQ may allow your CAS objects to be transferred in JMS but it is more important to find out what is causing this dramatic expansion and whether or not the service can be written differently so that the expansion is much smaller. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jan 22, 2014, at 9:44 AM, Mihaela M mmihaela1...@yahoo.com wrote: Hello, I have a uima pipeline that uses uima-as 2.3.1 which has one aggregator with one local annotator, one remote consumer and one remote annotator. It actually has more components but I will get into exactly the configuration only if needed. I have developed also a UIMA client for it using class: UimaAsynchronousEngine, method sendCas (async as far I understood) and a callback listener that waits for the processing to complete. 1. I have noticed that the CAS returned, in general is quite big. Is it a way to send, at least to the client, a CAS that does not contain all the types that the various annotators added? When could I remove those things from the CAS? 2. I send a text message for processing which has 48 KB - it gets processed successfully by the pipeline, but the pipeline fails to send a reply to the client. The exception that I get is: 01/21/2014 07:36:02.978 [ActiveMQ Transport: tcp://localhost/127.0.0.1:61616] [DEBUG] org.apache.activemq.ActiveMQConnection - Async exception with no exception listener: java.io.IOException: Frame size of 147 MB larger than max allowed 100 MB java.io.IOException: Frame size of 147 MB larger than max allowed 100 MB at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:277) ~[activemq-core-5.6.0.jar:5.6.0] at org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:229) ~[activemq-core-5.6.0.jar:5.6.0] at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:221) ~[activemq-core-5.6.0.jar:5.6.0] at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:204) ~[activemq-core-5.6.0.jar:5.6.0] at java.lang.Thread.run(Thread.java:662) [na:1.6.0_30] 01/21/2014 07:36:03.093 [ActiveMQ Connection Executor: tcp://localhost/127.0.0.1:61616] [DEBUG] org.apache.activemq.transport.tcp.TcpTransport - Stopping transport tcp://localhost/127.0.0.1:61616 As far as I understood, the client connects via JMS to the uima pipeline and a temporary reply queue gets created where the reply from the pipeline should be sent and then consumed by the client. After the above exception is thrown, the connection to the pipeline gets closed and automatically the temp queue gets deleted hence the client does not receive anymore the reply. I am wondering why the error I was mentioning is not thrown while the aggregator sends the CAS to the consumer, because the consumer
Re: how to dynamically set a required annotation type from within a UIMAfit annotator?
Renaud, We (clinical NLP group at the University of Utah) have written a platform that sits on top of UIMA-AS that will allow you to dynamically assign and even generate types for annotation engines. We have a whole family of annotators whose parameters are dynamic using this platform. We are almost ready to release this as open source, though it is still probably another month or two out. Until that time we are open to collaboration opportunities to wherein we give you access to the software and teach you how it is used. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Dec 5, 2013, at 3:43 AM, Richard Eckart de Castilho r...@apache.org wrote: To my knowledge, the capabilities are part of the descriptor which must be available before the AE is initialized. You cannot retroactively change the descriptor of a method from within its initialize() method. It would be nice to have something like this, though. But that would also mean switching any flow controllers which use this information from a static planning to a dynamic planning approach. How about filing a feature request against the UIMA framework? -- Richard On 05.12.2013, at 08:35, Renaud Richardet renaud.richar...@gmail.com wrote: I find it very convenient to add @TypeCapability(inputs = { TOKEN, SENTENCE, COOCCURRENCE }) so that I can ensure that dependencies are met. But sometimes, the dependencies are dynamic (e.g. an input type capability is part of the config of an annotator, and is loaded dynamically, see code below). Is there a way to dynamically set a required annotation type from within a UIMAfit annotator? Something like: @Override public void initialize(UimaContext context) throws ResourceInitializationException { super.initialize(context); try { // loading annotation class dynamically requiredAnnotation= (Class? extends Annotation) Class.forName( org.uima.MyRequiredAnnotation); // adding it as TypeCapability's input context.getMetadata().addCapabilityInput(requiredAnnotation); } catch (Exception e) { throw new ResourceInitializationException(e); } } Thanks, Renaud
Re: Working with very large text documents
Armin, It would probably be more efficient to have a CollectionReader that splits the log file so your not passing a gigantic file in RAM from the reader to the annotators before splitting it. If it were me I would split the log file by days or hours with a max size that auto segments lines. If your using UIMA-AS you can further scale your processing pipeline to increase throughput way beyond what CPE can provide. Also with UIMA-AS it is easy to create a listener that gathers the aggregate processed data from the segments that are returned. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Oct 18, 2013, at 7:58 AM, armin.weg...@bka.bund.de wrote: Dear Jens, dear Richard, Looks like I have to use a log file specific pipeline. The problem was that I did not knew it before the process crashed. It would be so nice having a general approach. Thanks, Armin -Ursprüngliche Nachricht- Von: Richard Eckart de Castilho [mailto:r...@apache.org] Gesendet: Freitag, 18. Oktober 2013 12:32 An: user@uima.apache.org Betreff: Re: Working with very large text documents Hi Armin, that's a good point. It's also an issue with UIMA then, because the begin/end offsets are likewise int values. If it is a log file, couldn't you split it into sections of e.g. one CAS per day and analyze each one. If there are long-distance relations that span days, you could add a second pass which reads in all analyzed cases for a rolling window of e.g. 7 days and tries to find the long distance relations in that window. -- Richard On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote: Hi Richard, As far as I know, Java strings can not be longer than 2 GB on 64bit VMs. Armin -Ursprüngliche Nachricht- Von: Richard Eckart de Castilho [mailto:r...@apache.org] Gesendet: Freitag, 18. Oktober 2013 10:43 An: user@uima.apache.org Betreff: Re: Working with very large text documents On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote: Hi, What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. In that order of magnitude, I'd probably try to get a computer with more memory ;) A. I expect that you split the large file before putting it into the pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, where do you split the input file? You can not just split it anywhere. There is a not so slight possibility to break the content. Is there a preferred chunk size for UIMA? The chunk size would likely not depend on UIMA, but rather on the machine you are using. If you cannot split the data in defined locations, maybe you can use a windowing approach where two splits have a certain overlap? B. Another possibility might be not to save the data in the CAS at all and use an URI reference instead. It's up to the analysis engine then how to load the data. My first idea was to use java.util.Scanner for regular expressions for examples. But I think that you need to have the whole text loaded to iterator over annotations. Or is just AnnotationFS.getCoveredText() not working. Any suggestions here? No idea unfortunately, never used the stream so far. -- Richard
Re: HashMap as type feature
Armin, Yes. Extracting the key set results in an array wherein the n-th element of the key array corresponds to the n-th element of the values array. That is part of how the hash map is handled in Java. Even if you implemented your own sorting algorithm for insertion the value would get inserted with the key and the corresponding key and values arrays would still match. The only caveat would be if you decided to manipulate the keys array independently after getting it from the HashMap. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Oct 17, 2013, at 8:43 AM, armin.weg...@bka.bund.de wrote: Hi Thomas, thanks for your answer. Using HashMap, does the n-th element of keySet() always corresponds to the n-th element of values()? Is this a defined behavior in Java? Cheers, Armin -Ursprüngliche Nachricht- Von: Thomas Ginter [mailto:thomas.gin...@utah.edu] Gesendet: Mittwoch, 16. Oktober 2013 18:53 An: user@uima.apache.org Betreff: Re: HashMap as type feature Armin, Our team does this with an annotation type designed to store feature vectors for Machine Learning applications. In this case we use a StringArray feature for the keys and a StringArray feature for the values. The StringArrays are pulled from a HashMapString, String vector variable and inserted into the features with the following code: int size = vector.size(); StringArray keys = new StringArray(jcas, size); StringArray values = new StringArray(jcas, size); keys.copyFromArray(vector.keySet().toArray(new String[size]), 0, 0, size); values.copyFromArray(vector.values().toArray(new String[size]), 0, 0, size); Retrieving the values is fairly straightforward. If you are using a static annotation type it can be as simple as: StringArray keys = vector.getKeysArray(); If you parameterize our annotation type in the annotator you can use the name of the feature to get a Feature object reference then pull the StringArrays like so: Type annotationTypeObj = aJCas.getRequiredType(com.my.Annotation); //parameter is the canonized name of the Annotation type Feature keyFeature = annotationTypeObj.getFeatureByBaseName(keyFeatureName); //the actual name of the feature storing the key StringArray reference Feature valuesFeature = annotationTypeObj.getFeatureByBaseName(valuesFeatureName); //the name of the values feature //Get a list of the annotation objects in the CAS then iterate through the list, for each annotation 'a' do the following to retrieve the keys and values StringArray keys = (StringArray) vector.getFeatureValue(keysFeature); StringArray values = (StringArray) vector.getFeatureValue(valuesFeature); If necessary you can retrieve a String[] from the StringArray FeatureStructure by calling the .toArray() method such as: String[] keysArray = keys.toArray(); Let me know if you have any questions. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edumailto:thomas.gin...@utah.edu On Oct 16, 2013, at 9:55 AM, Dr. Armin Wegner arminweg...@googlemail.commailto:arminweg...@googlemail.com wrote: Hi, I'd like to have a type feature that is a list of key-value pairs. The number of pairs is unknown. What's best for this? Is it even possible? Thanks, Armin
Re: HashMap as type feature
Armin, Our team does this with an annotation type designed to store feature vectors for Machine Learning applications. In this case we use a StringArray feature for the keys and a StringArray feature for the values. The StringArrays are pulled from a HashMapString, String vector variable and inserted into the features with the following code: int size = vector.size(); StringArray keys = new StringArray(jcas, size); StringArray values = new StringArray(jcas, size); keys.copyFromArray(vector.keySet().toArray(new String[size]), 0, 0, size); values.copyFromArray(vector.values().toArray(new String[size]), 0, 0, size); Retrieving the values is fairly straightforward. If you are using a static annotation type it can be as simple as: StringArray keys = vector.getKeysArray(); If you parameterize our annotation type in the annotator you can use the name of the feature to get a Feature object reference then pull the StringArrays like so: Type annotationTypeObj = aJCas.getRequiredType(com.my.Annotation); //parameter is the canonized name of the Annotation type Feature keyFeature = annotationTypeObj.getFeatureByBaseName(keyFeatureName); //the actual name of the feature storing the key StringArray reference Feature valuesFeature = annotationTypeObj.getFeatureByBaseName(valuesFeatureName); //the name of the values feature //Get a list of the annotation objects in the CAS then iterate through the list, for each annotation 'a' do the following to retrieve the keys and values StringArray keys = (StringArray) vector.getFeatureValue(keysFeature); StringArray values = (StringArray) vector.getFeatureValue(valuesFeature); If necessary you can retrieve a String[] from the StringArray FeatureStructure by calling the .toArray() method such as: String[] keysArray = keys.toArray(); Let me know if you have any questions. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edumailto:thomas.gin...@utah.edu On Oct 16, 2013, at 9:55 AM, Dr. Armin Wegner arminweg...@googlemail.commailto:arminweg...@googlemail.com wrote: Hi, I'd like to have a type feature that is a list of key-value pairs. The number of pairs is unknown. What's best for this? Is it even possible? Thanks, Armin
Re: SimpleServer, instantiating CAS with custom typesystem?
Helen, You might also consider using UIMA-AS instead. UIMA-AS allows you to deploy a service (your AAE) that can be remotely accessed by UIMA-AS clients on other machines or in other JVMs for scalable deployments. Each client provides a CollectionReader to supply documents to the service and a Listener to catch return events from the service to know when processing is complete. You can find some additional getting started information about UIMA-AS at the following: http://uima.apache.org/doc-uimaas-what.html Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Feb 19, 2013, at 7:04 AM, Helen Johnson -X (heljohns - Infobahn Softworld Inc at Cisco) heljo...@cisco.com wrote: Thanks for your reply, Jens. I admit I had been avoiding setting the text of the CAS to be the entire XML string I get back from the first REST service because it is a massive string and I only want a couple nodes from that xml string to be processed throughout the UIMA pipeline. But I see your point. So then, in this new AE, I retrieve the entire XML string from the CAS, do the zone-information processing from the specific nodes of the XML. I assume it is straightforward to then reset the CAS text to be just this text I have found in the original XML. Specifically, I would use CAS.reset() to empty the CAS of the original (full XML) text, then jCAS.setDocumentText() with the new string of just the relevant text, as well as load all the doc-zone annotations at this point. Is this right? Cheers, Helen -Original Message- From: Jens Grivolla [mailto:j+...@grivolla.net] Sent: Tuesday, February 19, 2013 3:20 AM To: user@uima.apache.org Subject: Re: SimpleServer, instantiating CAS with custom typesystem? Hi, SimpleServer itself is in a way your CR, creating a CAS with the document text you sent. Why do you want to change SimpleServer, it seems that you only want to add annotations to the CAS, not fundamentally change how the CAS is created. It seems to me that it would be far easier to just create an AE that adds those annotations. Then you won't have any typesystem issues either, since the AE would have the appropriate typesystem. HTH, Jens On 02/18/2013 10:37 PM, Helen Johnson -X (heljohns - Infobahn Softworld Inc at Cisco) wrote: I'm stumped: I have a UIMA pipeline that starts with a CollectionReader that - reads XML input (response from a REST service), - identifies a couple of relevant XML nodes - makes document-level annotations from the relevant nodes (title, document body, footnote section) From there, the AnalysisEngine portion of the pipeline has many AEs that I've wrapped into a single AggregateAnalysisEngine. The CollectionReader and the AAE all work correctly in this pipeline. Now I need to transfer this pipeline into a SimpleServer REST service environment. I've created a PEAR of the AAE portion of the pipeline, but I can't include the CollectionReader in this PEAR. First question: It is my understanding the CR cannot be included in the PEAR for the simpleServer, am I correct in this? In order to get those document-zoning annotations of title, body footnote, I have added some methods to the Service.java class in the SimpleServer package that do the XML parsing and then do the adding of these annotations to the JCAS before the AAE is called. The error that is being thrown at this point is this: The server encountered an internal error (JCas type myPackage.DocClass.ArticleMainTitle used in Java code, but was not declared in the XML type descriptor.) that prevented it from fulfilling this request. Second question: Where is Service.java looking for the typesystem xml file to be? I have tried all of the following, with the same error result: - put the typesystem descriptor file, myTSD.xml, in SimpleServer/lib - create a jar containing myTSD.xml, put it into SimpleServer/lib and add that to the build path - (after the two above attempts), in SimpleServer project properties, add lib to the UIMA CDE Property Page - in SimpleServer project properties, in UIMA Type System, point to the myTSD.xml file in lib - put myTSD.xml in SimpeServer/WebContent/WEB-INF/lib - put the jar containing myTSD.xml in the SimpleServer/WebContent/WEB-INF/lib - put myTSD.xml in SimpleServer/WebContent/WEB-INF/resources Final question: When a CAS gets instantiated (or reset, as it does in Service.java), how can I tell it to use a custom typesystem, and where will it look for that typesystem.xml file within the SimpleServer project? Thank you, Helen Johnson
Re: CollectionProcessComplete Event thrown with Outstanding CAS Count
Thanks Jerry. BTW will we be seeing a UIMA-AS 2.4.0 sometime soon? Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jun 20, 2012, at 1:03 PM, Jaroslaw Cwiklik wrote: I've checked the code and indeed this is a bug in uima-as client when running with a CR. As soon as the CR returns false from hasNext() the uima-as client process() method calls collectionProcessComplete(). The fix for this is to wait until all outstanding CASes are processed before calling collectionProcessComplete(). I will fix the trunk in a day or two. To deal with this problem, you can run the CR outside of uima-as client and call either send() or sendAndReceive() methods to process your CASes. Alternatively, if you want to patch 2.3.1, you can modify process() method to: while (initialized running ) { try { if ( (hasNext = collectionReader.hasNext()) == true) { cas = getCAS(); collectionReader.getNext(cas); sendCAS(cas); } else { break; } } catch (Exception e) { e.printStackTrace(); } } Object waitMonitor = new Object(); if (hasNext == false ) { while( running clientCache.size() 0 ) { try { // polling loop waiting for outstanding CASes to come back from the service synchronized(waitMonitor) { waitMonitor.wait(100); } } catch( Exception exx ) { } } collectionProcessingComplete(); } Jerry On Thu, Jun 14, 2012 at 9:29 PM, Thomas Ginter thomas.gin...@utah.eduwrote: My UIMA-AS 2.3.1 service is returning the CollectionProcessComplete event while there are still CAS objects outstanding. The client log shows: INFO: Client in CollecitonProcessComplete - OutstandingCasCount=2 TotalCasRequestsSentBetweenCpCs= I always seem to end up losing 2 CAS objects becuase the UimaAsynchronousEngine object stops blocking the process() method when the CollectionProcessComplete event is returned. My program then called the stop() method assuming the entire collection is finished processing. This is a problem because the stop() method appears to be disconnecting from the service before the listener can process the last two CAS objects. Is there a setting I am missing to give the client more time to handle entityProcessComplete events? What I have found in the documentation so far refers to input queues for remote delegates only. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu
Re: Exception thrown during CAS serialization for Remote UIMA-AS Service
Jorn, Thanks for the link to that section of documentation. The mention of the XMLUtils class was just what I needed. I wrote an XmlFilter class that uses XMLUtils to detect invalid XML characters and replace them with spaces so that our annotation offsets will still match the original text. I was thinking about the issue all wrong. I was assuming that all ASCII-8 characters are also valid XML-1.0 characters. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jun 14, 2012, at 3:52 PM, Jörn Kottmann wrote: You write a string to the CAS which contains a non-xml character. This character cannot be serialized into XMI, and thats what this exception is about. Have a look at our documentation explaining the issue: http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.xmi_emf.xml_character_issues Hope that helps, Jörn On 06/14/2012 11:39 PM, Thomas Ginter wrote: We are getting an odd error while trying to process large datasets using UIMA-AS 2.3.1. There is an exception thrown by the XmiCasSerializer in the Client when it is in the process of serializing a CAS to be sent to a remote service. The exception is as follows: org.apache.uima.resource.ResourceProcessException at org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:854) at org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:885) at org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.process(BaseUIMAAsynchronousEngineCommon_impl.java:734) at gov.va.vinci.flap.Client.run(Client.java:181) at gov.va.vinci.density.DensityClient.main(DensityClient.java:137) Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 character: _, 0x1a at org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.checkForInvalidXmlChars(XMLSerializer.java:254) at org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.startElement(XMLSerializer.java:174) at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.startElement(XmiCasSerializer.java:1003) at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:755) at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700) at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268) at org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108) at org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1539) at org.apache.uima.aae.UimaSerializer.serializeCasToXmi(UimaSerializer.java:136) at org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.serializeCAS(BaseUIMAAsynchronousEngineCommon_impl.java:260) at org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendCAS(BaseUIMAAsynchronousEngineCommon_impl.java:779) ... 4 more It happens at apparently random points when processing the corpus and is never actually thrown but is simply written to StdErr. Also the serializer never seems to return which means the UimaAsynchronoousEngine.process() method never returns and the client simply hangs until it is manually terminated. To resolve this issue I have implemented text filters for the incoming CAS data to prevent anything out of the ASCII-8 range. I have also tried switching the server and client to binary serialization strategies but that causes the XmiCasSerializer in my UimaAsBaseListener object to return errors attempting to serialize CAS objects revieved in the entityProcessingComplete event. Any suggestions from the UIMA masters? How can I debug further so that I can find out A: Where is this illegal character coming from and B: How can I prevent it from happening? Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edumailto:thomas.gin...@utah.edu
CollectionProcessComplete Event thrown with Outstanding CAS Count
My UIMA-AS 2.3.1 service is returning the CollectionProcessComplete event while there are still CAS objects outstanding. The client log shows: INFO: Client in CollecitonProcessComplete - OutstandingCasCount=2 TotalCasRequestsSentBetweenCpCs= I always seem to end up losing 2 CAS objects becuase the UimaAsynchronousEngine object stops blocking the process() method when the CollectionProcessComplete event is returned. My program then called the stop() method assuming the entire collection is finished processing. This is a problem because the stop() method appears to be disconnecting from the service before the listener can process the last two CAS objects. Is there a setting I am missing to give the client more time to handle entityProcessComplete events? What I have found in the documentation so far refers to input queues for remote delegates only. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu
Re: Maven UIMA and import by name
We use maven for our UIMA-AS projects. Here is the build section from our standard POM entries: build resources resource directorysrc/main/desc//directory /resource resource directorysrc/main/resources//directory /resource /resources pluginManagement plugins plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-compiler-plugin/artifactId configuration source1.6/source target1.6/target /configuration /plugin /plugins /pluginManagement /build This adds the desc and resources directories as source directories that allow you to resolve the import of descriptors by name. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On May 11, 2012, at 9:56 AM, Erik Fäßler wrote: Hello all, I have a question on how you deal with a specific use case and would like to know if you have some suggestions for me. I use Maven for all my Java projects and so I do for my UIMA related projects. Now I have a quite large pipeline with lots of descriptors. They reside in (or subdirectories of) the 'desc' directory of the 'UIMA nature' structure. Currently I am about to pack these single-AE descriptors into aggregates. For importing all single-AEs into the AAE descriptor, I would like to use import by name. However, the 'desc' directory is not a library for eclipse and thus, the AAE descriptor editor doesn't list the descriptors residing in this directory - I can't add them (and when I edit the XML, I get error messages about descriptors not found). I would like to just add the 'desc' directory to the build path as an class folder (not a source folder, this won't work), i.e. as a library. When I do this manually, Maven would overwrite it the next time it updates my project configuration. Have you any ideas here? Do you use 'import by name' for your PEARS? Do you just live with the error messages and edit the XML directly? Just would like to know how you do it - and if anyone knows a way to tell maven that 'desc' should be a library, I'd be glad :-) Best regards, Erik
Re: Running UIMA on a cluster
UIMA-AS was created to handle the message passing, job distribution, etc. Try going through the UIMA-AS documentation first. We have had pretty good success using it here. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Apr 27, 2012, at 1:35 PM, John David Osborne wrote: Hello, Is there any best practice documentation out there for running UIMA/UIMA-AS on a cluster? I have only run single machine instances of UIMA (mostly through Eclipse) and have not investigated the ability to perform multiple simultaneous analyses in order to process large document collections. It's not clear to me how UIMA would operate in a cluster environment, do people really do message passing using JMI? I'm guessing this is the case as I seeing references to MPICH, SGE or other things I am more used to. I've looked through some of the documentation (including all the Overview SDK setup) but am not finding anything helpful. I've also tried googling but I am not getting much except this: http://comments.gmane.org/gmane.comp.apache.uima.general/2131 which makes me think it is possible. Currently with my level of confusion I think it may be best to have multiple instances of UIMA on a cluster and just submit jobs processing discrete document sets to our SGE cluster and ignore whatever scaling features are actually present in UIMA since the document processing I plan to do is data parallel. -John