Hey Julien There are two aspect of making UIMA work with hadoop..
First to make it run...Somehow run on short data for the proof of concept... And then worry about the scalability Have you gone through the link http://cwiki.apache.org/confluence/display/UIMA/Running+UIMA+Apps+on+Hadoop or http://rohanrai.blogspot.com/2008/06/uima-hadoop.html When you have understood what is going on over here.. Then you should look at this thread which specifically talks about scalability issues Feel free to query more, if you are still unable to make progress Regards Rohan After On Tue, Aug 19, 2008 at 3:49 PM, Julien Nioche < [EMAIL PROTECTED]> wrote: > Hi Rohan, > > I saw that thread on the uima list and am in a similar situation. Would you > mind telling me how you built the job file? I have one which contains all my > libs and xml configuration files but it does not get automatically extracted > + I can't access my files using the ClassLoader. > > Do you use conf.setJar() at all? > > Thanks > > Julien > > > 2008/6/30 rohan rai <[EMAIL PROTECTED]> > > Sorry for misleading you guys by keeping a few facts with myself. >> Let me elaborate and tell you the actual problem and the solution I found >> >> Actually I am running my UIMA app over hadoop. >> There I encountered a big problem regarding which I had asked in this >> forum >> before >> Then I found out the solution which later got posted over here >> http://cwiki.apache.org/UIMA/running-uima-apps-on-hadoop.html >> This solved a set of problem but it started to give performance issues. >> Instead of speeding up and scaling up I started facing two sets of problem >> because of the solution mentioned in the >> wiki >> >> problem 1) Out of memory error >> The solution talks about using >> XMLInputSource in = new >> >> XMLInputSource(ClassLoader.getSystemResourceAsStream(aeXmlDescriptor),null) >> >> to load the xmls and using resource manager to do so. >> >> But if this activity is carried on in Map/Reduce class then eventually one >> gets out of memory error inspite of increasing the heap size considerably. >> >> The solution is to >> initialize these Analysis engine etc in the configure(JobConf) method of >> the >> Mapper,Reducer class so as to create a single instance of it in each >> hadoop >> task. One can even reuse the cas created using cas.reset() method. >> >> By this way the problem of out of memory was solved. >> >> Now I started facing another problem regarding performance. >> The source of which was the usage of Resource Manager mentioned in the >> wiki >> to solve another problem. >> >> It was caused as each class mentioned in the descriptor, was bought from >> the >> job temp directory to task temp directory. >> >> Now the problem became to achieve and solve the problem for which the wiki >> entry was made without using Resource Manager. >> >> The solution is to fake imports (Yeah indeed, Ironical, that faking proved >> to be useful :)). Now what we can do is in the class file where the >> Map/Reduce task has been implemented we need to import all the classes >> required by the descriptor initialized in those class. >> >> This ensures the presence of these classes at each individual task and >> thus >> giving considerable increase in performance >> >> Keeping the points mentioned in mind I was now the beauty of UIMA and >> hadoop together to my own benefit >> >> Regards >> Rohan >> >> >> >> >> On Thu, Jun 26, 2008 at 10:52 PM, Thilo Goetz <[EMAIL PROTECTED]> wrote: >> >> > rohan rai wrote: >> > >> >> @Pascal: As I have already said the timing does not scale linearly >> >> Secondly it the approx times which I have specified >> >> @Frank: >> >> I was talking about actual adding of annotation to CAS >> >> Record refer to lets say in tags like these <a>.....</a> >> >> and the document consist of such record >> >> Annotation is done via this method >> >> MyType annotation = new MyType(jCas); >> >> annotation.setBegin(start); >> >> annotation.setEnd(end); >> >> annotation.addToIndexes(); >> >> This takes a lot of time which is not likeable. >> >> >> > >> > I don't know what you mean by a lot of time, but >> > you can create hundreds of thousands of annotations >> > like this per second on a standard windows machine. >> > You can easily verify this by running this code in >> > isolation (with mock data). >> > >> > You're more likely seeing per document overhead. >> > For example, resetting the CAS after processing >> > a document is not so cheap. However, I still don't >> > know why things are so slow for you. For example, >> > I ran the following experiment. I installed the >> > Whitespace Tokenizer pear file into c:\tmp and ran >> > it 10000 times on its own descriptor. That creates >> > approx 10Mio annotations. On my 18 months old Xeon >> > this ran in about 4 seconds. Code and output is >> > below, for you to recreate. So I'm not sure you have >> > correctly identified your bottleneck. >> > >> > public static void main(String[] args) { >> > try { >> > System.out.println("Starting setup."); >> > XMLParser parser = UIMAFramework.getXMLParser(); >> > ResourceSpecifier spec = parser.parseResourceSpecifier(new >> > XMLInputSource(new File( >> > >> "c:\\tmp\\WhitespaceTokenizer\\WhitespaceTokenizer_pear.xml"))); >> > AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(spec, null, >> > null); >> > String text = FileUtils.file2String(new File( >> > >> "c:\\tmp\\WhitespaceTokenizer\\desc\\WhitespaceTokenizer.xml")); >> > CAS cas = ae.newCAS(); >> > System.out.println("Setup done, starting processing."); >> > final int max = 10000; >> > long time = System.currentTimeMillis(); >> > for (int i = 0; i < max; i++) { >> > cas.reset(); >> > cas.setDocumentText(text); >> > ae.process(cas); >> > if (cas.getAnnotationIndex().size() != 1080) { >> > // There are 1080 annotations created for each run >> > System.out.println("Processing error."); >> > } >> > } >> > time = System.currentTimeMillis() - time; >> > System.out.println("Time for processing " + max + " documents, " + >> max >> > * 1080 >> > + " annotations: " + new TimeSpan(time)); >> > } catch (Exception e) { >> > e.printStackTrace(); >> > } >> > } >> > >> > Output on my machine: >> > >> > Starting setup. >> > Setup done, starting processing. >> > Time for processing 10000 documents, 10800000 annotations: 4.078 sec >> > >> > --Thilo >> > >> > >> > >> > >> >> Regards >> >> Rohan >> >> >> >> >> >> On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. < >> >> [EMAIL PROTECTED]> wrote: >> >> >> >> Just to clarify, what do you mean by "annotation"? Is there a >> specific >> >>> Analysis Engine that you are using? What is a "record"? Is this a >> >>> document? It would actually be surprizing for many applications if >> >>> annotation were not the bottleneck, given that some annotation >> processes >> >>> are quite expensive, but this doesn't seem like what you mean here. I >> >>> can't tell from your question whether it is the process that >> determines >> >>> the annotations that is a burden or the actual adding of the >> annotations >> >>> to the cas. >> >>> >> >>> -----Original Message----- >> >>> From: rohan rai [mailto:[EMAIL PROTECTED] >> >>> Sent: Thursday, June 26, 2008 7:36 AM >> >>> To: [email protected] >> >>> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed >> >>> >> >>> When I profile a UIMA application >> >>> What I see that annonation takes a lot of time If I profile I see that >> >>> to annotate 1 record , it takes around 0.06 seconds Now you may say >> its >> >>> good Now scale up Although it does not scale up linearly. But here is >> >>> rough estimate on experiments done 6000 records take 6 min to annotate >> >>> 800000 record tale around 10 hrs min to annotate Which is bad. >> >>> One thing is that I am treating each record individually as a cas Even >> >>> if I treat all the record as a single cas it takes around 6-7 hrs >> Which >> >>> is still not good in terms of speed >> >>> >> >>> Is there a way out? >> >>> Can I improve performance by any means?? >> >>> >> >>> Regards >> >>> Rohan >> >>> >> >>> >> >> >> > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com >
