Author: pkluegl Date: Fri Jun 7 17:11:07 2013 New Revision: 1490733 URL: http://svn.apache.org/r1490733 Log: UIMA-2777 - started to rewrite textruler section in documentation
Added: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/ uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png (with props) uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png (with props) Modified: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml Added: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png URL: http://svn.apache.org/viewvc/uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png?rev=1490733&view=auto ============================================================================== Binary file - no diff available. Propchange: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png URL: http://svn.apache.org/viewvc/uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png?rev=1490733&view=auto ============================================================================== Binary file - no diff available. Propchange: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Modified: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml URL: http://svn.apache.org/viewvc/uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml?rev=1490733&r1=1490732&r2=1490733&view=diff ============================================================================== --- uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml (original) +++ uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml Fri Jun 7 17:11:07 2013 @@ -24,199 +24,60 @@ specific language governing permissions under the License. --> -<section id="section.ugr.tools.ruta.workbench.textruler"> +<section id="section.tools.ruta.workbench.textruler"> <title>TextRuler</title> - <para> Using the knowledge engineering approach, a knowledge engineer normally writes handcrafted - rules to create a domain dependent information extraction application often supported by a gold - standard. When starting the engineering process for the acquisition of the extraction knowledge - for possible new slot or more generally for new concepts, machine learning methods are often able - to offer support in an iterative engineering process. This section gives a conceptual overview - of the process model for the semi-automatic development of rule-based information extraction - applications. + <para> + Apache UIMA Ruta TextRuler is a framework for supervised rule induction included in the UIMA Ruta Workbench. + It provides several configurable algorithms, which are able to learn new rules based on given labeled data. + The framework was created in order to support the user by suggesting new rules for the given task. + The user selects a suitable learning algorithm and adapts its configuration parameters. Furthermore, + the user engineers a set of annotation-based features, which enable the algorithms to form efficient, effective and comprehensive rules. + The rule learning algorithms present their suggested rules in a new view, in which the user can either copy + the complete script or single rules to a new script file, where the rules can be further refined. </para> - <para> First, a suitable set of documents that contains the text fragments with patterns needs to be selected and annotated with the target concepts. Then, the knowledge - engineer chooses and configures the methods for automatic rule acquisition to the best of his - knowledge for the learning task: Lambda expressions based on tokens and linguistic features, for - example, differ in their application domain from wrappers that process generated HTML pages. + <para> + This section gives a short introduction about the included features and learners, and how to use the framework to learn UIMA Ruta rules. First, the + available rule learning algorithms are introduced in <xref linkend="section.tools.ruta.workbench.textruler.learner"/>. Then, + the user interface and the usage is explained in <xref linkend="section.tools.ruta.workbench.textruler.ui"/> using an exemplary UIMA Ruat project. </para> - <para> Furthermore, parameters like the window size defining relevant features need to be set to - an appropriate level. Before the annotated training documents form the input of the learning - task, they are enriched with features generated by the partial rule set of the developed - application. The result of the methods, which are the learned rules, are proposed to the knowledge - engineer for the extraction of the target concept. - </para> - <para> The knowledge engineer has different options to proceed: If the quality, amount or - generality of the presented rules is not sufficient, then additional training documents need to - be annotated or additional rules have to be handcrafted to provide more features in general or - more appropriate features. Rules or rule sets of high quality can be modified, combined or - generalized and transfered to the rule set of the application in order to support the extraction - task of the target concept. In the case that the methods did not learn reasonable rules at all, - the knowledge engineer proceeds with writing handcrafted rules. - </para> - <para> Having gathered enough extraction knowledge for the current concept, the semi-automatic - process is iterated and the focus is moved to the next concept until the development of the - application is completed. - </para> - <section id="ugr.tools.ruta.textruler.learner"> - <title>Available Learners</title> - <para> - The available learners are based on the following publications: - <orderedlist numeration="arabic"> - <!-- - <listitem> - <para> Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper Induction. In AAAI/IAAI, - pages 577-583, 2000.</para> - </listitem> - --> - <listitem> - <para> F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic - Constraints. Technical Report CS-03-07, Department of Computer Science, University of - Sheffield, Sheffield, 2003.</para> - </listitem> - <listitem> - <para> Mary Elaine Califf and Raymond J. Mooney. Bottom-up Relational Learning of Pattern - Matching Rules for Information Extraction. Journal of Machine Learning Research, - 4:177-210, 2003.</para> - </listitem> - <listitem> - <para> Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information - Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34, - pages 233-272, 1999.</para> - </listitem> - <listitem> - <para> N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper Induction for Information - Extraction. In Proc. IJC Artificial Intelligence, 1997.</para> - </listitem> - </orderedlist> - </para> + <section id="section.tools.ruta.workbench.textruler.learner"> + <title>Included rule learning algorithms</title> <para> - Each available learner has several features. Their meaning is explained here: - <itemizedlist> - <listitem> - <para> Strategy: The used strategy of the learning methods are commonly coverage - algorithms.</para> - </listitem> - <listitem> - <para> - Document: The type of the document may be <quote>free</quote> - like in newspapers, <quote>semi</quote> - or <quote>struct</quote> like in HTML pages. - </para> - </listitem> - <listitem> - <para> Slots: The slots refer to a single annotation that represents the goal of the - learning task. Some rule are able to create several annotations at once in the same - context (multi-slot). However, only single slots are supported by the current - implementations.</para> - </listitem> - <listitem> - <para> Status: The current status of the implementation in the TextRuler framework.</para> - </listitem> - </itemizedlist> - </para> - <para> - The following table gives an overview: - <table id="table.ugr.tools.ruta.workbench.textruler.available_learners" frame="all"> - <title>Overview of available learners</title> - <tgroup cols="6" colsep="1" rowsep="1"> - <colspec colname="c1" colwidth="1*" /> - <colspec colname="c2" colwidth="1*" /> - <colspec colname="c3" colwidth="1*" /> - <colspec colname="c4" colwidth="1*" /> - <colspec colname="c5" colwidth="1*" /> - <colspec colname="c6" colwidth="1*" /> - <thead> - <row> - <entry align="center">Name</entry> - <entry align="center">Strategy</entry> - <entry align="center">Document</entry> - <entry align="center">Slots</entry> - <entry align="center">Status</entry> - <entry align="center">Publication</entry> - </row> - </thead> - <tbody> - <!-- - <row> - <entry>BWI</entry> - <entry>Boosting, Top Down</entry> - <entry>Struct, Semi</entry> - <entry>Single, Boundary</entry> - <entry>Planning</entry> - <entry>1</entry> - </row> - --> - <row> - <entry>LP2</entry> - <entry>Bottom Up Cover</entry> - <entry>All</entry> - <entry>Single, Boundary</entry> - <entry>Prototype</entry> - <entry>1</entry> - </row> - <row> - <entry>RAPIER</entry> - <entry>Top Down/Bottom Up Compr.</entry> - <entry>Semi</entry> - <entry>Single</entry> - <entry>Experimental</entry> - <entry>2</entry> - </row> - <row> - <entry>WHISK</entry> - <entry>Top Down Cover</entry> - <entry>All</entry> - <entry>Multi</entry> - <entry>Prototype</entry> - <entry>3</entry> - </row> - <row> - <entry>WIEN</entry> - <entry>CSP</entry> - <entry>Struct</entry> - <entry>Multi, Rows</entry> - <entry>Prototype</entry> - <entry>4</entry> - </row> - </tbody> - </tgroup> - </table> - </para> - <!-- - <section id="section.ugr.tools.ruta.workbench.textruler.bwi"> - <title>BWI (Boosted Wrapper Induction)</title> - <para> BWI uses boosting techniques to improve the performance of simple pattern matching - single-slot boundary wrappers (boundary detectors). Two sets of detectors are learned: the - "fore" and the "aft" detectors. Weighted by their confidences and combined with a slot - length histogram derived from the training data they can classify a given pair of boundaries - within a document. BWI can be used for structured, semi-structured and free text. The - patterns are token-based with special wildcards for more general rules. </para> - <para> Implementations No implementations are yet available. </para> - <para> Parameters No parameters are yet available. </para> - </section> - --> - <section id="section.ugr.tools.ruta.workbench.textruler.lp2"> + This section gives a short description of the rule learning algorithms, + which are provided in the UIMA Ruta TextRuler framework. + </para> + <section id="section.tools.ruta.workbench.textruler.lp2"> <title>LP2</title> - <para>This method operates on all three kinds of documents. It learns separate rules for - the beginning and the end of a single slot. Tagging rules insert boundary SGML - tags and, additionally, induced correction rules shift misplaced tags to their correct - positions in order to improve precision. The learning strategy is a bottom-up covering + <note> + <para> + This rule learner is an experimental implementation of the ideas and algorithms published in: + F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic + Constraints. Technical Report CS-03-07, Department of Computer Science, University of + Sheffield, Sheffield, 2003. + </para> + </note> + <para>This algorithms learns separate rules for + the beginning and the end of a single slot, which are later combined + in order to identify the targeted annotation. The learning strategy is a bottom-up covering algorithm. It starts by creating a specific seed instance with a window of w tokens to the - left and right of the target boundary and searches for the best generalization. Other - linguistic NLP-features can be used in order to generalize over the flat word sequence. + left and right of the target boundary and searches for the best generalization. Additional context rules are + induced in order to identify missing boundaries. The current implementation does not support correction rules. + The TextRuler framework provides two versions of this algorithm: LP2 (naive) is a straightforward implementation + with limited expressiveness concerning the resulting Ruta rules. LP2 (optimized) is an improved + version with a dynamic programming approach and is providing better results in general. + The following parameters are available. For a more detailed description of the parameters, + please refer to the implementation and the publication. </para> <para> - Parameters: - </para> <itemizedlist> <listitem> <para>Context Window Size (to the left and right)</para> </listitem> <listitem> - <para>Best Rules List Size: Minimum</para> + <para>Best Rules List Size</para> </listitem> <listitem> - <para>Covered Positives per Rule</para> + <para>Minimum Covered Positives per Rule</para> </listitem> <listitem> <para>Maximum Error Threshold</para> @@ -225,55 +86,28 @@ under the License. <para>Contextual Rules List Size</para> </listitem> </itemizedlist> - </section> - <section id="section.ugr.tools.ruta.workbench.textruler.rapier"> - <title>RAPIER</title> - <para>RAPIER induces single slot extraction rules for semi-structured documents. The rules - consist of three patterns: a pre-filler, a filler and a post-filler pattern. Each pattern can hold - several constraints on tokens and their according POS-tag- and semantic information. The - algorithm uses a bottom-up compression strategy starting with a most specific seed rule for - each training instance. This initial rule base is compressed by randomly selecting rule - pairs and search for the best generalization. Considering two rules, the least general - generalization (LGG) of the slot fillers are created and specialized by adding rule items to - the pre- and post-filler until the new rules operate well on the training set. The best of - the k rules (k-beam search) is added to the rule base and all empirically subsumed rules are - removed. </para> - <para> - Parameters: </para> - <itemizedlist> - <listitem> - <para>Parameters Maximum Compression Fail Count</para> - </listitem> - <listitem> - <para>Internal Rules List Size: Rule Pairs for Generalizing</para> - </listitem> - <listitem> - <para>Maximum 'No improvement' Count</para> - </listitem> - <listitem> - <para>Maximum Noise Threshold: Minimum Covered Positives Per Rule</para> - </listitem> - <listitem> - <para>PosTag Root Type</para> - </listitem> - <listitem> - <para>Use All 3 GenSets at Specialization</para> - </listitem> - </itemizedlist> </section> - <section id="section.ugr.tools.ruta.workbench.textruler.whisk"> + + <section id="section.tools.ruta.workbench.textruler.whisk"> <title>WHISK</title> - <para> WHISK is a multi-slot method that operates on all three kinds of documents and learns - single- or multi-slot rules looking similar to regular expressions. The top-down covering - algorithm begins with the most general rule and specializes it by adding single rule terms - until the rule does not make errors anymore on the training set. Domain specific classes or linguistic - information obtained by a syntactic analyzer can be used as additional features. The exact - definition of a rule term (e.g., a token) and of a problem instance (e.g., a whole document or - a single sentence) depends on the operating domain and document type. </para> + <note> <para> - Parameters: + This rule learner is an experimental implementation of the ideas and algorithms published in: + Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information + Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34, + pages 233-272, 1999. </para> + </note> + <para>WHISK is a multi-slot method that operates on all three kinds of documents and learns + single- or multi-slot rules looking similar to regular expressions. However, the current implementation only support single slot rules. + The top-down covering algorithm begins with the most general rule and specializes it by adding single rule terms + until the rule does not make errors anymore on the training set. The TextRuler framework provides two versions of this algorithm: + WHISK (token) is a naive token-based implementation. WHISK (generic) is an optimized and improved implementation, + which is able to refer to arbitrary annotations and also supports primitive features. The following parameters are available. For a more detailed description of the parameters, + please refer to the implementation and the publication. + </para> + <para> <itemizedlist> <listitem> <para>Parameters Window Size</para> @@ -284,17 +118,51 @@ under the License. <listitem> <para>PosTag Root Type</para> </listitem> + <listitem> + <para>Considered Features (comma-separated) - only WHISK (generic)</para> + </listitem> </itemizedlist> - </section> - <section id="section.ugr.tools.ruta.workbench.textruler.wien"> - <title>WIEN </title> - <para> WIEN is the only method listed here that operates on highly structured texts only. It - induces wrappers that anchor the slots by their structured context. - The HLRT (head left right tail) wrapper class for example can determine and extract several - multi-slot-templates by first separating the important information block from unimportant - head and tail portions and extracting multiple data rows from table like data - structures from the remaining document. Inducing a wrapper is done by solving a CSP for all - possible pattern combinations from the training data. </para> - </section> - </section> -</section> \ No newline at end of file + </para> + </section> + </section> + <section id="section.tools.ruta.workbench.textruler.ui"> + <title>The TextRuler view</title> + <para> + </para> + <figure id="figure.tools.ruta.workbench.textruler.main"> + <title>The UIMA Ruta TextRuler framework + </title> + <mediaobject> + <imageobject role="html"> + <imagedata width="776px" format="PNG" align="center" + fileref="&imgroot;textruler/textruler.png" /> + </imageobject> + <imageobject role="fo"> + <imagedata width="5.4in" format="PNG" align="center" + fileref="&imgroot;textruler/textruler.png" /> + </imageobject> + <textobject> + <phrase>UIMA Ruta TextRuler framework</phrase> + </textobject> + </mediaobject> + </figure> + <figure id="figure.tools.ruta.workbench.textruler.pref"> + <title>The UIMA Ruta TextRuler Preferences + </title> + <mediaobject> + <imageobject role="html"> + <imagedata width="576px" format="PNG" align="center" + fileref="&imgroot;textruler/textruler_pref.png" /> + </imageobject> + <imageobject role="fo"> + <imagedata width="3.3in" format="PNG" align="center" + fileref="&imgroot;textruler/textruler_pref.png" /> + </imageobject> + <textobject> + <phrase>UIMA Ruta TextRuler Preferences</phrase> + </textobject> + </mediaobject> + </figure> + + </section> +</section>