Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml?rev=1363750&view=auto ============================================================================== --- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml (added) +++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml Fri Jul 20 12:27:14 2012 @@ -0,0 +1,1483 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" +"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ +<!ENTITY imgroot "images/tools/tools.textmarker/" > +<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > +%uimaents; +]> +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor + license agreements. See the NOTICE file distributed with this work for additional + information regarding copyright ownership. The ASF licenses this file to + you under the Apache License, Version 2.0 (the "License"); you may not use + this file except in compliance with the License. You may obtain a copy of + the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required + by applicable law or agreed to in writing, software distributed under the + License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS + OF ANY KIND, either express or implied. See the License for the specific + language governing permissions and limitations under the License. --> + +<chapter id="ugr.tools.tm.workbench"> + <title>TextMarker Workbench</title> + <para> + </para> + + <section id="ugr.tools.tm.install"> + <title>Installation</title> + <para> + # Download, install and start an Eclipse 3.5 or Eclipse + 3.6. + # + Add the Apache UIMA update site + (http://www.apache.org/dist/uima/eclipse-update-site/) and the + TextMarker update site + (http://ki.informatik.uni-wuerzburg.de/~pkluegl/updatesite/) to the + available software sites in your Eclipse installation. This can be + achived in the "Install New Software" dialog in the help menu of + Eclipse. + # Eclipse 3.6: TextMarker is currently based on DLTK + 1.0. + Therefore, adding the DLTK 1.0 update site + (http://download.eclipse.org/technology/dltk/updates-dev/1.0/) is + required since the Eclipse 3.6 update site only supports DLTK 2.0. + # + Select "Install New Software" in the help menu of Eclipse, if not + done yet. + # Select the TextMarker update site at "Work with", + deselect "Group + items by category" and select "Contact all update + sites during + install to find required software" + # Select the + TextMarker feature and continue the dialog. The CEV + feature is + already contained in the TextMarker feature. Eclipse will + automatically install the Apache UIMA (version 2.3) plugins and the + DLTK Core Framework (version 1.X) plugins. + # ''(OPTIONAL)'' If + additional HTML visualizations are desired, then + also install the CEV + HTML feature. However, you need to install the + XPCom and XULRunner + features previously, for example by using an + appropriate update site + (http://ftp.mozilla.org/pub/mozilla.org/xulrunner/eclipse/). Please + refer to the [CEV installation instruction|CEVInstall] for details. + # After the successful installation, switch to the TextMarker + perspective. + + You can also download the TextMarker plugins from + [SourceForge.net|https://sourceforge.net/projects/textmarker/] and + install the plugins mentioned above manually. + </para> + </section> + <section id="ugr.tools.tm.project"> + <title>TextMarker Projects</title> + <para> + Similar to Java projects in Eclipse, the TextMarker workbench + provides the possibility to create TextMarker projects. TextMarker + projects require a certain folder structure that is created with the + project. The most important folders are the script folder that + contains the TextMarker rule files in a package and the descriptor + folder that contains the generated UIMA components. The input folder + contains the text files or xmiCAS files that will be executed when + starting a TextMarker script. The result will be placed in the + output folder. + + <programlisting><![CDATA[ + ||Project element|| Used for + | Project | the TextMarker project + | - script | source folder with TextMarker scripts + | -- my.package | the package, resulting in several folders + | --- Script.tm | a TextMarker script + | - descriptor | build folder for UIMA components + | -- my/package | the folder structure for the components + | --- ScriptEngine.xml | the analysis engine of the Script.tm script + | --- ScriptTypeSystem.xml | the type system of the Script.tm script + | -- BasicEngine.xml | the analysis engine template for all generated engines in this project + | -- BasicTypeSystem.xml | the type system template for all generated type systems in this project + | -- InternalTypeSystem.xml | a type system with TextMarker types + | -- Modifier.xml | the analysis engine of the optional modifier that creates the ''modified'' view + | - input | folder that contains the files that will be processed when launching a TextMarker script + | -- test.html | an input file containing html + | -- test.xmi | an input file containing text and annotations + | - output | folder that contains the files that were processed by a TextMarker script + | -- test.html.modified.html | the result of the modifier: replaced text and colored html + | -- test.html.xmi | the result CAS with optional information + | -- test.xmi.modified.html | the result of the modifier: replaced text and colored html + | -- test.xmi.xmi | the result CAS with optional information + | - resources | default folder for word lists and dictionaries + | -- Dictionary.mtwl | a dictionary in the "multi tree word list" format + | -- FirstNames.txt | a simple word list with first names: one first name per line + | - test | test-driven development is still under construction +]]></programlisting> + + </para> + + </section> + <section id="ugr.tools.tm.explain"> + <title>Explanation</title> + <para> + Handcrafting rules is laborious, especially if the newly + written rules do not + behave as expected. The TextMarker System is + able to protocol the + application of each single rule and block in + order to provide an + explanation of the rule inference and a minmal + debug functionality. + + The explanation component is built upon the CEV + plugin. The + information about the application of the rules itself is + stored in + the result xmiCAS, if the parameter of the executed engine + are + configured correctly. The simplest way the generate these + information is to open a TextMarker file and click on the common + "Debug" button (looks like a green bug) in your eclipse. The current + TextMarker file will then be executed on the text files in the input + directory and xmiCAS are created in the output directory containing + the additional UIMA feature structures describing the rule + inference. The resulting xmiCAS needs to be opened with the CEV + plugin. However, only additional views are capable of displaying the + debug information. In order to open the neccessary views, you can + either open the "Explain" perspective or open the views separately + and arrange them as you like. + + There are currently seven views that + display information about the + execution of the rules: Applied Rules, + Selected Rules, Rule List, + Matched Rules, Failed Rules, Rule Elements + and Basic Stream. + + </para> + + </section> + <section id="ugr.tools.tm.dictionaries"> + <title>Dictionariers</title> + <para> + + The TextMarker system suports currently the usage of dictionaries in + four different ways. The files are always encoded with UTF-8. The + generated analysis engines provide a parameter "resourceLocation" + that specifies the folder that contains the external dictionary + files. The paramter is initially set to the resource folder of the + current TextMarker project. In order to use a different folder, + change for example set value of the paramter and rebuild all + TextMarker rule files in the project in order to update all analysis + engines. + + The algorithm for the detection of the entires of a + dictionary: + + <programlisting><![CDATA[ +for all basic annotations of the matched annotation do + set current candidate to current basic + loop + if the dictionary contains current candidate then + remember candidate + else if an entry of the dictionary starts with the current candidate then + add next basic annotation to the current candidate + continue loop + else + stop loop +]]></programlisting> + + + + + Word List (.txt) + Word lists are simple text files that contain a term + or string in each + line. The strings may include white spaces and are + sperated by a + line break. + + Usage: + Content of a file named FirstNames.txt + (located in the resource folder of a + TextMarker project): + <programlisting><![CDATA[ +Peter +Jochen +Joachim +Martin +]]></programlisting> + + Examplary rules: + <programlisting><![CDATA[ +LIST FirstNameList = 'FirstNames.txt'; +DECLARE FirstName; +Document{-> MARKFAST(FirstName, FirstNameList)}; +]]></programlisting> + + In this example, all first names in the given text file are + annotated in the input document with the type FirstName. + + Tree Word + List (.twl) + A tree word list is a compiled word list similar to a + trie. A .twl + file is an XML-file that contains a tree-like structure + with a node + for each character. The nodes themselves refer to child + nodes that + represent all characters that succeed the caracter of the + parent + node. For single word entries, this is resulting in a + complexity of + O(m*log(n)) instead of a complexity of O(m*n) (simple + .txt file), + whereas m is the amount of basic annotations in the + document and n + is the amount of entries in the dictionary. + + Usage: + A + .twl file are generated using the popup menu. Select one or more + .txt files (or a folder containing .txt files), click the right + mouse button and choose ''Convert to TWL''. Then, one or more .twl + files are generated with the according file name. + + Examplary rules: + + <programlisting><![CDATA[ +LIST FirstNameList = 'FirstNames.twl'; +DECLARE FirstName; +Document{-> MARKFAST(FirstName, FirstNameList)}; +]]></programlisting> + + In this example, all first names in the given text file are again + annotated in the input document with the type FirstName. + + Multi Tree + Word List (.mtwl) + A multi tree word list is generated using multiple + .txt files and + contains special nodes: Its nodes provide additional + information + about the original file. The .mtwl files are useful, if + several + different dictionaries are used in a TextMarker file. For + five + dictionaries, for example, also five MARKFAST rules are + necessary. + Therefore the matched text is searched five times and the + complexity + is 5 * O(m*log(n)). Using a .mtwl file reduces the + complexity to + about O(m*log(5*n)). + + Usage: + A .mtwl file is generated + using the popup menu. Select one or more + .txt files (or a folder + containing .txt files), click the right + mouse button and choose + ''Convert to MTWL''. A .mtwl file named + "generated.mtwl" is then + generated that contains the word lists of + all selected .txt files. + Renaming the .mtwl file is recommended. + + + If there are for example two + or more word lists with the name + "FirstNames.txt", "Companies.txt" + and so on given and the generated + .mtwl file is renamed to + "Dictionary.mtwl", then the following rule + annotates all companies + and first names in the complete document. + + Examplary rules: + + <programlisting><![CDATA[ +LIST Dictionary = 'Dictionary.mtwl'; +DECLARE FirstName, Company; +Document{-> TRIE("FirstNames.txt" = FirstName, "Companies.txt" = Company, Dictionary, false, 0, false, 0, "")}; +]]></programlisting> + + + + + Table (.csv) + The TextMarker system also supports .csv files, + respectively tables. + + Usage: + Content of a file named TestTable.csv + (located in the resource folder of a + TextMarker project): + <programlisting><![CDATA[ +Peter;P; +Jochen;J; +Joba;J; +]]></programlisting> + + Examplary rules: + <programlisting><![CDATA[ +PACKAGE de.uniwue.tm; +TABLE TestTable = 'TestTable.csv'; +DECLARE Annotation Struct (STRING first); +Document{-> MARKTABLE(Struct, 1, TestTable, "first" = 2)}; +]]></programlisting> + In this example, the document is searched for all occurences of the + entries of the first column of the given table, an annotation of the + type Struct is created and its feature "first" is filled with the + entry of the second column. + + For the input document with the content + "Peter" the result is a single + annotation of the type Struct and with + P assigned to its features + "first". + + </para> + + </section> + <section id="ugr.tools.tm.parameters"> + <title>Parameters</title> + <para> + <itemizedlist> + <listitem> + <para>mainScript (String): This is the TextMarker script that + will + be loaded and executed by the generated engine. The string + is + referencing the name of the file without file extension but + with + its complete namespace, e.g., my.package.Main. + </para> + </listitem> + + <listitem> + <para>scriptPaths (Multiple Strings): The given strings + specify the + folders that contain TextMarker script files, the + main script file + and the additional script files in particular. + Currently, there is + only one folder supported in the TextMarker + workbench (script). + </para> + </listitem> + + <listitem> + <para>enginePaths (Multiple Strings): The given strings + specify the + folders that contain additional analysis engines that + are called + from within a script file. Currently, there is only + one folder + supported in the TextMarker workbench (descriptor). + </para> + </listitem> + + <listitem> + <para>resourcePaths (Multiple Strings): The given strings + specify + the folders that contain the word lists and dictionaries. + Currently, there is only one folder supported in the TextMarker + workbench (resources). + + </para> + </listitem> + + <listitem> + <para>additionalScripts (Multiple Strings): This parameter + contains a list of all known script files references with their + complete namespace, e.g., my.package.AnotherOne. + </para> + </listitem> + + <listitem> + <para>additionalEngines (Multiple Strings): This parameter + contains a list of all known analysis engines. + </para> + </listitem> + + <listitem> + <para>additionalEngineLoaders (Multiple Strings): This + parameter + contains the class names of the implementations that + help to load + more complex analysis engines. + + </para> + </listitem> + + <listitem> + <para>scriptEncoding (String): The encoding of the script + files. + Not yet supported, please use UTF-8. + </para> + </listitem> + + <listitem> + <para>defaultFilteredTypes (Multiple Strings): The complete + names + of the types that are filtered by default. + </para> + </listitem> + + <listitem> + <para>defaultFilteredMarkups (Multiple Strings): The names of + the + markups that are filtered by default. + + </para> + </listitem> + + <listitem> + <para>seeders (Multiple Strings): + </para> + </listitem> + + <listitem> + <para>useBasics (String): + </para> + </listitem> + + <listitem> + <para>removeBasics (Boolean): + + </para> + </listitem> + + <listitem> + <para>debug (Boolean): + </para> + </listitem> + + <listitem> + <para>profile (Boolean): + </para> + </listitem> + + <listitem> + <para>debugWithMatches (Boolean): + </para> + </listitem> + + <listitem> + <para>statistics (Boolean): + </para> + </listitem> + + <listitem> + <para>debugOnlyFor (Multiple Strings): + + </para> + </listitem> + + <listitem> + <para>style (Boolean): + </para> + </listitem> + + <listitem> + <para>styleMapLocation (String): + </para> + </listitem> + </itemizedlist> + </para> + + </section> + <section id="ugr.tools.tm.query"> + <title>Query</title> + <para> + The query view can be used to write queries on several documents + within a folder with the TextMArker language. + + A short example how to + use the Query view: + <itemizedlist> + <listitem> + <para> In the first field ''Query Data'', the folder is added in + which the query is executed, for example with drag and drop from + the script explorer. If the checkbox is activated, then all + subfolder will be included in the query. + </para> + </listitem> + <listitem> + <para> The next field ''Type System'' must contain a type system + or a TextMarker script that specifies all types that are used in + the query. + </para> + </listitem> + <listitem> + <para> The query in form of one or more TextMarker rules is + specified in the text field in the middle of the view. In the + example of the screenshot, all ''Author'' annotations are + selected that contain a ''FalsePositive'' or ''FalseNegative'' + annotation. + </para> + </listitem> + <listitem> + <para> If the start button near the tab of the view in the upper + right corner ist pressed, then the results are displayed. + </para> + </listitem> + </itemizedlist> + <screenshot> + <mediaobject> + <imageobject> + <imagedata scale="80" format="PNG" fileref="&imgroot;Query.png" /> + </imageobject> + <textobject> + <phrase>Query View</phrase> + </textobject> + </mediaobject> + </screenshot> + + </para> + </section> + <section id="ugr.tools.tm.views"> + <title>Views</title> + <para> + + </para> + <section id="ugr.tools.tm.views.browser"> + <title>Annotation Browser</title> + <para> + </para> + </section> + <section id="ugr.tools.tm.views.editor"> + <title>Annotation Editor</title> + <para> + </para> + </section> + <section id="ugr.tools.tm.views.palette"> + <title>Marker Palette</title> + <para> + </para> + </section> + <section id="ugr.tools.tm.views.selection"> + <title>Selection</title> + <para> + </para> + </section> + + <section id="ugr.tools.tm.views.stream"> + <title>Basic Stream</title> + <para> + The basic stream contains a listing of the complete disjunct + partition + of the document by the TextMarkerBasic annotation that are + used for + the inference and the annotation seeding. + </para> + </section> + + <section id="ugr.tools.tm.views.applied"> + <title>Applied Rules</title> + <para> + The Applied Rules views displays how often a rule tried to + apply and + how often the rule succeeded. Additionally some profiling + information is added after a short verbalisation of the rule. The + information is structured: if BLOCK constructs were used in the + executed TextMarker file, the rules contained in that block will be + represented as child node in the tree of the view. Each TextMarker + file is itself a BLOCK construct named after the file. Therefore + the root node of the view is always a BLOCK containing the rules of + the executed TextMarker script. Additionally, if a rule calls a + different TextMarker file, then the root block of that file is the + child of that rule. The selection of a rule in this view will + directly change the information visualized in the other views. + + </para> + </section> + <section id="ugr.tools.tm.views.selected"> + <title>Selected Rules</title> + <para> + This views is very similar to the Applied Rules view, but + displays only + rules and blocks under a given selection. If the user + clicks on the + document, then an Applied Rule view is generated + containing only + element that affect that position in the document. + The Rule + Elements view then only contains match information of that + position, but the result of the rule element match is still + displayed. + </para> + </section> + + <section id="ugr.tools.tm.views.rulelist"> + <title>Rule List</title> + <para> + This views is very similar to the Applied Rules view and the + Selected + Rules view, but displays only rules and NO blocks under + a + given + selection. If the user clicks on the document, then a list + of + rules + is generated that matched or tried to match on that + position in + the + document. The Rule Elements view then only contains + match + information of that position, but the result of the rule + element + match is still displayed. Additionally, this view provides a + text + field for filtering the rules. Only those rules remain that + contain + the entered text in their verbalization. + </para> + </section> + + <section id="ugr.tools.tm.views.matched"> + <title>Matched Rules</title> + <para> + If a rule is selected in the Applied Rules views, then this + view + displays the instances (text passages) where this rules + matched. + </para> + </section> + + <section id="ugr.tools.tm.views.failed"> + <title>Failed Rules</title> + <para> + If a rule is selected in the Applied Rules views, then this + view + displays the instances (text passages) where this rules failed + to + match. + </para> + </section> + + <section id="ugr.tools.tm.views.elements"> + <title>Rule Elements</title> + <para> + If a successful or failed rule match in the Matched Rules view + or + Failed Rules view is selected, then this views contains a listing + of the rule elements and their conditions. There is detailed + information available on what text each rule element matched and + which condition did evavaluate true. + </para> + </section> + + <section id="ugr.tools.tm.views.statistics"> + <title>Statistics</title> + <para> + This views displays the used conditions and actions of the + TextMarker + language. Three numbers are given for each element: The + total time + of execution, the amount of executions and the time per + execution. + </para> + </section> + <section id="ugr.tools.tm.views.fp"> + <title>False Positive</title> + <para> + </para> + </section> + + <section id="ugr.tools.tm.views.fn"> + <title>False Negative</title> + <para> + </para> + </section> + + <section id="ugr.tools.tm.views.tp"> + <title>True Positive</title> + <para> + + </para> + </section> + </section> + <section id="ugr.tools.tm.testing"> + <title>Testing</title> + <para> + The TextMarker Software comes bundled with its own testing + environment, + that allows you to test and evaluate TextMarker scripts. + It provides + full back end testing capabilities and allows you to + examine test + results in detail. As a product of the testing operation + a new + document file will be created and detailed information on how + well + the script performed in the test will be added to this document. + </para> + <section id="ugr.tools.tm.testing.overview"> + <title>Overview</title> + <para> + The testing procedure compares a previously annotated gold standard + file with the result of the selected TextMarker script using an + evaluator. The evaluators compare the offsets of annotations in + both documents and, depending on the evaluator, mark a result + document with true positive, false positive or false negative + annotations. Afterwards the f1-score is calculated for the whole + set of tests, each test file and each type in the test file. + The testing environment contains the following parts : + <itemizedlist> + <listitem> + <para>Main view</para> + </listitem> + <listitem> + <para>Result views : true positive, false positive, false + negative view + </para> + </listitem> + <listitem> + <para>Preference page</para> + </listitem> + </itemizedlist> + <screenshot> + <mediaobject> + <imageobject> + <imagedata scale="80" format="PNG" + fileref="&imgroot;Screenshot_main.png" /> + </imageobject> + <textobject> + <phrase>Eclipse with open TextMarker and testing environment. + </phrase> + </textobject> + </mediaobject> + </screenshot> + All control elements,that are needed for the interaction with the + testing environment, are located in the main view. + This is also + where test files can be selected and information, on how + well the + script performed is, displayed. During the testing process + a result + CAS file is produced that will contain new + annotation types like + true positives (tp), false positives (fp) and false + negatives (fn). + While displaying the result .xmi file in the script + editor, + additional + views allow easy navigation through the new annotations. + Additional tree + views, like the true positive view, display the + corresponding + annotations in a + hierarchic structure. This allows an + easy tracing of the results inside the + testing document. A + preference page allows customization of the + behavior + of the testing + plug-in. + </para> + <section id="ugr.tools.tm.testing.overview.main"> + <title>Main View</title> + <para> + The following picture shows a close up view of the testing + environments main-view part. The toolbar contains all buttons + needed to operate the plug-ins. The first line shows the name of + the script that is going to be tested and a combo-box, where the + view, that should be tested, is selected. On the right follow + fields that will show some basic information of the results of the + test-run. + Below and on the left the test-list is located. This list + contains the + different test-files. Right besides it, you will find + a table with + statistic information. It shows a total tp, fp and fn + information, + as well as precision, recall and f1-score of every + test-file and + for every type in each file. + <screenshot> + <mediaobject> + <imageobject> + <imagedata scale="80" format="PNG" + fileref="&imgroot;Screenshot_testing_desc_3_resize.png" /> + </imageobject> + <textobject> + <phrase>The main view of the testing environment.</phrase> + </textobject> + </mediaobject> + </screenshot> + </para> + </section> + <section id="ugr.tools.tm.testing.overview.result"> + <title>Result Views</title> + <para> + This views add additional information to the CAS View, once a + result file is opened. Each view displays one of the following + annotation types in a hierarchic tree structure : true positives, + false positive and false negative. Adding a check mark to one of + the annotations in a result view, will highlight the annotation in + the CAS Editor. + <screenshot> + <mediaobject> + <imageobject> + <imagedata scale="80" format="PNG" + fileref="&imgroot;Screenshot_result.png" /> + </imageobject> + <textobject> + <phrase>The main view of the testing environment.</phrase> + </textobject> + </mediaobject> + </screenshot> + </para> + </section> + <section id="ugr.tools.tm.testing.overview.preferences"> + <title>Preference Page</title> + <para> + The preference page offers a few options that will modify the + plug-ins general behavior. For example the preloading of + previously collected result data can be turned off, should it + produce a to long loading time. An important option in the + preference page is the evaluator you can select. On default the + "exact evaluator" is selected, which compares the offsets of the + annotations, that are contained in the file produced by the + selected script, with the annotations in the test file. Other + evaluators will compare annotations in a different way. + <screenshot> + <mediaobject> + <imageobject> + <imagedata scale="80" format="PNG" + fileref="&imgroot;Screenshot_preferences.png" /> + </imageobject> + <textobject> + <phrase>The preference page of the testing environment. + </phrase> + </textobject> + </mediaobject> + </screenshot> + </para> + </section> + <section id="ugr.tools.tm.testing.overview.project"> + <title>The TextMarker Project Structure</title> + <para> + The picture shows the TextMarker's script explorer. Every + TextMarker project contains a folder called "test". This folder is + the default location for the test-files. In the folder each + script-file has its own sub-folder with a relative path equal to + the scripts package path in the "script" folder. This folder + contains the test files. In every scripts test-folder you will + also find a result folder with the results of the tests. Should + you use test-files from another location in the file-system, the + results will be saved in the "temp" sub-folder of the projects + "test" folder. All files in the "temp" folder will be deleted, + once eclipse is closed. + <screenshot> + <mediaobject> + <imageobject> + <imagedata scale="80" format="PNG" + fileref="&imgroot;folder_struc_sep_desc_cut.png" /> + </imageobject> + <textobject> + <phrase>Script Explorer with the test folder expanded.</phrase> + </textobject> + </mediaobject> + </screenshot> + </para> + </section> + </section> + <section id="ugr.tools.tm.testing.usage"> + <title>Usage</title> + <para> + This section will demonstrate how to use the testing + environment. + It will show the basic actions needed to perform a test + run. + </para> + <para> + Preparing Eclipse: + The testing environment provides its own + perspective called + "TextMarker Testing". It will display the main + view as well as the + different result views on the right hand side. + It is encouraged to + use this perspective, especially when working + with the testing + environment for the first time. + </para> + <para> + Selecting a script for testing: + TextMarker will always test the + script, that is currently open in the + script-editor. Should another + editor be open, for example a + java-editor with some java class being + displayed, you will see that + the testing view is not available. + </para> + <para> + Creating a test file: + A test-file is a previously annotated + .xmi file that can be used as + a golden standard for the test. To + create such a file, no + additional tools will be provided, instead + the TextMarker system + already provides such tools. + </para> + <para> + Selecting a test-file: + Test files can be added to the test-list + by simply dragging them from + the Script Explorer into the test-file + list. Depending on the + setting in the preference page, test-files + from a scripts "test" + folder might already be loaded into the list. + A different way to + add test-files is to use the "Add files from + folder" button. It can + be used to add all .xmi files from a selected + folder. The "del" key + can be used to remove files from the + test-list. + </para> + <para> + Selecting a CAS View to test: + TextMarker supports different + views, that allow you to operate on different + levels in a document. + The InitialView is selected as default, + however you can also switch + the evaluation to another view by + typing the views name into the + list or selecting the view you wish + to use from the list. + </para> + <para> + Selecting the evaluator: + The testing environment supports + different evaluators that allow a + sophisticated analysis of the + behavior of a TextMarker script. The + evaluator can be chosen in the + testing environments preference + page. The preference page can be + opened either trough the menu or + by clicking the blue preference + buttons in the testing views + toolbar. The default evaluator is the + "Exact CAS Evaluator" which + compares the offsets of the annotations + between the test file and + the file annotated by the tested script. + </para> + <para> + Excluding Types: + During a test-run it might be convenient to + disable testing for specific + types like punctuation or tags. The + ''exclude types`` button will + open a dialog where all types can be + selected that should not be + considered in the test. + </para> + <para> + Running the test: + A test-run can be started by clicking on the + green start button in + the toolbar. + </para> + <para> + Result Overview: + The testing main view displays some + information, on how well the + script did, after every test run. It + will display an overall number + of true positive, false positive and + false negatives annotations of + all result files as well as an + overall f1-score. Furthermore a + table will be displayed that + contains the overall statistics of the + selected test file as well as + statistics for every single type in + the test file. The information + displayed are true positives, false + positives, false negatives, + precision, recall and f1-measure. + </para> + <para> + The testing environment also supports the export of the + overall data + in form of a comma-separated table. Clicking the export + evaluation + data will open a dialog window that contains this table. + The text + in this table can be copied and easily imported into + OpenOffice.org + or MS Excel. + </para> + <para> + Result Files: + When running a test, the evaluator will create a new + result .xmi file + and will add new true positive, false positive and + false negative + annotations. By clicking on a file in the test-file + list, you can + open the corresponding result .xmi file in the + TextMarker script + editor. When opening a result file in the script + explorer, + additional views will open, that allow easy access and + browsing of + the additional debugging annotations. + <screenshot> + <mediaobject> + <imageobject> + <imagedata scale="80" format="PNG" + fileref="&imgroot;Screenshot_Result_TP_desc_close_cut.png" /> + </imageobject> + <textobject> + <phrase>Open result file and selected true positive annotation + in the true positive view. + </phrase> + </textobject> + </mediaobject> + </screenshot> + </para> + </section> + <section id="ugr.tools.tm.testing.evaluators"> + <title>Evaluators</title> + <para> + When testing a CAS file, the system compared the offsets of + the + annotations of a previously annotated gold standard file with + the + offsets of the annotations + of the result file the script + produced. Responsible for comparing + annotations in the two CAS files + are evaluators. These evaluators + have different methods + and + strategies, for comparing the annotations, implemented. Also a + extension point is provided that allows easy implementation new + evaluators. + </para> + <para> + Exact Match Evaluator: + The Exact Match Evaluator compares the + offsets of the annotations in + the result and the golden standard + file. Any difference will be + marked with either an false positive or + false negative annotations. + </para> + <para> + Partial Match Evaluator: + The Partial Match Evaluator compares + the offsets of the annotations in + the result and golden standard + file. It will allow differences in + the beginning or the end of an + annotation. For example "corresponding" and "corresponding " will + not be + annotated as an error. + </para> + <para> + Core Match Evaluator: + The Core Match Evaluator accepts + annotations that share a core + expression. In this context a core + expression is at least four + digits long and starts with a + capitalized letter. For example the + two annotations "L404-123-421" + and "L404-321-412" would be + considered a true positive match, + because of "L404" is considered a + core expression that is contained + in both annotations. + </para> + <para> + Word Accuracy Evaluator: + Compares the labels of all + words/numbers in an annotation, whereas the + label equals the type of + the annotation. This has the consequence, + for example, that each + word or number that is not part of the + annotation is counted as a + single false negative. For example we + have the sentence: "Christmas + is on the 24.12 every year." + The script labels "Christmas is on the + 12" as a single sentence, while + the test file labels the sentence + correctly with a single sentence + annotation. While for example the + Exact CAS Evaluator while only + assign a single False Negative + annotation, Word Accuracy Evaluator + will mark every word or number + as a single False Negative. + </para> + <para> + Template Only Evaluator: + This Evaluator compares the offsets of + the annotations and the + features, that have been created by the + script. For example the + text "Alan Mathison Turing" is marked with + the author annotation + and "author" contains 2 features: "FirstName" + and "LastName". If + the script now creates an author annotation with + only one feature, + the annotation will be marked as a false positive. + </para> + <para> + Template on Word Level Evaluator: + The Template On Word + Evaluator compares the offsets of the + annotations. In addition it + also compares the features and feature + structures and the values + stored in the features. For example the + annotation "author" might + have features like "FirstName" and + "LastName" The authors name is + "Alan Mathison Turing" and the + script correctly assigns the author + annotation. The feature + assigned by the script are "Firstname : + Alan", "LastName : + Mathison", while the correct feature values would + be "FirstName + Alan", "LastName Turing". In this case the Template + Only Evaluator + will mark an annotation as a false positive, since the + feature + values differ. + </para> + </section> + + </section> + <section id="ugr.tools.tm.textruler"> + <title>TextRuler</title> + <para> + Using the knowledge engineering approach, a knowledge engineer + normally + writes handcrafted rules to create a domain dependent + information + extraction application, often supported by a gold + standard. When + starting the engineering process for the acquisition + of the + extraction knowledge for possibly new slot or more general for + new + concepts, machine learning methods are often able to offer + support + in an iterative engineering process. This section gives a + conceptual + overview of the process model for the semi-automatic + development of + rule-based information extraction applications. + </para> + <para> + First, a suitable set of documents that contain the text + fragments with + interesting patterns needs to be selected and + annotated with the + target concepts. Then, the knowledge engineer + chooses and configures + the methods for automatic rule acquisition to + the best of his + knowledge for the learning task: Lambda expressions + based on tokens + and linguistic features, for example, differ in their + application + domain from wrappers that process generated HTML pages. + </para> + <para> + Furthermore, parameters like the window size defining relevant + features need to + be set to an appropriate level. Before the annotated + training + documents form the input of the learning task, they are + enriched + with features generated by the partial rule set of the + developed + application. The result of the methods, that is the learned + rules, + are proposed to the knowledge engineer for the extraction of + the + target concept. + </para> + <para> + The knowledge engineer has different options to proceed: If the + quality, amount or generality of the presented rules is not + sufficient, then additional training documents need to be annotated + or additional rules have to be handcrafted to provide more features + in general or more appropriate features. Rules or rule sets of high + quality can be modified, combined or generalized and transfered to + the rule set of the application in order to support the extraction + task of the target concept. In the case that the methods did not + learn reasonable rules at all, the knowledge engineer proceeds with + writing handcrafted rules. + </para> + <para> + Having gathered enough extraction knowledge for the current + concept, the + semi-automatic process is iterated and the focus is + moved to the + next concept until the development of the application is + completed. + </para> + <section id="ugr.tools.tm.textruler.learner"> + <title>Available Learners</title> + <para> + Overview + + ||Name||Strategy||Document||Slots||Status + |BWI (1) + |Boosting, Top Down |Struct, Semi |Single, Boundary |Planning + |LP2 + (2) |Bottom Up Cover |All |Single, Boundary |Prototype + |RAPIER (3) + |Top Down/Bottom Up Compr. |Semi |Single |Experimental + |WHISK (4) + |Top Down Cover |All |Multi |Prototype + |WIEN (5) |CSP |Struct + |Multi, Rows |Prototype + </para> + <para> + * Strategy: The used strategy of the learning methods are + commonly + coverage algorithms. + * Document: The type of the document + may be ''free'' like in + newspapers, ''semi'' or ''struct'' like HTML + pages. + * Slots: The slots refer to a single annotation that + represents the + goal of the learning task. Some rule are able to + create several + annotation at once in the same context (multi-slot). + However, only + single slots are supported by the current + implementations. + * Status: The current status of the implementation + in the TextRuler + framework. + </para> + <para> + Publications + </para> + <para> + (1) Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper + Induction. + In AAAI/IAAI, pages 577â583, 2000. + </para> + <para> + (2) F. Ciravegna. (LP)2, Rule Induction for Information + Extraction + Using Linguistic Constraints. Technical Report CS-03-07, + Department + of Computer Science, University of Sheffield, Sheffield, + 2003. + </para> + <para> + (3) Mary Elaine Califf and Raymond J. Mooney. Bottom-up + Relational + Learning of Pattern Matching Rules for Information + Extraction. + Journal of Machine Learning Research, 4:177â210, 2003. + </para> + <para> + (4) Stephen Soderland, Claire Cardie, and Raymond Mooney. + Learning + Information Extraction Rules for Semi-Structured and Free + Text. In + Machine Learning, volume 34, pages 233â272, 1999. + </para> + <para> + (5) N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper + Induction for + Information Extraction. In Proc. IJC Artificial + Intelligence, 1997. + </para> + <para> + BWI + BWI (Boosted Wrapper Induction) uses boosting techniques to + improve + the performance of simple pattern matching single-slot + boundary + wrappers (boundary detectors). Two sets of detectors are + learned: + the "fore" and the "aft" detectors. Weighted by their + confidences + and combined with a slot length histogram derived from + the training + data they can classify a given pair of boundaries + within a + document. BWI can be used for structured, semi-structured + and free + text. The patterns are token-based with special wildcards + for more + general rules. + </para> + <para> + Implementations + No implementations are yet available. + </para> + <para> + Parameters + No parameters are yet available. + + </para> + <para> + LP2 + This method operates on all three kinds of documents. It + learns + separate rules for the beginning and the end of a single + slot. So + called tagging rules insert boundary SGML tags and + additionally + induced correction rules shift misplaced tags to their + correct + positions in order to improve precision. The learning + strategy is a + bottom-up covering algorithm. It starts by creating a + specific seed + instance with a window of w tokens to the left and + right of the + target boundary and searches for the best + generalization. Other + linguistic NLP-features can be used in order + to generalize over the + flat word sequence. + </para> + <para> + Implementations + LP2 (naive): + LP2 (optimized): + </para> + <para> + Parameters + Context Window Size (to the left and right): + Best + Rules List Size: + Minimum Covered Positives per Rule: + Maximum Error + Threshold: + Contextual Rules List Size: + </para> + <para> + RAPIER + RAPIER induces single slot extraction rules for + semi-structured + documents. The rules consist of three patterns: a + pre-filler, a + filler and a post-filler pattern. Each can hold + several constraints + on tokens and their according POS-tag- and + semantic information. + The algorithm uses a bottom-up compression + strategy, starting with + a most specific seed rule for each training + instance. This initial + rule base is compressed by randomly selecting + rule pairs and search + for the best generalization. Considering + two + rules, the least general generalization (LGG) of the slot fillers + are created and specialized by adding rule items to the pre- and + post-filler until the new rules operate well on the training set. + The best of the k rules (k-beam search) is added to the rule base + and all empirically subsumed rules are removed. + </para> + <para> + Implementations + RAPIER: + </para> + <para> + Parameters + Maximum Compression Fail Count: + Internal Rules List + Size: + Rule Pairs for Generalizing: + Maximum 'No improvement' Count: + Maximum Noise Threshold: + Minimum Covered Positives Per Rule: + PosTag + Root Type: + Use All 3 GenSets at Specialization: + </para> + <para> + WHISK + WHISK is a multi-slot method that operates on all three + kinds of + documents and learns single- or multi-slot rules looking + similar to + regular expressions. The top-down covering algorithm + begins with + the most general rule and specializes it by adding + single + rule terms until the rule makes no errors on the training + set. Domain + specific classes or linguistic information obtained by a + syntactic + analyzer can be used as additional features. The exact + definition + of a rule term (e.g. a token) and of a problem instance + (e.g. a + whole document or a single sentence) depends on the + operating + domain and document + type. + </para> + <para> + Implementations + WHISK (token): + WHISK (generic): + </para> + <para> + Parameters + Window Size: + Maximum Error Threshold: + PosTag Root + Type: + </para> + <para> + WIEN + WIEN is the only method listed here that operates on + highly structured + texts only. It induces so called wrappers that + anchor the slots by + their structured context around them. The HLRT + (head left right + tail) wrapper class for example can determine and + extract + several multi-slot-templates by first separating the + important information + block from unimportant head and tail portions + and then extracting + multiple data rows from table like + data + structures from the remaining document. Inducing a wrapper is done + by solving a CSP for all possible pattern combinations from the + training data. + </para> + <para> + Implementations + WIEN: + </para> + <para> + Parameters + No parameters are available. + + </para> + </section> + </section> + +</chapter> \ No newline at end of file