RE: External rule files
Thanks Dave. I am not an XML expert. I understand the phrase 'define a transform' to mean 'specify a mapping'. If my understanding is not correct, please tell me. There is not a 1:1 mapping between the term checker postags and the LT postags. Thus, I cannot define a transform for all the postags, but I can define a transform for some of them. However, there are possible problems as the examples below show. Example 1. Ignoring technical verbs that LT does not 'know', a verb that has the postag STE_VERB_LEXICAL_BASE usually has the LT postag VB. However, although the verb 'do' has the LT postag VB, it does not have the postag STE_VERB_LEXICAL_BASE. (It has the postags STE_VERB_AUXILIARY_DO and STE_VERB_AUXILIARY_CAN_DO_MUST_WILL.) Thus, without excluding 'do' from a rule, you cannot map STE_VERB_LEXICAL_BASE to VB. Example 2. With an approved 2-word plural noun, the first word has the postag STE_TN_NOUN_MULTI_WORD_PLURAL_1 and the second word has the postag STE_TN_NOUN_MULTI_WORD_PLURAL_2. (TN is an abbreviation of 'Technical Name', which is a term from the STE specification.) The 3 terms that follow are approved 2-word nouns. The LT postags that relate to nouns are different for the first word. The LT postags for nouns are in brackets: circuit breakers (NN, NNS) duty cycles (NN:UN, NNS) operating systems (-, NNS) In a related e-mail, Marcin wrote: Hm, that means I will have to look at them and manually create a generic version, if that only is possible. That is already a big help for me, as it's not trivial to find regularities that create good disambiguation rules. Marcin, if a partial mapping helps you, let me know, and I will define one. Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Dave Pawson [mailto:dave.paw...@gmail.com] Sent: 05 April 2014 19:50 To: development discussion for LanguageTool Subject: Re: External rule files On 5 April 2014 17:11, Mike Unwalla m...@techscribe.co.uk wrote: snip Most of the rules that I developed are specifically for STE and contain customized postags. Example: token postag_regexp=yes postag=STE_VERB_LEXICAL_BASE|STE_TVb_BASE|STE_TVb_2_WORD_BASE|PROJECT_TVb_B ASE|PROJECT_TVb_2_WORD_BASE/token The STE rules must be 'fail safe'. To develop rules that give correct results with all words in the English lexicon is difficult. If you can define a transform I'll write a stylesheet to do it (perhaps leaving the extra tags as comments) HTH snip -- Put Bad Developers to Shame Dominate Development with Jenkins Continuous Integration Continuously Automate Build, Test Deployment Start a new project now. Try Jenkins in the cloud. http://p.sf.net/sfu/13600_Cloudbees_APR ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
RE: External rule files
and how you want these in the output, we can start from there. I think that we have a miscommunication. I don't need a mapping from the STE postags to the LT postags. I created the STE postags for the term checker because I can't do what I want to do with only the LT postags. I need the XML source markup (is the source XML?) The source is XML. It is available from www.simplified-english.co.uk/installation.html in the file term-checker-evaluation--mm-dd.zip (I do not give the current file name in this e-mail because the .zip file name contains a date, and I put only the most recent version of the file on the website.) But, if 'source markup' means a marked up document in which terms are annotated with a postag, then no, I do not have source markup. I'm not sure I understand this... If you can express the conditions, then I can write a transform based on those conditions. Yes. (But I don't understand why someone would want this transformation.) E.g. (guessing) input STE_VERB_LEXICAL_BASE - VB input do - VB Although that sounds too simple? In principle, yes. But the mappings are much more complex. Also, there are verbs that LT does not 'know' as verbs, such as the approved verb 'safety'. And there is the not-approved verb 'safety-clip', for which there is no LT postag (except for what it finds with the chunker [http://wiki.languagetool.org/using-chunks]). then maps to ... Again I do not understand the English explanation, perhaps an XML example? following terms - are these XML children (nested within the parent) or siblings? Sorry, I don't know how to give an XML example. There is no formal XML specification for the STE postags. I used the method that is in 'Adding only POS tags or tokens' (http://wiki.languagetool.org/developing-a-disambiguator#toc8). Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Dave Pawson [mailto:dave.paw...@gmail.com] Sent: 07 April 2014 12:55 To: development discussion for LanguageTool Subject: Re: External rule files On 7 April 2014 11:08, Mike Unwalla m...@techscribe.co.uk wrote: Thanks Dave. I am not an XML expert. I understand the phrase 'define a transform' to mean 'specify a mapping'. If my understanding is not correct, please tell me. That's right. As a trial, if you give me a few examples, and how you want these in the output, we can start from there. There is not a 1:1 mapping between the term checker postags and the LT postags. Thus, I cannot define a transform for all the postags, but I can define a transform for some of them. However, there are possible problems as the examples below show. I need the XML source markup (is the source XML?) XSLT works on XML in and XML out. Example 1. Ignoring technical verbs that LT does not 'know', a verb that has the postag STE_VERB_LEXICAL_BASE usually has the LT postag VB. However, although the verb 'do' has the LT postag VB, it does not have the postag STE_VERB_LEXICAL_BASE. (It has the postags STE_VERB_AUXILIARY_DO and STE_VERB_AUXILIARY_CAN_DO_MUST_WILL.) Thus, without excluding 'do' from a rule, you cannot map STE_VERB_LEXICAL_BASE to VB. I'm not sure I understand this... If you can express the conditions, then I can write a transform based on those conditions. E.g. (guessing) input STE_VERB_LEXICAL_BASE - VB input do - VB Although that sounds too simple? Example 2. With an approved 2-word plural noun, the first word has the postag STE_TN_NOUN_MULTI_WORD_PLURAL_1 and the second word has the postag STE_TN_NOUN_MULTI_WORD_PLURAL_2. (TN is an abbreviation of 'Technical Name', which is a term from the STE specification.) The 3 terms that follow are approved 2-word nouns. The LT postags that relate to nouns are different for the first word. The LT postags for nouns are in brackets: circuit breakers (NN, NNS) duty cycles (NN:UN, NNS) operating systems (-, NNS) STE_TN_NOUN_MULTI_WORD_PLURAL_1 + STE_TN_NOUN_MULTI_WORD_PLURAL_2 (written as xsl:template match=STE_TN_NOUN_MULTI_WORD_PLURAL_1[following-sibling::STE_TN_NOUN_MULTI_ WORD_PLURAL_2[1]] then maps to ... Again I do not understand the English explanation, perhaps an XML example? following terms - are these XML children (nested within the parent) or siblings? p child/ /p sibling/ regards -- Dave Pawson XSLT XSL-FO FAQ. Docbook FAQ. http://www.dpawson.co.uk -- Put Bad Developers to Shame Dominate Development with Jenkins Continuous Integration Continuously Automate Build, Test Deployment Start a new project now. Try Jenkins in the cloud. http://p.sf.net/sfu/13600_Cloudbees_APR ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Put Bad
Re: External rule files
On 7 April 2014 14:43, Mike Unwalla m...@techscribe.co.uk wrote: and how you want these in the output, we can start from there. I think that we have a miscommunication. I don't need a mapping from the STE postags to the LT postags. I created the STE postags for the term checker because I can't do what I want to do with only the LT postags. Yes I think we do have an difference of understanding. I need the XML source markup (is the source XML?) The source is XML. It is available from www.simplified-english.co.uk/installation.html in the file term-checker-evaluation--mm-dd.zip (I do not give the current file name in this e-mail because the .zip file name contains a date, and I put only the most recent version of the file on the website.) But, if 'source markup' means a marked up document in which terms are annotated with a postag, then no, I do not have source markup. No, I was thinking of the valid syntax of your form to that which is required? Either a schema or DTD. Examples of marked up text would suffice, just take longer? I'm not sure I understand this... If you can express the conditions, then I can write a transform based on those conditions. Yes. (But I don't understand why someone would want this transformation.) My assumption. I may be wrong. You have many files marked up using schema A. (or simply a tagset A) You want to transform these files to use a more recent LT tagset. If we can share an understanding of the tagset, and how to get from one to the other, I can help automate it. E.g. (guessing) input STE_VERB_LEXICAL_BASE - VB input do - VB Although that sounds too simple? In principle, yes. But the mappings are much more complex. Also, there are verbs that LT does not 'know' as verbs, such as the approved verb 'safety'. And there is the not-approved verb 'safety-clip', for which there is no LT postag (except for what it finds with the chunker [http://wiki.languagetool.org/using-chunks]). No problem. For 'unknowns' I will mark the items as unknown original=xxx where xxx is the source markup. then maps to ... Again I do not understand the English explanation, perhaps an XML example? following terms - are these XML children (nested within the parent) or siblings? Sorry, I don't know how to give an XML example. There is no formal XML specification for the STE postags. I used the method that is in 'Adding only POS tags or tokens' (http://wiki.languagetool.org/developing-a-disambiguator#toc8). The link points to XML? If that is not available, then XSLT will not help? regards (Oh the joys of miscommunication :-) Dave P -Original Message- From: Dave Pawson [mailto:dave.paw...@gmail.com] Sent: 07 April 2014 12:55 To: development discussion for LanguageTool Subject: Re: External rule files On 7 April 2014 11:08, Mike Unwalla m...@techscribe.co.uk wrote: Thanks Dave. I am not an XML expert. I understand the phrase 'define a transform' to mean 'specify a mapping'. If my understanding is not correct, please tell me. That's right. As a trial, if you give me a few examples, and how you want these in the output, we can start from there. There is not a 1:1 mapping between the term checker postags and the LT postags. Thus, I cannot define a transform for all the postags, but I can define a transform for some of them. However, there are possible problems as the examples below show. I need the XML source markup (is the source XML?) XSLT works on XML in and XML out. Example 1. Ignoring technical verbs that LT does not 'know', a verb that has the postag STE_VERB_LEXICAL_BASE usually has the LT postag VB. However, although the verb 'do' has the LT postag VB, it does not have the postag STE_VERB_LEXICAL_BASE. (It has the postags STE_VERB_AUXILIARY_DO and STE_VERB_AUXILIARY_CAN_DO_MUST_WILL.) Thus, without excluding 'do' from a rule, you cannot map STE_VERB_LEXICAL_BASE to VB. I'm not sure I understand this... If you can express the conditions, then I can write a transform based on those conditions. E.g. (guessing) input STE_VERB_LEXICAL_BASE - VB input do - VB Although that sounds too simple? Example 2. With an approved 2-word plural noun, the first word has the postag STE_TN_NOUN_MULTI_WORD_PLURAL_1 and the second word has the postag STE_TN_NOUN_MULTI_WORD_PLURAL_2. (TN is an abbreviation of 'Technical Name', which is a term from the STE specification.) The 3 terms that follow are approved 2-word nouns. The LT postags that relate to nouns are different for the first word. The LT postags for nouns are in brackets: circuit breakers (NN, NNS) duty cycles (NN:UN, NNS) operating systems (-, NNS) STE_TN_NOUN_MULTI_WORD_PLURAL_1 + STE_TN_NOUN_MULTI_WORD_PLURAL_2 (written as xsl:template match=STE_TN_NOUN_MULTI_WORD_PLURAL_1[following-sibling::STE_TN_NOUN_MULTI_ WORD_PLURAL_2[1]] then maps to ... Again I do not understand the English explanation, perhaps an XML
Re: External rule files
W dniu 2014-04-07 15:58, Dave Pawson pisze: On 7 April 2014 14:43, Mike Unwalla m...@techscribe.co.uk wrote: and how you want these in the output, we can start from there. I think that we have a miscommunication. I don't need a mapping from the STE postags to the LT postags. I created the STE postags for the term checker because I can't do what I want to do with only the LT postags. Yes I think we do have an difference of understanding. I need the XML source markup (is the source XML?) The source is XML. It is available from www.simplified-english.co.uk/installation.html in the file term-checker-evaluation--mm-dd.zip (I do not give the current file name in this e-mail because the .zip file name contains a date, and I put only the most recent version of the file on the website.) But, if 'source markup' means a marked up document in which terms are annotated with a postag, then no, I do not have source markup. No, I was thinking of the valid syntax of your form to that which is required? Either a schema or DTD. Examples of marked up text would suffice, just take longer? I'm not sure I understand this... If you can express the conditions, then I can write a transform based on those conditions. Yes. (But I don't understand why someone would want this transformation.) My assumption. I may be wrong. You have many files marked up using schema A. (or simply a tagset A) You want to transform these files to use a more recent LT tagset. If we can share an understanding of the tagset, and how to get from one to the other, I can help automate it. No, Mike does not want to transform or retag his files. He's using a specialized tagset, and that's fine. I simply want to steal some of his disambiguation rules, but for that, I'll have to use my brain instead of my Ctrl+C/Ctrl+V ;) Best, Marcin E.g. (guessing) input STE_VERB_LEXICAL_BASE - VB input do - VB Although that sounds too simple? In principle, yes. But the mappings are much more complex. Also, there are verbs that LT does not 'know' as verbs, such as the approved verb 'safety'. And there is the not-approved verb 'safety-clip', for which there is no LT postag (except for what it finds with the chunker [http://wiki.languagetool.org/using-chunks]). No problem. For 'unknowns' I will mark the items as unknown original=xxx where xxx is the source markup. then maps to ... Again I do not understand the English explanation, perhaps an XML example? following terms - are these XML children (nested within the parent) or siblings? Sorry, I don't know how to give an XML example. There is no formal XML specification for the STE postags. I used the method that is in 'Adding only POS tags or tokens' (http://wiki.languagetool.org/developing-a-disambiguator#toc8). The link points to XML? If that is not available, then XSLT will not help? regards (Oh the joys of miscommunication :-) Dave P -Original Message- From: Dave Pawson [mailto:dave.paw...@gmail.com] Sent: 07 April 2014 12:55 To: development discussion for LanguageTool Subject: Re: External rule files On 7 April 2014 11:08, Mike Unwalla m...@techscribe.co.uk wrote: Thanks Dave. I am not an XML expert. I understand the phrase 'define a transform' to mean 'specify a mapping'. If my understanding is not correct, please tell me. That's right. As a trial, if you give me a few examples, and how you want these in the output, we can start from there. There is not a 1:1 mapping between the term checker postags and the LT postags. Thus, I cannot define a transform for all the postags, but I can define a transform for some of them. However, there are possible problems as the examples below show. I need the XML source markup (is the source XML?) XSLT works on XML in and XML out. Example 1. Ignoring technical verbs that LT does not 'know', a verb that has the postag STE_VERB_LEXICAL_BASE usually has the LT postag VB. However, although the verb 'do' has the LT postag VB, it does not have the postag STE_VERB_LEXICAL_BASE. (It has the postags STE_VERB_AUXILIARY_DO and STE_VERB_AUXILIARY_CAN_DO_MUST_WILL.) Thus, without excluding 'do' from a rule, you cannot map STE_VERB_LEXICAL_BASE to VB. I'm not sure I understand this... If you can express the conditions, then I can write a transform based on those conditions. E.g. (guessing) input STE_VERB_LEXICAL_BASE - VB input do - VB Although that sounds too simple? Example 2. With an approved 2-word plural noun, the first word has the postag STE_TN_NOUN_MULTI_WORD_PLURAL_1 and the second word has the postag STE_TN_NOUN_MULTI_WORD_PLURAL_2. (TN is an abbreviation of 'Technical Name', which is a term from the STE specification.) The 3 terms that follow are approved 2-word nouns. The LT postags that relate to nouns are different for the first word. The LT postags for nouns are in brackets: circuit breakers (NN, NNS) duty cycles
RE: External rule files
Hi All, But maybe the standard LT would benefit from your rules as well? I am happy to donate all or some of the rules that I developed for STE issue 3. The most recent version of the rules is on www.simplified-english.co.uk/installation.html. Most of the rules that I developed are specifically for STE and contain customized postags. Example: token postag_regexp=yes postag=STE_VERB_LEXICAL_BASE|STE_TVb_BASE|STE_TVb_2_WORD_BASE|PROJECT_TVb_B ASE|PROJECT_TVb_2_WORD_BASE/token The STE rules must be 'fail safe'. To develop rules that give correct results with all words in the English lexicon is difficult. I don't want to make the rule set for the journal part of the standard distribution, as they quite specific. At the same time, I want to use standard rules. So I simply want to open the additional rule set before I make the check. This is similar to my situation. Also, when I check a text, I use more than one rule set. The STE rules that are on the simplified-english website are the 'core', as defined by the STEMG (www.asd-ste100.org). For each project, I have a grammar file and a disambiguation file (www.simplified-english.co.uk/design.html has a picture). When I check a text, I use both the core STE files and the project files. Some scenarios for the use of user files are as follows: * Single-user environment. User wants to use standalone LT and LT in OpenOffice. Currently, the user must copy/paste the files from the standalone directory to an OpenOffice directory. (Testrules is available only with standalone, thus, to develop user rules, that version of LT is always necessary.) * Multi-user environment. Grammar and disambiguation files are on a server. LT accesses these files only. * Multi-user environment. Grammar and disambiguation files are on a server. LT simultaneously accesses these files and project-specific grammar files that are on a user's computer. Possibly, one option is to split the disambiguation file into 2 parts. (And similarly with the grammar file.) The first part is only a 'wrapper', which refers to the default LT disambiguation file: ?xml version=1.0 encoding=utf-8? !DOCTYPE doc [ !ENTITY DefaultLTDisambiguation SYSTEM org/languagetool/resource/en/disambiguation-default.xml ] rules lang=en xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:noNamespaceSchemaLocation=http://svn.code.sf.net/p/languagetool/code/tr unk/languagetool/languagetool-core/src/main/resources/org/languagetool/resou rce/disambiguation.xsd DefaultLTDisambiguation; !-- The content of the current disambiguation.xml, but without the rules element -- !--An explanation of how to add external entities goes here. -- /rules 'Out of the box', LT works as usual. However, a user can edit the 'wrapper' disambiguation file to make LT use other rule sets. Possible problem 1: Because the user can install LT anywhere, the path for DefaultLTDisambiguation must be relative to the installation directory. But, that can cause a validating XML editor to show an error and not open the file. If the user wants to use a validating XML editor, the solution is to edit the file with the full path. Possible problem 2: Dave Pawson suggested that xInclude is preferable to entities (http://sourceforge.net/p/languagetool/mailman/message/32177932/). Possible problem 3: Each time that the user updates LT, the user must edit the 'wrapper' disambiguation file or copy/paste from the previous LT version. (But, with the integrate attribute, presumably a user must specify the location of the user file(s), so the same problem exists with that.) Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Marcin Milkowski [mailto:list-addr...@wp.pl] Sent: 05 April 2014 08:03 To: languagetool-devel@lists.sourceforge.net Subject: Re: External rule files W dniu 2014-04-04 19:24, Mike Unwalla pisze: Hi All, I'm not sure why Mike Unwalla doesn't want to use our disambiguation rules I do not have a fundamental objection to using the LT disambiguation file with the STE rules. Part of the reason that I now do not use the LT disambiguation rules is historical. The LT disambiguation rules are not sufficient for the STE term checker. Examples: * A part-of-speech disambiguator is necessary (primarily for noun/verb disambiguation). * Each term that is in the STE specification must be specified in the disambiguation rules with its approved and not-approved parts of speech. When I started to write the STE disambiguation rules, I did not know how to add rules to an external file (http://wiki.languagetool.org/tips-and-tricks#toc2). Therefore, the disambiguation file was in installation path\org\languagetool\resource\en. If I add the STE rules at the end of the LT disambiguation file, each time that I update LT, I must copy/paste the STE rules into the new LT disambiguation file. If some part of the new LT disambiguation has an effect on the STE rules, I must change the STE
Re: External rule files
On 5 April 2014 17:11, Mike Unwalla m...@techscribe.co.uk wrote: Hi All, But maybe the standard LT would benefit from your rules as well? I am happy to donate all or some of the rules that I developed for STE issue 3. The most recent version of the rules is on www.simplified-english.co.uk/installation.html. Most of the rules that I developed are specifically for STE and contain customized postags. Example: token postag_regexp=yes postag=STE_VERB_LEXICAL_BASE|STE_TVb_BASE|STE_TVb_2_WORD_BASE|PROJECT_TVb_B ASE|PROJECT_TVb_2_WORD_BASE/token The STE rules must be 'fail safe'. To develop rules that give correct results with all words in the English lexicon is difficult. If you can define a transform I'll write a stylesheet to do it (perhaps leaving the extra tags as comments) HTH I don't want to make the rule set for the journal part of the standard distribution, as they quite specific. At the same time, I want to use standard rules. So I simply want to open the additional rule set before I make the check. This is similar to my situation. Also, when I check a text, I use more than one rule set. The STE rules that are on the simplified-english website are the 'core', as defined by the STEMG (www.asd-ste100.org). For each project, I have a grammar file and a disambiguation file (www.simplified-english.co.uk/design.html has a picture). When I check a text, I use both the core STE files and the project files. Some scenarios for the use of user files are as follows: * Single-user environment. User wants to use standalone LT and LT in OpenOffice. Currently, the user must copy/paste the files from the standalone directory to an OpenOffice directory. (Testrules is available only with standalone, thus, to develop user rules, that version of LT is always necessary.) * Multi-user environment. Grammar and disambiguation files are on a server. LT accesses these files only. * Multi-user environment. Grammar and disambiguation files are on a server. LT simultaneously accesses these files and project-specific grammar files that are on a user's computer. Possibly, one option is to split the disambiguation file into 2 parts. (And similarly with the grammar file.) The first part is only a 'wrapper', which refers to the default LT disambiguation file: ?xml version=1.0 encoding=utf-8? !DOCTYPE doc [ !ENTITY DefaultLTDisambiguation SYSTEM org/languagetool/resource/en/disambiguation-default.xml ] rules lang=en xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:noNamespaceSchemaLocation=http://svn.code.sf.net/p/languagetool/code/tr unk/languagetool/languagetool-core/src/main/resources/org/languagetool/resou rce/disambiguation.xsd DefaultLTDisambiguation; !-- The content of the current disambiguation.xml, but without the rules element -- !--An explanation of how to add external entities goes here. -- /rules 'Out of the box', LT works as usual. However, a user can edit the 'wrapper' disambiguation file to make LT use other rule sets. Possible problem 1: Because the user can install LT anywhere, the path for DefaultLTDisambiguation must be relative to the installation directory. But, that can cause a validating XML editor to show an error and not open the file. If the user wants to use a validating XML editor, the solution is to edit the file with the full path. Possible problem 2: Dave Pawson suggested that xInclude is preferable to entities (http://sourceforge.net/p/languagetool/mailman/message/32177932/). Possible problem 3: Each time that the user updates LT, the user must edit the 'wrapper' disambiguation file or copy/paste from the previous LT version. (But, with the integrate attribute, presumably a user must specify the location of the user file(s), so the same problem exists with that.) Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Marcin Milkowski [mailto:list-addr...@wp.pl] Sent: 05 April 2014 08:03 To: languagetool-devel@lists.sourceforge.net Subject: Re: External rule files W dniu 2014-04-04 19:24, Mike Unwalla pisze: Hi All, I'm not sure why Mike Unwalla doesn't want to use our disambiguation rules I do not have a fundamental objection to using the LT disambiguation file with the STE rules. Part of the reason that I now do not use the LT disambiguation rules is historical. The LT disambiguation rules are not sufficient for the STE term checker. Examples: * A part-of-speech disambiguator is necessary (primarily for noun/verb disambiguation). * Each term that is in the STE specification must be specified in the disambiguation rules with its approved and not-approved parts of speech. When I started to write the STE disambiguation rules, I did not know how to add rules to an external file (http://wiki.languagetool.org/tips-and-tricks#toc2). Therefore, the disambiguation file was in installation path\org\languagetool\resource\en. If I add the STE