Re: XML element and attribute statistics

2014-04-05 Thread Daniel Naber
On 2014-04-05 03:29, Andriy Rysin wrote:

 Here's the patch for the solution that I think should be acceptable: 
 for
 multiple grammar files when validating we extract all unification
 elements from the first file and prepend them to the rest of the files.

I think that's okay. It still means we're shipping XML files that are 
not validating and that XML editors will complain about them, but I have 
no better solution either.

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-04-05 Thread Marcin Miłkowski
W dniu 2014-04-05 10:46, Daniel Naber pisze:
 On 2014-04-05 03:29, Andriy Rysin wrote:

 Here's the patch for the solution that I think should be acceptable:
 for
 multiple grammar files when validating we extract all unification
 elements from the first file and prepend them to the rest of the files.

 I think that's okay. It still means we're shipping XML files that are
 not validating and that XML editors will complain about them, but I have
 no better solution either.

Well, I think we may document this on our wiki so that people would know 
the implications of using multiple rule files.

Another solution is to define unification outside rule files, in a 
separate file. This seems to be a bit cleaner, though there will be no 
validation on IDs then (validating them in pure XML would involve 
inclusions and exclusions, which creates unnecessary complexity), so 
then we should drop IDs and use any other attribute in XML. Instead, we 
might

(1) restrict the unification attribute values in the Schema, using the 
namespaces to get country-specific values, or

(2) check the attributes in the XML rule loader, which should already 
know the attribute values and complain if they are not correct.

Overall, I think the gain from validating IDs in unification is very 
small, and we don't even have to move unification to a separate file if 
we use solution (2), which should be very easy to implement. Right now 
the unknown ID causes NPE in Unifier.isSatisfied().

I am leaning towards moving unification to a separate file and using the 
rule loader to check the values. This should make it easier to use 
unification in the rule editor online as well - the unification file 
could be simply loaded beforehand, just like disambiguation.

Regards,
Marcin

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-04-04 Thread Andriy Rysin
Well moving the rules to common file will defeat the purpose of the
split, especially when more and more rules will use unification...

I looked in the code a bit more and it looks that we disable
validation when we load rules in LT, that's why loading in run-time is
working fine, and then I guess Java references the unifications that
are already loaded.

I'll keep looking for good solution,
Andriy


2014-04-03 12:23 GMT-04:00 Daniel Naber daniel.na...@languagetool.org:
 On 2014-04-03 01:12, Andriy Rysin wrote:

 I guess we have two ways to go from here: adjust the tests to load
 files
 and keep them (I am not sure how easy it is - depends on how flexible
 our XMLValidator is)

 We're just using standard XML validation, I don't think there's much we
 can do (other than catching that specific exception, which would be very
 ugly). But not many rules are affected, what about moving those to
 grammar.xml? (I know, that's not very elegant either).

 Regards
   Daniel


 --
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-04-04 Thread Dave Pawson
On 4 April 2014 18:29, Andriy Rysin ary...@gmail.com wrote:
 Well moving the rules to common file will defeat the purpose of the
 split, especially when more and more rules will use unification...

Similar scenario, I create a temp common file, validate with that
then rm the temp file?
  Suites both uses?

HTH DaveP



 I looked in the code a bit more and it looks that we disable
 validation when we load rules in LT, that's why loading in run-time is
 working fine, and then I guess Java references the unifications that
 are already loaded.

 I'll keep looking for good solution,
 Andriy


 2014-04-03 12:23 GMT-04:00 Daniel Naber daniel.na...@languagetool.org:
 On 2014-04-03 01:12, Andriy Rysin wrote:

 I guess we have two ways to go from here: adjust the tests to load
 files
 and keep them (I am not sure how easy it is - depends on how flexible
 our XMLValidator is)

 We're just using standard XML validation, I don't think there's much we
 can do (other than catching that specific exception, which would be very
 ugly). But not many rules are affected, what about moving those to
 grammar.xml? (I know, that's not very elegant either).

 Regards
   Daniel


 --
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel

 --
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-04-04 Thread Andriy Rysin
On 04/03/2014 12:23 PM, Daniel Naber wrote:
 On 2014-04-03 01:12, Andriy Rysin wrote:

 I guess we have two ways to go from here: adjust the tests to load 
 files
 and keep them (I am not sure how easy it is - depends on how flexible
 our XMLValidator is)
 We're just using standard XML validation, I don't think there's much we 
 can do (other than catching that specific exception, which would be very 
 ugly). But not many rules are affected, what about moving those to 
 grammar.xml? (I know, that's not very elegant either).

Here's the patch for the solution that I think should be acceptable: for
multiple grammar files when validating we extract all unification
elements from the first file and prepend them to the rest of the files.
Advantages:
* only tests are affected by this change
* only langauges with multiple grammar xml files are affected
* low overhead (re-including only the elements we need)

I would appreciate any feedback,
Thanks
Andriy
diff --git a/languagetool-core/src/test/java/org/languagetool/XMLValidator.java b/languagetool-core/src/test/java/org/languagetool/XMLValidator.java
index e113dbb..ce9c6d4 100644
--- a/languagetool-core/src/test/java/org/languagetool/XMLValidator.java
+++ b/languagetool-core/src/test/java/org/languagetool/XMLValidator.java
@@ -27,15 +27,22 @@ import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 
 import javax.xml.XMLConstants;
+import javax.xml.parsers.DocumentBuilder;
+import javax.xml.parsers.DocumentBuilderFactory;
 import javax.xml.parsers.ParserConfigurationException;
 import javax.xml.parsers.SAXParser;
 import javax.xml.parsers.SAXParserFactory;
+import javax.xml.transform.Source;
+import javax.xml.transform.dom.DOMSource;
 import javax.xml.transform.stream.StreamSource;
 import javax.xml.validation.Schema;
 import javax.xml.validation.SchemaFactory;
 import javax.xml.validation.Validator;
 
 import org.languagetool.tools.StringTools;
+import org.w3c.dom.Document;
+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
 import org.xml.sax.InputSource;
 import org.xml.sax.SAXException;
 import org.xml.sax.SAXParseException;
@@ -123,6 +130,61 @@ public final class XMLValidator {
 
   /**
* Validate XML file using the given XSD. Throws an exception on error.
+   * @param baseFilename File to prepend common parts (unification) from before validating main file
+   * @param filename File in classpath to validate
+   * @param xmlSchemaPath XML schema file in classpath
+   */
+  public void validateWithXmlSchema(String baseFilename, String filename, String xmlSchemaPath) throws IOException {
+try {
+  final InputStream xmlStream = this.getClass().getResourceAsStream(filename);
+  final InputStream baseXmlStream = this.getClass().getResourceAsStream(baseFilename);
+  if (xmlStream == null || baseXmlStream == null ) {
+throw new IOException(File not found in classpath:  + filename);
+  }
+  try {
+final URL schemaUrl = this.getClass().getResource(xmlSchemaPath);
+if (schemaUrl == null) {
+  throw new IOException(XML schema not found in classpath:  + xmlSchemaPath);
+}
+validateInternal(mergeIntoSource(baseXmlStream, xmlStream, this.getClass().getResource(xmlSchemaPath)), schemaUrl);
+  } finally {
+xmlStream.close();
+  }
+} catch (Exception e) {
+  throw new IOException(Cannot load or parse ' + filename + ', e);
+}
+  }
+
+
+  private static Source mergeIntoSource(InputStream baseXmlStream, InputStream xmlStream, URL xmlSchema) throws Exception {
+DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
+domFactory.setIgnoringComments(true);
+domFactory.setValidating(false);
+domFactory.setNamespaceAware(true);
+
+//SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
+//Schema schema = sf.newSchema(xmlSchema);
+//domFactory.setSchema(schema);
+
+DocumentBuilder builder = domFactory.newDocumentBuilder();
+Document baseDoc = builder.parse(baseXmlStream);
+Document ruleDoc = builder.parse(xmlStream);
+
+// Shall this be more generic, i.e. reuse not just unification ???
+NodeList unificationNodes = baseDoc.getElementsByTagName(unification);
+Node ruleNode = ruleDoc.getElementsByTagName(rules).item(0);
+Node firstChildRuleNode = ruleNode.getChildNodes().item(1);
+
+for(int i=0; iunificationNodes.getLength(); i++) {
+  Node unificationNode = ruleDoc.importNode(unificationNodes.item(i), true);
+  ruleNode.insertBefore(unificationNode, firstChildRuleNode);
+}
+
+return new DOMSource(ruleDoc);
+  }
+  
+  /**
+   * Validate XML file using the given XSD. Throws an exception on error.
* @param xml the XML string to be validated
* @param xmlSchemaPath XML schema file in classpath
* @since 2.3
@@ -171,6 +233,14 @@ public final class XMLValidator {
 validator.validate(new StreamSource(xml));
   }
 
+  private void 

Re: XML element and attribute statistics

2014-04-02 Thread Andriy Rysin
Thanks Daniel!

I can't figure out what's wrong with those tests you commented out
though, the error is this:
cvc-id.1: There is no ID/IDREF binding for IDREF 'gender'. Problem
found at line 484, column 9.
but the gender is properly defined in grammar.xml:
unification feature=gender

Provided those rules work in 2.5, do you think we just didn't include
grammar.xml before testing grammar-style.xml in our tests?

grammar.xml should be returned first in Ukrainian.getRuleFileNames()
list of filenames.

Thanks
Andriy

2014-04-01 16:11 GMT-04:00 Daniel Naber daniel.na...@languagetool.org:
 On 2014-04-01 05:00, Andriy Rysin wrote:

 Oops, my bad, but the interesting this is that the tests pass on this
 file and the rule actually works in the final release...

 This is fixed now I think. I commented out the Ukrainian rules that
 would have made the tests fail now.

 Regards
   Daniel


 --
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-04-02 Thread Daniel Naber
On 2014-04-02 19:29, Andriy Rysin wrote:

 When I was splitting grammar.xml file I actually spent almost a day
 trying to use xml include features to include component grammar files,
 I must say I was not able to make it work properly in all scenarios:

I guess you tried this one?
http://wiki.languagetool.org/tips-and-tricks#toc2
If that doesn't work, there's no other approach I know of.

 If we can't do that can we consider loading all files together
 similarly to how it's done in production code?

Mhh, I can't see us doing anything special in production code. All files 
are handled separately. Are you really 100% sure that these rules 
actually worked? Or did they maybe work by chance, e.g. because the 
unify wasn't actually needed for the examples you tried?

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-04-02 Thread Andriy Rysin
On 04/02/2014 04:44 PM, Daniel Naber wrote:
 On 2014-04-02 19:29, Andriy Rysin wrote:

 When I was splitting grammar.xml file I actually spent almost a day
 trying to use xml include features to include component grammar files,
 I must say I was not able to make it work properly in all scenarios:
 I guess you tried this one?
 http://wiki.languagetool.org/tips-and-tricks#toc2
 If that doesn't work, there's no other approach I know of.
yes, that's what i tried, I could not make the url work for both
filesystem and jar, I even seen some differences on how LT code and
xmllint include files (the simple include that worked for xmllint didn't
work in LT) so I abandoned that path

 If we can't do that can we consider loading all files together
 similarly to how it's done in production code?
 Mhh, I can't see us doing anything special in production code. All files 
 are handled separately. Are you really 100% sure that these rules 
 actually worked? Or did they maybe work by chance, e.g. because the 
 unify wasn't actually needed for the examples you tried?
yes I can confirm one of the rules (rulegroup id SAMYI) works
correctly in 2.5 and takes to account unification.

It looks that PatterRuleTest.validatePatternFile() checks the xml files
one at a time: loading one, validating it, going for next, while
JLanguageTool.activateDefaultPatternRules() loads them all in memory,
which (if I understand correctly) will keep first grammar.xml (which
contains common parts) already loaded and parsed when loading/parsing
rest of them.

I guess we have two ways to go from here: adjust the tests to load files
and keep them (I am not sure how easy it is - depends on how flexible
our XMLValidator is) or change our getRuleFileNames() API to require
those files to be independent (which may not be very efficient if all
rule files will have to load and parse the same common parts, like
unification etc)

Regards,
Andriy


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-04-01 Thread Dave Pawson
Schematron can check for such errors as this?
count(xpath1) = count(xpath2)

HTH

On 31 March 2014 21:36, Daniel Naber daniel.na...@languagetool.org wrote:
 On 2014-03-31 22:06, Dominique Pellé wrote:

 I would have expected those 3 numbers to be equal.
 How can the number of pattern tags be less than
 the number of rule tags?

 It turns out there's an XML syntax error in uk/grammar-barbarism.xml in
 line 999. Too bad our unit tests didn't catch that, probably because
 they still assume there's only one 'grammar.xml' file.

 Regards
   Daniel


 --
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel



-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-04-01 Thread Marcin Miłkowski
W dniu 2014-03-31 21:57, Daniel Naber pisze:
 Hi,

 I added a tiny tool at org.languagetool.dev.XmlUsageCounter that counts
 elements and attributes used in our grammar.xml files. Here's what the
 result looks like. There's no immediate use for this, but it's
 interesting to see what features get adopted:


It might be interesting to count elements in disambiguation as well.

Regards,
Marcin

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-04-01 Thread Daniel Naber
On 2014-04-01 05:00, Andriy Rysin wrote:

 Oops, my bad, but the interesting this is that the tests pass on this
 file and the rule actually works in the final release...

This is fixed now I think. I commented out the Ukrainian rules that 
would have made the tests fail now.

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-03-31 Thread Daniel Naber
On 2014-03-31 22:06, Dominique Pellé wrote:

 I would have expected those 3 numbers to be equal.
 How can the number of pattern tags be less than
 the number of rule tags?

It turns out there's an XML syntax error in uk/grammar-barbarism.xml in 
line 999. Too bad our unit tests didn't catch that, probably because 
they still assume there's only one 'grammar.xml' file.

Regards
  Daniel


--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: XML element and attribute statistics

2014-03-31 Thread Andriy Rysin
Oops, my bad, but the interesting this is that the tests pass on this file
and the rule actually works in the final release...

Regards
Andriy
On Mar 31, 2014 4:36 PM, Daniel Naber daniel.na...@languagetool.org
wrote:

 On 2014-03-31 22:06, Dominique Pellé wrote:

  I would have expected those 3 numbers to be equal.
  How can the number of pattern tags be less than
  the number of rule tags?

 It turns out there's an XML syntax error in uk/grammar-barbarism.xml in
 line 999. Too bad our unit tests didn't catch that, probably because
 they still assume there's only one 'grammar.xml' file.

 Regards
   Daniel



 --
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel

--
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel