[GitHub] [any23] HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa

2019-09-14 Thread GitBox
HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa
URL: https://github.com/apache/any23/pull/104#issuecomment-531527033
 
 
   @lewismc I've resolved that issue, moving the semargl-specific bugfixes into 
the semargl extractors. 
   
   @JulioCCBUcuenca a couple recommendations for this branch whenever you get a 
chance:
   
   (1) Synchronize this branch with master
   (2) Rerun test suite, making sure all tests still pass
   (3) If possible, benchmark the `Extractor` wrappers (not just underlying 
rdf4j parsers). Doing this may give the librdfa extractor a performance edge, 
as the semargl parser requires the html stream to be preprocessed and 
transformed into strictly conforming XML.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa

2019-09-12 Thread GitBox
HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa
URL: https://github.com/apache/any23/pull/104#issuecomment-531074410
 
 
   I created an [issue](https://issues.apache.org/jira/browse/ANY23-442) for 
this. We should resolve that issue first and then make sure the librdfa test 
suite still passes before adding the librdfa module.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa

2019-09-12 Thread GitBox
HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa
URL: https://github.com/apache/any23/pull/104#issuecomment-531068423
 
 
   @lewismc My first thought is: if the performance of this module is not as 
good as that of our current implementation, then in its current form, what is 
the added value?
   
   My second thought is: the benchmarks do not test the Any23 `Extractor` 
wrappers around these rdf4j parsers, only the underlying parsers themselves. 
However, in Any23's `BaseRDFExtractor`, due to a lot of bugs in the semargl 
html parser, we had to preprocess the input stream using jsoup before passing 
it into the underlying parser. I am curious as to whether or not the `librdfa` 
parser would have any of those same html parsing bugs. If _not_, if I can take 
the preprocessing logic out of `BaseRDFExtractor` and move it to the semargl 
parser specifically, and **if** the librdfa parser can still pass the entire 
test suite without using the jsoup-preprocessed stream, then there would be a 
much better case for including it (as its performance would then likely eclipse 
our current rdfa performance without the preprocessing overhead).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa

2019-09-11 Thread GitBox
HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa
URL: https://github.com/apache/any23/pull/104#issuecomment-530605395
 
 
   @lewismc do we have any benchmarks on this version vs our current rdfa 
parser?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services