[GitHub] [any23] HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa

2019-09-12 Thread GitBox
HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa
URL: https://github.com/apache/any23/pull/104#issuecomment-531074410
 
 
   I created an [issue](https://issues.apache.org/jira/browse/ANY23-442) for 
this. We should resolve that issue first and then make sure the librdfa test 
suite still passes before adding the librdfa module.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (ANY23-442) Move HTML preprocessing logic from BaseRDFExtractor to semargl Extractors

2019-09-12 Thread Hans Brende (Jira)
Hans Brende created ANY23-442:
-

 Summary: Move HTML preprocessing logic from BaseRDFExtractor to 
semargl Extractors
 Key: ANY23-442
 URL: https://issues.apache.org/jira/browse/ANY23-442
 Project: Apache Any23
  Issue Type: Improvement
Reporter: Hans Brende


Cf. https://github.com/apache/any23/pull/104#issuecomment-531068423



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[GitHub] [any23] HansBrende edited a comment on issue #104: Any23 295: Implement ability to use librdfa

2019-09-12 Thread GitBox
HansBrende edited a comment on issue #104: Any23 295: Implement ability to use 
librdfa
URL: https://github.com/apache/any23/pull/104#issuecomment-531068423
 
 
   @lewismc My first thought is: if the performance of this module is not as 
good as that of our current implementation, then in its current form, what is 
the added value?
   
   My second thought is: the benchmarks do not test the Any23 `Extractor` 
wrappers around these rdf4j parsers, only the underlying parsers themselves. 
However, in Any23's `BaseRDFExtractor`, due to a lot of bugs in the semargl 
html parser, we had to preprocess the input stream using jsoup before passing 
"clean html" into the underlying parser. I am curious as to whether or not the 
`librdfa` parser would have any of those same html parsing bugs. If _not_, if I 
can take the preprocessing logic out of `BaseRDFExtractor` and move it to the 
semargl parser specifically, and **if** the librdfa parser can still pass the 
entire test suite without using the jsoup-preprocessed stream, then there would 
be a much better case for including it (as its performance would then likely 
eclipse our current rdfa performance without the preprocessing overhead).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa

2019-09-12 Thread GitBox
HansBrende commented on issue #104: Any23 295: Implement ability to use librdfa
URL: https://github.com/apache/any23/pull/104#issuecomment-531068423
 
 
   @lewismc My first thought is: if the performance of this module is not as 
good as that of our current implementation, then in its current form, what is 
the added value?
   
   My second thought is: the benchmarks do not test the Any23 `Extractor` 
wrappers around these rdf4j parsers, only the underlying parsers themselves. 
However, in Any23's `BaseRDFExtractor`, due to a lot of bugs in the semargl 
html parser, we had to preprocess the input stream using jsoup before passing 
it into the underlying parser. I am curious as to whether or not the `librdfa` 
parser would have any of those same html parsing bugs. If _not_, if I can take 
the preprocessing logic out of `BaseRDFExtractor` and move it to the semargl 
parser specifically, and **if** the librdfa parser can still pass the entire 
test suite without using the jsoup-preprocessed stream, then there would be a 
much better case for including it (as its performance would then likely eclipse 
our current rdfa performance without the preprocessing overhead).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services