Hi Jean-Marc!
Lorenz is correct. You can use pretty much any Lucene analyzer with
jena-text, but there isn't one for HTML AFAIK so you'd have to write
your own and add it to the jena-text codebase (or Lucene itself).
I see that Elasticsearch has an HTML Strip Char Filter:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
I don't think the current jena-text Elasticsearch backend is
configurable enough to just start using it as it is, but it probably
wouldn't be very difficult to add. The Lucene side already supports
arbitrary analyzers (including filters) through assembler configuration.
-Osma
Jean-Marc Vanel kirjoitti 29.01.2018 klo 12:31:
Vielen Dank Lorenz !
This is annoying; I can't preprocess the literals before putting them in
TDB, because TDB *is* the database for my CMS + social network.
And duplication of data would be a mess.
But maybe there is a way to preprocess the literals before putting them in
the underlying Lucene.
This being said, the most frequent tags , <p> and <div> are not likely to
be search strings from the user.
So this is not a big problem,
but I felt it an interesting problem.
2018-01-29 11:12 GMT+01:00 Lorenz Buehmann <
[email protected]>:
I guess it simply uses the Lucene Standard Analyzer, thus, yes the tags
will be indexed. There isn't a HTML analyzer in Lucene AFAIK, which
means you have to preprocess the literals first via Apache Tika or
something like JSoup before you add them to the triple store.
Lorenz
On 29.01.2018 10:14, Jean-Marc Vanel wrote:
Hi
With semantic_forms one can create content with an HTML editor in
JavaScript.
Example:
http://semantic-forms.cc:9112/download?url=http%3A%2F%
2Fsemantic-forms.cc%3A9112%2Fldp%2F1515780312176-31461258964949990&syntax=
Turtle
and how it looks in the UI :
http://semantic-forms.cc:9112/ldp/1515780312176-31461258964949990
My question is:
Does Jena text indexing process the tags in HTML (or XML) content ?
If yes , <bold> would be indexed in Lucene, which is not desirable.
Nothing is said in these 2 pages:
https://jena.apache.org/documentation/notes/typed-literals.html
https://jena.apache.org/documentation/query/text-query.html
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi