Re: indexing text in HTML content

Lorenz Buehmann Mon, 29 Jan 2018 02:13:29 -0800

I guess it simply uses the Lucene Standard Analyzer, thus, yes the tags
will be indexed. There isn't a HTML analyzer in Lucene AFAIK, which
means you have to preprocess the literals first via Apache Tika or
something like JSoup before you add them to the triple store.



Lorenz



On 29.01.2018 10:14, Jean-Marc Vanel wrote:
> Hi
>
> With semantic_forms one can create content with an HTML editor in
> JavaScript.
>
> Example:
> http://semantic-forms.cc:9112/download?url=http%3A%2F%2Fsemantic-forms.cc%3A9112%2Fldp%2F1515780312176-31461258964949990&syntax=Turtle
> and how it looks in the UI :
> http://semantic-forms.cc:9112/ldp/1515780312176-31461258964949990
>
> My question is:
> Does Jena text indexing process the tags in HTML (or XML) content ?
> If yes , <bold> would be indexed in Lucene, which is not desirable.
>
> Nothing is said in these 2 pages:
> https://jena.apache.org/documentation/notes/typed-literals.html
> https://jena.apache.org/documentation/query/text-query.html
>

Re: indexing text in HTML content

Reply via email to