Re: indexing text in HTML content

Jean-Marc Vanel Mon, 29 Jan 2018 02:41:13 -0800

Vielen Dank Lorenz !

This is annoying; I can't preprocess the literals before putting them in
TDB, because TDB *is* the database for my CMS + social network.
And duplication of data would be a mess.
But maybe there is a way to preprocess the literals before putting them in
the underlying Lucene.


This being said, the most frequent tags , <p> and <div> are not likely to
be search strings from the user.
So this is not a big problem,
but I felt it an interesting problem.





2018-01-29 11:12 GMT+01:00 Lorenz Buehmann <
[email protected]>:

> I guess it simply uses the Lucene Standard Analyzer, thus, yes the tags
> will be indexed. There isn't a HTML analyzer in Lucene AFAIK, which
> means you have to preprocess the literals first via Apache Tika or
> something like JSoup before you add them to the triple store.
>
>
> Lorenz
>
>
>
> On 29.01.2018 10:14, Jean-Marc Vanel wrote:
> > Hi
> >
> > With semantic_forms one can create content with an HTML editor in
> > JavaScript.
> >
> > Example:
> > http://semantic-forms.cc:9112/download?url=http%3A%2F%
> 2Fsemantic-forms.cc%3A9112%2Fldp%2F1515780312176-31461258964949990&syntax=
> Turtle
> > and how it looks in the UI :
> > http://semantic-forms.cc:9112/ldp/1515780312176-31461258964949990
> >
> > My question is:
> > Does Jena text indexing process the tags in HTML (or XML) content ?
> > If yes , <bold> would be indexed in Lucene, which is not desirable.
> >
> > Nothing is said in these 2 pages:
> > https://jena.apache.org/documentation/notes/typed-literals.html
> > https://jena.apache.org/documentation/query/text-query.html
> >
>
>


-- 
Jean-Marc Vanel
http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me#subject
<http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui

Re: indexing text in HTML content

Reply via email to