Re: Can Jena Full Text search work with other Jena based API like Virtuoso Jena or MarkLogic Jena ?

Dan Davis Sat, 21 Sep 2019 09:33:32 -0700

It would be of tremendous value to my project if this works; I wish I had
time to try it also.


On Wed, Sep 18, 2019, 10:03 PM Alex To <tonhud...@gmail.com> wrote:

> Hi Dan
> Thanks for your suggestion but I am not trying to load large dataset yet.
>
> I am trying to see if I can use Jena Full text search with other Jena based
> API such as MarkLogic or Virtuoso but seems like it doesn't work as
> expected. Not a Jena problem though. My set up is
>
> 1. Input file: dbpedia.owl (2.5MB)
> 2. Import using MarkLogic Jena without TextDataset: 1 minute
> 3. Import using MarkLogic Jena with TextDataset wrapping about it: 13
> minutes
>
> Regards
>
> On Thu, Sep 19, 2019 at 10:54 AM Dan Davis <dansm...@gmail.com> wrote:
>
> > dbpedia is not actually that large.  Make sure you test with RDF datasets
> > that really represent your data.
> >
> > On Wed, Sep 18, 2019 at 8:14 PM Alex To <tonhud...@gmail.com> wrote:
> >
> > > Update: I switched from Lucene to Elasticsearch 6.4.3 and Kibana. Both
> > Jena
> > > and MarkLogic Jena works with indexing, I haven't tried querying
> > MarkLogic
> > > with text:query though.
> > >
> > > Using Kibana, I could see the number of documents increasing while
> > > importing data with MarkLogic however it is very slow.
> > >
> > > Importing dbpedia.owl (2.5MB)  with MarkLogic Jena takes less than a
> > minute
> > > without indexing.
> > >
> > > With TextDataset wrapping around MarkLogic dataset, it takes 13 minutes
> > so
> > > I guess MarkLogic dataset does not seem to send triples in batch when
> > using
> > > with TextDataset.
> > >
> > >
> > >
> > > On Tue, Sep 17, 2019 at 9:58 AM Alex To <tonhud...@gmail.com> wrote:
> > >
> > > > Hi Andy
> > > >
> > > > I ended up creating separate implementation for Jena and MarkLogic
> full
> > > > text search for now due to time constraints of the project. I will
> > > > investigate further  at a later time.
> > > >
> > > > Thank you
> > > >
> > > > Best Regards
> > > >
> > > > On Sun, Sep 15, 2019 at 6:53 PM Andy Seaborne <a...@apache.org>
> wrote:
> > > >
> > > >> Alex,
> > > >>
> > > >> I can't try it out - I don't have a Marklogic system.
> > > >>
> > > >> Can you see in the server logs what is happening?
> > > >>
> > > >>  > Pure speculation but parts 1 & 2 sounds like the data load is not
> > > going
> > > >>  > to MarkLogic as a single transaction but as "autocommit" - one
> > > >>  > transaction for each triple added.
> > > >>
> > > >>      Andy
> > > >>
> > > >> On 13/09/2019 23:04, Andy Seaborne wrote:
> > > >> > The maven central artifact com.marklogic:marklogic-jena is 3.0.6
> but
> > > >> our
> > > >> > code depends on 3.1.0 - what code is it using?
> > > >> >
> > > >> > On 13/09/2019 01:18, Alex To wrote:
> > > >> >> I created a small program to try out Lucene with MarkLogic Jena
> > here
> > > >> >>
> > > >> >>
> > > >>
> > >
> >
> https://github.com/AlexTo/jena-lab/blob/master/src/main/java/com/company/MainMarkLogic.java
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> My observation is as follows (see my comment at line 54 & 56)
> > > >> >>
> > > >> >> 1. If the model reads a small file with 2 triples, the loading
> can
> > > >> finish
> > > >> >> quickly
> > > >> >> 2. If the model reads a slightly larger file (1.5MB), the loading
> > > takes
> > > >> >> forever so I have to terminate it
> > > >> >
> > > >> > Pure speculation but parts 1 & 2 sounds like the data load is not
> > > going
> > > >> > to MarkLogic as a single transaction but as "autocommit" - one
> > > >> > transaction for each triple added.
> > > >> >
> > > >> >      Andy
> > > >> >
> > > >> >
> > > >> >> 3. After loading the small file, searching the Lucene index
> direct
> > > >> shows
> > > >> >> that the triples are indexed
> > > >> >> 4. After loading the small file, run SPARQL query with
> "text:query"
> > > >> won't
> > > >> >> finish
> > > >> >>
> > > >> >> For now I created 2 separate implementation in my program to
> > support
> > > >> Full
> > > >> >> Text search with Jena or MarkLogic but I look forward to know
> more
> > > >> >> whether
> > > >> >> it is still possible to use Jena Elastic indexing with
> TextDataset
> > > >> >> because
> > > >> >> then I can provide a single UI to users to configure their search
> > > >> >> regardless of the back end. :)
> > > >> >>
> > > >> >>
> > > >> >> On Fri, Sep 13, 2019 at 1:07 AM Dan Davis <dansm...@gmail.com>
> > > wrote:
> > > >> >>
> > > >> >>> I am incorrect, and apologize. Virtuoso's Jena 3 driver includes
> > an
> > > >> >>> implementation of Dataset, and so while application is only
> using
> > > the
> > > >> >>> virtuoso.jena.driver.VirtGraph and
> > > >> >>> virtuoso.jena.driver.VirtuosoQueryExecution (and factory), a
> more
> > > >> >>> flexible
> > > >> >>> integration is possible. I look forward to experimenting with it
> > and
> > > >> >>> seeing
> > > >> >>> what I can do on the backend.
> > > >> >>>
> > > >> >>> On Thu, Sep 12, 2019 at 10:19 AM Dan Davis <dansm...@gmail.com>
> > > >> wrote:
> > > >> >>>
> > > >> >>>> Virtuoso's Jena driver implements the model interface, rather
> > than
> > > >> the
> > > >> >>>> DatasetGraphAPI.  is translating the SPARQL query into its own
> > JDBC
> > > >> >>>> interface. You can see the architecture at
> > > >> >>>>
> > > >> >>>
> > > >>
> > >
> >
> http://docs.openlinksw.com/virtuoso/rdfnativestorageprovidersjena/#rdfnativestorageprovidersjenawhatisv
> > > .
> > > >>
> > > >> >>>
> > > >> >>> However,
> > > >> >>>> Virtuoso has its own full-text indexing, which can be
> effective.
> > > Its
> > > >> >>> rules
> > > >> >>>> for translating words into queries is not as flexible as
> > > >> >>>> lucene/solr/elastic, but it does allow you to specify what
> should
> > > be
> > > >> >>>> indexed - e.g. which objects from which which data properties
> in
> > > >> which
> > > >> >>>> graphs.
> > > >> >>>>
> > > >> >>>> I use Virtuoso behind virt_jena and virt_jdbc.  You can see the
> > > code
> > > >> at
> > > >> >>>> https://github.com/HHS/lodestar, which is run underneath
> > > >> >>>> https://github.com/HHS/meshrdf.   You will see that
> > > >> >>>> https://github.com/HHS/lodestar is a fork from EBI, but the
> NLM
> > > >> copy
> > > >> >>>> has
> > > >> >>>> been updated to Jena 3. The EBI version is ahead on UI features
> > > >> >>>> however.
> > > >> >>>>
> > > >> >>>> I cannot speak to MarkLogic, Stardog, etc.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> EBI's lodestar still uses Jena 2, but the fork at HHS has been
> > > >> >>>> updated to
> > > >> >>>> Jena 3.
> > > >> >>>>
> > > >> >>>> Virtuoso has its own full-text indexing, which is not as
> flexible
> > > in
> > > >> >>>> how
> > > >> >>>> it indexes as Elastic/Solr/Lucene.   It still works.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> On Thu, Sep 12, 2019 at 7:03 AM Andy Seaborne <a...@apache.org
> >
> > > >> wrote:
> > > >> >>>>
> > > >> >>>>> Yes, probably - but.
> > > >> >>>>>
> > > >> >>>>> The Jena text index will work in conjunction with any (Jena)
> > > >> >>>>> DatasetGraphAPI implementation. 3rd party systems are not
> tested
> > > in
> > > >> >>>>> the
> > > >> >>>>> build.
> > > >> >>>>>
> > > >> >>>>> The "but" is efficiency. Both those systems have their own
> > > built-in
> > > >> >>>>> text
> > > >> >>>>> indexing which execute as part of the native query engine.
> This
> > > may
> > > >> >>>>> be a
> > > >> >>>>> factor for you, it may not.
> > > >> >>>>>
> > > >> >>>>> Let us know how you get on trying it.
> > > >> >>>>>
> > > >> >>>>> ----
> > > >> >>>>>
> > > >> >>>>> There is a SPARQL 1.2 issue about standardizing text query.
> > > >> >>>>>
> > > >> >>>>> Issue 40 : SPARQL 1.2 Community Group:
> > > >> >>>>> https://github.com/w3c/sparql-12/issues/40
> > > >> >>>>>
> > > >> >>>>>       Andy
> > > >> >>>>>
> > > >> >>>>> On 12/09/2019 02:53, Alex To wrote:
> > > >> >>>>>> Hi
> > > >> >>>>>>
> > > >> >>>>>> I have so far been happy with Jena + Lucene / Elastic. Just
> > > trying
> > > >> to
> > > >> >>>>> get a
> > > >> >>>>>> quick answer whether it can work with other Jena based API
> like
> > > >> >>>>> Virtuoso /
> > > >> >>>>>> MarkLogic.
> > > >> >>>>>>
> > > >> >>>>>> If I wrap a MarkLogic Dataset in a Jena TextDataset, can it
> > work
> > > as
> > > >> >>>>>> expected ?
> > > >> >>>>>>
> > > >> >>>>>> Given that a MarkLogic / Virtuoso Dataset implements Jena
> > Dataset
> > > >> >>>>>> interface, it may work but I am not sure because the
> > "text:query"
> > > >> >>> seems
> > > >> >>>>> to
> > > >> >>>>>> be more Jena specific.
> > > >> >>>>>>
> > > >> >>>>>> I will try out myself in the next couple of days to see if it
> > > works
> > > >> >>> but
> > > >> >>>>> if
> > > >> >>>>>> there is a quick answer it may save me a couple of hours :)
> > > >> >>>>>>
> > > >> >>>>>> Thank a lot
> > > >> >>>>>>
> > > >> >>>>>> Regards
> > > >> >>>>>>
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>
> > > >> >>
> > > >> >>
> > > >
> > > >
> > >
> >
>
>
> --
>
> Alex To
>
> PhD Candidate
>
> School of Computer Science
>
> Knowledge Discovery and Management Research Group
>
> Faculty of Engineering & IT
>
> THE UNIVERSITY OF SYDNEY | NSW | 2006
>
> Desk 4e69 | Building J12| 1 Cleveland Street
>
> M. +61423330656 <%2B61450061602>
>

Re: Can Jena Full Text search work with other Jena based API like Virtuoso Jena or MarkLogic Jena ?

Reply via email to