Re: Extending Jena Text to Support ElasticSearch as Indexing/Querying Engine

anuj kumar Wed, 01 Mar 2017 10:46:28 -0800

Thanks Osma. I sent my previous email just a minute early. I will try your
suggestion and if it doesn't work will send you the entire example.


Thanks again.
Anuj

On 1 Mar 2017 19:36, "Osma Suominen" <[email protected]> wrote:

> Hi Anuj!
>
> Generally I use assembler descriptions to configure the jena-text index.
> An example with multiple properties (SKOS label properties) is here:
> https://github.com/NatLibFi/Skosmos/wiki/InstallTutorial#cre
> ating-a-text-index
>
> For examples on how to use assembler descriptions from Java code, take a
> look at the jena-text unit tests. They generally contain a snippet of
> assembler definition that configures the text index in a particular way,
> then test that it does what it should when using that configuration.
>
> You didn't provide a full example. What is your data and what query did
> you use? What results did you expect? What happened instead?
>
> One possible problem in your configuration is that you have set the
> primary predicate to rdfs:label, but not set a field for it. Try adding
> this:
>
> entDef.set("label", RDFS.label.asNode());
>
> For querying everything else but the default field, you need to specify
> the predicate at query time. With your configuration, it should be possible
> to query rdfs:comment values like this:
>
> ?s text:query (rdfs:comment "word") .
>
> Hope this helps!
>
> -Osma
>
> 01.03.2017, 17:33, anuj kumar kirjoitti:
>
>> BTW, I have one more question:
>>
>> How do I add more than one field to be indexed in my Index?
>> Basically, if I want to index rdfs:label , rdfs:comment in the same index
>> document, how do I do it?
>>
>> I tried :
>>
>> EntityDefinition entDef = new EntityDefinition(DOC_TYPE, FIELD_TO_SEARCH);
>> entDef.setPrimaryPredicate(RDFS.label);
>> entDef.setGraphField(GRAPH_FIELD_NAME);
>> entDef.set("comment", RDFS.comment.asNode());
>>
>> But it doesnt work. Can you please point me on a way to do it please. This
>> is an important piece of functionality I need.
>>
>> Thanks,
>> Anuj Kumar
>>
>>
>> On Wed, Mar 1, 2017 at 3:59 PM, anuj kumar <[email protected]>
>> wrote:
>>
>> I personally have no preference as to how the code in Jena should be
>>> structured, as long as I am able to use it :).
>>> I have personal preference of doing it in a specific way because IMO, it
>>> is modular which makes it much easier to maintain in the long run. But
>>> again it may not be the quickest one.
>>>
>>> I already have been given a deadline, by the company to have ES extension
>>> implemented in the next 15 days :). What this means is that I will be
>>> maintaining the ES code extension to Jena Text at-least locally for a
>>> coming period of time. I would be more than happy to contribute to Jena
>>> community whatever is required to have a proper ElasticSearch
>>> Implementation in place, whether within jena-text module or as a separate
>>> module. Till the time Lucene and Solr is not upgraded to the latest
>>> version, I will have to maintain a separate module for jena-text-es.
>>>
>>> Cheers!
>>> Anuj Kumar
>>>
>>>
>>> On Wed, Mar 1, 2017 at 3:36 PM, A. Soroka <[email protected]> wrote:
>>>
>>> Osma--
>>>>
>>>> The short answer is that yes, given the right tools you _can_ have
>>>> different versions of code accessible in different ways. The longer
>>>> answer
>>>> is that it's probably not a viable alternative for Jena for this
>>>> problem,
>>>> at least not without a lot of other change.
>>>>
>>>> You are right to point to the classloader mechanism as being at the
>>>> heart
>>>> of this question, but I must alter your remark just slightly. From "the
>>>> Java classloader only sees a single, flat package/class namespace and a
>>>> set
>>>> of compiled classes" to "ANY GIVEN Java classloader only sees a single,
>>>> flat package/class namespace and a set of compiled classes".
>>>>
>>>> This is the fact that OSGi uses to make it possible to maintain strict
>>>> module boundaries (and even dynamic module relationships at run-time).
>>>> Each
>>>> OSGi bundle sees its own classloader, and the framework is responsible
>>>> for
>>>> connecting bundles up to ensure that every bundle has what it needs in
>>>> the
>>>> way of types to function, based on metadata that the bundles provide to
>>>> the
>>>> framework. It's an incredibly powerful system (I use it every day and
>>>> enjoy
>>>> it enormously) but it's also very "heavy" and requires a good deal of
>>>> investment to use. In particular, it's probably too large to put
>>>> _inside_
>>>> Jena. (I frequently put Jena inside an OSGi instance, on the other
>>>> hand.)
>>>>
>>>> Java 9 Jigsaw [1] offers some possibility for strong modularization of
>>>> this kind, but it's really meant for the JDK itself, not application
>>>> libraries. In theory, we could "roll our own" classloader management for
>>>> this problem. That sounds like more than a bit of a rabbit hole to me.
>>>> There might be another, more lightweight, toolkit out there to this
>>>> purpose, but I'm not aware of any myself.
>>>>
>>>> Otherwise, yes, you get into shading and the like. We have to do that
>>>> for
>>>> Guava for now because of HADOOP-10101 (grumble grumble) but it's hardly
>>>> a
>>>> thing we want to do any more of than needed, I don't think.
>>>>
>>>> ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>>
>>>> [1] http://openjdk.java.net/projects/jigsaw/
>>>>
>>>> On Mar 1, 2017, at 9:03 AM, Osma Suominen <[email protected]>
>>>>>
>>>> wrote:
>>>>
>>>>>
>>>>> Hi Anuj!
>>>>>
>>>>> Thanks for the clarification.
>>>>>
>>>>> However, I'm still not sure I understand the situation completely. I
>>>>>
>>>> know Maven can perform a lot of tricks, but Maven modules are just
>>>> convenient ways to structure a Java project. Maven cannot change the
>>>> fact
>>>> that at runtime, module divisions don't really matter (except that they
>>>> usually correspond to package sub-namespaces) and the Java classloader
>>>> only
>>>> sees a single, flat package/class namespace and a set of compiled
>>>> classes
>>>> (usually within JARs) in the classpath that it needs to check to find
>>>> the
>>>> right classes, and if there are two versions of the same library (eg
>>>> Lucene) with overlapping class names, that's going to cause trouble. The
>>>> only way around that is to shade some of the libraries, i.e. rename
>>>> them so
>>>> that they end up in another, non-conflicting namespace. Apparently
>>>> Elasticsearch also did some of that in the past [1] but nowadays tries
>>>> to
>>>> avoid it.
>>>>
>>>>>
>>>>> Does your assumption 1 ("At a given point in time, only a single
>>>>>
>>>> Indexing Technology is used") imply that in the assembler configuration,
>>>> you cannot have ja:loadClass declarations for both Lucene and ES
>>>> backends?
>>>> Or how do you run something like Fuseki that contains (in a single big
>>>> JAR)
>>>> both the jena-text and jena-text-es modules with all their dependencies,
>>>> one of which requires the Lucene 4.x classes and the other one the
>>>> Lucene
>>>> 6.4.1 classes? How do you ensure that only one of them is used at a
>>>> time,
>>>> and that the Java classloader, even though it has access to both
>>>> versions
>>>> of Lucene, only loads classes from the single, correct one and not the
>>>> other? Or do you need to have separate "Fuseki-Lucene" and "Fuseki-ES"
>>>> packages, so that you don't end up with two Lucene versions within the
>>>> same
>>>> Fuseki JAR?
>>>>
>>>>>
>>>>> -Osma
>>>>>
>>>>> [1] https://www.elastic.co/blog/to-shade-or-not-to-shade
>>>>>
>>>>> 01.03.2017, 11:03, anuj kumar kirjoitti:
>>>>>
>>>>>> Hi Osma,
>>>>>>
>>>>>> I understand what you are saying. There are ways to mitigate risks and
>>>>>> balance the refactoring without affecting the existing modules. But I
>>>>>>
>>>>> will
>>>>
>>>>> not delve into those now. I am not an expert in Jena to convincingly
>>>>>>
>>>>> say
>>>>
>>>>> that it is possible, without any hiccups. But I can take a guess and
>>>>>>
>>>>> say
>>>>
>>>>> that it is indeed possible :)
>>>>>>
>>>>>> For the question: "is it even possible to mix modules that depend on
>>>>>> different versions of the Lucene libraries within the same project?"
>>>>>>
>>>>>> I actually do not understand what you mean by mixing modules. I assume
>>>>>>
>>>>> you
>>>>
>>>>> mean having jena-text and jena-text-es as dependencies in a build
>>>>>>
>>>>> without
>>>>
>>>>> causing the build to conflict. If that is what you mean than the
>>>>>>
>>>>> answer is
>>>>
>>>>> yes it is possible and quite simple as well. Let me explain how it is
>>>>>> possible. But before that some assumption which I want to call out
>>>>>> explicitly.
>>>>>>
>>>>>> *Assumption:*
>>>>>> 1. At a given point in time, only a single Indexing Technology is used
>>>>>>
>>>>> for
>>>>
>>>>> text based indexing and searching via Jean. What this means is that we
>>>>>>
>>>>> will
>>>>
>>>>> either use Lucene Implementation OR Solr Implementation OR ES
>>>>>> Implementation at any given point in time.
>>>>>> 2. Fuseki build does not depend on any Lucene 4.9.1 specific classes
>>>>>>
>>>>> but
>>>>
>>>>> only on jena-text classes, if at all.
>>>>>>
>>>>>> Based on these assumptions it is possible to create a build that
>>>>>>
>>>>> contains
>>>>
>>>>> jena-text based common classes + ES specific classes without any
>>>>>> compatibility issues. And it is infact quite simple. I did it in the
>>>>>> current jena-text-es module and ran the entire build which succeeded.
>>>>>> The key is to include the latest Lucene dependencies at the very
>>>>>>
>>>>> beginning
>>>>
>>>>> in the pom and then include jena-text dependency. Maven will then
>>>>>> automatically resolve the dependency issues by including the Lucene
>>>>>> librarires that we included in our es specific pom. Have a look the
>>>>>>
>>>>> pom of
>>>>
>>>>> jena-text-es module here to see how it can be done :
>>>>>> https://github.com/EaseTech/jena/blob/master/jena-text-es/pom.xml
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Anuj Kumar
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 1, 2017 at 7:27 AM, Osma Suominen <
>>>>>>
>>>>> [email protected]>
>>>>
>>>>> wrote:
>>>>>>
>>>>>> Hi Anuj,
>>>>>>>
>>>>>>> I understand your concerns. However, we also need to balance between
>>>>>>>
>>>>>> the
>>>>
>>>>> needs of individual modules/features and the whole codebase. I'm
>>>>>>>
>>>>>> willing to
>>>>
>>>>> put in the effort to keep the other modules up to date with newer
>>>>>>>
>>>>>> Lucene
>>>>
>>>>> versions. Lucene upgrade requirements are well documented, the only
>>>>>>>
>>>>>> hitches
>>>>
>>>>> seen in JENA-1250 were related to how jena-text (ab)used some Lucene
>>>>>>> features that were dropped from newer versions.
>>>>>>>
>>>>>>> A perhaps stupid question to more experienced Java developers: is it
>>>>>>>
>>>>>> even
>>>>
>>>>> possible to mix modules that depend on different versions of the
>>>>>>>
>>>>>> Lucene
>>>>
>>>>> libraries within the same project? In my (quite limited)
>>>>>>>
>>>>>> understanding of
>>>>
>>>>> Java projects and libraries, this requires special arrangements (e.g.
>>>>>>> shading) as the Java package/class namespace is shared by all the
>>>>>>> code
>>>>>>> running within the same JVM.
>>>>>>>
>>>>>>> So can you create, say, a Fuseki build that contains the current
>>>>>>>
>>>>>> jena-text
>>>>
>>>>> module (depending on Lucene 4.x) and the new jena-text-es module
>>>>>>>
>>>>>> (depending
>>>>
>>>>> on Lucene 6.4.1) without any compatibility issues?
>>>>>>>
>>>>>>> -Osma
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 01.03.2017, 00:47, anuj kumar kirjoitti:
>>>>>>>
>>>>>>> Hi,
>>>>>>>>
>>>>>>>> My 2 Cents :
>>>>>>>>
>>>>>>>> The reason I proposed to have separate modules for Lucene, Solr and
>>>>>>>>
>>>>>>> ES is
>>>>
>>>>> exactly for avoiding the "All or Nothing" approach we need to take
>>>>>>>>
>>>>>>> if we
>>>>
>>>>> club them all together. If they stay together and if in the near
>>>>>>>>
>>>>>>> future I
>>>>
>>>>> want to upgrade ES to another version, I also need to again upgrade
>>>>>>>>
>>>>>>> Lucene
>>>>
>>>>> and Solr and possibly another implementation that may have been added
>>>>>>>> during the time. As we all know, this means weeks of work if not
>>>>>>>>
>>>>>>> months to
>>>>
>>>>> get the changes released. This will personally de-motivate me to do
>>>>>>>> anything and I will probably start maintaining my version of
>>>>>>>>
>>>>>>> Jena-Text as
>>>>
>>>>> that would be much simpler to do than to upgrade and test and in the
>>>>>>>> process own(read fix bugs) the upgrade for each and every
>>>>>>>> technology.
>>>>>>>>
>>>>>>>> If they are developed as separate modules, they can evolve
>>>>>>>>
>>>>>>> independently
>>>>
>>>>> of
>>>>>>>> each other and we can avoid situations where we cant upgrade to
>>>>>>>>
>>>>>>> latest
>>>>
>>>>> version of Lucene because we do not know what effect it will have on
>>>>>>>>
>>>>>>> Solr
>>>>
>>>>> Implementation.
>>>>>>>>
>>>>>>>> We can start with having a separate Module for Jena Text ES and see
>>>>>>>>
>>>>>>> how
>>>>
>>>>> things go. If they go well, we could extract out Solr and Lucene out
>>>>>>>>
>>>>>>> of
>>>>
>>>>> Jena Text.
>>>>>>>>
>>>>>>>> Again this is just a suggestion based on my limited industry
>>>>>>>>
>>>>>>> experience.
>>>>
>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anuj Kumar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 28, 2017 at 5:23 PM, Osma Suominen <
>>>>>>>>
>>>>>>> [email protected]
>>>>
>>>>>
>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> 28.02.2017, 17:12, A. Soroka kirjoitti:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://lists.apache.org/thread.html/dce0d502b11891c28e57bbc
>>>>>>>>>
>>>>>>>>>> bb0cdef27d8374d58d9634076b8ef4cd7@1431107516@%3Cdev.jena.apa
>>>>>>>>>>
>>>>>>>>> che.org%3E
>>>>
>>>>> ? In other words, might it be better to factor out between -text
>>>>>>>>>>
>>>>>>>>> and
>>>>
>>>>> -spatial and _then_ try to upgrade the Lucene version?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I certainly wouldn't object to that, but somebody has to volunteer
>>>>>>>>>
>>>>>>>> to do
>>>>
>>>>> the actual work!
>>>>>>>>>
>>>>>>>>> I don't use the Solr component now, but I could easily see so
>>>>>>>>>
>>>>>>>> doing...
>>>>
>>>>>
>>>>>>>>> that's pretty vague, I know, and I'm not in a position to do any
>>>>>>>>>>
>>>>>>>>> work to
>>>>
>>>>> maintain it, so consider that just a very small and blurry data
>>>>>>>>>>
>>>>>>>>> point.
>>>>
>>>>> :)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Last time I tried it (it was a while ago) I couldn't figure out
>>>>>>>>> how
>>>>>>>>>
>>>>>>>> to
>>>>
>>>>> get
>>>>>>>>> it running... If you could just try that with some toy data, then
>>>>>>>>>
>>>>>>>> your
>>>>
>>>>> data
>>>>>>>>> point would be a lot less blurry :) I haven't used Solr for
>>>>>>>>>
>>>>>>>> anything, so
>>>>
>>>>> I'm not very familiar with how to set it up, and the jena-text
>>>>>>>>> instructions
>>>>>>>>> are pretty vague unfortunately.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Osma
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Osma Suominen
>>>>>>>>> D.Sc. (Tech), Information Systems Specialist
>>>>>>>>> National Library of Finland
>>>>>>>>> P.O. Box 26 (Kaikukatu 4)
>>>>>>>>> 00014 HELSINGIN YLIOPISTO
>>>>>>>>> Tel. +358 50 3199529
>>>>>>>>> [email protected]
>>>>>>>>> http://www.nationallibrary.fi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Osma Suominen
>>>>>>> D.Sc. (Tech), Information Systems Specialist
>>>>>>> National Library of Finland
>>>>>>> P.O. Box 26 (Kaikukatu 4)
>>>>>>> 00014 HELSINGIN YLIOPISTO
>>>>>>> Tel. +358 50 3199529
>>>>>>> [email protected]
>>>>>>> http://www.nationallibrary.fi
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Osma Suominen
>>>>> D.Sc. (Tech), Information Systems Specialist
>>>>> National Library of Finland
>>>>> P.O. Box 26 (Kaikukatu 4)
>>>>> 00014 HELSINGIN YLIOPISTO
>>>>> Tel. +358 50 3199529
>>>>> [email protected]
>>>>> http://www.nationallibrary.fi
>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> *Anuj Kumar*
>>>
>>>
>>
>>
>>
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 26 (Kaikukatu 4)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529
> [email protected]
> http://www.nationallibrary.fi
>

Re: Extending Jena Text to Support ElasticSearch as Indexing/Querying Engine

Reply via email to