Re: Extending Jena Text to Support ElasticSearch as Indexing/Querying Engine

anuj kumar Thu, 02 Mar 2017 02:25:21 -0800

Just FYI, I was able to index multiple fields in ElasticSearch using Jena
Text capability.
The issue was in my ElasticSearch code where I was doing insert every time
instead of an update :/


Cheers!
Anuj Kumar

On Wed, Mar 1, 2017 at 7:40 PM, anuj kumar <[email protected]> wrote:

> Thanks Osma. I sent my previous email just a minute early. I will try your
> suggestion and if it doesn't work will send you the entire example.
>
> Thanks again.
> Anuj
>
> On 1 Mar 2017 19:36, "Osma Suominen" <[email protected]> wrote:
>
>> Hi Anuj!
>>
>> Generally I use assembler descriptions to configure the jena-text index.
>> An example with multiple properties (SKOS label properties) is here:
>> https://github.com/NatLibFi/Skosmos/wiki/InstallTutorial#cre
>> ating-a-text-index
>>
>> For examples on how to use assembler descriptions from Java code, take a
>> look at the jena-text unit tests. They generally contain a snippet of
>> assembler definition that configures the text index in a particular way,
>> then test that it does what it should when using that configuration.
>>
>> You didn't provide a full example. What is your data and what query did
>> you use? What results did you expect? What happened instead?
>>
>> One possible problem in your configuration is that you have set the
>> primary predicate to rdfs:label, but not set a field for it. Try adding
>> this:
>>
>> entDef.set("label", RDFS.label.asNode());
>>
>> For querying everything else but the default field, you need to specify
>> the predicate at query time. With your configuration, it should be possible
>> to query rdfs:comment values like this:
>>
>> ?s text:query (rdfs:comment "word") .
>>
>> Hope this helps!
>>
>> -Osma
>>
>> 01.03.2017, 17:33, anuj kumar kirjoitti:
>>
>>> BTW, I have one more question:
>>>
>>> How do I add more than one field to be indexed in my Index?
>>> Basically, if I want to index rdfs:label , rdfs:comment in the same index
>>> document, how do I do it?
>>>
>>> I tried :
>>>
>>> EntityDefinition entDef = new EntityDefinition(DOC_TYPE,
>>> FIELD_TO_SEARCH);
>>> entDef.setPrimaryPredicate(RDFS.label);
>>> entDef.setGraphField(GRAPH_FIELD_NAME);
>>> entDef.set("comment", RDFS.comment.asNode());
>>>
>>> But it doesnt work. Can you please point me on a way to do it please.
>>> This
>>> is an important piece of functionality I need.
>>>
>>> Thanks,
>>> Anuj Kumar
>>>
>>>
>>> On Wed, Mar 1, 2017 at 3:59 PM, anuj kumar <[email protected]>
>>> wrote:
>>>
>>> I personally have no preference as to how the code in Jena should be
>>>> structured, as long as I am able to use it :).
>>>> I have personal preference of doing it in a specific way because IMO, it
>>>> is modular which makes it much easier to maintain in the long run. But
>>>> again it may not be the quickest one.
>>>>
>>>> I already have been given a deadline, by the company to have ES
>>>> extension
>>>> implemented in the next 15 days :). What this means is that I will be
>>>> maintaining the ES code extension to Jena Text at-least locally for a
>>>> coming period of time. I would be more than happy to contribute to Jena
>>>> community whatever is required to have a proper ElasticSearch
>>>> Implementation in place, whether within jena-text module or as a
>>>> separate
>>>> module. Till the time Lucene and Solr is not upgraded to the latest
>>>> version, I will have to maintain a separate module for jena-text-es.
>>>>
>>>> Cheers!
>>>> Anuj Kumar
>>>>
>>>>
>>>> On Wed, Mar 1, 2017 at 3:36 PM, A. Soroka <[email protected]> wrote:
>>>>
>>>> Osma--
>>>>>
>>>>> The short answer is that yes, given the right tools you _can_ have
>>>>> different versions of code accessible in different ways. The longer
>>>>> answer
>>>>> is that it's probably not a viable alternative for Jena for this
>>>>> problem,
>>>>> at least not without a lot of other change.
>>>>>
>>>>> You are right to point to the classloader mechanism as being at the
>>>>> heart
>>>>> of this question, but I must alter your remark just slightly. From "the
>>>>> Java classloader only sees a single, flat package/class namespace and
>>>>> a set
>>>>> of compiled classes" to "ANY GIVEN Java classloader only sees a single,
>>>>> flat package/class namespace and a set of compiled classes".
>>>>>
>>>>> This is the fact that OSGi uses to make it possible to maintain strict
>>>>> module boundaries (and even dynamic module relationships at run-time).
>>>>> Each
>>>>> OSGi bundle sees its own classloader, and the framework is responsible
>>>>> for
>>>>> connecting bundles up to ensure that every bundle has what it needs in
>>>>> the
>>>>> way of types to function, based on metadata that the bundles provide
>>>>> to the
>>>>> framework. It's an incredibly powerful system (I use it every day and
>>>>> enjoy
>>>>> it enormously) but it's also very "heavy" and requires a good deal of
>>>>> investment to use. In particular, it's probably too large to put
>>>>> _inside_
>>>>> Jena. (I frequently put Jena inside an OSGi instance, on the other
>>>>> hand.)
>>>>>
>>>>> Java 9 Jigsaw [1] offers some possibility for strong modularization of
>>>>> this kind, but it's really meant for the JDK itself, not application
>>>>> libraries. In theory, we could "roll our own" classloader management
>>>>> for
>>>>> this problem. That sounds like more than a bit of a rabbit hole to me.
>>>>> There might be another, more lightweight, toolkit out there to this
>>>>> purpose, but I'm not aware of any myself.
>>>>>
>>>>> Otherwise, yes, you get into shading and the like. We have to do that
>>>>> for
>>>>> Guava for now because of HADOOP-10101 (grumble grumble) but it's
>>>>> hardly a
>>>>> thing we want to do any more of than needed, I don't think.
>>>>>
>>>>> ---
>>>>> A. Soroka
>>>>> The University of Virginia Library
>>>>>
>>>>> [1] http://openjdk.java.net/projects/jigsaw/
>>>>>
>>>>> On Mar 1, 2017, at 9:03 AM, Osma Suominen <[email protected]>
>>>>>>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi Anuj!
>>>>>>
>>>>>> Thanks for the clarification.
>>>>>>
>>>>>> However, I'm still not sure I understand the situation completely. I
>>>>>>
>>>>> know Maven can perform a lot of tricks, but Maven modules are just
>>>>> convenient ways to structure a Java project. Maven cannot change the
>>>>> fact
>>>>> that at runtime, module divisions don't really matter (except that they
>>>>> usually correspond to package sub-namespaces) and the Java classloader
>>>>> only
>>>>> sees a single, flat package/class namespace and a set of compiled
>>>>> classes
>>>>> (usually within JARs) in the classpath that it needs to check to find
>>>>> the
>>>>> right classes, and if there are two versions of the same library (eg
>>>>> Lucene) with overlapping class names, that's going to cause trouble.
>>>>> The
>>>>> only way around that is to shade some of the libraries, i.e. rename
>>>>> them so
>>>>> that they end up in another, non-conflicting namespace. Apparently
>>>>> Elasticsearch also did some of that in the past [1] but nowadays tries
>>>>> to
>>>>> avoid it.
>>>>>
>>>>>>
>>>>>> Does your assumption 1 ("At a given point in time, only a single
>>>>>>
>>>>> Indexing Technology is used") imply that in the assembler
>>>>> configuration,
>>>>> you cannot have ja:loadClass declarations for both Lucene and ES
>>>>> backends?
>>>>> Or how do you run something like Fuseki that contains (in a single big
>>>>> JAR)
>>>>> both the jena-text and jena-text-es modules with all their
>>>>> dependencies,
>>>>> one of which requires the Lucene 4.x classes and the other one the
>>>>> Lucene
>>>>> 6.4.1 classes? How do you ensure that only one of them is used at a
>>>>> time,
>>>>> and that the Java classloader, even though it has access to both
>>>>> versions
>>>>> of Lucene, only loads classes from the single, correct one and not the
>>>>> other? Or do you need to have separate "Fuseki-Lucene" and "Fuseki-ES"
>>>>> packages, so that you don't end up with two Lucene versions within the
>>>>> same
>>>>> Fuseki JAR?
>>>>>
>>>>>>
>>>>>> -Osma
>>>>>>
>>>>>> [1] https://www.elastic.co/blog/to-shade-or-not-to-shade
>>>>>>
>>>>>> 01.03.2017, 11:03, anuj kumar kirjoitti:
>>>>>>
>>>>>>> Hi Osma,
>>>>>>>
>>>>>>> I understand what you are saying. There are ways to mitigate risks
>>>>>>> and
>>>>>>> balance the refactoring without affecting the existing modules. But I
>>>>>>>
>>>>>> will
>>>>>
>>>>>> not delve into those now. I am not an expert in Jena to convincingly
>>>>>>>
>>>>>> say
>>>>>
>>>>>> that it is possible, without any hiccups. But I can take a guess and
>>>>>>>
>>>>>> say
>>>>>
>>>>>> that it is indeed possible :)
>>>>>>>
>>>>>>> For the question: "is it even possible to mix modules that depend on
>>>>>>> different versions of the Lucene libraries within the same project?"
>>>>>>>
>>>>>>> I actually do not understand what you mean by mixing modules. I
>>>>>>> assume
>>>>>>>
>>>>>> you
>>>>>
>>>>>> mean having jena-text and jena-text-es as dependencies in a build
>>>>>>>
>>>>>> without
>>>>>
>>>>>> causing the build to conflict. If that is what you mean than the
>>>>>>>
>>>>>> answer is
>>>>>
>>>>>> yes it is possible and quite simple as well. Let me explain how it is
>>>>>>> possible. But before that some assumption which I want to call out
>>>>>>> explicitly.
>>>>>>>
>>>>>>> *Assumption:*
>>>>>>> 1. At a given point in time, only a single Indexing Technology is
>>>>>>> used
>>>>>>>
>>>>>> for
>>>>>
>>>>>> text based indexing and searching via Jean. What this means is that we
>>>>>>>
>>>>>> will
>>>>>
>>>>>> either use Lucene Implementation OR Solr Implementation OR ES
>>>>>>> Implementation at any given point in time.
>>>>>>> 2. Fuseki build does not depend on any Lucene 4.9.1 specific classes
>>>>>>>
>>>>>> but
>>>>>
>>>>>> only on jena-text classes, if at all.
>>>>>>>
>>>>>>> Based on these assumptions it is possible to create a build that
>>>>>>>
>>>>>> contains
>>>>>
>>>>>> jena-text based common classes + ES specific classes without any
>>>>>>> compatibility issues. And it is infact quite simple. I did it in the
>>>>>>> current jena-text-es module and ran the entire build which succeeded.
>>>>>>> The key is to include the latest Lucene dependencies at the very
>>>>>>>
>>>>>> beginning
>>>>>
>>>>>> in the pom and then include jena-text dependency. Maven will then
>>>>>>> automatically resolve the dependency issues by including the Lucene
>>>>>>> librarires that we included in our es specific pom. Have a look the
>>>>>>>
>>>>>> pom of
>>>>>
>>>>>> jena-text-es module here to see how it can be done :
>>>>>>> https://github.com/EaseTech/jena/blob/master/jena-text-es/pom.xml
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anuj Kumar
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 1, 2017 at 7:27 AM, Osma Suominen <
>>>>>>>
>>>>>> [email protected]>
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Anuj,
>>>>>>>>
>>>>>>>> I understand your concerns. However, we also need to balance between
>>>>>>>>
>>>>>>> the
>>>>>
>>>>>> needs of individual modules/features and the whole codebase. I'm
>>>>>>>>
>>>>>>> willing to
>>>>>
>>>>>> put in the effort to keep the other modules up to date with newer
>>>>>>>>
>>>>>>> Lucene
>>>>>
>>>>>> versions. Lucene upgrade requirements are well documented, the only
>>>>>>>>
>>>>>>> hitches
>>>>>
>>>>>> seen in JENA-1250 were related to how jena-text (ab)used some Lucene
>>>>>>>> features that were dropped from newer versions.
>>>>>>>>
>>>>>>>> A perhaps stupid question to more experienced Java developers: is it
>>>>>>>>
>>>>>>> even
>>>>>
>>>>>> possible to mix modules that depend on different versions of the
>>>>>>>>
>>>>>>> Lucene
>>>>>
>>>>>> libraries within the same project? In my (quite limited)
>>>>>>>>
>>>>>>> understanding of
>>>>>
>>>>>> Java projects and libraries, this requires special arrangements (e.g.
>>>>>>>> shading) as the Java package/class namespace is shared by all the
>>>>>>>> code
>>>>>>>> running within the same JVM.
>>>>>>>>
>>>>>>>> So can you create, say, a Fuseki build that contains the current
>>>>>>>>
>>>>>>> jena-text
>>>>>
>>>>>> module (depending on Lucene 4.x) and the new jena-text-es module
>>>>>>>>
>>>>>>> (depending
>>>>>
>>>>>> on Lucene 6.4.1) without any compatibility issues?
>>>>>>>>
>>>>>>>> -Osma
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 01.03.2017, 00:47, anuj kumar kirjoitti:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> My 2 Cents :
>>>>>>>>>
>>>>>>>>> The reason I proposed to have separate modules for Lucene, Solr and
>>>>>>>>>
>>>>>>>> ES is
>>>>>
>>>>>> exactly for avoiding the "All or Nothing" approach we need to take
>>>>>>>>>
>>>>>>>> if we
>>>>>
>>>>>> club them all together. If they stay together and if in the near
>>>>>>>>>
>>>>>>>> future I
>>>>>
>>>>>> want to upgrade ES to another version, I also need to again upgrade
>>>>>>>>>
>>>>>>>> Lucene
>>>>>
>>>>>> and Solr and possibly another implementation that may have been added
>>>>>>>>> during the time. As we all know, this means weeks of work if not
>>>>>>>>>
>>>>>>>> months to
>>>>>
>>>>>> get the changes released. This will personally de-motivate me to do
>>>>>>>>> anything and I will probably start maintaining my version of
>>>>>>>>>
>>>>>>>> Jena-Text as
>>>>>
>>>>>> that would be much simpler to do than to upgrade and test and in the
>>>>>>>>> process own(read fix bugs) the upgrade for each and every
>>>>>>>>> technology.
>>>>>>>>>
>>>>>>>>> If they are developed as separate modules, they can evolve
>>>>>>>>>
>>>>>>>> independently
>>>>>
>>>>>> of
>>>>>>>>> each other and we can avoid situations where we cant upgrade to
>>>>>>>>>
>>>>>>>> latest
>>>>>
>>>>>> version of Lucene because we do not know what effect it will have on
>>>>>>>>>
>>>>>>>> Solr
>>>>>
>>>>>> Implementation.
>>>>>>>>>
>>>>>>>>> We can start with having a separate Module for Jena Text ES and see
>>>>>>>>>
>>>>>>>> how
>>>>>
>>>>>> things go. If they go well, we could extract out Solr and Lucene out
>>>>>>>>>
>>>>>>>> of
>>>>>
>>>>>> Jena Text.
>>>>>>>>>
>>>>>>>>> Again this is just a suggestion based on my limited industry
>>>>>>>>>
>>>>>>>> experience.
>>>>>
>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anuj Kumar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 28, 2017 at 5:23 PM, Osma Suominen <
>>>>>>>>>
>>>>>>>> [email protected]
>>>>>
>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> 28.02.2017, 17:12, A. Soroka kirjoitti:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://lists.apache.org/thread.html/dce0d502b11891c28e57bbc
>>>>>>>>>>
>>>>>>>>>>> bb0cdef27d8374d58d9634076b8ef4cd7@1431107516@%3Cdev.jena.apa
>>>>>>>>>>>
>>>>>>>>>> che.org%3E
>>>>>
>>>>>> ? In other words, might it be better to factor out between -text
>>>>>>>>>>>
>>>>>>>>>> and
>>>>>
>>>>>> -spatial and _then_ try to upgrade the Lucene version?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I certainly wouldn't object to that, but somebody has to
>>>>>>>>>> volunteer
>>>>>>>>>>
>>>>>>>>> to do
>>>>>
>>>>>> the actual work!
>>>>>>>>>>
>>>>>>>>>> I don't use the Solr component now, but I could easily see so
>>>>>>>>>>
>>>>>>>>> doing...
>>>>>
>>>>>>
>>>>>>>>>> that's pretty vague, I know, and I'm not in a position to do any
>>>>>>>>>>>
>>>>>>>>>> work to
>>>>>
>>>>>> maintain it, so consider that just a very small and blurry data
>>>>>>>>>>>
>>>>>>>>>> point.
>>>>>
>>>>>> :)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Last time I tried it (it was a while ago) I couldn't figure out
>>>>>>>>>> how
>>>>>>>>>>
>>>>>>>>> to
>>>>>
>>>>>> get
>>>>>>>>>> it running... If you could just try that with some toy data, then
>>>>>>>>>>
>>>>>>>>> your
>>>>>
>>>>>> data
>>>>>>>>>> point would be a lot less blurry :) I haven't used Solr for
>>>>>>>>>>
>>>>>>>>> anything, so
>>>>>
>>>>>> I'm not very familiar with how to set it up, and the jena-text
>>>>>>>>>> instructions
>>>>>>>>>> are pretty vague unfortunately.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Osma
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Osma Suominen
>>>>>>>>>> D.Sc. (Tech), Information Systems Specialist
>>>>>>>>>> National Library of Finland
>>>>>>>>>> P.O. Box 26 (Kaikukatu 4)
>>>>>>>>>> 00014 HELSINGIN YLIOPISTO
>>>>>>>>>> Tel. +358 50 3199529
>>>>>>>>>> [email protected]
>>>>>>>>>> http://www.nationallibrary.fi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Osma Suominen
>>>>>>>> D.Sc. (Tech), Information Systems Specialist
>>>>>>>> National Library of Finland
>>>>>>>> P.O. Box 26 (Kaikukatu 4)
>>>>>>>> 00014 HELSINGIN YLIOPISTO
>>>>>>>> Tel. +358 50 3199529
>>>>>>>> [email protected]
>>>>>>>> http://www.nationallibrary.fi
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Osma Suominen
>>>>>> D.Sc. (Tech), Information Systems Specialist
>>>>>> National Library of Finland
>>>>>> P.O. Box 26 (Kaikukatu 4)
>>>>>> 00014 HELSINGIN YLIOPISTO
>>>>>> Tel. +358 50 3199529
>>>>>> [email protected]
>>>>>> http://www.nationallibrary.fi
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> *Anuj Kumar*
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Osma Suominen
>> D.Sc. (Tech), Information Systems Specialist
>> National Library of Finland
>> P.O. Box 26 (Kaikukatu 4)
>> 00014 HELSINGIN YLIOPISTO
>> Tel. +358 50 3199529
>> [email protected]
>> http://www.nationallibrary.fi
>>
>


-- 
*Anuj Kumar*

Re: Extending Jena Text to Support ElasticSearch as Indexing/Querying Engine

Reply via email to