Hi Anuj!

Congratulations for getting the PoC working!

I'm not sure I like the idea of having a separate jena-text-es module.

Am I right that your main concern with creating a separate module is that the Elasticsearch client library requires a newer Lucene version than what jena-text currently uses? In that case, I think the solution should be upgrading the Lucene version everywhere, i.e. the current jena-text and jena-spatial modules. This work has already started (see JENA-1250) but it has recently stalled and has not yet been merged.

I don't think it should be a problem to have multiple implementations (Lucene, Solr, ES) within the same module. Ideally a lot of the infrastructure could be shared (which is of course possible also with separate modules, as you have done), and I would hope that also the unit tests could be reused for the different implementations, although that is currently not the case (the unit tests only target Lucene).

The Solr side of jena-text has unfortunately bitrotted even more than the Lucene support. I've previously suggested that it should be removed entirely [1], but there were no responses to my suggestion at the time.

-Osma

[1] https://www.mail-archive.com/[email protected]/msg16380.html

27.02.2017, 14:08, anuj kumar kirjoitti:
Hi All,

*Apologies for the long email.*

 As some of you know, I have been working on extending Jena to Support
ElasticSearch for Text Indexing (in addition to Lucene and Solr).

I have come to a point where I have a basic (read non-prod) code that can
index RDFS:label text data into ElasticSearch 5.2.1
The code is working and testable. You simply have to download elasticsearch
5.2.1 and run it locally for executing the test within  the ES
implementation.
The code is NOT production Ready but just a PoC code.  You can find the
first cut of the code here: https://github.com/EaseTech/jena (look inside
the module jena-text-es)

I need feedback from Jena maintainers and community, in terms of the
structuring of the code as this is important for me to finalize before I
move to implement the full blown Production Ready code for Jean Text
ElasticSearch Integration.

Here is the short description of what I did and the reasoning behind it:

1. Created a separate module : *jena-text-es *that extends from *jena-text*
AND excludes all the Lucene related and Solr related dependencies. The
reason I had to do it was that* jena-text* module depends on Lucene version
4.9.1 whereas ElasticSearch 5.2.1 version depends on Lucene 6.4.1. This was
resulting in the conflicts of Lucene version if I created the code for
ElasticSearch support within the *jena-text *module. Thus the need to
create a separate module.
2. A side effect of creating a separate module meant, I had to extend the
TextDataSetFactory.java class present in the *jena-text *module to include
methods for creating ElasticSearch index objects. I named it
ESTextDataSetFactory. At this point in time I do not know if this is the
right approach or if Jena ALWAYS instantiates Index objects using the
TextDataSetFactory.java class. My initial investigation showed it is fine,
but I want the people who are experts in Jena to please confirm.
3. I have tested a simple integration with ElasticSearch by defining a test
class under
src/test/java/org/apache/jena/query/text/TestBuildTextDataSet.java. You can
run this test by first starting an instance of Elasticsearch 5.2.1 locally.

*My Queries*
1. Is it acceptable by the Jena community that I create a separate module
for support of ElasticSearch and call it *jena-text-es*?
2. Is it fine if I extend the TextDataSetFactory.java class within the
*jena-text-es
*module?

*Food for Thought*

While implementing the ElasticSearch Integration, I could not help but
notice that the module *jena-text *not only contains the core classes for
performing text queries, but also contains technology specific (for eg.
Lucene and Solr) classes.
IMO, these should be separate and defined in their own modules to enable
separation of concerns.
This will also help in easier maintenance and extensions to be added later
on.

I think we should have the following modules:

jena-text - Containing core Jena text specific classes that are technology
agnostic.
jena-text-lucene - Lucene specific implementation of Jena-Text
jena-text-solr - Solr specific implementation of Jena-Text
jena-text-es - ElasticSearch specific implementation of Jena-Text

What does everyone think?

Thanks,
Anuj Kumar


On Tue, Feb 14, 2017 at 2:27 PM, anuj kumar <[email protected]> wrote:

My saviour Osma. It worked :)
Thanks for pointing that out. Really appreciate it.
I am now to my next task. Implementing the actual code for ElasticSearch
integration with Jena.

Thanks once again.

Anuj Kumar

On Tue, Feb 14, 2017 at 2:22 PM, Osma Suominen <[email protected]>
wrote:

14.02.2017, 15:15, anuj kumar kirjoitti:

I will do it. But I need to first get the simple test working in order to
move forward. I hope I someone here can help me.


Maybe you need to add an implementWith declaration to TextAssembler.java?


-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi




--
*Anuj Kumar*






--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Reply via email to