Re: Solr Repository Connector

Rafa Haro Tue, 13 Aug 2019 08:25:39 -0700

Hi Dileepa,

IMHO, Furkan's approach makes the most sense here. As Olivier pointed out,
to retrieve the original content from a Lucene based index, all the fields
you are interested in must be stored. If it is your case, you can probably
implement a Repository connector then. You can enable incremental crawling
by querying for all the Solr documents (q=*:*), using pagination and using
one of the fields as a filter to locate only new or modified documents at
each crawl.


But, it seems to make more sense if you include your Solr index as a new
distributed index along with the other index (ES or Solr) that you plan to
populate using ManifoldCF. Typical resources you are going to need for
achieving that is 1) a query adapter to convert the user query to a query
language supported for all your indexes (easy in this case, because both
can talk Lucene query syntax) and 2) a module to normalize the scores of
the results from all your indexes. You can use a min-max approach for
normalising, for example.

This is a quite typical scenario, so I'm sure you can easily find good
literature about how to architecture a distributed federated search engine

Cheers,
Rafa

On Tue, Aug 6, 2019 at 2:52 PM Dileepa Jayakody <[email protected]>
wrote:

> Hi All,
>
> Thank you for your replies.
>
> @Furkan, Olivier, thanks for the pointers. I will check the approach of
> the Solr repository connector as per given references.
> @Olivier if you can contribute the Solr repo-connector you are working on,
> to MCF that will be awesome! Will be looking forward to an update on that.
>
> Regards,
> Dileepa
>
>
> On Mon, Aug 5, 2019 at 5:01 PM Olivier Tavard <
> [email protected]> wrote:
>
>> Hello,
>>
>> We are currently working on this kind of repository connector for a
>> customer. We plan to give the code to the MCF project if the customer lets
>> us do it legally. We will know it at the end of the month or at the
>> beginning of next month.
>>
>> In order to have this working, all the fields of the target Solr need to
>> be stored, this condition is mandatory. You can give a look to the Solr
>> entity processor of Data Import Handler component :
>> https://lucene.apache.org/solr/guide/8_0/uploading-structured-data-store-data-with-the-data-import-handler.html#entity-processors.
>>  We
>> were inspired by that for the development of the connector.
>>
>> Best regards,
>>
>> Olivier
>>
>>
>>
>> Le 5 août 2019 à 16:38, Furkan KAMACI <[email protected]> a écrit :
>>
>> Hi Dileepa,
>>
>> Writing a custom repository connector can let you achieve your goal. Read
>> and directly write to an output connector.
>>
>> You should check your requirements i.e. which data sources you will
>> connect. MCF may rid of huge integration pains compared to many other ETL
>> tools in your case.
>>
>> On the other hand, if you wanna achieve a federated search, you could
>> search across distributed indexes. Otherwise, it is a heteregous sourced
>> indexing architecture. You can federate your search query into Solr without
>> ingesting it to any other place. By the way, MCF will let you make document
>> level security, you should handle it manually in such a case.
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>> 5 Ağu 2019 Pzt, saat 17:11 tarihinde Dileepa Jayakody <
>> [email protected]> şunu yazdı:
>>
>>> Hi Karl and all,
>>>
>>> In my use-case, one of the data-sources is an already populated Solr
>>> index which is an e-commerce web-site data index (customers, products &
>>> services).
>>> Apart from the Solr Index, I need to ingest several other heterogeneous
>>> data-sources such as PostgresSQL databases, CRM data etc into the federated
>>> search index (the output index will either be a Solr, Elastic-search. We
>>> haven't yet finalized on the output index, but I know that both of these
>>> are supported in MCF as output connectors.).
>>>
>>> @Karl based on your comments, I would appreciate your opinion on below
>>> ingestion flow.
>>> Solr repository/data-source > Solr schema transformations >
>>> Solr/Elastic-search search-index
>>>
>>> For such a scenario, do you think MCF is not the ideal option as the
>>> ETL/ingestion tool?  Should I go for a lower-level ETL tool such as Apache
>>> Nifi ?
>>> Or will writing a MCF Solr repository connector be useful to achieve
>>> this?
>>> WDYT?
>>>
>>> Thanks a lot.
>>> Regards,
>>> Dileepa
>>>
>>>
>>>
>>> On Mon, Aug 5, 2019 at 3:40 PM Karl Wright <[email protected]> wrote:
>>>
>>>> If you are trying to extract data from a Solr index, I know of no way
>>>> to do that.
>>>> Karl
>>>>
>>>>
>>>> On Mon, Aug 5, 2019 at 9:08 AM Dileepa Jayakody <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Thanks for your replies.
>>>>> I'm looking for a repository connector. I've used the Solr output
>>>>> connector before. But now what I need is to connect to a solr index as a
>>>>> repository and retrieve the documents from there. So I need a Solr
>>>>> repository connector.
>>>>>
>>>>> @Karl
>>>>> I will look at the Solr connector, but this is an output connect,
>>>>> isn't it? Can use this as a repository connector to retrieve docs?
>>>>>
>>>>> Thanks,
>>>>> Dileepa
>>>>>
>>>>> On Mon, Aug 5, 2019 at 12:45 PM Cihad Guzel <[email protected]> wrote:
>>>>>
>>>>>> Hi Dileepa,
>>>>>>
>>>>>> You can check all MFC Connectors list from
>>>>>> https://manifoldcf.apache.org/release/release-2.13/en_US/included-connectors.html
>>>>>>
>>>>>> MFC have a Solr Output Connector. It is not a repository connector.
>>>>>> if you want to use as repository connector, you should write a new
>>>>>> repository connector.
>>>>>>
>>>>>> Regards,
>>>>>> Cihad Guzel
>>>>>>
>>>>>>
>>>>>> Dileepa Jayakody <[email protected]>, 5 Ağu 2019 Pzt, 13:18
>>>>>> tarihinde şunu yazdı:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I'm working on a project which needs to implement a federated search
>>>>>>> solution with heterogeneous data repositories. One repository is a Solr
>>>>>>> index. I would like to use ManifoldCF as the data ingestion engine in 
>>>>>>> this
>>>>>>> project as I have worked with MCF before.
>>>>>>>
>>>>>>> Does ManifoldCF has a Solr repository connector which I can use
>>>>>>> here? Or will I need to implement a new repository connector for Solr?
>>>>>>> Any guidance here is much appreciated.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dileepa
>>>>>>>
>>>>>>
>>

Re: Solr Repository Connector

Reply via email to