Hi Rafa, Thank you for your valuable suggestions.
On Tue, Aug 13, 2019 at 5:25 PM Rafa Haro <rh...@apache.org> wrote: > Hi Dileepa, > > IMHO, Furkan's approach makes the most sense here. As Olivier pointed out, > to retrieve the original content from a Lucene based index, all the fields > you are interested in must be stored. If it is your case, you can probably > implement a Repository connector then. You can enable incremental crawling > by querying for all the Solr documents (q=*:*), using pagination and using > one of the fields as a filter to locate only new or modified documents at > each crawl. > > But, it seems to make more sense if you include your Solr index as a new > distributed index along with the other index (ES or Solr) that you plan to > populate using ManifoldCF. Typical resources you are going to need for > achieving that is 1) a query adapter to convert the user query to a query > language supported for all your indexes (easy in this case, because both > can talk Lucene query syntax) and 2) a module to normalize the scores of > the results from all your indexes. You can use a min-max approach for > normalising, for example. > Are you referring to a query-time-merge approach instead of following a index-time-merge approach (getting all the content in a central index first) as a solution here? If so, yes, I am considering that approach as well, however the concern is some data-sources are rdf data stores, and querying rdf stores are typically slow in POV and the whole search would then be slow with query-federation approach. WDYT? Thanks, Dileepa > > This is a quite typical scenario, so I'm sure you can easily find good > literature about how to architecture a distributed federated search engine > > Cheers, > Rafa > > On Tue, Aug 6, 2019 at 2:52 PM Dileepa Jayakody <dileepajayak...@gmail.com> > wrote: > >> Hi All, >> >> Thank you for your replies. >> >> @Furkan, Olivier, thanks for the pointers. I will check the approach of >> the Solr repository connector as per given references. >> @Olivier if you can contribute the Solr repo-connector you are working >> on, to MCF that will be awesome! Will be looking forward to an update on >> that. >> >> Regards, >> Dileepa >> >> >> On Mon, Aug 5, 2019 at 5:01 PM Olivier Tavard < >> olivier.tav...@francelabs.com> wrote: >> >>> Hello, >>> >>> We are currently working on this kind of repository connector for a >>> customer. We plan to give the code to the MCF project if the customer lets >>> us do it legally. We will know it at the end of the month or at the >>> beginning of next month. >>> >>> In order to have this working, all the fields of the target Solr need to >>> be stored, this condition is mandatory. You can give a look to the Solr >>> entity processor of Data Import Handler component : >>> https://lucene.apache.org/solr/guide/8_0/uploading-structured-data-store-data-with-the-data-import-handler.html#entity-processors. >>> We >>> were inspired by that for the development of the connector. >>> >>> Best regards, >>> >>> Olivier >>> >>> >>> >>> Le 5 août 2019 à 16:38, Furkan KAMACI <furkankam...@gmail.com> a écrit : >>> >>> Hi Dileepa, >>> >>> Writing a custom repository connector can let you achieve your goal. >>> Read and directly write to an output connector. >>> >>> You should check your requirements i.e. which data sources you will >>> connect. MCF may rid of huge integration pains compared to many other ETL >>> tools in your case. >>> >>> On the other hand, if you wanna achieve a federated search, you could >>> search across distributed indexes. Otherwise, it is a heteregous sourced >>> indexing architecture. You can federate your search query into Solr without >>> ingesting it to any other place. By the way, MCF will let you make document >>> level security, you should handle it manually in such a case. >>> >>> Kind Regards, >>> Furkan KAMACI >>> >>> 5 Ağu 2019 Pzt, saat 17:11 tarihinde Dileepa Jayakody < >>> dileepajayak...@gmail.com> şunu yazdı: >>> >>>> Hi Karl and all, >>>> >>>> In my use-case, one of the data-sources is an already populated Solr >>>> index which is an e-commerce web-site data index (customers, products & >>>> services). >>>> Apart from the Solr Index, I need to ingest several other heterogeneous >>>> data-sources such as PostgresSQL databases, CRM data etc into the federated >>>> search index (the output index will either be a Solr, Elastic-search. We >>>> haven't yet finalized on the output index, but I know that both of these >>>> are supported in MCF as output connectors.). >>>> >>>> @Karl based on your comments, I would appreciate your opinion on below >>>> ingestion flow. >>>> Solr repository/data-source > Solr schema transformations > >>>> Solr/Elastic-search search-index >>>> >>>> For such a scenario, do you think MCF is not the ideal option as the >>>> ETL/ingestion tool? Should I go for a lower-level ETL tool such as Apache >>>> Nifi ? >>>> Or will writing a MCF Solr repository connector be useful to achieve >>>> this? >>>> WDYT? >>>> >>>> Thanks a lot. >>>> Regards, >>>> Dileepa >>>> >>>> >>>> >>>> On Mon, Aug 5, 2019 at 3:40 PM Karl Wright <daddy...@gmail.com> wrote: >>>> >>>>> If you are trying to extract data from a Solr index, I know of no way >>>>> to do that. >>>>> Karl >>>>> >>>>> >>>>> On Mon, Aug 5, 2019 at 9:08 AM Dileepa Jayakody < >>>>> dileepajayak...@gmail.com> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> Thanks for your replies. >>>>>> I'm looking for a repository connector. I've used the Solr output >>>>>> connector before. But now what I need is to connect to a solr index as a >>>>>> repository and retrieve the documents from there. So I need a Solr >>>>>> repository connector. >>>>>> >>>>>> @Karl >>>>>> I will look at the Solr connector, but this is an output connect, >>>>>> isn't it? Can use this as a repository connector to retrieve docs? >>>>>> >>>>>> Thanks, >>>>>> Dileepa >>>>>> >>>>>> On Mon, Aug 5, 2019 at 12:45 PM Cihad Guzel <cguz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Dileepa, >>>>>>> >>>>>>> You can check all MFC Connectors list from >>>>>>> https://manifoldcf.apache.org/release/release-2.13/en_US/included-connectors.html >>>>>>> >>>>>>> MFC have a Solr Output Connector. It is not a repository connector. >>>>>>> if you want to use as repository connector, you should write a new >>>>>>> repository connector. >>>>>>> >>>>>>> Regards, >>>>>>> Cihad Guzel >>>>>>> >>>>>>> >>>>>>> Dileepa Jayakody <dileepajayak...@gmail.com>, 5 Ağu 2019 Pzt, 13:18 >>>>>>> tarihinde şunu yazdı: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I'm working on a project which needs to implement a federated >>>>>>>> search solution with heterogeneous data repositories. One repository >>>>>>>> is a >>>>>>>> Solr index. I would like to use ManifoldCF as the data ingestion >>>>>>>> engine in >>>>>>>> this project as I have worked with MCF before. >>>>>>>> >>>>>>>> Does ManifoldCF has a Solr repository connector which I can use >>>>>>>> here? Or will I need to implement a new repository connector for Solr? >>>>>>>> Any guidance here is much appreciated. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Dileepa >>>>>>>> >>>>>>> >>>