Dan, if you're bound to federated search then I would say that you need to work on the service guarantees of each of the nodes and, maybe, create strategies to cope with bad nodes.
paul Le 26 août 2013 à 22:57, Dan Davis a écrit : > First answer: > > My employer is a library and do not have the license to harvest everything > indexed by a "web-scale discovery service" such as PRIMO or Summon. If > our design automatically relays searches entered by users, and then > periodically purges results, I think it is reasonable from a licensing > perspective. > > Second answer: > > What if you wanted your Apache Solr powered search to include all results > from Google scholar to any query? Do you think you could easily or > cheaply configure a Zookeeper cluster large enough to harvest and index all > of Google Scholar? Would that violate robot rules? Is it even possible > to do this from an API perspective? Wouldn't google notice? > > Third answer: > > On Gartner's 2013 Enterprise Search Magic Quadrant, LucidWorks and the > other Enterprise Search firm based on Apache Solr were dinged on the lack > of Federated Search. I do not have the hubris to think I can fix that, and > it is not really my role to try, but something that works without > Harvesting and local indexing is obviously desirable to Enterprise Search > users. > > > > On Mon, Aug 26, 2013 at 4:46 PM, Paul Libbrecht <p...@hoplahup.net> wrote: > >> >> Why not simply create a meta search engine that indexes everything of each >> of the nodes.? >> (I think one calls this harvesting) >> >> I believe that this the way to avoid all sorts of performance bottleneck. >> As far as I could analyze, the performance of a federated search is the >> performance of the least speedy node; which can turn to be quite bad if you >> do not exercise guarantees of remote sources. >> >> Or are the "remote cores" below actually things that you manage on your >> side? If yes guarantees are easy to manage.. >> >> Paul >> >> >> Le 26 août 2013 à 22:38, Dan Davis a écrit : >> >>> I have now come to the task of estimating man-days to add "Blended Search >>> Results" to Apache Solr. The argument has been made that this is not >>> desirable (see Jonathan Rochkind's blog entries on Bento search with >>> blacklight). But the estimate remains. No estimate is worth much >>> without a design. So, I am come to the difficult of estimating this >>> without having an in-depth knowledge of the Apache core. Here is my >>> design, likely imperfect, as it stands. >>> >>> - Configure a core specific to each search source (local or remote) >>> - On cores that index remote content, implement a periodic delete query >>> that deletes documents whose timestamp is too old >>> - Implement a custom requestHandler for the "remote" cores that goes >> out >>> and queries the remote source. For each result in the top N >>> (configurable), it computes an id that is stable (e.g. it is based on >> the >>> remote resource URL, doi, or hash of data returned). It uses that id >> to >>> look-up the document in the lucene database. If the data is not >> there, it >>> updates the lucene core and sets a flag that commit is required. >> Once it >>> is done, it commits if needed. >>> - Configure a core that uses a custom SearchComponent to call the >>> requestHandler that goes and gets new documents and commits them. >> Since >>> the cores for remote content are different cores, they can restart >> their >>> searcher at this point if any commit is needed. The custom >>> SearchComponent will wait for commit and reload to be completed. >> Then, >>> search continues uses the other cores as "shards". >>> - Auto-warming on this will assure that the most recently requested >> data >>> is present. >>> >>> It will, of course, be very slow a good part of the time. >>> >>> Erik and others, I need to know whether this design has legs and what >> other >>> alternatives I might consider. >>> >>> >>> >>> On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <erickerick...@gmail.com >>> wrote: >>> >>>> The lack of global TF/IDF has been answered in the past, >>>> in the sharded case, by "usually you have similar enough >>>> stats that it doesn't matter". This pre-supposes a fairly >>>> evenly distributed set of documents. >>>> >>>> But if you're talking about federated search across different >>>> types of documents, then what would you "rescore" with? >>>> How would you even consider scoring docs that are somewhat/ >>>> totally different? Think magazine articles an meta-data associated >>>> with pictures. >>>> >>>> What I've usually found is that one can use grouping to show >>>> the top N of a variety of results. Or show tabs with different >>>> types. Or have the app intelligently combine the different types >>>> of documents in a way that "makes sense". But I don't know >>>> how you'd just get "the right thing" to happen with some kind >>>> of scoring magic. >>>> >>>> Best >>>> Erick >>>> >>>> >>>> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <dansm...@gmail.com> wrote: >>>> >>>>> I've thought about it, and I have no time to really do a meta-search >>>>> during >>>>> evaluation. What I need to do is to create a single core that contains >>>>> both of my data sets, and then describe the architecture that would be >>>>> required to do blended results, with liberal estimates. >>>>> >>>>> From the perspective of evaluation, I need to understand whether any of >>>>> the >>>>> solutions to better ranking in the absence of global IDF have been >>>>> explored? I suspect that one could retrieve a much larger than N >> set of >>>>> results from a set of shards, re-score in some way that doesn't require >>>>> IDF, e.g. storing both results in the same priority queue and >> *re-scoring* >>>>> before *re-ranking*. >>>>> >>>>> The other way to do this would be to have a custom SearchHandler that >>>>> works >>>>> differently - it performs the query, retries all results deemed >> relevant >>>>> by >>>>> another engine, adds them to the Lucene index, and then performs the >> query >>>>> again in the standard way. This would be quite slow, but perhaps >> useful >>>>> as a way to evaluate my method. >>>>> >>>>> I still welcome any suggestions on how such a SearchHandler could be >>>>> implemented. >>>>> >>>> >>>> >> >>