Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/16/07, Ryan McKinley [EMAIL PROTECTED] wrote: ...I think a DocumentParser registry is a good way to isolate this top level task... With all this talk about plugins, registries etc., /me can't help thinking that this would be a good time to introduce the Spring IoC container to manage this stuff. More info at http://www.springframework.org/docs/reference/beans.html for people who are not familiar with it. It's very easy to use for simple cases like the ones we're talking about. -Bertrand
[jira] Commented: (SOLR-69) PATCH:MoreLikeThis support
[ https://issues.apache.org/jira/browse/SOLR-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465092 ] Bertrand Delacretaz commented on SOLR-69: - SOLR-69.patch updated PATCH:MoreLikeThis support -- Key: SOLR-69 URL: https://issues.apache.org/jira/browse/SOLR-69 Project: Solr Issue Type: Improvement Components: search Reporter: Bertrand Delacretaz Priority: Minor Attachments: lucene-queries-2.0.0.jar, SOLR-69.patch, SOLR-69.patch Here's a patch that implements simple support of Lucene's MoreLikeThis class. The MoreLikeThisHelper code is heavily based on (hmm...lifted from might be more appropriate ;-) Erik Hatcher's example mentioned in http://www.mail-archive.com/solr-user@lucene.apache.org/msg00878.html To use it, add at least the following parameters to a standard or dismax query: mlt=true mlt.fl=list,of,fields,which,define,similarity See the MoreLikeThisHelper source code for more parameters. Here are two URLs that work with the example config, after loading all documents found in exampledocs in the index (just to show that it seems to work - of course you need a larger corpus to make it interesting): http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mindf=1fl=id,score http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mindf=1fl=id,score Results are added to the output like this: response ... lst name=moreLikeThis result name=UTF8TEST numFound=1 start=0 maxScore=1.5293242 doc float name=score1.5293242/float str name=idSOLR1000/str /doc /result result name=SOLR1000 numFound=1 start=0 maxScore=1.5293242 doc float name=score1.5293242/float str name=idUTF8TEST/str /doc /result /lst I haven't tested this extensively yet, will do in the next few days. But comments are welcome of course. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (SOLR-110) Factor out common code in our SolrRequestHandler classes
Factor out common code in our SolrRequestHandler classes Key: SOLR-110 URL: https://issues.apache.org/jira/browse/SOLR-110 Project: Solr Issue Type: Improvement Components: search Reporter: Bertrand Delacretaz DisMaxRequestHandler and StandardRequestHandler are similar enough to warrant a common base class, or helper classes to factor out common code. I don't have the time (or courage ;-) to do that right now, but it should be done to save time when implementing features that impact both classes. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (SOLR-69) PATCH:MoreLikeThis support
[ https://issues.apache.org/jira/browse/SOLR-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465114 ] Bertrand Delacretaz commented on SOLR-69: - The method used to compute includeScore in MoreLikeThisHelper was inconsistent with what the XmlWriter does. I have changed it to take this info from SolrQueryResponse.getReturnFields(). The md5 sum of the current SOLR-69 patch is b6178d11d33f19b296b741a67df00d45 With this change, all the following requests should work (standard and dismax handlers, with no fl param, id only and id + score as return fields): http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1 http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id,score http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1 http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id,score PATCH:MoreLikeThis support -- Key: SOLR-69 URL: https://issues.apache.org/jira/browse/SOLR-69 Project: Solr Issue Type: Improvement Components: search Reporter: Bertrand Delacretaz Priority: Minor Attachments: lucene-queries-2.0.0.jar, SOLR-69.patch, SOLR-69.patch, SOLR-69.patch Here's a patch that implements simple support of Lucene's MoreLikeThis class. The MoreLikeThisHelper code is heavily based on (hmm...lifted from might be more appropriate ;-) Erik Hatcher's example mentioned in http://www.mail-archive.com/solr-user@lucene.apache.org/msg00878.html To use it, add at least the following parameters to a standard or dismax query: mlt=true mlt.fl=list,of,fields,which,define,similarity See the MoreLikeThisHelper source code for more parameters. Here are two URLs that work with the example config, after loading all documents found in exampledocs in the index (just to show that it seems to work - of course you need a larger corpus to make it interesting): http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mindf=1fl=id,score http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mindf=1fl=id,score Results are added to the output like this: response ... lst name=moreLikeThis result name=UTF8TEST numFound=1 start=0 maxScore=1.5293242 doc float name=score1.5293242/float str name=idSOLR1000/str /doc /result result name=SOLR1000 numFound=1 start=0 maxScore=1.5293242 doc float name=score1.5293242/float str name=idUTF8TEST/str /doc /result /lst I haven't tested this extensively yet, will do in the next few days. But comments are welcome of course. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
query to pull all document ids in an index
I'm new to solr (working on solrb with Erik). We have some functional tests that run against a live solr instance, and I'd like the tests to periodically remove all the documents from the index. This way tests will have a predictable outcome that is independent on the state of the index before the test. I was thinking I could do a query that pulls back all the document ids in the index, and then delete each one...but I'm not quite sure how I could perform such a select. Does anyone have any ideas? //Ed
Re: query to pull all document ids in an index
On 1/16/07, Edward Summers [EMAIL PROTECTED] wrote: I was thinking I could do a query that pulls back all the document ids in the index, and then delete each one... The delete by query feature will do this without requiring an iteration on the client side, see http://incubator.apache.org/solr/tutorial.html#Deleting+Data -Bertrand
SOLR-67 query interface
I'm new to SOLR and would like to contribute. I think my skills would best lend themselves to helping with a nice query interface. I'm a java web dev by profession (couple of the sites/companies I have worked with are below) www.ptplace.com www.colinx.com www.getlocalbiz.com www.kemperinvestors.com (don't blame me, client wanted it that way) Is someone else working on this already ? How can I help ? Thanks, Rick -- View this message in context: http://www.nabble.com/SOLR-67-query-interface-tf3020838.html#a8389856 Sent from the Solr - Dev mailing list archive at Nabble.com.
Re: SOLR-67 query interface
On 1/16/07, rlawson [EMAIL PROTECTED] wrote: ...I'm new to SOLR and would like to contribute. I think my skills would best lend themselves to helping with a nice query interface. I'm a java web dev by profession... If you mean graphic design of the admin webpages, there are two issues about this currently: http://issues.apache.org/jira/browse/SOLR-84 http://issues.apache.org/jira/browse/SOLR-76 Your opinions and contributions are of course welcome! ...www.kemperinvestors.com (don't blame me, client wanted it that way)... ouch ;-) -Bertrand
Can this be achieved? (Was: document support for file system crawling)
First: Please pardon the cross-post to solr-user for reference. I hope to continue this thread in solr-dev. Please answer to solr-dev. 1) more documentation (and posisbly some locking configuration options) on how you can use Solr to access an index generated by the nutch crawler (i think Thorsten has allready done this) or by Compass, or any other system that builds a Lucene index. Thorsten Scherler? Is this code available anywhere? Sounds very interesting to me. Maybe someone could ellaborate on the differences between the indexes created by Nutch/Solr/Compass/etc., or point me in the direction of an answer? 2) contrib code that runs as it's own process to crawl documents and send them to a Solr server. (mybe it parses them, or maybe it relies on the next item...) Do you know FAST? It uses a step-by-step approach (pipeline) in which all of these tasks are done. Much of it is tuned in a easy web tool. The point I'm trying to make is that contrib code is nice, but a complete package with these possibilities could broaden Solr's appeal somewhat. 3) Stock update plugins that can each read a raw inputstreams of a some widely used file format (PDF, RDF, HTML, XML of any schema) and have configuration options telling them them what fields in the schema each part of their document type should go in. Exactly, this sounds more like it. But if similar inputstreams can be handled by Nutch, what's the point in using Solr at all? The http API's? In other words, both Nutch and Solr seem to have functionality that enterprises would want. But neither gives you the total solution. Don't get it wrong, I don't want to bloat the products, even though it would be nice to have a crossover solution which is easy to set up. The architecture could look something like this: Connector - Parser - DocProc - (via schema) - Index Possible connectors: JDBC, filesystem, crawler, manual feed Possible parsers: PDF, whatever Both connectors, parsers AND the document processors would be plugins. The DocProcs would typically be adjusted for each enterprise' needs, so that it fits with their schema.xml. Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to really know all possibilities and limitations. But I do believe that the outlined architecture would be flexible and answer many needs. So the question is: What is Solr missing? Could parts of Nutch be used in Solr to achieve this? How? Have I misunderstood completely? :) Eivind
solr.solr.home- what I have to do?
Hello! I'm a novice in Lucene technologies, and now only trying to install Solr. The main problem is appeared because I have to use Sun Java(bla-bla) web server as servlet contaner. So: Who can explain me what means phrase in solr's docs- Solr now looks in ./solr/conf for config, ./solr/data for data configurable via solr.solr.home system property... ?? Is system property is really system-property ... / tag in web.xml file? Or I have to define sime environment var with name solr.solr.home? Or something else? Sincerelly yours Buharkin Y.A. Moscow
Merging Results from Multiple Solr Instances
I have three instances of Solr on a single machine that I would like to query as if they were a single instance. I was wondering if there's a facility, or if anyone has any recommendations, for searching across multiple instances with a single query, or merging the results of multiple instances into one result set. -STA
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Bertrand Delacretaz wrote: With all this talk about plugins, registries etc., /me can't help thinking that this would be a good time to introduce the Spring IoC container to manage this stuff. More info at http://www.springframework.org/docs/reference/beans.html for people who are not familiar with it. It's very easy to use for simple cases like the ones we're talking about. Please, no. I work on a big webapp that uses spring - it's a complete nightmare to figure out what's going on. -- Alan Burlison --
To Spring or not to Spring? (was: Update Plugins)
On 1/16/07, Alan Burlison [EMAIL PROTECTED] wrote: Bertrand Delacretaz wrote: .../me can't help thinking that this would be a good time to introduce the Spring IoC container to manage this stuff... Please, no. I work on a big webapp that uses spring - it's a complete nightmare to figure out what's going on. Using just the IoC container? I'm not talking about full-blown Spring magic, *just* IoC to assemble plugins. Spring's IoC is not complicated, and logging statements and debuggers are here to find out exactly what's happening if needed. I don't think it'd be more complicated than using our homegrown plugin system. Only more tested, documented and well-known. -Bertrand
Re: To Spring or not to Spring? (was: Update Plugins)
Bertrand Delacretaz wrote: Using just the IoC container? I'm not talking about full-blown Spring magic, *just* IoC to assemble plugins. Spring's IoC is not complicated, and logging statements and debuggers are here to find out exactly what's happening if needed. I don't think it'd be more complicated than using our homegrown plugin system. Only more tested, documented and well-known. It just seems like a big hammer to crack a small nut. I've had *bad* experiences with apps where people pulled in just about every framework, component and widget you can think of - to understand what the hell is going on you end up having to be an expert in all of them. Yes, I'm probably just paranoid ;-) -- Alan Burlison --
Re: To Spring or not to Spring? (was: Update Plugins)
On 1/16/07, Alan Burlison [EMAIL PROTECTED] wrote: ...I've had *bad* experiences with apps where people pulled in just about every framework, component and widget you can think of... That's what you previous message seemed to imply ;-) I agree that, if we start using Spring (or another) IoC container, we must be careful to use what actually helps us, and not let it become our Code Dictator... -Bertrand
Re: Can this be achieved? (Was: document support for file system crawling)
On Tue, 2007-01-16 at 16:28 +0100, Eivind Hasle Amundsen wrote: First: Please pardon the cross-post to solr-user for reference. I hope to continue this thread in solr-dev. Please answer to solr-dev. 1) more documentation (and posisbly some locking configuration options) on how you can use Solr to access an index generated by the nutch crawler (i think Thorsten has allready done this) or by Compass, or any other system that builds a Lucene index. Thorsten Scherler? Hmm, I did the exact opposite. Let me explain you my use case. I am working on a part of a portal http://andaluciajunta.es. The new version of http://andaluciajunta.es/BOJA is this part. The current version is based on a proprietary CMS in a dynamic environment. The new development is using Apache Forrest to generate static html. Now coming to solr/nutch, you can find http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html the current search engine especially for the BOJA. This will be changed to a solr powered solution. Like I said I only doing one part of the portal and the main portal has a search engine as well. http://andaluciajunta.es/aj-sea-.html This search engine will be based on nutch in the next version. The special character is that this main portal search engine has to search against the solr BOJA based indexed. Meaning Nutch will have to search the solr index and not vice versa. What I did before we decided to go with solr is a simple test. I copied my solr index into a nutch instance and dispatched a couple of queries. The only thing that you need is to keep your solr schema as close as possible to the one nutch uses. For example nutch is using content, url and title as default fields when returning the search result. If you do not have this fields in your solr schema then nutch will return null. Is this code available anywhere? Like stated above it is a couple of lines in the solr schema: field name=title type=string stored=true /field field name=content type=text indexed=true stored=true / field name=url type=string stored=true /field Then you just need to point your nutch instance to this index for searching. The same is true (I guess) for solr searching a nutch index. You could use nutch to update the index, point solr to the index and it should work (if you have defined all field in the schema). Sounds very interesting to me. Maybe someone could ellaborate on the differences between the indexes created by Nutch/Solr/Compass/etc., or point me in the direction of an answer? I am far from being an expert, but actually the only real difference I see is the usage of field names. All indexes could be searched with a raw lucene component (if they are based on the same lucene version) 2) contrib code that runs as it's own process to crawl documents and send them to a Solr server. (mybe it parses them, or maybe it relies on the next item...) Do you know FAST? It uses a step-by-step approach (pipeline) in which all of these tasks are done. Much of it is tuned in a easy web tool. The point I'm trying to make is that contrib code is nice, but a complete package with these possibilities could broaden Solr's appeal somewhat. Hmm, I think like Hoss on this, why do we want do the same work of nutch. If you need a crawler why not use the one from nutch and change some lines? I actually use Forrest as crawler when I generate the new sites, which will then push the content to the solr server via a plugin I developed: http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/ 3) Stock update plugins that can each read a raw inputstreams of a some widely used file format (PDF, RDF, HTML, XML of any schema) and have configuration options telling them them what fields in the schema each part of their document type should go in. Exactly, this sounds more like it. But if similar inputstreams can be handled by Nutch, what's the point in using Solr at all? The http API's? In other words, both Nutch and Solr seem to have functionality that enterprises would want. But neither gives you the total solution. Not sure. I am using solr because I did not had to develop three different nutch plugin to make it work. Further I have punctual updates where I push a certain set of documents to the server, so no need for a crawler. Don't get it wrong, I don't want to bloat the products, even though it would be nice to have a crossover solution which is easy to set up. The architecture could look something like this: Connector - Parser - DocProc - (via schema) - Index Possible connectors: JDBC, filesystem, crawler, manual feed Possible parsers: PDF, whatever Both connectors, parsers AND the document processors would be plugins. The DocProcs would typically be adjusted for each enterprise' needs, so that it fits with their schema.xml. Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to really know all
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
I'm in frantic deadline mode so I'm just going to throw in some (hopefully) short comments... At 11:02 PM -0800 1/15/07, Ryan McKinley wrote: the one thing that still seems missing is those micro-plugins i was [SNIP] interface SolrRequestParser { SolrRequest process( HttpServletRequest req ); } I left out micro-plugins because i don't quite have a good answer yet :) This may a place where a custom dispatcher servlet/filter defined in web.xml is the most appropriate solution. If the issue is munging HTTPServletRequest information, then a proper separation of concerns suggests responsibility should lie with a Servlet Filter, as Ryan suggests. For example, while the Servlet 2.4 spec doesn't have specifications for how the servlet container can/should burst a multipart-MIME payload into separate files or streams, there are a number of 3rd party Filters which do this. The IteratorContentStream is a great idea because if each stream is read to completion before the next is opened it doesn't impose any limitation on individual stream length and doesn't require disk buffering. (Of course some handlers may require access to more than one stream at a time; each time next() is called on the iterator before the current stream is closed, the remainder of that stream will have to be buffered in memory or on disk, depending on the part length. Nonetheless that detail can be entirely hidden from the handler, as it should be. I am not sure if any available ServletFilter implementations work this way, but it's certainly doable.) But that detail is irrelevant for now; as I suggest below, using this API lets one immediately implement it with only next() value of the entire POST stream; that would answer the needs of the existing update request handling code, but establish an API to handle multi-part. Whenever someone wants to write a multi-stream handler, they can write or find a better IteratorContentStream implementation, which would best be cast as a ServletFilter. I like the SolrRequestParser suggestion. Me too. It answers a hole in my vision for how this can all fit together. Consider: qt='RequestHandler' wt='ResponseWriter' rp='RequestParser ' (rb='SolrBuilder'?) To avoid possible POST read-ahead stream mungling: qt,wt, and rp should be defined by the URL, not parameters. (We can add special logic to allow /query?qt=xxx) For qt, I like J.J. Larrea's suggestion on SOLR-104 to let people define arbitrary path mapping for qt. We could append 'wt', 'rb', and arbitrary arbitrary text to the registered path, something like /registered/path/wt:json/rb:standard/more/stuff/in/the/path?params... (any other syntax ideas?) No need for new syntax, I think. The pathInfo or qt or other source resolves to a requestHandler CONFIG name. The handler config is read to determine the handler class name. It also can be consulted (with URL or form-POST params overriding if allowed by the config) to decide which RequestParser to invoke BEFORE IT IS CALLED and which ResponseWriter to invoke AFTER. Once those objects are set up, the request body gets executed. Handler config inheritance (as I proposed in SOLR-104 point #2) would greatly simplify, for example, creating a dozen query handlers which used a particular invariant combination of qt, wt, and rp The 'standard' RequestParser would: GET: fill up SolrParams directly with req.getParameterMap() if there is a 'post' parameter (post=XXX) return a stream with XXX as its content else empty iterator. Perhaps add a standard way to reference a remote URI stream. POST: if( multipart ) { read all form fields into parameter map. This should use the same req.getParameterMap as for GET, which Servlet 2.4 says is suppose to be automatically by the servlet container if the payload is application/x-www-form-urlencoded; in that case the input stream should be null. return an iterator over the collection of files Collection of streams, per Hoss. } else { no parameters? parse parameters from the URL? /name:value/ return the body stream As above, this introduces unneeded complexity and should be avoided. } DEL: throw unsupported exception? Maybe each RequestHandler could have a default RequestParser. If we limited the 'arbitrary path' to one level, this could be used to generate more RESTful URLs. Consider: /myadder//// /myadder maps to MyCustomHandler and that gives you MyCustomRequestBuilder that maps /// to SolrParams I think these are best left for an extra-SOLR layer, especially since SOLR URLs are meant for interprogram communication and not direct use by non-developer end users. For example, for my org's website I have hundreds of Apache mod_rewrite rules which do URL munging such as /journals/abc/7/3/192a.pdf into /journalroot/index.cfm?journal=abcvolume=7issue=3 page=192seq=aformat=pdf Or someone could custom-code a subclass of SolrServlet which
Re: solr.solr.home- what I have to do?
: Solr now looks in ./solr/conf for config, ./solr/data for data : configurable via solr.solr.home system property... : ?? : : Is system property is really system-property ... / tag in web.xml file? : Or I have to define sime environment var with name solr.solr.home? it's a system property that can be defined using whatever means your servlet container lets you define system properties before loading web applications ... i don't know much about the Sub Servlet Container, but assuming it's pure java, and you have a shell script somewhere that start it like this... java ... com.sun.SomeMainClass you can pass system proeprties on the comamnd line like this... java -Dsolr.solr.home=/your/path ... com.sun.SomeMainClass. -Hoss
[jira] Commented: (SOLR-106) new facet params: facet.sort, facet.mincount, facet.offset
[ https://issues.apache.org/jira/browse/SOLR-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465254 ] Yonik Seeley commented on SOLR-106: --- Thanks for the info JJ... didn't see your update untill after I committed this (I'm running a bit behind all the solr traffic :-) Case for Facet Count Caching: Paging through the hitlist Hmmm, yes that would be good for a more stateless client. Even more efficient would be to recognize in the client that since you are only changing a page in the hitlist, the facets won't change (and hence don't re-query). It occurs to me that facet.limit should NOT do double-duty for paging: Or, it should *only* be used for paging, specifying the number to be returned. The BoundedTreeSet size and caching are an implementation detail and shouldn't be in the API unless really necessary. If it matters in the future, we could add a hint specifying how much extra should be computed. Case for pulling response generation out of getFieldCacheCounts and getFacetTermEnumCounts Sure, makes sense. Don't view the current facet code as done... I have a *lot* of little ideas on how to make it better, esp for cases like faceting on author. TermFreqVectors Regarding this, do you have any performance data on it... my assumption was that it would be too slow for a large number of hits. Perhaps still a good option to have if the number of hits are small and the fieldcache isn't an option though. Just had an idea: It would be even nicer if the counting logic could be passed some object, Yup, separating those things was on my todo list. new facet params: facet.sort, facet.mincount, facet.offset -- Key: SOLR-106 URL: https://issues.apache.org/jira/browse/SOLR-106 Project: Solr Issue Type: Improvement Components: search Reporter: Yonik Seeley Attachments: facet_params.patch a couple of new facet params: facet lists become pageable with facet.offset, facet.limit (idea from Erik) facet.sort explicitly specifies sort order (true for count descending, false for natural index order) facet.mincount: minimum count for facets included in response (idea from JJ, deprecate zeros) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/15/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The most important issue is to nail down the external HTTP interface. I'm not sure if i agree with that statement .. i would think that figuring out the model or how updates should be handled in a generic way, what all of the Plugin types are, and what their APIs should be is the most important issue -- once we have those issues settled we could allways write a new SolrServlet2 that made the URL structure work anyway we want. The number of people writing update plugins will be small compared to the number of users using the external HTTP API (the URL + query parameters, and the relationship URL-wise between different update formats). My main concern is making *that* as nice and utilitarian as possible, and any plugin stuff is implementation and a secondary concern IMO. -Yonik
[jira] Commented: (SOLR-86) [PATCH] standalone updater cli based on httpClient
[ https://issues.apache.org/jira/browse/SOLR-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465276 ] Hoss Man commented on SOLR-86: -- regarding Bertrand's comment, i'm not sure if there is any benefit in having this code and SOLR-20 share a common SolrUpdateClientInterface since this code will be dealing with pure streaming of UTF8 data, while SOLR-20 is focused on a better object abstracting for SolrDocuments ... i'm not sure what kinds of methods such an interface might have. regarding THorstens comment: yeah, i removed the directory support from your patch while i was refactoring just because it was confusing me and i was trying to keep things simple (i kept trying to run java -jar post.jar exampledos/ ant it would fail because of the .svn directory) that's no reason not to inlcude it though since it's so simple. [PATCH] standalone updater cli based on httpClient --- Key: SOLR-86 URL: https://issues.apache.org/jira/browse/SOLR-86 Project: Solr Issue Type: New Feature Components: update Reporter: Thorsten Scherler Attachments: simple-post-using-urlconnection-approach.patch, solr-86.diff, solr-86.diff We need a cross platform replacement for the post.sh. The attached code is a direct replacement of the post.sh since it is actually doing the same exact thing. In the future one can extend the CLI with other feature like auto commit, etc.. Right now the code assumes that SOLR-85 is applied since we using the servlet of this issue to actually do the update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (SOLR-111) new repsonse classes and connection enhancements
new repsonse classes and connection enhancements Key: SOLR-111 URL: https://issues.apache.org/jira/browse/SOLR-111 Project: Solr Issue Type: Improvement Components: clients - ruby - flare Environment: Darwin frizz 8.8.1 Darwin Kernel Version 8.8.1: Mon Sep 25 19:42:00 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_I386 i386 i386 Reporter: Ed Summers Attachments: response_connection_changes.diff Similar to Solr::Request::* a Solr::Response::* hierarchy was created as well as a Solr::Response::Base which has a factory method for creating the appropriate response based on the request type and the raw response. Also added delete(), delete_by_query(), add(), update() and query() methods to Solr::Connection. This gets a bit closer to a DSL type of syntax which doesn't require the user to know the inner workings of solrb. I adjusted README accordingly. Solr::Connection also operates with autocommit turned *on* so commit() messages are not required when doing add(), update(), delete() calls. It can be turned off if the user doesn't want they extra http traffic. Added the ability to iterate over search results. Although need to add the ability to iterate over complete results, fetching data behind the scenes as necessary. Unit tests have been added and functional tests improved. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (SOLR-111) new repsonse classes and connection enhancements
[ https://issues.apache.org/jira/browse/SOLR-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Summers updated SOLR-111: Attachment: response_connection_changes.diff new repsonse classes and connection enhancements Key: SOLR-111 URL: https://issues.apache.org/jira/browse/SOLR-111 Project: Solr Issue Type: Improvement Components: clients - ruby - flare Environment: Darwin frizz 8.8.1 Darwin Kernel Version 8.8.1: Mon Sep 25 19:42:00 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_I386 i386 i386 Reporter: Ed Summers Attachments: response_connection_changes.diff Similar to Solr::Request::* a Solr::Response::* hierarchy was created as well as a Solr::Response::Base which has a factory method for creating the appropriate response based on the request type and the raw response. Also added delete(), delete_by_query(), add(), update() and query() methods to Solr::Connection. This gets a bit closer to a DSL type of syntax which doesn't require the user to know the inner workings of solrb. I adjusted README accordingly. Solr::Connection also operates with autocommit turned *on* so commit() messages are not required when doing add(), update(), delete() calls. It can be turned off if the user doesn't want they extra http traffic. Added the ability to iterate over search results. Although need to add the ability to iterate over complete results, fetching data behind the scenes as necessary. Unit tests have been added and functional tests improved. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (SOLR-111) new repsonse classes and connection enhancements
[ https://issues.apache.org/jira/browse/SOLR-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Summers updated SOLR-111: Attachment: response_connection_changes.diff new repsonse classes and connection enhancements Key: SOLR-111 URL: https://issues.apache.org/jira/browse/SOLR-111 Project: Solr Issue Type: Improvement Components: clients - ruby - flare Environment: Darwin frizz 8.8.1 Darwin Kernel Version 8.8.1: Mon Sep 25 19:42:00 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_I386 i386 i386 Reporter: Ed Summers Attachments: response_connection_changes.diff, response_connection_changes.diff Similar to Solr::Request::* a Solr::Response::* hierarchy was created as well as a Solr::Response::Base which has a factory method for creating the appropriate response based on the request type and the raw response. Also added delete(), delete_by_query(), add(), update() and query() methods to Solr::Connection. This gets a bit closer to a DSL type of syntax which doesn't require the user to know the inner workings of solrb. I adjusted README accordingly. Solr::Connection also operates with autocommit turned *on* so commit() messages are not required when doing add(), update(), delete() calls. It can be turned off if the user doesn't want they extra http traffic. Added the ability to iterate over search results. Although need to add the ability to iterate over complete results, fetching data behind the scenes as necessary. Unit tests have been added and functional tests improved. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: svn patches and directories...
On 1/16/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Date: Sat, 13 Jan 2007 19:12:27 -0800 (PST) : Subject: [jira] Commented: (SOLR-104) SQL Upload Plugin : 2) download HandlerRefactoring.DRAFT.zip and extract the contents to: : \solr\src\java\org\apache\solr\handler : : (svn patches don' t let you add new directories!) that's shouldn't be true true .. using the Linux SVN client you can definitely svn add a directory and then generate a diff from it (even using anonymous svn) ... you may want to double check the docs for your SVN client on how to do the same thing on your platform aaah. I'm running TortoiseSVN on XP. I ran 'svn add' on everything When i create a patch using TortoiseSVN I get a message that says: You've selected added folders. The patch won't contain added files within such added folders. Do you want to poceed anyway? But when you run from command line: svn diff XXX.patch it seems to work ok. (pure patches are allways easier to deal with then patches+zips) got it
Re: Can this be achieved? (Was: document support for file system crawling)
(...) http://andaluciajunta.es/aj-sea-.html This search engine will be based on nutch in the next version. The special character is that this main portal search engine has to search against the solr BOJA based indexed. Meaning Nutch will have to search the solr index and not vice versa. Looks interesting, too bad I don't understand the language :) But I do get the idea. field name=title type=string stored=true /field field name=content type=text indexed=true stored=true / field name=url type=string stored=true /field This is valuable info to a newbie like me. Thanks a lot! It also makes me think why didn't they make Nutch more general but I guess they wanted consistence (and it's probably configurable in Nutch, hidden somewhere, anyway). Hmm, I think like Hoss on this, why do we want do the same work of nutch. If you need a crawler why not use the one from nutch and change some lines? I actually use Forrest as crawler when I generate the new sites, which will then push the content to the solr server via a plugin I developed: http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/ Nice one. I didn't know about Forrest, so thanks for the advice. My needs are actually not related to a certain site or application at all. I am here for pure interest in Lucene/Solr/Nutch/etc, and the search field in general (enterprise in particular). Think of my needs as more of RD, if you'd like. Ultimately I hope to be able to contribute, but don't know where to start (and how much time/resources I have). Not sure. I am using solr because I did not had to develop three different nutch plugin to make it work. Further I have punctual updates where I push a certain set of documents to the server, so no need for a crawler. My suggestion is independent of how often docs are indexed. Everything should be possible - manual feed, crawler, filesystem surveillance, database transaction reports - as long as this is kept separate limit lies in one's imagination. Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to really know all possibilities and limitations. But I do believe that the outlined architecture would be flexible and answer many needs. Not sure. Well I am thinking about a way to meet the same market as some commercial vendors. They should not and may not be copied, so don't get me wrong. But I do know something about this market, or at least I like to think so. (...) I must say that interproject collaboration is very hard to archive. I take your word for it :) I guess one way is to just code/create the damn thing, not talk about it like I do now. *dreaming* Anyway if you need a crawler but want use solr then see the crawling code of nutch and write a standalone crawler that will update the solr index. Will do! Thanks for a full and wise reply. Eivind
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/16/07, J.J. Larrea [EMAIL PROTECTED] wrote: - Revise the XML-based update code (broken out of SolrCore into a RequestHandler) to use all the above. +++1, that's been needed forever. If one has the time, I'd also advocate moving to StAX (via woodstox for Java5, but it's built into Java6). -Yonik
[jira] Resolved: (SOLR-111) new repsonse classes and connection enhancements
[ https://issues.apache.org/jira/browse/SOLR-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Hatcher resolved SOLR-111. --- Resolution: Fixed Assignee: Erik Hatcher Applied, except tweaked autocommit to off by default. Good stuff, Ed! new repsonse classes and connection enhancements Key: SOLR-111 URL: https://issues.apache.org/jira/browse/SOLR-111 Project: Solr Issue Type: Improvement Components: clients - ruby - flare Environment: Darwin frizz 8.8.1 Darwin Kernel Version 8.8.1: Mon Sep 25 19:42:00 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_I386 i386 i386 Reporter: Ed Summers Assigned To: Erik Hatcher Attachments: response_connection_changes.diff, response_connection_changes.diff Similar to Solr::Request::* a Solr::Response::* hierarchy was created as well as a Solr::Response::Base which has a factory method for creating the appropriate response based on the request type and the raw response. Also added delete(), delete_by_query(), add(), update() and query() methods to Solr::Connection. This gets a bit closer to a DSL type of syntax which doesn't require the user to know the inner workings of solrb. I adjusted README accordingly. Solr::Connection also operates with autocommit turned *on* so commit() messages are not required when doing add(), update(), delete() calls. It can be turned off if the user doesn't want they extra http traffic. Added the ability to iterate over search results. Although need to add the ability to iterate over complete results, fetching data behind the scenes as necessary. Unit tests have been added and functional tests improved. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Java version for solr development (was Re: Update Plugins)
On Tue, 2007-01-16 at 15:49 -0500, Yonik Seeley wrote: On 1/16/07, J.J. Larrea [EMAIL PROTECTED] wrote: - Revise the XML-based update code (broken out of SolrCore into a RequestHandler) to use all the above. +++1, that's been needed forever. If one has the time, I'd also advocate moving to StAX (via woodstox for Java5, but it's built into Java6). I was up to have a look on this. Seeing this comment makes me think. I am on 1.5 ATM and using |-- stax-1.2.0-dev.jar `-- stax-utils.jar Two more dependencies. Setting min version !-- Java Version we are compatible with -- property name=java.compat.version value=1.6 / would get rid of this. Should I use 1.6 for a patch or above mentioned libs? wdyt? salu2 -- thorsten Together we stand, divided we fall! Hey you (Pink Floyd)
[jira] Commented: (SOLR-86) [PATCH] standalone updater cli based on httpClient
[ https://issues.apache.org/jira/browse/SOLR-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465327 ] Thorsten Scherler commented on SOLR-86: --- Yeah, I know what you mean (had a similar problem today). if (!file.isDirectory()){ tool.postFile(file, out); } should fix that. TIA [PATCH] standalone updater cli based on httpClient --- Key: SOLR-86 URL: https://issues.apache.org/jira/browse/SOLR-86 Project: Solr Issue Type: New Feature Components: update Reporter: Thorsten Scherler Attachments: simple-post-using-urlconnection-approach.patch, solr-86.diff, solr-86.diff We need a cross platform replacement for the post.sh. The attached code is a direct replacement of the post.sh since it is actually doing the same exact thing. In the future one can extend the CLI with other feature like auto commit, etc.. Right now the code assumes that SOLR-85 is applied since we using the servlet of this issue to actually do the update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Java version for solr development (was Re: Update Plugins)
On 1/16/07, Thorsten Scherler [EMAIL PROTECTED] wrote: I am on 1.5 ATM and using |-- stax-1.2.0-dev.jar `-- stax-utils.jar I don't know where those jars are from, but I guess one would need the stax API jar, and the implementation (woodstox I would think) jar. That's two jars instead of one, but they could go away with a move to Java6. The API is likely to have a much longer lifetime too. Two more dependencies. Setting min version !-- Java Version we are compatible with -- property name=java.compat.version value=1.6 / would get rid of this. Should I use 1.6 for a patch or above mentioned libs? I think it's a bit soon to move to 1.6 - I don't know how many platforms it's available for yet. -Yonik
[jira] Updated: (SOLR-107) Iterable NamedList with java5 generics
[ https://issues.apache.org/jira/browse/SOLR-107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McKinley updated SOLR-107: --- Attachment: IterableNamedList.patch Iterable NamedList with java5 generics -- Key: SOLR-107 URL: https://issues.apache.org/jira/browse/SOLR-107 Project: Solr Issue Type: Improvement Reporter: Ryan McKinley Priority: Trivial Attachments: IterableNamedList.patch, IterableNamedList.patch Iterators and generics are nice! this patch adds both to NamedList.java -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (SOLR-107) Iterable NamedList with java5 generics
[ https://issues.apache.org/jira/browse/SOLR-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465357 ] Ryan McKinley commented on SOLR-107: updated patch for 1,2, and 3 Iterable NamedList with java5 generics -- Key: SOLR-107 URL: https://issues.apache.org/jira/browse/SOLR-107 Project: Solr Issue Type: Improvement Reporter: Ryan McKinley Priority: Trivial Attachments: IterableNamedList.patch, IterableNamedList.patch Iterators and generics are nice! this patch adds both to NamedList.java -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Java version for solr development (was Re: Update Plugins)
On 1/16/07 8:03 PM, Yonik Seeley [EMAIL PROTECTED] wrote: I think it's a bit soon to move to 1.6 - I don't know how many platforms it's available for yet. It is still in early release from IBM for their PowerPC servers, so requiring 1.6 would be a serious problem for us. wunder -- Walter Underwood Search Guru, Netflix
Re: Java version for solr development (was Re: Update Plugins)
On 1/17/07, Thorsten Scherler [EMAIL PROTECTED] wrote: ...Should I use 1.6 for a patch or above mentioned libs?... IMHO moving to 1.6 is way too soon, and if it's only to save two jars it's not worth it. -Bertrand
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
: The number of people writing update plugins will be small compared to : the number of users using the external HTTP API (the URL + query : parameters, and the relationship URL-wise between different update : formats). My main concern is making *that* as nice and utilitarian as : possible, and any plugin stuff is implementation and a secondary : concern IMO. Agreed, but my point was that we should try to design the internal APIs indepently from the URL structure ... if we have a set of APIs, it's easy to come up with a URL structure that will map well (we could theoretically have several URL structures using different servlets) but if we worry too much about what hte URL should look like, we may hamstring the model design. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
kind of like a binary stream equivilent to the way analyzers can be customized -- is thta kind of what you had in mind? exactly. interface SolrDocumentParser { public init(NamedList args); Document parse(SolrParams p, ContentStream content); } yes