Re: unsubscribe
On Thu, 2007-05-10 at 10:05 +0100, Kainth, Sachin wrote: unsubscribe Hi Sachin, you need to send to a different mailing address: [EMAIL PROTECTED] HTH salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: Question about delete
but index file size not changed and maxDoc not changed. 2007/5/10, Nick Jenkin [EMAIL PROTECTED]: Hi James, As I understand it numDocs is the number of documents in your index, maxDoc is the most documents you have ever had in your index. You currently have no documents in your index by the looks, thus your delete query must of deleted everything. That would be why you are getting no results. -Nick On 5/10/07, James liu [EMAIL PROTECTED] wrote: i use command like this curl http://localhost:8983/solr/update --data-binary 'deletequeryname:DDR/query/delete' curl http://localhost:8983/solr/update --data-binary 'commit/' and i get numDocs : 0 maxDoc : 1218819 when i search something which exists in before delete and find nothing. but index file size not changed and maxDoc not changed. why it happen? -- regards jl -- - Nick -- regards jl
Re: Does solr support index which made by lucene 1.4.3
On 5/10/07, James liu [EMAIL PROTECTED] wrote: i try, it show me error information: Solr could support a Lucene 1.4.3 index if the schema was configured to match it. I see the following buried in your logs: java.lang.RuntimeException: Can't find resource 'solrconfig.xml' -Yonik
Costume response writer
I have written a costume response writer and added the response writer to solrconfig.xml When I run a program I can see the costume response writer is initialized, but when I run a search with the costume writer's name as the wt paramater the search is executed but the response writer is not called (Even the first line of the write function in the costume writer,which is log.info(...) is not written out.). Any leads of what might be the cause? Thank you , Debra -- View this message in context: http://www.nabble.com/Costume-response-writer-tf3721357.html#a10412462 Sent from the Solr - User mailing list archive at Nabble.com.
fast update handlers
I'm trying to setup a system to have very low index latency (1-2 seconds) and one of the javadocs intrigued me: DirectUpdateHandler2 implements an UpdateHandler where documents are added directly to the main Lucene index as opposed to adding to a separate smaller index The plain DirectUpdateHandler also had the same in its docs. Does this imply that there use to be another handler that could send docs to a small/faster index and then merge them in with a larger one or that someone could in the future? I read through a good bit of the code and didn't see how it could be handled from a searcher perspective but perhaps I'm missing some key piece. - will
Re: fast update handlers
On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I'm trying to setup a system to have very low index latency (1-2 seconds) and one of the javadocs intrigued me: DirectUpdateHandler2 implements an UpdateHandler where documents are added directly to the main Lucene index as opposed to adding to a separate smaller index The plain DirectUpdateHandler also had the same in its docs. Does this imply that there use to be another handler that could send docs to a small/faster index and then merge them in with a larger one or that someone could in the future? That was the original design, before I thought of the current method in DUH2. DirectUpdateHandler was just meant to get things working to establish the external interface (it's only for testing... very slow at overwriting docs). Adding documents to a separate index and then merging would have no real indexing speed advantage (it's essentially what Lucene does anyway when adding to a large index). There would be some advantage for index distribution, but it would complicate things greatly. High latency is caused by segment merges... this would happen when you periodically had to merge the smaller index into the larger anyway. You could do some other tricks for more predictable index times... set a large mergeFactor and then call optimize after you have added your batch of documents. Stay tuned though... there has been some work on a lucene patch to do merging in a background thread. -Yonik
RE: fast update handlers
I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. - will -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, May 10, 2007 9:49 AM To: solr-user@lucene.apache.org Subject: Re: fast update handlers On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I'm trying to setup a system to have very low index latency (1-2 seconds) and one of the javadocs intrigued me: DirectUpdateHandler2 implements an UpdateHandler where documents are added directly to the main Lucene index as opposed to adding to a separate smaller index The plain DirectUpdateHandler also had the same in its docs. Does this imply that there use to be another handler that could send docs to a small/faster index and then merge them in with a larger one or that someone could in the future? That was the original design, before I thought of the current method in DUH2. DirectUpdateHandler was just meant to get things working to establish the external interface (it's only for testing... very slow at overwriting docs). Adding documents to a separate index and then merging would have no real indexing speed advantage (it's essentially what Lucene does anyway when adding to a large index). There would be some advantage for index distribution, but it would complicate things greatly. High latency is caused by segment merges... this would happen when you periodically had to merge the smaller index into the larger anyway. You could do some other tricks for more predictable index times... set a large mergeFactor and then call optimize after you have added your batch of documents. Stay tuned though... there has been some work on a lucene patch to do merging in a background thread. -Yonik
Re: fast update handlers
On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. There isn't currently any support for incrementally updating filters. -Yonik
Re: Question about delete
I believe in lucene at least deleting documents only marks them for deletion. The actual delete happens only after closing the IndexReader. Not sure about Solr Ajanta. James liu wrote: but index file size not changed and maxDoc not changed. 2007/5/10, Nick Jenkin [EMAIL PROTECTED]: Hi James, As I understand it numDocs is the number of documents in your index, maxDoc is the most documents you have ever had in your index. You currently have no documents in your index by the looks, thus your delete query must of deleted everything. That would be why you are getting no results. -Nick On 5/10/07, James liu [EMAIL PROTECTED] wrote: i use command like this curl http://localhost:8983/solr/update --data-binary 'deletequeryname:DDR/query/delete' curl http://localhost:8983/solr/update --data-binary 'commit/' and i get numDocs : 0 maxDoc : 1218819 when i search something which exists in before delete and find nothing. but index file size not changed and maxDoc not changed. why it happen? -- regards jl -- - Nick
RE: fast update handlers
What about issuing separate commits to the index on a regularly scheduled basis? For example, you add documents to the index every 2 seconds, or however often, but these operations don't commit. Instead, you have a cron'd script or something that just issues a commit every 5 or 10 minutes or whatever interval you'd like. I had to do something similar when I was running a re-index of my entire dataset. My program wasn't issuing commits, so I just cron'd a commit for every half hour so it didn't overload the server. Thanks, Charlie -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, May 10, 2007 9:07 AM To: solr-user@lucene.apache.org Subject: Re: fast update handlers On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. There isn't currently any support for incrementally updating filters. -Yonik
RE: fast update handlers
The problem is I want the newly added documents to be made searchable every 1-2 seconds so I need the commits. I was hoping that the caches could be stored/tied to the IndexSearcher then a MultiSearcher could take advantage of the multiple sub indexes and their respective caches. I think the best approach now will be to write a top level federator that can merge the large ~static index and the smaller more dynamic index. - will -Original Message- From: Charlie Jackson [mailto:[EMAIL PROTECTED] Sent: Thursday, May 10, 2007 10:53 AM To: solr-user@lucene.apache.org Subject: RE: fast update handlers What about issuing separate commits to the index on a regularly scheduled basis? For example, you add documents to the index every 2 seconds, or however often, but these operations don't commit. Instead, you have a cron'd script or something that just issues a commit every 5 or 10 minutes or whatever interval you'd like. I had to do something similar when I was running a re-index of my entire dataset. My program wasn't issuing commits, so I just cron'd a commit for every half hour so it didn't overload the server. Thanks, Charlie -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, May 10, 2007 9:07 AM To: solr-user@lucene.apache.org Subject: Re: fast update handlers On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. There isn't currently any support for incrementally updating filters. -Yonik
Re: Question about delete
On 5/10/07, Ajanta Phatak [EMAIL PROTECTED] wrote: I believe in lucene at least deleting documents only marks them for deletion. The actual delete happens only after closing the IndexReader. Not sure about Solr Closing an IndexReader only flushes the list of deleted docids to the index... it doesn't actually delete them. Deletions only happen when the deleted docs segment is involved in a merge, or when an optimize is done (which is a merge of all segments). -Yonik
Re: Costume response writer
On 5/10/07, Debra [EMAIL PROTECTED] wrote: I have written a costume response writer and added the response writer to solrconfig.xml When I run a program I can see the costume response writer is initialized, but when I run a search with the costume writer's name as the wt paramater the search is executed but the response writer is not called (Even the first line of the write function in the costume writer,which is log.info(...) is not written out.). Any leads of what might be the cause? That doesn't make sense... something like the dismax handler is the same to solr as any other custom request handler. Perhaps look for the dismax handler init in the log files and compare it to your handler. -Yonik
Re: Requests per second/minute monitor?
Yes, that is possible, but we also monitor Apache, Tomcat, the JVM, and OS through JMX and other live monitoring interfaces. Why invent a real-time HTTP log analysis system when I can fetch /search/stats.jsp at any time? By number of rows fetched, do you mean number of documents matched? The log you describe is pretty useful. Ultraseek has something similar and that is the log most often used by admins. I'd recommend also logging the start and rows part of the request so you can distinguish between new queries and second page requests. If possible, make the timestamp the same as the HTTP access log so you can correlate the entries. wunder On 5/9/07 9:43 PM, Ian Holsman [EMAIL PROTECTED] wrote: Walter Underwood wrote: This is for monitoring -- what happened in the last 30 seconds. Log file analysis doesn't really do that. I would respectfully disagree. Log file analysis of each request can give you that, and a whole lot more. you could either grab the stats via a regular cron job, or create a separate filter to parse them real time. It would then let you grab more sophisticated stats if you choose to. What I would like to know is (and excuse the newbieness of the question) how to enable solr to log a file with the following data. - time spent (ms) in the request. - IP# of the incoming request - what the request was (and what handler executed it) - a status code to signal if the request failed for some reasons - number of rows fetched and - the number of rows actually returned is this possible? (I'm using tomcat if that changes the answer). regards Ian
dates times
After writing my 3rd parser in my third scripting language in so many months to go from unix timestamps to Solr Time (8601) I have to ask: shouldn't the date/time field type be more resilient? I assume there's a good reason that it's 8601 internally, but certainly it would be excellent for Solr to transcode different types into Solr Time. My main problem (as a normal Solr end user) is that it's hard to do math directly on 8601 dates or really parse them without specific packages. My XSL 2.0 parsers don't even like it without some massaging (forget about XSL 1.0.) UNIX time (seconds since the epoch) is super easy, as are sortable delimitable strings like 20070510125403. I'm not advocating replacing 8601 as the known good Solr Time, just that some leeway be given in the parser to accept unix time or something else and the conversion to 8601 happens internally. And a further dream is to have a strftime formatter in solrconfig for the query response, so I can always have my date fields come back as May 10th, 2007, 12:58pm. -Brian
Re: fast update handlers
I don't know if this helps, but... Do *all* your queries need to include the fast updates? I have a setup where there are some cases that need the newest stuff but most cases can wait 5 mins (or so) In that case, I have two solr instances pointing to the same index files. One is used for updates and queries that need everything. The other is a read-only index that serves the majority of queries. What is nice about this is that you can set different cache sizes and auto-warming for the different cases. ryan Will Johnson wrote: The problem is I want the newly added documents to be made searchable every 1-2 seconds so I need the commits. I was hoping that the caches could be stored/tied to the IndexSearcher then a MultiSearcher could take advantage of the multiple sub indexes and their respective caches. I think the best approach now will be to write a top level federator that can merge the large ~static index and the smaller more dynamic index. - will -Original Message- From: Charlie Jackson [mailto:[EMAIL PROTECTED] Sent: Thursday, May 10, 2007 10:53 AM To: solr-user@lucene.apache.org Subject: RE: fast update handlers What about issuing separate commits to the index on a regularly scheduled basis? For example, you add documents to the index every 2 seconds, or however often, but these operations don't commit. Instead, you have a cron'd script or something that just issues a commit every 5 or 10 minutes or whatever interval you'd like. I had to do something similar when I was running a re-index of my entire dataset. My program wasn't issuing commits, so I just cron'd a commit for every half hour so it didn't overload the server. Thanks, Charlie -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, May 10, 2007 9:07 AM To: solr-user@lucene.apache.org Subject: Re: fast update handlers On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. There isn't currently any support for incrementally updating filters. -Yonik
Re: dates times
On 5/10/07, Brian Whitman [EMAIL PROTECTED] wrote: After writing my 3rd parser in my third scripting language in so many months to go from unix timestamps to Solr Time (8601) I have to ask: shouldn't the date/time field type be more resilient? I assume there's a good reason that it's 8601 internally, but certainly it would be excellent for Solr to transcode different types into Solr Time. My main problem (as a normal Solr end user) is that it's hard to do math directly on 8601 dates or really parse them without specific packages. My XSL 2.0 parsers don't even like it without some massaging (forget about XSL 1.0.) UNIX time (seconds since the epoch) is super easy, as are sortable delimitable strings like 20070510125403. I'm not sure what delimitable means, but Solr datetimes _are_ essentially sortable inverse-magnitude like the above, with a few punctuation symbols thrown in. I have no XSLT-fu, but is it not possible to do regexp-replace s/[TZ:-]//g on the solrdate to get the above? I'm not advocating replacing 8601 as the known good Solr Time, just that some leeway be given in the parser to accept unix time or something else and the conversion to 8601 happens internally. And a further dream is to have a strftime formatter in solrconfig for the query response, so I can always have my date fields come back as May 10th, 2007, 12:58pm. Those are interesting ideas and it probably would not be difficult to create a patch if you were interested, but I'm curious: What about XSL makes what seems to me an elementary string-processing task so difficult? regards -Mike
RE: dates times
You can get at some of this functionality in the built-in xslt 1.0 engine (Xalan) by using the e-xslt date-time extensions: see http://exslt.org/date/index.html, and for Xalan's implementation see http://xml.apache.org/xalan-j/extensionslib.html#exslt . There are some examples here: http://www-128.ibm.com/developerworks/library/x-exslt.html . I haven't tried this in Solr but I don't think there's any reason why it wouldn't work; I've used it in other Xalan-J environments, notably Cocoon. Peter -Original Message- From: Brian Whitman [mailto:[EMAIL PROTECTED] Sent: Thursday, May 10, 2007 11:49 AM To: solr-user@lucene.apache.org Subject: Re: dates times Those are interesting ideas and it probably would not be difficult to create a patch if you were interested, but I'm curious: What about XSL makes what seems to me an elementary string-processing task so difficult? Well, XSL 1.0 (which is the one that comes for free with Solr/java) doesn't handle dates and times at all. XSL 2.0 handles it well enough, but it's only supported through a GPL jar, which we can't distribute. It's more than string processing, anyway. I would want to convert the Solr Time 2007-03-15T00:41:5:2Z to March 15th, 2007 in a web app. I'd also like to say 'Posted 3 days ago. In my vision of things, that work is done on Solr's side. (The former case with a strftime type formatter in solrconfig, the latter by having strftime return the day number this year.)
Re: dates times
You can get at some of this functionality in the built-in xslt 1.0 engine (Xalan) by using the e-xslt date-time extensions: see http://exslt.org/date/index.html, and for Xalan's implementation see http://xml.apache.org/xalan-j/extensionslib.html#exslt . The exslt stuff looks good, thanks! I'll have to try it out. That's only one direction though, still want parsing of unix timestamp-like formats into the indexer on doc adds and updates. Just FYi the license for the exslt stuff seems OK w/ the APL: http:// lists.fourthought.com/pipermail/exslt-manage/2004-June/000603.html So if it works out we might want to put the date/time xsl stuff in the solr distribution in lieu of shipping with a XSL 2.0 processor. Those are interesting ideas and it probably would not be difficult to create a patch if you were interested, but I'm curious: What about XSL makes what seems to me an elementary string-processing task so difficult? Well, XSL 1.0 (which is the one that comes for free with Solr/java) doesn't handle dates and times at all. XSL 2.0 handles it well enough, but it's only supported through a GPL jar, which we can't distribute. It's more than string processing, anyway. I would want to convert the Solr Time 2007-03-15T00:41:5:2Z to March 15th, 2007 in a web app. I'd also like to say 'Posted 3 days ago. In my vision of things, that work is done on Solr's side. (The former case with a strftime type formatter in solrconfig, the latter by having strftime return the day number this year.) -- http://variogr.am/ [EMAIL PROTECTED]
Re: Costume response writer
This is from the log: ... INFO: adding queryResponseWriter jdbc=com.lss.search.request.JDBCResponseWriter 10/05/2007 21:11:39 com.lss.search.request.JDBCResponseWriter init INFO: Init JDBC reponse writer //This is added from the ini of the class to see that it's actually finding the right one ... 10/05/2007 21:11:44 org.apache.solr.core.SolrCore execute INFO: null jdsn=4start=0q=whitewt=jdbcqt=standardrows=90 0 1442 10/05/2007 21:11:44 org.apache.solr.core.SolrCore close This is from the JDBCResponseWriter code: public void write(Writer writer, SolrQueryRequest request, SolrQueryResponse response) throws IException { log.info(USING JDBC RESPONSE WRITER); The line USING JDBC RESPONSE WRITER doesn't appear in the log. Thanks, Debra -- View this message in context: http://www.nabble.com/Costume-response-writer-tf3721357.html#a10418873 Sent from the Solr - User mailing list archive at Nabble.com.
Re: dates times
: It's more than string processing, anyway. I would want to convert the : Solr Time 2007-03-15T00:41:5:2Z to March 15th, 2007 in a web app. : I'd also like to say 'Posted 3 days ago. In my vision of things, : that work is done on Solr's side. (The former case with a strftime : type formatter in solrconfig, the latter by having strftime return : the day number this year.) One of the early architecture/design principles of the Solr search APIs was compute secondary info about a result if it's more efficient or easier to compute in Solr then it would be for a client to do it -- DocSet caches, facet counts, and sorting/pagination being great examples of things where Solr can do less work to get the same info out of raw data then a client app would because of it's low level access to the data, and becuase of how much data would need to go over the wire for the client to do the same computation. ... that's largely just a lit bit of historic trivial however, Solr has a lot of features now which might not hold up to the yard stick, but i mention it only to clarify one of hte reasons Solr didnt' have more 'configurable date formatting to start with. it has been on the TaskList since the start of incubation however... * a DateTime field (or Query Parser extension) that allows flexible input for easier human entered queries * allow alternate format for date output to ease client creation of date objects? One of hte reasons i dont' think anyone has tackled them yet is because it's hard to get a holistic view of a solution, because there are really several loosely related problems with date formatting issues: The first is a discusion of the internal format and what resolution the dates are stored at in the index itself. if you *know* that you never plan on querying with anything more fine grained then day resolution, storing your dates with only day resolution can make your index a lot smaller (and make date searches a lot faster). with the current DateField the same performance benefits can be achieved by rounding your dates before indexing them, but if we were to make it a config option on DateField itself to automaticly round, we would need to take this info into account when parsing updates -- should the client be exepcted to know what precision each date field uses? do they send dates expressed using the internal format, or as fully qualified times? is it an error/warning to attempt to index more datetime precision then a field supports? The second is a discussion of external format (which seems to be what you are mostly discussing) the most trivial way to address this would be options on the ResponseWriters that allow them to be configured with DateFormater Strings they would use to process any date they return .. but that raises questions about the QueryParsing aspect as well ... should date formating be a property of the response, or a property of the request, such that both input and output formats are identicle? Third is how the discussions of the internal format and the external format shouldn't be treated completely indepndent. it's tempting to say that there will be a clean abstraction between the two, that all client interaction will be done using configured external formater(s) to create internal java Date objects, which will then be translated back to Strings by an internal formater for the purpose of indexing (and querying) but what happens when a query expresses a date range too precise for the granularity expressed by the internal format? do we match nothing/everything? ... what if the indexed granularity is *more* recised then the uery graunlarity .. how do we know if a range query between March 6, 2007 and May 10, 2007 on a field that stores millisencond granularity is suppose to go from the first millisecond of each day or the last? Questions like these are whiy I'm glad Solr currently keeps it simple and makes people deal in absolutes .. less room for confusion :) -Hoss
Re: Costume response writer
: INFO: adding queryResponseWriter : jdbc=com.lss.search.request.JDBCResponseWriter : 10/05/2007 21:11:44 org.apache.solr.core.SolrCore execute : INFO: null jdsn=4start=0q=whitewt=jdbcqt=standardrows=90 0 1442 that's very strange ... the only thing that jumps out at me is the null there where the context path is suppose to be logged, it suggests that you aren't useing the standard /select URL so maybe this is a bug with some of hte new request handler path based stuff? can you clarify: 1) which version of Solr you are using (the Solr Implementation Version from /admin/registry.jsp gives the best answer) 2) exactly what URL you are hitting to generate this request 3) what the solrconfig.xml lools like for your queryResponseWriter and requestHandler configurations 4) Lastly: what response does your client get? is it the default XML response, or just nothing at all? -Hoss
Re: dates times
On May 10, 2007, at 2:30 PM, Chris Hostetter wrote: Questions like these are whiy I'm glad Solr currently keeps it simple and makes people deal in absolutes .. less room for confusion :) I get all that, thanks for the great explanation. I imagine most of my problems can be solved with a custom field analyzer (converting other date strings to 8601 during indexing) and then XSL on the select?q= end (which we do anyway.) In other words, leaving core solr absolute with an option for different date analyzers. I see the need to not clutter it up, especially at this stage. What would, say, a filter that converted unix timestamps to 8601 before indexing as a solr.DateField look like? Is that a custom filter, or a tokenizer?
Re: dates times
On 5/10/07, Brian Whitman [EMAIL PROTECTED] wrote: On May 10, 2007, at 2:30 PM, Chris Hostetter wrote: Questions like these are whiy I'm glad Solr currently keeps it simple and makes people deal in absolutes .. less room for confusion :) I get all that, thanks for the great explanation. I imagine most of my problems can be solved with a custom field analyzer (converting other date strings to 8601 during indexing) and then XSL on the select?q= end (which we do anyway.) In other words, leaving core solr absolute with an option for different date analyzers. I see the need to not clutter it up, especially at this stage. What would, say, a filter that converted unix timestamps to 8601 before indexing as a solr.DateField look like? Is that a custom filter, or a tokenizer? That would be a custom filter which is currently only supported by text fields, so the XML output would be str instead of date (if that matters to you). One could also just store seconds or milliseconds in an int or long field. That's fine for small devel teams, but not ideal since it's a bit less expressive. The right approach for more flexible date parsing is probably to add more functionality to the date field and configure via optional attributes. -Yonik
RE: cwd requirement to run Solr with Tomcat
BTW, The Simple Example Install section in http://wiki.apache.org/solr/SolrTomcat leaves the unzipped directory apache-solr-nightly-incubating intact, but this is not needed after copying the solr.war and the example solr directory, is it? Can I edit the instruction to insert: rm -r apache-solr-nightly-incubating after the cp line? -kuro
Does Solr XSL writer work with Arabic text?
I'm trying to search an index of docs which have text fields in Arabic, using XSL writer (wt=xslttr=example.xsl). But the Arabic text gets all garbled. Is XSL writer known to work for Arabic text? Is anybody using it? -kuro
Re: Does Solr XSL writer work with Arabic text?
In example.xsl change the output type xsl:output media-type=text/html/ to xsl:output media-type=text/html; charset=UTF-8 encoding=UTF-8/ And see if that helps. I had the same problem (different language.) If this works we should file a JIRA to fix it up in trunk. On May 10, 2007, at 4:13 PM, Teruhiko Kurosaka wrote: I'm trying to search an index of docs which have text fields in Arabic, using XSL writer (wt=xslttr=example.xsl). But the Arabic text gets all garbled. Is XSL writer known to work for Arabic text? Is anybody using it? -kuro -- http://variogr.am/ [EMAIL PROTECTED]
Re: dates times
: The right approach for more flexible date parsing is probably to add : more functionality to the date field and configure via optional : attributes. Adding configuration options to DateField seems like it might ultimately be the right choice for changing the *internal* format, but assuming we want to keep the internal representation of DateField fixed and unconfigurable for the time being and address the the various *external* formatting issues i imagine the simplest things to tackle this (in a way that is consistent with the other datatypes) would be... 1) change DateField to support Analyzers. that way you could have seperate analyzers for indexing vs querying just like a text field (so you could for example send Solr seconds since epoch when indexing, and query query using MM/DD/) The Analyzers used would be responsible for producing Tokens which match what values the current DateField.toInternal() already consideres legal (either a DateMath string or an iso8601 string). (In general a DateTranslatingTokenFilter class would be a pretty cool addition to Lucene, it could as constructor args two DateFormatters (one for parsing the incoming tokens, and one for formating the outgoing tokens) and a boolean indicating wether it's job was to replace matching tokens or inject duplicate tokens in the same position ... maybe another option indicating wether incoming Tokens that can't be parsed should be striped or passed through ... the idea being that for something like DateFiled you would use KeywordTokenizer along with an instance of this to parse whatever format you wanted -- but when parsing generic text you might have several of these TokenFilters configured with differnet DateFormatters so if they see a Token in the text that matches a known DateFormat they could inject the name of the month, or the day of hte week into the text at the same position.) 2) add options to the various QueryResponseWriters to control which format they use when writting fields out. .. in the case of XmlResposneWriter it would still produce a date tag, but the value would be formated according to the configuration. -Hoss
Re: dates times
(In general a DateTranslatingTokenFilter class would be a pretty cool addition to Lucene, it could as constructor args two DateFormatters (one for parsing the incoming tokens, and one for formating the outgoing If this happens, it would be nice (perhaps overkill) to have a chronic input filter: http://chronic.rubyforge.org/ the java port: https://jchronic.dev.java.net/ --- brian, for a quick easy solution, if you find working with unix timestamps easier, perhaps just want to to put in dates as a SortableLongField and deal with the formatting that way.
Re: Costume response writer
hossman_lucene wrote: can you clarify: 1) which version of Solr you are using (the Solr Implementation Version from /admin/registry.jsp gives the best answer) ... -Hoss Just downloaded the latest night build and viola it's back on track (with the other bugs...) -- View this message in context: http://www.nabble.com/Costume-response-writer-tf3721357.html#a10421865 Sent from the Solr - User mailing list archive at Nabble.com.
RE: dates times
Regarding Hoss's points about the internal format, resolution of date-times, etc.: maybe a good starting point would be to implement the date-time algorithms of XML Schema (http://www.w3.org/TR/xmlschema-2/#isoformats), where these behaviors are spelled out in reasonably precise terms. There must be code somewhere that Solr could steal to help with this. This would mesh well with XSLT 2.0, and presumably other modern XML environments. peter -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Thursday, May 10, 2007 12:30 PM To: solr-user@lucene.apache.org Subject: Re: dates times : It's more than string processing, anyway. I would want to convert the : Solr Time 2007-03-15T00:41:5:2Z to March 15th, 2007 in a web app. : I'd also like to say 'Posted 3 days ago. In my vision of things, : that work is done on Solr's side. (The former case with a strftime : type formatter in solrconfig, the latter by having strftime return : the day number this year.) One of the early architecture/design principles of the Solr search APIs was compute secondary info about a result if it's more efficient or easier to compute in Solr then it would be for a client to do it -- DocSet caches, facet counts, and sorting/pagination being great examples of things where Solr can do less work to get the same info out of raw data then a client app would because of it's low level access to the data, and becuase of how much data would need to go over the wire for the client to do the same computation. ... that's largely just a lit bit of historic trivial however, Solr has a lot of features now which might not hold up to the yard stick, but i mention it only to clarify one of hte reasons Solr didnt' have more 'configurable date formatting to start with. it has been on the TaskList since the start of incubation however... * a DateTime field (or Query Parser extension) that allows flexible input for easier human entered queries * allow alternate format for date output to ease client creation of date objects? One of hte reasons i dont' think anyone has tackled them yet is because it's hard to get a holistic view of a solution, because there are really several loosely related problems with date formatting issues: The first is a discusion of the internal format and what resolution the dates are stored at in the index itself. if you *know* that you never plan on querying with anything more fine grained then day resolution, storing your dates with only day resolution can make your index a lot smaller (and make date searches a lot faster). with the current DateField the same performance benefits can be achieved by rounding your dates before indexing them, but if we were to make it a config option on DateField itself to automaticly round, we would need to take this info into account when parsing updates -- should the client be exepcted to know what precision each date field uses? do they send dates expressed using the internal format, or as fully qualified times? is it an error/warning to attempt to index more datetime precision then a field supports? The second is a discussion of external format (which seems to be what you are mostly discussing) the most trivial way to address this would be options on the ResponseWriters that allow them to be configured with DateFormater Strings they would use to process any date they return .. but that raises questions about the QueryParsing aspect as well ... should date formating be a property of the response, or a property of the request, such that both input and output formats are identicle? Third is how the discussions of the internal format and the external format shouldn't be treated completely indepndent. it's tempting to say that there will be a clean abstraction between the two, that all client interaction will be done using configured external formater(s) to create internal java Date objects, which will then be translated back to Strings by an internal formater for the purpose of indexing (and querying) but what happens when a query expresses a date range too precise for the granularity expressed by the internal format? do we match nothing/everything? ... what if the indexed granularity is *more* recised then the uery graunlarity .. how do we know if a range query between March 6, 2007 and May 10, 2007 on a field that stores millisencond granularity is suppose to go from the first millisecond of each day or the last? Questions like these are whiy I'm glad Solr currently keeps it simple and makes people deal in absolutes .. less room for confusion :) -Hoss
RE: Facet only support english?
If my memory is correct, UTF-8 has been the default encoding per XML specification from a very early stage. If the XML parser is not defaulting to UTF-8 in absence of the encoding attribute, that means the XML parser has a bug, and the code should be corrected. (I don't have an objection to add the encoding attribute for clarity, however.) -kuro -Original Message- From: Walter Underwood [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 09, 2007 4:33 PM To: solr-user@lucene.apache.org Subject: Re: Facet only support english? I didn't remember that requirement, so I looked it up. It was added in XML 1.0 2nd edition. Originally, unspecified encodings were open for auto-detection. Content type trumps encoding declarations, of course, per RFC 3023 and allowed by the XML spec. wunder On 5/9/07 4:19 PM, Mike Klaas [EMAIL PROTECTED] wrote: I thought that conformant parsers use UTF-8 as the default anyway: http://www.w3.org/TR/REC-xml/#charencoding -Mike
Re: Index Concurrency
Though, isn't there a recent patch to allow multiple indices under a single Solr instance in JIRA? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, May 9, 2007 6:32:33 PM Subject: Re: Index Concurrency On 5/9/07, joestelmach [EMAIL PROTECTED] wrote: My first intuition is to give each user their own index. My thinking here is that querying would be faster (since each user's index would be much smaller than one big index,) and, more importantly, that I would dodge any concurrency issues stemming from multiple threads trying to update the same index simultaneously. I realize that Lucene implements a locking mechanism to protect against concurrent access, but I seem to hit the lock access timeout quite easily with only a couple threads. After looking at solr, I would really like to take advantage of the many features it adds to Lucene, but it doesn't look like I'll be able to achieve multiple indexes. No, not currently. Start your implementation with just a single index... unless it is very large, it will likely be fast enough. Solr also handles all the concurrency issues, and you should never hit lock access timeout when updating from multiple threads. -Yonik
Re: Index Concurrency
Yes, coordination between the main index searcher, the index writer, and the index reader needed to delete other documents. Can you point me to any documentation/code that describes this implementation? That's weird... I've never seen that. The lucene write lock is only obtained when the IndexWriter is created. Can you post the relevant part of the log file where the exception happens? After doing some more testing, I believe it was a stale lock file that was causing me to have these lock issues yesterday - sorry for the false alarm :) Also, unless you have at least 6 CPU cores or so, you are unlikely to see greater throughput with 10 threads. If you add multiple documents per HTTP-POST (such that HTTP latency is minimized), the best setting would probably be nThreads == nCores. For a single doc per POST, more threads will serve to cover the latency and keep Solr busy. I agree with your thinking here. My requirement for a large number of threads is somewhat of an artifact of my current system design. I'm trying not to serialize the system's processing at the point of indexing. -- View this message in context: http://www.nabble.com/Index-Concurrency-tf3718634.html#a10424207 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about delete
get it. thks yonik. 2007/5/10, Yonik Seeley [EMAIL PROTECTED]: On 5/10/07, Ajanta Phatak [EMAIL PROTECTED] wrote: I believe in lucene at least deleting documents only marks them for deletion. The actual delete happens only after closing the IndexReader. Not sure about Solr Closing an IndexReader only flushes the list of deleted docids to the index... it doesn't actually delete them. Deletions only happen when the deleted docs segment is involved in a merge, or when an optimize is done (which is a merge of all segments). -Yonik -- regards jl
Solr concurrent commit not updated
Hello all, I have tested by use post.sh in example directory to add xml documents into solr. It works when I add one by one. But when I have a lot of .xml file to be posted (say about 500-1000 files) and I wrote a shell script to call post.sh one by one. I found those xml files are not searchable after post. But from solr admin page / statistics I found that it records commited numbers. But numDocs is not updated. So why, when I use post.sh to post one xml it will be fine, but if I use post.sh for 500 times, each time one xml will be different behavior? Regards, David
Re: Solr concurrent commit not updated
u should know id is unique number. 2007/5/11, David Xiao [EMAIL PROTECTED]: Hello all, I have tested by use post.sh in example directory to add xml documents into solr. It works when I add one by one. But when I have a lot of .xml file to be posted (say about 500-1000 files) and I wrote a shell script to call post.sh one by one. I found those xml files are not searchable after post. But from solr admin page / statistics I found that it records commited numbers. But numDocs is not updated. So why, when I use post.sh to post one xml it will be fine, but if I use post.sh for 500 times, each time one xml will be different behavior? Regards, David -- regards jl
RE: cwd requirement to run Solr with Tomcat
that section was never really intented to be *the* set of instructions for installing Solr on Tomcat, just the *simplest* set of things you could do to see it working, many additional things could be done (besides deleting the unzipped dir). If we start listing more things, people may get confused as assume those things *have* to be done. I've added some better comments to try and clarify that it's a minimal set of steps. : The Simple Example Install section in : http://wiki.apache.org/solr/SolrTomcat : leaves the unzipped directory apache-solr-nightly-incubating : intact, but this is not needed after copying the : solr.war and the example solr directory, is it? : Can I edit the instruction to insert: : rm -r apache-solr-nightly-incubating : after the cp line? -Hoss
Re: Question about delete
: Closing an IndexReader only flushes the list of deleted docids to the : index... it doesn't actually delete them. Deletions only happen when : the deleted docs segment is involved in a merge, or when an optimize : is done (which is a merge of all segments). just to clarify slightly because deleted can be differnet things to differnet people... think of executing a delete command as logically deleting them, by adding them to a list of documents to be ignored by IndexSearchers. a commit will ensure that deleted docs list is written to disk, and reopen the IndexSearcher which will treat any documents in that list as if they didn't exit. When segment merges happens sometime in the future document information is physically deleted in the sense that the data associated with docs i nthe deleted list is actually removed from the index files, and disk/ram space is freed up. -Hoss
Re: Solr Sorting, merging/weighting sort fields
The boost is a way to adjust the weight of that field, just like you adjust the weight of any other field. If the boost is dominating the score, reduce the weight and vice versa. wunder On 5/10/07 9:22 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : Is this correct? bf is a boosting function, so a function is needed there, no? : If I'm not missing someting, the ^0.5 is just a boost, and popularity : is just a (numeric) field. So boosting a numeric field wouldn't make : sense, but appying it to a function would. Am I missing something? the function parser does the right thing when you give it a bare field name, from the javadocs... http://lucene.apache.org/solr/api/org/apache/solr/search/QueryParsing.html#par seFunction(java.lang.String,%20org.apache.solr.schema.IndexSchema) // Numeric fields default to correct type // (ie: IntFieldSource or FloatFieldSource) // Others use implicit ord(...) to generate numeric field value myfield you are correct about 0.5 being the boost, using either the _val_ hack on the SolrQueryParser or using he bf param of dismax the ^0.5 will be used as a boost on the resulting function query... qt=standardq=%2Bfoo%20_val_:popularity^0.5 qt=dismaxq=foobf=popularity^0.5 -Hoss
RE: fast update handlers
: want to add docs every 2 seconds all while doing queries. if I do : commits every 2 seconds I basically loose any caching advantage and my : faceting performance goes down the tube. If however, I were to add : things to a smaller index and then roll it into the larger one every ~30 : minutes then I only take the hit on computing the larger filters caches searching across both of these indexes (the big and the little) would require something like a MultiReader, a way to unify DocSets between the two, and the ability to cache on the sub indexes and on the main MultiReader. fortunately, a MultiReader is exactly what Lucence uses under the covers when dealing with an FSDIrectory, so we're half way there. something like these might get us the rest of the way... https://issues.apache.org/jira/browse/LUCENE-831 https://issues.apache.org/jira/browse/LUCENE-743 -Hoss