Re: DIH transformer script size limitations with Jetty?
On Thu, Aug 12, 2010 at 5:42 AM, harrysmith harrysmith...@gmail.com wrote: To follow up on my own question, it appears this is only an issue when using the DataImport console debugging tools. It looks like when submitting the debugging request, the data-config.xml is sent via a GET request, which would fail. However, using the exact same data-config.xml via a full-import operation (ie not a dry run debug), it looks like the request is sent POST and the import works fine. You are right. In debug mode, the data-config is sent as a GET request. Can you open a Jira issue? -- Regards, Shalin Shekhar Mangar.
Indexing Hanging during GC?
Hi, When indexing large amounts of data I hit a problem whereby Solr becomes unresponsive and doesn't recover (even when left overnight!). I think i've hit some GC problems/tuning is required of GC and I wanted to know if anyone has ever hit this problem. I can replicate this error (albeit taking longer to do so) using Solr/Lucene analysers only so I thought other people might have hit this issue before over large data sets Background on my problem follows -- but I guess my main question is -- can Solr become so overwhelmed by update posts that it becomes completely unresponsive?? Right now I think the problem is that the java GC is hanging but I've been working on this all week and it took a while to figure out it might be GC-based / wasn't a direct result of my custom analysers so i'd appreciate any advice anyone has about indexing large document collections. I also have a second questions for those in the know -- do we have a chance of indexing/searching over our large dataset with what little hardware we already have available?? thanks in advance :) bec a bit of background: --- I've got a large collection of articles we want to index/search over -- about 180k in total. Each article has say 500-1000 sentences and each sentence has about 15 fields, many of which are multi-valued and we store most fields as well for display/highlighting purposes. So I'd guess over 100 million index documents. In our small test collection of 700 articles this results in a single index of about 13GB. Our pipeline processes PDF files through to Solr native xml which we call index.xml files i.e. in adddoc... format ready to post straight to Solr's update handler. We create the index.xml files as we pull in information from a few sources and creation of these files from their original PDF form is farmed out across a grid and is quite time-consuming so we distribute this process rather than creating index.xml files on the fly... We do a lot of linguistic processing and to enable search functionality of our resulting terms requires analysers that split terms/ join terms together i.e. custom analysers that perform string operations and are quite time-consuming/ have large overhead compared to most analysers (they take approx 20-30% more time and use twice as many short-lived objects than the text field type). Right now i'm working on my new Imac: quad-core 2.8 GHz intel Core i7 16 GB 1067 MHz DDR3 RAM 2TB hard-drive (about half free) Version 10.6.4 OSX Production environment: 2 linux boxes each with: 8-core Intel(R) Xeon(R) CPU @ 2.00GHz 16GB RAM I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core right now). I setup Solr to use autocommit as we'll have several document collections / post to Solr from different data sets: !-- autocommit pending docs if certain criteria are met. Future versions may expand the available criteria -- autoCommit maxDocs50/maxDocs !-- every 1000 articles -- maxTime90/maxTime !-- every 15 minutes -- /autoCommit I also have useCompoundFilefalse/useCompoundFile ramBufferSizeMB1024/ramBufferSizeMB mergeFactor10/mergeFactor - *** First question: Has anyone else found that Solr hangs/becomes unresponsive after too many documents are indexed at once i.e. Solr can't keep up with the post rate? I've got LCF crawling my local test set (file system connection required only) and posting documents to Solr using 6GB of RAM. As I said above, these documents are in native Solr XML format (adddoc) with one file per article so each add contains all the sentence-level documents for the article. With LCF I post about 2.5/3k articles (files) per hour -- so about 2.5k*500 /3600 = 350 docs per second post-rate -- is this normal/expected?? Eventually, after about 3000 files (an hour or so) Solr starts to hang/becomes unresponsive and with Jconsole/GC logging I can see that the Old-Gen space is about 90% full and the following is the end of the solr log file-- where you can see GC has been called: -- 3012.290: [GC Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 53349392 Max Chunk Size: 3200168 Number of Blocks: 66 Av. Block Size: 808324 Tree Height: 13 Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 0 Max Chunk Size: 0 Number of Blocks: 0 Tree Height: 0 3012.290: [ParNew (promotion failed): 143071K-142663K(153344K), 0.0769802 secs]3012.367: [CMS -- I can replicate this with Solr using text field types in place of those that use my custom analysers -- whereby Solr takes longer to become unresponsive (about 3 hours / 13k docs) but there is the same kind of GC message at the end of the log file / Jconsole shows that the Old-Gen space was
Re: Analysing SOLR logfiles
Thanks - splunk looks overkill. We're extremely small scale - were hoping for something open source :-) - Original Message From: Jan Høydahl / Cominvent jan@cominvent.com To: solr-user@lucene.apache.org Sent: Wed, August 11, 2010 11:14:37 PM Subject: Re: Analysing SOLR logfiles Have a look at www.splunk.com -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 11. aug. 2010, at 19.34, Jay Flattery wrote: Hi there, Just wondering what tools people use to analyse SOLR log files. We're looking to do things like extracting common queries, calculating averaging Qtime and hits, returning particularly slow/expensive queries, etc. Would prefer not to code something (completely) from scratch. Thanks!
Re: Improve Query Time For Large Index
Hi Robert! Since the example given was http being slow, its worth mentioning that if queries are one word urls [for example http://lucene.apache.org] these will actually form slow phrase queries by default. do you mean that http://lucene.apache.org will be split up into http lucene apache org and solr will perform a phrase query? Regards, Peter.
Re: Improve Query Time For Large Index
Hi Tom, I tried again with: queryResultCache class=solr.LRUCache size=1 initialSize=1 autowarmCount=1/ and even now the hitratio is still 0. What could be wrong with my setup? ('free -m' shows that the cache has over 2 GB free.) Regards, Peter. Hi Peter, Can you give a few more examples of slow queries? Are they phrase queries? Boolean queries? prefix or wildcard queries? If one word queries are your slow queries, than CommonGrams won't help. CommonGrams will only help with phrase queries. How are you using termvectors? That may be slowing things down. I don't have experience with termvectors, so someone else on the list might speak to that. When you say the query time for common terms stays slow, do you mean if you re-issue the exact query, the second query is not faster? That seems very strange. You might restart Solr, and send a first query (the first query always takes a relatively long time.) Then pick one of your slow queries and send it 2 times. The second time you send the query it should be much faster due to the Solr caches and you should be able to see the cache hit in the Solr admin panel. If you send the exact query a second time (without enough intervening queries to evict data from the cache, ) the Solr queryResultCache should get hit and you should see a response time in the .01-5 millisecond range. What settings are you using for your Solr caches? How much memory is on the machine? If your bottleneck is disk i/o for frequent terms, then you want to make sure you have enough memory for the OS disk cache. I assume that http is not in your stopwords. CommonGrams will only help with phrase queries CommonGrams was committed and is in Solr 1.4. If you decide to use CommonGrams you definitely need to re-index and you also need to use both the index time filter and the query time filter. Your index will be larger. fieldType name=foo ... analyzer type=index filter class=solr.CommonGramsFilterFactory words=new400common.txt/ /analyzer analyzer type=query filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/ /analyzer /fieldType Tom -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Tuesday, August 10, 2010 3:32 PM To: solr-user@lucene.apache.org Subject: Re: Improve Query Time For Large Index Hi Tom, my index is around 3GB large and I am using 2GB RAM for the JVM although a some more is available. If I am looking into the RAM usage while a slow query runs (via jvisualvm) I see that only 750MB of the JVM RAM is used. Can you give us some examples of the slow queries? for example the empty query solr/select?q= takes very long or solr/select?q=http where 'http' is the most common term Are you using stop words? yes, a lot. I stored them into stopwords.txt http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 this looks interesting. I read through https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4. I only need to enable it via: filter class=solr.CommonGramsFilterFactory ignoreCase=true words=stopwords.txt/ right? Do I need to reindex? Regards, Peter. Hi Peter, A few more details about your setup would help list members to answer your questions. How large is your index? How much memory is on the machine and how much is allocated to the JVM? Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists. So you need to leave some memory for the OS. On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files. Can you give us some examples of the slow queries? Are you using stop words? If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list or try CommonGrams and add them to the common words list. (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2) Tom Burton-West -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Tuesday, August 10, 2010 9:54 AM To: solr-user@lucene.apache.org Subject: Improve Query Time For Large Index Hi, I have 5 Million small documents/tweets (= ~3GB) and the slave index replicates itself from master every 10-15 minutes, so the index is optimized before querying. We are using solr 1.4.1 (patched with SOLR-1624) via SolrJ. Now the search speed is slow 2s for common terms which hits more than 2 mio docs and acceptable for others: 0.5s. For those numbers I don't use highlighting or facets. I am using the following schema [1] and from luke handler I know that numTerms =~20 mio. The query for common terms stays slow if I
Re: Improve Query Time For Large Index
Hi Tom! Hi Peter, Can you give a few more examples of slow queries? Are they phrase queries? Boolean queries? prefix or wildcard queries? I am experimenting with one word queries only at the moment. If one word queries are your slow queries, than CommonGrams won't help. CommonGrams will only help with phrase queries. hmmh, ok. How are you using termvectors? yes. That may be slowing things down. I don't have experience with termvectors, so someone else on the list might speak to that. ok. But for highlighting I'll need them to speed things up (a lot). When you say the query time for common terms stays slow, do you mean if you re-issue the exact query, the second query is not faster? That seems very strange. Yes. Indeed. The queryResultCache has no hits at all. Strange. You might restart Solr, and send a first query (the first query always takes a relatively long time.) Then pick one of your slow queries and send it 2 times. The second time you send the query it should be much faster due to the Solr caches and you should be able to see the cache hit in the Solr admin panel. If you send the exact query a second time (without enough intervening queries to evict data from the cache, ) the Solr queryResultCache should get hit and you should see a response time in the .01-5 millisecond range. That's not the case. The second query is only some few milliseconds faster (but stays 2s). But I'm not sure what I am doing wrong. The other 3 caches have a good hitratio but queryResultCache has 0. For queryResultCache I am using: queryResultCache class=solr.LRUCache size=400 initialSize=400 autowarmCount=400/ But even if I double that it didn't make the hitratio 0 How much memory is on the machine? If your bottleneck is disk i/o for frequent terms, then you want to make sure you have enough memory for the OS disk cache. Yes, there should be enough memory for the OS-disc-cache. I assume that http is not in your stopwords. exactly. CommonGrams will only help with phrase queries. CommonGrams was committed and is in Solr 1.4. If you decide to use CommonGrams you definitely need to re-index and you also need to use both the index time filter and the query time filter. Your index will be larger. fieldType name=foo ... analyzer type=index filter class=solr.CommonGramsFilterFactory words=new400common.txt/ /analyzer analyzer type=query filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/ /analyzer /fieldType Thanks, I will try that, if I can solve the current issue :-) And thanks for all your answers, I will try to experiment with my setup in more detail now ... Kind regards, Peter. Subject: Re: Improve Query Time For Large Index Hi Tom, my index is around 3GB large and I am using 2GB RAM for the JVM although a some more is available. If I am looking into the RAM usage while a slow query runs (via jvisualvm) I see that only 750MB of the JVM RAM is used. Can you give us some examples of the slow queries? for example the empty query solr/select?q= takes very long or solr/select?q=http where 'http' is the most common term Are you using stop words? yes, a lot. I stored them into stopwords.txt http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 this looks interesting. I read through https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4. I only need to enable it via: filter class=solr.CommonGramsFilterFactory ignoreCase=true words=stopwords.txt/ right? Do I need to reindex? Regards, Peter. Hi Peter, A few more details about your setup would help list members to answer your questions. How large is your index? How much memory is on the machine and how much is allocated to the JVM? Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists. So you need to leave some memory for the OS. On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files. Can you give us some examples of the slow queries? Are you using stop words? If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list or try CommonGrams and add them to the common words list. (Details on CommonGrams here: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2) Tom Burton-West -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Tuesday, August 10, 2010 9:54 AM To: solr-user@lucene.apache.org Subject: Improve Query Time For Large Index Hi, I have 5 Million small documents/tweets (= ~3GB) and the slave index replicates itself from master every 10-15 minutes, so the index is
Re: Analysing SOLR logfiles
we've just started using awstats - as suggested by the solr 1.4 book. its open source!: http://awstats.sourceforge.net/ On 12 August 2010 18:18, Jay Flattery jayc...@rocketmail.com wrote: Thanks - splunk looks overkill. We're extremely small scale - were hoping for something open source :-) - Original Message From: Jan Høydahl / Cominvent jan@cominvent.com To: solr-user@lucene.apache.org Sent: Wed, August 11, 2010 11:14:37 PM Subject: Re: Analysing SOLR logfiles Have a look at www.splunk.com -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 11. aug. 2010, at 19.34, Jay Flattery wrote: Hi there, Just wondering what tools people use to analyse SOLR log files. We're looking to do things like extracting common queries, calculating averaging Qtime and hits, returning particularly slow/expensive queries, etc. Would prefer not to code something (completely) from scratch. Thanks!
Re: Improve Query Time For Large Index
exactly! On Thu, Aug 12, 2010 at 5:26 AM, Peter Karich peat...@yahoo.de wrote: Hi Robert! Since the example given was http being slow, its worth mentioning that if queries are one word urls [for example http://lucene.apache.org] these will actually form slow phrase queries by default. do you mean that http://lucene.apache.org will be split up into http lucene apache org and solr will perform a phrase query? Regards, Peter. -- Robert Muir rcm...@gmail.com
Re: Multiple Facet Dates
On 05/08/2010 09:59, Raphaël Droz wrote: Hi, I saw this post : http://lucene.472066.n3.nabble.com/Multiple-Facet-Dates-td495480.html I didn't see work in progress or plans about this feature on the list and bugtracker. Does someone already created a patch, pof, ... I wouldn't have been able to find ? From my naïve point of view the ratio usefulness / added code complexity appears as high. My use-case is to provide, in one request : - the results count for each one of several years (tag-based exclusion) - the results count for each month of a given year - the results count for each day of a given month and year) I pretty sure someone here already encountered the above, isn't ? After having understood : This parameter can be specified on a per field basis. I created 3 more copy-fields, it's then obvious : // the real constraint requested fq={!tag=datefq}date f.date.facet.date.start=2008-12-08T06:00:00Z f.date.facet.date.end=2008-12-09T06:00:00Z f.date.facet.date.gap=+1DAY // three more field for the total facet.date={!ex%3Ddatefq}date_for_year facet.date={!ex%3Ddatefq}date_for_year_month facet.date={!ex%3Ddatefq}date_for_year_month_day // the count for all year without the constraint f.date_for_year.facet.date.start=1970-01-01T06:00:00Z f.date_for_year.facet.date.end=2011-01-01T06:00:00Z f.date_for_year.facet.date.gap=+1YEAR // the count for all month of the year requested (2008) without the constraint f.date_for_year_month.facet.date.start=2008-01-01T06:00:00Z f.date_for_year_month.facet.date.end=2008-12-31T06:00:00Z f.date_for_year_month.facet.date.gap=+1MONTH // idem for the days... Thanks for your work ! Raph
Solr branches
Hi, I'm having oome problems with solr. From random browsing I'm getting an impression that a lot of memory fixes happened recently in solr and lucene. Could you give me a quick summary how (un)stable are different lucene / solr branches and how much improvement I can expect?
Re: Analysing SOLR logfiles
I wonder too, that there shouldn't be a special tool which analyzes solr logfiles (e.g. parses qtime, the parameters q, fq, ...) Because there are some other open source log analyzers out there: http://yaala.org/ http://www.mrunix.net/webalizer/ Another free tool is newrelic.com (you will submit your query data to this site, same as for google analytics). Setup is easy. For traffic on our site which triggers the solr search we use piwik and common queries can be extracted easily. Setup was done in 5 minutes. Regards, Peter. we've just started using awstats - as suggested by the solr 1.4 book. its open source!: http://awstats.sourceforge.net/ On 12 August 2010 18:18, Jay Flattery jayc...@rocketmail.com wrote: Thanks - splunk looks overkill. We're extremely small scale - were hoping for something open source :-) - Original Message From: Jan Høydahl / Cominvent jan@cominvent.com To: solr-user@lucene.apache.org Sent: Wed, August 11, 2010 11:14:37 PM Subject: Re: Analysing SOLR logfiles Have a look at www.splunk.com -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 11. aug. 2010, at 19.34, Jay Flattery wrote: Hi there, Just wondering what tools people use to analyse SOLR log files. We're looking to do things like extracting common queries, calculating averaging Qtime and hits, returning particularly slow/expensive queries, etc. Would prefer not to code something (completely) from scratch. Thanks!
indexing???
Hi all, The indexing part of solr is going good,but i got a error on indexing a single pdf file. when i searched for the error in the mailing list i found that the error was due to copyright of that file. can't we index a file which has copy rights or any digital rights??? regards, satya
Indexing large files using Solr Cell causes OutOfMemory error
Hi, I'm trying to index a txt-File (~150MB) using Solr Cell/Tika. The curl command aborts due to a java.lang.OutOfMemoryError. * java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.lt;initgt;(String.java:215) at java.lang.StringBuilder.toString(StringBuilder.java:430) at org.apache.solr.handler.extraction.SolrContentHandler.newDocument(Sol rContentHandler.java:124) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(Ext ractingDocumentLoader.java:119) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(Ex tractingDocumentLoader.java:125) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr actingDocumentLoader.java:195) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co ntentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl erBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle Request(RequestHandlers.java:237) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter .java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte r.java:240) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl icationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF ilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV alve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextV alve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j ava:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j ava:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal ve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav a:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java :852) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce ss(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:48 9) at java.lang.Thread.run(Thread.java:619) ) that prevented it from fulfilling this request./u/pHR size=1 noshade=n oshadeh3Apache Tomcat/6.0.26/h3/body/html * AFAIK Tika keeps the whole file in RAM and posts it as one single string to Solr. I'm using JVM-args: Xmx1024M and solr default config with * mainIndex !-- options specific to the main on-disk lucene index -- useCompoundFilefalse/useCompoundFile ramBufferSizeMB32/ramBufferSizeMB mergeFactor10/mergeFactor ... /mainIndex requestDispatcher handleSelect=true !--Make sure your system has some authentication before enabling remote streaming! -- requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=2048000 / ... * Is there a chance to force Solr/Tika to flush the memory during indexing a file? Increasing RAM in dependence on the size of the largest file to index seems not very nice. Did I miss some configuration option or do I have to modify Java code? I just found http://osdir.com/ml/tika-dev.lucene.apache.org/2009-02/msg00020.html and I'm wondering if there is a solution yet. Carina
Re: Solr branches
(10/08/12 21:06), Tomasz Wegrzanowski wrote: Hi, I'm having oome problems with solr. From random browsing I'm getting an impression that a lot of memory fixes happened recently in solr and lucene. Could you give me a quick summary how (un)stable are different lucene / solr branches and how much improvement I can expect? Lucene/Solr have CHANGES.txt. You can refer to it to see how much Lucene/Solr get improved from previous release. Koji -- http://www.rondhuit.com/en/
Re: Schema Definition Question
One way I've done to handle this, and it works only for some types of data, is to put the searchable part of the sub-doc in a search field (indexed=true) and put an xml or json representation of the sub-doc in a stored only field. Then if the main doc is hit via search I can grab the xml or json, convert it to an object graph and do whatever I want. If you need to search on a variety of elements in the sub-doc this becomes less useful an approach. But in some use-cases it worked for me. -- View this message in context: http://lucene.472066.n3.nabble.com/Schema-Definition-Question-tp1049966p1110159.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing Hanging during GC?
I am a little confused - how did 180k documents become 100m index documents? We use have over 20 indices (for different content sets), one with 5m documents (about a couple of pages each) and another with 100k+ docs. We can index the 5m collection in a couple of days (limitation is in the source) which is 100k documents an hour without breaking a sweat. On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote: Hi, When indexing large amounts of data I hit a problem whereby Solr becomes unresponsive and doesn't recover (even when left overnight!). I think i've hit some GC problems/tuning is required of GC and I wanted to know if anyone has ever hit this problem. I can replicate this error (albeit taking longer to do so) using Solr/Lucene analysers only so I thought other people might have hit this issue before over large data sets Background on my problem follows -- but I guess my main question is -- can Solr become so overwhelmed by update posts that it becomes completely unresponsive?? Right now I think the problem is that the java GC is hanging but I've been working on this all week and it took a while to figure out it might be GC-based / wasn't a direct result of my custom analysers so i'd appreciate any advice anyone has about indexing large document collections. I also have a second questions for those in the know -- do we have a chance of indexing/searching over our large dataset with what little hardware we already have available?? thanks in advance :) bec a bit of background: --- I've got a large collection of articles we want to index/search over -- about 180k in total. Each article has say 500-1000 sentences and each sentence has about 15 fields, many of which are multi-valued and we store most fields as well for display/highlighting purposes. So I'd guess over 100 million index documents. In our small test collection of 700 articles this results in a single index of about 13GB. Our pipeline processes PDF files through to Solr native xml which we call index.xml files i.e. in adddoc... format ready to post straight to Solr's update handler. We create the index.xml files as we pull in information from a few sources and creation of these files from their original PDF form is farmed out across a grid and is quite time-consuming so we distribute this process rather than creating index.xml files on the fly... We do a lot of linguistic processing and to enable search functionality of our resulting terms requires analysers that split terms/ join terms together i.e. custom analysers that perform string operations and are quite time-consuming/ have large overhead compared to most analysers (they take approx 20-30% more time and use twice as many short-lived objects than the text field type). Right now i'm working on my new Imac: quad-core 2.8 GHz intel Core i7 16 GB 1067 MHz DDR3 RAM 2TB hard-drive (about half free) Version 10.6.4 OSX Production environment: 2 linux boxes each with: 8-core Intel(R) Xeon(R) CPU @ 2.00GHz 16GB RAM I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core right now). I setup Solr to use autocommit as we'll have several document collections / post to Solr from different data sets: !-- autocommit pending docs if certain criteria are met. Future versions may expand the available criteria -- autoCommit maxDocs50/maxDocs !-- every 1000 articles -- maxTime90/maxTime !-- every 15 minutes -- /autoCommit I also have useCompoundFilefalse/useCompoundFile ramBufferSizeMB1024/ramBufferSizeMB mergeFactor10/mergeFactor - *** First question: Has anyone else found that Solr hangs/becomes unresponsive after too many documents are indexed at once i.e. Solr can't keep up with the post rate? I've got LCF crawling my local test set (file system connection required only) and posting documents to Solr using 6GB of RAM. As I said above, these documents are in native Solr XML format (adddoc) with one file per article so each add contains all the sentence-level documents for the article. With LCF I post about 2.5/3k articles (files) per hour -- so about 2.5k*500 /3600 = 350 docs per second post-rate -- is this normal/expected?? Eventually, after about 3000 files (an hour or so) Solr starts to hang/becomes unresponsive and with Jconsole/GC logging I can see that the Old-Gen space is about 90% full and the following is the end of the solr log file-- where you can see GC has been called: -- 3012.290: [GC Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 53349392 Max Chunk Size: 3200168 Number of Blocks: 66 Av. Block Size: 808324 Tree Height: 13 Before GC: Statistics for BinaryTreeDictionary: Total Free
Re: Solr branches
On 12 August 2010 13:46, Koji Sekiguchi k...@r.email.ne.jp wrote: (10/08/12 21:06), Tomasz Wegrzanowski wrote: Hi, I'm having oome problems with solr. From random browsing I'm getting an impression that a lot of memory fixes happened recently in solr and lucene. Could you give me a quick summary how (un)stable are different lucene / solr branches and how much improvement I can expect? Lucene/Solr have CHANGES.txt. You can refer to it to see how much Lucene/Solr get improved from previous release. This is technically true, but I'm not sufficiently familiar with solr/lucene development process to infer much about performance and stability of different branches from it.
Re: Indexing large files using Solr Cell causes OutOfMemory error
On Thu, 12 Aug 2010 14:32:19 +0200 Lannig Carina lan...@ssi-schaefer-noell.com wrote: Hi, I'm trying to index a txt-File (~150MB) using Solr Cell/Tika. The curl command aborts due to a java.lang.OutOfMemoryError. [...] AFAIK Tika keeps the whole file in RAM and posts it as one single string to Solr. I'm using JVM-args: Xmx1024M and solr default config with [...] Do not know about Tika, but what is the size of your Solr index, and the number of documents in it? Solr seems to need RAM, and while we did not do real benchmarks then, even with a few tens of thousands of documents, performance seemed to improve by allocating 2GB RAM. Besides, unless you are on a very tight budget, throwing a few GB more RAM at the problem seems to be an easy, and not very expensive, way out. Regards, Gora
Re: Indexing Hanging during GC?
sorry -- i used the term documents too loosely! 180k scientific articles with between 500-1000 sentences each and we index sentence-level index documents so i'm guessing about 100 million lucene index documents in total. an update on my progress: i used GC settings of: -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8 -XX:CMSInitiatingOccupancyFraction=70 which allowed the indexing process to run to 11.5k articles and for about 2hours before I got the same kind of hanging/unresponsive Solr with this as the tail of the solr logs: Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 2416734 Max Chunk Size: 2412032 Number of Blocks: 3 Av. Block Size: 805578 Tree Height: 3 5980.480: [ParNew: 1887488K-1887488K(1887488K), 0.193 secs]5980.480: [CMS I also saw (in jconsole) that the number of threads rose from the steady 32 used for the 2 hours to 72 before Solr finally became unresponsive... i've got the following GC info params switched on (as many as i could find!): -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:PrintFLSStatistics=1 with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875 million fairly small docs per hour!! this produced an index of about 40GB to give you an idea of index size... because i've already got the documents in solr native xml format i.e. one file per article each with adddoc.../doc i.e. posting each set of sentence docs per article in every LCF file post... this means that LCF can throw documents at Solr very fast and i think i'm breaking it GC-wise. i'm going to try adding in System.gc() calls to see if this runs ok (albeit slower)... otherwise i'm pretty much at a loss as to what could be causing this GC issue/ solr hanging if it's not a GC issue... thanks :) bec On 12 August 2010 21:42, dc tech dctech1...@gmail.com wrote: I am a little confused - how did 180k documents become 100m index documents? We use have over 20 indices (for different content sets), one with 5m documents (about a couple of pages each) and another with 100k+ docs. We can index the 5m collection in a couple of days (limitation is in the source) which is 100k documents an hour without breaking a sweat. On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote: Hi, When indexing large amounts of data I hit a problem whereby Solr becomes unresponsive and doesn't recover (even when left overnight!). I think i've hit some GC problems/tuning is required of GC and I wanted to know if anyone has ever hit this problem. I can replicate this error (albeit taking longer to do so) using Solr/Lucene analysers only so I thought other people might have hit this issue before over large data sets Background on my problem follows -- but I guess my main question is -- can Solr become so overwhelmed by update posts that it becomes completely unresponsive?? Right now I think the problem is that the java GC is hanging but I've been working on this all week and it took a while to figure out it might be GC-based / wasn't a direct result of my custom analysers so i'd appreciate any advice anyone has about indexing large document collections. I also have a second questions for those in the know -- do we have a chance of indexing/searching over our large dataset with what little hardware we already have available?? thanks in advance :) bec a bit of background: --- I've got a large collection of articles we want to index/search over -- about 180k in total. Each article has say 500-1000 sentences and each sentence has about 15 fields, many of which are multi-valued and we store most fields as well for display/highlighting purposes. So I'd guess over 100 million index documents. In our small test collection of 700 articles this results in a single index of about 13GB. Our pipeline processes PDF files through to Solr native xml which we call index.xml files i.e. in adddoc... format ready to post straight to Solr's update handler. We create the index.xml files as we pull in information from a few sources and creation of these files from their original PDF form is farmed out across a grid and is quite time-consuming so we distribute this process rather than creating index.xml files on the fly... We do a lot of linguistic processing and to enable search functionality of our resulting terms requires analysers that split terms/ join terms together i.e. custom analysers that perform string operations and are quite time-consuming/ have large overhead compared to most analysers (they take approx 20-30% more time and use twice as many short-lived objects than the text field type). Right now i'm working on my new Imac: quad-core 2.8 GHz intel Core i7 16 GB 1067
Deleting with the DIH sometimes doesn't delete
I'm doing deletes with the DIH but getting mixed results. Sometimes the documents get deleted, other times I can still find them in the index. What would prevent a doc from getting deleted? For example, I delete 594039 and get this in the logs; 2010-08-12 14:41:55,625 [Thread-210] INFO [DataImporter] Starting Delta Import 2010-08-12 14:41:55,625 [Thread-210] INFO [SolrWriter] Read productimportupdate.properties 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Starting delta collection. 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Running ModifiedRowKey() for Entity: item 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Completed ModifiedRowKey for Entity: item rows obtained : 0 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Completed DeletedRowKey for Entity: item rows obtained : 1 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Completed parentDeltaQuery for Entity: item 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Deleting stale documents 2010-08-12 14:41:55,625 [Thread-210] INFO [SolrWriter] Deleting document: 594039 2010-08-12 14:41:55,703 [Thread-210] INFO [SolrDeletionPolicy] newest commit = 1281030128383 2010-08-12 14:41:55,718 [Thread-210] DEBUG [SolrIndexWriter] Opened Writer DirectUpdateHandler2 2010-08-12 14:41:55,718 [Thread-210] INFO [DocBuilder] Delta Import completed successfully 2010-08-12 14:41:55,718 [Thread-210] INFO [DocBuilder] Import completed successfully 2010-08-12 14:41:55,718 [Thread-210] INFO [DirectUpdateHandler2] start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) 2010-08-12 14:42:08,562 [Thread-210] DEBUG [SolrIndexWriter] Closing Writer DirectUpdateHandler2 2010-08-12 14:42:10,437 [Thread-210] INFO [SolrDeletionPolicy] SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_8,version=1281030128383,generation=8,filenames=[_39.frq, _2i.fdx, _39.tis, _39.prx, _39.fnm, _2i.fdt, _39.tii, _39.nrm, segments_8] commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx, _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9] 2010-08-12 14:42:10,437 [Thread-210] INFO [SolrDeletionPolicy] newest commit = 1281030128384 ..this works fine; I can no longer find 594039 in the index. But a little later I delete a couple more (33252 and 105224) and get the following (I added two docs at the same time); 2010-08-12 15:27:42,828 [Thread-217] INFO [DataImporter] Starting Delta Import 2010-08-12 15:27:42,828 [Thread-217] INFO [SolrWriter] Read productimportupdate.properties 2010-08-12 15:27:42,828 [Thread-217] INFO [DocBuilder] Starting delta collection. 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Running ModifiedRowKey() for Entity: item 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Completed ModifiedRowKey for Entity: item rows obtained : 2 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Completed DeletedRowKey for Entity: item rows obtained : 2 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Completed parentDeltaQuery for Entity: item 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Deleting stale documents 2010-08-12 15:27:42,843 [Thread-217] INFO [SolrWriter] Deleting document: 33252 2010-08-12 15:27:42,906 [Thread-217] INFO [SolrDeletionPolicy] SolrDeletionPolicy.onInit: commits:num=1 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx, _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9] 2010-08-12 15:27:42,906 [Thread-217] INFO [SolrDeletionPolicy] newest commit = 1281030128384 2010-08-12 15:27:42,906 [Thread-217] DEBUG [SolrIndexWriter] Opened Writer DirectUpdateHandler2 2010-08-12 15:27:42,906 [Thread-217] INFO [SolrWriter] Deleting document: 105224 2010-08-12 15:27:42,906 [Thread-217] INFO [DocBuilder] Delta Import completed successfully 2010-08-12 15:27:42,906 [Thread-217] INFO [DocBuilder] Import completed successfully 2010-08-12 15:27:42,906 [Thread-217] INFO [DirectUpdateHandler2] start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) 2010-08-12 15:27:55,578 [Thread-217] DEBUG [SolrIndexWriter] Closing Writer DirectUpdateHandler2 2010-08-12 15:27:56,875 [Thread-217] INFO [SolrDeletionPolicy] SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx, _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9] commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_a,version=1281030128385,generation=10,filenames=[_3c.tis, _3c.fdt, _3c.fnm, _3c.nrm, _3c.tii, segments_a, _3c.fdx, _3c.prx, _3c.frq] 2010-08-12 15:27:56,875 [Thread-217] INFO [SolrDeletionPolicy] newest commit = 1281030128385 -- View this message in context:
index pdf files
I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * - doc - arr name=attr_Author strHristovski D/str /arr - arr name=attr_Content-Type strapplication/pdf/str /arr - arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr - arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr - arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much,
Re: index pdf files
To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * - doc - arr name=attr_Author strHristovski D/str /arr - arr name=attr_Content-Type strapplication/pdf/str /arr - arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr - arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr - arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much,
RE: index pdf files
Thanks so much. I didn't know how to make any changes in schema.xml for pdf files. I used solr default schema.xml. Please tell me what I need do in schema.xml. The simple java program I use is following. I also attached that pdf file. I really appreciate your help! * public class importPDF { public static void main(String[] args) { try { String fileName = pub2009001.pdf; String solrId = pub2009001.pdf; indexFilesSolrCell(fileName, solrId); } catch (Exception ex) { System.out.println(ex.toString()); } } public static void indexFilesSolrCell(String fileName, String solrId) throws IOException, SolrServerException { String urlString = http://lhcinternal.nlm.nih.gov:8989/solr/lhcpdf;; SolrServer solr = new CommonsHttpSolrServer(urlString); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(/update/extract); up.addFile(new File(fileName)); up.setParam(literal.id, solrId); up.setParam(uprefix, attr_); up.setParam(fmap.content, attr_content); up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solr.request(up); } } -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * - doc - arr name=attr_Author strHristovski D/str /arr - arr name=attr_Content-Type strapplication/pdf/str /arr - arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr - arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr - arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much,
how to update solr to older 1.5 builds instead of to trunk
please excuse this newbie question, but: I want to upgrade solr to a version but not to the latest version in the trunk (because there are so many changes that I would have to test against, and modify my custom classes for, and behavior changes, and deal with the lucene index change, etc) My thought was to try to look at versions that are post 903398 2010-01-26 20:21:09Z but pre the change in the lucene index. Eventually picking up the version that had the features I wanted but with as few other changes as feasible. I know I could probably apply a bunch of patches but some of the patches seem to rely on other patches which rely on other patches which rely on ... It just seems easier to pick the version that has just the features/patches I want. I have no trouble seeing/using the trunk at http://svn.apache.org/repos/asf/lucene/dev/trunk/ but it only seems to have builds 984777 thru 984832 So where would I find significantly older builds (ie like the one I am currently using - 903398)? I tried using svn on repository http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/ but get a Repository moved permanently to '/viewc/lucene/solr/branches/branch-1.5-dev/' message. Any help would be great -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1113863.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing Hanging during GC?
1) I assume you are doing batching interspersed with commits 2) Why do you need sentence level Lucene docs? 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be surprised if you a memory/connection leak their (or it is not releasing some resource explicitly) In general, we have NEVER had a problem in loading Solr. On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote: sorry -- i used the term documents too loosely! 180k scientific articles with between 500-1000 sentences each and we index sentence-level index documents so i'm guessing about 100 million lucene index documents in total. an update on my progress: i used GC settings of: -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8 -XX:CMSInitiatingOccupancyFraction=70 which allowed the indexing process to run to 11.5k articles and for about 2hours before I got the same kind of hanging/unresponsive Solr with this as the tail of the solr logs: Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 2416734 Max Chunk Size: 2412032 Number of Blocks: 3 Av. Block Size: 805578 Tree Height: 3 5980.480: [ParNew: 1887488K-1887488K(1887488K), 0.193 secs]5980.480: [CMS I also saw (in jconsole) that the number of threads rose from the steady 32 used for the 2 hours to 72 before Solr finally became unresponsive... i've got the following GC info params switched on (as many as i could find!): -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:PrintFLSStatistics=1 with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875 million fairly small docs per hour!! this produced an index of about 40GB to give you an idea of index size... because i've already got the documents in solr native xml format i.e. one file per article each with adddoc.../doc i.e. posting each set of sentence docs per article in every LCF file post... this means that LCF can throw documents at Solr very fast and i think i'm breaking it GC-wise. i'm going to try adding in System.gc() calls to see if this runs ok (albeit slower)... otherwise i'm pretty much at a loss as to what could be causing this GC issue/ solr hanging if it's not a GC issue... thanks :) bec On 12 August 2010 21:42, dc tech dctech1...@gmail.com wrote: I am a little confused - how did 180k documents become 100m index documents? We use have over 20 indices (for different content sets), one with 5m documents (about a couple of pages each) and another with 100k+ docs. We can index the 5m collection in a couple of days (limitation is in the source) which is 100k documents an hour without breaking a sweat. On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote: Hi, When indexing large amounts of data I hit a problem whereby Solr becomes unresponsive and doesn't recover (even when left overnight!). I think i've hit some GC problems/tuning is required of GC and I wanted to know if anyone has ever hit this problem. I can replicate this error (albeit taking longer to do so) using Solr/Lucene analysers only so I thought other people might have hit this issue before over large data sets Background on my problem follows -- but I guess my main question is -- can Solr become so overwhelmed by update posts that it becomes completely unresponsive?? Right now I think the problem is that the java GC is hanging but I've been working on this all week and it took a while to figure out it might be GC-based / wasn't a direct result of my custom analysers so i'd appreciate any advice anyone has about indexing large document collections. I also have a second questions for those in the know -- do we have a chance of indexing/searching over our large dataset with what little hardware we already have available?? thanks in advance :) bec a bit of background: --- I've got a large collection of articles we want to index/search over -- about 180k in total. Each article has say 500-1000 sentences and each sentence has about 15 fields, many of which are multi-valued and we store most fields as well for display/highlighting purposes. So I'd guess over 100 million index documents. In our small test collection of 700 articles this results in a single index of about 13GB. Our pipeline processes PDF files through to Solr native xml which we call index.xml files i.e. in adddoc... format ready to post straight to Solr's update handler. We create the index.xml files as we pull in information from a few sources and creation of these files from their original PDF form is farmed out across a grid and is quite time-consuming so we distribute this process rather than creating index.xml files on the fly... We do a lot of linguistic processing
Re: how to update solr to older 1.5 builds instead of to trunk
Another option is the 3x branch - that should still be able to read indexes from Solr 1.4/Lucene 2.9 I personally don't expect a 1.5 release to ever materialize. There will eventually be a Lucene/Solr 3.1 release off of the 3x branch, and a Lucene/Solr 4.0 release off of trunk. -Yonik http://www.lucidimagination.com On Thu, Aug 12, 2010 at 11:59 AM, solr-user solr-u...@hotmail.com wrote: please excuse this newbie question, but: I want to upgrade solr to a version but not to the latest version in the trunk (because there are so many changes that I would have to test against, and modify my custom classes for, and behavior changes, and deal with the lucene index change, etc) My thought was to try to look at versions that are post 903398 2010-01-26 20:21:09Z but pre the change in the lucene index. Eventually picking up the version that had the features I wanted but with as few other changes as feasible. I know I could probably apply a bunch of patches but some of the patches seem to rely on other patches which rely on other patches which rely on ... It just seems easier to pick the version that has just the features/patches I want. I have no trouble seeing/using the trunk at http://svn.apache.org/repos/asf/lucene/dev/trunk/ but it only seems to have builds 984777 thru 984832 So where would I find significantly older builds (ie like the one I am currently using - 903398)? I tried using svn on repository http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/ but get a Repository moved permanently to '/viewc/lucene/solr/branches/branch-1.5-dev/' message. Any help would be great -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1113863.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Doc Lucene Doc !?
no help ? =( -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1114172.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to update solr to older 1.5 builds instead of to trunk
Thanks Yonik but http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/solr/CHANGES.txt says that the lucene index has changed Upgrading from Solr 1.4 -- * The Lucene index format has changed and as a result, once you upgrade, previous versions of Solr will no longer be able to read your indices. In a master/slave configuration, all searchers/slaves should be upgraded before the master. If the master were to be updated first, the older searchers would not be able to read the new index format. not to mention that regression testing is a pain Is there any way to get a set of builds with versions prior to 3.x?? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1114353.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing Hanging during GC?
hi, 1) I assume you are doing batching interspersed with commits as each file I crawl for are article-level each add contains all the sentences for the article so they are naturally batched into the about 500 documents per post in LCF. I use auto-commit in Solr: autoCommit maxDocs50/maxDocs !-- every 1000 articles -- maxTime90/maxTime !-- every 15 minutes -- /autoCommit 2) Why do you need sentence level Lucene docs? that's an application specific need due to linguistic info needed on a per-sentence basis. 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be surprised if you a memory/connection leak their (or it is not releasing some resource explicitly) I thought this could be the case too -- but if I replace the use of my custom analysers and specify my fields are of type text instead (from standard solrconfig.xml i.e. using solr-based analysers) then I get this kind of hanging too -- at least it did when I didn't have any explicit GC settings... it does take longer to replicate as my analysers/field types are more complex than text field type. i will try it again with the different GC settings tomorrow and post the results. In general, we have NEVER had a problem in loading Solr. i'm not sure if we would either if we posted as we created the index.xml format... but because we post 500+ documents a time (one article file per LCF post) and LCF can post these files quickly i'm not sure if I need to try and slow down the post rate!? thanks for your replies, bec :) On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote: sorry -- i used the term documents too loosely! 180k scientific articles with between 500-1000 sentences each and we index sentence-level index documents so i'm guessing about 100 million lucene index documents in total. an update on my progress: i used GC settings of: -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8 -XX:CMSInitiatingOccupancyFraction=70 which allowed the indexing process to run to 11.5k articles and for about 2hours before I got the same kind of hanging/unresponsive Solr with this as the tail of the solr logs: Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 2416734 Max Chunk Size: 2412032 Number of Blocks: 3 Av. Block Size: 805578 Tree Height: 3 5980.480: [ParNew: 1887488K-1887488K(1887488K), 0.193 secs]5980.480: [CMS I also saw (in jconsole) that the number of threads rose from the steady 32 used for the 2 hours to 72 before Solr finally became unresponsive... i've got the following GC info params switched on (as many as i could find!): -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:PrintFLSStatistics=1 with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875 million fairly small docs per hour!! this produced an index of about 40GB to give you an idea of index size... because i've already got the documents in solr native xml format i.e. one file per article each with adddoc.../doc i.e. posting each set of sentence docs per article in every LCF file post... this means that LCF can throw documents at Solr very fast and i think i'm breaking it GC-wise. i'm going to try adding in System.gc() calls to see if this runs ok (albeit slower)... otherwise i'm pretty much at a loss as to what could be causing this GC issue/ solr hanging if it's not a GC issue... thanks :) bec On 12 August 2010 21:42, dc tech dctech1...@gmail.com wrote: I am a little confused - how did 180k documents become 100m index documents? We use have over 20 indices (for different content sets), one with 5m documents (about a couple of pages each) and another with 100k+ docs. We can index the 5m collection in a couple of days (limitation is in the source) which is 100k documents an hour without breaking a sweat. On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote: Hi, When indexing large amounts of data I hit a problem whereby Solr becomes unresponsive and doesn't recover (even when left overnight!). I think i've hit some GC problems/tuning is required of GC and I wanted to know if anyone has ever hit this problem. I can replicate this error (albeit taking longer to do so) using Solr/Lucene analysers only so I thought other people might have hit this issue before over large data sets Background on my problem follows -- but I guess my main question is -- can Solr become so overwhelmed by update posts that it becomes completely unresponsive?? Right now I think the problem is that the java GC is hanging but I've been working on this all week and it took a while to figure out it might be GC-based / wasn't a direct result of my custom analysers so i'd appreciate any advice anyone has
Re: how to update solr to older 1.5 builds instead of to trunk
On Thu, Aug 12, 2010 at 12:24 PM, solr-user solr-u...@hotmail.com wrote: Thanks Yonik but http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/solr/CHANGES.txt says that the lucene index has changed Right - but it will be able to read your older index. Do you need Solr 1.4 to be able to read the new index once you upgrade? -Yonik http://www.lucidimagination.com
edismax pf2 and ps
Short summary: Is there any way I can specify that I want a lot of phrase slop for the pf parameter, but none at all for the pf2 parameter? I find the 'pf' parameter with a pretty large 'ps' to do a very nice job for providing a modest boost to many documents that are quite well related to many queries in my system. In contrast, I find the 'pf2' parameter with zero 'ps' does extremely well at providing a high boost to documents that are often exactly what someone's searching for. Is there any way I can get both effects? Edismax's pf2 parameter is really nice for boosting exact phrases in queries like 'black jacket red cap white shoes'. But as soon as even a little phrase slop (ps) is added, it seems like it starts boosting documents with red jackets and white caps just as much as those with black jackets and red caps. My gut feeling is that if I could have pf with a large phrase slop and the pf2 with zero phrase slop, it'd give me better overall results than any single phrase slop setting that gets applied to both. Is there any good way for me to test that? Thanks, Ron
Re: how to update solr to older 1.5 builds instead of to trunk
no, once upgraded I wouldnt need to have an older solr read the indexes. misunderstood the note. thx -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-update-solr-to-older-1-5-builds-instead-of-to-trunk-tp1113863p1115694.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: index pdf files
Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * - doc - arr name=attr_Author strHristovski D/str /arr - arr name=attr_Content-Type strapplication/pdf/str /arr - arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr - arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr - arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much,
RE: Improve Query Time For Large Index
Hi Peter, If hits aren't showing up, and you aren't getting any queryResultCache hits even with the exact query being repeated, something is very wrong. I'd suggest first getting the query result cache working, and then moving on to look at other possible bottlenecks. What are your settings for queryResultWindowSize and queryResultMaxDocsCached? Following up on Robert's point, you might also try to run a few queries in the admin interface with the debug flag on to see if the query parser is creating phrase queries (assuming you have queries like http://foo.bar.baz). The debug/explain will indicate whether the parsed query is a PhraseQuery. Tom -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Thursday, August 12, 2010 5:36 AM To: solr-user@lucene.apache.org Subject: Re: Improve Query Time For Large Index Hi Tom, I tried again with: queryResultCache class=solr.LRUCache size=1 initialSize=1 autowarmCount=1/ and even now the hitratio is still 0. What could be wrong with my setup? ('free -m' shows that the cache has over 2 GB free.) Regards, Peter. Hi Peter, Can you give a few more examples of slow queries? Are they phrase queries? Boolean queries? prefix or wildcard queries? If one word queries are your slow queries, than CommonGrams won't help. CommonGrams will only help with phrase queries. How are you using termvectors? That may be slowing things down. I don't have experience with termvectors, so someone else on the list might speak to that. When you say the query time for common terms stays slow, do you mean if you re-issue the exact query, the second query is not faster? That seems very strange. You might restart Solr, and send a first query (the first query always takes a relatively long time.) Then pick one of your slow queries and send it 2 times. The second time you send the query it should be much faster due to the Solr caches and you should be able to see the cache hit in the Solr admin panel. If you send the exact query a second time (without enough intervening queries to evict data from the cache, ) the Solr queryResultCache should get hit and you should see a response time in the .01-5 millisecond range. What settings are you using for your Solr caches? How much memory is on the machine? If your bottleneck is disk i/o for frequent terms, then you want to make sure you have enough memory for the OS disk cache. I assume that http is not in your stopwords. CommonGrams will only help with phrase queries CommonGrams was committed and is in Solr 1.4. If you decide to use CommonGrams you definitely need to re-index and you also need to use both the index time filter and the query time filter. Your index will be larger. fieldType name=foo ... analyzer type=index filter class=solr.CommonGramsFilterFactory words=new400common.txt/ /analyzer analyzer type=query filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/ /analyzer /fieldType Tom -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Tuesday, August 10, 2010 3:32 PM To: solr-user@lucene.apache.org Subject: Re: Improve Query Time For Large Index Hi Tom, my index is around 3GB large and I am using 2GB RAM for the JVM although a some more is available. If I am looking into the RAM usage while a slow query runs (via jvisualvm) I see that only 750MB of the JVM RAM is used. Can you give us some examples of the slow queries? for example the empty query solr/select?q= takes very long or solr/select?q=http where 'http' is the most common term Are you using stop words? yes, a lot. I stored them into stopwords.txt http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 this looks interesting. I read through https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4. I only need to enable it via: filter class=solr.CommonGramsFilterFactory ignoreCase=true words=stopwords.txt/ right? Do I need to reindex? Regards, Peter. Hi Peter, A few more details about your setup would help list members to answer your questions. How large is your index? How much memory is on the machine and how much is allocated to the JVM? Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of postings lists. So you need to leave some memory for the OS. On the other hand if you are optimizing and refreshing every 10-15 minutes, that will invalidate all the caches, since an optimized index is essentially a set of new files. Can you give us some examples of the slow queries? Are you using stop words? If your slow queries are phrase queries, then you might try either adding the most frequent terms in your index to the stopwords list or try CommonGrams and add them to the common words list. (Details on
Re: index pdf files
Maybe this helps: http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 Cheers, Stefan Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]: Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * -doc -arr name=attr_Author strHristovski D/str /arr -arr name=attr_Content-Type strapplication/pdf/str /arr -arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr -arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr -arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much, -- *** Stefan Moises Senior Softwareentwickler shoptimax GmbH Guntherstraße 45 a 90461 Nürnberg Amtsgericht Nürnberg HRB 21703 GF Friedrich Schreieck Tel.: 0911/25566-25 Fax: 0911/25566-29 moi...@shoptimax.de http://www.shoptimax.de ***
Re: Solr Doc Lucene Doc !?
Are you just trying to learn the tiny details of how Solr and DIH work? Is this just an intellectual curiosity? Or are you having some specific problem that you are trying to solve? If you have a problem, could you describe the symptoms of the problem? I am using Solr, DIH, and several other related technologies and have never needed to know the difference between a SolrDocument and a LuceneDocument or how the UpdateHandler chains. So I'm curious about what your ultimate goal is with these questions. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1117472.html Sent from the Solr - User mailing list archive at Nabble.com.
Results from More then One Cors?
Hallo Users... I tryed to get results from more then one Cores.. But i dont know how.. Maby you have a Idea.. I need it into PHP King
RE: index pdf files
Thanks so much for your help! I defined dynamic field in schema.xml as following: dynamicField name=metadata_* type=string indexed=true stored=true multiValued=false/ But I wonder what I should put for uniqueKey/uniqueKey. I really appreciate your help! -Original Message- From: Stefan Moises [mailto:moi...@shoptimax.de] Sent: Thursday, August 12, 2010 1:58 PM To: solr-user@lucene.apache.org Subject: Re: index pdf files Maybe this helps: http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 Cheers, Stefan Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]: Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * -doc -arr name=attr_Author strHristovski D/str /arr -arr name=attr_Content-Type strapplication/pdf/str /arr -arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr -arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr -arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much, -- *** Stefan Moises Senior Softwareentwickler shoptimax GmbH Guntherstraße 45 a 90461 Nürnberg Amtsgericht Nürnberg HRB 21703 GF Friedrich Schreieck Tel.: 0911/25566-25 Fax: 0911/25566-29 moi...@shoptimax.de http://www.shoptimax.de ***
Re: Solr Doc Lucene Doc !?
i write a little thesis about this. and i need to know how solr is using lucene -in which way. in example of using dih and searching. so for my better understanding .. ;-) -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1118089.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: index pdf files
Thanks so much. I got it work now. I really appreciate your help! Xiaohui -Original Message- From: Stefan Moises [mailto:moi...@shoptimax.de] Sent: Thursday, August 12, 2010 1:58 PM To: solr-user@lucene.apache.org Subject: Re: index pdf files Maybe this helps: http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 Cheers, Stefan Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]: Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * -doc -arr name=attr_Author strHristovski D/str /arr -arr name=attr_Content-Type strapplication/pdf/str /arr -arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr -arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr -arr name=attr_content strCombining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Combining Semantic Relations and DNA Microarray Data for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej Kastrin,2... * Please help me out if anyone has experience with pdf files. I really appreciate it! Thanks so much, -- *** Stefan Moises Senior Softwareentwickler shoptimax GmbH Guntherstraße 45 a 90461 Nürnberg Amtsgericht Nürnberg HRB 21703 GF Friedrich Schreieck Tel.: 0911/25566-25 Fax: 0911/25566-29 moi...@shoptimax.de http://www.shoptimax.de ***
possible bug in sorting by Function?
I was looking at the ability to sort by Function that was added to solr. For the most part it seems to work. However solr doesn't seem to like to sort by certain functions. For example, this sum works: http://10.0.11.54:8994/solr/select?q=*:*sort=sum(1,Latitude,Longitude,sum(Latitude,Longitude)) asc but this hsin doesn't work: http://10.0.11.54:8994/solr/select?q=*:*sort=sum(3959,rad(47.544594),rad(-122.38723),rad(Latitude),rad(Longitude)) and gives me a Must declare sort field or function error, pointing to a line in QueryParsing.java. Note that I did apply the SOLR-1297-2.patch supplied by Koji Sekiguchi but it didn't seem to help. I am using solr 903398 2010-01-26 20:21:09Z. Any suggestions appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1118235.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: possible bug in sorting by Function?
small typo in last email: second sum should have been hsin, but I notice that the problem also occurs when I leave it as sum -- View this message in context: http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1118260.html Sent from the Solr - User mailing list archive at Nabble.com.
Field getting tokenized prior to charFilter on select query
I'm attempting to make use of PatternReplaceCharFilterFactory, but am running into issues on both 1.4.1 ( I ported it) and on nightly (4.0-2010-07-27). It seems that on a real query the charFilter isn't executed prior to the tokenizer. I modified the example configuration included in the distribution with the following fieldType in schema.xml and mapped a new field to it. !-- Field defintion for name text field -- fieldtype name=nameText class=solr.TextField analyzer !-- Replace (char char) or (char and char) with (charchar) -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=(.*?)(\b(\w) (amp;|and) (\w))(.*?) replacement=$1$3amp;$5$6/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory/ /analyzer /fieldtype field name=name type=nameText indexed=true stored=true required=false omitNorms=true / I vaildated that the regex works properly outside of Solr using just Java. The regex attempts to normalize single word characters around an '' into something consistent for searching. For example, it will turn A B Company into AB Company. The user can then search on AB, A and B, or A B and the proper result will be located. However, when I import a document with A B Company I can't ever locate it with A B query. It can be located with AB query. When I run analysis.jsp it works properly and it will match using any of the combinations. So from this I concluded that it was being indexed properly, but for some reason the query wasn't applying the regex properly. I hooked up a debugger and could see a difference in how the analyzer was applying the charFilter and how the query was applying the charFilter. When the analyzer invoked PatternReplaceCharFilterFactory.create(CharStream) the entire field was provided in a single call. When the query invoked PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times with 3 seperate tokens (A, , B). Because of this the regex won't ever locate the full string in the field. I'm using the following encoded URL to perform the query. This works http://localhost:8983/solr/select?q=name:%28a%26b%29 But this doesn't http://localhost:8983/solr/select?q=name:%28a+%26+b%29 Why is the query parser tokenizing the name field prior to the charFilter getting a chance to perform processing?
XSL import/include relative to app server home directory...
Hello, I'm customizing my XML response using with the XSLTResponseWriter using wt=xslttr=transform.xsl. Because I have a few use-cases to support, I wanted to break up the common bits and import/include them from multiple top level xslt files, but it appears that the base directory of the transform is the directory the application was launched in. Inside my transform.xsl I have this, for example xsl:import href=common/image-links.xsl/ which results in stack traces such as (copied only the relevant bits). Caused by: java.io.IOException: Unable to initialize Templates 'transform.xsl' Caused by: javax.xml.transform.TransformerException: Had IO Exception with stylesheet file: common/image-links.xsl Caused by: java.io.FileNotFoundException: C:\dev\jboss-5.1.0.GA http://jboss-5.1.0.ga/\bin\common\image-links.xsl This appears to be caused by a lack of provided systemId on the StreamSource of the xslt document I've requested. I've copied the relevant lines that I believe are the root cause of the problem here for reference. TransformFactory.getTemplates():line 105-6 final InputStream xsltStream = loader.openResource(xslt/ + filename); result = tFactory.newTemplates(new StreamSource(xsltStream)); The loader variable is an instance of solr's ResourceLoader which has no ability to provide the systemId to set on StreamSource to make relative references work in the xslt. It seems like we need something along the lines of String systemId = loader.getResourceURL().toString() + xslt/; result = tFactory.newTemplates(new StreamSource(xsltStream, systemId)); I looked for a bug/patch and didn't see anything. Please let me know, if I missed the patch or if there is another way to solve this problem (aside from not using xsl:include or xsl:import). Thanks in advance, Brian For reference... http://onjava.com/pub/a/onjava/excerpt/java_xslt_ch5/index.html?page=5 https://jira.springframework.org/secure/attachment/10163/AbstractXsltView.patch (similar bug that was in spring)
Require some advice
Hi, I am new to text search and mining and have been doing research for different available products. My application requires reading a SMS message (unstructured) and finding out entities such as person name, area, zip , city and skills associated with the person. SMS would be in form of free text. The parsed data would be stored in database and used by Solr to display results. A SMS message could in the following form: John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard We need to interpret in the following manner: first name - John last name - Mayer city- Mumbai zip - 411004 area-Juhu skills - car driver, body guard 1. Is Solr capable enough to handle this application considering that SMS message would be unstructured. 2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER (stanford university), Lingpipe? 3. Is Solr only text search or can be used for information extraction? 4. Is it recommended to use Solr with other products such as UIMA and GATE. There are companies that are specialized in making meaning out of unstructured SMS messages. Do we have something similar in open source world? Can we extend Solr for the same purpose? You reply would be appreciated. Thanking you. Regards, Pavan
RE: index pdf files
I got the following error when I index some pdf files. I wonder if anyone has this issue before and how to fix it. Thanks so much in advance! *** html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 500 /title /head bodyh2HTTP ERROR: 500/h2preorg.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2 org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) *** -Original Message- From: Stefan Moises [mailto:moi...@shoptimax.de] Sent: Thursday, August 12, 2010 1:58 PM To: solr-user@lucene.apache.org Subject: Re: index pdf files Maybe this helps: http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 Cheers, Stefan Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]: Does anyone know if I need define fields in schema.xml for indexing pdf files? If I need, please tell me how I can do it. I defined fields in schema.xml and created data-configuration file by using xpath for xml files. Would you please tell me if I need do it for pdf files and how I can do? Thanks so much for your help as always! -Original Message- From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] Sent: Thursday, August 12, 2010 11:45 AM To: solr-user@lucene.apache.org Subject: Re: index pdf files To help you we need the description of your fields in your schema.xml and the query that you do when you search only a single word. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]xiao...@mail.nlm.nih.gov I wrote a simple java program to import a pdf file. I can get a result when I do search *:* from admin page. I get nothing if I search a word. I wonder if I did something wrong or miss set something. Here is part of result I get when do *:* search: * -doc -arr name=attr_Author strHristovski D/str /arr -arr name=attr_Content-Type strapplication/pdf/str /arr -arr name=attr_Keywords strmicroarray analysis, literature-based discovery, semantic predications, natural language processing/str /arr -arr name=attr_Last-Modified strThu Aug 12 10:58:37 EDT 2010/str /arr -arr name=attr_content
Free Webinar: Findability: Designing the Search Experience
Here's perhaps the coolest webinar we've done to date, IMO :) I attended Tyler's presentation at Lucene EuroCon* and thoroughly enjoyed it. Search UI/UX is a fascinating topic to me, and really important to do well for the applications most of us are building. I'm pleased to pass along the blurb below. See you there! Erik * http://lucene-eurocon.org/sessions-track2-day2.html#3 Lucid Imagination presents a free webinar Wednesday, August 18, 2010 10:00 AM PST / 1:00 PM EST / 19:00 CET Sign up at http://www.eventsvc.com/lucidimagination/081810?trk=ap You don't need billions of dollars or users to build a user-friendly search application. In fact, studies of how and why people search have revealed a set of principles that can result in happy users who find what they're seeking with as little friction as possible -- and help you build a better, more successful search application. Join special guest Tyler Tate, user experience designer at UK-based TwigKit Search, for a high-level discussion of key user interface strategies for search that can be leveraged with Lucene and Solr. The presentation covers: * Ten things to know about designing the search experience * When to assume users know what they’re looking for – and when not to * Navigation/discovery techniques, such as faceted navigation, tag clouds, histograms and more * Practical considerations in leveraging suggestions into search interactions Lucid Imagination presents a free webinar Wednesday, August 18, 2010 10:00 AM PST / 1:00 PM EST / 19:00 CET Sign up at http://www.eventsvc.com/lucidimagination/081810?trk=ap About the presenter: Tyler Tate is co-founder of TwigKit, a UK-based company focused on building truly usable interfaces for search. Tyler has led user experience design for enterprise applications from CMS to CRM, and is the creator of the popular 1KB CSS Grid. Tyler also organizes a monthly Enterprise Search Meetup in London, and blogs at blog.twigkit.com. - Join the Revolution! Don't miss Lucene Revolution Lucene Solr User Conference Boston | October 7-8 2010 http://lucenerevolution.org - This webinar is sponsored by Lucid Imagination, the commercial entity exclusively dedicated to Apache Lucene/Solr open source search technology. Our solutions can help you develop and deploy search solutions with confidence: SLA-based support subscriptions, professional training, best practices consulting, along with and value- add software and free documentation and certified distributions of Lucene and Solr. Apache Lucene and Apache Solr are trademarks of the Apache Software Foundation.
Re: possible bug in sorting by Function?
problem could be related to some oddity in sum()?? some more examples: note: Latitude and Longitude are fields of type=double works: http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(1,1.0))%20asc http://10.0.11.54:8994/solr/select?q=*:*sort=sum(Latitude,Latitude)%20asc http://10.0.11.54:8994/solr/select?q=*:*sort=sum(rad(Latitude))%20asc http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(Latitude,1))%20asc http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(Latitude,1.0))%20asc fails: http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(Latitude,1),sum(Latitude,1))%20asc http://10.0.11.54:8994/solr/select?q=*:*sort=sum(sum(Latitude,1.0),sum(Latitude,1.0))%20asc http://10.0.11.54:8994/solr/select?q=*:*sort=sum(rad(Latitude),rad(Latitude))%20asc -- View this message in context: http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1120017.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Require some advice
Solr is a search engine, not an entity extraction tool. While there are some decent open source entity extraction tools, they are focused on processing sentences and paragraphs. The structural differences in text messages means you'd need to do a fair amount of work to get decent entity extraction. That said, you may want to look into simple word/phrase matching if your domain is sufficiently small. Use RegEx to extract ZIP, use dictionaries to extract city/area, skills, and names. Much simpler and cheaper. -Original Message- From: Pavan Gupta [mailto:pavan@gmail.com] Sent: Thursday, August 12, 2010 2:58 PM To: solr-user@lucene.apache.org Subject: Require some advice Hi, I am new to text search and mining and have been doing research for different available products. My application requires reading a SMS message (unstructured) and finding out entities such as person name, area, zip , city and skills associated with the person. SMS would be in form of free text. The parsed data would be stored in database and used by Solr to display results. A SMS message could in the following form: John Mayer Mumbai 411004 Juhu, car driver, also capable of body guard We need to interpret in the following manner: first name - John last name - Mayer city- Mumbai zip - 411004 area-Juhu skills - car driver, body guard 1. Is Solr capable enough to handle this application considering that SMS message would be unstructured. 2. How is Solr/Lucene as compared to other tools such as UIMA, GATE, CER (stanford university), Lingpipe? 3. Is Solr only text search or can be used for information extraction? 4. Is it recommended to use Solr with other products such as UIMA and GATE. There are companies that are specialized in making meaning out of unstructured SMS messages. Do we have something similar in open source world? Can we extend Solr for the same purpose? You reply would be appreciated. Thanking you. Regards, Pavan
SOLR-788 - disributed More Like This
I tried some time ago to use SOLR-788. Ultimately I was able to get both patch versions to apply (separately), but neither worked. The suggestion I received when I commented on the issue was to download the specific release mentioned in the patch and then update, but the patch was created before the merge with Lucene, so I have no idea how to go about that. Without a much better understanding of Solr internals and a bunch more time to learn Java, there's no way that I can work on it myself. Is there anyone who has the time and inclination to get distributed MLT working with branch_3x? A further goal would be to have it actually committed before release. Thanks, Shawn
Re: possible bug in sorting by Function?
issue resolve. problem was that solr.war was silently not being overwritten by new version. will try to spend more time debugging before posting. -- View this message in context: http://lucene.472066.n3.nabble.com/possible-bug-in-sorting-by-Function-tp1118235p1121349.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: General questions about distributed solr shards
On 8/11/2010 3:27 PM, JohnRodey wrote: 1) Is there any information on preferred maximum sizes for a single solr index. I've read some people say 10 million, some say 80 million, etc... Is there any official recommendation or has anyone experimented with large datasets into the tens of billions? 2) Is there any down side to running multiple solr shard instances on a single machine rather than one shard instance with a larger index per machine? I would think that having 5 instances with 1/5 the index would return results approx 5 times faster. 3) Say you have a solr configuration with multiple shards. If you attempt to query while one of the shards is down you will receive a HTTP 500 on the client due to a connection refused on the server. Is there a way to tell the server to ignore this and return as many results as possible? In other words if you have 100 shards, it is possible that occasionally a process may die, but I would still like to return results from the active shards. 1) It highly depends on what's in your index. I'll let someone more qualified address this question in more detail. 2) Distributed search adds overhead. It has to query the individual shards, send additional requests to gather the matching records, and then assemble the results. If you create enough shards that you can fit all (or most) of each index in whatever RAM is left for the OS disk cache, you'll see a VERY significant boost in search speed by using shards. If 3) There are a couple of patches that address this, but in the end, you'll be better served by setting up a replicated pair and using a load balancer. I've got a distributed index with two machines per shard, the master and the slave. The load balancer checks the ping status URL every 5 seconds to see whether each machine is up. If one goes down, it is removed from the load balancer and everything keeps working. Each of my shards is about 12.5GB in size and the VMs that access the data have 9GB total RAM. I wish I had more memory!
Re: clustering component
Hey thanks Stanislaw! I'm going to try this against the current trunk tonight and see what happens. Matt On Wed, Jul 28, 2010 at 8:41 AM, Stanislaw Osinski stanislaw.osin...@carrotsearch.com wrote: The patch should also work with trunk, but I haven't verified it yet. I've just added a patch against solr trunk to https://issues.apache.org/jira/browse/SOLR-1804. S.
Hierarchical faceting
Hey all, I am doing a search on hierarchical data, and I have a hard time getting my head around the following problem. I want a result as follows, in one single query only: USA (3) California (2) Arizona (1) Europe (4) Norway (3) Oslo (3) Sweden (1) How it looks in the XML/JSON response is not really important, this is more a presentation issue. I guess I could store the values USA, USA/California, Europe/Norway/Oslo as strings for each document, and do some JavaScript-ing to show the hierarchies appropriately. When a specific item in the facet is selected, for example Norway, Solr could be queries with a filter query on Europe/Norway*? Do anyone have some experiences they could please share with me? I have tried out SOLR-64, and it gives me the results I look for. However, I do not have the opportunity to use a patch in the production environment ... -- Thanks, Mats Bolstad
Re: Phrase search
: I'm trying to match Apple 2 but not Apple2 using phrase search, this is why I have it quoted. : I was under the impression --when I use phrase search-- all the : analyzer magic would not apply, but it is!!! Otherwise, how would I : search for a phrase?! well .. yes ... even with phrase searches your query is analyzed. the only differnce is that with a quoted phrase search, the entire phrase is analyzed at one time -- when the input isn't quoted, the whitespace is evaluated by the QueryParser as markup just like quotes and +/-, etc... (unless it's escaped) and the individual words are analyzed independently. : Using Google, when I search for Windows 7 (with quotes), unlike Solr, : I don't get hits on Window7. I want to use catenateNumbers=1 which : I want it to take effect on other searches but no phrase searches. Is : this possible ? you need to elaborate more on what you do and don't want to match -- so far you've given one example of a query you want to execute, and a document you *don't* want to match that query, but not an example of what types of documents you *do* want to match that query -- you also haven't given examples of queries that you *do* want that example document to match. i suspect that catenateNumbers=1 isn't actually your problem ... it sounds like you don't actually want WordDelimiterFilter doing the split at index time at all. Forget the phrase queries for a second: the question to ask yourself is: when you index a document containing Windows7 do you want a serach for the word Windows to match thta document? If the answer is no then you probably don't want WordDelimiterFilter at all. -Hoss
Re: Solr query result cache size and expire property
: please help - how can I calculate queryresultcache size (how much RAM should : be dedicated for that). I have 1,5 index size, 4 mio docs. : QueryResultWindowSize is 20. : Could I use expire property on the documents in this cache? There is no expire property, items are automaticly removed from the cache if the cache gets full, and the entire cache is thrown out when a new searcher is loaded (that's the only time it would make sense to expire anything) honestly: trial and error is typically the best bet for sizing your queryResultsCache ... the size of your index is much less relevant then the types of queries you get. If you typically only have 200 unique queries over and over again, and no one ever does any ohter queries, then any number abot 200 is going to be essentially the same. if you have 200 queries thta get a *lot* and 100 other queries that get hit once or twice ver ... then something ~250 is probably a good idea ... any more is probably just a waste of ram, any less is probably a waste of CPU. -Hoss
Re: How to extend the BinaryResponseWriter imposed by Solrj
: I'm trying to extend the writer used by solrj : (org.apache.solr.response.BinaryResponseWriter), i have declared it in ... : I see that it is initialized, but when i try to set the 'wt' param to : 'myWriter' : : solrQuery.setParam(wt,myWriter), nothing happen, it's still using the : 'javabin' writer. I'm not certian, but i don't think SolrJ respects a wt param set by the caller .. i think ResponseParser dictates what wt param is sent to the server -- that's why javabin is the default and calling server.setParser(new XMLResponseParser()) causes XML to be sent by the server (even if don't set wt=xml in your SolrParams) If you've customized the BinaryResponseWriter then presumably you've had to write a custom ResponseParser as well, correct? (otherwise how would you take advantage of your customizations to hte format) ... so take a look at the existing ResponseParsers to see how they force the wt param and do the same thing in your custom ResponseParser. (Note: this is all mostly speculation on my part) -Hoss
can searcher.getReader().getFieldNames() return only stored fields?
CollectionString myFL = searcher.getReader().getFieldNames(IndexReader.FieldOption.ALL); will return all fields in the schema (i.e. index, stored, and indexed+stored). CollectionString myFL = searcher.getReader().getFieldNames(IndexReader.FieldOption.INDEXED ); likely returns all fields that are indexed (I havent tried). however, both of these can/will return fields that are not stored. is there a parameter that I can use to only return fields that are stored? there does not seem to be a IndexReader.FieldOption.STORED and cant tell if any of the others might work any info helpful. thx -- View this message in context: http://lucene.472066.n3.nabble.com/can-searcher-getReader-getFieldNames-return-only-stored-fields-tp1124178p1124178.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Data Import Handler Query
Thanks Alexey. That solved the issue. I am now able to get all images information in the index. On Thu, Aug 12, 2010 at 12:47 AM, Alexey Serba ase...@gmail.com wrote: Try to define image solr fields - db columns mapping explicitly in image entity, i.e. entity name=image query=select filename, filepath, type from images where story_id='${story.story_id}' field column=filename name=filename / field column=filepath name=filepath / field column=type name=type / /entity See http://www.lucidimagination.com/search/document/c8f2ed065ee75651/dih_and_multivariable_fields_problems On Thu, Aug 12, 2010 at 2:30 AM, Manali Joshi joshi.man...@gmail.com wrote: I tried making the schema fields that get the image data to multiValued=true. But it still gets only the first image data. It doesn't have information about all the images. On Wed, Aug 11, 2010 at 1:15 PM, kenf_nc ken.fos...@realestate.com wrote: It may not be the data config. Do you have the fields in the schema.xml that the image data is going to set to be multiValued=true? Although, I would think the last image would be stored, not the first, but haven't really tested this. -- View this message in context: http://lucene.472066.n3.nabble.com/Data-Import-Handler-Query-tp1092010p1092917.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Duplicate a core
: Is it possible to duplicate a core? I want to have one core contain only : documents within a certain date range (ex: 3 days old), and one core with : all documents that have ever been in the first core. The small core is then : replicated to other servers which do real-time processing on it, but the : archive core exists for longer term searching. It's not something i've ever dealt with, but if i were going to pursue it i would investigate wether this works... 1) have three+ solr instances: master, archive and one or more query machines 2) index everything to core named recent on server master 3) configure the query machines to replicate recent from master 4) configure the archive machine to replicate recent from master 5) configure the archive machine to also have an all core 6) on some timed bases: - delete docs from recent on master that are *older* then X - delete docs from recent on archive that are *newer* then X - use the index merge command on archive to merge the recent core into the all core ...i'm pretty sure that merge command will require that you shutdown both cores on archive during the merge, but that's a good idea anyway. if you need continuous searching of the all core to be available, then just setup that core on archive as a repeater and have some archive-query machines slaving off of it. that should work. -Hoss
SOLR Query
Hi there, I've a problem querying SOLR for a specific field with a query string that contains spaces. I added following lines in the schema.xml to add my own defined fields. Fields are: ap_name, ap_address, ap_dob, ap_desg, ap_sec. Since all these fields are beginning with ap_, I included the the following dynamicField. dynamicField name=*ap_* type=text indexed=true stored=true/ I included this line to make a query for all fields instead of a specfic field. copyField source=ap_* dest=text/ I added the following document in my index: add doc field name=id1/field field name=ap_nameTom Cruise/field field name=ap_addressSan Fransisco/field /doc /add 1. When I query q=Tom+Cruise, I should get the above document since it is available in text which ic my default query field. [Works as expected] 2. When I query q=ap_address:Tom, I should not get the above document since Tom is not available in ap_address. [Works as expected] 3. When I query q=ap_address:Tom+Cruise, I shouldnt not get the above document BUT I GET IT. {Doesnt work as expected] Could anyone please explain me what mistake I am making? Thanks alot, appreciate any help! Moiz
Re: analysis tool vs. reality
: Furthermore, I would like to add its not just the highlight matches : functionality that is horribly broken here, but the output of the analysis : itself is misleading. : : lets say i take 'textTight' from the example, and add the following synonym: : : this is broken = broke : : the query time analysis is wrong, as it clearly shows synonymfilter : collapsing this is broken to broke, but in reality with the qp for that : field, you are gonna get 3 separate tokenstreams and this will never : actually happen (because the qp will divide it up on whitespace first) : : So really the output from 'Query Analyzer' is completely bogus. analysis.jsp is only intended to explain *analysis* ... it accurately tells you what the analyzer type=query ... for the specified field (or fieldType) is going to produce given a hunk of text. That is what it does, that is all that it does, that is all it has ever done, and all it has ever purported to do. You say it's bogus because the qp will divide on whitesapce first -- but you're assuming you know what query parser will be used ... the field query parser (to name one) doesn't split on whitespace first. That's my point: analysis.jsp doesn't make any assumptions about what query parser *might* be used, it just tells you what your analyzers do with strings. Saying the output of analisys.jsp is bogus because it doesn't take into account QueryParsing is like saying the output of stats.jsp is bogus because those are only the stats of the local solr instance on that machine, and it doesn't do distributed stats -- yeah that would be nice to have, but the stats.jsp never implies that's what it's giving you. If there are ways we can make the purpose of analysis.jsp more obvious, and less missleading for people who don't udnerstand the distinction between query parsing and analysis then i am all for it. if you really believe getting rid of the highlite check box is going to help, then fine -- but i have yet to see any evidence that people who don't understand the relationship between query parsing and analysis are confused by the blue boxes. what people seem to be confused by is when they see the same tokens ultimately produced by both the index analyzer and the query analyzer -- it doesn't matter if those tokens are in blue or not, if they see that the tokens in the index analyzer output are a super set of the tokens in the query analyzer output then they tend to assume that means searching for the string in the query box will match documents containing hte string in the index text box. Getting rid of the blue table cell is just going to make it harder to notice matching tokens in the output -- not reduce the confusion when those matching tokens exist in the output. My question is: What can we do to make it more clear what the *purpose* of analysis.jsp is? is there verbage we can add to the page to make it more obvious? NOTE: I'm not just asking Robert, this is a question for the solr-user community as a whole. I *know* what analysis.jsp is for, i've never been confused -- for people who have been confused in hte past (or are still confused) please help us understand what type of changes we could make to the output of analysis.jsp to make it's functionality more understandable. -Hoss
Re: analysis tool vs. reality
On Thu, Aug 12, 2010 at 7:55 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: You say it's bogus because the qp will divide on whitesapce first -- but you're assuming you know what query parser will be used ... the field query parser (to name one) doesn't split on whitespace first. That's my point: analysis.jsp doesn't make any assumptions about what query parser *might* be used, it just tells you what your analyzers do with strings. you're right, we should just fix the bug that the queryparser tokenizes on whitespace first. then analysis.jsp will be significantly less confusing. -- Robert Muir rcm...@gmail.com
Re: Solrj ContentStreamUpdateRequest Slow
: It returns in around a second. When I execute the attached code it takes just : over three minutes. The optimal for me would be able get closer to the : performance I'm seeing with curl using Solrj. I think your problem may be that StreamingUpdateSolrServer buffers up commands and sends them in batches in a background thread. if you want to send individual updates in real time (and time them) you should just use CommonsHttpSolrServer -Hoss
Re: analysis tool vs. reality
: You say it's bogus because the qp will divide on whitesapce first -- but : you're assuming you know what query parser will be used ... the field : query parser (to name one) doesn't split on whitespace first. That's my : point: analysis.jsp doesn't make any assumptions about what query parser : *might* be used, it just tells you what your analyzers do with strings. : : : you're right, we should just fix the bug that the queryparser tokenizes on : whitespace first. then analysis.jsp will be significantly less confusing. dude .. not trying to get into a holy war here even if you change the Lucene QUeryParser so that whitespace isn't a meta character it doens't affect the underlying issue: analysis.jsp is agnostic about QueryParsers. Some other QParser the users uses might have other special behavior and if people don't understand hte distinction between QueryParsing and analysis they can still be confused -- hell even if the only QParser anyone ever uses is the lucene QParser, and even if you get the QUeryParser changed so that whitespace isn't a metacharacter, you we are still going to be left with the fact that *other* charaters (like '+' and '-' and '' and '*' and ...) are metacharacters for that query parser, and have special meaning. analysis.jsp isn't going to know about those, or do anything special for them -- so people cna still be easily confused when analysis.jsp says one thing about how the string +foo* -bar get's analyzed, but that string as a query means something completley different. Hence my point: leave arguments about QueryParser out of it -- how do we make the function of analysis.jsp more clear? -Hoss
Re: Hierarchical faceting
We were able to get the hierarchy faceting working with a work around approach. e.g. if you have Europe//Norway//Oslo as an entry 1. Create a new multivalued field with string type field name=country_facet type=string indexed=true stored=true multiValued=true/ 2. Index the field for Europe//Norway//Oslo with values 0//Europe 1//Europe//Norway 2//Europe//Norway//Oslo 3. The Facet can now be used in the Queries :- 1st Level - Would return all entries @ 1st level e.g. 0//USA, 0//Europe fq= f.country_facet.facet.prefix=0// facet.field=country_facet 2nd Level - Would return all entries @ second level in Europe 1//Europe//Norway, 1//Europe//Sweden fq=country_facet:0//Europe f.country_facet.facet.prefix=1//Europe facet.field=country_facet 3rd Level - Would return 1//Europe//Norway entries fq=country_facet:1//Europe//Norway f.country_facet.facet.prefix=2//Europe//Norway facet.field=country_facet Increment the facet.prefix by 1 so that you limit the facet results to to that prefix. Also works for any depth. Regards, Jayendra On Thu, Aug 12, 2010 at 6:01 PM, Mats Bolstad mat...@stud.ntnu.no wrote: Hey all, I am doing a search on hierarchical data, and I have a hard time getting my head around the following problem. I want a result as follows, in one single query only: USA (3) California (2) Arizona (1) Europe (4) Norway (3) Oslo (3) Sweden (1) How it looks in the XML/JSON response is not really important, this is more a presentation issue. I guess I could store the values USA, USA/California, Europe/Norway/Oslo as strings for each document, and do some JavaScript-ing to show the hierarchies appropriately. When a specific item in the facet is selected, for example Norway, Solr could be queries with a filter query on Europe/Norway*? Do anyone have some experiences they could please share with me? I have tried out SOLR-64, and it gives me the results I look for. However, I do not have the opportunity to use a patch in the production environment ... -- Thanks, Mats Bolstad
Re: Index compatibility 1.4 Vs 3.1 Trunk
: : That should still be true in the the official 4.0 release (i really should : have said When 4.0 can no longer read SOlr 1.4 indexes), ... : i havne't been following the detials closely, but i suspect that tool : hasn't been writen yet because there isn't much point until the full : details of the trunk index format are nailed down. : This is news to me? : : File formats are back-compatible between major versions. Version X.N should : be able to read indexes generated by any version after and including version : X-1.0, but may-or-may-not be able to read indexes generated by version : X-2.N. It was a big part of the proposal regarding hte creation of hte 3x branch ... that index format compabtibility between major versions would no longer be supported by silently converted on first write -- instead there there would be a tool for explicit conversion... http://search.lucidimagination.com/search/document/c10057266d3471c6/proposal_about_version_api_relaxation http://search.lucidimagination.com/search/document/c494a78f1ec1bfb5/lucene_3_x_branch_created -Hoss
Re: edismax pf2 and ps
We pretty much had the same issue, ended up customizing the ExtendedDismax code. In your case its just a change of a single line addShingledPhraseQueries(query, normalClauses, phraseFields2, 2, tiebreaker, pslop); to addShingledPhraseQueries(query, normalClauses, phraseFields2, 2, tiebreaker, 0); Regards, Jayendra On Thu, Aug 12, 2010 at 1:04 PM, Ron Mayer r...@0ape.com wrote: Short summary: Is there any way I can specify that I want a lot of phrase slop for the pf parameter, but none at all for the pf2 parameter? I find the 'pf' parameter with a pretty large 'ps' to do a very nice job for providing a modest boost to many documents that are quite well related to many queries in my system. In contrast, I find the 'pf2' parameter with zero 'ps' does extremely well at providing a high boost to documents that are often exactly what someone's searching for. Is there any way I can get both effects? Edismax's pf2 parameter is really nice for boosting exact phrases in queries like 'black jacket red cap white shoes'. But as soon as even a little phrase slop (ps) is added, it seems like it starts boosting documents with red jackets and white caps just as much as those with black jackets and red caps. My gut feeling is that if I could have pf with a large phrase slop and the pf2 with zero phrase slop, it'd give me better overall results than any single phrase slop setting that gets applied to both. Is there any good way for me to test that? Thanks, Ron
Re: DIH and multivariable fields problems
Please add a JIRA issue for this. https://issues.apache.org/jira/secure/BrowseProject.jspa On Tue, Aug 10, 2010 at 6:59 PM, kenf_nc ken.fos...@realestate.com wrote: Glad I could help. I also would think it was a very common issue. Personally my schema is almost all dynamic fields. I have unique_id, content, last_update_date and maybe one other field specifically defined, the rest are all dynamic. This lets me accept an almost endless variety of document types into the same schema. So if I planned on using DIH I had to come up with a way, and stitching together solutions to a couple related issues got me to my script transform. Mine is more convoluted than the one I gave here, but obviously you got the gist of the idea. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1081738.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor
Please add a JIRA issue for this. On Wed, Aug 11, 2010 at 6:24 AM, Sascha Szott sz...@zib.de wrote: Sorry, there was a mistake in the stack trace. The correct one is: SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' value: /home/doe/foo is not a directory Processing Document # 3 at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) -Sascha On 11.08.2010 15:18, Sascha Szott wrote: Hi folks, why does FileListEntityProcessor ignores onError=continue and abort indexing if a directory or a file does not exist? I'm using both XPathEntityProcessor and FileListEntityProcessor with onError set to continue. In case a directory or file is not present an Exception is thrown and indexing is stopped immediately. Below you can find a stack trace that is generated in case the directory /home/doe/foo does not exist: SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' value: /home/doe/foo/bar.xml is not a directory Processing Document # 3 at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) How should I configure both processors so that missing directories and files are ignored and the indexing process does not stop immediately? Best, Sascha -- Lance Norskog goks...@gmail.com
Re: analysis tool vs. reality
On Thu, Aug 12, 2010 at 8:07 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : You say it's bogus because the qp will divide on whitesapce first -- but : you're assuming you know what query parser will be used ... the field : query parser (to name one) doesn't split on whitespace first. That's my : point: analysis.jsp doesn't make any assumptions about what query parser : *might* be used, it just tells you what your analyzers do with strings. : : : you're right, we should just fix the bug that the queryparser tokenizes on : whitespace first. then analysis.jsp will be significantly less confusing. dude .. not trying to get into a holy war here actually I'm suggesting the practical solution: that we fix the primary problem that makes it confusing. even if you change the Lucene QUeryParser so that whitespace isn't a meta character it doens't affect the underlying issue: analysis.jsp is agnostic about QueryParsers. analysis.jsp isn't agnostic about queryparsers, its ignorant of them, and your default queryparser is actually a de-facto whitespace tokenizer, don't try to sugarcoat it. -- Robert Muir rcm...@gmail.com
Re: Solr 1.4.1 and 3x: Grouping of query changes results
: Does not return document as expected: : id:1234 AND (-indexid:1 AND -indexid:2) AND -indexid:3 : : Has anyone else experienced this? The exact placement of the parens isn't : key, just adding a level of nesting changes the query results. ... : I could be wrong but I think this has to do with Solr's lack of support for : purely negative queries, try the following and see if it behaves correctly: : : id:1234 AND (*:* AND -indexid:1 AND -indexid:2) AND -indexid:3 1) correct. In general a purely negative query can't work -- queries must select something, it doesn't matter if they are nested in another query or not. the query string A AND (-B AND -C) AND -D says that a document must match A and it must match a query which does not match anything and it must not match D ... it's thta middle clause that prevents anything from matching. Solr does support purely negative queries if they are the top level query (ie: q=-foo) but it doesn't rewrite nested sub queries (ie: q=foo (-bar -baz)) 2) FWIW: setting asside the pure negative query aspect of this question, changing the grouping of clauses can always affect the results of a query -- this is because the grouping dictates the scoring (due to queryNorms and coord factors) so A (B C (D E)) F can produce very results in a very different order then A B C D E F ... likewise A C -B will match different documents then A (C -B) (the latter will match a document containing both A and B, the former will not) -Hoss
Re: index pdf files
: Subject: index pdf files : References: aanlktim1wgref511p+unovqcu=b0usxnm8vxzn5bu...@mail.gmail.com : 4c63ed43.4030...@r.email.ne.jp : aanlkti=28tulxqjtibrwcbxtok0avwbvbrjnxpdej...@mail.gmail.com : In-Reply-To: aanlkti=28tulxqjtibrwcbxtok0avwbvbrjnxpdej...@mail.gmail.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Re: Indexing large files using Solr Cell causes OutOfMemory error
: Subject: Indexing large files using Solr Cell causes OutOfMemory error : References: aanlktinfbtudv4lpjh40vjzderto1-dn7gztnjxfv...@mail.gmail.com : In-Reply-To: aanlktinfbtudv4lpjh40vjzderto1-dn7gztnjxfv...@mail.gmail.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Re: Filter Performance in Solr 1.3
There was a major Lucene change in filter handling from Solr 1.3 to Solr 1.4. They are much much faster in 1.4. Really Lucene 2.4.1 to Lucene 2.9.2. The filter is now consulted much earlier in the search process, thus weeding out many more documents early. It sounds like in Solr 1.3, you should only use filter queries for queries with large document sets. On Wed, Aug 11, 2010 at 12:21 PM, Bargar, Matthew B matthew.bar...@verizonwireless.com wrote: The search with the filter takes longer than a search for the same term but no filter after repeated searches, after the cache should have come into play. To be more specific, this happens on filters that exclude very few results from the overall set. For instance, type:video returns few results and as one would expect, returns much quicker than a search without that filter. -type:video, on the other hand returns a lot of results and excludes very few, and actually takes longer than a search without any filter at all. Is this what one might expect when using a filter that excludes few results, or does it still seem like something strange might be happening? Thanks, Matt -Original Message- From: Geert-Jan Brits [mailto:gbr...@gmail.com] Sent: Wednesday, August 11, 2010 2:55 PM To: solr-user@lucene.apache.org Subject: Re: Filter Performance in Solr 1.3 fq's are the preferred way to use for filtering when the same filter is often used. (since the filter-set can be cached seperately) . as to your direct question: My question is whether there is anything that can be done in 1.3 to help alleviate the problem, before upgrading to 1.4? I don't think so (perhaps some patches that I'm not aware of) . When are you seeing increased search time? is it the first time the filter is used? If that's the case: that's logical since the filter needs to be build. (fq)-filters only show their strength (as said above) when you use them repeatedly. If on the other hand you're seeing slower repsonse times with a fq-filter applied all the time, then the same queries without the fq-filter, there must be something strange going on since this really shouldn't happen in normal situations. Geert-Jan 2010/8/11 Bargar, Matthew B matthew.bar...@verizonwireless.com Hi there, I have a question about filter (fq) performance in Solr 1.3. After doing some testing it seems as though adding a filter increases search time. From what I've read here http://www.derivante.com/2009/06/23/solr-filtering-performance-increas e/ and here http://www.lucidimagination.com/blog/2009/05/27/filtered-query-perform an ce-increases-for-solr-14/ it seems as though upgrading to 1.4 would solve this problem. My question is whether there is anything that can be done in 1.3 to help alleviate the problem, before upgrading to 1.4? It becomes an issue because the majority of searches that are done on our site need some content type excluded or filtered for. Does it make sense to use the fq parameter in this way, or is there some better approach since filters are almost always used? Thank you! -- Lance Norskog goks...@gmail.com
Re: PDF file
: Subject: PDF file : References: 20100729152139.321c4...@ibis : aanlktinhby5iasd3q9iep7dr8tymajozvk8curih1...@mail.gmail.com : In-Reply-To: aanlktinhby5iasd3q9iep7dr8tymajozvk8curih1...@mail.gmail.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Re: In multicore env, can I make it access core0 by default
: In-Reply-To: aanlktimwvhxxdhpup5hl-2e1teh9pu6yetopgu=98...@mail.gmail.com : References: aanlktimwvhxxdhpup5hl-2e1teh9pu6yetopgu=98...@mail.gmail.com : aanlktim46b_hcfpf2r6t=b8y_weq4bbhgi=8mappz...@mail.gmail.com : Subject: In multicore env, can I make it access core0 by default http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Re: hl.usePhraseHighlighter
: Subject: hl.usePhraseHighlighter : References: 1281125904548-1031951.p...@n3.nabble.com : 960560.55971...@web52904.mail.re2.yahoo.com : In-Reply-To: 960560.55971...@web52904.mail.re2.yahoo.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss
Re: Indexing and ExtractingRequestHandler
This is probably true about Luke. The trunk has a new Lucene format and does not read any previous format. The trunk is a busy code base. The 3.1 branch is slated to be the next Solr release, and is probably a better base for your testing. Best of all is to use the Solr 1.4.1 binary release. On Wed, Aug 11, 2010 at 8:08 PM, Harry Hochheiser hsh...@gmail.com wrote: Thanks. I've done Tika command line to parse the Excel file, and I see contents in it that don't appear to be indexed. I've tried the path of using Tika to parse the Excel and then using extracting request handler to index the resulting text, and that doesn't work either. As far as Luke goes, I've built it from scratch. Still bombs. Is it possible that it's not compatible with lucene builds based on trunk? thanks, -harry On Wed, Aug 11, 2010 at 6:48 PM, Jan Høydahl / Cominvent jan@cominvent.com wrote: Hi, You can try Tika command line to parse your Excel file, then you will se the exact textual output from it, which will be indexed into Solr, and thus inspect whether something is missing. Are you sure you use a version of Luke which supports your version of Lucene? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 11. aug. 2010, at 23.33, Harry Hochheiser wrote: I'm trying to use Solr to index the contents of an Excel file, using the ExtractingRequestHandler (CSV handler won't work for me - I need to consider the whole spreadsheet as one document), and I'm running into some trouble. Is there any way to see what's going on during the indexing process? I'm concerned that I may be losing some terms, and I'd like to see if i can snoop on the terms that are added to the index as they go along. How might I do this? Barring that, how can I inspect the index post-fact? I have tried to use luke to see what's in the index, but I get an error: Unknown format version -10. Is it possible to get luke to work? My solr build is straight out of SVN. thanks, harry -- Lance Norskog goks...@gmail.com
Re: Deleting with the DIH sometimes doesn't delete
Which version of Solr is this? How many documents are there in the index? Etc. It is hard for us to help you without more details. On Thu, Aug 12, 2010 at 8:32 AM, Qwerky neil.j.tay...@hmv.co.uk wrote: I'm doing deletes with the DIH but getting mixed results. Sometimes the documents get deleted, other times I can still find them in the index. What would prevent a doc from getting deleted? For example, I delete 594039 and get this in the logs; 2010-08-12 14:41:55,625 [Thread-210] INFO [DataImporter] Starting Delta Import 2010-08-12 14:41:55,625 [Thread-210] INFO [SolrWriter] Read productimportupdate.properties 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Starting delta collection. 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Running ModifiedRowKey() for Entity: item 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Completed ModifiedRowKey for Entity: item rows obtained : 0 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Completed DeletedRowKey for Entity: item rows obtained : 1 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Completed parentDeltaQuery for Entity: item 2010-08-12 14:41:55,625 [Thread-210] INFO [DocBuilder] Deleting stale documents 2010-08-12 14:41:55,625 [Thread-210] INFO [SolrWriter] Deleting document: 594039 2010-08-12 14:41:55,703 [Thread-210] INFO [SolrDeletionPolicy] newest commit = 1281030128383 2010-08-12 14:41:55,718 [Thread-210] DEBUG [SolrIndexWriter] Opened Writer DirectUpdateHandler2 2010-08-12 14:41:55,718 [Thread-210] INFO [DocBuilder] Delta Import completed successfully 2010-08-12 14:41:55,718 [Thread-210] INFO [DocBuilder] Import completed successfully 2010-08-12 14:41:55,718 [Thread-210] INFO [DirectUpdateHandler2] start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) 2010-08-12 14:42:08,562 [Thread-210] DEBUG [SolrIndexWriter] Closing Writer DirectUpdateHandler2 2010-08-12 14:42:10,437 [Thread-210] INFO [SolrDeletionPolicy] SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_8,version=1281030128383,generation=8,filenames=[_39.frq, _2i.fdx, _39.tis, _39.prx, _39.fnm, _2i.fdt, _39.tii, _39.nrm, segments_8] commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx, _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9] 2010-08-12 14:42:10,437 [Thread-210] INFO [SolrDeletionPolicy] newest commit = 1281030128384 ..this works fine; I can no longer find 594039 in the index. But a little later I delete a couple more (33252 and 105224) and get the following (I added two docs at the same time); 2010-08-12 15:27:42,828 [Thread-217] INFO [DataImporter] Starting Delta Import 2010-08-12 15:27:42,828 [Thread-217] INFO [SolrWriter] Read productimportupdate.properties 2010-08-12 15:27:42,828 [Thread-217] INFO [DocBuilder] Starting delta collection. 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Running ModifiedRowKey() for Entity: item 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Completed ModifiedRowKey for Entity: item rows obtained : 2 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Completed DeletedRowKey for Entity: item rows obtained : 2 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Completed parentDeltaQuery for Entity: item 2010-08-12 15:27:42,843 [Thread-217] INFO [DocBuilder] Deleting stale documents 2010-08-12 15:27:42,843 [Thread-217] INFO [SolrWriter] Deleting document: 33252 2010-08-12 15:27:42,906 [Thread-217] INFO [SolrDeletionPolicy] SolrDeletionPolicy.onInit: commits:num=1 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx, _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9] 2010-08-12 15:27:42,906 [Thread-217] INFO [SolrDeletionPolicy] newest commit = 1281030128384 2010-08-12 15:27:42,906 [Thread-217] DEBUG [SolrIndexWriter] Opened Writer DirectUpdateHandler2 2010-08-12 15:27:42,906 [Thread-217] INFO [SolrWriter] Deleting document: 105224 2010-08-12 15:27:42,906 [Thread-217] INFO [DocBuilder] Delta Import completed successfully 2010-08-12 15:27:42,906 [Thread-217] INFO [DocBuilder] Import completed successfully 2010-08-12 15:27:42,906 [Thread-217] INFO [DirectUpdateHandler2] start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) 2010-08-12 15:27:55,578 [Thread-217] DEBUG [SolrIndexWriter] Closing Writer DirectUpdateHandler2 2010-08-12 15:27:56,875 [Thread-217] INFO [SolrDeletionPolicy] SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_9,version=1281030128384,generation=9,filenames=[_3a.prx, _3a.tis, _3a.fnm, _3a.nrm, _3a.fdt, _3a.tii, _3a.fdx, _3a.frq, segments_9] commit{dir=E:\SOLR\kiosk\data\index,segFN=segments_a,version=1281030128385,generation=10,filenames=[_3c.tis, _3c.fdt,
Re: indexing???
Can you provide more details? What is the error you're receiving? What do you think is going on? It might be helpful if you reviewed: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Thu, Aug 12, 2010 at 8:21 AM, satya swaroop sswaro...@gmail.com wrote: Hi all, The indexing part of solr is going good,but i got a error on indexing a single pdf file. when i searched for the error in the mailing list i found that the error was due to copyright of that file. can't we index a file which has copy rights or any digital rights??? regards, satya
Re: Results from More then One Cors?
There is no information to go on here. Please review http://wiki.apache.org/solr/UsingMailingLists and add some more details... Best Erick On Thu, Aug 12, 2010 at 2:09 PM, Jörg Agatz joerg.ag...@googlemail.comwrote: Hallo Users... I tryed to get results from more then one Cores.. But i dont know how.. Maby you have a Idea.. I need it into PHP King
Re: SOLR Query
You'll get a lot of insight into what's actually happening if you append debugQuery=true to your queries, or check the debug checkbox in the solr admin page. But I suspect (and it's a guess since you haven't included your schema) that your problem is that you're mixing explicit and default fields. Something like q=ap_address:Tom+Cruise, I think, gets parsed into something like ap_address:tom + default_field:cruise What happens if you try ap_address:(tom +cruise)? Best Erick On Thu, Aug 12, 2010 at 7:19 PM, Moiz Bhukhiya moiz.bhukh...@gmail.comwrote: Hi there, I've a problem querying SOLR for a specific field with a query string that contains spaces. I added following lines in the schema.xml to add my own defined fields. Fields are: ap_name, ap_address, ap_dob, ap_desg, ap_sec. Since all these fields are beginning with ap_, I included the the following dynamicField. dynamicField name=*ap_* type=text indexed=true stored=true/ I included this line to make a query for all fields instead of a specfic field. copyField source=ap_* dest=text/ I added the following document in my index: add doc field name=id1/field field name=ap_nameTom Cruise/field field name=ap_addressSan Fransisco/field /doc /add 1. When I query q=Tom+Cruise, I should get the above document since it is available in text which ic my default query field. [Works as expected] 2. When I query q=ap_address:Tom, I should not get the above document since Tom is not available in ap_address. [Works as expected] 3. When I query q=ap_address:Tom+Cruise, I shouldnt not get the above document BUT I GET IT. {Doesnt work as expected] Could anyone please explain me what mistake I am making? Thanks alot, appreciate any help! Moiz
Re: Index compatibility 1.4 Vs 3.1 Trunk
On Thu, Aug 12, 2010 at 8:29 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: It was a big part of the proposal regarding hte creation of hte 3x branch ... that index format compabtibility between major versions would no longer be supported by silently converted on first write -- instead there there would be a tool for explicit conversion... http://search.lucidimagination.com/search/document/c10057266d3471c6/proposal_about_version_api_relaxation http://search.lucidimagination.com/search/document/c494a78f1ec1bfb5/lucene_3_x_branch_created Hoss, did you actually *read* these documents We will only provide a conversion tool that can convert indexes from the last branch_3x up to this trunk (4.0) release, so they can be read later, but may not contain terms with all current analyzers, so people need mostly reindexing. Older indexes will not be able to be read natively without conversion first (with maybe loss of analyzer compatibility). the fact 4.0 can read 3.x indexes *at all* without a converter tool is only because Mike Mccandless went the extra mile. i dont see anything suggesting we should support any tools for 2.x indexes! -- Robert Muir rcm...@gmail.com
DataImportHandler and SAXParseExceptions with Jetty
Win XP, Solr 1.4.1 out of the box install, using jetty. If I add greater than or less than (ie or ) in any xml field and attempt to load or run from the DataImportConsole I receive a SAXParseException. Example follows: If I don't have a 'less than' it works just fine. I know this must work, because the examples given on the wiki show deltaQueries using a greater than/less than compare. Relevant snippet from data-config.xml : entity name=item query=select * from project_items where rownum 500 Stack trace received: org.apache.solr.common.SolrException: FATAL: Could not create importer. DataImporter config invalid at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:121) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:222) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Exception occurred while initializing context at org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:190) at org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:101) at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:113) ... 22 more Caused by: org.xml.sax.SAXParseException: The value of attribute query associated with an element type null must not contain the '' character. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source) at org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:178) ... 24 more -- View this message in context: http://lucene.472066.n3.nabble.com/DataImportHandler-and-SAXParseExceptions-with-Jetty-tp1125898p1125898.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR Query
I tried ap_address:(tom+cruise) and that worked. I am sure its the same problem as you suspected! Thanks a lot Erick( users!) for your time. Moiz On Thu, Aug 12, 2010 at 8:51 PM, Erick Erickson erickerick...@gmail.comwrote: You'll get a lot of insight into what's actually happening if you append debugQuery=true to your queries, or check the debug checkbox in the solr admin page. But I suspect (and it's a guess since you haven't included your schema) that your problem is that you're mixing explicit and default fields. Something like q=ap_address:Tom+Cruise, I think, gets parsed into something like ap_address:tom + default_field:cruise What happens if you try ap_address:(tom +cruise)? Best Erick On Thu, Aug 12, 2010 at 7:19 PM, Moiz Bhukhiya moiz.bhukh...@gmail.com wrote: Hi there, I've a problem querying SOLR for a specific field with a query string that contains spaces. I added following lines in the schema.xml to add my own defined fields. Fields are: ap_name, ap_address, ap_dob, ap_desg, ap_sec. Since all these fields are beginning with ap_, I included the the following dynamicField. dynamicField name=*ap_* type=text indexed=true stored=true/ I included this line to make a query for all fields instead of a specfic field. copyField source=ap_* dest=text/ I added the following document in my index: add doc field name=id1/field field name=ap_nameTom Cruise/field field name=ap_addressSan Fransisco/field /doc /add 1. When I query q=Tom+Cruise, I should get the above document since it is available in text which ic my default query field. [Works as expected] 2. When I query q=ap_address:Tom, I should not get the above document since Tom is not available in ap_address. [Works as expected] 3. When I query q=ap_address:Tom+Cruise, I shouldnt not get the above document BUT I GET IT. {Doesnt work as expected] Could anyone please explain me what mistake I am making? Thanks alot, appreciate any help! Moiz