RE: java GC overhead limit exceeded
Hi, which version do you use? 1.4.1 is highly recommended since previous versions contained some bugs related to memory usage that could lead to memory leaks. i had this gc overhead limit in my setup as well. only workaround that helped was a dayly restart of all instances. with 1.4.1 this issue seems to be fixed. -Ursprüngliche Nachricht- Von: Jonathan Rochkind [mailto:rochk...@jhu.edu] Gesendet: Dienstag, 27. Juli 2010 01:18 An: solr-user@lucene.apache.org Betreff: java GC overhead limit exceeded I am now occasionally getting a Java GC overhead limit exceeded error in my Solr. This may or may not be related to recently adding much better (and more) warming querries. I can get it when trying a 'commit', after deleting all documents in my index, or in other cases. Anyone run into this, and have suggestions as to how to set my java options to eliminate? I'm not sure this simply means that my heap size needs to be bigger, it seems to be something else. Any advice appreciated. Googling didn't get me much I trusted. Jonathan
Re: Design questions/Schema Help
Hi, IMHO you can do this with date range queries and (date) facets. The DateMathParser will allow you to normalize dates on min/hours/days. If you hit a limit there, then just add a field with an integer for either min/hour/day. This way you'll loose the month information - which is sometimes what you want. You probably want the document entity to be a query with fields: query user (id? if you have that) sessionid date the most popular query within a date range is the query that was logged most times? Do a search on the date range: q=date:[start TO end] with facet on the query which gives you the count similar to group by count aggregation functionality in an RDBMS. You can do multiple facets at the same time but be carefull what you are querying for - it will impact the facet count. You can use functions to change the base of each facet. http://wiki.apache.org/solr/SimpleFacetParameters Cheers, Chantal On Tue, 2010-07-27 at 01:43 +0200, Mark wrote: We are thinking about using Cassandra to store our search logs. Can someone point me in the right direction/lend some guidance on design? I am new to Cassandra and I am having trouble wrapping my head around some of these new concepts. My brain keeps wanting to go back to a RDBMS design. We will be storing the user query, # of hits returned and their session id. We would like to be able to answer the following questions. - What is the n most popular queries and their counts within the last x (mins/hours/days/etc). Basically the most popular searches within a given time range. - What is the most popular query within the last x where hits = 0. Same as above but with an extra where clause - For session id x give me all their other queries - What are all the session ids that searched for 'foos' We accomplish the above functionality w/ MySQL using 2 tables. One for the raw search log information and the other to keep the aggregate/running counts of queries. Would this sort of ad-hoc querying be better implemented using Hadoop + Hive? If so, should I be storing all this information in Cassandra then using Hadoop to retrieve it? Thanks for your suggestions
Re: How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?
I would use the string version as Drupal will probably populate it with a url like thing something that may not validate as type url On 27 Jul 2010, at 04:00, Savannah Beckett wrote: I am trying to merge the schema.xml that is the solr/nutch setup with the one from drupal apache solr module. I encounter a field that is not mergeable. From drupal module: field name=url type=string indexed=true stored=true/ From solr/nutch setup: field name=url type=url stored=true indexed=true required=true/ I am not sure if there are any more stuff like this that is not mergeable. Is there a easy way to deal with schema.xml? Thanks. From: David Stuart david.stu...@progressivealliance.co.uk To: solr-user@lucene.apache.org Sent: Mon, July 26, 2010 1:46:58 PM Subject: Re: How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml? Hi Savannah, I have just answered this question over on drupal.org. http://drupal.org/node/811062 Response number 5 and 11 will help you. On the solrconfig.xml side of things you will only really need Drupal's version. Although still in alpha my Nutch module will help you out with integration http://drupal.org/project/nutch Regards, David Stuart On 26 Jul 2010, at 21:37, Savannah Beckett wrote: I am using Drupal ApacheSolr module to integrate solr with drupal. I already integrated solr with nutch. I already moved nutch's solrconfig.xml and schema.xml to solr's example directory, and it work. I tried to append Drupal's ApacheSolr module's own solrconfig.xml and schema.xml into the same xml files, but I got the following error when I java -jar start.jar: Jul 26, 2010 1:18:31 PM org.apache.solr.common.SolrException log SEVERE: Exception during parsing file: solrconfig.xml:org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124) at org.apache.solr.core.Config.init(Config.java:110) at org.apache.solr.core.SolrConfig.init(SolrConfig.java:130) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) Why? does solrconfig.xml allow to have 2 config sections? does schema.xml allow to have 2 schema sections? Thanks.
Any tips/guidelines to turning the Solr/luence performance in a master/slave/sharding environment
How to reduce the index files size, decreate the sync time between each nodes. decrease the index create/update time. Thanks.
Russian stemmer
Hello, I'm using SnowballPorterFilterFactory with language=Russian. The stemming works ok except people names, geographical places. Here are some examples: searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове. Are there other stemming plugins for the russian language that can handle this? If not, what are the options. A simple solution may be to use the wildcard queries in Standard mode instead of the DisMaxQueryHandler: Ковров* but I'd like to avoid it. Thanks.
Re: Russian stemmer
All of your examples stem to ковров: assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове, new String[] { ковров, ковров, ковров, ковров }); } Are you sure you enabled this at *both* index and query time? 2010/7/27 Oleg Burlaca o...@burlaca.com Hello, I'm using SnowballPorterFilterFactory with language=Russian. The stemming works ok except people names, geographical places. Here are some examples: searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове. Are there other stemming plugins for the russian language that can handle this? If not, what are the options. A simple solution may be to use the wildcard queries in Standard mode instead of the DisMaxQueryHandler: Ковров* but I'd like to avoid it. Thanks. -- Robert Muir rcm...@gmail.com
Spellchecking and frequency
Hi, I've recently been looking into Spellchecking in solr, and was struck by how limited the usefulness of the tool was. Like most corpora , ours contains lots of different spelling mistakes for the same word, so the 'spellcheck.onlyMorePopular' is not really that useful unless you click on it numerous times. I was thinking that since most of the time people spell words correctly why was there no other frequency parameter that could enter into the score? i.e. something like: spell_score ~ edit_dist * freq I'm sure others have come across this issue and was wonding what steps/algorithms they have used to overcome these limitations? Cheers, Dan
Re: Russian stemmer
another look, your problem is ковров itself... its mapped to ковр a workaround might be to use the protected words functionality to keep ковров and any other problematic people/geo names as-is. separately, in trunk there is an alternative russian stemmer (RussianLightStemFilterFactory), which might give you less problems on average, but I noticed it has this same problem with the example you gave. On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir rcm...@gmail.com wrote: All of your examples stem to ковров: assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове, new String[] { ковров, ковров, ковров, ковров }); } Are you sure you enabled this at *both* index and query time? 2010/7/27 Oleg Burlaca o...@burlaca.com Hello, I'm using SnowballPorterFilterFactory with language=Russian. The stemming works ok except people names, geographical places. Here are some examples: searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове. Are there other stemming plugins for the russian language that can handle this? If not, what are the options. A simple solution may be to use the wildcard queries in Standard mode instead of the DisMaxQueryHandler: Ковров* but I'd like to avoid it. Thanks. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Russian stemmer
Yes, I'm sure I've enabled SnowballPorterFilterFactory both at Index and Query time, because the search works ok, except names and geo locations. I've noticed that searching by Коврова also shows documents that contain Коврову, Коврове Search by Ковров, 7 results: http://www.sova-center.ru/search/?q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2 Search by Коврова, 26 results: http://www.sova-center.ru/search/?lg=1q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2%D0%B0 Adding such words in stopwords.txt will be a tedious task, as there are 7 millions russian names :) Kind Regards, Oleg Burlaca On Tue, Jul 27, 2010 at 11:35 AM, Robert Muir rcm...@gmail.com wrote: another look, your problem is ковров itself... its mapped to ковр a workaround might be to use the protected words functionality to keep ковров and any other problematic people/geo names as-is. separately, in trunk there is an alternative russian stemmer (RussianLightStemFilterFactory), which might give you less problems on average, but I noticed it has this same problem with the example you gave. On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir rcm...@gmail.com wrote: All of your examples stem to ковров: assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове, new String[] { ковров, ковров, ковров, ковров }); } Are you sure you enabled this at *both* index and query time? 2010/7/27 Oleg Burlaca o...@burlaca.com Hello, I'm using SnowballPorterFilterFactory with language=Russian. The stemming works ok except people names, geographical places. Here are some examples: searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове. Are there other stemming plugins for the russian language that can handle this? If not, what are the options. A simple solution may be to use the wildcard queries in Standard mode instead of the DisMaxQueryHandler: Ковров* but I'd like to avoid it. Thanks. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Russian stemmer
A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
Re: Russian stemmer
Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
Re: Russian stemmer
2010/7/27 Oleg Burlaca o...@burlaca.com Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Well, here is one idea of a more general solution. The problem with protected words is you must have a complete list. One idea would be to add a filter that protects any words from stemming that match a regular expression: In english maybe someone wants to avoid any capitalized words to reduce trouble: [A-Z].* in your case then some pattern like [A-Я].*ов might prevent problems. Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Here is the link: http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2 -- Robert Muir rcm...@gmail.com
clustering component
Hi, I'm attempting to get the carrot based clustering component (in trunk) to work. I see that the clustering contrib has been disabled for the time being. Does anyone know if this will be re-enabled soon, or even better, know how I could get it working as it is? Thanks, Matt
Re: clustering component
Hi Matt, I'm attempting to get the carrot based clustering component (in trunk) to work. I see that the clustering contrib has been disabled for the time being. Does anyone know if this will be re-enabled soon, or even better, know how I could get it working as it is? I've recently created a patch to update the clustering algorithms in branch_3x: https://issues.apache.org/jira/browse/SOLR-1804 The patch should also work with trunk, but I haven't verified it yet. S.
Re: slave index is bigger than master index
We have three dedicated servers for solr, two for slaves and one for master, all with linux/debian packages installed. I understand that replication does always copies over the index in an exact form as in master index directory (or it is supposed to do that at least), and if the master index was optimized after indexing, one doesn't need to run an optimize call again on master to optimize the slave's index. But in our case thats what fixed it and I agree it is even more confusing now :s Another problem is, we are serving live services using slave nodes, so I dont want to effect the live search, while playing with slave nodes' indices. We will be running the indexing on master node today over the night. Lets see if it does it again. -- View this message in context: http://lucene.472066.n3.nabble.com/slave-index-is-bigger-than-master-index-tp996329p998750.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi Mitch, thanks for that suggestion. I wasn't aware of that. I've already added a temporary field in my ScriptTransformer that does basically the same. However, with this approach indexing time went up from 20min to more than 5 hours. The new approach is to query the solr index for that other database that I've already setup. This is only a bit slower than the original query (20min). (I'm using URLDataSource to be 1.4.1 conform.) As with the db entity before, for every document a request is sent to the solr core even if it is useless because the input variable is empty. It seems that once an entity processor kicks in you cannot avoid the initial request to its data source? Thanks, Chantal On Mon, 2010-07-26 at 16:22 +0200, MitchK wrote: Hi Chantal, did you tried to write a http://wiki.apache.org/solr/DIHCustomFunctions custom DIH Function ? If not, I think this will be a solution. Just check, whether ${prog.vip} is an empty string or null. If so, you need to replace it with a value that never can response anything. So the vip-field will always be empty for such queries. Maybe that helps? Hopefully, the variable resolver is able to resolve something like ${dih.functions.getReplacementIfNeeded(prog.vip). Kind regards, - Mitch Chantal Ackermann wrote: Hi, my use case is the following: In a sub-entity I request rows from a database for an input list of strings: entity name=prog ... field name=vip ... /* multivalued, not required */ entity name=ssc_entry dataSource=ssc onError=continue query=select SSC_VALUE from SSC_VALUE where SSC_ATTRIBUTE_ID=1 and SSC_VALUE in (${prog.vip}) field column=SSC_VALUE name=vip_ssc / /entity /entity The root entity is prog and it has an optional multivalued field called vip. When the list of vip values is empty, the SQL for the sub-entity above throws an SQLException. (Working with Oracle which does not allow an empty expression in the in-clause.) Two things: (A) best would be not to run the query whenever ${prog.vip} is null or empty. (B) From the documentation, it is not clear that onError is only checked in the transformer runs but not checked when the SQL for the entity throws an exception. (Trunk version JdbcDataSource lines 250pp). IMHO, (A) is the better fix, and if so, (B) is the right decision. (If (A) is not easily fixable, making (B) work would be helpful.) Looking through the code, I've realized that the replacement of the variables is done in a very generic way. I've not yet seen an appropriate way to check on those variables in order to stop the processing of the entity if the variable is empty. Is there a way to do this? Or maybe there is a completely different way to get my use case working. Any help most appreciated! Thanks, Chantal
LucidWorks 1.4 compilation
Good Morning, afternoon or evening... If someone installed Solr using the LucidWorks.jar (1.4) installation how can one make a small change and recompile. Is there a LucidWorks (tomcat) build somewhere? Regards ericz
Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox
Hi Jon, During the last days we front the same problem. Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract content and from others, Solr throws an exception during the Indexing Process . You must: Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8 snapshot and tika-parsers 0.8. Update PdfBox and all related libraries. After that You have to patch Solr 1.4.1 following this patch : https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel This is the firts way to solve the problem. Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is thrown during the Indexing process, but no content is extracted. Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated) all sounds good but we don't know how stableit is! I hope you have now a clear vision of this issue, Best Regards 2010/7/26 Sharp, Jonathan jsh...@coh.org Every so often I need to index new batches of scanned PDFs and occasionally Adobe's OCR can't recognize the text in a couple of these documents. In these situations I would like to type in a small amount of text onto the document and have it be extracted by Solr CELL. Adobe Pro 9 has a number of different ways to add text directly to a PDF file: *Typewriter *Sticky Note *Callout boxes *Text boxes I tried indexing documents with each of these text additions with Solr 1.4.1 + Solr CELL but can't extract the text in any of these boxes. If someone has modified their Solr CELL installation to use more recent versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment on whether newer versions can pull the text out of any of these various text boxes I'd appreciate that very much. -Jon - SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. - -- -- Benedetti Alessandro Personal Page: http://tigerbolt.altervista.org Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
DIH $deleteDocByQuery
Hi, I have been using DIH to do index documents from database. I am hoping to use DIH to delete documents from index. I search in wiki and found the special commands in DIH to do so. http://wiki.apache.org/solr/DataImportHandler#Special_Commands But there is no example on how to use them. I tried searching in the web but couldn't find any samples. Any help regarding this would be most welcome. Thanks, Maddy. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-deleteDocByQuery-tp998816p998816.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: NullPointerException with CURL, but not in browser
Ouch! Absolutely correct - quoting the URL fixed it. Thanks for saving me a sleepless night! cheers - rene 2010/7/26 Chris Hostetter hossman_luc...@fucit.org : However, when I'm trying this very URL with curl within my (perl) script, I : receive a NullPointerException: : CURL-COMMAND: curl -sL : http://localhost:8983/solr/select?indent=onversion=2.2q=*fq=ListId%3A881start=0rows=0fl=*%2Cscoreqt=standardwt=standard it appears you aren't quoting the URL, so that first character is causing the shell to think yo uare done with the command, and you want it to be backgrounded (allthough i'm not certain, since it depends on how you are having perl execute curl) i would suggest that you avoid exec/system calls to curl from Perl, and use an LWP::UserAgent instead. -Hoss
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi Chantal, However, with this approach indexing time went up from 20min to more than 5 hours. This is 15x slower than the initial solution... wow. From MySQL I know that IN ()-clauses are the embodiment of endlessness - they perform very, very badly. New idea: Create a method which returns the query-string: returnString(theVIP) { if ( theVIP != null || theVIP != ) { return a query-string to find the vip } else { return SELECT 1 // you need to modify this, so that it matches your field-definition } } The main-idea is to perform a blazing fast query, instead of a complex IN-clause-query. Does this sounds like a solution??? The new approach is to query the solr index for that other database that I've already setup. This is only a bit slower than the original query (20min). (I'm using URLDataSource to be 1.4.1 conform.) Unfortunately I can not follow you. You are querying a solr-index for a database? Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998859.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LucidWorks 1.4 compilation
I did not realize the LucidWords.jar comes with an option to install the sources :-) On Tue, Jul 27, 2010 at 10:59 AM, Eric Grobler impalah...@googlemail.comwrote: Good Morning, afternoon or evening... If someone installed Solr using the LucidWorks.jar (1.4) installation how can one make a small change and recompile. Is there a LucidWorks (tomcat) build somewhere? Regards ericz
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi Mitch, New idea: Create a method which returns the query-string: returnString(theVIP) { if ( theVIP != null || theVIP != ) { return a query-string to find the vip } else { return SELECT 1 // you need to modify this, so that it matches your field-definition } } The main-idea is to perform a blazing fast query, instead of a complex IN-clause-query. Does this sounds like a solution??? I was using in because it's a multiValued input that results in multiValued output (not necessarily but it's most probable - it's either empty or multiple values). I don't understand how I can make your solution work with multivalued input/output? The new approach is to query the solr index for that other database that I've already setup. This is only a bit slower than the original query (20min). (I'm using URLDataSource to be 1.4.1 conform.) Unfortunately I can not follow you. You are querying a solr-index for a database? Yes, because I've already put one up (second core) and used SolrJ to get what I want later on, but it would be better to compute the relation between the two indexes at index time instead of at query time. (If it would have worked with the db entity the second index wouldn't have been required, anymore.) But now that it works well with the url entity I'm fine with maintaining that second index. It's not that much effort. I've subclassed URLDataSource to add a check whether the list of input values is empty and only proceed when this is not the case. If realized that I have to throw an exception and add the onError attribute to the entity to make that work. Thanks! Chantal
Re: slave index is bigger than master index
We have three dedicated servers for solr, two for slaves and one for master, all with linux/debian packages installed. I understand that replication does always copies over the index in an exact form as in master index directory (or it is supposed to do that at least), and if the master index was optimized after indexing, one doesn't need to run an optimize call again on master to optimize the slave's index. But in our case thats what fixed it and I agree it is even more confusing now :s Thats why I said: try it on the slaves too ;-) In our case it helped too to shrink 2*index to 1*index. I think the data which necessary for the replication won't cleanup before the next replication or before an optimize. For us it was crucial to shrink the size because of limited disc-resources and to make sure that the next replication does not increase the index to 3*times of the initial size. @muneeb so I think, optimization is not necessary or do you have disc limitations too? @Hoss or others: does this explanation sound logically? Another problem is, we are serving live services using slave nodes, so I dont want to effect the live search, while playing with slave nodes' indices. What do you mean here? Optimizing is too CPU expensive? We will be running the indexing on master node today over the night. Lets see if it does it again. Do you mean increase to double size?
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi Chantal, instead of: entity name=prog ... field name=vip ... /* multivalued, not required */ entity name=ssc_entry dataSource=ssc onError=continue query=select SSC_VALUE from SSC_VALUE where SSC_ATTRIBUTE_ID=1 and SSC_VALUE in (${prog.vip}) field column=SSC_VALUE name=vip_ssc / /entity /entity you do: entity name=prog ... field name=vip ... /* multivalued, not required */ entity name=ssc_entry dataSource=ssc onError=continue query=${yourCustomFunctionToReturnAQueryString(prog.vip, ..., ...)} field column=SSC_VALUE name=vip_ssc / /entity /entity The yourCustomFunctionToReturnAQueryString(vip, querystring1, querystring2) { if(vip != null !vip.equals()) { StringBuilder sb = new StringBuilder(50); sb.append(querystring1); // SELECT SSC_VALUE from SSC_VALUE where SSC_ATTRIBUTE_ID=1 and SSC_VALUE in ( sb.append(vip);//VIP-value sb.append(querystring2);//just the closing ) return sb.toString(); } else { return SELECT \\ AS yourFieldName; } } I expect that this method is called for every vip-value, if there is one. Solr DIH uses the returned querystring to query the database. So, if vip-value is empty or null, you can use a different query that is blazing fast (i.e. SELECT AS yourFieldName - just an example to show the logic). This query should return a row with an empty string. So Solr fills the current field with an empty string. I don't know how to prevent Solr from calling your ssc_entry-entity, when vip is null or empty. But this would be a solution to handle empty vip-strings as efficient as possible. If realized that I have to throw an exception and add the onError attribute to the entity to make that work. I am curious: Can you show how to make a method throwing an exception that is accepted by the onError-attribute? I hope we do not talk past eachother here. :-) Kind regards, - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998950.html Sent from the Solr - User mailing list archive at Nabble.com.
question: solrCloud with multiple cores on each machine
Hi I am using solrCloud. Suppose I have a total 4 machines dedicated for solr. I want to have 2 machines as replication (salves) and 2 masters But I want to work with 8 logical cores rather 2. i.e. each master (and each slave) will have 4 cores on it. the reason is that I can optimize the cores one at a time so the IO intensity at any given moment will be low and will not degrade the online performance Is there a way to configure my solr.xml so that when I am doing a distributed search (distrib=true) it will know to query all 8 cores ? Thanks Yatir
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Hi Mitch, thanks for the code. Currently, I've got a different solution running but it's always good to have examples. If realized that I have to throw an exception and add the onError attribute to the entity to make that work. I am curious: Can you show how to make a method throwing an exception that is accepted by the onError-attribute? the catch clause looks for Exception so it's actually easy. :-D Anyway, I've found a cleaner way. It is better to subclass the XPathEntityProcessor and put it in a state that prevents it from calling initQuery which triggers the dataSource.getData() call. I have overridden the initContext() method setting a go/no go flag that I am using in the overridden nextRow() to find out whether to delegate to the superclass or not. This way I can also avoid the code that fills the tmp field with an empty value if there is no value to query on. Cheers, Chantal
RE: Spellcheck help
Thanks for the input, i'll check it out! Marc Subject: RE: Spellcheck help Date: Fri, 23 Jul 2010 13:12:04 -0500 From: james.d...@ingrambook.com To: solr-user@lucene.apache.org In org.apache.solr.spelling.SpellingQueryConverter, find the line (#84): final static String PATTERN = (?:(?!( + NMTOKEN + :|\\d+)))[\\p{L}_\\-0-9]+; and remove the |\\d+ to make it: final static String PATTERN = (?:(?! + NMTOKEN + :))[\\p{L}_\\-0-9]+; My testing shows this solves your problem. The caution is to test it against all your use cases because obviously someone thought we should ignore leading digits from keywords. Surely there's a reason why although I can't think of it. James Dyer E-Commerce Systems Ingram Book Company (615) 213-4311 -Original Message- From: dekay...@hotmail.com [mailto:dekay...@hotmail.com] Sent: Saturday, July 17, 2010 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck help Can anybody help me with this? :( -Original Message- From: Marc Ghorayeb Sent: Thursday, July 08, 2010 9:46 AM To: solr-user@lucene.apache.org Subject: Spellcheck help Hello,I've been trying to get rid of a bug when using the spellcheck but so far with no success :(When searching for a word that starts with a number, for example 3dsmax, i get the results that i want, BUT the spellcheck says it is not correctly spelled AND the collation gives me 33dsmax. Further investigation shows that the spellcheck is actually only checking dsmax which it considers does not exist and gives me 3dsmax for better results, but since i have spellcheck.collate = true, the collation that i show is 33dsmax with the first 3 being the one discarded by the spellchecker... Otherwise, the spellcheck works correctly for normal words... any ideas? :(My spellcheck field is fairly classic, whitespace tokenizer, with lowercase filter...Any help would be greatly appreciated :)Thanks,Marc _ Messenger arrive enfin sur iPhone ! Venez le télécharger gratuitement ! http://www.messengersurvotremobile.com/?d=iPhone _ Exclu : Téléchargez la nouvelle version de Messenger ! http://clk.atdmt.com/FRM/go/244627952/direct/01/
Re: Russian stemmer
Thanks Robert for all your help, The idea of ы[A-Z].* stopwords is ideal for the english language, although in russian nouns are inflected: Борис, Борису, Бориса, Борисом I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned it's more accurate). Once again thanks, Oleg Burlaca On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir rcm...@gmail.com wrote: 2010/7/27 Oleg Burlaca o...@burlaca.com Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Well, here is one idea of a more general solution. The problem with protected words is you must have a complete list. One idea would be to add a filter that protects any words from stemming that match a regular expression: In english maybe someone wants to avoid any capitalized words to reduce trouble: [A-Z].* in your case then some pattern like [A-Я].*ов might prevent problems. Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Here is the link: http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2 -- Robert Muir rcm...@gmail.com
Re: Russian stemmer
right, but your problem is this is the current output: Ковров - Ковр Коврову - Ковров Ковровом - Ковров Коврове - Ковров so, if Ковров was simply left alone, all your forms would match... 2010/7/27 Oleg Burlaca o...@burlaca.com Thanks Robert for all your help, The idea of ы[A-Z].* stopwords is ideal for the english language, although in russian nouns are inflected: Борис, Борису, Бориса, Борисом I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned it's more accurate). Once again thanks, Oleg Burlaca On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir rcm...@gmail.com wrote: 2010/7/27 Oleg Burlaca o...@burlaca.com Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Well, here is one idea of a more general solution. The problem with protected words is you must have a complete list. One idea would be to add a filter that protects any words from stemming that match a regular expression: In english maybe someone wants to avoid any capitalized words to reduce trouble: [A-Z].* in your case then some pattern like [A-Я].*ов might prevent problems. Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Here is the link: http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2 -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Highlighting parameters wiki
The wiki entry for hl.highlightMultiTerm: http://wiki.apache.org/solr/HighlightingParameters#hl.highlightMultiTerm doesn't appear to be correct. It says: If the SpanScorer is also being used, enables highlighting for range/wildcard/fuzzy/prefix queries. Default is false. But the code in DefaultSolrHighlighter (both on the 1.4 branch that I'm using and in the trunk) does: Boolean highlightMultiTerm = request.getParams().getBool(HighlightParams.HIGHLIGHT_MULTI_TERM, true); if(highlightMultiTerm == null) { highlightMultiTerm = false; } which looks to me like like it's going to default to true, since getBool will never return null, and if it gets a null value from the parameters internally, it will return true. Shall I file a Jira on this one? Perhaps it's easier just to fix the Wiki page? Steve -- Stephen Green http://thesearchguy.wordpress.com
RE: Spellcheck help
If you could, let me know how your testing goes with this change. I too am interested in having the Collate work as good as it can. It looks like the code would be better with this change but then again I don't know what the original author was thinking when this was put in. James Dyer E-Commerce Systems Ingram Book Company (615) 213-4311 -Original Message- From: Marc Ghorayeb [mailto:dekay...@hotmail.com] Sent: Tuesday, July 27, 2010 8:07 AM To: solr-user@lucene.apache.org Subject: RE: Spellcheck help Thanks for the input, i'll check it out! Marc Subject: RE: Spellcheck help Date: Fri, 23 Jul 2010 13:12:04 -0500 From: james.d...@ingrambook.com To: solr-user@lucene.apache.org In org.apache.solr.spelling.SpellingQueryConverter, find the line (#84): final static String PATTERN = (?:(?!( + NMTOKEN + :|\\d+)))[\\p{L}_\\-0-9]+; and remove the |\\d+ to make it: final static String PATTERN = (?:(?! + NMTOKEN + :))[\\p{L}_\\-0-9]+; My testing shows this solves your problem. The caution is to test it against all your use cases because obviously someone thought we should ignore leading digits from keywords. Surely there's a reason why although I can't think of it. James Dyer E-Commerce Systems Ingram Book Company (615) 213-4311 -Original Message- From: dekay...@hotmail.com [mailto:dekay...@hotmail.com] Sent: Saturday, July 17, 2010 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck help Can anybody help me with this? :( -Original Message- From: Marc Ghorayeb Sent: Thursday, July 08, 2010 9:46 AM To: solr-user@lucene.apache.org Subject: Spellcheck help Hello,I've been trying to get rid of a bug when using the spellcheck but so far with no success :(When searching for a word that starts with a number, for example 3dsmax, i get the results that i want, BUT the spellcheck says it is not correctly spelled AND the collation gives me 33dsmax. Further investigation shows that the spellcheck is actually only checking dsmax which it considers does not exist and gives me 3dsmax for better results, but since i have spellcheck.collate = true, the collation that i show is 33dsmax with the first 3 being the one discarded by the spellchecker... Otherwise, the spellcheck works correctly for normal words... any ideas? :(My spellcheck field is fairly classic, whitespace tokenizer, with lowercase filter...Any help would be greatly appreciated :)Thanks,Marc _ Messenger arrive enfin sur iPhone ! Venez le télécharger gratuitement ! http://www.messengersurvotremobile.com/?d=iPhone _ Exclu : Téléchargez la nouvelle version de Messenger ! http://clk.atdmt.com/FRM/go/244627952/direct/01/
RE: Querying throws java.util.ArrayList.RangeCheck
Hi Yonik, I am using Solr 1.4 release dated Feb-9 2010. There is no custom code. I am using regular out of box dismax requesthandler. The query is a simple one with 4 filter queries (fq's) and one sort query. During the index generation, I delete a set of rows based on date filter, then add new rows to the index. Then another process queries the index and generates some stats and updates the index again. Not sure if during this process something is going wrong with the index. Thanks Kalyan -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, July 27, 2010 12:15 AM To: solr-user@lucene.apache.org Subject: Re: Querying throws java.util.ArrayList.RangeCheck Do you have any custom code, or is this stock solr (and which version, and what is the request)? -Yonik http://www.lucidimagination.com On Tue, Jul 27, 2010 at 12:30 AM, Manepalli, Kalyan kalyan.manepa...@orbitz.com wrote: Hi, I am stuck at this weird problem during querying. While querying the solr index I am getting the following error. Index: 52, Size: 16 java.lang.IndexOutOfBoundsException: Index: 52, Size: 16 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at During debugging I found that the SolrIndexReader is trying to read a document which doesnt exist in the index. I tried optimizing the index and restarting the server but still no luck. Any help in resolving this issue will be appreciated. Thanks Kalyan
Is it possible to get keyword/match's position?
According to SO: http://stackoverflow.com/questions/1557616/retrieving-per-keyword-field-match-position-in-lucene-solr-possible It is not possible, but it is one year ago, is it still true for now? Thanks.
Re: java GC overhead limit exceeded
Look into -XX:-GCUseOverheadLimit On 7/26/10, Jonathan Rochkind rochk...@jhu.edu wrote: I am now occasionally getting a Java GC overhead limit exceeded error in my Solr. This may or may not be related to recently adding much better (and more) warming querries. I can get it when trying a 'commit', after deleting all documents in my index, or in other cases. Anyone run into this, and have suggestions as to how to set my java options to eliminate? I'm not sure this simply means that my heap size needs to be bigger, it seems to be something else. Any advice appreciated. Googling didn't get me much I trusted. Jonathan -- Sent from my mobile device
RE: Total number of terms in an index?
Hi Jason, Are you looking for the total number of unique terms or total number of term occurrences? Checkindex reports both, but does a bunch of other work so is probably not the fastest. If you are looking for total number of term occurrences, you might look at contrib/org/apache/lucene/misc/HighFreqTerms.java. If you are just looking for the total number of unique terms, I wonder if there is some low level API that would allow you to just access the in-memory representation of the tii file and then multiply the number of terms in it by your indexDivisor (default 128). I haven't dug in to the code so I don't actually know how the tii file gets loaded into a data structure in memory. If there is api access, it seems like this might be the quickest way to get the number of unique terms. (Of course you would have to do this for each segment). Tom -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Monday, July 26, 2010 8:39 PM To: solr-user@lucene.apache.org Subject: Re: Total number of terms in an index? : Sorry, like the subject, I mean the total number of terms. it's not stored anywhere, so the only way to fetch it is to actually iteate all of the terms and count them (that's why LukeRequestHandler is slow slow to compute this particular value) If i remember right, someone mentioned at one point that flex would let you store data about stuff like this in your index as part of the segment writing, but frankly i'm still not sure how that iwll help -- because you unless your index is fully optimized, you still have to iterate the terms in each segment to 'de-dup' them. -Hoss
SpatialSearch: sorting by distance
Hi, I'm trying to sort by distance like this: sort=dist(2,lat,lon,55.755786,37.617633) asc In general results are sorted, but some documents are not in right order. I'm using DistanceUtils.getDistanceMi(...) from lucene spatial to calculate real distance after reading documents from Solr. Solr version from trunk. fieldType name=double class=solr.TrieDoubleField precisionStep=0 omitNorms=true positionIncrementGap=0/ field name=lat type=double indexed=true stored=true/ field name=lon type=double indexed=true stored=true/ Thanks. -- Pavel Minchenkov
does this indicate a commit happened for every add?
I'm adding lots of small docs with several threads to solr and the adds start fast but then slow down. I didn't do any explicit commits and autocommit is turned off but the logs show lots of commit activity on this core and restarting this solr core logged the below. Where did all these commits come from, the exact same number as my adds? I'm stumped... Jul 27, 2010 10:07:17 AM org.apache.solr.update.DirectUpdateHandler2 close INFO: closed DirectUpdateHandler2{commits=456389,autocommits=0,optimizes=0,rollbacks= 0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,e rrors=0,cumulative_adds=456393,cumulative_deletesById=0,cumulative_delet esByQuery=0,cumulative_errors=0}
Re: Spellchecking and frequency
Hi, I found the suggestions returned from the standard solr spellcheck not to be that relevant. By contrast, aspell, given the same dictionary and mispelled words, gives much more accurate suggestions. I therefore wrote an implementation of SolrSpellChecker that wraps jazzy, the java aspell library. I also extended the SpellCheckComponent to take the matrix of suggested words and query the corpus to find the first combination of suggestions which returned a match. This works well for my use case, where term frequency is irrelevant to spelling or scoring. I'd like to publish the code in case someone finds it useful (although it's a bit crude at the moment and will need a decent tidy up). Would it be appropriate to open up a Jira issue for this? Cheers, ~mark On 27 July 2010 09:33, dan sutton danbsut...@gmail.com wrote: Hi, I've recently been looking into Spellchecking in solr, and was struck by how limited the usefulness of the tool was. Like most corpora , ours contains lots of different spelling mistakes for the same word, so the 'spellcheck.onlyMorePopular' is not really that useful unless you click on it numerous times. I was thinking that since most of the time people spell words correctly why was there no other frequency parameter that could enter into the score? i.e. something like: spell_score ~ edit_dist * freq I'm sure others have come across this issue and was wonding what steps/algorithms they have used to overcome these limitations? Cheers, Dan
RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox
Alessandro all, I was having the same issue with Tika crashing on certain PDFs. I also noticed the bug where no content was extracted after upgrading Tika. When I went to the SOLR issue you link to below, I applied all the patches, downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and got the following error: SEVERE: java.lang.NoSuchMethodError: org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader; at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579) at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555) at java.lang.Thread.run(Thread.java:619) This is really weird because I DID apply the SolrResourceLoader patch that adds the getClassLoader method. I even verified by going opening up the JARs and looking at the class file in Eclipse...I can see the SolrResourceLoader.getClassLoader() method. Does anyone know why it can't find the method? After patching the source I did ant clean dist in the base directory of the Solr source tree and everything looked like it compiles (BUILD SUCCESSFUL). Then I copied all the jars from dist/ and all the library dependencies from contrib/extraction/lib/ into my SOLR_HOME. Restarting tomcat, everything in the logs looked good. I'm stumped. It would be very nice to have a Solr implementation using the newest versions of PDFBox Tika and actually have content being extracted...=) Best, Dave -Original Message- From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com] Sent: Tuesday, July 27, 2010 6:09 AM To: solr-user@lucene.apache.org Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox Hi Jon, During the last days we front the same problem. Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract content and from others, Solr throws an exception during the Indexing Process . You must: Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8 snapshot and tika-parsers 0.8. Update PdfBox and all related libraries. After that You have to patch Solr 1.4.1 following this patch : https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel This is the firts way to solve the problem. Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is thrown during the Indexing process, but no content is extracted. Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated) all sounds good but we don't know how stableit is! I hope you have now a clear vision of this issue, Best Regards 2010/7/26 Sharp, Jonathan jsh...@coh.org Every so often I need to index new batches of scanned PDFs and occasionally Adobe's OCR can't recognize the text in a couple of these documents. In these situations I would like to type in a small amount of text onto the document and have it be extracted by Solr CELL. Adobe Pro 9 has a number of different ways to add text directly to a PDF file: *Typewriter *Sticky Note *Callout boxes *Text boxes I tried indexing documents with each of these text additions with Solr 1.4.1 + Solr CELL but can't extract the text in any of these boxes. If someone has modified their Solr CELL installation to use more recent versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment on whether newer versions can pull the text out of any of these various text boxes I'd appreciate that very much. -Jon - SECURITY/CONFIDENTIALITY WARNING: This
Re: Total number of terms in an index?
In trunk (flex) you can ask each segment for its unique term count. But to compute the unique term count across all segments is necessarily costly (requires merging them, to de-dup), as Hoss described. Mike On Tue, Jul 27, 2010 at 12:27 PM, Burton-West, Tom tburt...@umich.edu wrote: Hi Jason, Are you looking for the total number of unique terms or total number of term occurrences? Checkindex reports both, but does a bunch of other work so is probably not the fastest. If you are looking for total number of term occurrences, you might look at contrib/org/apache/lucene/misc/HighFreqTerms.java. If you are just looking for the total number of unique terms, I wonder if there is some low level API that would allow you to just access the in-memory representation of the tii file and then multiply the number of terms in it by your indexDivisor (default 128). I haven't dug in to the code so I don't actually know how the tii file gets loaded into a data structure in memory. If there is api access, it seems like this might be the quickest way to get the number of unique terms. (Of course you would have to do this for each segment). Tom -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Monday, July 26, 2010 8:39 PM To: solr-user@lucene.apache.org Subject: Re: Total number of terms in an index? : Sorry, like the subject, I mean the total number of terms. it's not stored anywhere, so the only way to fetch it is to actually iteate all of the terms and count them (that's why LukeRequestHandler is slow slow to compute this particular value) If i remember right, someone mentioned at one point that flex would let you store data about stuff like this in your index as part of the segment writing, but frankly i'm still not sure how that iwll help -- because you unless your index is fully optimized, you still have to iterate the terms in each segment to 'de-dup' them. -Hoss
RE: Spellchecking and frequency
Mark, I'd like to see your code if you open a JIRA for this. I recently opened SOLR-2010 with a patch that does something similar to the second part only of what you describe (find combinations that actually return a match). But I'm not sure if my approach is the best one so I would like to see yours to compare. James Dyer E-Commerce Systems Ingram Book Company (615) 213-4311 -Original Message- From: Mark Holland [mailto:mark.holl...@zoopla.co.uk] Sent: Tuesday, July 27, 2010 1:04 PM To: solr-user@lucene.apache.org Subject: Re: Spellchecking and frequency Hi, I found the suggestions returned from the standard solr spellcheck not to be that relevant. By contrast, aspell, given the same dictionary and mispelled words, gives much more accurate suggestions. I therefore wrote an implementation of SolrSpellChecker that wraps jazzy, the java aspell library. I also extended the SpellCheckComponent to take the matrix of suggested words and query the corpus to find the first combination of suggestions which returned a match. This works well for my use case, where term frequency is irrelevant to spelling or scoring. I'd like to publish the code in case someone finds it useful (although it's a bit crude at the moment and will need a decent tidy up). Would it be appropriate to open up a Jira issue for this? Cheers, ~mark On 27 July 2010 09:33, dan sutton danbsut...@gmail.com wrote: Hi, I've recently been looking into Spellchecking in solr, and was struck by how limited the usefulness of the tool was. Like most corpora , ours contains lots of different spelling mistakes for the same word, so the 'spellcheck.onlyMorePopular' is not really that useful unless you click on it numerous times. I was thinking that since most of the time people spell words correctly why was there no other frequency parameter that could enter into the score? i.e. something like: spell_score ~ edit_dist * freq I'm sure others have come across this issue and was wonding what steps/algorithms they have used to overcome these limitations? Cheers, Dan
Re: Timeout in distributed search
: Is there anyway to have time out support in distributed search. I : searched https://issues.apache.org/jira/browse/SOLR-502 but looks it is : not in main release of solr1.4 note that issue is marked Fix Version/s: 1.3 ... that means it was fixed in Solr 1.3, well before 1.4 came out. You should also take a look at the functionality added in SOLR-850, which explicitly deals with hard timeouts in distributed searching... https://issues.apache.org/jira/browse/SOLR-850 ...that was first included in Solr 1.4 -Hoss
Re: SolrCore has a large number of SolrIndexSearchers retained in infoRegistry
: : I was wondering if anyone has found any resolution to this email thread? As Grant asked in his reply when this thread was first started (December 2009)... It sounds like you are either using embedded mode or you have some custom code. Are you sure you are releasing your resources correctly? ...there was no response to his question for clarification. the problem, given the info we have to work with, definitely seems to be that the custom code utilizing the SolrCore directly is not releasing the resources that it is using in every case. if you are claling hte execute method, that means you have a SOlrQueryRequest object -- which means you somehow got an instance of a SolrIndexSearcher (every SOlrQueryRequest has one assocaited with it) and you are somehow not releasing that SolrIndexSearcher (probably because you are not calling close() on your SolrQueryRequest) But it relaly all depends on how you got ahold of that SOlrQueryRequest/SolrIndexSearcher pair in the first place ... every method in SolrCore that gives you access to a SolrIndexSearcher is documented very clearly on how to release it when you are done with it so the ref count can be decremented. -Hoss
Re: help finding illegal chars in XML doc
: Thanks for your reply. I could not find in the log files any mention to : that. By the way I only have _MM_DD.request.log files in my directory. : : Do I have to enable any specific log or level to catch those errors? if you are using that java -jar start.jar command for the example jetty nstance then the log messages i'm refering to are written directly to your console. if you are using running solr in some other servlet container, then it all depneds on the servlet container... http://wiki.apache.org/solr/SolrLogging http://wiki.apache.org/solr/LoggingInDefaultJettySetup -Hoss
Difficulties with Highlighting
I'm a relative beginner at SOLR, indexing and searching Unicode Tibetan texts. I am trying to use the highlighter but it just returns, empty elements, such as: lst name=highlighting lst name=kt-d-0103-text-v4p262a/ /lst What am I doing wrong? The query that generated that is: http://www.thlib.org:8080/thdl-solr/thdl-texts/select?indent=onversion=2.2q=%E0%BD%91%E0%BD%84%E0%BD%B4%E0%BD%A3%E0%BC%8B%E0%BD%98%E0%BD%81%E0%BD%93%E0%BC%8B+AND+type%3Atextstart=0rows=10fl=*%2Cscoreqt=standardwt=standardhl=truehl.fl=pg_bohl.snippets=50 The hit is in the multivalued field named pg_bo and in a doc with that id #. I've looked at the various highlighting parameters (not that I fully understand them) and tried fiddling with those but nothing helped. I did notice that if you change the hl.fl=*. Then you get the type field highlighted: lst name=highlighting lst name=kt-d-0103-text-v4p262a arr name=type stremtext/em/str /arr /lst /lst But that's not much help. We are using a custom Tibetan tokenizer for the Unicode Tibetan text fields. Would this have something to do with it? Any suggestions would be appreciated! Thanks for your help, Than Grove -- Nathaniel Grove Research Associate Technical Director Tibetan Himalayan Library University of Virginia http://www.thlib.org
Re: SolrCore has a large number of SolrIndexSearchers retained in infoRegistry
On Jul 27, 2010, at 12:21pm, Chris Hostetter wrote: : : I was wondering if anyone has found any resolution to this email thread? As Grant asked in his reply when this thread was first started (December 2009)... It sounds like you are either using embedded mode or you have some custom code. Are you sure you are releasing your resources correctly? ...there was no response to his question for clarification. the problem, given the info we have to work with, definitely seems to be that the custom code utilizing the SolrCore directly is not releasing the resources that it is using in every case. if you are claling hte execute method, that means you have a SOlrQueryRequest object -- which means you somehow got an instance of a SolrIndexSearcher (every SOlrQueryRequest has one assocaited with it) and you are somehow not releasing that SolrIndexSearcher (probably because you are not calling close() on your SolrQueryRequest) One thing that bit me previously with using APIs in this area of Solr is that if you call CoreContainer.getCore(), this increments the open count, so you have to balance each getCore() call with a close() call. The naming here could be better - I think it's common to have an expectation that calls to get something don't change any state. Maybe openCore()? -- Ken But it relaly all depends on how you got ahold of that SOlrQueryRequest/SolrIndexSearcher pair in the first place ... every method in SolrCore that gives you access to a SolrIndexSearcher is documented very clearly on how to release it when you are done with it so the ref count can be decremented. -Hoss Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Difficulties with Highlighting
Than - Looks like maybe your text_bo field type isn't analyzing how you'd like? Though that's just a hunch. I pasted the value of that field returned in the link you provided into your analysis.jsp page and it chunked tokens by whitespace. Though I could be experiencing a copy/ paste/i18n issue. Also looks like you're on Solr 1.3 - so it's likely quite worth upgrading to 1.4.1 (don't know if that directly affects this highlighting issue, just a general recommendation). Erik On Jul 27, 2010, at 3:43 PM, Nathaniel Grove wrote: I'm a relative beginner at SOLR, indexing and searching Unicode Tibetan texts. I am trying to use the highlighter but it just returns, empty elements, such as: lst name=highlighting lst name=kt-d-0103-text-v4p262a/ /lst What am I doing wrong? The query that generated that is: http://www.thlib.org:8080/thdl-solr/thdl-texts/select?indent=onversion=2.2q=%E0%BD%91%E0%BD%84%E0%BD%B4%E0%BD%A3%E0%BC%8B%E0%BD%98%E0%BD%81%E0%BD%93%E0%BC%8B+AND+type%3Atextstart=0rows=10fl=*%2Cscoreqt=standardwt=standardhl=truehl.fl=pg_bohl.snippets=50 The hit is in the multivalued field named pg_bo and in a doc with that id #. I've looked at the various highlighting parameters (not that I fully understand them) and tried fiddling with those but nothing helped. I did notice that if you change the hl.fl=*. Then you get the type field highlighted: lst name=highlighting lst name=kt-d-0103-text-v4p262a arr name=type stremtext/em/str /arr /lst /lst But that's not much help. We are using a custom Tibetan tokenizer for the Unicode Tibetan text fields. Would this have something to do with it? Any suggestions would be appreciated! Thanks for your help, Than Grove -- Nathaniel Grove Research Associate Technical Director Tibetan Himalayan Library University of Virginia http://www.thlib.org
Re: Querying throws java.util.ArrayList.RangeCheck
I am getting a similar error with today's nightly build: HTTP Status 500 - Index: 54, Size: 24 java.lang.IndexOutOfBoundsException: Index: 54, Size: 24 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:264) at I'm adding and deleting a batch of documents. Currently during indexing for each document there is a commit. In some cases the document is deleted just before it is added with a commit for the delete and a commit for the add. It appears that if I wait to commit until the end of all indexing, I avoid this error. Jason On Tue, Jul 27, 2010 at 10:25 AM, Manepalli, Kalyan kalyan.manepa...@orbitz.com wrote: Hi Yonik, I am using Solr 1.4 release dated Feb-9 2010. There is no custom code. I am using regular out of box dismax requesthandler. The query is a simple one with 4 filter queries (fq's) and one sort query. During the index generation, I delete a set of rows based on date filter, then add new rows to the index. Then another process queries the index and generates some stats and updates the index again. Not sure if during this process something is going wrong with the index. Thanks Kalyan -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, July 27, 2010 12:15 AM To: solr-user@lucene.apache.org Subject: Re: Querying throws java.util.ArrayList.RangeCheck Do you have any custom code, or is this stock solr (and which version, and what is the request)? -Yonik http://www.lucidimagination.com On Tue, Jul 27, 2010 at 12:30 AM, Manepalli, Kalyan kalyan.manepa...@orbitz.com wrote: Hi, I am stuck at this weird problem during querying. While querying the solr index I am getting the following error. Index: 52, Size: 16 java.lang.IndexOutOfBoundsException: Index: 52, Size: 16 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at During debugging I found that the SolrIndexReader is trying to read a document which doesnt exist in the index. I tried optimizing the index and restarting the server but still no luck. Any help in resolving this issue will be appreciated. Thanks Kalyan
min/max, StatsComponent, performance
I thought I asked a variation of this before, but I don't see it on the list, apologies if this is a duplicate, but I have new questions. So I need to find the min and max value of a result set. Which can be several million documents. One way to do this is the StatsComponent. One problem is that I'm having performance problems with StatsComponent across so many documents, adding the stats component on the field I'm interested in is adding 10s to my query response time. So one question is if there's any way to increase StatsComponent performance. Does it use any caches, or does it operate without caches? My Solr is running near the top of it's heap size, although I'm not currently getting any OOM errors, perhaps not enough free memory is somehow hurting StatsComponent performance. Or any other ideas for increasing StatsComponent performance? But it also occurs to me that the StatsComponent is doing a lot more than I need. I just need min/max. And the cardinality of this field is a couple orders of magnitude lower than the total number of documents. But StatsComponent is also doing a bunch of other things, like sum, median, etc. Perhaps if there were a way to _just_ get min/max, it would be faster. Is there any way to get min/max values in a result set other than StatsComponent? Jonathan
Indexing Problem: Where's my data?
Hi, (The first version of this was rejected for spam). I'm setting up a test instance of Solr, and keep running into the problem of having Solr not work the way I think it should work. Specifically, the data I want to go into the index isn't there after indexing. I'm extracting the data from MSSQL via DataImportHandler, JDBC 4.0. My data is set up that for every product ID there is one category (hierarchical, but I'm not dealing with that ATM), a family, and a set of attributes (which includes name, etc). After indexing, I get Category, Family, and Product ID - but nothing from my attribute values (STRING_NAME, below) - which is the most useful data. Is there something wrong with my schema? I thought it might be that the schema.xml file wasn't respecting the names I assigned via the DataImportHandler; when I changed to the column names in the schema.xml, I picked up Family and Category (previously, it was only product ID). I'm really banging my head against the wall at this point, so I'd appreciate any help. My step will probably be to do a considerably more complicated denormalization (in terms of the SQL), which would make the Solr end simpler (but that has problems of its own). Config information below. Any help appreciated. Thanks, Michael Data Config: dataConfig dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost\DEVELOPMENT/Databases/data:1433 / document name=products entity onError=continue name=product query=select Product_ID,Category_ID from TB_Product field column=PRODUCT_ID name=pid / field column=CATEGORY_ID name=cid / entity name=facets query=select * from TB_PROD_SPECS where PRODUCT_ID=${product.Product_ID} field column=STRING_VALUE / field column=NUMERIC_VALUE / entity name=attributes query=select ATTRIBUTE_NAME,ATTRIBUTE_TYPE from TB_ATTRIBUTE where ATTRIBUTE_ID=${facets.ATTRIBUTE_ID} field column=Attribute_Name name=Attribute Name / /entity /entity entity name=category query=select CATEGORY_NAME,PARENT_CATEGORY from TB_CATEGORY where CATEGORY_ID='${product.Category_ID}' field column=Category_Name name=Category / field column=Parent_Category name=Parent Category / /entity entity name=family_id query=select FAMILY_ID from TB_PROD_FAMILY where Product_ID = ${product.Product_ID} entity name=family query=select FAMILY_Name,PARENT_FAMILY_ID,ROOT_FAMILY,CATEGORY_ID from TB_Family where Family_ID = ${family_id.FAMILY_ID} field column=FAMILY_NAME name=Family / field column=ROOT_FAMILY name=Root Family / field column=PARENT_FAMILY name=Parent Family / field column=Category_id name=Category ID / /entity /entity /entity /document /dataConfig Schema: fields field name=Product_ID type=int indexed=true stored=true required=true / field name=Family_NAME type=textTight indexed=true stored=false multivalued=true/ field name=Category_Name type=textTight indexed=true stored=true multiValued=true omitNorms=true / field name=STRING_VALUE type=textTight indexed=true stored=false multivalued=true/ field name=ATTRIBUTE_NAME type=textTight indexed=true stored=false multivalued=true/ field name=text type=text indexed=true stored=false multiValued=true/ dynamicField name=*_i type=stringindexed=true stored=true multivalued=true/ /fields uniqueKeyProduct_ID/uniqueKey defaultSearchFieldtext/defaultSearchField solrQueryParser defaultOperator=OR/ copyField source=* dest=text/
RE: Querying throws java.util.ArrayList.RangeCheck
Yonik, One more update on this. I used the filter query that was throwing error and used it to delete a subset of results. After that the queries started working correctly. Which indicates that the particular docId was present in the index somewhere, but lucene was not able to find it. -Kalyan -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Tuesday, July 27, 2010 4:46 PM To: solr-user@lucene.apache.org Subject: Re: Querying throws java.util.ArrayList.RangeCheck I haven't been able to reproduce anything... But if you guys are sure you're not running any custom code, then there's definitely seems to be a bug somewhere. Can anyone reproduce this in something you can share? -Yonik http://www.lucidimagination.com
Re: Indexing Problem: Where's my data?
for STRING_VALUE, I assume there is a property in the 'select *' results called string_value? if so I'm not sure why it wouldn't work. If not, then that's why, it doesn't have anything to put there. For ATTRIBUTE_NAME, is it possibly a case issue? you called it 'Attribute_Name' in your query, but ATTRIBUTE_NAME in your schema...just something to check I guess. Also, not sure why you are using name= in your fields, for example, field column=PARENT_FAMILY name=Parent Family / I thought 'column' was the source field name and 'name' was supposed to be the schema field name and if not there it would assume 'column' name. You don't have a schema field called Parent Family so it looks like it's defaulting to column name too which is lucky for you I suppose. But you may want to either remove 'name=' or make it match the schema. (and I may be completely wrong on this, it's been a while since I got DIH going). -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Problem-Where-s-my-data-tp1000660p1000843.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Highlighting parameters wiki
(10/07/27 23:16), Stephen Green wrote: The wiki entry for hl.highlightMultiTerm: http://wiki.apache.org/solr/HighlightingParameters#hl.highlightMultiTerm doesn't appear to be correct. It says: If the SpanScorer is also being used, enables highlighting for range/wildcard/fuzzy/prefix queries. Default is false. But the code in DefaultSolrHighlighter (both on the 1.4 branch that I'm using and in the trunk) does: Boolean highlightMultiTerm = request.getParams().getBool(HighlightParams.HIGHLIGHT_MULTI_TERM, true); if(highlightMultiTerm == null) { highlightMultiTerm = false; } which looks to me like like it's going to default to true, since getBool will never return null, and if it gets a null value from the parameters internally, it will return true. Shall I file a Jira on this one? Perhaps it's easier just to fix the Wiki page? Steve Hi Steve, Please just fix the wiki page. Thank you for reporting this! Koji -- http://www.rondhuit.com/en/
How to 'filter' facet results
Is there a way to tell Solr to only return a specific set of facet values? I feel like the facet query must be able to do this, but I'm not really understanding the facet query. In my specific case, I'd like to only see facet values for the same values I pass in as query filters, i.e. if I run this query: fq=keyword:man OR keyword:bear OR keyword:pig facet=on facet.field:keyword then I only want it to return the facet counts for man, bear, and pig. The resulting docs might have a number of different values for keyword, in addition to those specified in the filter because keyword is a multiValued field. How can I tell it to only return the facet values for man, bear, and pig? On the client side I could programmatically remove the other facets that I don't care about, except that the resulting docs could return hundreds of different values. If I were faceting on a single value, I could say facet.prefix=man, and that would work, but mostly I need this to work for more than one filter value. Is there a way to set multiple facet.prefix values? Any ideas? -dKt
RE: How to 'filter' facet results
Is there a way to tell Solr to only return a specific set of facet values? I feel like the facet query must be able to do this, but I'm not really understanding the facet query. In my specific case, I'd like to only see facet values for the same values I pass in as query filters, i.e. if I run this query: fq=keyword:man OR keyword:bear OR keyword:pig facet=on facet.field:keyword then I only want it to return the facet counts for man, bear, and pig. The resulting docs might have a number of different values for keyword, in addition For the general case of filtering facet values, I've wanted to do that too in more complex situations, and there is no good way I've found. For your very specific use case though, yeah, you can do it with facet.query. Leave out the facet.field, but instead: facet.query=keyword:man facet.query=keyword:bear facet.query=keyword:pig You'll get three facet.query results in the response, one each for man, bear, pig. Solr behind the scenes will kind of do three seperate 'sub-queries', one for each facet.query, but since the query itself should be cached, you shouldn't notice much difference. Especially if you have a warming query that facets on the keyword field (I'm never entirely sure when caches created by warming queries will be used by a facet.query, or if it depends on the facet method in use, but it can't hurt). Jonathan
Re: Tika, Solr running under Tomcat 6 on Debian
I would start over from the Solr 1.4.1 binary distribution and follow the instructions on the wiki: http://wiki.apache.org/solr/ExtractingRequestHandler (Java classpath stuff is notoriously difficult, especially when dynamically configured and loaded. I often cannot tell if Java cannot load the class it prints, or if that class requires others.) On Sat, Jul 24, 2010 at 11:21 PM, Tim AtLee timat...@gmail.com wrote: Hello I desperately hope someone can help me here... I'm a bit out of my league here. I am trying to implement content extraction using Tika and Solr as part of a search package for a product I am using. I have been successful in getting Solr to work so far as indexing text, and returning search results, however I am hitting a wall when I try to use Tika for content extraction. I add the following configuration to solrconfig.xml: requestHandler name=/extract/tika class=org.apache.solr.handler.extraction.ExtractingRequestHandler lst name=defaults /lst !-- This path only extracts - never updates -- lst name=invariants bool name=extractOnlytrue/bool /lst /requestHandler During a test, I receive the following error: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.extraction.ExtractingRequestHandler' The full text of this error is listed below. So, as I indicated in the subject line, I am using Debian linux Squeeze (testing). Tomcat is at version 6.0.26 and is installed by apt. Solr is also installed from apt, and is at version: 1.4.0.2010.04.24.07.20.22. Java -version looks like this: java version 1.6.0_20 Java(TM) SE Runtime Environment (build 1.6.0_20-b02) The JDK is also at the same version, and also from apt. I have built Tika from source (nightly build) using mvn2, and placed the complied jar's in /lib. /lib is located at /var/solr/site/lib, along with /var/solr/site/conf and /var/solr/site/data. Hopefully this is the right place to put the jar's. I also tried building solr from source (also the nightly build), and was able to get solr sort of working (not Tika). I could run a single instance, but getting multiple instances running didn't seem to be in the cards. I didn't pursue this any further. If this is the route I should go down, if anyone can direct me on how to install a built Solr war and configure it so I can use multiple instances, I'll gladly try it out. I found a similar issue to mine at http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200911.mbox/d2b0462d72664840b72118cb4437cbd403e2a...@ndhamrexm22.amer.pfizer.comhttp://mail-archives.apache.org/mod_mbox/lucene-solr-user/200911.mbox/%3cd2b0462d72664840b72118cb4437cbd403e2a...@ndhamrexm22.amer.pfizer.com%3e, From that email, I tried copying the built Solr jars into the Solr site's lib directory, then realized that the likelihood of that working was pretty slim - jars built from a nightly build trying to work with a .war from 1.4.0 was probably not going work. As you might have guessed, it didn't. This is when I tried building Solr from source (thinking that if all the Solr stuff was at the same revision, it might work). I have not tried all of this under Jetty. It's my understanding that Jetty won't let me do multiple instances, and since this is a requirement for what I'm doing, I'm more or less constrained to Tomcat. I have also seen some other references to using OpenJDK instead of Sun JDK. This resulted in the same error (don't recall the site where I saw this referenced). Any help would be greatly appreciated. I am new to Tomcat and Solr, so I may have some dumb follow-up questions that will be googled thoroughly first. Sorry in advance.. Tim -- - org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.extraction.ExtractingRequestHandler' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:414) at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:450) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:152) at org.apache.solr.core.SolrCore.lt;initgt;(SolrCore.java:557) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) at org.apache.catalina.core.ApplicationFilterConfig.lt;initgt;(ApplicationFilterConfig.java:115) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3838) at
Re: Spellchecking and frequency
Yonik's Law of Patches reads: A half-baked patch in Jira, with no documentation, no tests and no backwards compatibilty is better than no patch at all. It'd be perfectly appropriate, IMO, for you to post an outline of what your enhancements do over on the SOLR dev list and get a reaction from the folks over there as to whether it should be a Jira or not... see solr-...@lucene.apache.org Best Erick On Tue, Jul 27, 2010 at 2:04 PM, Mark Holland mark.holl...@zoopla.co.ukwrote: Hi, I found the suggestions returned from the standard solr spellcheck not to be that relevant. By contrast, aspell, given the same dictionary and mispelled words, gives much more accurate suggestions. I therefore wrote an implementation of SolrSpellChecker that wraps jazzy, the java aspell library. I also extended the SpellCheckComponent to take the matrix of suggested words and query the corpus to find the first combination of suggestions which returned a match. This works well for my use case, where term frequency is irrelevant to spelling or scoring. I'd like to publish the code in case someone finds it useful (although it's a bit crude at the moment and will need a decent tidy up). Would it be appropriate to open up a Jira issue for this? Cheers, ~mark On 27 July 2010 09:33, dan sutton danbsut...@gmail.com wrote: Hi, I've recently been looking into Spellchecking in solr, and was struck by how limited the usefulness of the tool was. Like most corpora , ours contains lots of different spelling mistakes for the same word, so the 'spellcheck.onlyMorePopular' is not really that useful unless you click on it numerous times. I was thinking that since most of the time people spell words correctly why was there no other frequency parameter that could enter into the score? i.e. something like: spell_score ~ edit_dist * freq I'm sure others have come across this issue and was wonding what steps/algorithms they have used to overcome these limitations? Cheers, Dan
Re: Solr 3.1 and ExtractingRequestHandler resulting in blank content
There are two different datasets that Solr (Lucene really) saves from a document: raw storage and the indexed terms. I don't think the ExtractingRequestHandler ever automatically stored the raw data; in fact Lucene works in Strings internally, not raw byte arrays (this is changing). It should be indexed- that means if you search 'text' with a word from the document, it will find those documents and bring back the file name. Your app has to then use the file name. Solr/Lucene is not intended as a general-purpose content store, only an index. The ERH wiki page doesn't quite say this. It describes what the ERH does rather than what it does not do :) On Mon, Jul 26, 2010 at 12:00 PM, David Thibault dthiba...@esperion.com wrote: Hello all, I’m working on a project with Solr. I had 1.4.1 working OK using ExtractingRequestHandler except that it was crashing on some PDFs. I noticed that Tika bundled with 1.4.1 was 0.4, which was kind of old. I decided to try updating to 0.7 as per the directions here: http://wiki.apache.org/solr/ExtractingRequestHandler but it was giving me errors (I forget what they were specifically). Then I tried downloading Solr 3.1 from the source repository, which I noticed came with Tika 0.7. I figured this would be an easier route to get working. Now I’m testing with 3.1 and 0.7 and I’m noticing my documents are going into Solr OK, but they all have blank content (no document text stored in Solr). I did see that the default “text” field is not stored. Changing that to stored=true didn’t help. Changing to fmap.content=attr_contentuprefix=attr_content didn’t help either. I have attached all relevant info here. Please let me know if someone sees something I don’t (it’s entirely possible as I’m relatively new to Solr). Schema.xml: ?xml version=1.0 encoding=UTF-8 ? schema name=example version=1.3 types fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ fieldType name=boolean class=solr.BoolField sortMissingLast=true omitNorms=true/ fieldtype name=binary class=solr.BinaryField/ fieldType name=int class=solr.TrieIntField precisionStep=0 omitNorms=true positionIncrementGap=0/ fieldType name=float class=solr.TrieFloatField precisionStep=0 omitNorms=true positionIncrementGap=0/ fieldType name=long class=solr.TrieLongField precisionStep=0 omitNorms=true positionIncrementGap=0/ fieldType name=double class=solr.TrieDoubleField precisionStep=0 omitNorms=true positionIncrementGap=0/ fieldType name=tint class=solr.TrieIntField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=tfloat class=solr.TrieFloatField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=tlong class=solr.TrieLongField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=date class=solr.TrieDateField omitNorms=true precisionStep=0 positionIncrementGap=0/ fieldType name=tdate class=solr.TrieDateField omitNorms=true precisionStep=6 positionIncrementGap=0/ fieldType name=pint class=solr.IntField omitNorms=true/ fieldType name=plong class=solr.LongField omitNorms=true/ fieldType name=pfloat class=solr.FloatField omitNorms=true/ fieldType name=pdouble class=solr.DoubleField omitNorms=true/ fieldType name=pdate class=solr.DateField sortMissingLast=true omitNorms=true/ fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ fieldType name=slong class=solr.SortableLongField sortMissingLast=true omitNorms=true/ fieldType name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/ fieldType name=sdouble class=solr.SortableDoubleField sortMissingLast=true omitNorms=true/ fieldType name=random class=solr.RandomSortField indexed=true / fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter
Re: slave index is bigger than master index
Ah! You have junk files piling up in the slave index directory. When this happens, you may have to remove data/index entirely. I'm not sure if Solr replication will handle that, or if you have to copy the whole index to reset it. You said the slaves time out- maybe the files are so large that the master slave need socket timeouts changed? In solrconfig.xml, these two lines control that. Maybe they need to be increased. str name=httpConnTimeout5000/str str name=httpReadTimeout1/str On Tue, Jul 27, 2010 at 3:59 AM, Peter Karich peat...@yahoo.de wrote: We have three dedicated servers for solr, two for slaves and one for master, all with linux/debian packages installed. I understand that replication does always copies over the index in an exact form as in master index directory (or it is supposed to do that at least), and if the master index was optimized after indexing, one doesn't need to run an optimize call again on master to optimize the slave's index. But in our case thats what fixed it and I agree it is even more confusing now :s Thats why I said: try it on the slaves too ;-) In our case it helped too to shrink 2*index to 1*index. I think the data which necessary for the replication won't cleanup before the next replication or before an optimize. For us it was crucial to shrink the size because of limited disc-resources and to make sure that the next replication does not increase the index to 3*times of the initial size. @muneeb so I think, optimization is not necessary or do you have disc limitations too? @Hoss or others: does this explanation sound logically? Another problem is, we are serving live services using slave nodes, so I dont want to effect the live search, while playing with slave nodes' indices. What do you mean here? Optimizing is too CPU expensive? We will be running the indexing on master node today over the night. Lets see if it does it again. Do you mean increase to double size? -- Lance Norskog goks...@gmail.com
Re: Indexing Problem: Where's my data?
Solr respects case for field names. Database fields are supplied in lower-case, so it should be 'attribute_name' and 'string_value'. Also 'product_id', etc. It is easier if you carefully emulate every detail in the examples, for example lower-case names. On Tue, Jul 27, 2010 at 2:59 PM, kenf_nc ken.fos...@realestate.com wrote: for STRING_VALUE, I assume there is a property in the 'select *' results called string_value? if so I'm not sure why it wouldn't work. If not, then that's why, it doesn't have anything to put there. For ATTRIBUTE_NAME, is it possibly a case issue? you called it 'Attribute_Name' in your query, but ATTRIBUTE_NAME in your schema...just something to check I guess. Also, not sure why you are using name= in your fields, for example, field column=PARENT_FAMILY name=Parent Family / I thought 'column' was the source field name and 'name' was supposed to be the schema field name and if not there it would assume 'column' name. You don't have a schema field called Parent Family so it looks like it's defaulting to column name too which is lucky for you I suppose. But you may want to either remove 'name=' or make it match the schema. (and I may be completely wrong on this, it's been a while since I got DIH going). -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Problem-Where-s-my-data-tp1000660p1000843.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)
Should this go into the trunk, or does it only solve problems unique to your use case? On Tue, Jul 27, 2010 at 5:49 AM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Hi Mitch, thanks for the code. Currently, I've got a different solution running but it's always good to have examples. If realized that I have to throw an exception and add the onError attribute to the entity to make that work. I am curious: Can you show how to make a method throwing an exception that is accepted by the onError-attribute? the catch clause looks for Exception so it's actually easy. :-D Anyway, I've found a cleaner way. It is better to subclass the XPathEntityProcessor and put it in a state that prevents it from calling initQuery which triggers the dataSource.getData() call. I have overridden the initContext() method setting a go/no go flag that I am using in the overridden nextRow() to find out whether to delegate to the superclass or not. This way I can also avoid the code that fills the tmp field with an empty value if there is no value to query on. Cheers, Chantal -- Lance Norskog goks...@gmail.com
Re: Russian stemmer
I have studied some Russian. I kind of got the picture from the texts that all the exceptions had already been 'found', and were listed in the book. I do know that languages are living, changing organisms, but Russian has got to be more regular than English I would think, even WITH all six cases and 3 genders. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Tue, 7/27/10, Robert Muir rcm...@gmail.com wrote: From: Robert Muir rcm...@gmail.com Subject: Re: Russian stemmer To: solr-user@lucene.apache.org Date: Tuesday, July 27, 2010, 7:12 AM right, but your problem is this is the current output: Ковров - Ковр Коврову - Ковров Ковровом - Ковров Коврове - Ковров so, if Ковров was simply left alone, all your forms would match... 2010/7/27 Oleg Burlaca o...@burlaca.com Thanks Robert for all your help, The idea of ы[A-Z].* stopwords is ideal for the english language, although in russian nouns are inflected: Борис, Борису, Бориса, Борисом I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned it's more accurate). Once again thanks, Oleg Burlaca On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir rcm...@gmail.com wrote: 2010/7/27 Oleg Burlaca o...@burlaca.com Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Well, here is one idea of a more general solution. The problem with protected words is you must have a complete list. One idea would be to add a filter that protects any words from stemming that match a regular expression: In english maybe someone wants to avoid any capitalized words to reduce trouble: [A-Z].* in your case then some pattern like [A-Я].*ов might prevent problems. Robert, thanks for the RussianLightStemFilterFactory info, I've found this page http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html that somehow describes it. Where can I read more about RussianLightStemFilterFactory ? Here is the link: http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf Regards, Oleg 2010/7/27 Oleg Burlaca o...@burlaca.com A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2 -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com