Fwd: facet.pivot and facet.sort does not work with fq
Hello again! The missing pivot facet when sorting by index can also be repeated in solr 4.3.1 Does anyone have an idea, how to debug this? Best regards Johannes -- Forwarded message -- From: jotpe jotpe@gmail.com Date: 2013/6/25 Subject: facet.pivot and facet.sort does not work with fq To: solr-user@lucene.apache.org Hello I'm trying to display a hierachical structur with a facet.pivot, which should be sorted by index. I followed the idea fromhttp://wiki.apache.org/solr/HierarchicalFaceting#Pivot_Facets and created path_levelX fields from 0 to 7. My tokens are not unique per level and i need to sort it like in the original structure. So i added a prefix with a sortorder number with static length and an unique id (always 8 nums). Later this prefix will be hide by using substring. Format: SORTORDER/UNIQUE_ID/NAME_TO_DISPLAY example: path_level0:000/123/Chief path_level0:000/123/Chief path_level1:000/124/Staff path_level0:000/123/Chief path_level1:000/124/Staff path_level2:00/125/Chief path_level0:001/126/Legal Adviser Displaying the pivot works fine. Sorted by count OK http://localhost:8080/solr/collection1/select?wt=xmlq=*:*rows=2facet=onfacet.pivot=path_level1,path_level2,path_level3facet.pivot.mincount=1facet.sort=count Sorted by index OK http://localhost:8080/solr/collection1/select?wt=xmlq=*:*rows=2facet=onfacet.pivot=path_level1,path_level2,path_level3facet.pivot.mincount=1facet.sort=index Now I must reduce my global structure to one office by using the fq parameter. Reduced to one office, sorted by count OK http://localhost:8080/solr/collection1/select?wt=xmlq=*:*rows=2facet=onfacet.pivot=path_level1,path_level2,path_level3facet.pivot.mincount=1facet.sort=countfq=office:xyz Reduced to one office, sorted by index : failure http://localhost:8080/solr/collection1/select?wt=xmlq=*:*rows=2facet=onfacet.pivot=path_level1,path_level2,path_level3facet.pivot.mincount=1facet.sort=countfq=office:xyz The facet.pivot elements stays empty. So what is wrong? lst name=facet_pivotarr name=path_level1,path_level2,path_level3//lst Maybe this is a bug... On the other hand, maybe this is a bad way to obtain a hierchacial structure with a custom sort. Better ideas? Best regards Johannes
Re: Shard identification
When you say you move to different machines, did you copy the zoo_data from your old setup, or did you just start up zookeeper and your shards one by one? Also did you use collection API to create the collection or just start up your cores and let them attach to ZK. I believe the ZK rules for assigning shards has changed somewhere around 4.2. We had a setup with 4.0 and it simply assigned them in order, shard 1, shard 2, shard 3, etc then when all shards were filled, it started with replicas. In 4.3 (we skipped the intermediates) the ordering wasn't obvious, I had to do a bit of trial an error to determine the right order to start things in order to get shard assignments correct, but that isn't really the recommended way of doing it. If you want specific assignments (cores to shards) then I think the core API/collection API are the recommended way to go. Create a collection using the Collection API (http://wiki.apache.org/solr/SolrCloud) and then copy the data to the right servers once it has assigned the shards (it should make sure that replicas don't exist on the same machine, and things like that). I believe the general direction (of the next major Solr release) is to start a system with a blank solr.xml and create cores/collections that way rather than have a file and then have to connect to ZK and merge the data with what's there. We have a slightly odd requirement in that we need to determine the DataDir for each core, and I haven't yet worked out the right sequence of commands (Collection API doesn't support DataDir but CoreAPI does). It should be possible though, just haven't found the time to get to it! On 25 June 2013 18:40, Erick Erickson erickerick...@gmail.com wrote: Try sending requests to your shards with distrib=false. See if the results agree with the SolrCloud graph or whether the docs you get back are inconsistent with the shard labels in the admin page. The distrib=false bit keeps the query from going to other shards and will tell you if the current state is consistent or not. Best Erick On Tue, Jun 25, 2013 at 1:02 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Firstly, using 1 zookeeper machine is not at all ideal. See http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7 I've never personally seen such an issue. Can you give screen shots of the cloud graph on each node? Use an image hosting service because the mailing list won't allow attachments. On Tue, Jun 18, 2013 at 2:07 PM, Ophir Michaeli micha...@wesee.com wrote: Hi, I built a 2 shards and 2 replicas system that works ok on a local machine, 1 zookeeper on shard 1. It appears ok on the solar monitor page, cloud tab (http://localhost:8983/solr/#/~cloud). When I move to using different machines, each shard/replica on a different machine I get a wrong cloud-graph on the Solr monitoring page. The machine that has Shard 2 appears on the graph on shard 1, and the replicas are also mixed, shard 2 appears as 1 and shard 1 appears as 2. Any ideas why this happens? Thanks, Ophir -- Regards, Shalin Shekhar Mangar.
OOM fieldCache problem
Hi all, I have some memory problems (OOM) with Solr 3.5.0 and I suppose that it has something to do with the fieldCache. The entries count of the fieldCache grows and grows, why is it not rebuilt after a commit? I commit every 60 seconds, but the memory consumption of Solr increased within one day from 2GB to 10GB (index size: ~200MB). I tried to solve the problem by reducing the other cache sizes (filterCache, documentCache, queryResultCache). It delayed the OOM exception but it did not solve the problem that the memory consumption increases continuously. Is it possible to reset the fieldCache explicitly? Markus
Re: URL search and indexing
Ok thank you all for the great help! Now I'm ready to start playing with my index! Best, Flavio On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky j...@basetechnology.comwrote: Yeah, URL Classify does only do so much. That's why you need to combine multiple methods. As a fourth method, you could code up a short JavaScript ** StatelessScriptUpdateProcessor** that did something like take a full domain name (such as output by URL Classify) and turn it into multiple values, each with more of the prefix removed, so that lucene.apache.org would index as: lucene.apache.org apache.org apache .org org And then the user could query by any of those partial domain names. But, if you simply tokenize the URL (copy the URL string to a text field), you automatically get most of that. The user can query by a URL fragment, such as apache.org, .org, lucene.apache.org, etc. and the tokenization will strip out the punctuation. I'll add this script to my list of examples to add in the next rev of my book. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 10:06 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing I bought the book and looking at the example I still don't understand if it possible query all sub-urls of my URL. For example, if the URLClassifyProcessorFactory takes in input url_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html and makes some outputs like - url_domain_s:lucene.apache.**org http://lucene.apache.org - url_canonical_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html How should I configure url_domain_s in order to be able to makes query like '*.apache.org'? How should I configure url_canonical_s in order to be able to makes query like 'http://lucene.apache.org/**solr/* http://lucene.apache.org/solr/* '? Is it better to have two different fields for the two queries or could I create just one field for the two kind of queries (obviously for the former case then I should query something like *://.apache.org/*)? On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky j...@basetechnology.com* *wrote: There are examples in my book: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/product-21079719.htmlhttp://** www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/**product-21079719.htmlhttp://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html But... I still think you should use a tokenized text field as well - use all three: raw string, tokenized text, and URL classification fields. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 9:02 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing That's sound exactly what I'm looking for! However I cannot find an example of how to use it..could you help me please? Moreover, about id field, isn't true that id field shouldn't be analyzed as suggested in http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_documenthttp://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document http://wiki.apache.**org/solr/UniqueKey#Text_field_**in_the_documenthttp://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document ? On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl jan@cominvent.com wrote: Sure you can query the url directly. Or if you choose you can split it up in multiple components, e.g. using http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/** solr/update/processor/URLClassifyProcessor.htmlhttp** ://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/** update/processor/**URLClassifyProcessor.htmlhttp://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 25. juni 2013 kl. 14:10 skrev Flavio Pompermaier pomperma...@okkam.it: Sorry but maybe I miss something here..could I declare url as key field and query it too..? At the moment, my schema.xml looks like: fields field name=url type=string indexed=true stored=true required=true multiValued=false / field name=category type=string indexed=true stored=true/ field name=language type=string indexed=true stored=true/ ... field name=_version_ type=long indexed=true stored=true/ /fields uniqueKeyurl/uniqueKey Is it ok? or should I add a baseurl field of some kind to be able to query all url coming from a certain domain (1st or 2nd level as well)? Best, Flavio On Tue, Jun 25, 2013 at 12:28 PM,
How to truncate a particular field, LimitTokenCountAnalyzer or LimitTokenCountFilter?
We have a requirement to grab the first N words in a particular field and weight them differently for scoring purposes. So I thought to use a copyField and have some extra filter on the destination to truncate it down (post tokenization). Did a quick search and found both a LimitTokenCountAnalyzer and LimitTokenCountFilter mentioned, if I read the wiki right, the Filter is the correct approach for Solr as we have the schema-able analyzer chain, so we don't need to code anything, right? The Analyzer version would be more useful if we were explicitly coding up a set of operations in Java, so that's what Lucene users directly would tend to use. Just in search of confirmation really.
Re: Result Grouping
What type of field are you grouping on? What happens when you distribute it? I.e. what specifically goes wrong? Upayavira On Tue, Jun 25, 2013, at 09:12 PM, Bryan Bende wrote: I was reading this documentation on Result Grouping... http://docs.lucidworks.com/display/solr/Result+Grouping which says... sort - sortspec - Specifies how Solr sorts the groups relative to each other. For example, sort=popularity desc will cause the groups to be sorted according to the highest popularity document in each group. The default value is score desc. group.sort - sort.spec - Specifies how Solr sorts documents within a single group. The default value is score desc. Is it possible to use these parameters such that group.sort would first sort with in each group, and then the overall sort would be applied according to the first element of each sorted group ? For example, using the scenario above where it has sort=popularity desc, could you also have group.sort=date asc resulting in the the most recent document of each group being sorted by decreasing popularity ? It seems to work the way I described when running a single node Solr 4.3 instance, but in a 2 shard configuration it appears to work differently. -Bryan
multiValued field score and count
Hi to everybody, I have some multiValued (single-token) field, for example authorid and itemid, and what I'd like to know if there's the possibility to know how many times a match was found in that document for some field and if the score is higher when multiple match are found. For example, my docs are: doc id1/id authorid11/authorid authorid9/authorid itemid1000/itemid itemid1000/itemid itemid1000/itemid itemid5000/itemid /doc doc id2/id authorid3/authorid itemid1000/itemid /doc Whould the first document have an higher score than the second if I search for itemid=1000? Is it possible to know how many times the match was found (3 for the doc1 and 1 for doc2)? Otherwise, how could I achieve that result? Best, Flavio -- Flavio Pompermaier *Development Department *___ *OKKAM**Srl **- www.okkam.it* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* f.pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner.
Re: multiValued field score and count
Add fl=[explain],* to your query, and review the output in the new field. It will tell you how the score was calculated. Look at the TF or termfreq values, as this is the number of times the term appears. Also, you could add this to your fl= param: count:termfreq(authorid, '1000’) which would give you a new field telling you how many times the term 1000 appears in the authorid field for each document. Upayavira On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote: Hi to everybody, I have some multiValued (single-token) field, for example authorid and itemid, and what I'd like to know if there's the possibility to know how many times a match was found in that document for some field and if the score is higher when multiple match are found. For example, my docs are: doc id1/id authorid11/authorid authorid9/authorid itemid1000/itemid itemid1000/itemid itemid1000/itemid itemid5000/itemid /doc doc id2/id authorid3/authorid itemid1000/itemid /doc Whould the first document have an higher score than the second if I search for itemid=1000? Is it possible to know how many times the match was found (3 for the doc1 and 1 for doc2)? Otherwise, how could I achieve that result? Best, Flavio -- Flavio Pompermaier *Development Department *___ *OKKAM**Srl **- www.okkam.it* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* f.pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner.
Re: multiValued field score and count
So, in order to achieve that feature I have to declare my fileds (authorid and itemid) with termVectors=true termPositions=true termOffsets=false? Should it be enough? On Wed, Jun 26, 2013 at 10:42 AM, Upayavira u...@odoko.co.uk wrote: Add fl=[explain],* to your query, and review the output in the new field. It will tell you how the score was calculated. Look at the TF or termfreq values, as this is the number of times the term appears. Also, you could add this to your fl= param: count:termfreq(authorid, '1000’) which would give you a new field telling you how many times the term 1000 appears in the authorid field for each document. Upayavira On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote: Hi to everybody, I have some multiValued (single-token) field, for example authorid and itemid, and what I'd like to know if there's the possibility to know how many times a match was found in that document for some field and if the score is higher when multiple match are found. For example, my docs are: doc id1/id authorid11/authorid authorid9/authorid itemid1000/itemid itemid1000/itemid itemid1000/itemid itemid5000/itemid /doc doc id2/id authorid3/authorid itemid1000/itemid /doc Whould the first document have an higher score than the second if I search for itemid=1000? Is it possible to know how many times the match was found (3 for the doc1 and 1 for doc2)? Otherwise, how could I achieve that result? Best, Flavio -- Flavio Pompermaier *Development Department *___ *OKKAM**Srl **- www.okkam.it* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* f.pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner. -- Flavio Pompermaier *Development Department *___ *OKKAM**Srl **- www.okkam.it* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* f.pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner.
Re: multiValued field score and count
I mentioned two features, [explain] and termfreq(field, 'value'). Neither of these require anything special, as they are using stuff central to Lucene's scoring mechanisms. I think you can turn off the storage of term frequencies, obviously that would spoil things, but that's certainly not on my default. I typed the syntax below from memory, so I might not have got it exactly right. Upayavira On Wed, Jun 26, 2013, at 10:22 AM, Flavio Pompermaier wrote: So, in order to achieve that feature I have to declare my fileds (authorid and itemid) with termVectors=true termPositions=true termOffsets=false? Should it be enough? On Wed, Jun 26, 2013 at 10:42 AM, Upayavira u...@odoko.co.uk wrote: Add fl=[explain],* to your query, and review the output in the new field. It will tell you how the score was calculated. Look at the TF or termfreq values, as this is the number of times the term appears. Also, you could add this to your fl= param: count:termfreq(authorid, '1000’) which would give you a new field telling you how many times the term 1000 appears in the authorid field for each document. Upayavira On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote: Hi to everybody, I have some multiValued (single-token) field, for example authorid and itemid, and what I'd like to know if there's the possibility to know how many times a match was found in that document for some field and if the score is higher when multiple match are found. For example, my docs are: doc id1/id authorid11/authorid authorid9/authorid itemid1000/itemid itemid1000/itemid itemid1000/itemid itemid5000/itemid /doc doc id2/id authorid3/authorid itemid1000/itemid /doc Whould the first document have an higher score than the second if I search for itemid=1000? Is it possible to know how many times the match was found (3 for the doc1 and 1 for doc2)? Otherwise, how could I achieve that result? Best, Flavio -- Flavio Pompermaier *Development Department *___ *OKKAM**Srl **- www.okkam.it* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* f.pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner. -- Flavio Pompermaier *Development Department *___ *OKKAM**Srl **- www.okkam.it* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* f.pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner.
Re: Is there a way to capture div tag by id?
Hi. I ran into this issue a while ago. In my case, the div I was trying to extract was the main content of the page. If that is your case, boilerpipe way help. There is a patch at https://issues.apache.org/jira/browse/SOLR-3808 that worked for me. Arcadius. On 25 June 2013 18:17, eShard zim...@yahoo.com wrote: let's say I have a div with id=myDiv Is there a way to set up the solr upate/extract handler to capture just that particular div? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-capture-div-tag-by-id-tp4073120.html Sent from the Solr - User mailing list archive at Nabble.com.
StatsComponent doesn't work if field's type is TextField - can I change field's type to String
Hi all, StatsComponent doesn't work if field's type is TextField. I get the following message: Field type textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, sortMissingLast=true}} is not currently supported. My field configuration is: fieldType name=mvstring class=solr.TextField positionIncrementGap= 100 sortMissingLast=true analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\n / /analyzer /fieldType field name=myField type=mvstring indexed=true stored=false multiValued=true/ So, the reason my field is of type TextField is that in the document indexed there may be multiple values in the field separated by new lines. The tokenizer is splitting it to multiple values and the field is indexed as multi-valued field. Is there a way I can define the field as regular String field? Or a way to make StatsComponent work with TextField? Thank you very much.
Re: multiValued field score and count
I tried to play a little with the tools you suggested. However, I probably miss something because the term frequency is not that expected. My itemid field is defined (in schema.xml) as: field name=itemid type=string indexed=true stored=true multiValued=true/ I was supposing that indexing via post.sh the xml mentioned in the first mail, the term frequency of itemid 1000 was 3 in the first doc and 1 in the second! Instead, I got that result only if I change my settings to: field name=itemid type=text_ws indexed=true stored=true multiValued=true/ fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType and I modify my populating xml as: doc id1/id authorid11/authorid authorid9/authorid itemid1000 1000 1000/itemid itemid5000/itemid /doc doc id2/id authorid3/authorid itemid1000/itemid /doc Is there a way to achieve termFrequency=3 for doc1 also using my initial settings (itemid as string and just one value per itemid-tag)? Best, Flavio On Wed, Jun 26, 2013 at 12:38 PM, Upayavira u...@odoko.co.uk wrote: I mentioned two features, [explain] and termfreq(field, 'value'). Neither of these require anything special, as they are using stuff central to Lucene's scoring mechanisms. I think you can turn off the storage of term frequencies, obviously that would spoil things, but that's certainly not on my default. I typed the syntax below from memory, so I might not have got it exactly right. Upayavira On Wed, Jun 26, 2013, at 10:22 AM, Flavio Pompermaier wrote: So, in order to achieve that feature I have to declare my fileds (authorid and itemid) with termVectors=true termPositions=true termOffsets=false? Should it be enough? On Wed, Jun 26, 2013 at 10:42 AM, Upayavira u...@odoko.co.uk wrote: Add fl=[explain],* to your query, and review the output in the new field. It will tell you how the score was calculated. Look at the TF or termfreq values, as this is the number of times the term appears. Also, you could add this to your fl= param: count:termfreq(authorid, '1000’) which would give you a new field telling you how many times the term 1000 appears in the authorid field for each document. Upayavira On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote: Hi to everybody, I have some multiValued (single-token) field, for example authorid and itemid, and what I'd like to know if there's the possibility to know how many times a match was found in that document for some field and if the score is higher when multiple match are found. For example, my docs are: doc id1/id authorid11/authorid authorid9/authorid itemid1000/itemid itemid1000/itemid itemid1000/itemid itemid5000/itemid /doc doc id2/id authorid3/authorid itemid1000/itemid /doc Whould the first document have an higher score than the second if I search for itemid=1000? Is it possible to know how many times the match was found (3 for the doc1 and 1 for doc2)? Otherwise, how could I achieve that result? Best, Flavio -- Flavio Pompermaier *Development Department *___ *OKKAM**Srl **- www.okkam.it* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* f.pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner. -- Flavio Pompermaier *Development Department *___ *OKKAM**Srl **- www.okkam.it* *Phone:* +(39) 0461 283 702 *Fax:* + (39) 0461 186 6433 *Email:* f.pomperma...@okkam.it *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2 *Registered office:* Trento (Italy), via Segantini 23 Confidentially notice. This e-mail transmission may contain legally privileged and/or confidential information. Please do not read it if you are not the intended recipient(S). Any use, distribution, reproduction or disclosure by any other person is strictly prohibited. If you have received this e-mail in error, please notify the sender and destroy the original transmission and its attachments without reading or saving it in any manner.
Re: Is there a way to capture div tag by id?
On 06/25/2013 01:17 PM, eShard wrote: let's say I have a div with id=myDiv Is there a way to set up the solr upate/extract handler to capture just that particular div? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-capture-div-tag-by-id-tp4073120.html Sent from the Solr - User mailing list archive at Nabble.com. You might be interested in Lux (see at http://luxdb.org), which provides XML-aware indexing for Solr. It indexes text in the context of every element, and also allows you to explicitly define indexes using any XPath 2.0 expression, including //div[@id='myDiv'], for example. -- Michael Sokolov Senior Architect Safari Books Online
Re: StatsComponent doesn't work if field's type is TextField - can I change field's type to String
You could use an update processor to turn the text string into multiple string values. A short snippet of JavaScript in a StatelessScriptUpdateProcessor could do the trick. The field could then be a multivalued string field. -- Jack Krupansky -Original Message- From: Elran Dvir Sent: Wednesday, June 26, 2013 7:14 AM To: solr-user@lucene.apache.org Subject: StatsComponent doesn't work if field's type is TextField - can I change field's type to String Hi all, StatsComponent doesn't work if field's type is TextField. I get the following message: Field type textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, sortMissingLast=true}} is not currently supported. My field configuration is: fieldType name=mvstring class=solr.TextField positionIncrementGap= 100 sortMissingLast=true analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\n / /analyzer /fieldType field name=myField type=mvstring indexed=true stored=false multiValued=true/ So, the reason my field is of type TextField is that in the document indexed there may be multiple values in the field separated by new lines. The tokenizer is splitting it to multiple values and the field is indexed as multi-valued field. Is there a way I can define the field as regular String field? Or a way to make StatsComponent work with TextField? Thank you very much.
index analyzer vs query analyzer
Hello, What's the criteria used in putting an analyzer at query or index? e.g. I want to use NGramFilterFactory, is there a difference whether I put it under analyzer type=index or analyzer type=query ? Thanks. Mugoma
Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads
Right, unfortunately this is a gremlin lurking in the weeds, see: http://wiki.apache.org/solr/DistributedSearch#Distributed_Deadlock There are a couple of ways to deal with this: 1 go ahead and up the limit and re-compile, if you look at SolrCmdDistributor the semaphore is defined there. 2 https://issues.apache.org/jira/browse/SOLR-4816 should address this as well as improve indexing throughput. I'm totally sure Joel (the guy working on this) would be thrilled if you were able to verify that these two points, I'd ask him (on the JIRA) whether he thinks it's ready to test. 3 Reduce the number of threads you're indexing with 4 index docs in small packets, perhaps even one and just rack together a zillion threads to get throughput. FWIW, Erick On Tue, Jun 25, 2013 at 8:55 AM, Vinay Pothnis poth...@gmail.com wrote: Jason and Scott, Thanks for the replies and pointers! Yes, I will consider the 'maxDocs' value as well. How do i monitor the transaction logs during the interval between commits? Thanks Vinay On Mon, Jun 24, 2013 at 8:48 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Scott, My comment was meant to be a bit tongue-in-cheek, but my intent in the statement was to represent hard failure along the lines Vinay is seeing. We're talking about OutOfMemoryException conditions, total cluster paralysis requiring restart, or other similar and disastrous conditions. Where that line is is impossible to generically define, but trivial to accomplish. What any of us running Solr has to achieve is a realistic simulation of our desired production load (probably well above peak) and to see what limits are reached. Armed with that information we tweak. In this case, we look at finding the point where data ingestion reaches a natural limit. For some that may be JVM GC, for others memory buffer size on the client load, and yet others it may be I/O limits on multithreaded reads from a database or file system. In old Solr days we had a little less to worry about. We might play with a commitWithin parameter, ramBufferSizeMB tweaks, or contemplate partial commits and rollback recoveries. But with 4.x we now have more durable write options and NRT to consider, and SolrCloud begs to use this. So we have to consider transaction logs, the file handles they leave open until commit operations occur, and how we want to manage writing to all cores simultaneously instead of a more narrow master/slave relationship. It's all manageable, all predictable (with some load testing) and all filled with many possibilities to meet our specific needs. Considering hat each person's data model, ingestion pipeline, request processors, and field analysis steps will be different, 5 threads of input at face value doesn't really contemplate the whole problem. We have to measure our actual data against our expectations and find where the weak chain links are to strengthen them. The symptoms aren't necessarily predictable in advance of this testing, but they're likely addressable and not difficult to decipher. For what it's worth, SolrCloud is new enough that we're still experiencing some uncharted territory with unknown ramifications but with continued dialog through channels like these there are fewer territories without good cartography :) Hope that's of use! Jason On Jun 24, 2013, at 7:12 PM, Scott Lundgren scott.lundg...@carbonblack.com wrote: Jason, Regarding your statement push you over the edge- what does that mean? Does it mean uncharted territory with unknown ramifications or something more like specific, known symptoms? I ask because our use is similar to Vinay's in some respects, and we want to be able to push the capabilities of write perf - but not over the edge! In particular, I am interested in knowing the symptoms of failure, to help us troubleshoot the underlying problems if and when they arise. Thanks, Scott On Monday, June 24, 2013, Jason Hellman wrote: Vinay, You may wish to pay attention to how many transaction logs are being created along the way to your hard autoCommit, which should truncate the open handles for those files. I might suggest setting a maxDocs value in parallel with your maxTime value (you can use both) to ensure the commit occurs at either breakpoint. 30 seconds is plenty of time for 5 parallel processes of 20 document submissions to push you over the edge. Jason On Jun 24, 2013, at 2:21 PM, Vinay Pothnis poth...@gmail.com wrote: I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30 seconds. On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Vinay, What autoCommit settings do you have for your indexing process? Jason On Jun 24, 2013, at 1:28 PM, Vinay Pothnis poth...@gmail.com wrote: Here is the ulimit -a output: core file size (blocks, -c) 0 data seg size (kbytes, -d)
Get the query result from one collection and send it to other collection to for merging the result sets
Hi, We will have two categories of data, where one category will be the list of primary data (for example products) and the other collection (it could be spread across shards) holds the transaction data (for example product sales data). We have search scenario where we need to show the products along with the number of sales for each product. For this we need to do a facet based search on second collection and then this has to shown together along with the primary data. Is there any way to handle this kind of scenario. Please suggest any other approaches to get the desired result. Thank you, Jilani
Re: Solr indexer and Hadoop
Well, it's been merged into trunk according to the comments, so Try it on trunk, help with any bugs, buy Mark beer. And, most especially, document up what it takes to make it work. Mark is juggling a zillion things and I'm sure he'd appreciate any help there. Erick On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: zomghowcanihelp? :) Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson erickerick...@gmail.comwrote: You might be interested in following: https://issues.apache.org/jira/browse/SOLR-4916 Best Erick On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Jack, Sorry, but I don't agree that it's that cut and dried. I've very successfully worked with terabytes of data in Hadoop that was stored on an Isilon mounted via NFS, for example. In cases like this, you're using MapReduce purely for it's execution model (which existed far before Hadoop and HDFS ever did). Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky j...@basetechnology.com wrote: ??? Hadoop=HDFS If the data is not in Hadoop/HDFS, just use the normal Solr indexing tools, including SolrCell and Data Import Handler, and possibly ManifoldCF. -- Jack Krupansky -Original Message- From: engy.morsy Sent: Tuesday, June 25, 2013 8:10 AM To: solr-user@lucene.apache.org Subject: Re: Solr indexer and Hadoop Thank you Jack. So, I need to convert those nodes holding data to HDFS. -- View this message in context: http://lucene.472066.n3.** nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Result Grouping
The field I am grouping on is a single-valued string. It looks like in non-distributed mode if I use group=true, sort, group.sort, and group.limit=1, it will.. - group the results - sort with in each group - limit down to 1 result per group - apply the sort between groups using the single result of each group When I run with numShards = 1... - group the results - apply the sort between groups using the document from each group based on the sort, for example if sort= popularity desc then it uses the highest popularity from each group - sort with in the group - limit down to 1 result per group I was trying to confirm if this was the expected behavior, or if there is something I could do to get the first behavior in a distributed configuration. I posted this a few days ago describing the scenario in more detail if you are interested... http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCALo_M18WVoLKvepJMu0wXk_x2H8cv3UaX9RQYtEh4-mksQHLBA%40mail.gmail.com%3E What type of field are you grouping on? What happens when you distribute ?it? I.e. what specifically goes wrong? Upayavira On Tue, Jun 25, 2013, at 09:12 PM, Bryan Bende wrote: I was reading this documentation on Result Grouping... http://docs.lucidworks.com/display/solr/Result+Grouping which says... sort - sortspec - Specifies how Solr sorts the groups relative to each other. For example, sort=popularity desc will cause the groups to be sorted according to the highest popularity document in each group. The default value is score desc. group.sort - sort.spec - Specifies how Solr sorts documents within a single group. The default value is score desc. Is it possible to use these parameters such that group.sort would first sort with in each group, and then the overall sort would be applied according to the first element of each sorted group ? For example, using the scenario above where it has sort=popularity desc, could you also have group.sort=date asc resulting in the the most recent document of each group being sorted by decreasing popularity ? It seems to work the way I described when running a single node Solr 4.3 instance, but in a 2 shard configuration it appears to work differently. -Bryan
Re: how to replicate Solr Cloud
On the lengthy TODO list is making SolrCloud nodes rack aware that should help with this, but it's not real high in the priority queue as I recall. The current architecture sends updates and requests all over the cluster, so there are lots of messages that go across the presumably expensive pipe between data centers. Not to mention the Zookeeper quorum problem. Hmmm, Zookeeper Quorum problem. Say 1 ZK node is in DC1 and 2 are in DC2. If DC2 goes down, DC1 will not accept updates because there is no available ZK quorum. I've seen one proposal where you use 3 DCs, each with a ZK node to ameliorate this. But all this is an issue only if the communications link between the datacenters is expensive where that term can mean that it literally costs more, that it is slow, whatever. Best Erick On Tue, Jun 25, 2013 at 12:14 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Uh, I remember that email, but can't recall where we did it will try to recall it some more and reply if I can manage to dig it out of my brain... Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jun 25, 2013 at 2:24 PM, Kevin Osborn kevin.osb...@cbsi.com wrote: Otis, I did actually stumble upon this link. http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/74870 This was from you. You were attempting to replicate data from SolrCloud to some other slaves for heavy-duty queries. You said that you accomplished this. Can you provide a few pointers on how you did this? Thanks. On Tue, Jun 25, 2013 at 10:25 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: I think what is needed is a Leader that, while being a Leader for its own Slice in its local Cluster and Collection (I think I'm using all the latest terminology correctly here), is at the same time a Replica of its own Leader counterpart in the Primary Cluster. Not currently possible, AFAIK. Or maybe there is a better way? Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jun 25, 2013 at 1:07 PM, Kevin Osborn kevin.osb...@cbsi.com wrote: We are going to have two datacenters, each with their own SolrCloud and ZooKeeper quorums. The end result will be that they should be replicas of each other. One method that has been mentioned is that we should add documents to each cluster separately. For various reasons, this may not be ideal for us. Instead, we are playing around with the idea of always indexing to one datacenter. And then having that replicate to the other datacenter. And this is where I am having some trouble on how to proceed. The nice thing about SolrCloud is that there is no masters and slaves. Each node is equals, has the same configs, etc. But in this case, I want to have a node in one datacenter poll for changes in another data center. Before SolrCloud, I would have used slave/master replication. But in the SolrCloud world, I am not sure how to configure this setup? Or is there any better ideas on how to use replication to push or pull data from one datacenter to another? In my case, NRT is not a requirement. And I will also be dealing with about 3 collections and 5 or 6 shards. Thanks. -- *KEVIN OSBORN* LEAD SOFTWARE ENGINEER CNET Content Solutions OFFICE 949.399.8714 CELL 949.310.4677 SKYPE osbornk 5 Park Plaza, Suite 600, Irvine, CA 92614 [image: CNET Content Solutions] -- *KEVIN OSBORN* LEAD SOFTWARE ENGINEER CNET Content Solutions OFFICE 949.399.8714 CELL 949.310.4677 SKYPE osbornk 5 Park Plaza, Suite 600, Irvine, CA 92614 [image: CNET Content Solutions]
Re: How to truncate a particular field, LimitTokenCountAnalyzer or LimitTokenCountFilter?
Yes, the LimitTokenCountFilterFactory will do the trick. I have some examples in the book, showing for a given input string, what the output tokens will be. Otherwise, the Solr Javadoc does given one generic example, but without showing how it actually works: http://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilterFactory.html The new Apache Solr Reference? No mention of the filter. -- Jack Krupansky -Original Message- From: Daniel Collins Sent: Wednesday, June 26, 2013 3:38 AM To: solr-user@lucene.apache.org Subject: How to truncate a particular field, LimitTokenCountAnalyzer or LimitTokenCountFilter? We have a requirement to grab the first N words in a particular field and weight them differently for scoring purposes. So I thought to use a copyField and have some extra filter on the destination to truncate it down (post tokenization). Did a quick search and found both a LimitTokenCountAnalyzer and LimitTokenCountFilter mentioned, if I read the wiki right, the Filter is the correct approach for Solr as we have the schema-able analyzer chain, so we don't need to code anything, right? The Analyzer version would be more useful if we were explicitly coding up a set of operations in Java, so that's what Lucene users directly would tend to use. Just in search of confirmation really.
Re: Querying multiple collections in SolrCloud
bq: Would the above setup qualify as multiple compatible collections No. While there may be enough fields in common to form a single query, the TF/IDF calculations will not be compatible and the scores from the various collections will NOT be comparable. So simply getting the list of top N docs will probably be dominated by the docs from a single type. bq: How does SolrCloud combine the query results from multiple collections? It doesn't. SolrCloud sorts the results from multiple nodes in the _same_ collection according to whatever sort criteria are specified, defaulting to score. Say you ask for the top 20 docs. A node from each shard returns the top 20 docs for that shard. The node processing them just merges all the returned lists and only keeps the top 20. I don't think your last two questions are really relevant, SolrCloud isn't built to query multiple collections and return the results coherently. The root problem here is that you're trying to compare docs from different collections for goodness to return the top N. This isn't actually hard _except_ when goodness is the score, then it just doesn't work. You can't even compare scores from different queries on the _same_ collection, much less different ones. Consider two collections, books and songs. One consists of lots and lots of text and the ter frequency and inverse doc freq (TF/IDF) will be hugely different than songs. Not to mention field length normalization. Now, all that aside there's an option. Index all the docs in a single collection and use grouping (aka field collapsing) to get a single response that has the top N docs from each type (they'll be in different sections of the original response) and present them to the user however makes sense. You'll get hands on experience in why this isn't something that's easy to do automatically if you try to sort these into a single list by relevance G... Best Erick On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey ctoo...@gmail.com wrote: Thanks Jack for the alternatives. The first is interesting but has the downside of requiring multiple queries to get the full matching docs. The second is interesting and very simple, but has the downside of not being modular and being difficult to configure field boosting when the collections have overlapping field names with different boosts being needed for the same field in different document types. I'd still like to know about the viability of my original approach though too. Chris On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky j...@basetechnology.comwrote: One simple scenario to consider: N+1 collections - one collection per document type with detailed fields for that document type, and one common collection that indexes a subset of the fields. The main user query would be an edismax over the common fields in that main collection. You can then display summary results from the common collection. You can also then support drill down into the type-specific collection based on a type field for each document in the main collection. Or, sure, you actually CAN index multiple document types in the same collection - add all the fields to one schema - there is no time or space penalty if most of the field are empty for most documents. -- Jack Krupansky -Original Message- From: Chris Toomey Sent: Tuesday, June 25, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: Querying multiple collections in SolrCloud Hi, I'm investigating using SolrCloud for querying documents of different but similar/related types, and have read through docs. on the wiki and done many searches in these archives, but still have some questions. Thanks in advance for your help. Setup: * Say that I have N distinct types of documents and I want to do queries that return the best matches regardless document type. I.e., something akin to a Google search where I'd like to get the best matches from the web, news, images, and maps. * Our main use case is supporting simple user-entered searches, which would just contain terms / phrases and wouldn't specify fields. * The document types will not all have the same fields, though there may be some overlap in the fields. * We plan to use a separate collection for each document type, and to use the eDisMax query parser. Each collection would have a document-specific schema configuration with appropriate defaults for query fields and boosts, etc. Questions: * Would the above setup qualify as multiple compatible collections, such that we could search all N collections with a single SolrCloud query, as in the example query http://localhost:8983/solr/**collection1/select?q=apple%** 20piecollection=c1,c2,..http://localhost:8983/solr/collection1/select?q=apple%20piecollection=c1,c2,.. .,cN**? Again, we're not querying against specific fields. * How does SolrCloud combine the query results from multiple collections? Does it re-sort the combined result set, or does it just return
Re: URL search and indexing
Flavio: You mention that you're new to Solr, so I thought I'd make sure you know that the admin/analysis page is your friend! I flat guarantee that as you try to index/search following the suggestions you'll scratch your head at your results and you'll discover that the analysis process isn't doing quite what you expect. The admin/analysis page shows you the transformation of the input at each stage, i.e. how the input is tokenized, what transformations are applied to each token etc. It's invaluable! Best Erick P.S. Feel free to un-check the verbose box, it provides lots of information but can be overwhelming, especially at first! On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thank you all for the great help! Now I'm ready to start playing with my index! Best, Flavio On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky j...@basetechnology.comwrote: Yeah, URL Classify does only do so much. That's why you need to combine multiple methods. As a fourth method, you could code up a short JavaScript ** StatelessScriptUpdateProcessor** that did something like take a full domain name (such as output by URL Classify) and turn it into multiple values, each with more of the prefix removed, so that lucene.apache.org would index as: lucene.apache.org apache.org apache .org org And then the user could query by any of those partial domain names. But, if you simply tokenize the URL (copy the URL string to a text field), you automatically get most of that. The user can query by a URL fragment, such as apache.org, .org, lucene.apache.org, etc. and the tokenization will strip out the punctuation. I'll add this script to my list of examples to add in the next rev of my book. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 10:06 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing I bought the book and looking at the example I still don't understand if it possible query all sub-urls of my URL. For example, if the URLClassifyProcessorFactory takes in input url_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html and makes some outputs like - url_domain_s:lucene.apache.**org http://lucene.apache.org - url_canonical_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html How should I configure url_domain_s in order to be able to makes query like '*.apache.org'? How should I configure url_canonical_s in order to be able to makes query like 'http://lucene.apache.org/**solr/* http://lucene.apache.org/solr/* '? Is it better to have two different fields for the two queries or could I create just one field for the two kind of queries (obviously for the former case then I should query something like *://.apache.org/*)? On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky j...@basetechnology.com* *wrote: There are examples in my book: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/product-21079719.htmlhttp://** www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/**product-21079719.htmlhttp://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html But... I still think you should use a tokenized text field as well - use all three: raw string, tokenized text, and URL classification fields. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 9:02 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing That's sound exactly what I'm looking for! However I cannot find an example of how to use it..could you help me please? Moreover, about id field, isn't true that id field shouldn't be analyzed as suggested in http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_documenthttp://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document http://wiki.apache.**org/solr/UniqueKey#Text_field_**in_the_documenthttp://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document ? On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl jan@cominvent.com wrote: Sure you can query the url directly. Or if you choose you can split it up in multiple components, e.g. using http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/** solr/update/processor/URLClassifyProcessor.htmlhttp** ://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/** update/processor/**URLClassifyProcessor.htmlhttp://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 25. juni 2013 kl.
Re: StatsComponent doesn't work if field's type is TextField - can I change field's type to String
From the stats component page: The stats component returns simple statistics for indexed numeric fields within the DocSet So string, text, anything non-numeric won't work. You can declare it multiValued but then you have to add multiple values for the field when you send the doc to Solr or implement a custom update component to break them up. At least there's no filter that I know of that takes a delimited set of numbers and transforms them. FWIW, Erick On Wed, Jun 26, 2013 at 4:14 AM, Elran Dvir elr...@checkpoint.com wrote: Hi all, StatsComponent doesn't work if field's type is TextField. I get the following message: Field type textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, sortMissingLast=true}} is not currently supported. My field configuration is: fieldType name=mvstring class=solr.TextField positionIncrementGap= 100 sortMissingLast=true analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\n / /analyzer /fieldType field name=myField type=mvstring indexed=true stored=false multiValued=true/ So, the reason my field is of type TextField is that in the document indexed there may be multiple values in the field separated by new lines. The tokenizer is splitting it to multiple values and the field is indexed as multi-valued field. Is there a way I can define the field as regular String field? Or a way to make StatsComponent work with TextField? Thank you very much.
Dynamic Type For Solr Schema
I use Solr 4.3.1 as SolrCloud. I know that I can define analyzer at schema.xml. Let's assume that I have specialized my analyzer for Turkish. However I want to have another analzyer too, i.e. for English. I have that fields at my schema: ... field name=content type=text_tr stored=true indexed=true/ field name=title type=text_tr stored=true indexed=true/ ... I have a field type as text_tr that is combined for Turkish. I have another field type as text_en that is combined for Englished. I have another field at my schema as lang. lang holds the language of document as en or tr. If I get a document that has a lang field holds *tr* I want that: ... field name=content type=*text_tr* stored=true indexed=true/ field name=title type=*text_tr* stored=true indexed=true/ ... If I get a document that has a lang field holds *en* I want that: ... field name=content type=*text_en* stored=true indexed=true/ field name=title type=*text_en* stored=true indexed=true/ ... I want dynamic types just for that fields other will be same. How can I do that properly at Solr? (UpdateRequestProcessor, ...?)
Re: URL search and indexing
I was doing exactly that and, thanks to the administration page and explanation/debugging, I checked if results were those expected. Unfortunately, results were not correct submitting updates trough post.sh script (that use curl in the end). Probably, if it founds the same tag (same value for the same field-name), it will collapse them. Rewriting the same document in Java and submitting the updates did the things work correctly. In my opinion this is a bug (of the entire process, then I don't know it this is a problem of curl or of the script itself). Best, Flavio On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson erickerick...@gmail.comwrote: Flavio: You mention that you're new to Solr, so I thought I'd make sure you know that the admin/analysis page is your friend! I flat guarantee that as you try to index/search following the suggestions you'll scratch your head at your results and you'll discover that the analysis process isn't doing quite what you expect. The admin/analysis page shows you the transformation of the input at each stage, i.e. how the input is tokenized, what transformations are applied to each token etc. It's invaluable! Best Erick P.S. Feel free to un-check the verbose box, it provides lots of information but can be overwhelming, especially at first! On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thank you all for the great help! Now I'm ready to start playing with my index! Best, Flavio On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky j...@basetechnology.comwrote: Yeah, URL Classify does only do so much. That's why you need to combine multiple methods. As a fourth method, you could code up a short JavaScript ** StatelessScriptUpdateProcessor** that did something like take a full domain name (such as output by URL Classify) and turn it into multiple values, each with more of the prefix removed, so that lucene.apache.org would index as: lucene.apache.org apache.org apache .org org And then the user could query by any of those partial domain names. But, if you simply tokenize the URL (copy the URL string to a text field), you automatically get most of that. The user can query by a URL fragment, such as apache.org, .org, lucene.apache.org, etc. and the tokenization will strip out the punctuation. I'll add this script to my list of examples to add in the next rev of my book. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 10:06 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing I bought the book and looking at the example I still don't understand if it possible query all sub-urls of my URL. For example, if the URLClassifyProcessorFactory takes in input url_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.html http://lucene.apache.org/solr/4_0_0/changes/Changes.html and makes some outputs like - url_domain_s:lucene.apache.**org http://lucene.apache.org - url_canonical_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.html http://lucene.apache.org/solr/4_0_0/changes/Changes.html How should I configure url_domain_s in order to be able to makes query like '*.apache.org'? How should I configure url_canonical_s in order to be able to makes query like 'http://lucene.apache.org/**solr/* http://lucene.apache.org/solr/* '? Is it better to have two different fields for the two queries or could I create just one field for the two kind of queries (obviously for the former case then I should query something like *://.apache.org/*)? On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky j...@basetechnology.com* *wrote: There are examples in my book: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive- http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/product-21079719.htmlhttp://** www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/**product-21079719.html http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html But... I still think you should use a tokenized text field as well - use all three: raw string, tokenized text, and URL classification fields. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 9:02 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing That's sound exactly what I'm looking for! However I cannot find an example of how to use it..could you help me please? Moreover, about id field, isn't true that id field shouldn't be analyzed as suggested in http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document
Re: URL search and indexing
If there is a bug... we should identify it. What's a sample post command that you issued? -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Wednesday, June 26, 2013 10:53 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing I was doing exactly that and, thanks to the administration page and explanation/debugging, I checked if results were those expected. Unfortunately, results were not correct submitting updates trough post.sh script (that use curl in the end). Probably, if it founds the same tag (same value for the same field-name), it will collapse them. Rewriting the same document in Java and submitting the updates did the things work correctly. In my opinion this is a bug (of the entire process, then I don't know it this is a problem of curl or of the script itself). Best, Flavio On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson erickerick...@gmail.comwrote: Flavio: You mention that you're new to Solr, so I thought I'd make sure you know that the admin/analysis page is your friend! I flat guarantee that as you try to index/search following the suggestions you'll scratch your head at your results and you'll discover that the analysis process isn't doing quite what you expect. The admin/analysis page shows you the transformation of the input at each stage, i.e. how the input is tokenized, what transformations are applied to each token etc. It's invaluable! Best Erick P.S. Feel free to un-check the verbose box, it provides lots of information but can be overwhelming, especially at first! On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thank you all for the great help! Now I'm ready to start playing with my index! Best, Flavio On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky j...@basetechnology.comwrote: Yeah, URL Classify does only do so much. That's why you need to combine multiple methods. As a fourth method, you could code up a short JavaScript ** StatelessScriptUpdateProcessor** that did something like take a full domain name (such as output by URL Classify) and turn it into multiple values, each with more of the prefix removed, so that lucene.apache.org would index as: lucene.apache.org apache.org apache .org org And then the user could query by any of those partial domain names. But, if you simply tokenize the URL (copy the URL string to a text field), you automatically get most of that. The user can query by a URL fragment, such as apache.org, .org, lucene.apache.org, etc. and the tokenization will strip out the punctuation. I'll add this script to my list of examples to add in the next rev of my book. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 10:06 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing I bought the book and looking at the example I still don't understand if it possible query all sub-urls of my URL. For example, if the URLClassifyProcessorFactory takes in input url_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.html http://lucene.apache.org/solr/4_0_0/changes/Changes.html and makes some outputs like - url_domain_s:lucene.apache.**org http://lucene.apache.org - url_canonical_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.html http://lucene.apache.org/solr/4_0_0/changes/Changes.html How should I configure url_domain_s in order to be able to makes query like '*.apache.org'? How should I configure url_canonical_s in order to be able to makes query like 'http://lucene.apache.org/**solr/* http://lucene.apache.org/solr/* '? Is it better to have two different fields for the two queries or could I create just one field for the two kind of queries (obviously for the former case then I should query something like *://.apache.org/*)? On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky j...@basetechnology.com* *wrote: There are examples in my book: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive- http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/product-21079719.htmlhttp://** www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/**product-21079719.html http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html But... I still think you should use a tokenized text field as well - use all three: raw string, tokenized text, and URL classification fields. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 9:02 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing That's sound exactly what I'm looking for! However I cannot find an example of how to use it..could you help me please? Moreover, about id field, isn't true that id field shouldn't be analyzed as suggested in
Re: index analyzer vs query analyzer
Yes! A rather extreme difference and you probably want it in both. The admin/analysis page is your friend. Basically, putting stuff in the type=index section dictates what goes into the index, and that is _all_ that is searchable. The result of the full analysis chain is what's in the index and searchable. Putting stuff in the type=query section dictates what terms the index is searched for. So if the two don't match, you will get surprising results. I'd advise that you keep them both identical until you're more familiar with how all this works or use one of the pre-defined examples and add or remove filters _in the same order_. Best Erick On Wed, Jun 26, 2013 at 6:23 AM, Mugoma Joseph O. mug...@yengas.com wrote: Hello, What's the criteria used in putting an analyzer at query or index? e.g. I want to use NGramFilterFactory, is there a difference whether I put it under analyzer type=index or analyzer type=query ? Thanks. Mugoma
Re: Solr indexer and Hadoop
Pardon, my unfamiliarity with the Solr development process. Now that it's in the trunk, will it appear in the next 4.X release? -- David On Wed, Jun 26, 2013 at 9:42 AM, Erick Erickson erickerick...@gmail.comwrote: Well, it's been merged into trunk according to the comments, so Try it on trunk, help with any bugs, buy Mark beer. And, most especially, document up what it takes to make it work. Mark is juggling a zillion things and I'm sure he'd appreciate any help there. Erick On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: zomghowcanihelp? :) Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson erickerick...@gmail.com wrote: You might be interested in following: https://issues.apache.org/jira/browse/SOLR-4916 Best Erick On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Jack, Sorry, but I don't agree that it's that cut and dried. I've very successfully worked with terabytes of data in Hadoop that was stored on an Isilon mounted via NFS, for example. In cases like this, you're using MapReduce purely for it's execution model (which existed far before Hadoop and HDFS ever did). Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky j...@basetechnology.com wrote: ??? Hadoop=HDFS If the data is not in Hadoop/HDFS, just use the normal Solr indexing tools, including SolrCell and Data Import Handler, and possibly ManifoldCF. -- Jack Krupansky -Original Message- From: engy.morsy Sent: Tuesday, June 25, 2013 8:10 AM To: solr-user@lucene.apache.org Subject: Re: Solr indexer and Hadoop Thank you Jack. So, I need to convert those nodes holding data to HDFS. -- View this message in context: http://lucene.472066.n3.** nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: URL search and indexing
Your other best friend is debug=query on the URL, you might be seeing different parsed queries than you expect, although that doesn't really hold water given you say SolrJ fixes things. I'd be surprised if posting the xml was the culprit, but you never know. Did you re-index after schema changes etc? Best Erick On Wed, Jun 26, 2013 at 8:18 AM, Jack Krupansky j...@basetechnology.com wrote: If there is a bug... we should identify it. What's a sample post command that you issued? -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Wednesday, June 26, 2013 10:53 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing I was doing exactly that and, thanks to the administration page and explanation/debugging, I checked if results were those expected. Unfortunately, results were not correct submitting updates trough post.sh script (that use curl in the end). Probably, if it founds the same tag (same value for the same field-name), it will collapse them. Rewriting the same document in Java and submitting the updates did the things work correctly. In my opinion this is a bug (of the entire process, then I don't know it this is a problem of curl or of the script itself). Best, Flavio On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson erickerick...@gmail.comwrote: Flavio: You mention that you're new to Solr, so I thought I'd make sure you know that the admin/analysis page is your friend! I flat guarantee that as you try to index/search following the suggestions you'll scratch your head at your results and you'll discover that the analysis process isn't doing quite what you expect. The admin/analysis page shows you the transformation of the input at each stage, i.e. how the input is tokenized, what transformations are applied to each token etc. It's invaluable! Best Erick P.S. Feel free to un-check the verbose box, it provides lots of information but can be overwhelming, especially at first! On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thank you all for the great help! Now I'm ready to start playing with my index! Best, Flavio On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky j...@basetechnology.comwrote: Yeah, URL Classify does only do so much. That's why you need to combine multiple methods. As a fourth method, you could code up a short JavaScript ** StatelessScriptUpdateProcessor** that did something like take a full domain name (such as output by URL Classify) and turn it into multiple values, each with more of the prefix removed, so that lucene.apache.org would index as: lucene.apache.org apache.org apache .org org And then the user could query by any of those partial domain names. But, if you simply tokenize the URL (copy the URL string to a text field), you automatically get most of that. The user can query by a URL fragment, such as apache.org, .org, lucene.apache.org, etc. and the tokenization will strip out the punctuation. I'll add this script to my list of examples to add in the next rev of my book. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 10:06 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing I bought the book and looking at the example I still don't understand if it possible query all sub-urls of my URL. For example, if the URLClassifyProcessorFactory takes in input url_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.html http://lucene.apache.org/solr/4_0_0/changes/Changes.html and makes some outputs like - url_domain_s:lucene.apache.**org http://lucene.apache.org - url_canonical_s: http://lucene.apache.org/solr/**4_0_0/changes/Changes.html http://lucene.apache.org/solr/4_0_0/changes/Changes.html How should I configure url_domain_s in order to be able to makes query like '*.apache.org'? How should I configure url_canonical_s in order to be able to makes query like 'http://lucene.apache.org/**solr/* http://lucene.apache.org/solr/* '? Is it better to have two different fields for the two queries or could I create just one field for the two kind of queries (obviously for the former case then I should query something like *://.apache.org/*)? On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky j...@basetechnology.com* *wrote: There are examples in my book: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive- http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/product-21079719.htmlhttp://** www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-** early-access-release-1/ebook/**product-21079719.html http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html But... I still think you should use a tokenized text
Re: Solr indexer and Hadoop
See Mark's comments on the Jira when I asked that question. My take: If 4.4 happens real soon (which some people have proposed), then it may not make it into 4.4. But if a 4.4 RC doesn't happen for another couple of weeks (my inclination), then the HDFS support could well make it into 4.4. If not in 4.4, 4.5 is probably a slam-dunk. -- Jack Krupansky -Original Message- From: David Larochelle Sent: Wednesday, June 26, 2013 11:24 AM To: solr-user@lucene.apache.org Subject: Re: Solr indexer and Hadoop Pardon, my unfamiliarity with the Solr development process. Now that it's in the trunk, will it appear in the next 4.X release? -- David On Wed, Jun 26, 2013 at 9:42 AM, Erick Erickson erickerick...@gmail.comwrote: Well, it's been merged into trunk according to the comments, so Try it on trunk, help with any bugs, buy Mark beer. And, most especially, document up what it takes to make it work. Mark is juggling a zillion things and I'm sure he'd appreciate any help there. Erick On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: zomghowcanihelp? :) Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson erickerick...@gmail.com wrote: You might be interested in following: https://issues.apache.org/jira/browse/SOLR-4916 Best Erick On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Jack, Sorry, but I don't agree that it's that cut and dried. I've very successfully worked with terabytes of data in Hadoop that was stored on an Isilon mounted via NFS, for example. In cases like this, you're using MapReduce purely for it's execution model (which existed far before Hadoop and HDFS ever did). Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky j...@basetechnology.com wrote: ??? Hadoop=HDFS If the data is not in Hadoop/HDFS, just use the normal Solr indexing tools, including SolrCell and Data Import Handler, and possibly ManifoldCF. -- Jack Krupansky -Original Message- From: engy.morsy Sent: Tuesday, June 25, 2013 8:10 AM To: solr-user@lucene.apache.org Subject: Re: Solr indexer and Hadoop Thank you Jack. So, I need to convert those nodes holding data to HDFS. -- View this message in context: http://lucene.472066.n3.** nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: URL search and indexing
Obviously I messed up with email thread...however I found a problem indexing my document via post.sh. This is basically my schema.xml: schema name=dopa-schema version=1.5 fields field name=url type=string indexed=true stored=true required=true multiValued=false / field name=itemid type=string indexed=true stored=true multiValued=true/ field name=_version_ type=long indexed=true stored=true/ /fields uniqueKeyurl/uniqueKey types fieldType name=string class=solr.StrField sortMissingLast=true / fieldType name=long class=solr.TrieLongField precisionStep=0 positionIncrementGap=0/ /types /schema and this is the document I tried to upload via post.sh: add doc field name=urlhttp://test.example.org/first.html/field field name=itemid1000/field field name=itemid1000/field field name=itemid1000/field field name=itemid5000/field /doc doc field name=urlhttp://test.example.org/second.html/field field name=itemid1000/field field name=itemid5000/field /doc /add When playing with administration and debugging tools I discovered that searching for q=itemid:5000 gave me the same score for those docs, while I was expecting different term frequencies between the first and the second. In fact, using java to upload documents lead to correct results (3 occurrences of item 1000 in the first doc and 1 in the second), e.g.: document1.addField(itemid, 1000); document1.addField(itemid, 1000); document1.addField(itemid, 1000); Am I right or am I missing something else? On Wed, Jun 26, 2013 at 5:18 PM, Jack Krupansky j...@basetechnology.comwrote: If there is a bug... we should identify it. What's a sample post command that you issued? -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Wednesday, June 26, 2013 10:53 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing I was doing exactly that and, thanks to the administration page and explanation/debugging, I checked if results were those expected. Unfortunately, results were not correct submitting updates trough post.sh script (that use curl in the end). Probably, if it founds the same tag (same value for the same field-name), it will collapse them. Rewriting the same document in Java and submitting the updates did the things work correctly. In my opinion this is a bug (of the entire process, then I don't know it this is a problem of curl or of the script itself). Best, Flavio On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson erickerick...@gmail.com* *wrote: Flavio: You mention that you're new to Solr, so I thought I'd make sure you know that the admin/analysis page is your friend! I flat guarantee that as you try to index/search following the suggestions you'll scratch your head at your results and you'll discover that the analysis process isn't doing quite what you expect. The admin/analysis page shows you the transformation of the input at each stage, i.e. how the input is tokenized, what transformations are applied to each token etc. It's invaluable! Best Erick P.S. Feel free to un-check the verbose box, it provides lots of information but can be overwhelming, especially at first! On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thank you all for the great help! Now I'm ready to start playing with my index! Best, Flavio On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky j...@basetechnology.comwrote: Yeah, URL Classify does only do so much. That's why you need to combine multiple methods. As a fourth method, you could code up a short JavaScript ** StatelessScriptUpdateProcessor that did something like take a full domain name (such as output by URL Classify) and turn it into multiple values, each with more of the prefix removed, so that lucene.apache.org would index as: lucene.apache.org apache.org apache .org org And then the user could query by any of those partial domain names. But, if you simply tokenize the URL (copy the URL string to a text field), you automatically get most of that. The user can query by a URL fragment, such as apache.org, .org, lucene.apache.org, etc. and the tokenization will strip out the punctuation. I'll add this script to my list of examples to add in the next rev of my book. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Tuesday, June 25, 2013 10:06 AM To: solr-user@lucene.apache.org Subject: Re: URL search and indexing I bought the book and looking at the example I still don't understand if it possible query all sub-urls of my URL. For example, if the URLClassifyProcessorFactory takes in input url_s: http://lucene.apache.org/solr/4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/**4_0_0/changes/Changes.html http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html and makes
RE: StatsComponent doesn't work if field's type is TextField - can I change field's type to String
Erick, thanks for the response. I think the stats component works with strings. In StatsValuesFactory, I see the following code: public static StatsValues createStatsValues(SchemaField sf) { ... else if (StrField.class.isInstance(fieldType)) { return new StringStatsValues(sf); } } -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, June 26, 2013 5:30 PM To: solr-user@lucene.apache.org Subject: Re: StatsComponent doesn't work if field's type is TextField - can I change field's type to String From the stats component page: The stats component returns simple statistics for indexed numeric fields within the DocSet So string, text, anything non-numeric won't work. You can declare it multiValued but then you have to add multiple values for the field when you send the doc to Solr or implement a custom update component to break them up. At least there's no filter that I know of that takes a delimited set of numbers and transforms them. FWIW, Erick On Wed, Jun 26, 2013 at 4:14 AM, Elran Dvir elr...@checkpoint.com wrote: Hi all, StatsComponent doesn't work if field's type is TextField. I get the following message: Field type textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache. solr.analysis.TokenizerChain,args={positionIncrementGap=100, sortMissingLast=true}} is not currently supported. My field configuration is: fieldType name=mvstring class=solr.TextField positionIncrementGap= 100 sortMissingLast=true analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=\n / /analyzer /fieldType field name=myField type=mvstring indexed=true stored=false multiValued=true/ So, the reason my field is of type TextField is that in the document indexed there may be multiple values in the field separated by new lines. The tokenizer is splitting it to multiple values and the field is indexed as multi-valued field. Is there a way I can define the field as regular String field? Or a way to make StatsComponent work with TextField? Thank you very much. Email secured by Check Point
Re: Need Help in migrating Solr version 1.4 to 4.3
On 6/25/2013 11:52 PM, Sandeep Gupta wrote: Also in application development side, as I said that I am going to use HTTPSolrServer API and I found that we shouldn't create this object multiple times (as per the wiki document http://wiki.apache.org/solr/Solrj#HttpSolrServer) So I am planning to have my Server class as singleton. Please advice little bit in this front also. This is always the way that SolrServer objects are intended to be used, including CommonsHttpSolrServer in version 1.4. The only major difference between the two objects is that the new one uses HttpComponents 4.x and the old one uses HttpClient 3.x. There are other differences, but they are just the result of incremental improvements from version to version. Thanks, Shawn
Re: Dynamic Type For Solr Schema
On Wed, Jun 26, 2013 at 11:46 AM, Jack Krupansky j...@basetechnology.com wrote: But there are also built-in language identifier update processors that can simultaneously identify what language is used in the input value for a field AND do the redirection to a language-specific field AND store the language code. I have an example of using this as well (for English/Russian): https://github.com/arafalov/solr-indexing-book/tree/master/published/languages . This includes the collection data files, so you can see the end result and play with it. The instructions on how to recreate this and explanation behind routing and field aliases setup are in my book : http://blog.outerthoughts.com/2013/06/my-book-on-solr-is-now-published/ :-) Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Is it possible to searh Solr with a longer query string?
On 6/25/2013 6:15 PM, Jack Krupansky wrote: Are you using Tomcat? See: http://wiki.apache.org/solr/SolrTomcat#Enabling_Longer_Query_Requests Enabling Longer Query Requests If you try to submit too long a GET query to Solr, then Tomcat will reject your HTTP request on the grounds that the HTTP header is too large; symptoms may include an HTTP 400 Bad Request error or (if you execute the query in a web browser) a blank browser window. If you need to enable longer queries, you can set the maxHttpHeaderSize attribute on the HTTP Connector element in your server.xml file. The default value is 4K. (See http://tomcat.apache.org/tomcat-5.5-doc/config/http.html) Even better would be to force SolrJ to use a POST request. In newer versions (4.1 and later) Solr sets the servlet container's POST buffer size and defaults it to 2MB. In older versions, you'd have to adjust this in your servlet container config, but the default should be considerably larger than the header buffer used for GET requests. I thought that SolrJ used POST by default, but after looking at the code, it seems that I was wrong. Here's how to send a POST query: response = server.query(query, METHOD.POST); The import required for this is: import org.apache.solr.client.solrj.SolrRequest.METHOD; Gary, if you can avoid it, you should not be creating a new HttpSolrServer object every time you make a query. It is completely thread-safe, so create a singleton and use it for all queries against the medline core. Thanks, Shawn
Re: Dynamic Type For Solr Schema
You can certainly do redirection of input values in an update processing, even in a JavaScript script. But there are also built-in language identifier update processors that can simultaneously identify what language is used in the input value for a field AND do the redirection to a language-specific field AND store the language code. See: LangDetectLanguageIdentifierUpdateProcessorFactory TikaLanguageIdentifierUpdateProcessorFactory http://lucene.apache.org/solr/4_3_0/solr-langid/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessorFactory.html http://lucene.apache.org/solr/4_3_0/solr-langid/org/apache/solr/update/processor/TikaLanguageIdentifierUpdateProcessorFactory.html http://wiki.apache.org/solr/LanguageDetection The non-Tika version may be better, depending on the nature of your input. Neither processor is in the new Apache Solr Reference Guide nor current release from Lucid, but see the detailed examples in my book. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, June 26, 2013 10:51 AM To: solr-user@lucene.apache.org Subject: Dynamic Type For Solr Schema I use Solr 4.3.1 as SolrCloud. I know that I can define analyzer at schema.xml. Let's assume that I have specialized my analyzer for Turkish. However I want to have another analzyer too, i.e. for English. I have that fields at my schema: ... field name=content type=text_tr stored=true indexed=true/ field name=title type=text_tr stored=true indexed=true/ ... I have a field type as text_tr that is combined for Turkish. I have another field type as text_en that is combined for Englished. I have another field at my schema as lang. lang holds the language of document as en or tr. If I get a document that has a lang field holds *tr* I want that: ... field name=content type=*text_tr* stored=true indexed=true/ field name=title type=*text_tr* stored=true indexed=true/ ... If I get a document that has a lang field holds *en* I want that: ... field name=content type=*text_en* stored=true indexed=true/ field name=title type=*text_en* stored=true indexed=true/ ... I want dynamic types just for that fields other will be same. How can I do that properly at Solr? (UpdateRequestProcessor, ...?)
Re: Dynamic Type For Solr Schema
On 6/26/2013 8:51 AM, Furkan KAMACI wrote: If I get a document that has a lang field holds *tr* I want that: ... field name=content type=*text_tr* stored=true indexed=true/ field name=title type=*text_tr* stored=true indexed=true/ Changing the TYPE of a field based on the contents of another field isn't possible. The language detection that has been mentioned in your other replies makes it possible to direct different languages to different fields, but won't change the type. Solr is highly dependent on its schema. The schema is necessarily fairly static. This is changing to some degree with the schema REST API in newer versions, but even with that, types aren't dynamic. If you change them, you have to reindex. Making them dynamic would require a major rewrite of Solr internals, and it's very likely that nobody would be able to agree on the criteria used to choose a type. What you are trying to do could be done by writing a custom Lucene application, because Lucene has no schema. Field types are determined by whatever code you write yourself. The problem with this approach is that you have to write ALL the server code, something that you get for free with Solr. It would not be a trivial task. Thanks, Shawn
MoreLikeThis handler and pivot facets
Hi, I have the current worklow, which works fine: - User enters search text - Text is send to SOLR as query. Quite some faceting is also include in the request. - Result comes back and extensive facet information is displayed. Now I want to allow my user to enter a whole reference text as search text. So I do the same as above, but send the text via POST to a MoreLikeThis handler. Therefore I add those additional parameters: mlt.fl = 'text_field' mlt.minwl = 1 mlt.maxqt = 20 mlt.minf = 0 and remove of course the q parameter. The rest of the request - i.e. the faceting parameters - are identical. But I do not get facets back. For my sample request, I can see that 499 documents were found, but all facets are just empty. And the facet_pivot key does not exist at all. Is there any know issue with MLT + facets? I know that MLT + facets worked for me, but not yet when using pivot facets. kind regards, Achim
Parallal Import Process on same core. Solr 3.5
Hello, I'm trying to execute a parallel DIH process and running into heap related issues, hoping somebody has experienced this and can recommend some options.. Using Solr 3.5 on CentOS. Currently have JVM heap 4GB min , 8GB max When executing the entities in a sequential process (entities executing in sequence by default), my heap never exceeds 3GB. When executing the parallel process, everything runs fine for roughly an hour, then I reach the 8GB max heap size and the process stalls/fails. More specifically, here's how I'm executing the parallel import process: I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME VALUE') within my entity queries. And within Solrconfig.xml, I've created corresponding data import handlers, one for each of these entities. My total rows fetch/count is 9M records. And when I initiate the import, I call each one, similar to the below (obviously I've stripped out my server naming conventions. http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting1]clean=true http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting2] I assume that when doing this, only the first import request needs to contain the clean=true param. I've divided each import query to target roughly the same amount of data, and in solrconfig, I've tried various things in hopes to reduce heap size. Here's my current config: useCompoundFilefalse/useCompoundFile mergeFactor15/mergeFactor !-- I've experimented with 10, 15,25 and haven't seen much differences -- ramBufferSizeMB100/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults mainIndex useCompoundFilefalse/useCompoundFile ramBufferSizeMB100/ramBufferSizeMB !-- I've bumped this up from 32 -- mergeFactor15/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup /mainIndex updateHandler class=solr.DirectUpdateHandler2 autoCommit maxTime6/maxTime !-- I've experimented with various times here as well -- maxDocs25000/maxDocs !-- I've experimented with 25k, 500k, 100k -- /autoCommit maxPendingDeletes10/maxPendingDeletes /updateHandler What gets tricky is finding the sweet spot with these parameters, but wondering if anybody has any recommendations for an optimal config. Also, regarding autoCommit, I've even turned that feature off, but my heap size reaches its max sooner. I am wondering though, what would be the difference with autoCommit and passing in the commit=true param on each import query. Thanks in advance! Mike
Re: Parallal Import Process on same core. Solr 3.5
Hi Mike, Have you considered trying something like jhat or visualvm to see what's taking up room on the heap? http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html http://visualvm.java.net/ Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Wed, Jun 26, 2013 at 12:58 PM, Mike L. javaone...@yahoo.com wrote: Hello, I'm trying to execute a parallel DIH process and running into heap related issues, hoping somebody has experienced this and can recommend some options.. Using Solr 3.5 on CentOS. Currently have JVM heap 4GB min , 8GB max When executing the entities in a sequential process (entities executing in sequence by default), my heap never exceeds 3GB. When executing the parallel process, everything runs fine for roughly an hour, then I reach the 8GB max heap size and the process stalls/fails. More specifically, here's how I'm executing the parallel import process: I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME VALUE') within my entity queries. And within Solrconfig.xml, I've created corresponding data import handlers, one for each of these entities. My total rows fetch/count is 9M records. And when I initiate the import, I call each one, similar to the below (obviously I've stripped out my server naming conventions. http:// [server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting1]clean=true http:// [server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting2] I assume that when doing this, only the first import request needs to contain the clean=true param. I've divided each import query to target roughly the same amount of data, and in solrconfig, I've tried various things in hopes to reduce heap size. Here's my current config: useCompoundFilefalse/useCompoundFile mergeFactor15/mergeFactor!-- I've experimented with 10, 15,25 and haven't seen much differences -- ramBufferSizeMB100/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults mainIndex useCompoundFilefalse/useCompoundFile ramBufferSizeMB100/ramBufferSizeMB !-- I've bumped this up from 32 -- mergeFactor15/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup /mainIndex updateHandler class=solr.DirectUpdateHandler2 autoCommit maxTime6/maxTime !-- I've experimented with various times here as well -- maxDocs25000/maxDocs !-- I've experimented with 25k, 500k, 100k -- /autoCommit maxPendingDeletes10/maxPendingDeletes /updateHandler What gets tricky is finding the sweet spot with these parameters, but wondering if anybody has any recommendations for an optimal config. Also, regarding autoCommit, I've even turned that feature off, but my heap size reaches its max sooner. I am wondering though, what would be the difference with autoCommit and passing in the commit=true param on each import query. Thanks in advance! Mike
Re: Parallal Import Process on same core. Solr 3.5
On 6/26/2013 10:58 AM, Mike L. wrote: Hello, I'm trying to execute a parallel DIH process and running into heap related issues, hoping somebody has experienced this and can recommend some options.. Using Solr 3.5 on CentOS. Currently have JVM heap 4GB min , 8GB max When executing the entities in a sequential process (entities executing in sequence by default), my heap never exceeds 3GB. When executing the parallel process, everything runs fine for roughly an hour, then I reach the 8GB max heap size and the process stalls/fails. More specifically, here's how I'm executing the parallel import process: I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME VALUE') within my entity queries. And within Solrconfig.xml, I've created corresponding data import handlers, one for each of these entities. My total rows fetch/count is 9M records. And when I initiate the import, I call each one, similar to the below (obviously I've stripped out my server naming conventions. http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting1]clean=true http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting2] I assume that when doing this, only the first import request needs to contain the clean=true param. I've divided each import query to target roughly the same amount of data, and in solrconfig, I've tried various things in hopes to reduce heap size. Thanks for including some solrconfig snippets, but I think what we really need is your DIH configuration(s). Use a pastebin site and choose the proper document type. http://apaste.info is available and the proper type there would be (X)HTML. If you need to sanitize these to remove host/user/pass, please replace the values with something else rather than deleting them entirely. With full-import, clean defaults to true, so including it doesn't change anything. What I would actually do is have clean=true on the first import you run, then after waiting a few seconds to be sure it is running, start the others with clean=false so that they don't do ANOTHER clean. I suspect that you might be running into JDBC driver behavior where the entire result set is being buffered into RAM. Thanks, Shawn
Solr 4.2.1 - master taking long time to respond after tomcat restart
Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates, we enabled updateLog and made the few unstored int and boolean fields as stored. We have a single master and a single slave and all the queries go only to the slave. We make only max. 50 atomic update requests/hour to the master. Noticing that on restarting tomcat, the master Solr server takes several minutes to respond. This was not happening in 3.6.1. The slave is responding as quickly as before after restarting tomcat. Any ideas why only master would take this long?
Re: Solr 4.2.1 - master taking long time to respond after tomcat restart
On 6/26/2013 11:18 AM, Arun Rangarajan wrote: Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates, we enabled updateLog and made the few unstored int and boolean fields as stored. We have a single master and a single slave and all the queries go only to the slave. We make only max. 50 atomic update requests/hour to the master. Noticing that on restarting tomcat, the master Solr server takes several minutes to respond. This was not happening in 3.6.1. The slave is responding as quickly as before after restarting tomcat. Any ideas why only master would take this long? Classic problem after enabling the updateLog: http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_startup Thanks, Shawn
Need help with indexing names in a pdf
We receive about 100 documents a day of various sizes. The documents could pertain to any of 40,000 contacts stored in our database, and could include more than one. For each file we have, we maintain a list of contacts that are related to or involved in that file. I know it will never be exact, but I'd like to index possible names in the text, and then attempt to identify which files the document might pertain to, looking with files that are tied to contacts contained in the document. I've found some regex code to parse names from the text, but does anyone have any ideas on how to set up the index. There are currently approximately 900,000 documents in our library. --Warren
Re: Solr 4.2.1 - master taking long time to respond after tomcat restart
You need to do occasional hard commits, otherwise the update log just grows and grows and gets replayed on each server start. -- Jack Krupansky -Original Message- From: Arun Rangarajan Sent: Wednesday, June 26, 2013 1:18 PM To: solr-user@lucene.apache.org Subject: Solr 4.2.1 - master taking long time to respond after tomcat restart Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates, we enabled updateLog and made the few unstored int and boolean fields as stored. We have a single master and a single slave and all the queries go only to the slave. We make only max. 50 atomic update requests/hour to the master. Noticing that on restarting tomcat, the master Solr server takes several minutes to respond. This was not happening in 3.6.1. The slave is responding as quickly as before after restarting tomcat. Any ideas why only master would take this long?
Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads
Thank you Erick! Will look at all these suggestions. -Vinay On Wed, Jun 26, 2013 at 6:37 AM, Erick Erickson erickerick...@gmail.comwrote: Right, unfortunately this is a gremlin lurking in the weeds, see: http://wiki.apache.org/solr/DistributedSearch#Distributed_Deadlock There are a couple of ways to deal with this: 1 go ahead and up the limit and re-compile, if you look at SolrCmdDistributor the semaphore is defined there. 2 https://issues.apache.org/jira/browse/SOLR-4816 should address this as well as improve indexing throughput. I'm totally sure Joel (the guy working on this) would be thrilled if you were able to verify that these two points, I'd ask him (on the JIRA) whether he thinks it's ready to test. 3 Reduce the number of threads you're indexing with 4 index docs in small packets, perhaps even one and just rack together a zillion threads to get throughput. FWIW, Erick On Tue, Jun 25, 2013 at 8:55 AM, Vinay Pothnis poth...@gmail.com wrote: Jason and Scott, Thanks for the replies and pointers! Yes, I will consider the 'maxDocs' value as well. How do i monitor the transaction logs during the interval between commits? Thanks Vinay On Mon, Jun 24, 2013 at 8:48 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Scott, My comment was meant to be a bit tongue-in-cheek, but my intent in the statement was to represent hard failure along the lines Vinay is seeing. We're talking about OutOfMemoryException conditions, total cluster paralysis requiring restart, or other similar and disastrous conditions. Where that line is is impossible to generically define, but trivial to accomplish. What any of us running Solr has to achieve is a realistic simulation of our desired production load (probably well above peak) and to see what limits are reached. Armed with that information we tweak. In this case, we look at finding the point where data ingestion reaches a natural limit. For some that may be JVM GC, for others memory buffer size on the client load, and yet others it may be I/O limits on multithreaded reads from a database or file system. In old Solr days we had a little less to worry about. We might play with a commitWithin parameter, ramBufferSizeMB tweaks, or contemplate partial commits and rollback recoveries. But with 4.x we now have more durable write options and NRT to consider, and SolrCloud begs to use this. So we have to consider transaction logs, the file handles they leave open until commit operations occur, and how we want to manage writing to all cores simultaneously instead of a more narrow master/slave relationship. It's all manageable, all predictable (with some load testing) and all filled with many possibilities to meet our specific needs. Considering hat each person's data model, ingestion pipeline, request processors, and field analysis steps will be different, 5 threads of input at face value doesn't really contemplate the whole problem. We have to measure our actual data against our expectations and find where the weak chain links are to strengthen them. The symptoms aren't necessarily predictable in advance of this testing, but they're likely addressable and not difficult to decipher. For what it's worth, SolrCloud is new enough that we're still experiencing some uncharted territory with unknown ramifications but with continued dialog through channels like these there are fewer territories without good cartography :) Hope that's of use! Jason On Jun 24, 2013, at 7:12 PM, Scott Lundgren scott.lundg...@carbonblack.com wrote: Jason, Regarding your statement push you over the edge- what does that mean? Does it mean uncharted territory with unknown ramifications or something more like specific, known symptoms? I ask because our use is similar to Vinay's in some respects, and we want to be able to push the capabilities of write perf - but not over the edge! In particular, I am interested in knowing the symptoms of failure, to help us troubleshoot the underlying problems if and when they arise. Thanks, Scott On Monday, June 24, 2013, Jason Hellman wrote: Vinay, You may wish to pay attention to how many transaction logs are being created along the way to your hard autoCommit, which should truncate the open handles for those files. I might suggest setting a maxDocs value in parallel with your maxTime value (you can use both) to ensure the commit occurs at either breakpoint. 30 seconds is plenty of time for 5 parallel processes of 20 document submissions to push you over the edge. Jason On Jun 24, 2013, at 2:21 PM, Vinay Pothnis poth...@gmail.com wrote: I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30 seconds. On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman jhell...@innoventsolutions.com
Re: Querying multiple collections in SolrCloud
Thanks Erick, that's a very helpful answer. Regarding the grouping option, does that require all the docs to be put into a single collection, or could it be done with across N collections (assuming each collection had a common type field for grouping on)? Chris On Wed, Jun 26, 2013 at 7:01 AM, Erick Erickson erickerick...@gmail.comwrote: bq: Would the above setup qualify as multiple compatible collections No. While there may be enough fields in common to form a single query, the TF/IDF calculations will not be compatible and the scores from the various collections will NOT be comparable. So simply getting the list of top N docs will probably be dominated by the docs from a single type. bq: How does SolrCloud combine the query results from multiple collections? It doesn't. SolrCloud sorts the results from multiple nodes in the _same_ collection according to whatever sort criteria are specified, defaulting to score. Say you ask for the top 20 docs. A node from each shard returns the top 20 docs for that shard. The node processing them just merges all the returned lists and only keeps the top 20. I don't think your last two questions are really relevant, SolrCloud isn't built to query multiple collections and return the results coherently. The root problem here is that you're trying to compare docs from different collections for goodness to return the top N. This isn't actually hard _except_ when goodness is the score, then it just doesn't work. You can't even compare scores from different queries on the _same_ collection, much less different ones. Consider two collections, books and songs. One consists of lots and lots of text and the ter frequency and inverse doc freq (TF/IDF) will be hugely different than songs. Not to mention field length normalization. Now, all that aside there's an option. Index all the docs in a single collection and use grouping (aka field collapsing) to get a single response that has the top N docs from each type (they'll be in different sections of the original response) and present them to the user however makes sense. You'll get hands on experience in why this isn't something that's easy to do automatically if you try to sort these into a single list by relevance G... Best Erick On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey ctoo...@gmail.com wrote: Thanks Jack for the alternatives. The first is interesting but has the downside of requiring multiple queries to get the full matching docs. The second is interesting and very simple, but has the downside of not being modular and being difficult to configure field boosting when the collections have overlapping field names with different boosts being needed for the same field in different document types. I'd still like to know about the viability of my original approach though too. Chris On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky j...@basetechnology.com wrote: One simple scenario to consider: N+1 collections - one collection per document type with detailed fields for that document type, and one common collection that indexes a subset of the fields. The main user query would be an edismax over the common fields in that main collection. You can then display summary results from the common collection. You can also then support drill down into the type-specific collection based on a type field for each document in the main collection. Or, sure, you actually CAN index multiple document types in the same collection - add all the fields to one schema - there is no time or space penalty if most of the field are empty for most documents. -- Jack Krupansky -Original Message- From: Chris Toomey Sent: Tuesday, June 25, 2013 6:08 PM To: solr-user@lucene.apache.org Subject: Querying multiple collections in SolrCloud Hi, I'm investigating using SolrCloud for querying documents of different but similar/related types, and have read through docs. on the wiki and done many searches in these archives, but still have some questions. Thanks in advance for your help. Setup: * Say that I have N distinct types of documents and I want to do queries that return the best matches regardless document type. I.e., something akin to a Google search where I'd like to get the best matches from the web, news, images, and maps. * Our main use case is supporting simple user-entered searches, which would just contain terms / phrases and wouldn't specify fields. * The document types will not all have the same fields, though there may be some overlap in the fields. * We plan to use a separate collection for each document type, and to use the eDisMax query parser. Each collection would have a document-specific schema configuration with appropriate defaults for query fields and boosts, etc. Questions: * Would the above setup qualify as multiple compatible collections, such
Solr document auto-upload?
Is it possible to to configure Solr to automatically grab documents in a specidfied directory, with having to use the post command? I've not found any way to do this, though admittedly, I'm not terribly experienced with config files of this type. Thanks! - | A.Spielman | In theory there is no difference between theory and practice. In practice there is. - Chuck Reid -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-document-auto-upload-tp4073373.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need help with indexing names in a pdf
This kind of text processing is called entity extraction. I'm not up to date on what is available in Solr, but search on that. wunder On Jun 26, 2013, at 10:26 AM, Warren H. Prince wrote: We receive about 100 documents a day of various sizes. The documents could pertain to any of 40,000 contacts stored in our database, and could include more than one. For each file we have, we maintain a list of contacts that are related to or involved in that file. I know it will never be exact, but I'd like to index possible names in the text, and then attempt to identify which files the document might pertain to, looking with files that are tied to contacts contained in the document. I've found some regex code to parse names from the text, but does anyone have any ideas on how to set up the index. There are currently approximately 900,000 documents in our library. --Warren
OOM killer script woes
Recently upgraded to 4.3.1 but this problem has persisted for a while now ... I'm using the following configuration when starting Jetty: -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p If an OOM is triggered during Solr web app initialization (such as by me lowering -Xmx to a value that is too low to initialize Solr with), then the script gets called and does what I expect! However, once the Solr webapp initializes and Solr is happily responding to updates and queries. When an OOM occurs in this situation, then the script doesn't actually get invoked! All I see is the following in the stdout/stderr log of my process: # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p # Executing /bin/sh -c /home/solr/oom_killer.sh 83 21358... The oom_killer.sh script doesn't actually get called! So to recap, it works if an OOM occurs during initialization but once Solr is running, the OOM killer doesn't fire correctly. This leads me to believe my script is fine and there's something else going wrong. Here's the oom_killer.sh script (pretty basic): #!/bin/bash SOLR_PORT=$1 SOLR_PID=$2 NOW=$(date +%Y%m%d_%H%M) ( echo Running OOM killer script for process $SOLR_PID for Solr on port 89$SOLR_PORT kill -9 $SOLR_PID echo Killed process $SOLR_PID exec /home/solr/solr-dg/dg-solr.sh recover $SOLR_PORT echo Restarted Solr on 89$SOLR_PORT after OOM ) | tee oom_killer-89$SOLR_PORT-$NOW.log Anyone see anything like this before? Suggestions on where to begin tracking down this issue? Cheers, Tim
Is there a way to build indexes using SOLRJ without SOLR instance?
I currently have a SOLRJ program which I am using for indexing the data in SOLR. I am trying to figure out a way to build index without depending on running instance of SOLR. I should be able to supply the solrconfig and schema.xml to the indexing program which in turn create index files that I can use with any SOLR instance. Is it possible to implement this? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr document auto-upload?
Take a look at LucidWorks Search for automated crawler scheduling: http://docs.lucidworks.com/display/help/Create+or+Edit+a+Schedule http://docs.lucidworks.com/display/lweug/Data+Source+Schedules ManifoldCF also has crawler job scheduling: http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html I think the general idea on Unix is that cron is the obvious way to schedule periodic operations. You could certainly do a custom request handler that initializes with a thread on a timer and initiate custom directory crawling of your own. But there is no such feature directly implemented in Solr -- Jack Krupansky -Original Message- From: aspielman Sent: Wednesday, June 26, 2013 2:16 PM To: solr-user@lucene.apache.org Subject: Solr document auto-upload? Is it possible to to configure Solr to automatically grab documents in a specidfied directory, with having to use the post command? I've not found any way to do this, though admittedly, I'm not terribly experienced with config files of this type. Thanks! - | A.Spielman | In theory there is no difference between theory and practice. In practice there is. - Chuck Reid -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-document-auto-upload-tp4073373.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a way to build indexes using SOLRJ without SOLR instance?
Yes, it is possible by running an embedded Solr inside SolrJ process. The nice thing is that the index is portable, so you can then access it from the standalone Solr server later. I have an example here: https://github.com/arafalov/solr-indexing-book/tree/master/published/solrj , which shows SolrJ running both as a client and with an embedded container. Notice that you will probably need more jars than you expect for the standalone Solr to work, including a number of servlet jars. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Jun 26, 2013 at 2:59 PM, Learner bbar...@gmail.com wrote: I currently have a SOLRJ program which I am using for indexing the data in SOLR. I am trying to figure out a way to build index without depending on running instance of SOLR. I should be able to supply the solrconfig and schema.xml to the indexing program which in turn create index files that I can use with any SOLR instance. Is it possible to implement this? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Parallal Import Process on same core. Solr 3.5
Thanks for the response. Here's the scrubbed version of my DIH: http://apaste.info/6uGH It contains everything I'm more or less doing...pretty straight forward.. One thing to note and I don't know if this is a bug or not, but the batchSize=-1 streaming feature doesn't seem to work, at least with informix jdbc drivers. I set the batchsize to 500, but have tested it with various numbers including 5000, 1. I'm aware that behind the scenes this should be just setting the fetchsize, but its a bit puzzling why I don't see a difference regardless of what value I actually use. I was told by one of our DBA's that our value is set as a global DB param and can't be modified (which I haven't looked into afterward.) As far as HEAP patterns, I watch the process via WILY and notice GC occurs every 15min's or so, but becomes infrequent and not as significant as the previous one. It's almost as if some memory is never released until it eventually catches up to the max heap size. I did assume that perhaps there could have been some locking issues, which is why I made the following modifications: readOnly=true transactionIsolation=TRANSACTION_READ_UNCOMMITTED What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? My general understanding is the higher the mergeFactor, the less frequent merges which should improve index time, but slow down query response time. I also read somewhere that an increase on the ramBufferSize should help prevent frequent merges...but confused why I didn't really see an improvement...perhaps my combination of these values wasn't right in relation to my total fetch size. Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e the defaults) the better on memory management, but cost on index time as you pay for the overhead of committing. That is a number I've been experimenting with as well and have scene some variations in heap trends but unfortunately, have not completed the job quite yet with any config... I did get very close.. I'd hate to throw additional memory at the problem if there is something else I can tweak.. Thanks! Mike From: Shawn Heisey s...@elyograg.org To: solr-user@lucene.apache.org Sent: Wednesday, June 26, 2013 12:13 PM Subject: Re: Parallal Import Process on same core. Solr 3.5 On 6/26/2013 10:58 AM, Mike L. wrote: Hello, I'm trying to execute a parallel DIH process and running into heap related issues, hoping somebody has experienced this and can recommend some options.. Using Solr 3.5 on CentOS. Currently have JVM heap 4GB min , 8GB max When executing the entities in a sequential process (entities executing in sequence by default), my heap never exceeds 3GB. When executing the parallel process, everything runs fine for roughly an hour, then I reach the 8GB max heap size and the process stalls/fails. More specifically, here's how I'm executing the parallel import process: I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME VALUE') within my entity queries. And within Solrconfig.xml, I've created corresponding data import handlers, one for each of these entities. My total rows fetch/count is 9M records. And when I initiate the import, I call each one, similar to the below (obviously I've stripped out my server naming conventions. http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting1]clean=true http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting2] I assume that when doing this, only the first import request needs to contain the clean=true param. I've divided each import query to target roughly the same amount of data, and in solrconfig, I've tried various things in hopes to reduce heap size. Thanks for including some solrconfig snippets, but I think what we really need is your DIH configuration(s). Use a pastebin site and choose the proper document type. http://apaste.info/is available and the proper type there would be (X)HTML. If you need to sanitize these to remove host/user/pass, please replace the values with something else rather than deleting them entirely. With full-import, clean defaults to true, so including it doesn't change anything. What I would actually do is have clean=true on the first import you run, then after waiting a few seconds to be sure it is running, start the others with clean=false so that they don't do ANOTHER clean. I suspect that you might be running into JDBC driver behavior where the entire result set is being buffered into RAM. Thanks, Shawn
How to get values of external file field(s) in Solr query?
http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes says this about external file fields: They can be used only for function queries or display. I understand how to use them in function queries, but how do I retrieve the values for display? If I want to fetch only the values of a single external file field for a set of primary keys, I can do: q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score For this query, the score is the value of the external file field. But how to get the values for docs that match some arbitrary query? Is there a syntax trick that will work where the value of the ext file field does not affect the score of the main query, but I can still retrieve its value? Also is it possible to retrieve the values of more than one external file field in a single query?
Re: Solr 4.2.1 - master taking long time to respond after tomcat restart
Thanks, Shawn Jack. I will go with the wiki and use autoCommit with openSearcher set to false. On Wed, Jun 26, 2013 at 10:23 AM, Jack Krupansky j...@basetechnology.comwrote: You need to do occasional hard commits, otherwise the update log just grows and grows and gets replayed on each server start. -- Jack Krupansky -Original Message- From: Arun Rangarajan Sent: Wednesday, June 26, 2013 1:18 PM To: solr-user@lucene.apache.org Subject: Solr 4.2.1 - master taking long time to respond after tomcat restart Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates, we enabled updateLog and made the few unstored int and boolean fields as stored. We have a single master and a single slave and all the queries go only to the slave. We make only max. 50 atomic update requests/hour to the master. Noticing that on restarting tomcat, the master Solr server takes several minutes to respond. This was not happening in 3.6.1. The slave is responding as quickly as before after restarting tomcat. Any ideas why only master would take this long?
Re: Is there a way to build indexes using SOLRJ without SOLR instance?
AFAIK solrj is just the network client that connects to a Solr server using Java, now, if you just need to index your data on your local HDD you might want to step back to Lucene. I'm assuming you are using Java so you could also annotate your POJO's with Lucene annotations, google hibernate-search, maybe that's what you are looking for. HTH, Guido. On 26/06/13 19:59, Learner wrote: I currently have a SOLRJ program which I am using for indexing the data in SOLR. I am trying to figure out a way to build index without depending on running instance of SOLR. I should be able to supply the solrconfig and schema.xml to the indexing program which in turn create index files that I can use with any SOLR instance. Is it possible to implement this? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: OOM killer script woes
Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and throwing it/packaging it as a java.lang.RuntimeException. The -XX option assumes that the application doesn't handle the Errors and so they would reach the JVM and thus invoke the handler. Since Jetty has an exception handler that is dealing with anything (included Errors), they never reach the JVM, hence no handler. Not much we can do short of not using Jetty? That's a pain, I'd just written a nice OOM handler too! On 26 June 2013 20:37, Timothy Potter thelabd...@gmail.com wrote: A little more to this ... Just on chance this was a weird Jetty issue or something, I tried with the latest 9 and the problem still occurs :-( This is on Java 7 on debian: java version 1.7.0_21 Java(TM) SE Runtime Environment (build 1.7.0_21-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) Here is an example stack trace from the log 2013-06-26 19:31:33,801 [qtp632640515-62] ERROR solr.servlet.SolrDispatchFilter Q:22 - null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.OutOfMemoryError: Java heap space On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter thelabd...@gmail.com wrote: Recently upgraded to 4.3.1 but this problem has persisted for a while now ... I'm using the following configuration when starting Jetty: -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p If an OOM is triggered during Solr web app initialization (such as by me lowering -Xmx to a value that is too low to initialize Solr with), then the script gets called and does what I expect! However, once the Solr webapp initializes and Solr is happily responding to updates and queries. When an OOM occurs in this situation, then the script doesn't actually get invoked! All I see is the following in the stdout/stderr log of my process: # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p # Executing /bin/sh -c /home/solr/oom_killer.sh 83 21358... The oom_killer.sh script doesn't actually get called! So to recap, it works if an OOM occurs during initialization but once Solr is running, the OOM killer doesn't fire correctly. This leads me to believe my script is fine and there's something else going wrong. Here's the oom_killer.sh script (pretty basic): #!/bin/bash SOLR_PORT=$1 SOLR_PID=$2 NOW=$(date +%Y%m%d_%H%M) ( echo Running OOM killer script for process $SOLR_PID for Solr on port 89$SOLR_PORT kill -9 $SOLR_PID echo Killed process $SOLR_PID exec /home/solr/solr-dg/dg-solr.sh recover $SOLR_PORT echo Restarted Solr on 89$SOLR_PORT after OOM ) | tee oom_killer-89$SOLR_PORT-$NOW.log Anyone see anything like this before? Suggestions on where to begin tracking down this issue? Cheers, Tim
Re: Is there a way to build indexes using SOLRJ without SOLR instance?
Never heard of embedded Solr server, isn't better to just use lucene alone for that purpose? Using a helper like Hibernate? Since most applications that require indexes will have a relational DB behind the scene, it would not be a bad idea to use a ORM combined with Lucene annotations (aka hibernate-search) Guido. On 26/06/13 20:30, Alexandre Rafalovitch wrote: Yes, it is possible by running an embedded Solr inside SolrJ process. The nice thing is that the index is portable, so you can then access it from the standalone Solr server later. I have an example here: https://github.com/arafalov/solr-indexing-book/tree/master/published/solrj , which shows SolrJ running both as a client and with an embedded container. Notice that you will probably need more jars than you expect for the standalone Solr to work, including a number of servlet jars. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Jun 26, 2013 at 2:59 PM, Learner bbar...@gmail.com wrote: I currently have a SOLRJ program which I am using for indexing the data in SOLR. I am trying to figure out a way to build index without depending on running instance of SOLR. I should be able to supply the solrconfig and schema.xml to the indexing program which in turn create index files that I can use with any SOLR instance. Is it possible to implement this? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Parallal Import Process on same core. Solr 3.5
On 6/26/2013 1:36 PM, Mike L. wrote: Here's the scrubbed version of my DIH: http://apaste.info/6uGH It contains everything I'm more or less doing...pretty straight forward.. One thing to note and I don't know if this is a bug or not, but the batchSize=-1 streaming feature doesn't seem to work, at least with informix jdbc drivers. I set the batchsize to 500, but have tested it with various numbers including 5000, 1. I'm aware that behind the scenes this should be just setting the fetchsize, but its a bit puzzling why I don't see a difference regardless of what value I actually use. I was told by one of our DBA's that our value is set as a global DB param and can't be modified (which I haven't looked into afterward.) Setting the batchSize to -1 causes DIH to set fetchSize to Integer.MIN_VALUE (around negative two billion), which seems to be a MySQL-specific hack to enable result streaming. I've never heard of it working on any other JDBC driver. Assuming that the Informix JDBC driver is actually honoring the fetchSize, setting batchSize in the DIH config should be enough. If it's not, then it's a bug in the JDBC driver or possibly a server misconfiguration. As far as HEAP patterns, I watch the process via WILY and notice GC occurs every 15min's or so, but becomes infrequent and not as significant as the previous one. It's almost as if some memory is never released until it eventually catches up to the max heap size. I did assume that perhaps there could have been some locking issues, which is why I made the following modifications: readOnly=true transactionIsolation=TRANSACTION_READ_UNCOMMITTED I can't really comment here. It does appear that the Informix JDBC driver is not something you can download from IBM's website without paying them money. I would suggest going to IBM (or an informix-related support avenue) for some help, ESPECIALLY if you've paid money for it. What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? My general understanding is the higher the mergeFactor, the less frequent merges which should improve index time, but slow down query response time. I also read somewhere that an increase on the ramBufferSize should help prevent frequent merges...but confused why I didn't really see an improvement...perhaps my combination of these values wasn't right in relation to my total fetch size. Of these, ramBufferSizeMB is the only one that should have a *significant* effect on RAM usage, and at a value of 100, I would not expect there to be a major issue unless you are doing a lot of imports at the same time. Because you are using Solr 3.5, if you do not need your import results to be visible until the end, I wouldn't worry about using autoCommit. If you were using Solr 4.x, I would recommend that you turn autoCommit on, but with openSearcher set to false. Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e the defaults) the better on memory management, but cost on index time as you pay for the overhead of committing. That is a number I've been experimenting with as well and have scene some variations in heap trends but unfortunately, have not completed the job quite yet with any config... I did get very close.. I'd hate to throw additional memory at the problem if there is something else I can tweak.. General impressions: Unless the amount of data involved in each Solr document is absolutely enormous, this is very likely bugs (memory leaks or fetchSize problems) in the Informix JDBC driver. I did find the following page, but it's REALLY REALLY old, which hopefully means that it doesn't apply. http://www-01.ibm.com/support/docview.wss?uid=swg21260832 If your documents ARE huge, then you probably need to give more memory to the java heap ... but you might still have memory leak bugs in the JDBC driver. When it comes to Java and Lucene/Solr, IBM has a *terrible* track record, especially for people using the IBM Java VM. I would not be surprised if their JDBC driver is plagued by similar problems. If you do find a support resource and they tell you that you should change your JDBC code to work differently, then you need to tell them that you can't change the JDBC code and that they need to give you a configuration URL workaround. Here's another possibility of a bug that causes memory leaks: http://www-01.ibm.com/support/docview.wss?uid=swg1IC58469 You might ask whether the problem could be a memory leak in Solr. It's always possible, but I've had a lot of experience with DIH from MySQL on Solr 1.4.0, 1.4.1, 3.2.0, 3.5.0, and 4.2.1. I've never seen any signs of a leak. Thanks, Shawn
Re: Is it possible to searh Solr with a longer query string?
Oh this is good! On Wed, Jun 26, 2013 at 12:05 PM, Shawn Heisey s...@elyograg.org wrote: On 6/25/2013 6:15 PM, Jack Krupansky wrote: Are you using Tomcat? See: http://wiki.apache.org/solr/SolrTomcat#Enabling_Longer_Query_Requests Enabling Longer Query Requests If you try to submit too long a GET query to Solr, then Tomcat will reject your HTTP request on the grounds that the HTTP header is too large; symptoms may include an HTTP 400 Bad Request error or (if you execute the query in a web browser) a blank browser window. If you need to enable longer queries, you can set the maxHttpHeaderSize attribute on the HTTP Connector element in your server.xml file. The default value is 4K. (See http://tomcat.apache.org/tomcat-5.5-doc/config/http.html) Even better would be to force SolrJ to use a POST request. In newer versions (4.1 and later) Solr sets the servlet container's POST buffer size and defaults it to 2MB. In older versions, you'd have to adjust this in your servlet container config, but the default should be considerably larger than the header buffer used for GET requests. I thought that SolrJ used POST by default, but after looking at the code, it seems that I was wrong. Here's how to send a POST query: response = server.query(query, METHOD.POST); The import required for this is: import org.apache.solr.client.solrj.SolrRequest.METHOD; Gary, if you can avoid it, you should not be creating a new HttpSolrServer object every time you make a query. It is completely thread-safe, so create a singleton and use it for all queries against the medline core. Thanks, Shawn
Re: How to get values of external file field(s) in Solr query?
The only way is using a frange (function range) query: q={!frange l=0 u=10}my_external_field Will pull out documents that have your external field with a value between zero and 10. Upayavira On Wed, Jun 26, 2013, at 09:02 PM, Arun Rangarajan wrote: http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes says this about external file fields: They can be used only for function queries or display. I understand how to use them in function queries, but how do I retrieve the values for display? If I want to fetch only the values of a single external file field for a set of primary keys, I can do: q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score For this query, the score is the value of the external file field. But how to get the values for docs that match some arbitrary query? Is there a syntax trick that will work where the value of the ext file field does not affect the score of the main query, but I can still retrieve its value? Also is it possible to retrieve the values of more than one external file field in a single query?
Re: How to get values of external file field(s) in Solr query?
On Wed, Jun 26, 2013 at 4:02 PM, Arun Rangarajan arunrangara...@gmail.com wrote: http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes says this about external file fields: They can be used only for function queries or display. I understand how to use them in function queries, but how do I retrieve the values for display? If I want to fetch only the values of a single external file field for a set of primary keys, I can do: q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score For this query, the score is the value of the external file field. But how to get the values for docs that match some arbitrary query? Pseudo-fields allow you to retrieve the value for any arbitrary function per returned document. Should work here, but I haven't tried it. fl=id, score, field(EXT_FILE_FIELD) or you can alias it: fl=id, score, myfield:field(EXT_FILE_FIELD) -Yonik http://lucidworks.com
Re: How to get values of external file field(s) in Solr query?
Yonik, Thanks, your answer works! On Wed, Jun 26, 2013 at 2:07 PM, Yonik Seeley yo...@lucidworks.com wrote: On Wed, Jun 26, 2013 at 4:02 PM, Arun Rangarajan arunrangara...@gmail.com wrote: http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes says this about external file fields: They can be used only for function queries or display. I understand how to use them in function queries, but how do I retrieve the values for display? If I want to fetch only the values of a single external file field for a set of primary keys, I can do: q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score For this query, the score is the value of the external file field. But how to get the values for docs that match some arbitrary query? Pseudo-fields allow you to retrieve the value for any arbitrary function per returned document. Should work here, but I haven't tried it. fl=id, score, field(EXT_FILE_FIELD) or you can alias it: fl=id, score, myfield:field(EXT_FILE_FIELD) -Yonik http://lucidworks.com
Configuring Solr to retrieve documents?
Is it possible to to configure Solr to automatically grab documents in a specidfied directory, with having to use the post command? I've not found any way to do this, though admittedly, I'm not terribly experienced with config files of this type. Thanks! - | A.Spielman | In theory there is no difference between theory and practice. In practice there is. - Chuck Reid -- View this message in context: http://lucene.472066.n3.nabble.com/Configuring-Solr-to-retrieve-documents-tp4073372.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a way to build indexes using SOLRJ without SOLR instance?
On Wed, Jun 26, 2013 at 4:43 PM, Guido Medina guido.med...@temetra.comwrote: Never heard of embedded Solr server, I guess that's the exciting part about Solr. Always more nuances to learn: https://wiki.apache.org/solr/EmbeddedSolr :-) Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: OOM killer script woes
Thanks for the feedback Daniel ... For now, I've opted to just kill the JVM with System.exit(1) in the SolrDispatchFilter code and will restart it with a Linux supervisor. Not elegant but the alternative of having a zombie Solr instance walking around my cluster is much worse ;-) Will try to dig into the code that is trapping this error but for now I've lost too many hours on this problem. Cheers, Tim On Wed, Jun 26, 2013 at 2:43 PM, Daniel Collins danwcoll...@gmail.com wrote: Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and throwing it/packaging it as a java.lang.RuntimeException. The -XX option assumes that the application doesn't handle the Errors and so they would reach the JVM and thus invoke the handler. Since Jetty has an exception handler that is dealing with anything (included Errors), they never reach the JVM, hence no handler. Not much we can do short of not using Jetty? That's a pain, I'd just written a nice OOM handler too! On 26 June 2013 20:37, Timothy Potter thelabd...@gmail.com wrote: A little more to this ... Just on chance this was a weird Jetty issue or something, I tried with the latest 9 and the problem still occurs :-( This is on Java 7 on debian: java version 1.7.0_21 Java(TM) SE Runtime Environment (build 1.7.0_21-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) Here is an example stack trace from the log 2013-06-26 19:31:33,801 [qtp632640515-62] ERROR solr.servlet.SolrDispatchFilter Q:22 - null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.OutOfMemoryError: Java heap space On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter thelabd...@gmail.com wrote: Recently upgraded to 4.3.1 but this problem has persisted for a while now ... I'm using the following configuration when starting Jetty: -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p If an OOM is triggered during Solr web app initialization (such as by me lowering -Xmx to a value that is too low to initialize Solr with), then the script gets called and does what I expect! However, once the Solr webapp initializes and Solr is happily responding to updates and queries. When an OOM occurs in this situation, then the script doesn't actually get invoked! All I see is the following in the stdout/stderr log of my process: # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p # Executing /bin/sh -c /home/solr/oom_killer.sh 83 21358... The oom_killer.sh script doesn't actually get called! So to recap, it works if an OOM occurs during initialization but once Solr is running, the OOM killer doesn't fire correctly. This leads me to believe my script is fine and there's something else going wrong. Here's the oom_killer.sh script (pretty basic): #!/bin/bash SOLR_PORT=$1 SOLR_PID=$2 NOW=$(date +%Y%m%d_%H%M) ( echo Running OOM killer script for process $SOLR_PID for Solr on port 89$SOLR_PORT kill -9 $SOLR_PID echo Killed process
Replicating files containing external file fields
From https://wiki.apache.org/solr/SolrReplication I understand that index dir and any files under the conf dir can be replicated to slaves. I want to know if there is any way the files under the data dir containing external file fields can be replicated. These are not replicated by default. Currently we are running the ext file field reload script on both the master and the slave and then running reloadCache on each server once they are loaded.
Re: Is there a way to build indexes using SOLRJ without SOLR instance?
If hibernate search is like regular hibernate ORM I'm not sure I'd trust it to pick the most optimal solutions... Otis Solr ElasticSearch Support http://sematext.com/ On Jun 26, 2013 4:44 PM, Guido Medina guido.med...@temetra.com wrote: Never heard of embedded Solr server, isn't better to just use lucene alone for that purpose? Using a helper like Hibernate? Since most applications that require indexes will have a relational DB behind the scene, it would not be a bad idea to use a ORM combined with Lucene annotations (aka hibernate-search) Guido. On 26/06/13 20:30, Alexandre Rafalovitch wrote: Yes, it is possible by running an embedded Solr inside SolrJ process. The nice thing is that the index is portable, so you can then access it from the standalone Solr server later. I have an example here: https://github.com/arafalov/**solr-indexing-book/tree/** master/published/solrjhttps://github.com/arafalov/solr-indexing-book/tree/master/published/solrj , which shows SolrJ running both as a client and with an embedded container. Notice that you will probably need more jars than you expect for the standalone Solr to work, including a number of servlet jars. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Jun 26, 2013 at 2:59 PM, Learner bbar...@gmail.com wrote: I currently have a SOLRJ program which I am using for indexing the data in SOLR. I am trying to figure out a way to build index without depending on running instance of SOLR. I should be able to supply the solrconfig and schema.xml to the indexing program which in turn create index files that I can use with any SOLR instance. Is it possible to implement this? -- View this message in context: http://lucene.472066.n3.** nabble.com/Is-there-a-way-to-**build-indexes-using-SOLRJ-** without-SOLR-instance-**tp4073383.htmlhttp://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need Help in migrating Solr version 1.4 to 4.3
Thanks Shawn. To have singleton design pattern for SolrServer object creation, I found that there are so many ways described in http://en.wikipedia.org/wiki/Singleton_pattern So which is the best one, out of 5 examples mentioned in above url, for web application in general practice. I am sure lots of people (in this mailing list) will have practical experience as which type of singleton pattern need to be implement for creation of SolrServer object. Waiting for some comments in this front ? Regards Sandeep On Wed, Jun 26, 2013 at 9:20 PM, Shawn Heisey s...@elyograg.org wrote: On 6/25/2013 11:52 PM, Sandeep Gupta wrote: Also in application development side, as I said that I am going to use HTTPSolrServer API and I found that we shouldn't create this object multiple times (as per the wiki document http://wiki.apache.org/solr/Solrj#HttpSolrServer) So I am planning to have my Server class as singleton. Please advice little bit in this front also. This is always the way that SolrServer objects are intended to be used, including CommonsHttpSolrServer in version 1.4. The only major difference between the two objects is that the new one uses HttpComponents 4.x and the old one uses HttpClient 3.x. There are other differences, but they are just the result of incremental improvements from version to version. Thanks, Shawn