Managed schema used with Cloudera MapreduceIndexerTool and morphlines?
I've got a very difficult project to tackle. I've been tasked with using schemaless mode to index json files that we receive. The structure of the json files will always be very different as we're receiving files from different customers totally unrelated to one another. We are attempting to build a "one size fits all" approach to receiving documents from a wide variety of sources and then index them into Solr. We're running in Solr 5.3. The schemaless approach works well enough - until it doesn't. It seems to fail on type guessing and also gets confused indexing to different shards. If it was reliable it would be the perfect solution for our task. But the larger the JSON file the more likely it is to fail. At a certain size it just doesn't work. I've been advised by some experts and committers that schemaless is a good tool for prototyping, but risky to run in production, but we thought we would try it by doing offline indexing using the Cloudera MapReduceIndexerTool to build offline indexes - but still using managed schemas. This map reduce tool uses morphlines, which is a nifty ETL tool that pipes together a series of commands to transform data. For example a JSON or CSV file can be processed and loaded into a Solr index with a "readJSON" command piped to a "loadSolr" command, for a simple example. But the kite-sdk that manages the morphlines only seems to offer as they're latest version, solr *4.10.3*-cdh5.10.0 (they're customized version of 4.10.3) So I can't see any way to integrate schemaless (which has dependencies after 4.10.3) with the morphlines. But I thought I would ask here: Anybody had ANY experience with morphlines to index to Solr? Any info would help me make sense of this. Cheers to all!
Re: Very long running replication.
Bumping this. I'm seeing the error mentioned earlier in the thread - Unable to download segment filename completely. Downloaded 0!=size often in my logs. I'm dealing with a situation where maxDoc count is growing at a faster rate than numDocs and is now almost twice as large. I'm not optimizing but rather relying on the normal merge process to initiate the purging of deleted docs. No purging has happened for months now and it snuck up on me. Slaves are getting the newly indexed docs, but docs marked for delete are never getting purged. Index size is 23GB Indexing about 3K docs an hour Replication poll time is 60 seconds Running Solr 3.6 (I know, we should upgrade...working on that) autocommit every 30 seconds or 5K docs (so usually hitting the 30 second threshold rather than the doc count) Any pointers greatly appreciated! On Fri, Jan 3, 2014 at 7:14 AM, anand chandak anand.chan...@oracle.comwrote: Folks, would really appreciate if somebody can help/throw some light on below issue . This issue is blocking our upgrade, we are doing a 3.x to 4.x upgrade and indexing around 100g of data. Any help would be highly appreciated. Thanks, Anand On 1/3/2014 11:46 AM, anand chandak wrote: Thanks Shalin. I am facing one issue while replicating, as my replication (very large index 100g)is happening, I am also doing the indexing and I believe the segment_N file is changing because of new commits. So would the replication fail if the the filename is different from what it found when fetching the filename list. Basically, I am seeing this exception : [explicit-fetchindex-cmd] ERROR org.apache.solr.handler.ReplicationHandler- SnapPull failed :org.apache.solr.common.SolrException: Unable to download _av3.fdt completely. Downloaded 0!=497037 2 at org.apache.solr.handler.SnapPuller$ DirectoryFileFetcher.cleanup(SnapPuller.java:1268) 3 at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher. fetchFile(SnapPuller.java:1148) 4 at org.apache.solr.handler.SnapPuller.downloadIndexFiles( SnapPuller.java:743) 5 at org.apache.solr.handler.SnapPuller.fetchLatestIndex( SnapPuller.java:407) 6 at org.apache.solr.handler.ReplicationHandler.doFetch( ReplicationHandler.java:319) 7 at org.apache.solr.handler.ReplicationHandler$1.run( ReplicationHandler.java:220) And I am trying to find the root cause of this issue. Any help ? Thanks, Anand On 1/2/2014 5:32 PM, Shalin Shekhar Mangar wrote: Replications won't run concurrently. They are scheduled at a fixed rate and if a particular pull takes longer than the time period then subsequent executions are delayed until the running one finishes. On Tue, Dec 31, 2013 at 4:46 PM, anand chandak anand.chan...@oracle.com wrote: Quick question about solr replication : What happens if there's a replication running for very large index that runs more than the interval for 2 replication ? would the automatic runs of replication interfere with the current running one or it would not even spawn next iteration of replication ? Can somebody throw some light ?
Loading custom update request handler on startup
I'm writing a custom update request handler that will poll a hot directory for Solr xml files and index anything it finds there. The custom class implements Runnable, and when the run method is called the loop starts to do the polling. How can I tell Solr to load this class on startup to fire off the run() method? Thanks, -Jay
Re: Loading custom update request handler on startup
I may have found a good solution. I implemented my own SolrEventListener: public class DynamicIndexerEventListener implementsorg.apache.solr.core.SolrEventListener{ ... and then called it with a firstSearcher element in solrconfig.xml: listener event=firstSearcher class=com.bestbuy.search.foundation.solr.DynamicIndexerEventListener / Then in the newSearcher() method I startup up the thread for my polling UpdateRequestHandler. This seems to work, but if anyone has a better (or more tested) approach please let us know. -Jay On Mon, Jul 9, 2012 at 2:33 PM, Jay Hill jayallenh...@gmail.com wrote: I'm writing a custom update request handler that will poll a hot directory for Solr xml files and index anything it finds there. The custom class implements Runnable, and when the run method is called the loop starts to do the polling. How can I tell Solr to load this class on startup to fire off the run() method? Thanks, -Jay
Re: TermsComponent show only terms that matched query?
Yes, per-doc. I mentioned TermsComponent but meant TermVectorComponent, where we get back all the terms in the doc. Just wondering if there was a way to only get back the terms that matched the query. Thanks EE, -Jay On Sat, Feb 25, 2012 at 2:54 PM, Erick Erickson erickerick...@gmail.comwrote: Jay: I've seen the this question go 'round before, but don't remember a satisfactory solution. Are you talking on a per-document basis here? If so, I vaguely remember it being possible to do something with highlighting, just counting the tags returned after highlighting. Best Erick On Fri, Feb 24, 2012 at 3:31 PM, Jay Hill jayallenh...@gmail.com wrote: I have a situation where I want to show the term counts as is done in the TermsComponent, but *only* for terms that are *matched* in a query, so I get something returned like this (pseudo code): q=title:(golf swing) doc title: golf legends show how to improve your golf swing on the golf course ...other fields /doc terms golf (3) swing (1) /terms rather than getting back all of the terms in the doc. Thanks, -Jay
TermsComponent show only terms that matched query?
I have a situation where I want to show the term counts as is done in the TermsComponent, but *only* for terms that are *matched* in a query, so I get something returned like this (pseudo code): q=title:(golf swing) doc title: golf legends show how to improve your golf swing on the golf course ...other fields /doc terms golf (3) swing (1) /terms rather than getting back all of the terms in the doc. Thanks, -Jay
Complex query, need filtering after query not before
I have a project where we need to search 1B docs and still have results 700ms. The problem is, we are using geofiltering and that is happening * before* the queries, so we have to geofilter on the 1B docs to restrict our set of docs first, and then do the query on a name field. But it seems that it would be better and faster to run the main query first, and only then filter out that subset of docs by geo. Here is what a typical query looks like: ?shards=list of 20 nodes q={!boost b=sum(recip(geodist(geo_lat_long,38.2493581,-122.0399663),1,1,1))}(given_name:Barack OR given_name_exact:Barack^4.0) AND family_name:Obama fq={!geofilt pt=38.2493581,-122.0399663 sfield=geo_lat_long d=120} fq=(-source:somedatasource) rows=4 QTime=1040 I've looked at the cache=false param, and the cost= param, but that's not going to help much because we still have to do the filtering. (We *will* use cache=false to avoid the overhead of caching queries that will very rarely be the same.) Is there any way to indicate a filter query should happen *after* the other results? The other fq on source restricts the docset somewhat, but different variations don't eliminate a high number of docs, so we could use the cost param to run the fq on source before the fq on geo, but it would only help very minimally in some cases. Thanks, -Jay
Shard timeouts on large (1B docs) Solr cluster
I'm on a project where we have 1B docs sharded across 20 servers. We're not in production yet and we're doing load tests now. We're sending load to hit 100qps per server. As the load increases we're seeing query times sporadically increasing to 10 seconds, 20 seconds, etc. at times. What we're trying to do is set a shard timeout so that responses longer than 2 seconds are discarded. We can live with less results in these cases. We're not replicating yet as we want to see how the 20 shards perform first (plus we're waiting on the massive amount of hardware) I've tried setting the following config in our default req. handler: int name=shard-socket-timeout2000/int int name=shard-connection-timeout2000/int I've just added these, and am testing now, but this doesn't look promising either: int name=timeAllowed2000/int bool name=partialResultstrue/bool Couldn't find much on the wiki about these params - I'm looking for more details about how these work. I'll be happy to update the wiki with more details based on the discussion here. Any details about exactly how I can achieve my goal of timing out and disregarding queries longer that 2 seconds would be greatly appreciated. The index is insanely lean - no stored fields, no norms, no stop words, etc. RAM buffer is 128, and we're using the standard search req. handler. Essentially we're running Solr as a nosql data store, which suits this project, but we need responses to be no longer than 2 seconds at the max. Thanks, -Jay
Re: Shard timeouts on large (1B docs) Solr cluster
We're on the trunk: 4.0-2011-10-26_08-46-59 1189079 - hudson - 2011-10-26 08:51:47 Client timeouts are set to 4 seconds. Thanks, -Jay On Thu, Jan 26, 2012 at 1:40 PM, Mark Miller markrmil...@gmail.com wrote: On Jan 26, 2012, at 1:28 PM, Jay Hill wrote: I've tried setting the following config in our default req. handler: int name=shard-socket-timeout2000/int int name=shard-connection-timeout2000/int What version are you using Jay? At least on trunk, I took a look and it appears at some point these where renamed to socketTimeout and connTimeout. What about a timeout on your clients? - Mark Miller lucidimagination.com
Re: Shard timeouts on large (1B docs) Solr cluster
i'm changing the params to socketTimeout and connTimeout and will test this afternoon. client timeout was actually removed today, which helped a bit. what about the other params, timeAllowed and partialResults. my expectation was that these were specifically for distributed search, meaning if a response wasn't received w/in the timeAllowed, and if partialResults is true, then that shard would not be waited on for results. is that correct? thanks, -jay On Thu, Jan 26, 2012 at 2:23 PM, Jay Hill jayallenh...@gmail.com wrote: We're on the trunk: 4.0-2011-10-26_08-46-59 1189079 - hudson - 2011-10-26 08:51:47 Client timeouts are set to 4 seconds. Thanks, -Jay On Thu, Jan 26, 2012 at 1:40 PM, Mark Miller markrmil...@gmail.comwrote: On Jan 26, 2012, at 1:28 PM, Jay Hill wrote: I've tried setting the following config in our default req. handler: int name=shard-socket-timeout2000/int int name=shard-connection-timeout2000/int What version are you using Jay? At least on trunk, I took a look and it appears at some point these where renamed to socketTimeout and connTimeout. What about a timeout on your clients? - Mark Miller lucidimagination.com
/no_coord in dismax scoring explain
What does /no_coord mean in the dismax scoring output? I've looked through the wiki mail archives, lucidfind, and can't find any reference. -- ¡jah!
Re: facet search and UnInverted multi-valued field?
UnInvertedField is similar to Lucene's FieldCache, except, while the FieldCache cannot work with multivalued fields, UnInvertedField is designed for that very purpose. So since your f_dcperson field is multivalued, by default you use UnInvertedField. You're not doing anything wrong, that's default and normal behavior. -Jay http://lucidimagination.com On Tue, May 3, 2011 at 7:03 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Dear list, we use solr 3.1.0. my logs have the following entry: May 3, 2011 2:01:39 PM org.apache.solr.request.UnInvertedField uninvert INFO: UnInverted multi-valued field {field=f_dcperson,memSize=1966237,tindexSize=35730,time=849,phase1=782,nTerms=12,bigTerms=0,termInstances=368008,uses=0} The schema.xml has the field: field name=f_dcperson type=string indexed=true stored=true multiValued=true / The query was: May 3, 2011 2:01:40 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=null path=null params={facet=truefl=scorefacet.mincount=1facet.sort=start=0event=firstSearcherq=text:antigone^200facet.prefix=facet.limit=100facet.field=f_dcpersonfacet.field=f_dcsubjectfacet.field=f_dcyearfacet.field=f_dccollectionfacet.field=f_dctypenormfacet.field=f_dccontenttyperows=10} hits=1 status=0 QTime=1816 At first the log entry is an info, but what does it tell me? Am I doing something wrong or can something be done better? Regards, Bernd
Scaling Search with Big Data/Hadoop and Solr now available at Lucene Revolution
I've worked with a lot of different Solr implementations, and one area that is emerging more and more is using Solr in combination with other big data solutions. My company, Lucid Imagination, has added a two-day course to our upcoming Lucene Revolution conference, Scaling Search with Big Data and Solr, that covers Hadoop Solr, on May 23-24 - it'll be at Lucene Revolution in San Francisco (the conference is on May 25-26 -- see lucenerevolution.org). Description: The class covers Hadoop from the ground up, including MapReduce, the Hadoop Distributed File System (HDFS), cluster management, etc., before continuing on to connect it to Solr. Students will study common use cases for generating search indexes from big data, typical patterns for the data processing workflow, and how to make it all work reliably at scale. We will explore in-depth an example of processing 1 billion records to create a faceted Solr search solution. This course will be presented on May 23 and 24 at the Lucene Revolution conference in San Francisco (the conference is on May 25-26 -- see lucenerevolution.org). Details here: http://lucenerevolution.org/training#solr-scaling I've been asked by a lot of Solr users whether Lucid offers anything like this, so I know there is a lot of interest out there. -Jay
Re: Multiple Tags and Facets
I don't think I understand what you're trying to do. Are you trying to preserve all facets after a user clicks on a facet, and thereby triggers a filter query, which excludes the other facets? If that's the case, you can use local parameters to tag the filter queries so they are not used for the facets: Let's say I have the following facets: - Solr - Lucene - Nutch - Mahout And I do a search for solr. All of these links will have a filter query: - Solr [ ?q=solrfq=project:solr ] - Lucene [ ?q=solrfq=project:lucene ] - Nutch [ ?q=solrfq=project:nutch ] - Mahout [ ?q=solrfq=project:mahout ] But if a user clicks on the Solr facet, the resulting query will exclude the other facets, so you only see this facet: - Solr By using local parameters like this: ?q=solrfq={!tag=myTag}project:solr facet=onfacet.field{!ex=myTag}=project I can preserve all my facets, so that my query is filtered but all facets still remain: - Solr - Lucene - Nutch - Mahout Hope this helps, but I'm not sure that's what you were after. -Jay On Wed, Apr 20, 2011 at 8:03 AM, Em mailformailingli...@yahoo.de wrote: Hello, I watched an online video with Chris Hostsetter from Lucidimagination. He showed the possibility of having some Facets that exclude *all* filter while also having some Facets that take care of some of the set filters while ignoring other filters. Unfortunately the Webinar did not explain how they made this and I wasn't able to give a filter/facet more than one tag. Here is an example: Facets and Filters: DocType, Author Facet: - Author -- George (10) -- Brian (12) -- Christian (78) -- Julia (2) -Doctype -- PDF (70) -- ODT (10) -- Word (20) -- JPEG (1) -- PNG (1) When clicking on Julia I would like to achieve the following: Facet: - Author -- George (10) -- Brian (12) -- Christian (78) -- Julia (2) Julia's Doctypes: -- JPEG (1) -- PNG (1) -Doctype -- PDF (70) -- ODT (10) -- Word (20) -- JPEG (1) -- PNG (1) Another example which adds special options to your GUI could be as following: Imagine a fashion store. If you search for shirt you get a color-facet: colors: - red (19) - green (12) - blue (4) - black (2) As well as a brand-facet: brands: - puma (18) - nike (19) When I click on the red color-facet, I would like to get the following back: colors: - red (19) - green (12)* - blue (4)* - black (2)* brands: - puma (18)* - nike (19) All those filters marked by an * could be displayed half-transparent or so - they just show the user that those filter-options exist for his/her search but aren't included in the result-set, since he/she excluded them by clicking the red filter. This case is more interesting, if not all red shirts were from nike. This way you can show the user that i.e. 8 of 19 red - shirts are from the brand you selected/you see 8 of 19 red shirts. I hope I explained what I want to achive. Thank you! -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2843130.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Understanding the DisMax tie parameter
Looks good, thanks Tom. -Jay On Fri, Apr 15, 2011 at 8:55 AM, Burton-West, Tom tburt...@umich.eduwrote: Thanks everyone. I updated the wiki. If you have a chance please take a look and check to make sure I got it right on the wiki. http://wiki.apache.org/solr/DisMaxQParserPlugin#tie_.28Tie_breaker.29 Tom -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, April 14, 2011 5:41 PM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Cc: Burton-West, Tom Subject: Re: Understanding the DisMax tie parameter : Perhaps the parameter could have had a better name. It's essentially : max(score of matching clauses) + tie * (score of matching clauses that : are not the max) : : So it can be used and thought of as a tiebreak only in the sense that : if two docs match a clause (with essentially the same score), then a : small tie value will act as a tiebreaker *if* one of those docs also : matches some other fields. correct. w/o a tiebreaker value, a dismax query will only look at the maximum scoring clause for each doc -- the tie param is named for it's ability to help break ties when multiple documents have the same score from the max scoring clause -- by adding in a small portion of the scores (based on the 0-1 ratio of the tie param) from the other clauses. -Hoss
Re: Understanding the DisMax tie parameter
Dismax works by first selecting the highest scoring sub-query of all the sub-queries that were run. If I want to search on three fields, manu, name and features, I can configure dismax like this: requestHandler name=search_dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str * float name=tie0.0/float* str name=qfmanu name features/str str name=q.alt*:*/str /lst /requestHandler Now I'll use this query: http://localhost:8983/solr/select/?qt=search_dismaxq=cord Dismax will search for the term cord on the 3 fields I defined in the qf parameter like this: +(features:cord | manu:cord | name:cord) Of those 3 sub-queries dismax will pick the highest one as the main part of the score. The tie parameter is used like this: Final Score = highest scoring sub-query + (*tie* * sum of scores for all other sub-queries) So with a tie value of *0*, the max scoring sub-query is added to 0 * other sub-queries Final Score = 0.9645969 + (*0* * sum of other sub-queries) and this results in ONLY the max sub-query being used, hence a disjunction max. If I had a value of *1* for the tie parameter I get this: Final Score = 0.9645969 + (*1* * sum of other sub-queries) so the sum of all the other sub-queries is multiplied by 1, resulting in a disjunction sum. And then, of course, values between 0 and 1 result in the non-highest-sub-queries being multiplied by a fraction, and factoring into the scoring that way. -Jay On Thu, Apr 14, 2011 at 2:04 PM, Burton-West, Tom tburt...@umich.eduwrote: Hello, I'm having trouble understanding the relationship of the word tie and tiebreaker to the explanation of this parameter on the wiki. What two (or more things) are in a tie? and how does the number in the range from 0 to 1 break the tie? http://wiki.apache.org/solr/DisMaxQParserPlugin#tie_.28Tie_breaker.29 A value of 0.0 makes the query a pure disjunction max query -- only the maximum scoring sub query contributes to the final score. A value of 1.0 makes the query a pure disjunction sum query where it doesn't matter what the maximum scoring sub query is, the final score is the sum of the sub scores. Typically a low value (ie: 0.1) is useful. Tom Burton-West
Re: partial optimize does not reduce the segment number to maxNumSegments
As Hoss mentioned earlier in the thread, you can use the statistics page from the admin console to view the current number of segments. But if you want to know by looking at the files, each segment will have a unique prefix, such as _u. There will be one unique prefix for every segment in the index. -Jay On Tue, Apr 12, 2011 at 3:16 PM, Renee Sun renee_...@mcafee.com wrote: ok I dug more into this and realize the file extensions can vary depending on schema, right? for instance we dont have *.tvx, *.tvd, *.tvf (not using term vector)... and I suspect the file extensions may change with future lucene releases? now it seems we can't just count the file using any formula, we have to list all files in that directory and count that way... any insight will be appreciated. thanks Renee -- View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813561.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: phrase, inidividual term, prefix, fuzzy and stemming search
You mentioned that dismax does not support wildcards, but edismax does. Not sure if dismax would have solved your other problems, or whether you just had to shift gears because of the wildcard issue, but you might want to have a look at edismax. -Jay http://www.lucidimagination.com On Mon, Jan 31, 2011 at 2:22 PM, cyang2010 ysxsu...@hotmail.com wrote: My current project has the requirement to support search when user inputs any number of terms across a few index fields (movie title, actor, director). In order to maximize result, I plan to support all those searches listed in the subject, phrase, individual term, prefix, fuzzy and stemming. Of course, score relevance in the right order is also important. I have considered using dismax query. However, it does not support prefix query. I am not sure if it supports fuzzy query, my guess is does not. Therefore, i still need to use standard query. For example, if someone searches deim moer (typo for demi moore), i compare the phrase and terms with each searchable fields (title, actor, director): title_display: deim moer~30 actors: deim moer~30 directors: deim moer~30-- OR title_display: deim-- OR actors: deim directors: deim title_display: deim* -- OR actors: deim* directors: deim* title_display: deim~0.6 -- OR actors: deim~0.6 directors: deim~0.6 title_display: moer-- OR actors: moer directors: moer title_display: moer* -- OR actors: moer* directors: moer* title_display: moer~0.6-- OR actors: moer~0.6 directors: moer~0.6 The solr relevance score is sum for all those OR. In that way, i can make sure relevance score are in order. For example, for the exact match (deim moer), it will match phrase, term, prefix and fuzzy query all at the same time. Therefore, it will score higher than some input text only matchs term, or prefix or fuzzy. At the same time, i can apply boost to a particular search field if requirement needs. Does it sound right to you? Is there better ways to achieve the same thing? My concern is my query is not going to perform, since it tries to do too much. But isn't that what people want to get (maximize result) when they just type in a few search words? Another question is that: Can i combine the result of two query together? For example, first i query phrase and term match, next I query for prefix match. Can I just append the result for prefix match to that for phrase/term match? I thought two queries have different queryNorm, therefore, the score is not comparable to each other so as to combine. Is it correct? Thanks. love to hear what your thought is. -- View this message in context: http://lucene.472066.n3.nabble.com/phrase-inidividual-term-prefix-fuzzy-and-stemming-search-tp239p239.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: WordDelimiterFilterFactory
You can always try something like this out in the analysis.jsp page, accessible from the Solr Admin home. Check out that page and see how it allows you to enter text to represent what was indexed, and text for a query. You can then see if there are matches. Very handy to see how the various filters in a field type act on text. Make sure to check verbose output for both index and query. For this specific issue, yes, a query for cls500 will match both of those examples. To get the exact match to score higher: - create a text field (or a custom type that uses the WordDelimiterFilterFactory) (let's name the field foo) - create a string field (let's name it foo_string) - create a copyField with the source being foo and the dest being foo_string. - use dismax (or edismax) to search both of those fields http://localhost:8983/solr/select/?q=cls500defType=edismaxqf=foofoo_string This should score the string field higher, but you could also add a boost to it to make sure: http://localhost:8983/solr/select/?q=cls500defType=edismaxqf=foofoo_string^4.0 -Jay http://lucidimagination.com On Fri, Feb 4, 2011 at 4:25 PM, John kim hongs...@gmail.com wrote: If i use WordDelimiterFilterFactory during indexing and at query time, will a search for cls500 find cls 500 and cls500x? If so, will it find and score exact matches higher? If not, how do you get exact matches to display first?
Re: Tuning Solr
Removing those components is not likely to impact performance very much, if at all. I would focus on other areas when tuning performance, such as looking memory usage and configuration, query design, etc. But there isn't any harm in removing them either. Why not do some load tests with the components included in the configuration, and then some comparison tests with the components removed from solrconfig.xml? -Jay http://www.lucidimagination.com On Mon, Oct 4, 2010 at 11:36 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, If I dont need Morelikethis, spellcheck, highlight. Can I remove this configuration section in solrconfig.xml? In other workd, does solr load and use these SearchComponet on statup and suring runtime? Remove this configuration will or will not speedup query? Thanks
Creating new Solr cores using relative paths
I'm having trouble getting the core CREATE command to work with relative paths in the solr.xml configuration. I'm working with a layout like this: /opt/solr [this is solr.solr.home: $SOLR_HOME] /opt/solr/solr.xml /opt/solr/core0/ [this is the template core] /opt/solr/core0/conf/schema.xml [etc.] /opt/tomcat/bin [where tomcat is started from: $TOMCAT_HOME/bin] My very basic solr.xml: solr persistent=true cores adminPath=/admin/cores core name=core0 instanceDir=core0// /cores /solr The CREATE core command works fine with absolute paths, but I have a requirement to use relative paths. I want to be able to create a new core like this: http://localhost:8080/solr/admin/cores ?action=CREATE name=core1 instanceDir=core1 config=core0/conf/solrconfig.xml schema=core0/conf/schema.xml (core1 is the name for the new core to be created, and I want to use the config and schema from core0 to create the new core). but the error is always due to the servlet container thinking $TOMCAT_HOME/bin is the current working directory: Caused by: java.lang.RuntimeException: *Can't find resource 'core0/conf/solrconfig.xml'* in classpath or '/opt/solr/core1/conf/', * cwd=/opt/tomcat/bin * Does anyone know how to make this happen? Thanks, -Jay
Re: OutOfMemoryErrors
A merge factor of 100 is very high and out of the norm. Try starting with a value of 10. I've never seen a running system with a value anywhere near this high. Also, what is your setting for ramBufferSizeMB? -Jay On Tue, Aug 17, 2010 at 10:46 AM, rajini maski rajinima...@gmail.comwrote: yeah sorry I forgot to mention others... mergeFactor100/mergeFactor maxBufferedDocs1000/maxBufferedDocs maxMergeDocs10/maxMergeDocs maxFieldLength1/maxFieldLength above are the values Is this because of values here...initially I had mergeFactor parameter -10 and maxMergedocs-1With the same error i changed them to above values..Yet I got that error after index was about 2lacs docs... On Tue, Aug 17, 2010 at 11:04 PM, Erick Erickson erickerick...@gmail.com wrote: There are more merge paramaters, what values do you have for these: mergeFactor10/mergeFactor maxBufferedDocs1000/maxBufferedDocs maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength See: http://wiki.apache.org/solr/SolrConfigXml Hope that formatting comes through the various mail programs OK Also, what else happens while you're indexing? Do you search while indexing? How often do you commit your changes? On Tue, Aug 17, 2010 at 1:18 PM, rajini maski rajinima...@gmail.com wrote: mergefactor100 /mergefactor JVM Initial memory pool -256MB Maximum memory pool -1024MB add doc fieldlong:ID/field fieldstr:Body/field 12 fields /filed /doc /add I have a solr instance in solr folder (D:/Solr) free space in disc is 24.3GB .. How will I get to know what portion of memory is solr using ? On Tue, Aug 17, 2010 at 10:11 PM, Erick Erickson erickerick...@gmail.com wrote: You shouldn't be getting this error at all unless you're doing something out of the ordinary. So, it'd help if you told us: What parameters you have set for merging What parameters you have set for the JVM What kind of documents are you indexing? The memory you have is irrelevant if you only allocate a small portion of it for the running process... Best Erick On Tue, Aug 17, 2010 at 7:35 AM, rajini maski rajinima...@gmail.com wrote: I am getting it while indexing data to solr not while querying... Though I have enough memory space upto 40GB and I my indexing data is just 5-6 GB yet that particular error is seldom observed... (SEVERE ERROR : JAVA HEAP SPACE , OUT OF MEMORY ERROR ) I could see one lock file generated in the data/index path just after this error. On Tue, Aug 17, 2010 at 4:49 PM, Peter Karich peat...@yahoo.de wrote: Is there a way to verify that I have added correctlly? on linux you can do ps -elf | grep Boot and see if the java command has the parameters added. @all: why and when do you get those OOMs? while querying? which queries in detail? Regards, Peter.
SolrJ: Setting multiple parameters
Working with SolrJ I'm doing a query using the StatsComponent, and the stats.facet parameter. I'm not able to set multiple fields for the stats.facet parameter using SolrJ. Here is the query I'm trying to create: http://localhost:8983/solr/select/?q=*:*stats=onstats.field=fieldForStatsstats.facet=fieldAstats.facet=fieldBstats.facet=fieldC This works perfectly, and I'm able to pull the sum value from all three stats.facet fields, no problem. Trying in SolrJ I have this: SolrQuery solrQuery = new SolrQuery(); solrQuery.setQuery(*:*); solrQuery.setParam(stats, on); solrQuery.setParam(stats.field, fieldForStats); *solrQuery.setParam(stats.facet, fieldA); solrQuery.setParam(stats.facet, fieldB); solrQuery.setParam(stats.facet, fieldC);* But when I try to retrieve the sum values, it seems as if only the LAST setParam I called on stats.facet is taking. So in this case I can get the sum for fieldC, but not the other two: //works MapString, FieldStatsInfo statsInfoMap = queryResponse.getFieldStatsInfo(); FieldStatsInfo roomCountElement = statsInfoMap.get(fieldForStats); ArrayList fsi = (ArrayList) roomCountElement.getFacets().get(field*C* ); for (int i = 0; i fsi.size(); i++) { FieldStatsInfo m = (FieldStatsInfo) fsi.get(i); System.out.println(-- + m.getName() + + m.getSum()); } //doesn't work, get a null pointer as fieldB doesn't seem to have been passed to stats.facet MapString, FieldStatsInfo statsInfoMap = queryResponse.getFieldStatsInfo(); FieldStatsInfo roomCountElement = statsInfoMap.get(fieldForStats); ArrayList fsi = (ArrayList) roomCountElement.getFacets().get(field*B* ); for (int i = 0; i fsi.size(); i++) { FieldStatsInfo m = (FieldStatsInfo) fsi.get(i); System.out.println(-- + m.getName() + + m.getSum()); } Is there a way to set multiple values for stats.facet using the setParm method? I noticed that there is a setGetFieldStatistics method which can be used to set the stats.field, but there don't seem to be any methods that reach as deep as setting the stats.facet. Thanks, -Jay
Anyone using Solr spatial from trunk?
I was wondering about the production readiness of the new-in-trunk spatial functionality. Is anyone using this in a production environment? -Jay
Re: Index-time vs. search-time boosting performance
I've done a lot of recency boosting to documents, and I'm wondering why you would want to do that at index time. If you are continuously indexing new documents, what was recent when it was indexed becomes, over time less recent. Are you unsatisfied with your current performance with the boost function? Query-time recency boosting is a fairly common thing to do, and, if done correctly, shouldn't be a performance concern. -Jay http://lucidimagination.com On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman a...@newscred.com wrote: Perhaps I should have been more specific in my initial post. I'm doing date-based boosting on the documents in my index, so as to assign a higher score to more recent documents. Currently I'm using a boost function to achieve this. I'm wondering if there would be a performance improvement if instead of using the boost function at search time, I indexed the documents with a date-based boost. On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson erickerick...@gmail.com wrote: Index time boosting is different than search time boosting, so asking about performance is irrelevant. Paraphrasing Hossman from years ago on the Lucene list (from memory). ...index time boosting is a way of saying this documents' title is more important than other documents' titles. Search time boosting is a way of saying I care about documents whose titles contain this term more than other documents whose titles may match other parts of this query HTH Erick On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman a...@newscred.com wrote: Hi, What are the performance ramifications for using a function-based boost at search time (through bf in dismax parser) versus an index-time boost? Currently I'm using boost functions on a 15GB index of ~14mm documents. Our queries generally match many thousands of documents. I'm wondering if I would see a performance improvement by switching over to index-time boosting. Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Auto-suggest internal terms
I've got a situation where I'm looking to build an auto-suggest where any term entered will lead to suggestions. For example, if I type wine I want to see suggestions like this: french *wine* classes *wine* book discounts burgundy *wine* etc. I've tried some tricks with shingles, but the only solution that worked was pre-processing my queries into a core in all variations. Anyone know any tricks to accomplish this in Solr without doing any custom work? -Jay
Re: field length normalization
The fieldNorm is computed like this: fieldNorm = lengthNorm * documentBoost * documentFieldBoosts and the lengthNorm is: lengthNorm = 1/(numTermsInField)**.5 [note that the value is encoded as a single byte, so there is some precision loss] So the values are not pre-set for the lengthNorm, but for some counts the fieldLength value winds up being the same because of the precision los. Here is a list of lengthNorm values for 1 to 10 term fields: # of termslengthNorm 1 1.0 2 .625 3 .5 4 .5 5 .4375 6 .375 7 .375 8 .3125 9 .3125 10 .3125 That's why, in your example, the lengthNorm for 3 and 4 is the same. -Jay http://www.lucidimagination.com On Thu, Mar 11, 2010 at 9:50 AM, muneeb muneeba...@hotmail.com wrote: : : Did you reindex after setting omitNorms to false? I'm not sure whether or : not it is needed, but it makes sense. Yes i deleted the old index and reindexed it. Just to add another fact, that the titlles length is less than 10. I am not sure if solr has pre-set values for length normalizations, because for titles with 3 as well as 4 terms the fieldNorm is coming up as 0.5 (in the debugQuery section). -- View this message in context: http://old.nabble.com/field-length-normalization-tp27862618p27867025.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about fieldNorms
Yes, if omitNorms=true, then no lengthNorm calculation will be done, and the fieldNorm value will be 1.0, and lengths of the field in question will not be a factor in the score. To see an example of this you can do a quick test. Add two text fields, and on one omitNorms: field name=foo type=text indexed=true stored=true/ field name=bar type=text indexed=true stored=true omitNorms=true/ Index a doc with the same value for both fields: field name=foo1 2 3 4 5/field field name=bar1 2 3 4 5/field Set debugQuery=true and do two queries: q=foo:5 q=bar:5 in the explain section of the debug output note that the fieldNorm value for the foo query is this: 0.4375 = fieldNorm(field=foo, doc=1) and the value for the bar query is this: 1.0 = fieldNorm(field=bar, doc=1) A simplified description of how the fieldNorm value is: fieldNorm = lengthNorm * documentBoost * documentFieldBoosts and the lengthNorm is calculated like this: lengthNorm = 1/(numTermsInField)**.5 [note that the value is encoded as a single byte, so there is some precision loss] When omitNorms=true no norm calculation is done, so fieldNorm will always be one on those fields. You can also use the Luke utility to view the document in the index, and it will show that there is a norm value for the foo field, but not the bar field. -Jay http://www.lucidimagination.com On Sun, Mar 7, 2010 at 5:55 AM, Siddhant Goel siddhantg...@gmail.comwrote: Hi everyone, Is the fieldNorm calculation altered by the omitNorms factor? I saw on this page (http://old.nabble.com/Question-about-fieldNorm-td17782701.html) the formula for calculation of fieldNorms (fieldNorm = fieldBoost/sqrt(numTermsForField)). Does this mean that for a document containing a string like A B C D E in its field, its fieldNorm would be boost/sqrt(5), and for another document containing the string A B C in the same field, its fieldNorm would be boost/sqrt(3). Is that correct? If yes, then is *this* what omitNorms affects? Thanks, -- - Siddhant
Re: Free Webinar: Mastering Solr 1.4 with Yonik Seeley
Yes, it will be recorded and available to view after the presentation. -Jay On Thu, Feb 25, 2010 at 2:19 PM, Bernadette Houghton bernadette.hough...@deakin.edu.au wrote: Yonk, can you please advise whether this event will be recorded and available for later download? (It starts 5am our time ;-) ) Regards Bern -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, 25 February 2010 10:23 AM To: solr-user@lucene.apache.org Subject: Free Webinar: Mastering Solr 1.4 with Yonik Seeley I'd like to invite you to join me for an in-depth review of Solr's powerful, versatile new features and functions. The free webinar, sponsored by my company, Lucid Imagination, covers an intensive how-to for the features you need to make the most of Solr for your search application: * Faceting deep dive, from document fields to performance management * Best practices for sharding, index partitioning and scaling * How to construct efficient Range Queries and function queries * Sneak preview: Solr 1.5 roadmap Join us for a free webinar Thursday, March 4, 2010 10:00 AM PST / 1:00 PM EST / 18:00 GMT Follow this link to sign up http://www.eventsvc.com/lucidimagination/030410?trk=WR-MAR2010-AP Thanks, -Yonik http://www.lucidimagination.com
Re: What is largest reasonable setting for ramBufferSizeMB?
Looks like multi-threaded support was added to the DIH recently: http://issues.apache.org/jira/browse/SOLR-1352 -Jay On Fri, Feb 19, 2010 at 6:27 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Glen may be referring to LuSql indexing with multiple threads? Does/can DIH do that, too? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Yonik Seeley yo...@lucidimagination.com To: solr-user@lucene.apache.org Sent: Fri, February 19, 2010 11:41:07 AM Subject: Re: What is largest reasonable setting for ramBufferSizeMB? On Fri, Feb 19, 2010 at 5:03 AM, Glen Newton wrote: You may consider using LuSql[1] to create the indexes, if your source content is in a JDBC accessible db. It is quite a bit faster than Solr, as it is a tool specifically created and tuned for Lucene indexing. Any idea why it's faster? AFAIK, the main purpose of DIH is indexing databases too. If DIH is much slower, we should speed it up! -Yonik http://www.lucidimagination.com
Re: score computation for dismax handler
Set the tie parameter to 1.0. This param is set between 0.0 (pure disjunction maximum) and 1.0 (pure disjunction sum): http://wiki.apache.org/solr/DisMaxRequestHandler#tie_.28Tie_breaker.29 -Jay On Thu, Feb 18, 2010 at 4:24 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: Hi , When query is made across multiple fields in dismax handler using paramater qf , I have observed that with debug query enabled the resultant score is max score of scores of query across each fields . but I want the resultant score to be sum of score across fields (like the standard handler ) . can any one tell me how this can be achevied.
Re: optimize is taking too much time
With a mergeFactor set to anything 1 you would never have only one segment - unless you optimized. So Lucene will never naturally merge all the segments into one. Unless, I suppose, the mergeFactor was set to 1, but I've never tested that. It's hard to picture how that would work. If I understand correctly, the same actions occur (deleted documents are removed, etc.) because an optimize is only a multiway merge down to one segment, whereas normal merging is triggered by the mergeFactor, but does not have a target segment count to merge down to. -Jay On Sun, Feb 21, 2010 at 11:20 AM, David Smiley @MITRE.org dsmi...@mitre.org wrote: I've always thought that these two events were effectively equivalent. -- the results of an optimize vs the results of Lucene _naturally_ merging all segments together into one. If they don't have the safe effect then what is the difference? ~ David Smiley Otis Gospodnetic wrote: Hello, Solr will never optimize the whole index without somebody explicitly asking for it. Lucene will merge index segments on the master as documents are indexed. How often it does that depends on mergeFactor. See: http://search-lucene.com/?q=mergeFactor+segment+mergefc_project=Lucenefc_project=Solrfc_type=mail+_hash_+user Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: mklprasad mklpra...@gmail.com To: solr-user@lucene.apache.org Sent: Fri, February 19, 2010 1:02:11 AM Subject: Re: optimize is taking too much time Jagdish Vasani-2 wrote: Hi, you should not optimize index after each insert of document.insted you should optimize it after inserting some good no of documents. because in optimize it will merge all segments to one according to setting of lucene index. thanks, Jagdish On Fri, Feb 12, 2010 at 4:01 PM, mklprasad wrote: hi in my solr u have 1,42,45,223 records having some 50GB . Now when iam loading a new record and when its trying optimize the docs its taking 2 much memory and time can any body please tell do we have any property in solr to get rid of this. Thanks in advance -- View this message in context: http://old.nabble.com/optimize-is-taking-too-much-time-tp27561570p27561570.html Sent from the Solr - User mailing list archive at Nabble.com. Yes, Thanks for reply i have removed the optmize() from code. but i have a doubt .. 1.Will mergefactor internally do any optmization (or) we have to specify 2. Even if solr initaiates optmize if i have a large data like 52GB will that takes huge time? Thanks, Prasad -- View this message in context: http://old.nabble.com/optimize-is-taking-too-much-time-tp27561570p27650028.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/optimize-is-taking-too-much-time-tp27561570p27676881.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: optimize is taking too much time
Thanks for clearing that up guys, I misspoke slightly. It's just that, in a running system, it's probably very rare that there is only a single segment for any meaningful length of time. Unless that merge-down-to-one occurs right when indexing stops there will almost always be a new (small) segment following immediately after the merge. It would be interesting to observe, over a long time, how often and for how long everything is merged down to a single segment. Probably with a very low mergeFactor (2 or 3?) merges-to-one might occur often enough to make optimizing unnecessary. But I'm guessing that the merge-to-one happens so infrequently in most situations that optimizing is more important. -Jay On Mon, Feb 22, 2010 at 12:16 PM, Mark Miller markrmil...@gmail.com wrote: Also, a mergefactor of 1 is actually invalid - 2 is the lowest you can go. -- - Mark http://www.lucidimagination.com
Solr Analysis Webinar Jan 28, 2010
My colleague at Lucid Imagination, Tom Hill, will be presenting a free webinar focused on analysis in Lucene/Solr. If you're interested, please sign up and join us. Here is the official notice: We'd like to invite you to a free webinar our company is offering next Thursday, 28 January, at 2PM Eastern/11AM Pacific/1900 GMT Join Lucid Imagination Senior Staff Engineer Tom Hill for a free, in depth technical workshop to learn how the Lucene/Solr analyzer can grab and index text and field data, overcome grammatical and semantic variations, and how a little careful preparation and tuning lets you unleash the full power of Lucene/Solr Open Source Search. * Introduction to analysis, including tokens, tokenizers and token filters * Tuning tokenization to improve index flexibility and content retrieval precision * Avoid common pitfalls by using special troubleshooting tools and techniques Thursday, January 28, 2010 11:00 AM PST / 2:00 PM EST / 1900 GMT Register here: http://www.eventsvc.com/lucidimagination/012810?trk=WR-JAN2010-AP
Re: solr blocking on commit
A couple of follow up questions: - What type of garbage collector is in use? - How often are you optimizing the index? - In solrconfig.xml what is the setting for mainIndexramBufferSizeMB? - Right before and after you see this pause, check the output of http://host:port/solr/admin/system, specifically the output of jvmmemory and send this to the list. If possible definitely watch memory usage with something like JConsole, or start the JVM with some of these params: –XX:+PrintGCDetails –XX:+PrintGCTimeStamps -Jay On Tue, Jan 19, 2010 at 5:16 PM, Steve Conover scono...@gmail.com wrote: I'll play with the GC settings and watch memory usage (I've done a little bit of this already), but I have a sense that this isn't the problem. I should also note that in order to create the really long pauses I need to post xml files full of documents that haven't been added in a long time / ever. Once a set of documents is posted to /update, if I re-post it solr behaves pretty well - and that's true even if I restart solr. On Tue, Jan 19, 2010 at 3:05 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Jan 19, 2010 at 5:57 PM, Steve Conover scono...@gmail.com wrote: I'm using latest solr 1.4 with java 1.6 on linux. I have a 3M document index that's 10+GB. We currently give solr 12GB of ram to play in and our machine has 32GB total. We're seeing a problem where solr blocks during commit - it won't server /select requests - in some cases for more than 15-30 seconds. We'd like to somehow configure things such that there's no interruption in /select service. A commit shouldn't cause searches to block. Could this perhaps be a stop-the-word GC pause that coincides with the commit? -Yonik http://www.lucidimagination.com
Re: Solr 1.4 - stats page slow
It's definitely still an issue. I've seen this with at least four different Solr implementations. It clearly seems to be a problem when there is a large field cache. It would be bad enough if the stats.jsp was just slow to load (usually takes 1 to 2 minutes), but when monitoring memory usage with jconsole there is a clear and serious spike as soon as the url for stats.jsp is hit, on occasions causing OutOfMemory Exceptions. -Jay On Fri, Jan 8, 2010 at 9:46 AM, Yonik Seeley yo...@lucidimagination.comwrote: I thought this was fixed... http://issues.apache.org/jira/browse/SOLR-1292 http://www.lucidimagination.com/search/document/57103830f0655776/stats_page_slow_in_latest_nightly -Yonik http://www.lucidimagination.com
Re: Solr 1.4 - stats page slow
Actually my cases were all with customers I work with, not just one case. A common practice is to monitor cache stats to tune the caches properly. Also, noting the warmup times for new IndexSearchers, etc. I've worked with people that have excessive auto-warm count values which is causing extremely long warmup times for the new Searchers. So the stats.jsp page has always been a handy, simple tool to monitor this stuff and set caches appropriately. But at some point (around the release of 1.4) I started to notice this problem. Since it causes the memory spike it pretty much prevents the use of stats.jsp in production. I've had to resort to log-parsing and other tricks which is a bit of a waste since it was so simple to do before this surfaced. -Jay On Fri, Jan 8, 2010 at 10:41 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : 2009-05-28) to the Solr 1.4.0 released code. Every 3 hours we have a : cron task to log some of the data from the stats.jsp page from each : core (about 100 cores, most of which are small indexes). 1) what stats are you actaully interested in? ... in Jay's case the LukeRequestHandler made more sense to get the data he wnated anyway. 2) what does the output of stats.jsp say when you see these load spikes? ... it should be fairly lightweight unless it detects some insanity in the way the FieldCaches are being used, in which case it does memory estimation to make it clear how significant the problem is. -Hoss
Re: Indexing the latests MS Office documents
The version of Tika in the 1.4 release definitely parses the most current Office formats (.docx, .pptx, etc.) and they index as expected. -Jay On Mon, Jan 4, 2010 at 6:02 PM, Peter Wolanin peter.wola...@acquia.comwrote: You must have been searching old documentation - I think tika 0,3+ has support for the new MS formats. but don't take my word for it - why don't you build tika and try it? -Peter On Sun, Jan 3, 2010 at 7:00 PM, Roland Villemoes r...@alpha-solutions.dk wrote: Hi All, Anyone who knows how to index the latest MS office documents like .docx and .xlsx ? From searching it seems like Tika only supports the earlier formats .doc and .xls med venlig hilsen/best regards Roland Villemoes Tel: (+45) 22 69 59 62 E-Mail: mailto:r...@alpha-solutions.dk -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Solr 1.4 - stats page slow
I've noticed this as well, usually when working with a large field cache. I haven't done in-depth analysis of this yet, but it seems like when the stats page is trying to pull data from a large field cache it takes quite a long time. Are you doing a lot of sorting? If so, what are the field types of the fields you're sorting on? How large is the index both in document count and file size? Another approach to get data from the Solr instance would be to use JMX. And I've been working on a request handler (started by Erik Hatcher) that will provide the same information as the stats page, but a little more efficiently. I may try to put up a patch with this soon. -Jay On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss swe...@stylesight.comwrote: We've been using Solr 1.4 for a few days now and one slight downside we've noticed is the stats page comes up very slowly for some reason - sometimes more than 10 seconds. We call this programmatically to retrieve the last commit date so that we can keep users from committing too frequently. This means some of our administration pages are now taking a long time to load. Is there anything we should be doing to ensure that this page comes up quickly? I see some notes on this back in October but it looks like that update should already be applied by now. Or, better yet, is there now a better way to just retrieve the last commit date from Solr without pulling all of the statistics? Thanks in advance. -- Steve
Re: Solr 1.4 - stats page slow
Also, what is your heap size and the amount of RAM on the machine? I've also noticed that, when watching memory usage through JConsole or YourKit while loading the stats page, the memory usage spikes dramatically - are you seeing this as well? -Jay On Thu, Dec 24, 2009 at 9:12 AM, Jay Hill jayallenh...@gmail.com wrote: I've noticed this as well, usually when working with a large field cache. I haven't done in-depth analysis of this yet, but it seems like when the stats page is trying to pull data from a large field cache it takes quite a long time. Are you doing a lot of sorting? If so, what are the field types of the fields you're sorting on? How large is the index both in document count and file size? Another approach to get data from the Solr instance would be to use JMX. And I've been working on a request handler (started by Erik Hatcher) that will provide the same information as the stats page, but a little more efficiently. I may try to put up a patch with this soon. -Jay On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss swe...@stylesight.comwrote: We've been using Solr 1.4 for a few days now and one slight downside we've noticed is the stats page comes up very slowly for some reason - sometimes more than 10 seconds. We call this programmatically to retrieve the last commit date so that we can keep users from committing too frequently. This means some of our administration pages are now taking a long time to load. Is there anything we should be doing to ensure that this page comes up quickly? I see some notes on this back in October but it looks like that update should already be applied by now. Or, better yet, is there now a better way to just retrieve the last commit date from Solr without pulling all of the statistics? Thanks in advance. -- Steve
Sort fields all look Strings in field cache, no matter schema type
I'm on a project where I'm trying to determine the size of the field cache. We're seeing lots of memory problems, and I suspect that the field cache is extremely large, but I'm trying to get exact counts on what's in the field cache. One thing that struck me as odd in the output of the stats.jsp page is that the field cache always shows a String type for a field, even if it is not a String. For example, the output below is for a field cscore that is a double: entry#0 : 'org.apache.lucene.index.readonlydirectoryrea...@6239da8a'='cscore',class org.apache.lucene.search.FieldCache$StringIndex,null=org.apache.lucene.search.FieldCache$StringIndex#297347471 The index has 4,292,426 documents, so I would expect the field cache size for this field to be: cscore: double (8 bytes) x 4,292,426 docs = 34,339,408 bytes But can someone explain why a double is using FieldCache$StringIndex please? No matter what the type of the field is in the schema the field cache stats always show FieldCache$StringIndex. Thanks, -Jay
Re: Sort fields all look Strings in field cache, no matter schema type
This field is of class type solr.SortableDoubleField. I'm actually migrating a project from Solr 1.1 to 1.4, and am in the process of trying to update the schema and solrconfig in stages. Updating the field to TrieDoubleField w/ precisionStep=0 definitely helped. Thanks Yonik! -Jay On Sat, Dec 19, 2009 at 11:37 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Sat, Dec 19, 2009 at 2:25 PM, Jay Hill jayallenh...@gmail.com wrote: One thing that struck me as odd in the output of the stats.jsp page is that the field cache always shows a String type for a field, even if it is not a String. For example, the output below is for a field cscore that is a double: What's the class type of the double? Older style SortableDouble had to use the string index. Newer style trie-double based should use a double[]. It also matters what the FieldCache entry is being used for... certain things like faceting on single valued fields still use the StringIndex. I believe the stats component does too. Sorting and function queries should work as expected. -Yonik
Re: Sort fields all look Strings in field cache, no matter schema type
Oh, forgot to add (just to keep the thread complete), the field is being used for a sort, so it was able to use TrieDoubleField. Thanks again, -Jay On Sat, Dec 19, 2009 at 12:21 PM, Jay Hill jayallenh...@gmail.com wrote: This field is of class type solr.SortableDoubleField. I'm actually migrating a project from Solr 1.1 to 1.4, and am in the process of trying to update the schema and solrconfig in stages. Updating the field to TrieDoubleField w/ precisionStep=0 definitely helped. Thanks Yonik! -Jay On Sat, Dec 19, 2009 at 11:37 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Sat, Dec 19, 2009 at 2:25 PM, Jay Hill jayallenh...@gmail.com wrote: One thing that struck me as odd in the output of the stats.jsp page is that the field cache always shows a String type for a field, even if it is not a String. For example, the output below is for a field cscore that is a double: What's the class type of the double? Older style SortableDouble had to use the string index. Newer style trie-double based should use a double[]. It also matters what the FieldCache entry is being used for... certain things like faceting on single valued fields still use the StringIndex. I believe the stats component does too. Sorting and function queries should work as expected. -Yonik
Re: nested queries
I don't think your queries are actually nested queries. Nested queries key off of the magic field name _query_. You're right however that there is very little in the way of documentation of examples of nested queries. If you haven't seen this blog about them yet you might find this a helpful overview of nested queries: http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/ -Jay On Thu, Nov 19, 2009 at 6:15 AM, Andrea Campi andrea.ca...@zephirworks.comwrote: Grant, Grant Ingersoll wrote: On Nov 19, 2009, at 7:02 AM, Andrea Campi wrote: To make things easier and more maintainable, I'd like to use nested queries for that; I'd like to be able to write: q={!boost b=$dateboost v=ftext:$terms^1000 OR text:$terms}dateboost=product(...etc.)terms=something Or even better: q={!boost b=$dateboost v=$qq}qq={!query v=ftext:$terms^1000 OR text:$terms}dateboost=product(...etc.)terms=something Sounds like you might benefit from using the Dismax Parser. You can specify the field boosting thing in your config and also add the bf (boost function) capability. I tried that but the customer prefers the lucene syntax for the actual query. However, now that you mention this, I should probably be able to use Dismax but specify the lucene syntax for the actual search on the 'text' field, right? I will try that, thanks. Bye, Andrea
Re: Wildcards at the Beginning of a Search.
There is a text_rev field type in the example schema.xml file in the official release of 1.4. It uses the ReversedWildcardFilterFactory to revers a field. You can do a copyField from the field you want to use for leading wildcard searches to a field using the text_rev field, and then do a regular trailing wildcard search on the reversed field. -Jay http://www.lucidimagination.com On Thu, Nov 12, 2009 at 4:41 AM, Jörg Agatz joerg.ag...@googlemail.comwrote: is in solr 1.4 maby a way to search with an wildcard at the beginning? in 1.3 i cant activate it. KingArtus
Replication admin page auto-reload
The replication admin page on slaves used to have an auto-reload set to reload every few seconds. In the official 1.4 release this doesn't seem to be working, but it does in a nightly build from early June. Was this changed on purpose or is this a bug? I looked through CHANGES.txt to see if anything was mentioned related to this but didn't see anything. If it's a bug I'll open an issue in JIRA -Jay
Re: Sending file to Solr via HTTP POST
Here is a brief example of how to use SolrJ with the ExtractingRequestHandler: ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(/update/extract); req.addFile(fileToIndex); req.setParam(literal.id, getId(fileToIndex)); req.setParam(literal.hostname, getHostname()); req.setParam(literal.filename, fileToIndex.getName()); try { getSolrServer().request(req); } catch (SolrServerException e) { e.printStackTrace(); } You'll need a request handler configured in solrconfig.xml: !-- Solr Cell Wiki: http://wiki.apache.org/solr/ExtractingRequestHandler-- requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler startup=lazy lst name=defaults !-- All the main content goes into this field... if you need to return the extracted text or do highlighting, use a stored field. -- str name=map.contenttext/str str name=lowernamestrue/str str name=uprefixignored_/str /lst /requestHandler Note that the example also shows how to use the literal.* parameter to add metadata fields of your choice to the document. Hope that helps get you started. -Jay http://www.lucidimagination.com On Tue, Nov 3, 2009 at 10:38 PM, Caroline Tan caroline@gmail.comwrote: Hi, From the Solr wiki on ExtractingRequestHandler tutorial, when it comes to the part to post file to Solr, it always uses the curl command, e.g. curl ' http://localhost:8983/*solr*/update/extract?literal.id=doc1commit=true' -F myfi...@tutorial.html I have never used curl and i was thinking is there any replacement to such method? Is there any API that i can use to achieve the same thing in a java project without relying on CURL? Does SolrJ have such method? Thanks ~caroLine
Re: specify multiple files in lst for DataImportHandler
You can set up multiple request handlers each with their own configuration file. For example, in addition to the config you listed you could add something like this: requestHandler name=/dataimport-two class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-two-config.xml/str /lst /requestHandler and so on with as many handlers as you need. -Jay http://www.lucidimagination.com On Thu, Nov 5, 2009 at 8:57 AM, javaxmlsoapdev vika...@yahoo.com wrote: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config.xml/str /lst /requestHandler is there a way to list more than one files in the lst above configuration? I understand I can have multiple entity itself in the config but I need to keep two data-config files separate and still use same DIH to create one index. -- View this message in context: http://old.nabble.com/specify-multiple-files-in-%3Clst%3E-for-DataImportHandler-tp26215805p26215805.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: CPU utilization and query time high on Solr slave when snapshot install
So assuming you set up a few sample sort queries to run in the firstSearcher config, and had very low query volume during that ten minutes so that there were no evictions before a new Searcher was loaded, would those queries run by the firstSearcher be passed along to the cache for the next Searcher as part of the autowarm? If so, it seems like you might want to load a few sort queries for the firstSearcher, but might not need any included in the newSearcher? -Jay On Mon, Nov 2, 2009 at 4:26 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...I think you have to setup warming queries yourself and that autowarm just copies entries from the old cache to the new cache, rather than issuing queries - the value is how many entries it will copy. Though that's still going to take CPU and time. - Mark http://www.lucidimagination.com (mobile) On Nov 2, 2009, at 12:47 PM, Walter Underwood wun...@wunderwood.org wrote: If you are going to pull a new index every 10 minutes, try turning off cache autowarming. Your caches are never more than 10 minutes old, so spending a minute warming each new cache is a waste of CPU. Autowarm submits queries to the new Searcher before putting it in service. This will create a burst of query load on the new Searcher, often keeping one CPU pretty busy for several seconds. In solrconfig.xml, set autowarmCount to 0. Also, if you want the slaves to always have an optimized index, create the snapshot only in post-optimize. If you create snapshots in both post-commit and post-optimize, you are creating a non-optimized index (post-commit), then replacing it with an optimized one a few minutes later. A slave might get a non-optimized index one time, then an optimized one the next. wunder On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote: Hi Solr Gurus, We have solr in 1 master, 2 slave configuration. Snapshot is created post commit, post optimization. We have autocommit after 50 documents or 5 minutes. Snapshot puller runs as a cron every 10 minutes. What we have observed is that whenever snapshot is installed on the slave, we see solrj client used to query slave solr, gets timedout and there is high CPU usage/load avg. on slave server. If we stop snapshot puller, then slaves work with no issues. The system has been running since 2 months and this issue has started to occur only now when load on website is increasing. Following are some details: Solr Details: apache-solr Version: 1.3.0 Lucene - 2.4-dev Master/Slave configurations: Master: - for indexing data HTTPRequests are made on Solr server. - autocommit feature is enabled for 50 docs and 5 minutes - caching params are disable for this server - mergeFactor of 10 is set - we were running optimize script after every 2 hours, but now have reduced the duration to twice a day but issue still persists Slave1/Slave2: - standard requestHandler is being used - default values of caching are set Machine Specifications: Master: - 4GB RAM - 1GB JVM Heap memory is allocated to Solr Slave1/Slave2: - 4GB RAM - 2GB JVM Heap memory is allocated to Solr Master and Slave1 (solr1)are on single box and Slave2(solr2) on different box. We use HAProxy to load balance query requests between 2 slaves. Master is only used for indexing. Please let us know if somebody has ever faced similar kind of issue or has some insight into it as we guys are literally struck at the moment with a very unstable production environment. As a workaround, we have started running optimize on master every 7 minutes. This seems to have reduced the severity of the problem but still issue occurs every 2days now. please suggest what could be the root cause of this. Thanks, Bipul
Re: solr web ui
Have a look at the VelocityResponseWriter ( http://wiki.apache.org/solr/VelocityResponseWriter). It's in the contrib area, but the wiki has instructions on how to move it into your core Solr. Solr uses response writers to return results. The default is XML but responses can be returned in JSON, Ruby and other formats. The VelocityResponseWriter enables responses returned using Velocity templates. It sounds like exactly what you need. -Jay http://www.lucidimagination.com On Thu, Oct 29, 2009 at 6:17 PM, scabbage guans...@gmail.com wrote: Hi, I'm a new solr user. I would like to know if there are any easy to setup web UIs for solr. It can be as simple as a search box, term highlighting and basic faceting. Basically I'm using solr to store all our automation testing logs and would like to have a simple searchable UI. I don't wanna spent too much time writing my own. Thanks. -- View this message in context: http://www.nabble.com/solr-web-ui-tp26123604p26123604.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets - ORing attribute values
1.4 has a good chance of being released next week. There was a hope that it might make it this week, but another bug in Lucene 2.9.1 was found, pushing things back just a little bit longer. -Jay http://www.lucidimagination.com On Thu, Oct 29, 2009 at 11:43 AM, beaviebugeater mbr...@jdnholdings.comwrote: Do you have any (educated) guess on when 1.4 will be officially released? Weeks? Months? Years? Yonik Seeley-2 wrote: Perhaps something like this that's actually running Solr w/ multi-selecti? http://search.lucidimagination.com/ http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters You just need a recent version of Solr 1.4 -Yonik http://www.lucidimagination.com On Thu, Oct 29, 2009 at 1:51 PM, beaviebugeater mbr...@jdnholdings.com wrote: I have implemented faceting with Solr for an ecommerce project. However, I'd like to change the default behavior somewhat. Visualize with me the left nav that contains: Attribute A value1 (count) value2 (count) value3 (count) Attribute B value4 (count) value5 (count) The user interface has a checkbox for each attribute value. As a checkbox is checked, the list of products is refined to include those with the selected attribute(s). The default behavior is to AND all selected attributes. What I would like is if I check value1, none of the counts for Attribute A change, just the product result set. If I then check value3 the effect is that I'm saying products with values for Attribute A of value1 OR value3 (not AND). Counts for Attribute B do change as usual. If I then check value4, the effect is to return products with values for Attribute A of (value1 OR value3) AND values for Attribute B of value5. You can see this sort of thing in action here: http://www.beanbags.com/bean-bag-chairs/large/1618+1620+4225.cfm#N=1618+1620+4225+4229+4231Ns=Preferredview=36display=grid_view Is this doable with Solr out of the box or do I need to build some logic around Solr's faceting functionality? Thanks. Matt -- View this message in context: http://www.nabble.com/Facets---ORing-attribute-values-tp26117763p26117763.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Facets---ORing-attribute-values-tp26117763p26118562.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH: Setting rows= on full-import has no effect
As always, you guys rock! Thanks, -Jay http://www.lucidimagination.com On Fri, Oct 9, 2009 at 2:57 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: FYI - This is fixed in trunk. 2009/10/9 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com I have raised an issue http://issues.apache.org/jira/browse/SOLR-1501 On Fri, Oct 9, 2009 at 6:10 AM, Jay Hill jayallenh...@gmail.com wrote: In the past setting rows=n with the full-import command has stopped the DIH importing at the number I passed in, but now this doesn't seem to be working. Here is the command I'm using: curl ' http://localhost:8983/solr/indexer/mediawiki?command=full-importrows=100' But when 100 docs are imported the process keeps running. Here's the log output: Oct 8, 2009 5:23:32 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 100 Oct 8, 2009 5:23:33 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 200 Oct 8, 2009 5:23:35 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 300 Oct 8, 2009 5:23:36 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 400 Oct 8, 2009 5:23:38 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 500 and so on. Running on the most recent nightly: 1.4-dev 823366M - jayhill - 2009-10-08 17:31:22 I've used that exact url in the past and the indexing stopped at the rows number as expected, but I haven't run the command for about two months on a build from back in early July. Here's the dih config: dataConfig dataSource name=dsFiles type=FileDataSource encoding=UTF-8/ document entity name=f processor=FileListEntityProcessor baseDir=/path/to/files fileName=.*xml recursive=true rootEntity=false dataSource=null entity name=wikixml processor=XPathEntityProcessor forEach=/mediawiki/page url=${f.fileAbsolutePath} dataSource=dsFiles onError=skip field column=id xpath=/mediawiki/page/id/ field column=title xpath=/mediawiki/page/title/ field column=contributor xpath=/mediawiki/page/revision/contributor/username/ field column=comment xpath=/mediawiki/page/revision/comment/ field column=text xpath=/mediawiki/page/revision/text/ /entity /entity /document /dataConfig -Jay -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- Regards, Shalin Shekhar Mangar.
Re: concatenating tokens
Use copyField to copy to a field with a field type like this: fieldType name=special class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all/ /analyzer /fieldType This works for your example, however I can't be sure if it will work for all of your content, but give it a try and see. -Jay http://www.lucidimagination.com On Fri, Oct 9, 2009 at 1:34 AM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Hi Joe, WordDelimiterFilter removes different delimiters, and creates several token strings from the input. It can also concatenate and add that as additional token to the stream. Though, it concatenates without space. But maybe you can tweak it to your needs? You could also use two different fields, one creating the concatenated version with spaces, and the other producing the catenated tokens. (Both with WordDelimiter and/or RegexPattern filters etc.) Cheers, Chantal Joe Calderon schrieb: hello *, im using a combination of tokenizers and filters that give me the desired tokens, however for a particular field i want to concatenate these tokens back to a single string, is there a filter to do that, if not what are the steps needed to make my own filter to concatenate tokens? for example, i start with Sprocket (widget) - Blue the analyzers churn out the tokens [sprocket,widget,blue] i want to end up with the string sprocket widget blue, this is a simple example and in the general case lowercasing and punctuation removal does not work, hence why im looking to concatenate tokens --joe
Re: Dynamic Data Import from multiple identical tables
You could use separate DIH config files for each of your three tables. This might be overkill, but it would keep them separate. The DIH is not limited to one request handler setup, so you could create a unique handler for each case with a unique name: requestHandler name=/indexer/table1 class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configtable1-config.xml/str /lst /requestHandler requestHandler name=/indexer/table2 class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configtable2-config.xml/str /lst /requestHandler requestHandler name=/indexer/table3 class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configtable3-config.xml/str /lst /requestHandler When you go to ...solr/admin/dataimport.jsp you should see a list of all DataImportHandlers that are configured, and can select them individually, if that works for your needs. -Jay http://www.lucidimagination.com On Fri, Oct 9, 2009 at 10:57 AM, solr.searcher solr.searc...@gmail.comwrote: Hi all, First of all, please accept my apologies if this has been asked and answered before. I tried my best to search and couldn't find anything on this. The problem I am trying to solve is as follows. I have multiple tables with identical schema - table_a, table_b, table_c ... and I am trying to create one big index with the data from each of these tables. The idea was to programatically create the data-config file (just changing the table name) and do a reload-config followed by a full-import with clean set to false. In other words: 1. publish the data-config file 2. do a reload-config 3. do a full-import with clean = false 4. commit, optimize 5. repeat with new table name I wanted to then follow the same procedure for delta imports. The problem is that after I do a reload-config and then do a full-import, the old data in the index is getting lost. What am I missing here? Please note that I am new to solr. INFO: [] webapp=/solr path=/dataimport params={command=reload-configclean=false} status=0 QTime=4 Oct 9, 2009 10:17:30 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/dataimport params={command=full-importclean=false} status=0 QTime=1 Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity blah blah blah Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Time taken for getConnection(): 12 Oct 9, 2009 10:17:31 AM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/blah/blah/index,segFN=segments_1z,version=1255032607825,generation=71,filenames=[segments_1z, _cl.cfs] Oct 9, 2009 10:17:31 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1255032607825 Any help will be greatly appreciated. Is there any other way to automaticaly slurp data from multiple, identical tables? Thanks a lot. -- View this message in context: http://www.nabble.com/Dynamic-Data-Import-from-multiple-identical-tables-tp25825381p25825381.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: java -Dsolr.solr.home=core -jar start.jar not working for me
Shouldn't that be: java -Dsolr.solr.home=multicore -jar start.jar and then hit url: http://localhost:8983/solr/core0/admin/ or http://localhost:8983/solr/core1/admin/ -Jay http://www.lucidimagination.com On Fri, Oct 9, 2009 at 1:17 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I have a fresh checkout from trunk, cd example, after running java -Dsolr.solr.home=core -jar start.jar, http://localhost:8983/solr/admin yields a 404 error.
Re: java -Dsolr.solr.home=core -jar start.jar not working for me
After checking out the latest revision did you do a build? I've made that mistake myself a few times: check out the latest revision and then fire up jetty before running ant example - could that be it? -Jay http://www.lucidimagination.com On Fri, Oct 9, 2009 at 1:38 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Jay, I tried that as well, still nothing. When I run: java -Dsolr.solr.home=solr -jar start.jar I see: 2009-10-09 13:37:04.887::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2009-10-09 13:37:05.051::INFO: jetty-6.1.3 2009-10-09 13:37:05.096::INFO: Started SocketConnector @ 0.0.0.0:8983 And http://localhost:8983/solr/admin yields a 404 error. On Fri, Oct 9, 2009 at 1:27 PM, Jay Hill jayallenh...@gmail.com wrote: Shouldn't that be: java -Dsolr.solr.home=multicore -jar start.jar and then hit url: http://localhost:8983/solr/core0/admin/ or http://localhost:8983/solr/core1/admin/ -Jay http://www.lucidimagination.com On Fri, Oct 9, 2009 at 1:17 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I have a fresh checkout from trunk, cd example, after running java -Dsolr.solr.home=core -jar start.jar, http://localhost:8983/solr/admin yields a 404 error.
DIH: Setting rows= on full-import has no effect
In the past setting rows=n with the full-import command has stopped the DIH importing at the number I passed in, but now this doesn't seem to be working. Here is the command I'm using: curl ' http://localhost:8983/solr/indexer/mediawiki?command=full-importrows=100' But when 100 docs are imported the process keeps running. Here's the log output: Oct 8, 2009 5:23:32 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 100 Oct 8, 2009 5:23:33 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 200 Oct 8, 2009 5:23:35 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 300 Oct 8, 2009 5:23:36 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 400 Oct 8, 2009 5:23:38 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument INFO: Indexing stopped at docCount = 500 and so on. Running on the most recent nightly: 1.4-dev 823366M - jayhill - 2009-10-08 17:31:22 I've used that exact url in the past and the indexing stopped at the rows number as expected, but I haven't run the command for about two months on a build from back in early July. Here's the dih config: dataConfig dataSource name=dsFiles type=FileDataSource encoding=UTF-8/ document entity name=f processor=FileListEntityProcessor baseDir=/path/to/files fileName=.*xml recursive=true rootEntity=false dataSource=null entity name=wikixml processor=XPathEntityProcessor forEach=/mediawiki/page url=${f.fileAbsolutePath} dataSource=dsFiles onError=skip field column=id xpath=/mediawiki/page/id/ field column=title xpath=/mediawiki/page/title/ field column=contributor xpath=/mediawiki/page/revision/contributor/username/ field column=comment xpath=/mediawiki/page/revision/comment/ field column=text xpath=/mediawiki/page/revision/text/ /entity /entity /document /dataConfig -Jay
Re: TermsComponent or auto-suggest with filter
Something like this, building on each character typed: facet=onfacet.field=tc_queryfacet.prefix=befacet.mincount=1 -Jay http://www.lucidimagination.com On Tue, Oct 6, 2009 at 5:43 PM, R. Tan tanrihae...@gmail.com wrote: Nice. In comparison, how do you do it with faceting? Two other approaches are to use either the TermsComponent (new in Solr 1.4) or faceting. On Wed, Oct 7, 2009 at 1:51 AM, Jay Hill jayallenh...@gmail.com wrote: Have a look at a blog I posted on how to use EdgeNGrams to build an auto-suggest tool: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ You could easily add filter queries to this approach. Ffor example, the query used in the blog could add filter queries like this: http://localhost:8983/solr/select/?q=user_query: ”i”wt=jsonfl=user_queryindent=onechoParams=nonerows=10sort=count descfq=yourField:yourQueryfq=anotherField:anotherQuery -Jay http://www.lucidimagination.com On Tue, Oct 6, 2009 at 4:40 AM, R. Tan tanrihae...@gmail.com wrote: Hello, What's the best way to get auto-suggested terms/keywords that is filtered by one or more fields? TermsComponent should have been the solution but filters are not supported. Thanks, Rihaed
Re: ISOLatin1AccentFilter before or after Snowball?
Correct me if I'm wrong, but wasn't the ISOLatin1AccentFilterFactory deprecated in favor of: charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ in 1.4? -Jay http://www.lucidimagination.com On Wed, Oct 7, 2009 at 1:44 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Oct 6, 2009 at 4:33 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Hi all, from reading through previous posts on that subject, it seems like the accent filter has to come before the snowball filter. I'd just like to make sure this is so. If it is the case, I'm wondering whether snowball filters for i.e. French process accented language correctly, at all, or whether they remove accents anyway... Or whether accents should be removed whenever making use of snowball filters. I'd think so but I'm not sure. Perhaps someone else can weigh in. And also: it really is meant to take UTF-8 as input, even though it is named ISOLatin1AccentFilter, isn't it? See http://markmail.org/message/hi25u5iqusfu542b -- Regards, Shalin Shekhar Mangar.
Re: TermsComponent or auto-suggest with filter
Have a look at a blog I posted on how to use EdgeNGrams to build an auto-suggest tool: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ You could easily add filter queries to this approach. Ffor example, the query used in the blog could add filter queries like this: http://localhost:8983/solr/select/?q=user_query:”i”wt=jsonfl=user_queryindent=onechoParams=nonerows=10sort=count descfq=yourField:yourQueryfq=anotherField:anotherQuery -Jay http://www.lucidimagination.com On Tue, Oct 6, 2009 at 4:40 AM, R. Tan tanrihae...@gmail.com wrote: Hello, What's the best way to get auto-suggested terms/keywords that is filtered by one or more fields? TermsComponent should have been the solution but filters are not supported. Thanks, Rihaed
Batching requests using SolrCell with SolrJ
When working with SolrJ I have typically batched a Collection of SolrInputDocument objects before sending them to the Solr server. I'm working with the latest nightly build and using the ExtractingRequestHandler to index documents, and everything is working fine. Except I haven't been able to figure out how to batch documents when also including literals. Here's what I've got: //Looping over a List of Files ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(/update/extract); req.addFile(fileToIndex); req.setParam(literal.id, fileToIndex.getCanonicalPath()); try { getSolrServer().request(req); } catch (SolrServerException e) { e.printStackTrace(); } Which works great, except that each document processed in the loop is sending a separate request. Previously I built a collection of SolrInput docs and had SolrJ send them in batches of 100 or whatever. It seems like I could batch documents by continuing to add them to the request (req.addFile(eachFileUpToACount)), but the literals seem to present a problem. By sending one at a time the contents and the literals all wind up in the same document. But in a batch there will just be an array of params for literal.id (in this example) not matched to the contents. Can anyone provide a code snippet of how to do this? Or is there no other approach than sending a request for each document. Thanks, -Jay http://www.lucidimagination.com
Any way to encrypt/decrypt stored fields?
For security reasons (say I'm indexing very sensitive data, medical records for example) is there a way to encrypt data that is stored in Solr? Some businesses I've encountered have such needs and this is a barrier to them adopting Solr to replace other legacy systems. Would it require a custom-written filter to encrypt during indexing and decrypt at query time, or is there something I'm unaware of already available to do this? -Jay
Re: Is it possible to query for everything ?
Use: ?q=*:* -Jay http://www.lucidimagination.com On Mon, Sep 14, 2009 at 4:18 PM, Jonathan Vanasco jvana...@2xlp.com wrote: I'm using Solr for seach and faceted browsing Is it possible to have solr search for 'everything' , at least as far as q is concerned ? The request handlers I've found don't like it if I don't pass in a q parameter
Re: Is it possible to query for everything ?
With dismax you can use q.alt when the q param is missing: q.alt=*:* should work. -Jay On Mon, Sep 14, 2009 at 5:38 PM, Jonathan Vanasco jvana...@2xlp.com wrote: Thanks Jay Matt I tried *:* on my app, and it didn't work I tried it on the solr admin, and it did I checked the solr config file, and realized that it works on standard, but not on dismax, queries So i have my app checking *:* on a standard qt, and then filtering what I need on other qts! I would never have figured this out without you two!
Re: KStem download
The two jar files are all you should need, and the configuration is correct. However I noticed that you are on Solr 1.3. I haven't tested the Lucid KStemmer on a non-Lucid-certified distribution of 1.3. I have tested it on recent versions of 1.4 and it works fine (just tested with the most recent nightly build). So there are two options, but I don't know if either will work for you: 1. Move up to Solr 1.4, copy over the jars and configure. 2. Get the free Lucid certified distribution of 1.3 which already has the Lucid KStemmer (and other fixes which are an improvement over the standard 1.3). -Jay http://www.lucidimagination.com On Mon, Sep 14, 2009 at 6:09 PM, darniz rnizamud...@edmunds.com wrote: i was able to declare a field type when the i use the lucid distribution of solr fieldtype name=lucidkstemmer class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt / /analyzer /fieldtype But if i copy the two jars and put it in lib directory of apache solr distribution it still gives me the following error. SEVERE: java.lang.NoClassDefFoundError: org/apache/solr/util/plugin/ResourceLoaderAware at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:621) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$000(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:375) at org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:337) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:257) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:278) at org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:83) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:781) at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:56) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:413) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:431) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:440) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:92) at org.apache.solr.core.SolrCore.init(SolrCore.java:412) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at
Re: Highlighting in SolrJ?
Will do Shalin. -Jay http://www.lucidimagination.com On Fri, Sep 11, 2009 at 9:23 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Jay, it would be great if you can add this example to the Solrj wiki: http://wiki.apache.org/solr/Solrj On Fri, Sep 11, 2009 at 5:15 AM, Jay Hill jayallenh...@gmail.com wrote: Set up the query like this to highlight a field named content: SolrQuery query = new SolrQuery(); query.setQuery(foo); query.setHighlight(true).setHighlightSnippets(1); //set other params as needed query.setParam(hl.fl, content); QueryResponse queryResponse =getSolrServer().query(query); Then to get back the highlight results you need something like this: IteratorSolrDocument iter = queryResponse.getResults(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue(content)); String id = (String) resultDoc.getFieldValue(id); //id is the uniqueKey field if (queryResponse.getHighlighting().get(id) != null) { ListString highightSnippets = queryResponse.getHighlighting().get(id).get(content); } } Hope that gets you what you need. -Jay http://www.lucidimagination.com On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com wrote: Can somebody point me to some sample code for using highlighting in SolrJ? I understand the highlighted versions of the field comes in a separate NamedList? How does that work? -- http://www.linkedin.com/in/paultomblin -- Regards, Shalin Shekhar Mangar.
Re: Highlighting in SolrJ?
It's really just a matter of what you're intentions are. There are an awful lot of highlighting params and so highlighting is very flexible and customizable. Regarding snippets, as an example Google presents two snippets in results, which is fairly common. I'd recommend doing a lot of experimenting by changing the params on the query string to get what you want, and then setting them up in SolrJ. The example I sent was intended to be a generic starting point and mostly just to show how to set highlighting params and how to get back a List of highlighting results. -Jay http://www.lucidimagination.com On Thu, Sep 10, 2009 at 5:40 PM, Paul Tomblin ptomb...@xcski.com wrote: If I set snippets to 9 and mergeContinuous to true, will I get the entire contents of the field with all the search terms replaced? I don't see what good it would be just getting one line out of the whole field as a snippet. On Thu, Sep 10, 2009 at 7:45 PM, Jay Hill jayallenh...@gmail.com wrote: Set up the query like this to highlight a field named content: SolrQuery query = new SolrQuery(); query.setQuery(foo); query.setHighlight(true).setHighlightSnippets(1); //set other params as needed query.setParam(hl.fl, content); QueryResponse queryResponse =getSolrServer().query(query); Then to get back the highlight results you need something like this: IteratorSolrDocument iter = queryResponse.getResults(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue(content)); String id = (String) resultDoc.getFieldValue(id); //id is the uniqueKey field if (queryResponse.getHighlighting().get(id) != null) { ListString highightSnippets = queryResponse.getHighlighting().get(id).get(content); } } Hope that gets you what you need. -Jay http://www.lucidimagination.com On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com wrote: Can somebody point me to some sample code for using highlighting in SolrJ? I understand the highlighted versions of the field comes in a separate NamedList? How does that work? -- http://www.linkedin.com/in/paultomblin -- http://www.linkedin.com/in/paultomblin
Re: Highlighting in SolrJ?
Try adding this param: hl.fragsize=3 (obviously set the fragsize to whatever very high number you need for your largest doc.) -Jay On Fri, Sep 11, 2009 at 7:54 AM, Paul Tomblin ptomb...@xcski.com wrote: What I want is the whole text of that field with every instance of the search term high lighted, even if the search term only occurs in the first line of a 300 page field. I'm not sure if mergeContinuous will do that, or if it will miss everything after the last line that contains the search term. On Fri, Sep 11, 2009 at 10:42 AM, Jay Hill jayallenh...@gmail.com wrote: It's really just a matter of what you're intentions are. There are an awful lot of highlighting params and so highlighting is very flexible and customizable. Regarding snippets, as an example Google presents two snippets in results, which is fairly common. I'd recommend doing a lot of experimenting by changing the params on the query string to get what you want, and then setting them up in SolrJ. The example I sent was intended to be a generic starting point and mostly just to show how to set highlighting params and how to get back a List of highlighting results. -Jay http://www.lucidimagination.com On Thu, Sep 10, 2009 at 5:40 PM, Paul Tomblin ptomb...@xcski.com wrote: If I set snippets to 9 and mergeContinuous to true, will I get the entire contents of the field with all the search terms replaced? I don't see what good it would be just getting one line out of the whole field as a snippet. On Thu, Sep 10, 2009 at 7:45 PM, Jay Hill jayallenh...@gmail.com wrote: Set up the query like this to highlight a field named content: SolrQuery query = new SolrQuery(); query.setQuery(foo); query.setHighlight(true).setHighlightSnippets(1); //set other params as needed query.setParam(hl.fl, content); QueryResponse queryResponse =getSolrServer().query(query); Then to get back the highlight results you need something like this: IteratorSolrDocument iter = queryResponse.getResults(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue(content)); String id = (String) resultDoc.getFieldValue(id); //id is the uniqueKey field if (queryResponse.getHighlighting().get(id) != null) { ListString highightSnippets = queryResponse.getHighlighting().get(id).get(content); } } Hope that gets you what you need. -Jay http://www.lucidimagination.com On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com wrote: Can somebody point me to some sample code for using highlighting in SolrJ? I understand the highlighted versions of the field comes in a separate NamedList? How does that work? -- http://www.linkedin.com/in/paultomblin -- http://www.linkedin.com/in/paultomblin -- http://www.linkedin.com/in/paultomblin
Re: standard requestHandler components
RequestHandlers are configured in solrconfig.xml. If no components are explicitly declared in the request handler config the the defaults are used. They are: - QueryComponent - FacetComponent - MoreLikeThisComponent - HighlightComponent - StatsComponent - DebugComponent If you wanted to have a custom list of components (either omitting defaults or adding custom) you can specify the components for a handler directly: arr name=components strquery/str strfacet/str strmlt/str strhighlight/str strdebug/str strsomeothercomponent/str /arr You can add components before or after the main ones like this: arr name=first-components strmycomponent/str /arr arr name=last-components strmyothercomponent/str /arr and that's how the spell check component can be added: arr name=last-components strspellcheck/str /arr Note that the a component (except the defaults) must be configured in solrconfig.xml with the name used in the str element as well. Have a look at the solrconfig.xml in the example directory (.../example/solr/conf/) for examples on how to set up the spellcheck component, and on how the request handlers are configured. -Jay http://www.lucidimagination.com On Fri, Sep 11, 2009 at 3:04 PM, michael8 mich...@saracatech.com wrote: Hi, I have a newbie question about the 'standard' requestHandler in solrconfig.xml. What I like to know is where is the config information for this requestHandler kept? When I go to http://localhost:8983/solr/admin, I see the following info, but am curious where are the supposedly 'chained' components (e.g. QueryComponent, FacetComponent, MoreLikeThisComponent) configured for this requestHandler. I see timing and process debug output from these components with debugQuery=true, so somewhere these components must have been configured for this 'standard' requestHandler. name:standard class: org.apache.solr.handler.component.SearchHandler version:$Revision: 686274 $ description:Search using components: org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.DebugComponent, stats: handlerStart : 1252703405335 requests : 3 errors : 0 timeouts : 0 totalTime : 201 avgTimePerRequest : 67.0 avgRequestsPerSecond : 0.015179728 What I like to do from understanding this is to properly integrate spellcheck component into the standard requestHandler as suggested in a solr spellcheck example. Thanks for any info in advance. Michael -- View this message in context: http://www.nabble.com/%22standard%22-requestHandler-components-tp25409075p25409075.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pagination with solr json data
All you have to do is use the start and rows parameters to get the results you want. For example, the query for the first page of results might look like this, ?q=solrstart=0rows=10 (other params omitted). So you'll start at the beginning (0) and get 10 results. They next page would be ?q=solrstart=10rows=10 - start at the 10th result and display the next 10 rows. Then ?q=solrstart=20rows=10, and so on. -Jay http://www.lucidimagination.com On Wed, Sep 9, 2009 at 12:24 PM, Elaine Li elaine.bing...@gmail.com wrote: Hi, What is the best way to do pagination? I searched around and only found some YUI utilities can do this. But their examples don't have very close match to the pattern I have in mind. I would like to have pretty plain display, something like the search results from google. Thanks. Elaine
Re: TermsComponent
If you need an alternative to using the TermsComponent for auto-suggest, have a look at this blog on using EdgeNGrams instead of the TermsComponent. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ -Jay http://www.lucidimagination.com On Wed, Sep 9, 2009 at 3:35 PM, Todd Benge todd.be...@gmail.com wrote: We're using the StandardAnalyzer but I'm fairly certain that's not the issue. In fact, I there doesn't appear to be any issue with Lucene or Solr. There are many instances of data in which users have removed the whitespace so they have a high frequency which means they bubble to the top of the sort. The result is that a search for a name shows a first and last name without the whitespace. One thing I've noticed is that since TermsComponent is working on a single Term, there doesn't seem to be a way to query against a phrase. The same example as above applies, so if you're querying for name it'd be prefered to get multi-term responses back if a first name matches. Any suggestions? Thanks for all the help. It's much appreciated. Todd On Wed, Sep 9, 2009 at 12:11 PM, Grant Ingersoll gsing...@apache.org wrote: And what Analyzer are you using? I'm guessing that your words are being split up during analysis, which is why you aren't seeing whitespace. If you want to keep the whitespace, you will need to use the String field type or possibly the Keyword Analyzer. -Grant On Sep 9, 2009, at 11:06 AM, Todd Benge wrote: It's set as Field.Store.YES, Field.Index.ANALYZED. On Wed, Sep 9, 2009 at 8:15 AM, Grant Ingersoll gsing...@apache.org wrote: How are you tokenizing/analyzing the field you are accessing? On Sep 9, 2009, at 8:49 AM, Todd Benge wrote: Hi Rekha, Here's teh link to the TermsComponent info: http://wiki.apache.org/solr/TermsComponent and another link Matt Weber did on autocompletion: http://www.mattweber.org/2009/05/02/solr-autosuggest-with-termscomponent-and-jquery/ We had to upgrade to the latest nightly to get the TermsComponent to work. Good Luck! Todd On Wed, Sep 9, 2009 at 5:17 AM, dharhsana rekha.dharsh...@gmail.com wrote: Hi, I have a requirement on Autocompletion search , iam using solr 1.4. Could you please tell me how you worked on that Terms component using solr 1.4, i could'nt find terms component in solr 1.4 which i have downloaded,is there anyother configuration should be done. Do you have code for autocompletion, please share wih me.. Regards Rekha tbenge wrote: Hi, I was looking at TermsComponent in Solr 1.4 as a way of building a autocomplete function. I have a prototype working but noticed that terms that have whitespace in them when indexed are absent the whitespace when returned from the TermsComponent. Any ideas on why that may be happening? Am I just missing a configuration option? Thanks, Todd -- View this message in context: http://www.nabble.com/TermsComponent-tp25302503p25362829.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Highlighting in SolrJ?
Set up the query like this to highlight a field named content: SolrQuery query = new SolrQuery(); query.setQuery(foo); query.setHighlight(true).setHighlightSnippets(1); //set other params as needed query.setParam(hl.fl, content); QueryResponse queryResponse =getSolrServer().query(query); Then to get back the highlight results you need something like this: IteratorSolrDocument iter = queryResponse.getResults(); while (iter.hasNext()) { SolrDocument resultDoc = iter.next(); String content = (String) resultDoc.getFieldValue(content)); String id = (String) resultDoc.getFieldValue(id); //id is the uniqueKey field if (queryResponse.getHighlighting().get(id) != null) { ListString highightSnippets = queryResponse.getHighlighting().get(id).get(content); } } Hope that gets you what you need. -Jay http://www.lucidimagination.com On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com wrote: Can somebody point me to some sample code for using highlighting in SolrJ? I understand the highlighted versions of the field comes in a separate NamedList? How does that work? -- http://www.linkedin.com/in/paultomblin
Re: Sort a Multivalue field
Unfortunately you can't sort on a multi-valued field. In order to sort on a field it must be indexed but not multi-valued. Have a look at the FieldOptions wiki page for a good description of what values to set for different use cases: http://wiki.apache.org/solr/FieldOptionsByUseCase -Jay www.lucidimagination.com On Wed, Sep 9, 2009 at 2:37 AM, Jörg Agatz joerg.ag...@googlemail.comwrote: Hallo Friends, I have a Problem... my Search engins Server runs since a lot of weeks... Now i gett new XML, and one of the fields ar Multivalue,, Ok, i change the Schema.xml, set it to Multivalue and it works :-) no Error by the Indexing.. Now i go to the Gui, and will sort this Field, and BAM, i cant sort. it is impossible to sort a Tokenized field Than i think, ok, i doo it in a CopyField and sort the CopyField.. and voila, i dont get an error, but hie dosent sort realy, i get an output, but no change by desc ore asc What can i do to sort this Field.. i thinkt, when i soert this field (only Numbers) the file comes multible in the output, like this... xml: field aaa1122/aaa field aaa2211/aaa field aaa3322/aaa sort field aaa *1122* 1134 1145 *2211* 2233 3355 3311 3312 *3322* ... ... ... i hope you have a idea, i am at the end with my ideas KingArtus
Re: Field names with whitespaces
This seems to work: ?q=field\ name:something Probably not a good idea to have field names with whitespace though. -Jay 2009/8/28 Marcin Kuptel marcinkup...@gmail.com Hi, Is there a way to query solr about fields which names contain whitespaces? Indexing such data does not cause any problems but I have been unable to retrieve it. Regards, Marcin Kuptel
Re: MoreLikeThis: How to get quality terms from html from content stream?
Solr Cell definitely sounds like it has a place here. But wouldn't it be needed for as an extracting component earlier in the process for the MoreLikeThisHandler? The MLT Handler works great when it's directed to a content stream of plain text. If we could just use Solr Cell to identify the file type and do the content extraction earlier in the stream that would do the trick I think. Then whether the URL pointed to HTML, a PDF, or whatever, MLT would be receiving a stream of extracted content. -Jay On Sun, Aug 9, 2009 at 7:17 AM, Grant Ingersoll gsing...@apache.org wrote: It's starting to sound like Solr Cell needs a SearchComponent as well, that can come before the QueryComponent and can be used to map into the other components. Essentially, take the functionality of the extractOnly option and have it feed other SearchComponent. On Aug 8, 2009, at 10:42 AM, Ken Krugler wrote: On Aug 7, 2009, at 5:23pm, Jay Hill wrote: I'm using the MoreLikeThisHandler with a content stream to get documents from my index that match content from an html page like this: http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi ?f=/c/a/2009/08/06/SP5R194Q13.DTLmlt.fl=bodyrows=4debugQuery=true But, not surprisingly, the query generated is meaningless because a lot of the markup is picked out as terms: str name=parsedquery_toString body:li body:href body:div body:class body:a body:script body:type body:js body:ul body:text body:javascript body:style body:css body:h body:img body:var body:articl body:ad body:http body:span body:prop /str Does anyone know a way to transform the html so that the content can be parsed out of the content stream and processed w/o the markup? Or do I need to write my own HTMLParsingMoreLikeThisHandler? You'd want to parse the HTML to extract only text first, and use that for your index data. Both the Nutch and Tika OSS projects have examples of using HTML parsers (based on TagSoup or CyberNeko) to generate content suitable for indexing. -- Ken If I parse the content out to a plain text file and point the stream.url param to file:///parsedfile.txt it works great. -Jay -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-210-6378 -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
MoreLikeThis: How to get quality terms from html from content stream?
I'm using the MoreLikeThisHandler with a content stream to get documents from my index that match content from an html page like this: http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2009/08/06/SP5R194Q13.DTLmlt.fl=bodyrows=4debugQuery=true But, not surprisingly, the query generated is meaningless because a lot of the markup is picked out as terms: str name=parsedquery_toString body:li body:href body:div body:class body:a body:script body:type body:js body:ul body:text body:javascript body:style body:css body:h body:img body:var body:articl body:ad body:http body:span body:prop /str Does anyone know a way to transform the html so that the content can be parsed out of the content stream and processed w/o the markup? Or do I need to write my own HTMLParsingMoreLikeThisHandler? If I parse the content out to a plain text file and point the stream.url param to file:///parsedfile.txt it works great. -Jay
Re: DIH: Any way to make update on db table?
Excellent, thanks Avlesh and Noble. -Jay On Mon, Aug 3, 2009 at 9:28 PM, Avlesh Singh avl...@gmail.com wrote: datasource.getData(update mytable ); //though the name is getData() it can execute update commands also Even when the dataSource is readOnly, Noble? Cheers Avlesh 2009/8/4 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com If your are writing a Transformer (or any other component) you can get hold of a dataSource instance . datasource =Context#getDataSource(name). //then you can invoke datasource.getData(update mytable ); //though the name is getData() it can execute update commands also ensure that you do a datasource.close(); after you are done On Tue, Aug 4, 2009 at 9:40 AM, Avlesh Singhavl...@gmail.com wrote: Couple of things - 1. Your dataSource is probably in readOnly mode. It is possible to fire updates, by specifying readOnly=false in your dataSource. 2. What you are trying achieve, is typically done using a select for update. For MySql, here's the documentation - http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html 3. You don't need to create a separate entity for firing updates. Writing a database procedure might be a good idea. In that case your query will simply be entity name=mainEntity query=call MyProcedure(); .../. All the heavy lifting can be done by this query. Moreover, update queries, only return the number of rows affected and not a resultSet. DIH expects one and hence the exception. Cheers Avlesh On Tue, Aug 4, 2009 at 1:49 AM, Jay Hill jayallenh...@gmail.com wrote: Is it possible for the DataImportHandler to update records in the table it is querying? For example, say I have a query like this in my entity: query=select field1, field2, from someTable where hasBeenIndexed=false Is there a way I can mark each record processed by updating the hasBeenIndexed field? Here's a config I tried: ?xml version=1.0? dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/solrhacks user=user password=pass/ document name=testingDIHupdate entity name=mainEntity pk=id query=select id, name from tableToIndex where hasBeenIndexed=0 field column=id template=dihTestUpdate-${main.id}/ field column=name name=name/ entity name=updateEntity pk=id query=update tableToIndex set hasBeenIndexed=1 where id=${mainEntity.id} /entity /entity /document /dataConfig It does update the first record, but then an Exception is thrown: Aug 3, 2009 1:15:24 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: mainEntity document : SolrInputDocument[{id=id(1.0)={1}, name=name(1.0)={John Jones}}] org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: update tableToIndex set hasBeenIndexed=1 where id=1 Processing Document # 1 at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:250) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:207) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:40) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:344) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:370) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372) Caused by: java.lang.NullPointerException at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:248) ... 12 more -Jay -- - Noble Paul | Principal Engineer| AOL | http://aol.com
DIH: Any way to make update on db table?
Is it possible for the DataImportHandler to update records in the table it is querying? For example, say I have a query like this in my entity: query=select field1, field2, from someTable where hasBeenIndexed=false Is there a way I can mark each record processed by updating the hasBeenIndexed field? Here's a config I tried: ?xml version=1.0? dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/solrhacks user=user password=pass/ document name=testingDIHupdate entity name=mainEntity pk=id query=select id, name from tableToIndex where hasBeenIndexed=0 field column=id template=dihTestUpdate-${main.id}/ field column=name name=name/ entity name=updateEntity pk=id query=update tableToIndex set hasBeenIndexed=1 where id=${mainEntity.id} /entity /entity /document /dataConfig It does update the first record, but then an Exception is thrown: Aug 3, 2009 1:15:24 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: mainEntity document : SolrInputDocument[{id=id(1.0)={1}, name=name(1.0)={John Jones}}] org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: update tableToIndex set hasBeenIndexed=1 where id=1 Processing Document # 1 at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:250) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:207) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:40) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:344) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:370) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372) Caused by: java.lang.NullPointerException at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:248) ... 12 more -Jay
Re: How can i get lucene index format version information?
Check the system request handler: http://localhost:8983/solr/admin/system Should look something like this: lst name=lucene str name=solr-spec-version1.3.0.2009.07.28.10.39.42/str str name=solr-impl-version1.4-dev 797693M - jayhill - 2009-07-28 10:39:42/str str name=lucene-spec-version2.9-dev/str str name=lucene-impl-version2.9-dev 794238 - 2009-07-15 18:05:08/str /lst -Jay On Thu, Jul 30, 2009 at 10:32 AM, Walter Underwood wun...@wunderwood.orgwrote: I think the properties page in the admin UI lists the Lucene version, but I don't have a live server to check that on at this instant. wunder On Jul 30, 2009, at 10:26 AM, Chris Hostetter wrote: : i want to get the lucene index format version from solr web app (as : the Luke request handler writes it out: : :indexInfo.add(version, reader.getVersion()); that's the index version (as in i have added docs to the index, so the version number has changed) the question is about the format version (as in: i have upgraded Lucene from 2.1 to 2.3, so the index format has changed) I'm not sure how Luke get's that ... it's not exposed via a public API on an IndexReader. Hmm... SegmentInfos.readCurrentVersion(Directory) seems like it would do the trick; but i'm not sure how that would interact with customized INdexReader implementations. i suppose we could always make it non-fatal if it throws an exception (just print the exception mesg in place of hte number) anybody want to submit a patch to add this to the LukeRequestHandler? -Hoss
FieldCollapsing: Two response elements returned?
I'm doing some testing with field collapsing, and early results look good. One thing seems odd to me however. I would expect to get back one block of results, but I get two - the first one contains the collapsed results, the second one contains the full non-collapsed results: result name=response numFound=11 start=0 ... /result result name=response numFound=62 start=0 ... /result This seems somewhat confusing. Is this intended or is this a bug? Thanks, -Jay
DIH: On import (full or delta) commit=false seems to not take effect
I am trying to run full and delta imports with the commit=false option, but it doesn't seem to take effect - after the import a commit always happens no matter what params I send. I've looked at the source and unless I'm missing something it doesn't seem to process the commit param. Here's the url I'm using: curl ' http://localhost:8080/solr/indexer/books?command=full-importcommit=false' But as soon as the import finishes a commit occurs. I want to set things up to let autoCommit control all commits as I have a series of DIH-configs importing data at different times. I will file an issue in JIRA, but I wanted to check the list first to see if this has come up for others. -Jay
Re: spellcheck with misspelled words in index
We had the same thing to deal with recently, and a great solution was posted to the list. Create a stopwords filter on the field your using for your spell checking, and then populate a custom stopwords file with known misspelled words: fieldType name=textSpell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=misspelled_words.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Your spell field would look like this: field name=spell type=textSpell indexed=true stored=true multiValued=true/ Then add words like cusine to messpelled_words.txt -Jay On Tue, Jul 14, 2009 at 11:40 PM, Chris Williams cswilli...@gmail.comwrote: Hi, I'm having some trouble getting the correct results from the spellcheck component. I'd like to use it to suggest correct product titles on our site, however some of our products have misspellings in them outside of our control. For example, there's 2 products with the misspelled word cusine (and 25k with the correct spelling cuisine). So if someone searches for the word cusine on our site, I would like to show the 2 misspelled products, and a suggestion with Did you mean cuisine?. However, I can't seem to ever get any spelling suggestions when I search by the word cusine, and correctlySpelled is always true. Misspelled words that don't appear in the index work fine. I noticed that setting onlyMorePopular to true will return suggestions for the misspelled word, but I've found that it doesn't work great for other words and produces suggestions too often for correctly spelled words. I incorrectly had thought that by setting thresholdTokenFrequency higher on my spelling dictionary that these misspellings would not appear in my spelling index and thus I would get suggestions for them, but as I see now, the spellcheck doesn't quite work like that. Is there any way to somehow get spelling suggestions to work for these misspellings in my index if they have a low frequency? Thanks in advance, Chris
Re: DIH: On import (full or delta) commit=false seems to not take effect
My bad, I had a configuration setting overriding this value. Sorry for the mistake. -Jay On Wed, Jul 15, 2009 at 12:07 PM, Jay Hill jayallenh...@gmail.com wrote: I am trying to run full and delta imports with the commit=false option, but it doesn't seem to take effect - after the import a commit always happens no matter what params I send. I've looked at the source and unless I'm missing something it doesn't seem to process the commit param. Here's the url I'm using: curl ' http://localhost:8080/solr/indexer/books?command=full-importcommit=false' But as soon as the import finishes a commit occurs. I want to set things up to let autoCommit control all commits as I have a series of DIH-configs importing data at different times. I will file an issue in JIRA, but I wanted to check the list first to see if this has come up for others. -Jay
Re: DIH: On import (full or delta) commit=false seems to not take effect
Actually, my good after all. The parameter does not take effect. If commit=false is passed in a commit still happens. Will open and JIRA and supply a patch shortly. -Jay On Wed, Jul 15, 2009 at 5:50 PM, Jay Hill jayallenh...@gmail.com wrote: My bad, I had a configuration setting overriding this value. Sorry for the mistake. -Jay On Wed, Jul 15, 2009 at 12:07 PM, Jay Hill jayallenh...@gmail.com wrote: I am trying to run full and delta imports with the commit=false option, but it doesn't seem to take effect - after the import a commit always happens no matter what params I send. I've looked at the source and unless I'm missing something it doesn't seem to process the commit param. Here's the url I'm using: curl ' http://localhost:8080/solr/indexer/books?command=full-importcommit=false ' But as soon as the import finishes a commit occurs. I want to set things up to let autoCommit control all commits as I have a series of DIH-configs importing data at different times. I will file an issue in JIRA, but I wanted to check the list first to see if this has come up for others. -Jay
Spell checking: Is there a way to exclude words known to be wrong?
We're building a spell index from a field in our main index with the following configuration: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetextSpell/str lst name=spellchecker str name=namedefault/str str name=fieldspell/str str name=spellcheckIndexDir./spellchecker/str str name=buildOnCommittrue/str /lst /searchComponent This works great and re-builds the spelling index on commits as expected. However, we know there are misspellings in the spell field of our main index. We could remove these from the spelling index using Luke, however they will be added again on commits. What we need is something similar to how the protwords.txt file is used. So that when we notice misspelled words such as beginnning being pulled from our main index we could add them to an exclusion file so they are not added to the spelling index again. Any tricks to make this possible? -Jay
Re: Creating DataSource for DIH to Oracle Database
Francis, your question is a little vague. Are you looking for the configuration for connecting the DIH to a JNDI datasource set up in Weblogic? dataSource name=dsDb jndiName=java:comp/env/jdbc/myWeblogicDatasource type=JdbcDataSource user=/ -Jay On Mon, Jul 6, 2009 at 2:41 PM, Francis Yakin fya...@liquid.com wrote: Have any one had experience creating a datasource for DIH to an Oracle Database? Also, from the Solr side we are running weblogic and deploy the application using weblogic. I know in weblogic we can create a datasource that can connect to Oracle database, has any one had experience with this? Thanks Francis
Re: about defaultSearchField
Just to be sure: You mentioned that you adjusted schema.xml - did you re-index after making your changes? -Jay On Wed, Jul 8, 2009 at 7:07 AM, Yang Lin beckl...@gmail.com wrote: Thanks for your reply. But it works not. Yang 2009/7/8 Yao Ge yao...@gmail.com Try with fl=* or fl=*,score added to your request string. -Yao Yang Lin-2 wrote: Hi, I have some problems. For my solr progame, I want to type only the Query String and get all field result that includ the Query String. But now I can't get any result without specified field. For example, query with tina get nothing, but Sentence:tina could. I hava adjusted the *schema.xml* like this: fields field name=CategoryNamePolarity type=text indexed=true stored=true multiValued=true/ field name=CategoryNameStrenth type=text indexed=true stored=true multiValued=true/ field name=CategoryNameSubjectivity type=text indexed=true stored=true multiValued=true/ field name=Sentence type=text indexed=true stored=true multiValued=true/ field name=allText type=text indexed=true stored=true multiValued=true/ /fields uniqueKey required=falseSentence/uniqueKey !-- field for the QueryParser to use when an explicit fieldname is absent -- defaultSearchFieldallText/defaultSearchField !-- SolrQueryParser configuration: defaultOperator=AND|OR -- solrQueryParser defaultOperator=OR/ copyfield source=CategoryNamePolarity dest=allText/ copyfield source=CategoryNameStrenth dest=allText/ copyfield source=CategoryNameSubjectivity dest=allText/ copyfield source=Sentence dest=allText/ I think the problem is in defaultSearchField, but I don't know how to fix it. Could anyone help me? Thanks Yang -- View this message in context: http://www.nabble.com/about-defaultSearchField-tp24382105p24384615.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing rich documents from websites using ExtractingRequestHandler
I haven't tried this myself, but it sounds like what you're looking for is enabling remote streaming: http://wiki.apache.org/solr/ContentStream#head-7179a128a2fdd5dde6b1af553ed41735402aadbf As the link above shows you should be able to enable remote streaming like this: requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=2048 / and then something like this might work: stream.url=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdfhttp://www.sub.mydomain.com/files/pdfdocs/testfile.pdf So you use stream.url instead of stream.file. Hope this helps. -Jay On Wed, Jul 8, 2009 at 7:40 AM, ahammad ahmed.ham...@gmail.com wrote: Hello, I can index rich documents like pdf for instance that are on the filesystem. Can we use ExtractingRequestHandler to index files that are accessible on a website? For example, there is a file that can be reached like so: http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf How would I go about indexing that file? I tried using the following combinations. I will put the errors in brackets: stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The filename, directory name, or volume label syntax is incorrect) stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot find the path specified) stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of the specified network name is invalid) stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot find the path specified) stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path was not found) I sort of understand why I get those errors. What are the alternative methods of doing this? I am guessing that the stream.file attribute doesn't support web addresses. Is there another attribute that does? -- View this message in context: http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing XML
Mathieu, have a look at Solr's DataImportHandler. It provides a configuration-based approach to index different types of datasources including relational databases and XML files. In particular have a look at the XpathEntityProcessor ( http://wiki.apache.org/solr/DataImportHandler#head-f1502b1ed71d98ef0120671db5762e137e63f9d2) which allows you to use xpath syntax to map xml data to index fields. -Jay On Tue, Jul 7, 2009 at 7:25 AM, Saeli Mathieu saeli.math...@gmail.comwrote: Hello. I'm a new user of Solr, I already used Lucene to index files and search. But my programme was too slow, it's why I was looking for another solution, and I thought I found it. I said I thought because I don't know if it's possible to use solar with this kind of XML files. lom xsi:schemaLocation=http://ltsc.ieee.org/xsd/lomv1.0 http://ltsc.ieee.org/xsd/lomv1.0/lom.xsd; general identifier catalogSTRING HERE/catalog entry STRING HERE /entry /identifier title string language=fr STRING HERE /string /title languagefr/language description string language=fr STRING HERE /string /description /general lifeCycle status sourceSTRING HERE/source valueSTRING HERE/value /status contribute role sourceSTRING HERE/source valueSTRING HERE/value /role entitySTRING HERE /entity /contribute /lifeCycle metaMetadata identifier catalogSTRING HERE/catalog entrySTRING HERE/entry /identifier contribute role sourceSTRING HERE/source valueSTRING HERE/value /role entitySTRING HERE /entity date dateTimeSTRING HERE/dateTime /date /contribute contribute role sourceSTRING HERE/source valueSTRING HERE/value /role entitySTRING HERE /entity entitySTRING HERE/entity entitySTRING HERE /entity date dateTimeSTRING HERE/dateTime /date /contribute metadataSchemaSTRING HERE/metadataSchema languageSTRING HERE/language /metaMetadata technical locationSTRING HERE /location /technical educational intendedEndUserRole sourceSTRING HERE/source valueSTRING HERE/value /intendedEndUserRole context sourceSTRING HERE/source valueSTRING HERE/value /context typicalAgeRange string language=frSTRING HERE/string /typicalAgeRange description string language=frSTRING HERE/string /description description string language=frSTRING HERE/string /description languageSTRING HERE/language /educational annotation entitySTRING HERE /entity date dateTimeSTRING HERE/dateTime /date /annotation classification purpose sourceSTRING HERE/source valueSTRING HERE/value /purpose /classification classification purpose sourceSTRING HERE/source valueSTRING HERE/value /purpose taxonPath source string language=frSTRING HERE/string /source taxon idSTRING HERE/id entry string language=frSTRING HERE/string /entry /taxon /taxonPath /classification classification purpose sourceSTRING HERE/source valueSTRING HERE/value /purpose taxonPath source string language=frSTRING HERE /string /source taxon idSTRING HERE/id entry string language=frSTRING HERE/string /entry /taxon /taxonPath taxonPath source string language=frSTRING HERE/string /source taxon idSTRING HERE/id entry string language=frSTRING HERE/string /entry /taxon /taxonPath /classification /lom I don't know how I can use this kind of file with Solr because the XML example are this one. add doc field name=idSOLR1000/field field name=nameSolr, the Enterprise Search Server/field field name=manuApache Software Foundation/field field name=catsoftware/field field name=catsearch/field field name=featuresAdvanced Full-Text Search Capabilities using Lucene/field field name=featuresOptimized for High Volume Web Traffic/field field name=featuresStandards Based Open Interfaces - XML and HTTP/field field name=featuresComprehensive HTML Administration Interfaces/field field name=featuresScalability - Efficient Replication to other Solr Search Servers/field field name=featuresFlexible and Adaptable with XML configuration and Schema/field field name=featuresGood unicode support: h#xE9;llo (hello with an accent over the e)/field field name=price0/field field name=popularity10/field field name=inStocktrue/field field name=incubationdate_dt2006-01-17T00:00:00.000Z/field /doc /add I understood Solr need this kind of architecture, by Architecture I mean field + name=keywordValue/field or as you can see I can't use this kind of architecture because I'm not allow to change my XML files. I'm looking forward to read you. Mathieu Saeli -- Saeli Mathieu.
Re: DIH: Limited xpath syntax unable to parse all xml elements
Thanks Noble, I gave those examples a try. If I use field column=body xpath=/book/body/chapter/p / I only get the text from the last p element, not from all elements. If I use field column=body xpath=/book/body/chapter flatten=true/ or field column=body xpath=/book/body/chapter/ flatten=true/ I don't get back anything for the body column. So the first example is close, but it only gets the text for the last p element. If I could get all p elements at the same level that would be what I need. The double-slash (/book/body/chapter//p) doesn't seem to be supported. Thanks, -Jay 2009/7/1 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com complete xpath is not supported /book/body/chapter/p should work. if you wish all the text under chapter irrespective of nesting , tag names use this field column=body xpath=/book/body/chapter flatten=true/ On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote: I'm using the XPathEntityProcessor to parse an xml structure that looks like this: book authorJoe Smith/author titleWorld Atlas/title body chapter pContent I want is here/p pMore content I want is here./p pStill more content here./p /chapter /body /book The author and title parse out fine: field column=title xpath=/book/title/ field column=author xpath=/book/author/ But I can't get at the data inside the p tags. I want to get all non-markup text inside the body tag with something like this: field column=body xpath=/book/body/chapter//p/ but that is not supported. Does anyone know of a way that I can get the content within the p tags without the markup? Thanks, -Jay -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: DIH: Limited xpath syntax unable to parse all xml elements
It is not multivalued. The intention is to get all text under they body element into one body field in the index that is not multivalued. Essentially everything within the body element minus the markup. Thanks, -Jay On Thu, Jul 2, 2009 at 8:55 AM, Fergus McMenemie fer...@twig.me.uk wrote: Thanks Noble, I gave those examples a try. If I use field column=body xpath=/book/body/chapter/p / I only get the text from the last p element, not from all elements. Hm, I am sure I have done this. In your schema.xml is the field body multiValued or not? If I use field column=body xpath=/book/body/chapter flatten=true/ or field column=body xpath=/book/body/chapter/ flatten=true/ I don't get back anything for the body column. So the first example is close, but it only gets the text for the last p element. If I could get all p elements at the same level that would be what I need. The double-slash (/book/body/chapter//p) doesn't seem to be supported. Thanks, -Jay 2009/7/1 Noble Paul ?? Â Ë³Ë noble.p...@corp.aol.com complete xpath is not supported /book/body/chapter/p should work. if you wish all the text under chapter irrespective of nesting , tag names use this field column=body xpath=/book/body/chapter flatten=true/ On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote: I'm using the XPathEntityProcessor to parse an xml structure that looks like this: book authorJoe Smith/author titleWorld Atlas/title body chapter pContent I want is here/p pMore content I want is here./p pStill more content here./p /chapter /body /book The author and title parse out fine: field column=title xpath=/book/title/ field column=author xpath=/book/author/ But I can't get at the data inside the p tags. I want to get all non-markup text inside the body tag with something like this: field column=body xpath=/book/body/chapter//p/ but that is not supported. Does anyone know of a way that I can get the content within the p tags without the markup? Thanks, -Jay -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- === Fergus McMenemie Email:fer...@twig.me.ukemail%3afer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH: Limited xpath syntax unable to parse all xml elements
I'm on the trunk, built on July 2: 1.4-dev 789506 Thanks, -Jay On Thu, Jul 2, 2009 at 11:33 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller markrmil...@gmail.com wrote: Shalin Shekhar Mangar wrote: It selects all matching nodes. But if the field is not multi-valued, it will store only the last value. I guess this is what is happening here. So do you think it should match them all and add the concatenated text as one field? That would be more Xpath like I think, and less arbitrary than just choosing the last one. I won't call it arbitrary because it creates a SolrInputDocument with values from all the matching nodes just like you'd create any multi-valued field. The problem is that his field is not declared to be multi-valued. The same would happen if you posted an XML document to /update with multiple values for a single-valued field. XPathEntityProcessor provides the flatten=true option if you want to add it as concatenated test. Jay mentioned that flatten did not work for him which is something we should investigate. Jay, which version of Solr are you running? The flatten option is a 1.4 feature (added with SOLR-1003). -- Regards, Shalin Shekhar Mangar.
Re: DIH: Limited xpath syntax unable to parse all xml elements
Thanks Fergus, setting the field to multivalued did work: field column=body xpath=/book/body/chapter/p flatten=true/ gets all the p elements as multivalue fields in the body field. The only thing is, the body field is used by some other content sources, so I have to look at the implications setting it to multi-valued will have on the other data sources. Still, this might do the trick. Thanks to all that helped on this! -Jay On Thu, Jul 2, 2009 at 11:40 AM, Fergus McMenemie fer...@twig.me.uk wrote: Shalin Shekhar Mangar wrote: On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com wrote: It looks like DIH implements its own subset of the Xpath spec. Right, DIH has a streaming implementation supporting a subset of XPath only. The supported things are in the wiki examples. I don't see any tests with multiple matching sub nodes, so perhaps DIH Xpath does not properly support that and just selects the last matching node? It selects all matching nodes. But if the field is not multi-valued, it will store only the last value. I guess this is what is happening here. So do you think it should match them all and add the concatenated text as one field? That would be more Xpath like I think, and less arbitrary than just choosing the last one. Only when the field in schema.xml in not multiValued. If the field is multiValued is should still behave as at present? Also... what went wrong with the suggested:- field column=body xpath=/book/body/chapter flatten=true/ Regards Fergus.
DIH: Distributing docs to more than one Solr instance
I'm using the DIH to index records from a relational database. No problems, everything works great. But now, due to the size of index (70GB w/ 25M+ docs) I need to shard and want the DIH to distribute documents evenly between two shards. Current approach is to modify the sql query in the config file to get only even numbered ids on one host and odd numbered ids on the other host. Is there are more elegant way to distribute the documents? Has anyone else come up with a better way to approach this? Thanks, -Jay
DIH: Limited xpath syntax unable to parse all xml elements
I'm using the XPathEntityProcessor to parse an xml structure that looks like this: book authorJoe Smith/author titleWorld Atlas/title body chapter pContent I want is here/p pMore content I want is here./p pStill more content here./p /chapter /body /book The author and title parse out fine: field column=title xpath=/book/title/ field column=author xpath=/book/author/ But I can't get at the data inside the p tags. I want to get all non-markup text inside the body tag with something like this: field column=body xpath=/book/body/chapter//p/ but that is not supported. Does anyone know of a way that I can get the content within the p tags without the markup? Thanks, -Jay
PlainTextEntitiyProcessor not putting any text into a field in index
I'm having some trouble getting the PlainTextEntityProcessor to populate a field in an index. I'm using the TemplateTransformer to fill 2 fields, and have a timestamp field in schema.xml, and these fields make it into the index. Only the plaintText data is missing. Here is my configuration: dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name=f processor=FileListEntityProcessor baseDir=/Users/jayhill/test/dir fileName=.*txt recursive=true rootEntity=true entity name=pt processor=PlainTextEntityProcessor url=${f.fileAbsolutePath} transformer=RegexTransformer,TemplateTransformer field column=plainText name=text/ field column=datasource template=textfiles / /entity /entity /document /dataConfig I've tried adding plainText as a field in schema.xml, but that didn't work either. When I look at what the PlainTextEntityProcessor class is doing I see that it has correctly parsed the file and has the text in a StringWriter: row.put(PLAIN_TEXT, sw.toString()); I just don't know how to get that text into a field in the index Any pointers appreciated. -Jay
Re: query issue /special character and case
Regarding being able to search SCHOLKOPF (o with no umlaut) and match SCHÖLKOPF (with umlaut) try using the ISOLatin1AccentFilterFactory in your analysis chain: filter class=solr.ISOLatin1AccentFilterFactory / This filter removes accented chars and replaces them with non-accented versions. As always, make sure to add it to the for both type index and type query. -Jay On Fri, Jun 5, 2009 at 11:10 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Sat, May 30, 2009 at 9:48 AM, revas revas...@gmail.com wrote: Hi , When i give a query like the following ,why does it become a phrase query as shown below? The field type is the default text field in the schema. str name=querystringvolker-blanz/str str name=parsedqueryPhraseQuery(content:volker blanz)/str What is the query that was sent to Solr? Also when i have special characters in the query as SCHÖLKOPF , i am not able to convert the o with spl character to lower case on my unix os/it works fine on windows xp OS .Also if i have a spl character in my query ,i would like to search for it wihtout the special character as SCHOLKOPF ,this works fine in windows with strtr (string translate php fucntion) ,but again not in windows OS. Hmm, not sure. If you are using Tomcat, have you enabled UTF-8? http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4 You can try using the analysis.jsp on the text field with this token and see how it is being analyzed. See if that gives some hints. -- Regards, Shalin Shekhar Mangar.
Re: Query faceting
In order to get the the values you want for the service field you will need to change the fieldType definition in schema.xml for service to use something that doesn't alter your original values. Try the string fieldType to start and look at the fieldType definition for string. I'm guessing you have it set to text or something else with a chain of filters during analysis. If you don't want back facets with a count of 0 set this param: facet.mincount=1 Have a look at all the values you can set on facets: http://wiki.apache.org/solr/SimpleFacetParameters -Jay On Mon, Jun 8, 2009 at 2:09 PM, siping liu siping...@hotmail.com wrote: Hi, I have a field called service with following values: - Shuttle Services - Senior Discounts - Laundry Rooms - ... When I conduct query with facet=truefacet.field=servicefacet.limit=-1, I get something like this back: - shuttle 2 - service 3 - senior 0 - laundry 0 - room 3 - ... Questions: - How not to break up fields values in words, so I can get something like Shuttle Services 2 back? - How to tell Solr not to return facet with 0 value? The query takes long time to finish, seemingly because of the long list of items with 0 count. thanks for any advice. _ Insert movie times and more without leaving Hotmail®. http://windowslive.com/Tutorial/Hotmail/QuickAdd?ocid=TXT_TAGLM_WL_HM_Tutorial_QuickAdd_062009
Re: Highlighting and Field options
Use the fl param to ask for only the fields you need, but also keep hl=true. Something like this: http://localhost:8080/solr/select/?q=bearversion=2.2start=0rows=10indent=onhl=truefl=id Note that fl=id means the only field returned in the XML will be the id field. Highlights are still returned in the highlight element, but you won't get back the unneeded content field. -Jay On Mon, Jun 1, 2009 at 9:41 AM, ashokc ash...@qualcomm.com wrote: Hi, The 'content' field that I am indexing is usually large (e.g. a pdf doc of a few Mb in size). I need highlighting to be on. This 'seems' to require that I have to set the 'content' field to be STORED. This returns the whole content field in the search result XML. for each matching document. The highlighted text also is returned in a separate block. But I do NOT need the entire content field to display the search results. I only use the highlighted segments to display a brief description of each hit. The fact that SOLR returns entire content field, makes the returned XML unnecessarily huge, and makes for larger response times. How can I have SOLR return ONLY the highlighted text for each hit and NOT the entire 'content' filed? Thanks - ashok -- View this message in context: http://www.nabble.com/Highlighting-and-Field-options-tp23818019p23818019.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about field types and querying
Try using the admin analysis tool (http://host:port/solr/admin/analysis.jsp) too see what the analysis chain is doing to your query. Enter the field name (question in your case) and the Field value (Index) customize (since that's what's in the document). For Field value (Query) enter customer. Check Verbose Output and click Analyze. This will show you each filter in the chain and the actions they are taking on your query. Note that the highlighted fields show where a match would occur. Then adjust your fieldTypes and fields to get the results you want. Create a new fieldType if needed and add/remove filters as needed. -Jay On Thu, May 28, 2009 at 12:07 PM, ahammad ahmed.ham...@gmail.com wrote: Hello, I have a field type of text in my collection called question. When I query for the word customer for example in the question field (ie q=question:customer), the first document with the highest score shows up, but does not contain the word customer at all. Instead, it contains the word customize. What would be a way around this? I tried changing the type to string instead of text, but that I wouldn't get any results if I don't have the exact statement in there... -- View this message in context: http://www.nabble.com/Question-about-field-types-and-querying-tp23768061p23768061.html Sent from the Solr - User mailing list archive at Nabble.com.