Re: DataimportHandler development issue
On Fri, Jan 14, 2011 at 12:17 AM, Derek Werthmuller dwert...@ctg.albany.edu wrote: Its not clear why its not working. Advice? Also is this the best way to load data? We intent on loading several thousand docbook documents once we understand how this all works. We stuck with the rss/atom example since we didn't want to deal with schema changes yet. Thanks Derek example-DIH/solr/rss/conf/rss-data-config.xml modified source: dataConfig dataSource type=URLDataSource / document entity name=slashdot pk=link url=http://twitter.com/statuses/user_timeline/existdb.rss; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/subject commonField=true / field column=title xpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description / field column=creator xpath=/rss/channel/item/creator / field column=item-subject xpath=/rss/channel/item/subject / field column=date xpath=/rss/channel/item/date dateTimeFormat=-MM-dd'T'hh:mm:ss / field column=slash-department xpath=/rss/channel/item/department / field column=slash-section xpath=/rss/channel/item/section / field column=slash-comments xpath=/rss/channel/item/comments / /entity entity name=twitter pk=link url=http://twitter.com/statuses/user_timeline/ctg_ualbany.atom; processor=XPathEntityProcessor forEach=/feed | /feed/entry transformer=DateFormatTransformer field column=source xpath=/feed/title commonField=true / field column=source-link xpath=/feed/link commonField=true / field column=subject xpath=/feed/subtitle commonField=true / field column=title xpath=/feed/entry/title / field column=link xpath=/feed/entry/link / field column=description xpath=/feed/entry/description / field column=creator xpath=/feed/entry/creator / field column=item-subject xpath=/feed/entry/subject / field column=date xpath=/rss/channel/item/date dateTimeFormat=-MM-dd'T'hh:mm:ss / field column=slash-department xpath=/feed/entry/department / field column=slash-section xpath=/feed/entry/section / field column=slash-comments xpath=/feed/entry/comments / /entity /document /dataConfig Your problem is the second entity in the DIH configuration file. The Solr schema defines the unique key to be the field link. As noted in the comments in schema.xml, this means that this field is required. Solr is not able to populate the link field from the Atom feed. I have not tracked down why this is so, but it is probably because there is more than one link node under /feed/entry, and the link field is not multi-valued. Change the xpath to, say, /feed/entry/id, and the import works. Also, while this is not necessarily an issue, please note that several other fields have incorrect xpaths for this entity. To answer your other question, this way of importing data should work fine. Regards, Gora
Re: Improving Solr performance
The tests are performed with a selfmade program. The arguments are the number of threads and the path to a file which contains available queries (in the last test only one). When each thread is created, it gets the current date (in milisecs), and when it gets the response from the query, the thread logs the diff with that initial date. In the last post, I wrote the results of the 100 threads example orderered by the response date. The results ordered by the creation date are: 100 simultaneous queries: 9265, 11922, 12375, 4109, 4890, 7093, 21875, 8547, 13562, 13219, 1531, 11875, 21281, 31985, 11703, 7391, 32031, 22172, 21469, 13875, 1969, 11406, 8172, 9609, 16953, 13828, 17282, 22141, 16625, 2203, 24985, 2375, 25188, 2891, 5047, 6422, 20860, 7594, 23125, 32281, 32016, 5312, 23125, 11484, 10344, 11500, 18172, 3937, 11547, 13500, 28297, 20594, 24641, 7063, 24797, 12922, 1297, 8984, 20625, 13407, 23203, 32016, 15922, 21875, 8750, 12875, 23203, 26453, 26016, 11797, 31782, 24672, 21625, 7672, 18985, 14672, 22157, 26485, 23328, 9907, 5563, 24625, 14078, 4703, 25844, 12328, 11484, 6437, 25937, 26437, 18484, 13719, 16328, 28687, 23141, 14016, 26437, 13187, 25031, 31969 -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2254121.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.0 = Spatial Search - How to
caman, how did you try to concat them? perhaps some typecasting would do the trick? Stefan On Fri, Jan 14, 2011 at 7:20 AM, caman aboxfortheotherst...@gmail.comwrote: Thanks Here was the issues. Concatenating 2 floats(lat,lng) at mysql end converted it to a BLOB. Indexing would fail in storing BLOB in 'location' type field. After BLOB issue was resolved, all worked ok. Thank you all for your help -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2253691.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dismax, Sharding and Elevation
Hi, thank you for your reply, Grijesh. But Elevation in general works with sharding - if I used the Standard Request Handler instead of Dismax. I just wonder how (or if) it could work also with dismax. I think its not a problem of distributed search, but one of dismax (perhaps combined with distributed search). Oliver Grijesh.singh schrieb: As I seen the code for QueryElevationComponent ,there is no supports for Distributed Search i.e. query elevation does not works with shards. - Grijesh -- Oliver Marahrens TU Hamburg-Harburg / Universitätsbibliothek / Digitale Dienste Denickestr. 22 21071 Hamburg - Harburg Tel.+49 (0)40 / 428 78 - 32 91 eMail o.marahr...@tu-harburg.de -- GPG/PGP-Schlüssel: http://www.tub.tu-harburg.de/keys/Oliver_Marahrens_pub.asc -- Projekt DISCUS http://discus.tu-harburg.de Projekt TUBdok http://doku.b.tu-harburg.de
Solr and Ping PHP
Hello. Iam using NRT and for each search-request, updater-request and commit-request (on the search-instance) i start a ping to solr with a httpRequest. But sometimes ping isnt okay, but sor is available. Why cannot solr ping, when he is doing something like Commit on my searcher or when a search-request is running ? i get every night minimum one Error Message and thats really sucks ... -- System - One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other under 100.000 - Solr1 for Search-Requests - commit every Minute - 4GB Xmx - Solr2 for Update-Request - delta every 2 Minutes - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Ping-PHP-tp2254214p2254214.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Searchers and Warmups
Hi David, maybe the wiki page on caching could be helpful: http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners Regards, - Savvas On 14 January 2011 00:08, David Cramer dcra...@gmail.com wrote: I'm trying to understand the mechanics behind warming up, when new searchers are registered, and their costs. A quick Google didn't point me in the right direction, so hoping for some of that here. -- David Cramer
Re: Searchers and Warmups
Hi David, The idea is that you can define some listeners which make a list of queries to an IndexSearcher. In particular the firstSearcher event is related to the very first IndexSearcher being created inside the Solr instance while the newSearcher is the event related to the creation of a new IndexSearcher (i.e. when a commit is done the not used ones get closed and new IS are created with the last commit point). The warming up is simply the execution of particular queries against such IndexSearchers in order to put some documents in the caches before any user entered query is executed so that the searchers are warmed with proper documents (i.e. most frequent queries). Also some documents in the old caches get inside the caches of the new searchers depending on the cache configuration [1]. I hope this clarifies things a little bit. Cheers, Tommaso [1] : http://wiki.apache.org/solr/SolrPerformanceFactors#Cache_autoWarm_Count_Considerations 2011/1/14 David Cramer dcra...@gmail.com I'm trying to understand the mechanics behind warming up, when new searchers are registered, and their costs. A quick Google didn't point me in the right direction, so hoping for some of that here. -- David Cramer
Re: Solr 4.0 = Spatial Search - How to
absolutely no idea why it is a blob .. but the following one works as expected: CAST( CONCAT( lat, ',', lng ) AS CHAR ) HTH Stefan On Fri, Jan 14, 2011 at 9:31 AM, caman aboxfortheotherst...@gmail.comwrote: CONCAT(CAST(lat as CHAR),',',CAST(lng as CHAR)) -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254151.html Sent from the Solr - User mailing list archive at Nabble.com.
Schema design FAQs/questions
Dear Solr-users, is there a compilation of FAQs particularly targeting at schema design? I have a two questions that probably have been asked before: - I have to map different kinds of documents into my schema. Some of these documents have one or multiple time/dates that might be relevant for querying or sorting. I feel that it would be best to keep dates with different semantics in different fields. Does it pose any problems when some of these fields are filled only for certain documents? Probably when such a field is used for sorting, documents not providing a field value will be at one end of the result set, depending on sort order? - Sometimes several documents belong together as they are part of a bigger concept. I could keep a reference to this concept along with every document in the index. Now would it be possible to perform a search where hits on documents are grouped by these concepts? That is, I would like to get a result list that contains *only one* entry per concept but for each of these entry gives me a hit which document(s) contained the match? Thanks a lot! -mp.
solr speed issues..
I am working on an application that requires fetching results from solr based on date parameter..earlier i was using sharding to fetch the results but that was making things too slow,so instead of sharding,i queried on three different cores with the same parameters and merged the results..still the things are slow.. for one call i generally get around 500 to 1000 docs from solr..so basically i am including following parameters in url for solr call sort=created+desc json.nl=map wt=json rows=1000 version=1.2 omitHeader=true fl=title start=0 q=apple qt=standard fq=created:[date1 TO date2] Its taking long time to get the results,any solution for the above problem would be great.. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-speed-issues-tp2254823p2254823.html Sent from the Solr - User mailing list archive at Nabble.com.
Query : FAQ? Forum?
Hi, I am trying to get Solr installed and working: and have some queries: is there a FAQ or a Forum? How do I search to see whether someone has already asked my question and answered it? Regards Cathy -- Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd. Registration Number: 2416188 Registered in England and Wales. Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU. CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium. Please consider the environment before printing this e-mail Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd. Registration Number: 2416188 Registered in England and Wales. Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU. CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium. http://www.converteam.com Please consider the environment before printing this e-mail.
Re: Query : FAQ? Forum?
What about http://search.lucidimagination.com/search/#/p:solr ? :) On Fri, Jan 14, 2011 at 12:45 PM, Cathy Hemsley cathy.hems...@converteam.com wrote: Hi, I am trying to get Solr installed and working: and have some queries: is there a FAQ or a Forum? How do I search to see whether someone has already asked my question and answered it? Regards Cathy -- Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd. Registration Number: 2416188 Registered in England and Wales. Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU. CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium. Please consider the environment before printing this e-mail Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd. Registration Number: 2416188 Registered in England and Wales. Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU. CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium. http://www.converteam.com Please consider the environment before printing this e-mail.
boilerpipe solr tika howto please
Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus clutter). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) is it the right way? Thanks in advance, Arno.
Re: Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)
ok, now in the 4 test, it works ? ok.. i dont know... it works.. but now i have a Oher Problem, i cant sent content to the Server.. when i will send Content to solr i get: html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 400 /title /head bodyh2HTTP ERROR: 400/h2preDocument [null] missing required field: id/pre pRequestURI=/solr/update/extract/ppismalla href= http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ /body /html I do: curl http://192.168.105.66:8983/solr/update/extract?ext.idx.attr=true\ext.def.fl=text; -F myfile=@test.txt some ideas?
Re: segment gets corrupted (after background merge ?)
Right, but removing a segment out from under a live IW (when you run CheckIndex with -fix) is deadly, because that other IW doesn't know you've removed the segment, and will later commit a new segment infos still referencing that segment. The nature of this particular exception from CheckIndex is very strange... I think it can only be a bug in Lucene, a bug in the JRE or a hardware issue (bits are flipping somewhere). I don't think an error in the IO system can cause this particular exception (it would cause others), because the deleted docs are loaded up front when SegmentReader is init'd... This is why I'd really like to see if a given corrupt index always hits precisely the same exception if you run CheckIndex more than once. Mike On Thu, Jan 13, 2011 at 10:56 PM, Lance Norskog goks...@gmail.com wrote: 1) CheckIndex is not supposed to change a corrupt segment, only remove it. 2) Are you using local hard disks, or do run on a common SAN or remote file server? I have seen corruption errors on SANs, where existing files have random changes. On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless luc...@mikemccandless.com wrote: Generally it's not safe to run CheckIndex if a writer is also open on the index. It's not safe because CheckIndex could hit FNFE's on opening files, or, if you use -fix, CheckIndex will change the index out from under your other IndexWriter (which will then cause other kinds of corruption). That said, I don't think the corruption that CheckIndex is detecting in your index would be caused by having a writer open on the index. Your first CheckIndex has a different deletes file (_phe_p3.del, with 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with 44828 deleted docs), so it must somehow have to do with that change. One question: if you have a corrupt index, and run CheckIndex on it several times in a row, does it always fail in the same way? (Ie the same term hits the below exception). Is there any way I could get a copy of one of your corrupt cases? I can then dig... Mike On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat stephane.delp...@blogspirit.com wrote: I understand less and less what is happening to my solr. I did a checkIndex (without -fix) and there was an error... So a did another checkIndex with -fix and then the error was gone. The segment was alright During checkIndex I do not shut down the solr server, I just make sure no client connect to the server. Should I shut down the solr server during checkIndex ? first checkIndex : 4 of 17: name=_phe docCount=264148 compound=false hasProx=true numFiles=9 size (MB)=928.977 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_phe_p3.del] test: open reader.OK [44824 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs seen 0 + num docs deleted 0] java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 + num docs deleted 0 at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) test: stored fields...OK [7206878 total field count; avg 32.86 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: Term Index test failed at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) a few minutes latter : 4 of 18: name=_phe docCount=264148 compound=false hasProx=true numFiles=9 size (MB)=928.977 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_phe_p4.del] test: open reader.OK [44828 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs; 28919124 tokens] test: stored fields...OK [7206764 total field count; avg 32.86 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] Le 12/01/2011 16:50, Michael McCandless a écrit : Curious... is it always a docFreq=1 != num docs seen 0 +
Solr: using to index large folders recursively containing lots of different documents, and querying over the web
Hi Solr users, I hope you can help. We are migrating our intranet web site management system to Windows 2008 and need a replacement for Index Server to do the text searching. I am trying to establish if Lucene and Solr is a feasible replacement, but I cannot find the answers to these questions: 1. Can Solr be set up to recursively index a folder containing an indeterminate and variable large number of subfolders, containing files of all types: XML, HTML, PDF, DOC, spreadsheets, powerpoint presentations, text files etc. If so, how? 2. Can Solr be queried over the web and return a list of files that match a search query entered by a user, and also return the abstracts for these files, as well as 'hit highlighting'. If so, how? 3. Can Solr be run as a service (like Index Server) that automatically detects changes to the files within the indexed folder and updates the index? If so, how? Thanks for your help Cathy Hemsley -- Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd. Registration Number: 2416188 Registered in England and Wales. Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU. CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium. Please consider the environment before printing this e-mail Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd. Registration Number: 2416188 Registered in England and Wales. Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU. CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium. http://www.converteam.com Please consider the environment before printing this e-mail.
Re: Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)
pass an value for your id-field as you do it already for all the other fields? http://search.lucidimagination.com/search/document/ca95d06e700322ed/missing_required_field_id_using_extractingrequesthandler On Fri, Jan 14, 2011 at 12:59 PM, Jörg Agatz joerg.ag...@googlemail.comwrote: ok, now in the 4 test, it works ? ok.. i dont know... it works.. but now i have a Oher Problem, i cant sent content to the Server.. when i will send Content to solr i get: html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 400 /title /head bodyh2HTTP ERROR: 400/h2preDocument [null] missing required field: id/pre pRequestURI=/solr/update/extract/ppismalla href= http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ /body /html I do: curl http://192.168.105.66:8983/solr/update/extract?ext.idx.attr=true\ext.def.fl=text -F myfile=@test.txt some ideas?
Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web
Please visit the Nutch project. It is a powerful crawler and can integrate with Solr. http://nutch.apache.org/ Hi Solr users, I hope you can help. We are migrating our intranet web site management system to Windows 2008 and need a replacement for Index Server to do the text searching. I am trying to establish if Lucene and Solr is a feasible replacement, but I cannot find the answers to these questions: 1. Can Solr be set up to recursively index a folder containing an indeterminate and variable large number of subfolders, containing files of all types: XML, HTML, PDF, DOC, spreadsheets, powerpoint presentations, text files etc. If so, how? 2. Can Solr be queried over the web and return a list of files that match a search query entered by a user, and also return the abstracts for these files, as well as 'hit highlighting'. If so, how? 3. Can Solr be run as a service (like Index Server) that automatically detects changes to the files within the indexed folder and updates the index? If so, how? Thanks for your help Cathy Hemsley
Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web
On Fri, 2011-01-14 at 13:05 +0100, Cathy Hemsley wrote: I hope you can help. We are migrating our intranet web site management system to Windows 2008 and need a replacement for Index Server to do the text searching. I am trying to establish if Lucene and Solr is a feasible replacement, but I cannot find the answers to these questions: The answers to your questions are yes and no to all of them. Solr does not do what you ask out of the box, but it can certainly be done by extending Solr or using it as at the core of another system. Some time ago I stumbled upon http://www.constellio.com/ which seems to be exactly what you're looking for.
Re: Adding a new site to existing solr configuration
Awesome! thx! :) -- View this message in context: http://lucene.472066.n3.nabble.com/Adding-a-new-site-to-existing-solr-configuration-tp2249223p2255160.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web
Nutch can crawl the file system as well. Nutch 1.x can also provide search but this is delegated to Solr in Nutch 2.x. Solr can provide the search and Nutch can provide Solr with content from your intranet. On Friday 14 January 2011 13:17:52 Cathy Hemsley wrote: Hi, Thanks for suggesting this. However, I'm not sure a 'crawler' will work: as the various pages are not necessarily linked (it's complicated: basically our intranet is a dynamic and managed collection of independantly published web sites, and users found information using categorisation and/or text searching), so we need something that will index all the files in a given folder, rather than follow links like a crawler. Can Nutch do this? As well as the other requirements below? Regards Cathy On 14 January 2011 12:09, Markus Jelsma markus.jel...@openindex.io wrote: Please visit the Nutch project. It is a powerful crawler and can integrate with Solr. http://nutch.apache.org/ Hi Solr users, I hope you can help. We are migrating our intranet web site management system to Windows 2008 and need a replacement for Index Server to do the text searching. I am trying to establish if Lucene and Solr is a feasible replacement, but I cannot find the answers to these questions: 1. Can Solr be set up to recursively index a folder containing an indeterminate and variable large number of subfolders, containing files of all types: XML, HTML, PDF, DOC, spreadsheets, powerpoint presentations, text files etc. If so, how? 2. Can Solr be queried over the web and return a list of files that match a search query entered by a user, and also return the abstracts for these files, as well as 'hit highlighting'. If so, how? 3. Can Solr be run as a service (like Index Server) that automatically detects changes to the files within the indexed folder and updates the index? If so, how? Thanks for your help Cathy Hemsley -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Is deduplication possible during Tika extract?
Hello, here is an excerpt of my solrconfig.xml: requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler startup=lazy lst name=defaults str name=update.processordedupe/str !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contenttext/str str name=lowernamestrue/str str name=uprefixignored_/str !-- capture link hrefs but ignore div attributes -- str name=captureAttrtrue/str str name=fmap.alinks/str str name=fmap.divignored_/str /lst /requestHandler and updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain deduplication works when I use only /update but not when solr does an extract with Tika! Is deduplication possible during Tika extract? Thanks in advance, Arno
Re: segment gets corrupted (after background merge ?)
So I ran checkIndex (without -fix) 5 times in a row : SOLR was running, but no client connected to it. (just the slave which was synchronizing every 5 minutes) summary : 1: all good 2: 2 errors: (seg 1 2) terms, freq, prox...ERROR [term blog_id:104150: doc 324697 = lastDoc 324697] terms, freq, prox...ERROR [term blog_keywords:SPORT: doc 174808 = lastDoc 174808] 3: 1 error: (seg 2) terms, freq, prox...ERROR [Index: 105, Size: 51] 4: all good 5: 1 error: (seg 7) terms, freq, prox...ERROR [term blog_comments: %X docFreq=1 != num docs seen 0 + num docs deleted 0] Seams to me that some random things are happening here. File system is ext3, on a physical server. Here are the logs of the interesting segments : ** 1 ** 1 of 17: name=_nqt docCount=431889 compound=false hasProx=true numFiles=9 size (MB)=1,671.375 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_nqt_1y2.del] test: open reader.OK [41918 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs pairs; 59357374 tokens] test: stored fields...OK [11505678 total field count; avg 29.504 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 2 of 17: name=_ol7 docCount=913886 compound=false hasProx=true numFiles=9 size (MB)=3,567.739 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_ol7_1mc.del] test: open reader.OK [74076 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [9825896 terms; 93954470 terms/docs pairs; 132337348 tokens] test: stored fields...OK [26933113 total field count; avg 32.07 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] ** 2 ** 1 of 17: name=_nqt docCount=431889 compound=false hasProx=true numFiles=9 size (MB)=1,671.375 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_nqt_1y2.del] test: open reader.OK [41918 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...ERROR [term blog_id:104150: doc 324697 = lastDoc 324697] java.lang.RuntimeException: term blog_id:104150: doc 324697 = lastDoc 324697 at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) test: stored fields...OK [11505678 total field count; avg 29.504 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: Term Index test failed at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) 2 of 17: name=_ol7 docCount=913886 compound=false hasProx=true numFiles=9 size (MB)=3,567.739 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_ol7_1mc.del] test: open reader.OK [74076 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...ERROR [term blog_keywords:SPORT: doc 174808 = lastDoc 174808] java.lang.RuntimeException: term blog_keywords:SPORT: doc 174808 = lastDoc 174808 at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) test: stored fields...OK [26933113 total field count; avg 32.07 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc]
LukeRequestHandler histogram?
Dear list, what is the LukeRequestHandler histogram telling me? Couldn't find any explanation and would be pleased to have it explained. Many thanks in advance, Bernd
Re: LukeRequestHandler histogram?
Hi Bernd, there is an explanation from Hoss: http://search.lucidimagination.com/search/document/149e7d25415c0a36/some_kind_of_crazy_histogram#b22563120f1ec32b HTH Stefan On Fri, Jan 14, 2011 at 3:15 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Dear list, what is the LukeRequestHandler histogram telling me? Couldn't find any explanation and would be pleased to have it explained. Many thanks in advance, Bernd
Re: LukeRequestHandler histogram?
Hi Stefan, thanks a lot. Regards, Bernd Am 14.01.2011 15:25, schrieb Stefan Matheis: Hi Bernd, there is an explanation from Hoss: http://search.lucidimagination.com/search/document/149e7d25415c0a36/some_kind_of_crazy_histogram#b22563120f1ec32b HTH Stefan On Fri, Jan 14, 2011 at 3:15 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Dear list, what is the LukeRequestHandler histogram telling me? Couldn't find any explanation and would be pleased to have it explained. Many thanks in advance, Bernd
Re: Query : FAQ? Forum?
http://wiki.apache.org/solr/FrontPage Solr Wiki http://wiki.apache.org/solr/FAQ Solr FAQ http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1847195881/ref=sr_1_1?ie=UTF8qid=1295018231sr=8-1 A good book on Solr And this forum you posted to http://lucene.472066.n3.nabble.com/Solr-User-f472068.html (Solr-User) is one of the most active and useful Tech forums I've ever used. Don't be afraid to ask stupid questions, folks here are pretty forgiving and patient, especially if you attempt to use the Wiki or FAQ first. Good Luck! Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Query-FAQ-Forum-tp2254898p2256030.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: boilerpipe solr tika howto please
Is there a drastic difference between this and TagSoup which is already included in Solr? On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat arnaud.gaudi...@gmail.comwrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus clutter). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum ( http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) is it the right way? Thanks in advance, Arno.
Re: boilerpipe solr tika howto please
I just saw TagSoup and it seems to clean bad HTML tags to create a good HTML file. what's BoilerPipe does, it try to eliminate html content which is not part of the useful content for a human reader (ie. navigation contents, ads, comments...) take a look here: http://boilerpipe-web.appspot.com/ and try with one of your URL And other type of this application, is 'Readability' which is more for a end-user (http://lab.arc90.com/experiments/readability/) Le 14.01.2011 16:51, Adam Estrada a écrit : Is there a drastic difference between this and TagSoup which is already included in Solr? On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat arnaud.gaudi...@gmail.comwrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus clutter). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum ( http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) is it the right way? Thanks in advance, Arno.
Re: Improving Solr performance
On Fri, Jan 14, 2011 at 1:56 PM, supersoft elarab...@gmail.com wrote: The tests are performed with a selfmade program. [...] May I ask in what language is the program written in? The reason to ask that is to eliminate the possibility that there is an issue with the threading model, e.g., if you were using Python, for example. Would it be possible for you to run Apache bench, ab, against your Solr setup, e.g., something like: # For 10 simultaneous connections ab -n 100 -c 10 http://localhost:8983/solr/select/?q=my_query1 # For 50 simultaneous connections ab -n 500 -c 50 http://localhost:8983/solr/select/?q=my_query2 Please pay attention to the meaning of the -n parameter (there is a slight gotcha there). man ab for details on usage, or see, http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/ for example. In the last post, I wrote the results of the 100 threads example orderered by the response date. The results ordered by the creation date are: [...] OK, the numbers makes more sense now. As someone else has pointed out, your throughput does increase with more simultaneous queries, and there are better ways to do the measurement. Nevertheless, your results are very much at odds with what we see, and I would like to understand the issue. Regards, Gora
Re: boilerpipe solr tika howto please
Hi Arno, On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus clutter). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration ) is it the right way? You need to add the BoilerpipeContentHandler into Tika's content handler chain. Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk) the TikaEntityProcessor.getHtmlHandler() method. I'd try something like: return new BoilerpipeContentHandler(new ContentHandlerDecorator( Though from a quick look at that code, I'm curious why it doesn't use BodyContentHandler, versus the current ContentHandlerDecorator. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Variable datasources
I was actually able to figure this out using a slightly different method since the databases exist on the same server I simply made a single datasource with no database selected: datasource url=jdbc:mysql://localhost/ name=content / then in the queries, I qualify using the full database notation: database.table rather than just table document name=items entity datasource=content name=local query=select code from master.locals rootEntity=false entity datasource=content name=item query= select *, ${local.code} as code from content_${local.code}.item / /entity /document it works as expected -- View this message in context: http://lucene.472066.n3.nabble.com/Variable-datasources-tp2249568p2257334.html Sent from the Solr - User mailing list archive at Nabble.com.
No system property or default value specified for...
I'm trying to dynamically add a core to a multi core system using the following command: http://localhost:8983/solr/admin/cores?action=CREATEname=itemsinstanceDir=itemsconfig=data-config.xmlschema=schema.xmldataDir=datapersist=true the data-config.xml looks like this: dataConfig dataSource type=JdbcDataSource url=jdbc:mysql://localhost/ ... name=server/ document name=items entity dataSource=server name=locals query=select code from master.locals rootEntity=false entity dataSource=server name=item query=select '${local.code}' as localcode, items.* FROM ${local.code}_meta.item WHERE item.lastmodified '${dataimporter.last_index_time}' OR '${dataimporter.request.clean}' != 'false' order by item.objid / /entity /document /dataConfig this same configuration works for a core that is already imported into the system, but when trying to add the core with the above command I get the following error: No system property or default value specified for local.code so I added a property/ tag in the solr.xml figuring that it needed some type of default value for this to work, then I restarted solr, but now when I try the import I get: No system property or default value specified for dataimporter.last_index_time Do I have to define a default value for every variable I will conceivably use for future cores? is there a way to bypass this error? Thanks in advance
Re: segment gets corrupted (after background merge ?)
OK given that you're seeing non-deterministic results on the same index... I think this is likely a hardware issue or a JRE bug? If you move that index over to another env and run CheckIndex, is it consistent? Mike On Fri, Jan 14, 2011 at 9:00 AM, Stéphane Delprat stephane.delp...@blogspirit.com wrote: So I ran checkIndex (without -fix) 5 times in a row : SOLR was running, but no client connected to it. (just the slave which was synchronizing every 5 minutes) summary : 1: all good 2: 2 errors: (seg 1 2) terms, freq, prox...ERROR [term blog_id:104150: doc 324697 = lastDoc 324697] terms, freq, prox...ERROR [term blog_keywords:SPORT: doc 174808 = lastDoc 174808] 3: 1 error: (seg 2) terms, freq, prox...ERROR [Index: 105, Size: 51] 4: all good 5: 1 error: (seg 7) terms, freq, prox...ERROR [term blog_comments: %X docFreq=1 != num docs seen 0 + num docs deleted 0] Seams to me that some random things are happening here. File system is ext3, on a physical server. Here are the logs of the interesting segments : ** 1 ** 1 of 17: name=_nqt docCount=431889 compound=false hasProx=true numFiles=9 size (MB)=1,671.375 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_nqt_1y2.del] test: open reader.OK [41918 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs pairs; 59357374 tokens] test: stored fields...OK [11505678 total field count; avg 29.504 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 2 of 17: name=_ol7 docCount=913886 compound=false hasProx=true numFiles=9 size (MB)=3,567.739 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_ol7_1mc.del] test: open reader.OK [74076 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [9825896 terms; 93954470 terms/docs pairs; 132337348 tokens] test: stored fields...OK [26933113 total field count; avg 32.07 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] ** 2 ** 1 of 17: name=_nqt docCount=431889 compound=false hasProx=true numFiles=9 size (MB)=1,671.375 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_nqt_1y2.del] test: open reader.OK [41918 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...ERROR [term blog_id:104150: doc 324697 = lastDoc 324697] java.lang.RuntimeException: term blog_id:104150: doc 324697 = lastDoc 324697 at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) test: stored fields...OK [11505678 total field count; avg 29.504 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: Term Index test failed at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) 2 of 17: name=_ol7 docCount=913886 compound=false hasProx=true numFiles=9 size (MB)=3,567.739 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_ol7_1mc.del] test: open reader.OK [74076 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...ERROR [term blog_keywords:SPORT: doc 174808 = lastDoc 174808] java.lang.RuntimeException: term blog_keywords:SPORT: doc 174808 = lastDoc 174808 at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644) at
DataImportHandler: full import of a single entity
I've got a DataImportHandler set up with 5 entities. I would like to do a full import on just one entity. Is that possible? I worked around it temporarily by hand editing the dataimport.properties file and deleting the delta line for that one entity, and kicking off a delta. But for (hopefully) obvious reasons, delta is less efficient than full. -jsd-
MaxRows and disabling sort
Hi, I want to limit my SOLR results so that it stops further searching once it founds a certain number of records (just like 'limit' in MySQL). I know it has timeAllowed property but is there anything like MaxRows? I am NOT talking about 'rows' attribute which returns a specific no. of rows to client. This seems a very nice way to stop SOLR from traversing through the complete index but I am not sure if there is anything like this. Also I guess default sorting is on Scoring and sorting can only be done once it has the scores of all matches so then limiting it to the max rows becomes useless. So if there a way to disable sorting? e.g. it returns the rows as it finds without any order? Thanks! -- Regards, Salman Akram Cell: +92-321-4391210
Re: Multi-word exact keyword case-insensitive search suggestions
This might work: Define your field to use WhitespaceTokenizer and LowerCaseFilterFactory Use a filter query referencing this field. If you wanted the words to appear in their exact order, you could just define the pf field in your dismax. Best Erick On Thu, Jan 13, 2011 at 8:01 PM, Estrada Groups estrada.adam.gro...@gmail.com wrote: Ahhh...the fun of open source software ;-). Requires a ton of trial and error! I found what worked for me and figured it was worth passing it along. If you don't mind...when you sort everything out on your end, please post results for the rest of us to take a gander at. Cheers, Adam On Jan 13, 2011, at 9:08 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote: Thanks for your reply. However, it doesn't work for my case at all. I think it's the problem with query parser or something else. It forces me to put double quote to the search query in order to get the results found. str name=rawquerystringsim 010/str str name=querystringsim 010/str str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010)) ()/str str name=parsedquery_toString+(keyphrase:sim 010) ()/str str name=rawquerystringsmart mobile/str str name=querystringsmart mobile/str str name=parsedquery +((DisjunctionMaxQuery((keyphrase:smart)) DisjunctionMaxQuery((keyphrase:mobile)))~2) () /str str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2) ()/str The intent here is to do a full text search, part of that is to search keyword field, so I can't put quote to it. On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Hi, the following seems to work pretty well. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType !-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of wifi or wi fi could match a document containing Wi-Fi. Synonyms and stopwords are customized by external files, and stemming is enabled. The attribute autoGeneratePhraseQueries=true (the default) causes words that get split to form phrase queries. For example, WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:pdp 11 rather than (text:PDP OR text:11). NOTE: autoGeneratePhraseQueries=true tends to not work well for non whitespace delimited languages. -- fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType copyField source=cat dest=text/ copyField source=subject dest=text/ copyField source=summary dest=text/ copyField source=cause dest=text/ copyField source=status dest=text/ copyField source=urgency dest=text/ I ingest the source fields as text_ws (I know I've changed it a bit) and then copy the field to text. This seems to do what you are asking for.
Re: solr speed issues..
You haven't given us much information here, it might help to review: http://wiki.apache.org/solr/UsingMailingLists In addition to Kenf_nc's comments, your sorting may be an issue, especially if you're measuring the first query times. What does debugQuery=on show? How many docs in your index? How much RAM are you allocating to the JVM? Have you looked at your cache statistics on the admin page? Best Erick On Fri, Jan 14, 2011 at 3:26 AM, saureen saureen_ad...@yahoo.co.in wrote: I am working on an application that requires fetching results from solr based on date parameter..earlier i was using sharding to fetch the results but that was making things too slow,so instead of sharding,i queried on three different cores with the same parameters and merged the results..still the things are slow.. for one call i generally get around 500 to 1000 docs from solr..so basically i am including following parameters in url for solr call sort=created+desc json.nl=map wt=json rows=1000 version=1.2 omitHeader=true fl=title start=0 q=apple qt=standard fq=created:[date1 TO date2] Its taking long time to get the results,any solution for the above problem would be great.. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-speed-issues-tp2254823p2254823.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MaxRows and disabling sort
Why do you want to do this? That is, what problem do you think would be solved by this? Because there are other problems if you're trying to, say, return all rows that match But no, there's nothing that I know of that would do what you want (of course that doesn't mean there isn't). Best Erick On Fri, Jan 14, 2011 at 12:17 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Hi, I want to limit my SOLR results so that it stops further searching once it founds a certain number of records (just like 'limit' in MySQL). I know it has timeAllowed property but is there anything like MaxRows? I am NOT talking about 'rows' attribute which returns a specific no. of rows to client. This seems a very nice way to stop SOLR from traversing through the complete index but I am not sure if there is anything like this. Also I guess default sorting is on Scoring and sorting can only be done once it has the scores of all matches so then limiting it to the max rows becomes useless. So if there a way to disable sorting? e.g. it returns the rows as it finds without any order? Thanks! -- Regards, Salman Akram Cell: +92-321-4391210
Re: MaxRows and disabling sort
In some cases my search takes too long. Now I want to show user partial matches if its taking too long. The problem with timeAllowed is that lets say I set its value to 10 secs then for some queries it would be fine and will at least return few hundred rows but in really worse scenarios it might not even return few records in that time (even 0 is highly possible) so the user would think nothing matched though there were many matches. Telling SOLR to return first 20/50 records would ensure that it will at least return user the first page even if it takes more time. On Sat, Jan 15, 2011 at 3:11 AM, Erick Erickson erickerick...@gmail.comwrote: Why do you want to do this? That is, what problem do you think would be solved by this? Because there are other problems if you're trying to, say, return all rows that match But no, there's nothing that I know of that would do what you want (of course that doesn't mean there isn't). Best Erick On Fri, Jan 14, 2011 at 12:17 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Hi, I want to limit my SOLR results so that it stops further searching once it founds a certain number of records (just like 'limit' in MySQL). I know it has timeAllowed property but is there anything like MaxRows? I am NOT talking about 'rows' attribute which returns a specific no. of rows to client. This seems a very nice way to stop SOLR from traversing through the complete index but I am not sure if there is anything like this. Also I guess default sorting is on Scoring and sorting can only be done once it has the scores of all matches so then limiting it to the max rows becomes useless. So if there a way to disable sorting? e.g. it returns the rows as it finds without any order? Thanks! -- Regards, Salman Akram Cell: +92-321-4391210 -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
Re: MaxRows and disabling sort
: Also I guess default sorting is on Scoring and sorting can only be done once : it has the scores of all matches so then limiting it to the max rows becomes : useless. So if there a way to disable sorting? e.g. it returns the rows as : it finds without any order? http://wiki.apache.org/solr/CommonQueryParameters#sort You can sort by index id using sort=_docid_ asc or sort=_docid_ desc if you specify _docid_ asc then solr should return as soon as it finds the first N matching results w/o scoring all docs (because no score will be computed) if you use any complex features however (faceting or what not) then it will still most likely need to scan all docs. -Hoss
Re: Multi-word exact keyword case-insensitive search suggestions
Ahh, thanks guys for helping me! For Adam solution, it doesn't work for me. Here is my Field, FieldType, and solr query: fieldType name=text_keyword class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType field name=keyphrase type=text_keyword indexed=true stored=false multiValued=true/ http://localhost:8081/solr/select?q=printing%20houseqf=keyphrasedebugQuery=ondefType=dismax str name=parsedquery +((DisjunctionMaxQuery((keyphrase:smart)) DisjunctionMaxQuery((keyphrase:mobile)))~2) () /str str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2) ()/str lst name=explain/ The result is not found. For erick solution, it works for me. However, I can't put filter query, since it's part of full text search. If I put fq, it would just return documents that match exactly as the query. I want to show those that match exactly on the top and the rest for documents that match partially. The problem is that when the user search a word (eg. printing of the keyword printing house), that document also include in the search results. The other problem is that if the user search the reverse order(eg. house printing), it's also found. Cheers On Sat, Jan 15, 2011 at 3:31 AM, Erick Erickson erickerick...@gmail.comwrote: This might work: Define your field to use WhitespaceTokenizer and LowerCaseFilterFactory Use a filter query referencing this field. If you wanted the words to appear in their exact order, you could just define the pf field in your dismax. Best Erick On Thu, Jan 13, 2011 at 8:01 PM, Estrada Groups estrada.adam.gro...@gmail.com wrote: Ahhh...the fun of open source software ;-). Requires a ton of trial and error! I found what worked for me and figured it was worth passing it along. If you don't mind...when you sort everything out on your end, please post results for the rest of us to take a gander at. Cheers, Adam On Jan 13, 2011, at 9:08 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote: Thanks for your reply. However, it doesn't work for my case at all. I think it's the problem with query parser or something else. It forces me to put double quote to the search query in order to get the results found. str name=rawquerystringsim 010/str str name=querystringsim 010/str str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010)) ()/str str name=parsedquery_toString+(keyphrase:sim 010) ()/str str name=rawquerystringsmart mobile/str str name=querystringsmart mobile/str str name=parsedquery +((DisjunctionMaxQuery((keyphrase:smart)) DisjunctionMaxQuery((keyphrase:mobile)))~2) () /str str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2) ()/str The intent here is to do a full text search, part of that is to search keyword field, so I can't put quote to it. On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Hi, the following seems to work pretty well. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType !-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of wifi or wi fi could match a document containing Wi-Fi. Synonyms and stopwords are customized by external files, and stemming is enabled. The attribute autoGeneratePhraseQueries=true (the default) causes words that get split to form phrase queries. For example, WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:pdp 11 rather than (text:PDP OR text:11). NOTE: autoGeneratePhraseQueries=true tends to not work well for non whitespace delimited languages. -- fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt