Re: Meaning of "Index" flag under properties and schema
This list strips attachments so you'll have to figure out another way to show the difference, Cheers Charlie On 16/02/2021 15:16, ufuk yılmaz wrote: There’s a collection at our customer’s site giving weird exceptions when a particular field is involved (asked another question detailing that). When I inspected it, there’s only one difference between it and other dozens of fine working collections, which is, A text_general field in all other collections has the above configuration without my artsy paint edits, but only that problematic collection has an “index” flag with indexed tokenized and stored checked. I never saw this “Index” flag before. What does it mean? Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 -- Charlie Hull - Managing Consultant at OpenSource Connections Limited Founding member of The Search Network <https://thesearchnetwork.com/> and co-author of Searching the Enterprise <https://opensourceconnections.com/about-us/books-resources/> tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828
Re: Why Solr questions on stackoverflow get very few views and answers, if at all?
I've answered a few in my time, but my experience is that if you do so you then get emailed a whole load more questions some of which aren't even relevant to Solr! Also, quite a few of them are 'here is 3 pages of code please debug it for me no I won't tell the actual error I got'. This is the best place to come, also there's the IRC channel, the new Slack gateway to this at https://s.apache.org/solr-slack and in our own Relevance Slack at http://opensourceconnections.com/slack there's a #solr channel (as well as many others on search & relevance topics). Solr is 'hot' (but not as hot as Elasticsearch), and search is still a niche business overall. HTH Cheers Charlie On 12/02/2021 10:37, ufuk yılmaz wrote: Is it because the main place for q is this mailing list, or somewhere else that I don’t know? Or Solr isn’t ‘hot’ as some other topics? Sent from Mail for Windows 10 -- Charlie Hull - Managing Consultant at OpenSource Connections Limited Founding member of The Search Network <https://thesearchnetwork.com/> and co-author of Searching the Enterprise <https://opensourceconnections.com/about-us/books-resources/> tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828
Re: SOLR upgrade
Hi Lulu, I'm afraid you're going to have to recognise that Solr 5.2.1 is very out-of-date and the changes between this version and the current 8.x releases are significant. A direct jump is I think the only sensible option. Although you could take the current configuration and attempt to upgrade it to work with 8.x, I recommend that you should take the chance to look at your whole infrastructure (from data ingestion through to query construction) and consider what needs upgrading/redesigning for both performance and future-proofing. You shouldn't just attempt a lift-and-shift of the current setup - some things just won't work and some may lock you into future issues. If you're running at large scale (I've talked to some people at the BL before and I know you have some huge indexes there!) then a redesign may be necessary for scalability reasons (cost and feasibility). You should also consider your skills base and how the team can stay up to date with Solr changes and modern search practice. Hope this helps - this is a common situation which I've seen many times before, you're certainly not the oldest version of Solr running I've seen recently either! best Charlie On 09/02/2021 01:14, Paul, Lulu wrote: Hi SOLR team, Please may I ask for advice regarding upgrading the SOLR version (our project currently running on solr-5.2.1) to the latest version? What are the steps, breaking changes and potential issues ? Could this be done as an incremental version upgrade or a direct jump to the newest version? Much appreciate the advice, Thank you! Best Wishes Lulu ** Experience the British Library online at www.bl.uk<http://www.bl.uk/> The British Library's latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html> Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook> The Library's St Pancras site is WiFi - enabled * The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmas...@bl.uk<mailto:postmas...@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. * Think before you print -- Charlie Hull - Managing Consultant at OpenSource Connections Limited Founding member of The Search Network <https://thesearchnetwork.com/> and co-author of Searching the Enterprise <https://opensourceconnections.com/about-us/books-resources/> tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828
Re: Solr Slack Workspace
Relevance Slack is open to anyone working on search & relevance - #solr is only one of the channels, there's lots more! Hope to see you there. Cheers Charlie https://opensourceconnections.com/slack On 16/01/2021 02:18, matthew sporleder wrote: IRC has kind of died off, https://lucene.apache.org/solr/community.html has a slack mentioned, I'm on https://opensourceconnections.com/slack after taking their solr training class and assume it's mostly open to solr community. On Fri, Jan 15, 2021 at 8:10 PM Justin Sweeney wrote: Hi all, I did some googling and didn't find anything, but is there a Slack workspace for Solr? I think this could be useful to expand interaction within the community of Solr users and connect people solving similar problems. I'd be happy to get this setup if it does not exist already. Justin -- Charlie Hull - Managing Consultant at OpenSource Connections Limited Founding member of The Search Network <https://thesearchnetwork.com/> and co-author of Searching the Enterprise <https://opensourceconnections.com/about-us/books-resources/> tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828
Re: Handling acronyms
I'm wondering if you should be using these acronyms at index time, not search time. It will make your index bigger and you'll have to re-index to add new synonyms (as they may apply to old documents) but this could be an occasional task, and in the meantime you could use query-time synonyms for the new ones. Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me. Cheers Charlie On 15/01/2021 09:48, Shaun Campbell wrote: I have a medical journals search application and I've a list of some 9,000 acronyms like this: MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire SRN=>SRN Stroke Research Network IGBP=>IGBP isolated gastric bypass TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for Obstructive sleep apnoea–hypopnoea SRM=>SRM standardised response mean SRT=>SRT substrate reduction therapy SRS=>SRS Sexual Rating Scale SRU=>SRU stroke rehabilitation unit T2w=>T2w T2-weighted Ab-P=>Ab-P Aberdeen participation restriction subscale MSOA=>MSOA middle-layer super output area SSA=>SSA site-specific assessment SSC=>SSC Study Steering Committee SSB=>SSB short-stretch bandage SSE=>SSE sum squared error SSD=>SSD social services department NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument I tried to put them in a synonyms file, either just with a comma between, or with an arrow in between and the acronym repeated on the right like above, and no matter what I try I'm getting really strange search results. It's like words in one acronym are matching with the same word in another acronym and then searching with that acronym which is completely unrelated. I don't think Solr can handle this, but does anyone know of any crafty tricks in Solr to handle this situation where I can either search by the acronym or by the text? Shaun -- Charlie Hull - Managing Consultant at OpenSource Connections Limited Founding member of The Search Network <https://thesearchnetwork.com/> and co-author of Searching the Enterprise <https://opensourceconnections.com/about-us/books-resources/> tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828
Re: Solr using all available CPU and becoming unresponsive
uch higher, but we have reduced it to try to address this issue. The behavior we see: Solr is normally using ~3-6GB of heap and we usually have ~20GB of free memory. Occasionally, though, solr is not able to free up memory and the heap usage climbs. Analyzing the GC logs shows a sharp incline of usage with the GC (the default CMS) working hard to free memory, but not accomplishing much. Eventually, it fills up the heap, maxes out the CPUs, and never recovers. We have tried to analyze the logs to see if there are particular queries causing issues or if there are network issues to zookeeper, but we haven't been able to find any patterns. After the issues start, we often see session timeouts to zookeeper, but it doesn't appear that they are the cause. Does anyone have any recommendations on things to try or metrics to look into or configuration issues I may be overlooking? Thanks, Jeremy -- Charlie Hull - Managing Consultant at OpenSource Connections Limited Founding member of The Search Network <https://thesearchnetwork.com/> and co-author of Searching the Enterprise <https://opensourceconnections.com/about-us/books-resources/> tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828
Re: Improve results/relevance
Hi, A few strategies you can use: 1. First you need to know why the result has matched. Solr provides detailed debug info but it's not easy to interpret. Consider using something like www.splainer.io to give you better visibility (disclaimer: this is something we maintain, there are other alternatives including a cool Chrome plugin). You can now see where scores are being calculated. 2. Next you should read up on how Lucene/Solr edismax scoring works - remember it's a 'winner takes all' strategy. Here's a great blog by Doug on this https://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ . Now you should know why your results are being ordered as they are. 3. You've now got lots of options: you should set up some tests (perhaps use Quepid? www.quepid.com disclaimer: yes that's us too :) to monitor what happens as you try each and to check for side-effects. You could boost exact phrase matches - here's one way to do this http://everydaydeveloper.blogspot.com/2012/02/solr-improve-relevancy-by-boosting.html or you could use Querqy which gives you much more flexibility https://querqy.org/ (check out SMUI too as this is a great way to manage Querqy rules). 4. What you're doing is active search tuning for ecommerce, and this won't be the first example you'll come across. You should also implement a system for tracking these kinds of issues, what you do to fix them and the tests carried out: it's analogous to a bug tracker and something we call a 'Relevancy Register'. Otherwise you'll end up with a huge pile of hacks and will swiftly forget why they were implemented and what problem they were trying to solve! 5. We're running a blog series about ecommerce search which you might want to follow: https://opensourceconnections.com/blog/2020/07/07/meet-pete-the-e-commerce-search-product-manager/ HTH Charlie On 17/10/2020 04:51, Jayadevan Maymala wrote: Hi all, We have a catalogue of many products, including smart phones. We use *edismax* query parser. If someone types in iPhone 11, we are getting the correct results. But iPhone 11 Pro is coming before iPhone 11. What options can be used to improve this? Regards, Jayadevan -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Solr 7.7 - Few Questions
Nested docs would be one approach, result grouping might be another. Regarding JOINs, the only way you're going to know is by some representative testing. Charlie On 05/10/2020 05:49, Rahul Goswami wrote: Charlie, Thanks for providing an alternate approach to doing this. It would be interesting to know how one could go about organizing the docs in this case? (Nested documents?) How would join queries perform on a large index(200 million+ docs)? Thanks, Rahul On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull wrote: Hi Rahul, In addition to the wise advice below: remember in Solr, a 'document' is just the name for the thing that would appear as one of the results when you search (analagous to a database record). It's not the same conceptually as a 'Word document' or a 'PDF document'. If your source documents are so big, consider how they might be broken into parts, or whether you really need to index all of them for retrieval purposes, or what parts of them need to be extracted as text. Thus, the Solr documents don't necessarily need to be as large as your source documents. Consider an email size 20kb with ten PDF attachments, each 20MB. You probably shouldn't push all this data into a single Solr document, but you *could* index them as 11 separate Solr documents, but with metadata to indicate that one is an email and ten are PDFs, and a shared ID of some kind to indicate they're related. Then at query time there are various ways for you to group these together, so for example if the query hit one of the PDFs you could show the user the original email, plus the 9 other attachments, using the shared ID as a key. HTH, Charlie On 02/10/2020 01:53, Rahul Goswami wrote: Manisha, In addition to what Shawn has mentioned above, I would also like you to reevaluate your use case. Do you *need to* index the whole document ? eg: If it's an email, the body of the email *might* be more important than any attachments, in which case you could choose to only index the email body and ignore (or only partially index) the text from attachments. If you could afford to index the documents partially, you could consider Solr's "Limit token count filter": See the link below. https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter You'll need to configure it in the schema for the "index" analyzer for the data type of the field with large text. Indexing documents of the order of half a GB will definitely come to hurt your operations, if not now, later (think OOM, extremely slow atomic updates, long running merges etc.). - Rahul On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey wrote: On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote: We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr using Solr.Net commit. The data is being synced to SOLR in batches. The document size is very huge (~0.5GB average) and solr indexing is taking long time. Total document size is ~200GB. As the solr commit is done as a part of API, the API calls are failing as document indexing is not completed. A single document is five hundred megabytes? What kind of documents do you have? You can't even index something that big without tweaking configuration parameters that most people don't even know about. Assuming you can even get it working, there's no way that indexing a document like that is going to be fast. 1. What is your advise on syncing such a large volume of data to Solr KB. What is "KB"? I have never heard of this in relation to Solr. 2. Because of the search requirements, almost 8 fields are defined as Text fields. I can't figure out what you are trying to say with this statement. 3. Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large volume of data? If just one of the documents you're sending to Solr really is five hundred megabytes, then 2 gigabytes would probably be just barely enough to index one document into an empty index ... and it would probably be doing garbage collection so frequently that it would make things REALLY slow. I have no way to predict how much heap you will need. That will require experimentation. I can tell you that 2GB is definitely not enough. 4. How to set up Solr in production on Windows? Currently it's set up as a standalone engine and client is requested to take the backup of the drive. Is there any other better way to do? How to set up for the disaster recovery? I would suggest NOT doing it on Windows. My reasons for that come down to costs -- a Windows Server license isn't cheap. That said, there's nothing wrong with running on Windows, but you're on your own as far as running it as a service. We only have a service installer for UNIX-type systems. Most of the testing for that is done on Linux. 5. How to benchmark the system requirements for such a huge data I do not know what all your needs are, so I have
Re: Solr 7.7 - Few Questions
Hi Rahul, In addition to the wise advice below: remember in Solr, a 'document' is just the name for the thing that would appear as one of the results when you search (analagous to a database record). It's not the same conceptually as a 'Word document' or a 'PDF document'. If your source documents are so big, consider how they might be broken into parts, or whether you really need to index all of them for retrieval purposes, or what parts of them need to be extracted as text. Thus, the Solr documents don't necessarily need to be as large as your source documents. Consider an email size 20kb with ten PDF attachments, each 20MB. You probably shouldn't push all this data into a single Solr document, but you *could* index them as 11 separate Solr documents, but with metadata to indicate that one is an email and ten are PDFs, and a shared ID of some kind to indicate they're related. Then at query time there are various ways for you to group these together, so for example if the query hit one of the PDFs you could show the user the original email, plus the 9 other attachments, using the shared ID as a key. HTH, Charlie On 02/10/2020 01:53, Rahul Goswami wrote: Manisha, In addition to what Shawn has mentioned above, I would also like you to reevaluate your use case. Do you *need to* index the whole document ? eg: If it's an email, the body of the email *might* be more important than any attachments, in which case you could choose to only index the email body and ignore (or only partially index) the text from attachments. If you could afford to index the documents partially, you could consider Solr's "Limit token count filter": See the link below. https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter You'll need to configure it in the schema for the "index" analyzer for the data type of the field with large text. Indexing documents of the order of half a GB will definitely come to hurt your operations, if not now, later (think OOM, extremely slow atomic updates, long running merges etc.). - Rahul On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey wrote: On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote: We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr using Solr.Net commit. The data is being synced to SOLR in batches. The document size is very huge (~0.5GB average) and solr indexing is taking long time. Total document size is ~200GB. As the solr commit is done as a part of API, the API calls are failing as document indexing is not completed. A single document is five hundred megabytes? What kind of documents do you have? You can't even index something that big without tweaking configuration parameters that most people don't even know about. Assuming you can even get it working, there's no way that indexing a document like that is going to be fast. 1. What is your advise on syncing such a large volume of data to Solr KB. What is "KB"? I have never heard of this in relation to Solr. 2. Because of the search requirements, almost 8 fields are defined as Text fields. I can't figure out what you are trying to say with this statement. 3. Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large volume of data? If just one of the documents you're sending to Solr really is five hundred megabytes, then 2 gigabytes would probably be just barely enough to index one document into an empty index ... and it would probably be doing garbage collection so frequently that it would make things REALLY slow. I have no way to predict how much heap you will need. That will require experimentation. I can tell you that 2GB is definitely not enough. 4. How to set up Solr in production on Windows? Currently it's set up as a standalone engine and client is requested to take the backup of the drive. Is there any other better way to do? How to set up for the disaster recovery? I would suggest NOT doing it on Windows. My reasons for that come down to costs -- a Windows Server license isn't cheap. That said, there's nothing wrong with running on Windows, but you're on your own as far as running it as a service. We only have a service installer for UNIX-type systems. Most of the testing for that is done on Linux. 5. How to benchmark the system requirements for such a huge data I do not know what all your needs are, so I have no way to answer this. You're going to know a lot more about it that any of us are. Thanks, Shawn -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Solr training
Hi Matthew & all, Why not? Try the code 'evenearlier' for a further discount! (Oh and we extended the earlybird period for another week). Cheers Charlie On 17/09/2020 21:00, matthew sporleder wrote: Is there a friends-on-the-mailing list discount? I had a bit of sticker shock! On Wed, Sep 16, 2020 at 9:38 AM Charlie Hull wrote: I do of course mean 'Group Discounts': you don't get a discount for being in a 'froup' sadly (I wasn't even aware that was a thing!) Charlie On 16/09/2020 13:26, Charlie Hull wrote: Hi all, We're running our SolrThink Like a Relevance Engineer training 6-9 Oct - you can find out more & book tickets at https://opensourceconnections.com/training/solr-think-like-a-relevance-engineer-tlre/ The course is delivered over 4 half-days from 9am EST / 2pm BST / 3pm CET and is led by Eric Pugh who co-wrote the first book on Solr and is a Solr Committer. It's suitable for all members of the search team - search engineers, data scientists, even product owners who want to know how Solr search can be measured & tuned. Delivered by working relevance engineers the course features practical exercises and will give you a great foundation in how to use Solr to build great search. Tthe early bird discount expires end of this week so do book soon if you're interested! Froup discounts also available. We're also running a more advanced course on Learning to Rank a couple of weeks later - you can find all our training courses and dates at https://opensourceconnections.com/training/ Cheers Charlie -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web:www.o19s.com -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Solr training
I do of course mean 'Group Discounts': you don't get a discount for being in a 'froup' sadly (I wasn't even aware that was a thing!) Charlie On 16/09/2020 13:26, Charlie Hull wrote: Hi all, We're running our SolrThink Like a Relevance Engineer training 6-9 Oct - you can find out more & book tickets at https://opensourceconnections.com/training/solr-think-like-a-relevance-engineer-tlre/ The course is delivered over 4 half-days from 9am EST / 2pm BST / 3pm CET and is led by Eric Pugh who co-wrote the first book on Solr and is a Solr Committer. It's suitable for all members of the search team - search engineers, data scientists, even product owners who want to know how Solr search can be measured & tuned. Delivered by working relevance engineers the course features practical exercises and will give you a great foundation in how to use Solr to build great search. Tthe early bird discount expires end of this week so do book soon if you're interested! Froup discounts also available. We're also running a more advanced course on Learning to Rank a couple of weeks later - you can find all our training courses and dates at https://opensourceconnections.com/training/ Cheers Charlie -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web:www.o19s.com -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Solr training
Hi all, We're running our SolrThink Like a Relevance Engineer training 6-9 Oct - you can find out more & book tickets at https://opensourceconnections.com/training/solr-think-like-a-relevance-engineer-tlre/ The course is delivered over 4 half-days from 9am EST / 2pm BST / 3pm CET and is led by Eric Pugh who co-wrote the first book on Solr and is a Solr Committer. It's suitable for all members of the search team - search engineers, data scientists, even product owners who want to know how Solr search can be measured & tuned. Delivered by working relevance engineers the course features practical exercises and will give you a great foundation in how to use Solr to build great search. Tthe early bird discount expires end of this week so do book soon if you're interested! Froup discounts also available. We're also running a more advanced course on Learning to Rank a couple of weeks later - you can find all our training courses and dates at https://opensourceconnections.com/training/ Cheers Charlie -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: PDF extraction using Tika
Hi Joe, Tika is pretty amazing at coping with the things people throw at it and I know the team behind it have added a very extensive testing framework. However, the reality is that malformed, huge or just plain crazy documents may cause crashes - PDFs are mad, you can even embed Javascript in them I believe, and I've also seen PDFs running to thousands of pages. There's *no way* to design out every possible crash, and it's far better to design your system to cope if necessary by separating the PDF processing from Solr. Charlie On 25/08/2020 11:46, Joe Doupnik wrote: More properly,it would be best to fix Tika and thus not push extra complexity upon many many users. Error handling is one thing, crashes though ought to be designed out. Thanks, Joe D. On 25/08/2020 10:54, Charlie Hull wrote: On 25/08/2020 06:04, Srinivas Kashyap wrote: Hi Alexandre, Yes, these are the same PDF files running in windows and linux. There are around 30 pdf files and I tried indexing single file, but faced same error. Is it related to how PDF stored in linux? Did you try running Tika (the same version as you're using in Solr) standalone on the file as Alexandre suggested? And with regard to DIH and TIKA going away, can you share if any program which extracts from PDF and pushes into solr? https://lucidworks.com/post/indexing-with-solrj/ is one example. You should run Tika separately as it's entirely possible for it to fail to parse a PDF and crash - and if you're running it in DIH & Solr it then brings down everything. Separate your PDF processing from your Solr indexing. Cheers Charlie Thanks, Srinivas Kashyap -Original Message- From: Alexandre Rafalovitch Sent: 24 August 2020 20:54 To: solr-user Subject: Re: PDF extraction using Tika The issue seems to be more with a specific file and at the level way below Solr's or possibly even Tika's: Caused by: java.io.IOException: expected='>' actual=' ' at offset 2383 at org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045) Are you indexing the same files on Windows and Linux? I am guessing not. I would try to narrow down which of the files it is. One way could be to get a standalone Tika (make sure to match the version Solr embeds) and run it over the documents by itself. It will probably complain with the same error. Regards, Alex. P.s. Additionally, both DIH and Embedded Tika are not recommended for production. And both will be going away in future Solr versions. You may have a much less brittle pipeline if you save the structured outputs from those Tika standalone runs and then index them into Solr, possibly pre-processed. On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap wrote: Hello, We are using TikaEntityProcessor to extract the content out of PDF and make the content searchable. When jetty is run on windows based machine, we are able to successfully load documents using full import DIH(tika entity). Here PDF's is maintained in windows file system. But when jetty solr is run on linux machine, and try to run DIH, we are getting below exception: (Here PDF's are maintained in linux filesystem) Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483) at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1 at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233) ... 4 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69) at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j
Re: PDF extraction using Tika
on is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website. -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: SOLR indexing takes longer time
1. You could write some code to pull the items out of Mongo and dump them to disk - if this is still slow, then it's Mongo that's the problem. 2. Write a standalone indexer to replace DIH, it's single threaded and deprecated anyway. 3. Minor point - consider whether you need to index everything every time or just the deltas. 4. Upgrade Solr anyway, not for speed reasons but because that's a very old version you're running. HTH Charlie On 17/08/2020 19:22, Abhijit Pawar wrote: Hello, We are indexing some 200K plus documents in SOLR 5.4.1 with no shards / replicas and just single core. It takes almost 3.5 hours to index that data. I am using a data import handler to import data from the mongo database. Is there something we can do to reduce the time taken to index? Will upgrade to newer version help? Appreciate your help! Regards, Abhijit -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Querying solr using many QueryParser in one call
Hi, It's very hard to answer questions like 'how fast/slow might this be' - the best way to find out is to try, e.g. to build a prototype that you can time. To be useful this prototype should use representative data and queries. Once you have this, you can try improving performance with strategies like the cacheing you describe. Charlie On 16/07/2020 18:14, harjag...@gmail.com wrote: Hi All, Below are question regarding querying solr using many QueryParser in one call. We have need to do a search by keyword and also include few specific documents to result. We don't want to use elevator component as that would put those mandatory documents to the top of the result. We would like to mix those mandatory documents with organic keyword lookup result set and also make sure those mandatory documents take part in other scoring mechanism like bq's.On top of this we would also need to classify documents matched by keyword lookup against mandatory docs.We ended up doing the below solr query param to achieve it. fl=id,title,isTermMatch:exists(query({!type=edismax qf=$qf v=blah})),score q=({!edismax qf=$qf v=$searchQuery mm=$mm}) OR ({!edismax qf=$qf v=$docIdQuery mm=0 sow=true}) docIdQuery=5985612 6339445 5357348 searchQuery=blah Below are my question 1.As you can see we are calling three query parser in one call what would be the performance implication of the search? 2.As you can see two of those queries. the one in q and one in fl is the same. would query result cache help? 3.In general what is the implications on performance when we do a search calling multiple query parser in a single call? -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Sitecore 9.3 / Solr 8.1.1 - Zookeeper Issue
rj.impl.Http2SolrClient.request(Http2SolrClient.java:416) Thanks! Austin Kimmel Software Developer Vail Resorts, Inc. 303-404-1922 akim...@vailresorts.com<mailto:akim...@vailresorts.com> VAILRESORTS(r) EXPERIENCE OF A LIFETIME The information contained in this message is confidential and intended only for the use of the individual or entity named above, and may be privileged. Any unauthorized review, use, disclosure, or distribution is prohibited. If you are not the intended recipient, please reply to the sender immediately, stating that you have received the message in error, then please delete this e-mail. Thank you. -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: SOLR Exact phrase search issue
On 14/07/2020 12:48, Erick Erickson wrote: This is almost certainly a mismatch between what you think is happening and what you’ve actually told Solr to do ;). That's a great one-line explanation of 90% of the issues people face with Solr :-) Charlie Best, Erick On Jul 14, 2020, at 7:05 AM, Villalba Sans, Raúl wrote: Hello, We have an app that uses SOLR as search engine. We have detected incorrect behavior for which we find no explanation. If we perform a search with the phrase "Què t’hi jugues" we do not receive any results, although we know that there is a result that contains this phrase. However, if we search for "Què t’hi" or for "t’hi jugues" we do find results, including "Què t’hi jugues ". We attach screenshots of the search tool and the xml of the results. We would greatly appreciate it if you could lend a hand in trying to find a solution or identify the cause of the problem. Search 1 – “Què t’hi jugues” Search 2 – “Què t’hi” Search 3 – “t’hi jugues” Best regards, Raül Villalba Sans Delivery Centers – Centros de Producción Parque de Gardeny, Edificio 28 25071 Lleida, España T +34 973 193 580 -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
I Became a Solr Committer in 4662 Days. Here’s how you can do it faster!
Hi all, Thought you might enjoy Eric's blog, it's taken him a while! Some good hints here for those of you interested in contributing more to Solr. https://opensourceconnections.com/blog/2020/07/10/i-became-a-solr-committer-in-4662-days-heres-how-you-can-do-it-faster/ Cheers Charlie -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: solr fq with contains not returning any results
It looks like something in your query analyzer chain is turning the wildcard operators '*' into the word 'star' - maybe you need to dig into your analyzers, synonym lists etc. and see where this is happening. The admin/analysis panel that Erick suggests lets you enter data and see what happens once your analyzer chain has processed it - have a go and see what happens. Either that or newer Solr displays the debug information differently, but I don't have two versions here to compare... Charlie On 24/06/2020 19:18, yaswanth kumar wrote: Thanks Erick, I have now added =query and found a diff between old solr and new solr new solr (8.2) which is not giving results is as follows "debug":{ "rawquerystring":"*:*", "querystring":"*:*", "parsedquery":"MatchAllDocsQuery(*:*)", "parsedquery_toString":"*:*", "explain":{}, "QParser":"LuceneQParser", "filter_queries":["auto_nsallschools:*bostonschool*"], "parsed_filter_queries":["auto_nsallschools:_star_bostonschool_star_"], Where as solr 5.5 which is getting me the results is as follows "debug":{ "rawquerystring":"*:*", "querystring":"*:*", "parsedquery":"MatchAllDocsQuery(*:*)", "parsedquery_toString":"*:*", "explain":{}, "QParser":"LuceneQParser", "filter_queries":["auto_nsallschools:*bostonschool*"], "parsed_filter_queries":["auto_nsallschools:*bostonschool*"], I know in schema there are analyzer against this field but not getting on why its making differences here. Thanks, On Wed, Jun 24, 2020 at 9:24 AM Erick Erickson wrote: You need to do several things to track down why. First, use something (admin UI, terms query, etc) to see exactly what’s in your index. The admin/analysis screen is useful here. Second, aldd =query to the query on both machines and see what the actual parsed query looks like. Comparing those should give you a clue. Best, Erick On Jun 24, 2020, at 9:20 AM, yaswanth kumar wrote: "nsallschools":["BostonSchool"] That's how the data is stored against the field. We have a functionality where we can do "Starts with, Contains, Ends with"; Also if you look at the above schema we are using Also the strange part is that its working fine in Solr 5.5 but not in Solr 8.2 any thoughts?? Thanks, On Wed, Jun 24, 2020 at 3:15 AM Jörn Franke wrote: I don’t know your data, but could it be that you tokenize differently ? Why do you do the wildcard search at all? Maybe a different tokenizing strategy can bring you more effieciently results? Depends on what you need to achieve of course ... Am 24.06.2020 um 05:37 schrieb yaswanth kumar : I am using solr 8.2 And when trying to do fq=auto_nsallschools:*bostonschool*, the data is not being returned. But if I do the same in solr 5.5 (which I already have and we are in process of migrating to 8.2 ) its returning results. if I do fq=auto_nsallschools:bostonschool or fq=auto_nsallschools:bostonschool* its returning results but when I try with contains like described above or fq=auto_nsallschools:*bostonschool (ends with) it's not returning any results. The field which we are already using is a copy field and multi valued, am I doing something wrong? or does 8.2 need some adjustment in the configs? Here is the schema stored="true" multiValued="true"/> indexed="true" stored="false" multiValued="true"/> Thanks, -- Thanks & Regards, Yaswanth Kumar Konathala. yaswanth...@gmail.com -- Thanks & Regards, Yaswanth Kumar Konathala. yaswanth...@gmail.com -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Not all EML files are indexing during indexing
I think the OP is indexing flat files, not web pages (but otherwise, I agree with you that Scrapy is great - I know some of the people behind it too and they're a good bunch). Charlie On 02/06/2020 16:41, Walter Underwood wrote: On Jun 2, 2020, at 7:40 AM, Charlie Hull wrote: If it was me I'd probably build a standalone indexer script in Python that did the file handling, called out to a separate Tika service for extraction, posted to Solr. I would do the same thing, and I would base that script on Scrapy (https://scrapy.org <https://scrapy.org/>). I worked on a Python-based web spider for about ten years. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Not all EML files are indexing during indexing
Ah OK. I haven't used SimplePostTool myself and I note the docs say "View this not as a best-practice code example, but as a standalone example built with an explicit purpose of not having external jar dependencies." I'm wondering if it's some kind of synchronisation issue between new files arriving in the folder and being picked up by your Powershell script. Hard to say really without seeing all the code...perhaps take out the Tika & Solr parts for now and verify the rest of your code really can spot every new or updated file that arrives? If it was me I'd probably build a standalone indexer script in Python that did the file handling, called out to a separate Tika service for extraction, posted to Solr. Cheers Charlie On 02/06/2020 14:48, Zheng Lin Edwin Yeo wrote: Hi Charlie, The main code that is doing the indexing is from the Solr's SimplePostTools, but we have done some modification to it. The walking through a folder is done by PowerShell script, the extracting of the content from .eml file is from Tika that comes with Solr, and the images in the .eml file are done by OCR that comes with Solr. As we have modified the SimplePostTool code to do the checking if the file already exists in the index by running a Solr search query of the ID, I'm thinking if this issue is caused by the PowerShell script or the query in the SimplePostTool code not being able to keep up with the large number of files? Regards, Edwin On Mon, 1 Jun 2020 at 17:19, Charlie Hull wrote: Hi Edwin, What code is actually doing the indexing? AFAIK Solr doesn't include any code for actually walking a folder, extracting the content from .eml files and pushing this data into its index, so I'm guessing you've built something external? Charlie On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote: Hi, I am running this on Solr 7.6.0 Currently I have a situation whereby there's more than 2 million EML file in a folder, and the folder is constantly updating the EML files with the latest information and adding new EML files. When I do the indexing, it is suppose to index the new EML files, and update those index in which the EML file content has changed. However, I found that not all new EML files are updated with each run of the indexing. Could it be caused by the large number of files in the folder? Or due to some other reasons? Regards, Edwin -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Not all EML files are indexing during indexing
Hi Edwin, What code is actually doing the indexing? AFAIK Solr doesn't include any code for actually walking a folder, extracting the content from .eml files and pushing this data into its index, so I'm guessing you've built something external? Charlie On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote: Hi, I am running this on Solr 7.6.0 Currently I have a situation whereby there's more than 2 million EML file in a folder, and the folder is constantly updating the EML files with the latest information and adding new EML files. When I do the indexing, it is suppose to index the new EML files, and update those index in which the EML file content has changed. However, I found that not all new EML files are updated with each run of the indexing. Could it be caused by the large number of files in the folder? Or due to some other reasons? Regards, Edwin -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Haystack is Back! Not just one - but three search conferences
Hi all, So there's no Haystack in Charlottesville this year - but we've done our very best to bring you some of the talks and training we planned online - find out more at https://opensourceconnections.com/blog/2020/05/18/haystack-is-back-go-virtual-for-relevant-search-talks-workshops-discussions-training/ One part of this is three conferences, Berlin Buzzwords, Haystack and MICES, have come together for a week of online talks, workshops, panels and discussions. There's lots of great search related content including Uwe Schindler on Lucene 9, Doug Turnbull & Trey Grainger on AI-Powered Search, Tim Allison of NASA on genetic algorithms, a panel on result diversity, a workshop on the opensource ecommerce search ecosystem...do check it out at www.berlinbuzzwords.de . I'm running a Lightning Talks session too (let me know if you've got a talk). Cheers Charlie -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Combined virtual conference announced with content on Solr, search & relevance
The teams behind Berlin Buzzwords <https://berlinbuzzwords.de/>, Haystack <http://www.haystackconf.com> the search relevance conference, and MICES <http://mices.co> the ecommerce search event are happy to announce a week of virtual talks, panel discussions, workshops and training sessions covering themes of search, scale, store! To be held between *7th-12th June 2020* , this collaboration will bring together the best of the planned sessions from three annual conferences postponed or cancelled due to COVID-19 and make them available across the world. We aim to support our three communities and to bring them together to share knowledge, expertise and experiences. Read more here. <https://berlinbuzzwords.de/news/registration-online-event-now-available> Tickets are on sale now at https://berlinbuzzwords.de/tickets - see you there (virtually) we hope. Cheers Charlie -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Use TopicStream as percolator
Great! I ran Flax, where we created Luwak, up to last year when we merged with OSC, so this is great to see. Did you know we donated Luwak to Lucene recently? https://issues.apache.org/jira/browse/LUCENE-8766 It would be great to work this up into a Solr contrib module Charlie .. Berlin Buzzwords, MICES and Haystack come together for an awesome merged online search conference! Check out www.haystackconf.com for news On 01/05/2020 09:56, SOLR4189 wrote: Hi everyone, I wrote SOLR Update Processor that wraps Luwak library and implements Saved Searches a la ElasticSearch Percolator. https://github.com/SOLR4189/solcolator for anyone who wants to use. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Solr indexing with Tika DIH - ZeroByteFileException
If users can upload any PDF, including broken or huge ones, and some cause a Tika error, you should decouple Tika from Solr and run it as a separate process to extract text before indexing with Solr. Otherwise some of what is uploaded *will* break Solr. https://lucidworks.com/post/indexing-with-solrj/ has some good hints. Cheers Charlie On 11/06/2019 15:27, neilb wrote: Hi, while going through solr logs, I found data import error for certain documents. Here are details about the error. Exception while processing: file document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 7866 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69) at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483) at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122) at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165) How do I know which document(document name with path) is #7866? And how do I ignore ZeroByteFileException as document network share is not in my control. Users can upload any size pdfs to it. Thanks! -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: solr as a general search engine
Hi Matt, On 21/04/2020 13:41, matthew sporleder wrote: Sorry for the vague question and I appreciate the book recommendations -- I actually think I am mostly confused about suggest vs spellcheck vs morelikethis as they relate to what I referred to as "expected" behavior (like from a typed-in search bar). Suggest - here's some results that might match based on what you've typed so far (usually powered by a behind-the-scenes search of the index with some restrictions). Note the difference between this and autocompletion, which suggests complete search terms from the index based on the partial word you've typed so far. Spellcheck - The word you typed isn't anywhere in the index, so I've used an edit distance algorithm to suggest a few words you might have meant that are in the index (note this isn't spelling correction as the engine doesn't necessarily have the corrected form in its index) Morelikethis - here's some results that share some characteristics with the document you're looking at, e.g. they're indexed by some of the same terms For reference we have been using solr as search in some form for almost 10 years and it's always been great in finding things based on clear keywords, programmatic-type discovery, a nosql/distrtibuted k:v (actually really really good at this) but has always fallen short (imho and also our fault, obviously) in the "typed in a search query" experience. I'm guessing you're bumping into the problem that most people type very little into a search bar, and expect the engine to magically know what they meant. It doesn't of course, so it has to suggest some ways for the user to tell it more specific information - facets for example, or some of the features above. We are in the midst of re-developing our internal content ranking system and it has me grasping on how to *really* elevate our game in terms of giving an excellent human-driven discovery vs our current behavior of: "here is everything we have that contains those words, minus ones I took out". I think you need to look at several angles: - What defines a 'good' result in your world/for your content? - Who judges this? How do you record this? Human/clicks/both? - What Solr features *could* help - and how are you going to test that they actually do using the two lines above? We think that building up this measurement-driven, experimental process is absolutely key to improving relevance. Cheers Charlie On Tue, Apr 21, 2020 at 5:35 AM Charlie Hull wrote: Hi Matt, Are you looking for a good, general purpose schema and config for Solr? Well, there's the problem: you need to define what you mean by general purpose. Every search application will have its own requirements and they'll be slightly different to every other application. Yes, there will be some commonalities too. I guess by "as a human might expect one to behave" you mean "a bit like how Google works" but unfortunately Google is a poor example: you won't have Google's money or staff or platform in your company, nor are you likely to be building a massive-scale web search engine, so at best you can just take inspiration from it, not replicate it. In practice, what a lot of people do is start with an example setup (perhaps from one of the examples supplied with Solr, e.g. 'techproducts') and adapt it: or they might start with the Solr configset provided by another framework, e.g. Drupal (yay! Pink Ponies!). Unfortunately the standard example configsets are littered with comments that say things like 'Here is how you *could* do XYZ but please don't actually attempt it this way' and other config sections that if you un-comment them may just get you into further trouble. It's grown rather than been built, and to my mind there's a good argument for starting with an absolutely minimal Solr configset and only adding things in as you need them and understand them (see https://lucene.472066.n3.nabble.com/minimal-solrconfig-example-td4322977.html for some background and a great presentation from Alex Rafalovitch on the examples). You're also going to need some background on *why* all these features should be used, and for that I'd recommend my colleague Doug's book Relevant Search https://www.manning.com/books/relevant-search - or maybe our training (quick plug: we're running some online training in a couple of weeks https://opensourceconnections.com/blog/2020/05/05/tlre-solr-remote/ ) Hope this helps, Cheers Charlie On 20/04/2020 23:43, matthew sporleder wrote: Is there a comprehensive/big set of tips for making solr into a search-engine as a human would expect one to behave? I poked around in the nutch github for a minute and found this: https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml but I was wondering if I was missing a very obvious document somewhere. I guess I'm looking for things like: use suggester here, use spell
Re: solr as a general search engine
Hi Matt, Are you looking for a good, general purpose schema and config for Solr? Well, there's the problem: you need to define what you mean by general purpose. Every search application will have its own requirements and they'll be slightly different to every other application. Yes, there will be some commonalities too. I guess by "as a human might expect one to behave" you mean "a bit like how Google works" but unfortunately Google is a poor example: you won't have Google's money or staff or platform in your company, nor are you likely to be building a massive-scale web search engine, so at best you can just take inspiration from it, not replicate it. In practice, what a lot of people do is start with an example setup (perhaps from one of the examples supplied with Solr, e.g. 'techproducts') and adapt it: or they might start with the Solr configset provided by another framework, e.g. Drupal (yay! Pink Ponies!). Unfortunately the standard example configsets are littered with comments that say things like 'Here is how you *could* do XYZ but please don't actually attempt it this way' and other config sections that if you un-comment them may just get you into further trouble. It's grown rather than been built, and to my mind there's a good argument for starting with an absolutely minimal Solr configset and only adding things in as you need them and understand them (see https://lucene.472066.n3.nabble.com/minimal-solrconfig-example-td4322977.html for some background and a great presentation from Alex Rafalovitch on the examples). You're also going to need some background on *why* all these features should be used, and for that I'd recommend my colleague Doug's book Relevant Search https://www.manning.com/books/relevant-search - or maybe our training (quick plug: we're running some online training in a couple of weeks https://opensourceconnections.com/blog/2020/05/05/tlre-solr-remote/ ) Hope this helps, Cheers Charlie On 20/04/2020 23:43, matthew sporleder wrote: Is there a comprehensive/big set of tips for making solr into a search-engine as a human would expect one to behave? I poked around in the nutch github for a minute and found this: https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml but I was wondering if I was missing a very obvious document somewhere. I guess I'm looking for things like: use suggester here, use spelling there, use DocValues around here, DIY pagerank, etc Thanks, Matt -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Indexing data from multiple data sources
The link you quote is Sematext's mirror of the Apache solr-user mailing list. There are others also providing copies of this list. As the cat is very much out of the bag your best course of action is to change all the logins and passwords that have been leaked and review your security procedures. Cheers Charlie On 18/04/2020 13:27, RaviKiran Moola wrote: Hi, Greetings of the day!!! Unfortunately we have enclosed our database source details in the Solr community post while sending our queries to solr support as mentioned in the below mail. We find that it has been posted with this link https://sematext.com/opensee/m/Solr/eHNlswSd1vD6AF?subj=RE+Indexing+data+from+multiple+data+sources As it is open to the world, what we are requesting here is, could you please remove that post as-soon-as possible before it creates any sucurity issues for us. Your help is very very appreciable!!! FYI. Here I'm attaching the below screenshot Thanks & Regards, Ravikiran Moola *From:* RaviKiran Moola *Sent:* Friday, April 17, 2020 9:13 PM *To:* solr-user@lucene.apache.org *Subject:* RE: Indexing data from multiple data sources Hi, Greetings!!! We are working on indexing data from multiple data sources (MySQL & MSSQL) in a single collection. We specified data source details like connection details along with the required fields for both data sources in a single data config file, along with specified required fields details in the managed schema and here fetching the same columns from both data sources by specifying the common “unique key”. Unable to index the data from the data sources using solr. Here I’m attaching the data config file and screenshot. Data config file: url="jdbc:mysql://182.74.133.92:3306/ra_dev" user="devuser" password="Welcome_009" batchSize="1" /> driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://182.74.133.92;databasename=BB_SOLR" user="matuser" password="MatDev:07"/> Thanks & Regards, Ravikiran Moola +91-9494924492 -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: FW: Solr proximity search highlighting issue
I may be wrong here, but the problem may be that the match was on your terms pos1 and pos2 (you don't need the pos3 term to match, due to the OR operator) and thus that's what's been highlighted. There's a hl.q parameter that lets you supply a different query for highlighting to the one you're using for searching, perhaps that could have a different and more forgiving pattern that made sure all your terms were highlighted? Also, the XML didn't come through as this list strips attachments. Best Charlie On 31/03/2020 19:27, Anil Shingala wrote: Hello Dev Team, I found some problem in highlighting module. Not all the search terms are getting highlighted. Sample query: q={!complexphrase+inOrder=true}"pos1 (pos2 OR pos3)"~30=true Indexed text: "pos1 pos2 pos3 pos4" please find attached response xml screen shot from solr. You can see that only two terms are highlighted like, "pos1 pos2 pos3 pos4" The scenario is same in Solr source code since long time (I have checked in Solr version 4 to version 7). The scenario is when term positions are in-order in document and query both. Please let me know your view on this. Regards, Anil Shingala *Knovos* 10521 Rosehaven Street, Suite 300 | Fairfax, VA 22030 (USA) Office +1 703.226.1505 Main +1 703.226.1500 | +1 877.227.5457 /ashing...@knovos.com/ <mailto:ashing...@knovos.com>/_|_//www.knovos.com/ <http://www.knovos.com/> Washington DC | New York | London | Paris | Gandhinagar | Tokyo /Knovos was formerly also known as Capital Novus or Capital Legal Solutions. The information contained in this email message may be confidential or legally privileged. If you are not the intended recipient, please advise the sender by replying to this email and by immediately deleting all copies of this message and any attachments. Knovos, LLC is not authorized to practice law./ -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Solr Instance Migration - Server Access
If you can get the server login details you should be able to copy the Solr installation and its configuration. If not, then Solr itself doesn't provide any way to get them - it's just a search engine, it's not responsible for securing a server in any way. Charlie On 26/03/2020 02:13, Landon Cowan wrote: Hello! I’m working on a website for a client that was migrated from another website development company. The previous company used Solr to build out the site search – but they did not send us the server credentials. The developers who built the tool are no longer with the company – is there a process we should follow to secure the credentials? I worry we may need to rebuild the feature from the ground up. -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: How to get boosted field and values?
Try splainer.io - it parses the Debug output to show in detail how the scores are calculated (disclaimer, I work for OSC who created it - but it's free & open source of course ). Charlie On 23/03/2020 01:26, Taisuke Miyazaki wrote: The blog looks like it's going to be useful from now on, so I'll take a look.Thank you. What I wanted, however, was a way to know what field was boosted as a result. But I couldn't find a way to do that, so instead I tried to get the field and value out of the resulting score by putting a binary bit on the field/value pair. It doesn't really matter to me whether you do it additively or multiplicatively, as it's good to know the field boosted as a result. Do you see what I mean? 2020年3月20日(金) 18:56 Alessandro Benedetti : Hi Taisuke, there are various ways of approaching boosting and scoring in Apache Solr. First of all you must decide if you are interested in multiplicative or additive boost. Multiplicative will multiply the score of your search result by a certain factor while the additive will just add the factor to the final score. Using advanced query parsers such as the dismax and edismax you can use the : *boost* parameter - multiplicative - takes function in input - https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-TheboostParameter *bq*(boost query) - additive - https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebq_BoostQuery_Parameter *bf*(boost function) - additive - https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebf_BoostFunctions_Parameter This blog post is old but should help : https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/ Then you can boost fields or even specific query clauses: 1) https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqf_QueryFields_Parameter 2) q= features:2^1.0 AND features:3^5.0 1.0 is the default, you are multiplying the score contribution of the term by 1.0, so no effect. features:3^5.0 means that the score contribution of a match for the term '3' in the field 'features' will be multiplied by 5.0 (you can also see that enabling debug=results Finally you can force the score contribution of a term to be a constant, it's not recommended unless you are truly confident you don't need other types of scoring: q= features:2^=1.0 AND features:3^=5.0 in this example your document id: 3 will have a score of 6.0 Not sure if this answers your question, if not feel free to elaborate more. Cheers -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Thu, 19 Mar 2020 at 11:18, Taisuke Miyazaki I'm using Solr 7.5.0. I want to get boosted field and values per documents. e.g. documents: id: 1, features: [1] id: 2, features: [1,2] id: 3, features: [1,2,3] query: bq: features:2^1.0 AND features:3^1.0 I expect results like below. boosted: - id: 2 - field: features, value: 2 - id: 3 - field: features, value: 2 - field: features, value: 3 I have an idea that set boost score like bit-flag, but it's not good I think because I must send query twice. bit-flag: bq: features:2^2.0 AND features:3^4.0 docs: - id: 1, score: 1.0(0x001) - id: 2, score: 3.0(0x011) # have feature:2(2nd bit is 1) - id: 3, score: 7.0(0x111) # have feature:2 and feature:3(2nd and 3rd bit are 1) check score value then I can get boosted field. Is there a better way? -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: FW: SOLR version 8 bug???
achedChain.doFilter(ServletHandler.java:1596)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1577)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)\n\tat org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:500)\n\tat org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)\n\tat org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)\n\tat org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)\n\tat org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)\n\tat java.lang.Thread.run(Thread.java:748)\n", "code":500}} in Drupal\search_api_solr\Plugin\search_api\backend\SearchApiSolrBackend->search() (line 1600 of /srv/www/dcfinternet/phil/modules/composer/search_api_solr/src/Plugin/search_api/backend/SearchApiSolrBackend.php). -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Haystack US tickets on sale!
Hi all, Very happy to announce that Haystack US 2020, the search relevance conference, is now open for business! www.haystackconf.com for details of the event running during the week of April 27th in Charlottesville, including associated training. We have a fantastic lineup of speakers due to be published soon, there will be fun social events, book signings and more. Earlybird discounts are active until the end of March. (If you can't wait that long we're also running some Solr training in March in London https://www.eventbrite.co.uk/e/think-like-a-relevance-engineer-solr-march-2020-london-uk-tickets-92942813457 and holding our London Solr Meetup that same week https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/) Cheers Charlie -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Mongolian language in Solr
Hi, There's no Mongolian stemmer in Snowball, the stemmer project Lucene uses. I found one paper discussing how one might lemmatize Mongolian: https://www.researchgate.net/publication/220229332_A_lemmatization_method_for_Mongolian_and_its_application_to_indexing_for_information_retrieval https://dl.acm.org/doi/10.1016/j.ipm.2009.01.008 but no actual code. Of course, you could use Snowball to build your own stemmer. https://snowballstem.org/ I did have more success finding Mongolian stopwords https://github.com/elastic/elasticsearch/issues/40434 - someone over in Elasticsearch land seems to have the same problem as you do. Best Charlie On 12/02/2020 11:41, Samir Joshi wrote: Hi, Is it possible to get a Mongolian language in Solr indexing? Regards, Samir Joshi VFS GLOBAL EST. 2001 | Partnering Governments. Providing Solutions. 10th Floor, Tower A, Urmi Estate, 95, Ganpatrao Kadam Marg, Lower Parel (W), Mumbai 400 013, India Mob: +91 9987550070 | sami...@vfsglobal.com<mailto:sami...@vfsglobal.com> | www.vfsglobal.com<http://www.vfsglobal.com/> -- Care4Green: Please consider the environment before printing this e-mail -- This message contains information that may be privileged or confidential and is the property of the VFS Global Group. It is intended only for the person to whom it is addressed. Any unauthorised printing, copying, disclosure, distribution or use of this message or any part thereof is strictly forbidden. If you are not the intended recipient, you are not authorised to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. VFS Global Group has taken reasonable precaution to ensure that any attachment to this e-mail has been swept for viruses. However, we do not accept liability for any direct or indirect damage sustained as a result of software viruses and would advise that you conduct your own virus checks before opening any attachment. VFS Global Group does not guarantee the security of any information transmitted electronically and is not liable for the proper, timely and complete transmission thereof. -- -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Haystack CFP is open, come and tell us how you tune relevance for Lucene/Solr
Hi all, You have until this Friday to submit a talk to Haystack! Very much looking forward to your submissions. Charlie On 27/01/2020 21:53, Doug Turnbull wrote: Just an update the CFP was extended to Feb 7th, less than 2 weeks away. -> http://haystackconf.com It's your ethical imperative to share! ;) https://opensourceconnections.com/blog/2020/01/23/opening-up-search-is-an-ethical-imperative/ And no talk is too small, people often underestimate what they're doing, and very much underestimate how interesting others will find your story! The best talks often come from the least expected people & orgs. On Thu, Jan 9, 2020 at 4:13 AM Charlie Hull wrote: Hi all, Haystack, the search relevance conference, is confirmed for 29th & 30th April 2020 in Charlottesville, Virginia - the CFP is open and we need your contributions! More information at www.haystackconf.com <http://www.haystackconf.com>including links to previous talks, deadline is 31st January. We'd love to hear your Lucene/Solr relevance stories. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Haystack CFP is open, come and tell us how you tune relevance for Lucene/Solr
We're expecting prices to be very similar to last year - early bird will be $300 ish for conference only and $2250 ish for conference plus a training (we're running no less than 5 different classes that week including Think Like a Relevance Engineer, Hello LTR and NLP) - hopefully this will give you enough information for budgeting. Speakers get a small discount too! Cheers Charlie On 27/01/2020 22:21, John Blythe wrote: Hey Doug. Do you know the pricing yet? Trying to get something submitted to VP so I can take my team to the conference. Thanks! On Mon, Jan 27, 2020 at 14:54 Doug Turnbull < dturnb...@opensourceconnections.com> wrote: Just an update the CFP was extended to Feb 7th, less than 2 weeks away. -> http://haystackconf.com It's your ethical imperative to share! ;) https://opensourceconnections.com/blog/2020/01/23/opening-up-search-is-an-ethical-imperative/ And no talk is too small, people often underestimate what they're doing, and very much underestimate how interesting others will find your story! The best talks often come from the least expected people & orgs. On Thu, Jan 9, 2020 at 4:13 AM Charlie Hull wrote: Hi all, Haystack, the search relevance conference, is confirmed for 29th & 30th April 2020 in Charlottesville, Virginia - the CFP is open and we need your contributions! More information at www.haystackconf.com <http://www.haystackconf.com>including links to previous talks, deadline is 31st January. We'd love to hear your Lucene/Solr relevance stories. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk -- *Doug Turnbull **| CTO* | OpenSource Connections <http://opensourceconnections.com>, LLC | 240.476.9983 Author: Relevant Search <http://manning.com/turnbull> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Update synonyms.txt file based on values in the database
Try looking into Managed Resources: https://lucene.apache.org/solr/guide/8_4/managed-resources.html Charlie On 15/01/2020 10:35, seeteshh wrote: How do I update the synonyms.txt file if the data is being fetched from a database say PostgreSQL since I wont be able to update the synonmys.txt file every time manually and also the data is related to a table and not known to Solr. I am using Apache Solr 8.4. Regards, Seetesh hindlekar -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull OpenSource Connections, previously Flax tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Coming back to search after some time... SOLR or Elastic for text search?
On 15/01/2020 11:42, Dc Tech wrote: Thank you Jan and Charlie. I should say that in terms of posting to the community regarding Elastic vs Solr - this is probably the most civil and helpful community that I have been a part of - and your answers have only reinforced that notion !! Thank you for your responses. I am glad to hear that both can do most of it, which was my gut feeling as well. Charlie, to your point - the team probably feels that Elastic is easier to get started with hence the preference, as well as the hosting options (with the caveats you noted). Agree with you completely that tech is not the real issue. Jan, agree with the points you made on team skills. On our previous proprietary engine - that was in fact the biggest issue - the engine was powerful enough and had good references. However, we were not able to exploit it to good effect. Hi again, The dirty secret that few will voice is that...most search engines are basically the same. Once you've worked on a search project you can apply those skills to any future search engine. This is why I'm currently focused on supporting the search team, not the search tech - how do you learn and improve those relevance tuning skills, considering it's really, really hard to recruit people with existing high-level search skills (and if you can find them you probably can't afford them). Cheers Charlie Thank you again. On Jan 15, 2020, at 5:10 AM, Jan Høydahl wrote: Hi, Choosing the solr community mailing list to ask advice for whether to choose ES - you already know what to expect, not? More often than not the choice comes down to policy, standardization, what skills you have in the house etc rather than ticking off feature checkboxes. Sometimes company values also may drive a choice, i.e. Solr is 100% Apache and not open core, which may matter if you plan to get involved in the community, and contribute features or patches. However, if I were in your shoes as architect to evaluate tech stack, and there was not a clear choice based on the above, I’d do what projects normally do, to ask yourself what you really need from the engine. Maybe you have some features in your requirement list that makes one a much better choice over the other. Or maybe after that exercise you are still wondering what to choose, in which case you just follow your gut feeling and make a choice :) Jan 15. jan. 2020 kl. 10:07 skrev Charlie Hull : On 15/01/2020 04:02, Dc Tech wrote: I am SOLR fant and had implemented it in our company over 10 years ago. I moved away from that role and the new search team in the meanwhile implemented a proprietary (and expensive) nosql style search engine. That the project did not go well, and now I am back to project and reviewing the technology stack. Some of the team think that ElasticSearch could be a good option, especially since we can easily get hosted versions with AWS where we have all the contractual stuff sorted out. You can, but you should be aware that: 1. Amazon's hosted Elasticsearch isn't great, often lags behind the current version, doesn't allow plugins etc. 2. Amazon and Elastic are currently engaged in legal battles over who is the most open sourcey,who allegedly copied code that was 'open' but commercially licensed, who would like to capture the hosted search market...not sure how this will pan out (Google for details) 3. You can also buy fully hosted Solr from several places. Whle SOLR definitely seems more advanced (LTR, streaming expressions, graph, and all the knobs and dials for relevancy tuning), Elastic may be sufficient for our needs. It does not seem to have LTR out of the box but the relevancy tuning knobs and dials seem to be similar to what SOLR has. Yes, they're basically the same under the hood (unsurprising as they're both based on Lucene). If you need LTR there's an ES plugin for that (disclaimer, my new employer built and maintains it: https://github.com/o19s/elasticsearch-learning-to-rank). I've lost track of the amount of times I've been asked 'Elasticsearch or Solr, which should I choose?' and my current thoughts are: 1. Don't switch from one to the other for the sake of it. Switching search engines rarely addresses underlying issues (content quality, team skills, relevance tuning methodology) 2. Elasticsearch is easier to get started with, but at some point you'll need to learn how it all works 3. Solr is harder to get started with, but you'll know more about how it all works earlier 4. Both can be used for most search projects, most features are the same, both can scale. 5. Lots of Elasticsearch projects (and developers) are focused on logs, which is often not really a 'search' project. The corpus size is not a challenge - we have about one million document, of which about 1/2 have full text, while the test are simpler (i.e. company directory etc.). The query volumes are also quite low (max 5/second at peak). We have
Re: Coming back to search after some time... SOLR or Elastic for text search?
On 15/01/2020 04:02, Dc Tech wrote: I am SOLR fant and had implemented it in our company over 10 years ago. I moved away from that role and the new search team in the meanwhile implemented a proprietary (and expensive) nosql style search engine. That the project did not go well, and now I am back to project and reviewing the technology stack. Some of the team think that ElasticSearch could be a good option, especially since we can easily get hosted versions with AWS where we have all the contractual stuff sorted out. You can, but you should be aware that: 1. Amazon's hosted Elasticsearch isn't great, often lags behind the current version, doesn't allow plugins etc. 2. Amazon and Elastic are currently engaged in legal battles over who is the most open sourcey,who allegedly copied code that was 'open' but commercially licensed, who would like to capture the hosted search market...not sure how this will pan out (Google for details) 3. You can also buy fully hosted Solr from several places. Whle SOLR definitely seems more advanced (LTR, streaming expressions, graph, and all the knobs and dials for relevancy tuning), Elastic may be sufficient for our needs. It does not seem to have LTR out of the box but the relevancy tuning knobs and dials seem to be similar to what SOLR has. Yes, they're basically the same under the hood (unsurprising as they're both based on Lucene). If you need LTR there's an ES plugin for that (disclaimer, my new employer built and maintains it: https://github.com/o19s/elasticsearch-learning-to-rank). I've lost track of the amount of times I've been asked 'Elasticsearch or Solr, which should I choose?' and my current thoughts are: 1. Don't switch from one to the other for the sake of it. Switching search engines rarely addresses underlying issues (content quality, team skills, relevance tuning methodology) 2. Elasticsearch is easier to get started with, but at some point you'll need to learn how it all works 3. Solr is harder to get started with, but you'll know more about how it all works earlier 4. Both can be used for most search projects, most features are the same, both can scale. 5. Lots of Elasticsearch projects (and developers) are focused on logs, which is often not really a 'search' project. The corpus size is not a challenge - we have about one million document, of which about 1/2 have full text, while the test are simpler (i.e. company directory etc.). The query volumes are also quite low (max 5/second at peak). We have implemented the content ingestion and processing pipelines already in python and SPARK, so most of the data will be pushed in using APIs. I would really appreciate any guidance from the community !! Sounds like a pretty small setup to be honest, but as ever the devil is in the details. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search (now part of OpenSourceConnections) tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19.com
Haystack CFP is open, come and tell us how you tune relevance for Lucene/Solr
Hi all, Haystack, the search relevance conference, is confirmed for 29th & 30th April 2020 in Charlottesville, Virginia - the CFP is open and we need your contributions! More information at www.haystackconf.com <http://www.haystackconf.com>including links to previous talks, deadline is 31st January. We'd love to hear your Lucene/Solr relevance stories. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: hi question about solr
Hi, https://livebook.manning.com/book/solr-in-action/chapter-3 may help (I'd suggest reading the whole book as well). Basically what you're looking for is the 'term position'. The TermVectorComponent in Solr will allow you to return this for each result. Cheers Charlie On 02/12/2019 11:24, eli chen wrote: hi im kind of new to solr so please be patient i'll try to explain what do i need and what im trying to do. we a have a lot of books content and we want to index them and allow search in the books. when someone search for a term i need to get back the position of matchen word in the book for example if the book content is "hello my name is jeff" and someone search for "my". i want to get back the position of my in the content field (which is 1 in this case) i tried to do that with payloads but no success. and another problem i encourage is . lets say the content field is "hello my name is jeff what is your name". now if someone search for "name" i want to get back the index of all occurrences not just the first one is there any way to that with solr without develop new plugins thx -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Icelandic support in Solr
On 26/11/2019 16:35, Mikhail Ibraheem wrote: Hi,Does Solr supports Icelandic language out of the box? If not, can you please let me know how to add that with custom analyzers? Thanks The Snowball stemmer project which is used by Solr (https://snowballstem.org/algorithms/ - co-created by Martin Porter, author of the famous stemmer) doesn't support Icelandic unfortunately. I can't find any other stemmers that you could use in Solr. Basis Technology offer various commercial software for language processing that can work with Solr and other engines, not sure if they support Icelandic. So, not very positive I'm afraid: you could look into creating your own stemmer using Snowball, or some heuristic approaches, but you'd need a good grasp of the structure of the language. Best Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Active directory integration in Solr
Not out of the box, there are a few authentication plugins bundled but not for AD https://lucene.apache.org/solr/guide/7_2/authentication-and-authorization-plugins.html - there's also some useful stuff in Apache ManifoldCF https://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/ Best Charlie On 18/11/2019 15:08, Kommu, Vinodh K. wrote: Hi, Does anyone know that Solr has any out of the box capability to integrate Active directory (using LDAP) when security is enabled? Instead of creating users in security.json file, planning to use users who already exists in active directory so they can use their individual credentials rather than defining in Solr. Did anyone came across similar requirement? If so was there any working solution? Thanks, Vinodh DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: solr UI collection dropdown sorting order
I think we looked at this at our recent Hackday in DC - check out the first part of this blog: https://opensourceconnections.com/blog/2019/09/23/what-happens-at-a-lucene-solr-hackday/ - hopefully a pointer towards getting this fixed. Best Charlie On 20/10/2019 09:06, Sotiris Fragkiskos wrote: Hi everyone! is there any way the collections available on the left-hand side of the solr UI can be sorted? I'm referring to the "collection selector" dropdown. But the same applies to the Collections button. The sorting seems kind of..random? Thanks in advance! Sotir -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Using Tesseract OCR to extract PDF files in EML file attachment
My colleagues Eric Pugh and Dan Worley covered OCR and Solr in a presentation at our recent London Lucene/Solr Meetup: https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/264579498/ (direct link to slides if you can't find it in the comments https://www.slideshare.net/o19s/payloads-and-ocr-with-solr) HTH Charlie On 14/10/2019 11:40, Retro wrote: Hello, thanks for answer, but let me explain the setup. We are running our own backup solution for emails (messages from Exchange in MSG format). Content of these messages then indexed in SOLR. But SOLR can not process attachments within those MSG files, can not OCR them. This is what I need - to OCR attachments and get their content indexed in SOLR. Davis, Daniel (NIH/NLM) [C] wrote Nuance and ABBYY provide OCR capabilities as well. Looking at higher level solutions, both indexengines.com and Comvault can do email remediation for legal issues. AJ Weber wrote There are alternative, paid, libraries to parse and extract attachments from EML files as well EML attachments will have a mimetype associated with their metadata. Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Hackday in DC next Tuesday
Hi all, If you're in town for Activate next week, we're running another free Lucene Hackday on Tuesday: https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/263993681/ - do come along if you can! It's only a block and a half from the Activate venue. Cheers Charlie -- Charlie Hull OpenSource Connections tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.o19s.com
Re: Ranking
1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id tkntRGqBd7lZ for ; Sat, 27 Jul 2019 20:55:59 + (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.222.175; helo=mail-qk1-f175.google.com; envelope-from=erik.hatc...@gmail.com; receiver= Received: from mail-qk1-f175.google.com (mail-qk1-f175.google.com [209.85.222.175]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 261BCBC7B3 for ; Sat, 27 Jul 2019 20:55:59 + (UTC) Received: by mail-qk1-f175.google.com with SMTP id d15so41571526qkl.4 for ; Sat, 27 Jul 2019 13:55:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:date:subject:message-id :references:in-reply-to:to; bh=1gHvTKtoTkpa065pNgBbCPIiB7MlA4jsaGdI1mo8Lbo=; b=ZC6lb5CmIWySYfPuspRyKS8kpKRIgrHEALHWqB+cXPH187pmfYwKnSr1LIMNGiJso5 PBWWaIV8Rdt1rCOEiIZk6hWbC9xEsiSiAYuirIpJMAKsjigJXr+ua25jQDKB5EL/DIJ9 7Ygo2v5BzEmGb6h3Fxvmq71HEkwuOd5+Vi+6OoZdpkiuseD+pfEVUCp0FC0uAoP7wJKA J/Z9xJvU4m0kCvIo9ofeNNCv/nmMBjBUjZOvA6EUOfKPuBf0HOT6rW1K5gUenabNTc3Y hgqN3i5d8mRfM531Ts0/s90EbSrN+yKLnXsi5J7Y+ZGJzLgybGajBuJpGUy8zSxaq138 a7Mw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version:date :subject:message-id:references:in-reply-to:to; bh=1gHvTKtoTkpa065pNgBbCPIiB7MlA4jsaGdI1mo8Lbo=; b=TQjzBgBLERdlcF7x7vkFeoWbONWInnLJTGH5xre4s0oCCMzTrqF3s3Fh6z8unQrOz4 6WY0czoSp83jXHH4mQqoERTz1gaIXZZguzwNBPWe8t76Qf+GCpXCsxU6ZLG6Cn/qydup JcjcqeERlOMRySbUA17L7cDrUXWGh7x14KkdJqSByrXqatT00astGrTJswcmEfxiULTd cFMja9+dBSEGradQMPQfkvKB3rizOjauXO13LojKmXpfrX3h5oSXPk1QdscVDBzMDBkd rpUgMBLWVo/PgJ269AfhfAkr0sNeWfk0Vm+IOmLRokJ2OrOYoRR9i16uH1+r/GRxSqrY Prhg== X-Gm-Message-State: APjAAAWgIU3qTtZge+065LST9X7uBq4HN90TvcjzsAQas1RpKTe48fSP AmBL+r3+kuch3DEuvd7/tbw/1siqIXo= X-Received: by 2002:a37:4e92:: with SMTP id c140mr62121531qkb.48.1564260952874; Sat, 27 Jul 2019 13:55:52 -0700 (PDT) Received: from [192.168.0.102] ([71.51.161.116]) by smtp.gmail.com with ESMTPSA id r26sm24358675qkm.57.2019.07.27.13.55.52 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 27 Jul 2019 13:55:52 -0700 (PDT) From: Erik Hatcher Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (1.0) Date: Sat, 27 Jul 2019 16:55:51 -0400 Subject: Re: Ranking Message-Id: <9df60f32-0a60-4c0f-90c2-98a76b698...@gmail.com> References: In-Reply-To: To: solr-user@lucene.apache.org X-Mailer: iPhone Mail (16F203) The details of the scoring can be seen by setting =true Erik On Jul 27, 2019, at 15:40, Steven White wrote: Hi everyone, I have 2 files like so: FA has the letter "i" only 2 times, and the file size is 54,246 bytes FB has the letter "i" 362 times and the file size is 9,953 When I search on the letter "i" FB is ranked lower which confuses me because I was under the impression the occurrences of the term in a document and the document size is a factor as such I was expecting FB to rank higher. Did I get this right? If not, what's causing FB to rank lower? I'm on Solr 8.1 Thanks Steven -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Quepid, the relevance testing tool for Solr, released as open source
Hi all, We've finally made Quepid, the relevance testing tool, open source. There's also a free hosted version at www.quepid.com . Looking forward to contributions driving the project forward! Quepid is a way to record human relevance judgements, and then to experiment with query tuning and see the results in real time. More details at https://opensourceconnections.com/blog/2019/07/25/2019-07-22-quepid-is-now-open-source/ (also particularly pleased to see Luwak, the stored query engine we built at Flax become part of Lucene - it's a great day for open source!) Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Indexig excel (xlsx) file into SOLR 8.1.1
Simpler possibly, but not necessarily reliable. If you do everything inside Solr's DIH with Tika under the hood to extract data from Excel, a malformed Excel file could kill Tika and bring down your entire Solr cluster. Far better to do it outside of Solr as this blog describes: https://lucidworks.com/post/indexing-with-solrj/ If you want to see what Tika does to your Excel examples this is quite a neat way to experiment: https://okfnlabs.org/projects/tika-server/ Cheers Charlie On 26/07/2019 09:44, Vipul Bahuguna wrote: Hi Charlie, Thanks for your suggestion, but I will have thousands of these files coming from different sources. It would become very tedious if I have to first convert them to csv and then run liny by line. I was hoping if there could be a simpker way to achieve these using DIH which I thought can be configured to read and ingest MS Excel (xlsx) files. I am not too sure of how the configuration file would look like. Any pointers are welcome. Thanks! On Fri, 26 Jul, 2019, 1:56 PM Charlie Hull, wrote: Convert the Excel file to a CSV and then write a teeny script to go through it line by line and submit to Solr over HTTP? Tika would probably work but it's a lot of heavy lifting for what seems to me like a simple problem. Cheers Charlie On 26/07/2019 09:19, Vipul Bahuguna wrote: Hi Guys - can anyone suggest how to achieve this? I have understood how to insert json documents. So one alternative that comes to my mind is that I can convert the rows in my excel to json format with the header of my excel file becoming the json keys (corresponding to the fields I have defined in my managed-schema.xml). And then each cell in the excel file will become the value of this field. However, I am sure there must be a better way and directly ingesting the excel file to achieve the same. I was trying to reach about DIH and Apache Tika, but I am not very sure of how the configuration works. My sample excel file has 4 columns namely - 1. First Name 2. Last Name 3. Phone 4. Website Link I want to index these fields into SOLR in a way that all these columns become my solr schema fields and later I can search based on these fields. Any suggestions please. thanks ! -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Indexig excel (xlsx) file into SOLR 8.1.1
Convert the Excel file to a CSV and then write a teeny script to go through it line by line and submit to Solr over HTTP? Tika would probably work but it's a lot of heavy lifting for what seems to me like a simple problem. Cheers Charlie On 26/07/2019 09:19, Vipul Bahuguna wrote: Hi Guys - can anyone suggest how to achieve this? I have understood how to insert json documents. So one alternative that comes to my mind is that I can convert the rows in my excel to json format with the header of my excel file becoming the json keys (corresponding to the fields I have defined in my managed-schema.xml). And then each cell in the excel file will become the value of this field. However, I am sure there must be a better way and directly ingesting the excel file to achieve the same. I was trying to reach about DIH and Apache Tika, but I am not very sure of how the configuration works. My sample excel file has 4 columns namely - 1. First Name 2. Last Name 3. Phone 4. Website Link I want to index these fields into SOLR in a way that all these columns become my solr schema fields and later I can search based on these fields. Any suggestions please. thanks ! -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Understanding DebugQuery
Hi Paresh, There are various tools available for breaking down the Debug query: www.splainer.io (disclaimer, I work for OSC who wrote this) and a few others - check out section 4 of this post for more http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/ Cheers Charlie On 09/07/2019 06:43, Paresh Khandelwal wrote: Hi All, I tried to get the debug information about the query for my INNER JOIN and ACROSS JOIN and trying to understand it. See the query below - 1487 msec { "responseHeader":{ "status":0, "QTime":1487, "params":{ "q":"*:*", "fq.op":"AND", "indent":"on", "fl":"TC_0Y0_Item_ID", "fq":["TC_0Y0_Occurrence_Name:\"6935 style rear MY11+\"", "TC_0Y0_ProductScope:xtWNf_fTAaLUgD", "{!join to=TC_0Y0_Item_ID from=TC_0Y0_ItemRevision_0Y0_awp0Item_item_id fromIndex=collection1}TC_0Y0_ItemRevision_0Y0_awp0Item_item_id:92138773"], "wt":"json", "debugQuery":"on", "group.field":"TC_0Y0_Item_ID", .. "debug":{ "join":{ "{!join from=TC_0Y0_ItemRevision_0Y0_awp0Item_item_id to=TC_0Y0_Item_ID fromIndex=collection1}TC_0Y0_ItemRevision_0Y0_awp0Item_item_id:92138773":{ "time":955, "fromSetSize":3, "toSetSize":14560, "fromTermCount":6632106, "fromTermTotalDf":6632106, "fromTermDirectCount":6632106, "fromTermHits":1, "fromTermHitsTotalDf":1, "toTermHits":1, "toTermHitsTotalDf":14560, "toTermDirectCount":0, "smallSetsDeferred":1, "toSetDocsAdded":14560}}, "rawquerystring":"*:*", "querystring":"*:*", "parsedquery":"MatchAllDocsQuery(*:*)", "parsedquery_toString":"*:*", "explain":{ "AZD1uV0qgj6GxC":"\n1.0 = *:*, product of:\n 1.0 = boost\n 1.0 = queryNorm\n"}, "QParser":"LuceneQParser", "filter_queries":["TC_0Y0_Occurrence_Name:\"6935 style rear MY11+\"", "TC_0Y0_ProductScope:xtWNf_fTAaLUgD", "{!join to=TC_0Y0_Item_ID from=TC_0Y0_ItemRevision_0Y0_awp0Item_item_id fromIndex=collection1}TC_0Y0_ItemRevision_0Y0_awp0Item_item_id:92138773"], "parsed_filter_queries":["TC_0Y0_Occurrence_Name:6935 style rear MY11+", "TC_0Y0_ProductScope:xtWNf_fTAaLUgD", "JoinQuery({!join from=TC_0Y0_ItemRevision_0Y0_awp0Item_item_id to=TC_0Y0_Item_ID fromIndex=collection1}TC_0Y0_ItemRevision_0Y0_awp0Item_item_id:92138773)"], "timing":{ "time":1487.0, .. I am trying to see why fromTermCount is so high when fromSetSize and toSetSize is less? Where can I find the details about all the contents of debugQuery and how to read each component? Any help is appreciated. Regards, Paresh -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs
On 05/07/2019 14:33, Joseph_Tucker wrote: Thanks for your help / suggestion. I'm not sure I completely follow in this case. SolrJ looks like a method to allow Java applications to talk to Solr, or any other third party application would simply be a communication method between Solr and the language of your choosing. I guess what I'm after is, how would using SolrJ improve performance when indexing? It's not just about improving performance (although DIH is single threaded, so you could obtain a marked indexing performance gain using a client such as SolrJ). With DIH you will embed a lot of SQL code into Solr's configuration files, and the more sources you add the more complicated, hard to debug and unmaintainable it's going to be. You should thus consider writing a proper indexing script in Java, Python or whatever language you are most familiar with - this has always been our approach. Best Charlie *** I could be wrong in my assumptions as I'm still learning a great deal about Solr. *** I appreciate your help Regards, Joe -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Solr upgrade question
On 05/07/2019 14:49, Margo Breäs | INDI wrote: Hi all, At the moment we are working with Solr version 4.8.1 in combination with an older version of Intershop. We have recently migrated our entire shop to a new party, and so there is room for improvements. Are there any known issues with upgrading over that many versions in general, or with an Intershop version specifically? If so we would appreciate your experiences/stories, so we can mitigate things beforehand. If you're going to migrate from that old a version of Solr, I think you will need to re-index completely and also check that all your queries work as you expect...there have been a lot of changes since then and don't underestimate the task! Cheers Charlie Thanks in advance, best regards, Margo Breas | INDI Met vriendelijke groet / Kind regards, Margo Breäs Categoriespecialist T. +31 88 0666 000 E. *margo.br...@indi.nl* <mailto:margo.br...@indi.nl> *W. www.indi.nl* <https://www.indi.nl/nl-nl/?utm_medium=email_source=email_handtekening_campaign=margo_breas> INDI.nl website <https://www.indi.nl/nl-nl/?utm_medium=email_source=email_handtekening_campaign=margo_breas> -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: SOLR (6.3.0) Initialization Issue
On 10/05/2019 08:52, Charlie Hull wrote: On 09/05/2019 19:18, SAMMAR UL HASSAN wrote: Hi Support Team, I hope all is well. Let me explain what we are, what we are currently doing & what we want from you. We are IT based healthcare company, providing healthcare software services (EHR/EMR) to doctors across the U.S. In many important modules of our products we have implemented SOLR based smart search. We know the basics of SOLR & we are doing well to achieve our requirements but we face some issues time to time and try to resolve as per our best knowledge. At this moment, we are facing the attached errors and need your support to resolve this issue permanently. We will appreciate if you arrange the call to discuss this issue. In case any additional information please let us know. Hi, Solr is an open source product, so you have various options to get support. I'm assuming you've already done your own research around the issues you're facing. 1. ask on this mailing list, providing as much detail as you can, and hopefully someone will be able to help - but be aware that those who respond are volunteering their time from often very busy lives - and no-one is likely to want to arrange a call. 2. engage a professional services company (disclaimer: I work for OpenSource Connections who provide this sort of help, there are many others - see https://wiki.apache.org/solr/Support for individuals and companies who know Solr) 3. train up your own team on Solr, there are many courses available. 4. this list sometimes strips attachments, so I'm afraid the list of errors you supplied didn't arrive - perhaps put them inline? C HTH Charlie *Regards* Syed Sammar ul Hassan *Lead Surescripts-Development* MTBC | A Unique Healthcare IT Company® 7 Clyde Road | Somerset, NJ 08873 P: 732-873-5133 x319 | F: 732-873-3378 www.mtbc.com <http://www.mtbc.com/>| sammarulhas...@mtbc.com <mailto:sammarulhas...@mtbc.com> Follow MTBC on Twitter, LinkedIn and Facebook ONC-ACB Certified EHR | Deloitte® Technology Fast 500 | SureScripts® Solution Provider | Microsoft® Gold Certified Partner | Inc. 500|5000® NOTICE: The information contained in this e-mail message is confidential and intended only for the personal and confidential use of the designated recipient(s) named above. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you have received this document in error, and any review, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by email or telephone and delete the original message in its entirety. MTBC, the stylized MTBC logo, A Unique Healthcare IT Company and other MTBC logos, product and service names are trademarks of MTBC. -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web:www.flax.co.uk -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: SOLR (6.3.0) Initialization Issue
On 09/05/2019 19:18, SAMMAR UL HASSAN wrote: Hi Support Team, I hope all is well. Let me explain what we are, what we are currently doing & what we want from you. We are IT based healthcare company, providing healthcare software services (EHR/EMR) to doctors across the U.S. In many important modules of our products we have implemented SOLR based smart search. We know the basics of SOLR & we are doing well to achieve our requirements but we face some issues time to time and try to resolve as per our best knowledge. At this moment, we are facing the attached errors and need your support to resolve this issue permanently. We will appreciate if you arrange the call to discuss this issue. In case any additional information please let us know. Hi, Solr is an open source product, so you have various options to get support. I'm assuming you've already done your own research around the issues you're facing. 1. ask on this mailing list, providing as much detail as you can, and hopefully someone will be able to help - but be aware that those who respond are volunteering their time from often very busy lives - and no-one is likely to want to arrange a call. 2. engage a professional services company (disclaimer: I work for OpenSource Connections who provide this sort of help, there are many others - see https://wiki.apache.org/solr/Support for individuals and companies who know Solr) 3. train up your own team on Solr, there are many courses available. HTH Charlie *Regards* Syed Sammar ul Hassan *Lead Surescripts-Development* MTBC | A Unique Healthcare IT Company® 7 Clyde Road | Somerset, NJ 08873 P: 732-873-5133 x319 | F: 732-873-3378 www.mtbc.com <http://www.mtbc.com/>| sammarulhas...@mtbc.com <mailto:sammarulhas...@mtbc.com> Follow MTBC on Twitter, LinkedIn and Facebook ONC-ACB Certified EHR | Deloitte® Technology Fast 500 | SureScripts® Solution Provider | Microsoft® Gold Certified Partner | Inc. 500|5000® NOTICE: The information contained in this e-mail message is confidential and intended only for the personal and confidential use of the designated recipient(s) named above. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you have received this document in error, and any review, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by email or telephone and delete the original message in its entirety. MTBC, the stylized MTBC logo, A Unique Healthcare IT Company and other MTBC logos, product and service names are trademarks of MTBC. -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: solr search Ontology based data set
On 13/03/2019 17:01, Jie Luo wrote: Hi all, I have several ontology based data sets, I would like to use solr as search engine. Solr document is flat document. I would like to know how it is the best way to handle the search. Simple search is fine. One possible search I will need to retrieve the ontology tree or graph Best regards Jie Are you aware of the BioSolr project? Have a chat to Sameer Velankar at EBI. There's some background here https://github.com/flaxsearch/BioSolr https://www.ebi.ac.uk/spot/BioSolr/ Various ontology indexing code for Solr was developed as part of this project. Best Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: SOLR and AWS comprehend
On 13/02/2019 12:17, Gareth Baxendale wrote: This is perhaps more or an architecture question than dev code but appreciate collective thoughts! We are using Solr to order records and to categorise them to allow users to search and find specific medical conditions. We have an opportunity to make use of Machine Learning to aid and improve the results. AWS Comprehend is the product we are looking at but there is a question over whether one should replace the other as they would compete or if in fact both should work together to provide the solution we are after. One is an open source search engine and one is a closed source hosted NLP service you pay for. I think you're comparing chalk and cheese here: you would use a NLP service to enhance the source data before indexing with something like Solr, or extract information from a query before searching. Although Solr does contain some classification features it doesn't contain any NLP features - although as my colleague Liz writes you can now easily integrate Solr & OpenNLP, another open source toolkit. https://opensourceconnections.com/blog/2018/08/06/intro_solr_nlp_integrations/ By the way are you aware that NHS Wales are using Solr to power their patient records service? Best Charlie Appreciate any insights people have. Thanks Gareth -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Haystack Relevance Conference Announced; CFP ends Jan 9!
Hi all, Just to let you know the CFP has been extended until January 30th and we're really looking forward to seeing your proposals! http://haystackconf.com Cheers Charlie On 27/11/2018 22:33, Doug Turnbull wrote: Hey everyone, Many of you may know about/have been to Haystack - The Search Relevance Conference. http://haystackconf.com We're excited to announce 2019's Haystack, April 22-25 in Charlottesville, VA, USA. Our CFP due January 9th. We want to bring together practitioners that work on really interesting search relevance problems. We want talks that really get into the nitty-gritty of improving relevance, getting into technically meaty talks in applied Information Retrieval based on open source search. We know the Solr community is chock full of great ideas and problems solved, and we look forward to hearing about the tough problems you've solved with Solr/Lucene/Elasticsearch/Vespa/A Team of Trained Hamsters/whatever. Best -Doug -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: SV: Tool to format the solr query for easier reading?
On 08/01/2019 09:20, Hullegård, Jimi wrote: Hi Charlie, Care to elaborate on that a little? I can't seem to find any tool in that blog entry that formats a given solr query. What tool did you have in mind? Hi Jimi, I recalled that the Chrome plugin would do this, obviously it's not a perfect solution for you as you've prefer a Java formatter but it's a start - have you tried this one? Best Charlie /Jimi -Ursprungligt meddelande- Från: Charlie Hull Skickat: den 8 januari 2019 15:55 Till: solr-user@lucene.apache.org Ämne: Re: Tool to format the solr query for easier reading? On 08/01/2019 04:33, Hullegård, Jimi wrote: Hi, Hi Jimi, There are some suggestions in part 4 of my recent blog: http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/ Cheers Charlie I often find myself having to analyze an already existing solr query. But when the number of clauses and/or number of nested parentheses reach a certain level I can no longer grasp what the query is about by just a quick glance. Sometimes I can look at the code generating the query, but it might be autogenerated in a complex way, or I might only have access to a log output of the query. Here is an example query, based on a real query in our system: system:(a) type:(x OR y OR z) date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS] ((boolean1:false OR date2:[* TO 2019-08-31T06:15:00Z/DAY-30DAYS])) -date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] (((*:* -date4:*) OR date5:* OR date3:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS])) Here I find it quite difficult to what clauses are grouped together (using parentheses). What I tend to do in these circumstances is to copy the query into a simple text editor, and then manually add line breaks and indentation matching the parentheses levels. For the query above, it would result in something like this: system:(a) type:(x OR y OR z) date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS] ( (boolean1:false OR date2:[* TO 2019-08-31T06:15:00Z/DAY-30DAYS]) ) -date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] ( ((*:* -date4:*) OR date5:* OR date3:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS]) ) But that is a slow process, and I might make a mistake that messes up the interpretation completely. Especially when there are several levels of nested parentheses. Does anyone know of any kind of tool that would help automate this? It wouldn't have to format its output like my example, as long as it makes it easier to see what start and end parentheses belong to each other, preferably using multiple lines and indentation. A java tool would be perfect, because then I could easily integrate it into our existing debugging tools, but an online formatter (like http://jsonformatter.curiousconcept.com) would also be very useful. Regards /Jimi Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integrite t-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email m_medium=email> -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email_medium=email> -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: SV: Tool to format the solr query for easier reading?
On 08/01/2019 09:20, Hullegård, Jimi wrote: Hi Charlie, Care to elaborate on that a little? I can't seem to find any tool in that blog entry that formats a given solr query. What tool did you have in mind? This also does some basic URL splitting: https://www.freeformatter.com/url-parser-query-string-splitter.html Cheers Charlie /Jimi -Ursprungligt meddelande- Från: Charlie Hull Skickat: den 8 januari 2019 15:55 Till: solr-user@lucene.apache.org Ämne: Re: Tool to format the solr query for easier reading? On 08/01/2019 04:33, Hullegård, Jimi wrote: Hi, Hi Jimi, There are some suggestions in part 4 of my recent blog: http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/ Cheers Charlie I often find myself having to analyze an already existing solr query. But when the number of clauses and/or number of nested parentheses reach a certain level I can no longer grasp what the query is about by just a quick glance. Sometimes I can look at the code generating the query, but it might be autogenerated in a complex way, or I might only have access to a log output of the query. Here is an example query, based on a real query in our system: system:(a) type:(x OR y OR z) date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS] ((boolean1:false OR date2:[* TO 2019-08-31T06:15:00Z/DAY-30DAYS])) -date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] (((*:* -date4:*) OR date5:* OR date3:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS])) Here I find it quite difficult to what clauses are grouped together (using parentheses). What I tend to do in these circumstances is to copy the query into a simple text editor, and then manually add line breaks and indentation matching the parentheses levels. For the query above, it would result in something like this: system:(a) type:(x OR y OR z) date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS] ( (boolean1:false OR date2:[* TO 2019-08-31T06:15:00Z/DAY-30DAYS]) ) -date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] ( ((*:* -date4:*) OR date5:* OR date3:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS]) ) But that is a slow process, and I might make a mistake that messes up the interpretation completely. Especially when there are several levels of nested parentheses. Does anyone know of any kind of tool that would help automate this? It wouldn't have to format its output like my example, as long as it makes it easier to see what start and end parentheses belong to each other, preferably using multiple lines and indentation. A java tool would be perfect, because then I could easily integrate it into our existing debugging tools, but an online formatter (like http://jsonformatter.curiousconcept.com) would also be very useful. Regards /Jimi Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integrite t-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email m_medium=email> -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email_medium=email> -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Tool to format the solr query for easier reading?
On 08/01/2019 04:33, Hullegård, Jimi wrote: Hi, Hi Jimi, There are some suggestions in part 4 of my recent blog: http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/ Cheers Charlie I often find myself having to analyze an already existing solr query. But when the number of clauses and/or number of nested parentheses reach a certain level I can no longer grasp what the query is about by just a quick glance. Sometimes I can look at the code generating the query, but it might be autogenerated in a complex way, or I might only have access to a log output of the query. Here is an example query, based on a real query in our system: system:(a) type:(x OR y OR z) date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS] ((boolean1:false OR date2:[* TO 2019-08-31T06:15:00Z/DAY-30DAYS])) -date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] (((*:* -date4:*) OR date5:* OR date3:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS])) Here I find it quite difficult to what clauses are grouped together (using parentheses). What I tend to do in these circumstances is to copy the query into a simple text editor, and then manually add line breaks and indentation matching the parentheses levels. For the query above, it would result in something like this: system:(a) type:(x OR y OR z) date1:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS] ( (boolean1:false OR date2:[* TO 2019-08-31T06:15:00Z/DAY-30DAYS]) ) -date3:[2019-08-31T06:15:00Z/DAY+1DAYS TO *] ( ((*:* -date4:*) OR date5:* OR date3:[* TO 2019-08-31T06:15:00Z/DAY+1DAYS]) ) But that is a slow process, and I might make a mistake that messes up the interpretation completely. Especially when there are several levels of nested parentheses. Does anyone know of any kind of tool that would help automate this? It wouldn't have to format its output like my example, as long as it makes it easier to see what start and end parentheses belong to each other, preferably using multiple lines and indentation. A java tool would be perfect, because then I could easily integrate it into our existing debugging tools, but an online formatter (like http://jsonformatter.curiousconcept.com) would also be very useful. Regards /Jimi Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email_medium=email> -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Debugging Solr Search results & Issues with Distributed IDF
On 01/01/2019 23:03, Lavanya Thirumalaisami wrote: Hi, I am trying to debug a query to find out why one documentgets more score than the other. The below are two similar products. You might take a look at OSC's Splainer http://splainer.io/ or some of the other tools I've written about recently at http://www.flax.co.uk/blog/2018/11/15/defining-relevance-engineering-part-4-tools/ - note that this also covers some commercial offerings (and also that I'm very happy to take any comments or additions!). Cheers Charlie Below is the debug results I get from Solr admin console. "Doc1": "\n15.20965 = sum of:\n 4.7573533 = max of:\n 4.7573533= weight(All:2x in 962) [], result of:\n 4.7573533 = score(doc=962,freq=2.0 =termFreq=2.0\n), product of:\n 3.4598935 = idf(docFreq=1346, docCount=42836)\n 1.375 = tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted forfield)\n 10.452296 = max of:\n 5.9166136 = weight(All:powerpoint in 962)[], result of:\n 5.9166136 =score(doc=962,freq=2.0 = termFreq=2.0\n), product of:\n 4.302992 = idf(docFreq=579,docCount=42836)\n 1.375 = tfNorm,computed from:\n 2.0 =termFreq=2.0\n 1.2 = parameterk1\n 0.0 = parameter b (normsomitted for field)\n 10.452296 =weight(All:\"socket outlet\" in 962) [], result of:\n 10.452296 = score(doc=962,freq=2.0 =phraseFreq=2.0\n), product of:\n 7.60167 = idf(), sum of:\n 3.5370626 = idf(docFreq=1246, docCount=42836)\n 4.064607 = idf(docFreq=735,docCount=42836)\n 1.375 = tfNorm,computed from:\n 2.0 =phraseFreq=2.0\n 1.2 = parameterk1\n 0.0 = parameter b (normsomitted for field)\n", "Doc15":"\n13.258003 = sum of:\n 5.7317085 = max of:\n 5.7317085 = weight(All:doubl in 2122) [],result of:\n 5.7317085 =score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n 4.168515 = idf(docFreq=663,docCount=42874)\n 1.375 = tfNorm,computed from:\n 2.0 =termFreq=2.0\n 1.2 = parameterk1\n 0.0 = parameter b (normsomitted for field)\n 4.7657394 =weight(All:2x in 2122) [], result of:\n 4.7657394 = score(doc=2122,freq=2.0 = termFreq=2.0\n), productof:\n 3.4659925 =idf(docFreq=1339, docCount=42874)\n 1.375 = tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2= parameter k1\n 0.0 = parameterb (norms omitted for field)\n 5.390302= weight(All:2g in 2122) [], result of:\n 5.390302 = score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n 3.9202197 = idf(docFreq=850,docCount=42874)\n 1.375 = tfNorm,computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted forfield)\n 7.526294 = max of:\n 5.8597584 = weight(All:powerpoint in 2122)[], result of:\n 5.8597584 =score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n 4.2616425 = idf(docFreq=604,docCount=42874)\n 1.375 = tfNorm,computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted forfield)\n 7.526294 =weight(All:\"socket outlet\" in 2122) [], result of:\n 7.526294 = score(doc=2122,freq=1.0 =phraseFreq=1.0\n), product of:\n 7.526294 = idf(), sum of:\n 3.4955401 = idf(docFreq=1300, docCount=42874)\n 4.030754 = idf(docFreq=761,docCount=42874)\n 1.0 = tfNorm,computed from:\n 1.0 =phraseFreq=1.0\n 1.2 = parameterk1\n 0.0 = parameter b (normsomitted for field)\n", My Questions 1. IDF : I understand from solr documents that IDFis calculated for each separate shards, I have added the following stats cacheconfig to solrconfig.xml and reloaded collection But even after that there is no change incalculated IDF. 2. What are parameter b and parameter K1? 3. Why there are lots of parameters included in myDoc15 rather than Doc1? Is there any documentations I can refer to understand thesolr query calculations in depth. We are using Solr 6.1in Cloud with 3 zookeepers and 3 masters and 3 replicas. Regards, Lavanya -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Questions about the IndexUpgrader tool.
On 18/12/2018 17:40, Erick Erickson wrote: You are far better off re-indexing totally. I would add '...if you have the original data'. Not everyone *can* re-index, and there are various hairy ways of updating an index in place, but they require deep-level magic. But if you have the original source data, you should re-index. Cheers Charlie Using IndexUpgraderTool has never guaranteed compatibility across multiple major releases. I.e. if you have an index built with 4x, using that tool will work for 5x, but then going from 5x to 6x _even after the entire index is rewritten from 4 x format_ has never been guaranteed to work. By "guaranteed to work" here, I mean that there can be subtle problems, regardless of appearances The two most succinct statements as to why this is true follow. I will not second guess _anything_ these two people have to say about how Lucene works ;) From Mike McCandless: “This really is the difference between an index and a database: we do not store, precisely, the original documents. We store an efficient derived/computed index from them.” From Robert Muir: “I think the key issue here is Lucene is an index not a database. Because it is a lossy index and does not retain all of the user's data, its not possible to safely migrate some things automagically... The function is y = f(x) and if x is not available its not possible, so lucene can't do it.” As of 6x, a marker is written into each segments and the lowest version is retained when segments are merged. 8x will refuse to start if it detects a 6x marker so this will be enforced soon. Best, Erick On Mon, Dec 17, 2018 at 12:27 PM Pushkar Raste wrote: Hi, I have questions about the IndexUpgrader tool. - I want to upgrade from Solr 4 to Solr 7. Can I run upgrade the index from 4 to 5 then 5 to 6 and finally 6 to 7 using appropriate version of the IndexUpgrader but without loading the Index in the Solr at all during the successive upgrades. - The note in the tool says "This tool only keeps last commit in an index". Does this mean I have optimize the index before running the tool? - There is another note about partially upgraded index. How can the index be partially upgraded. One scenario I can think of is 'If I upgraded let's say from Solr 5 to Solr 6 and then added some documents. The new documents will be in Lucerne 6 format already, while old documents will still be Solr 5 format’ Is my understanding correct? -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Solr Cloud - Store Data using multiple drives -2
On 22/11/2018 11:50, Tech Support wrote: Dear Solr Team, I am using SOLR 7.5.0 in Windows OS (SOLR Cloud). My primary need is , If the current data storage drive is full, I need to use another one drive without moving the existing data into the new location. If I add new the dataDir location in the core.properties file, new data only available on the Solr. If we move the existing data into the new location then only I can access the old indexed data. Without moving the existing data is it possible to use the multiple data directory in SOLR ? You've already had some good and useful answers in a previous thread, so I'm not sure why you're asking the question again...but here goes: You are asking whether it is possible to split a Solr /core/ across two data drives. I don't think that is possible as you've since found out, as there can only ever be one data directory set for a core. However it is possible to create a Solr /collection/ that consists of multiple cores. You /shard/ the collection into several parts, each of one resides in a different core. You can then easily search over all these parts by addressing the collection in your search request. Each core could use a different data drive. This usually assumes you know how big your index will be and how many parts it needs splitting into, although there are ways to re-shard after the fact using the SolrCloud Collections API. If you just want to keep adding disks as your data grows, you could also use an /alias/ across several /collections/, with each collection having one or more /cores/ on different data drives. Again this alias feature is available via the SolrCloud Collections API. (I think I've got that all right - this stuff can be confusing and the difference between cores, shards, collections etc. not always clear. This page is very helpful to understand the basic concepts https://lucene.apache.org/solr/guide/7_3/how-solrcloud-works.html#how-solrcloud-works) I'd recommend reading up about Solr Cloud and thinking more about how to plan how to distribute your index before you start. Another thing to think about is how you know that a disk is getting full - you can use Solr's metrics for this and we've also written a proxy that will block further updates if a disk is getting full - see http://www.flax.co.uk/blog/2016/04/21/running-disk-space-elasticsearch-solr/ HTH, Charlie Thanks, Karthick Ramu -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Solr cloud change collection index directory
On 13/11/2018 22:34, Shawn Heisey wrote: If it's important for you to have the data separated from the program, setting the solr home is in my opinion the right way to go. This separation is achieved by the service installer script that Solr includes, which runs on most operating systems other than Windows. A service installer for Windows is something that's been on my mind to try and pursue, but there's never enough time. The standard (but not only) way to install Solr as a Windows service is using NSSM and there are multiple guides available online. One *could* take these and write a detailed addendum to the Solr Ref Guide "Taking Solr to Production" page but it might be hard to cover the various ways to do this (batch files, Powershell scripts, runnable installers, Win32 vs Win64) and produce a definitive best practice guide. However, perhaps a short paragraph suggesting where else to look might be useful. Cheers Charlie Thanks, Shawn -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Slow import from MsSQL and down cluster during process
On 23/10/2018 02:57, Daniel Carrasco wrote: annoyingHello, I've a Solr Cluster that is created with 7 machines on AWS instances. The Solr version is 7.2.1 (b2b6438b37073bee1fca40374e85bf91aa457c0b) and all nodes are running on NTR mode and I've a replica by node (7 replicas). One node is used to import, and the rest are just for serve data. My problem is that I'm having problems from about two weeks with a MsSQL import on my Solr Cluster: when the process becomes slow or takes too long, the entire cluster goes down. How exactly are you importing from MsSQL to Solr? Are you using the Data Import Handler (DIH) and if so, how? What evidence do you have that this is slow or takes too long? Charlie I'm confused, because the main reason to have a cluster is HA, and every time the import node "fails" (is not really failing, just taking more time to finish), the entire cluster fails and I've to stop the webpage until nodes are green again. I don't know if maybe I've to change something in configuration to allow the cluster to keep working even when the import freezes or the import node dies, but is very annoying to wake up at 3AM to fix the cluster. Is there any way to avoid this?, maybe keeping the import node as NTR and convert the rest to TLOG? I'm a bit noob in Solr, so I don't know if I've to sent something to help to find the problem, and the cluster was created just creating a Zookeeper cluster, connecting the Solr nodes to that Zk cluster, importing the collections and adding réplicas manually to every collection. Also I've upgraded that cluster from Solr 6 to Solr 7.1 and later to Solr 7.2.1. Thanks and greetings! -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Status of the Zeppelin Solr Interpreter
Also, this was just mentioned in a talk here at Activate: http://www.streamsolr.tk - the presenter Amrit Sarkar was certainly using Zeppelin in his talk which would imply Lucidworks are still maintaining the connectors. Charlie On Wed, 17 Oct 2018 at 16:37, Charlie Hull wrote: > Eric Pugh of Open Source Connections has used Lucidworks' Spark connector > to allow SQL queries to be sent to Solr, is that another way you could use? > > Cheers > > Charlie > > On Wed, 17 Oct 2018 at 08:14, Jan Høydahl wrote: > >> Hi >> >> What is the status of this project? >> Looks pretty dead on GitHub: https://github.com/lucidworks/zeppelin-solr >> Would love to be able to use this in a project. >> >> -- >> Jan Høydahl, search solution architect >> Cominvent AS - www.cominvent.com >> >>
Re: Status of the Zeppelin Solr Interpreter
Eric Pugh of Open Source Connections has used Lucidworks' Spark connector to allow SQL queries to be sent to Solr, is that another way you could use? Cheers Charlie On Wed, 17 Oct 2018 at 08:14, Jan Høydahl wrote: > Hi > > What is the status of this project? > Looks pretty dead on GitHub: https://github.com/lucidworks/zeppelin-solr > Would love to be able to use this in a project. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > >
Re: Zookeeper external vs internal
It's also important to remember that you don't need a particularly large or powerful node to run Zookeeper. Charlie On Sun, 14 Oct 2018 at 23:57, Shawn Heisey wrote: > On 10/14/2018 9:31 PM, Sourav Moitra wrote: > > My question does running separate zookeeper ensemble in the same boxes > > provides any advantage over using the solr embedded zookeeper ? > > The major disadvantage to having ZK embedded in Solr is this: If you > stop or restart the Solr process, part of your ZK ensemble goes down > too. It is vastly preferable to have it running as a separate process, > so that you can restart one of the services without causing disruption > in the other service. > > Thanks, > Shawn > >
Re: Modify the log directory for dih
On 04/10/2018 16:35, Shawn Heisey wrote: On 10/4/2018 12:30 AM, lala wrote: Hi, I am using: Solr: 7.4 OS: windows7 I start solr using a service on startup. In that case, I really have no idea where anything is on your system. There is no service installation from the Solr project for Windows -- either you obtained that from somewhere else, or it's something written in-house. Either way, you would need to talk to whoever created that service installation for help locating files on your setup. We usually use NSSM for service-ifying Solr on Windows, I'd recommend you consider that. Also, bear in mind that a Windows Service can't output to stdout or stderr so some messages simply won't go anywhere - but the NSSM documentation is helpful. Charlie In general, you need to find the log4j2.xml file that is controlling your logging configuration and modify it. It contains a sample of how to log something to a separate file -- the slow query log. That example redirects a specific logger name (which is similar to a full qualified class name and in most cases *is* the class name) to a different logfile. Version 7.4 has a bug when running on Windows that causes a lot of problems specific to logging. https://issues.apache.org/jira/browse/SOLR-12538 That problem has been fixed in the 7.5 release. You can also fix it by editing the solr.cmd script manually. Additional info: I am developing a web application that uses solr as search engine, I use DIH to index folders in solr using the FileListEntityProcessor. What I need is logging each index operation in a file that I can reach & read to be able to detect failed index files in the folder. The FileListEntityProcessor class has absolutely no logging in it. If you require that immediately, you would need to add logging commands to the source code and recompile Solr yourself to produce a package with your change. With an enhancement issue in Jira, we can review what logging is suitable for the class, and probably make it work like SQLEntityProcessor in that regard. If that's done the way I think it should be, then you could add config in log4j2.xml to could enable DEBUG level logging for that class specifically and write its logs to a separate logfile. Thanks, Shawn -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Unnecessary Components
An interesting problem, perhaps we'll look at this at one of the Hackdays we're running soon! Previously we managed to cut down the Solr config files to fewer lines than the Apache license statement. Charlie On 19/09/2018 21:25, Shawn Heisey wrote: On 9/19/2018 1:48 PM, oddtyme wrote: I am helping implement solr for a "downloadable library" of sorts. The objective is that communities without internet access will be able to access a library's worth of information on a small, portable device. As such, I am working within strict space constraints. What are some non-essential components of solr that can be cut to conserve space for more information? For basic functionality, the entire contrib directory could probably be removed. That's more than half of the download right there. Some of the jars in solr-webapp/webapp/WEB-INF/lib can likely be removed. Chances are that you won't need the jars starting with "hadoop" - those are for HDFS support. That's another 11 MB. If you don't need either HDFS or SolrCloud, you can remove the zookeeper jar, and I think you can also remove the curator jars. If you're not accessing Solr with a JDBC driver, you won't need the calcite jars. If you're not dealing with oriental characters (and sometimes even if you ARE), you can probably do without lucene-analyzers-kuromoji. With careful code analysis, you can probably find other jars that aren't needed, but there's not a huge amount of space saving to be gained with most of the others. Thanks, Shawn -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Haystack, the search relevance conference comes to London on October 2nd 2018
On 21/08/2018 15:14, Charlie Hull wrote: Hi all, We're very happy to announce the first Haystack Europe conference in London on October 2nd. Hi all, Just to note the full conference programme is now up, including talks on Learning to Rank, tools for visualising and tuning relevance, building search relevance teams and more. Hope to see some of you there! https://opensourceconnections.com/events/haystack-europe-2018/ Cheers Charlie https://opensourceconnections.com/events/haystack-europe-2018/ Come and hear talks by Doug Turnbull, co-author of Relevant Search, Karen Renshaw, Head of Search and Content for Grainger Global Online and other relevance experts, plus the usual networking and knowledge sharing. Hope to meet some of you there! Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: MLT in Cloud Mode - Not Returning Fields?
On 31/08/2018 19:36, Doug Turnbull wrote: Hello, We're working on a Solr More Like This project (Solr 6.6.2), using the More Like This searchComponent. What we note is in standalone Solr, when we request MLT using the search component, we get every more like this document fully formed with complete fields in the moreLikeThis section. Hey Doug, IIRC there wasn't a lot of support for MLT in cloud mode a few years ago, and there are certainly still a few open issues around cloud support: https://issues.apache.org/jira/browse/SOLR-4414 https://issues.apache.org/jira/browse/SOLR-5480 Maybe there are some hints in the ticket comments about different ways to do what you want. Cheers Charlie In cloud, however, with the exact same query and config, we only get the doc ids under "moreLikeThis" requiring us to fetch the metadata associated with each document. I can't easily share an example due to confidentiality, but I want to check if we're missing something? Documentation doesn't mention any limitations. The only interesting note I've found is this one which points to a potential difference in behavior The Cloud MLT Query Parser uses the realtime get handler to retrieve the fields to be mined for keywords. Because of the way the realtime get handler is implemented, it does not return data for fields populated using copyField. https://stackoverflow.com/a/46307140/8123 Any thoughts? -Doug -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Want to start contributing.
On 20/08/2018 18:45, Rohan Chhabra wrote: Hi all, I am an absolute beginner (dummy) in the field of contributing open source. But I am interested in contributing to open source. How do i start? Solr is a java based search engine based on Lucene. I am good at Java and therefore chose this to start. I need guidance. Help required!! A related topic: we are running two free Lucene Hackdays, in London on October 9th and Montreal on October 15th (the week of the Activate conference): https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/252740719/ https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/253610289/ This would be a great place to meet and learn from existing Lucene committers. Best Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Haystack, the search relevance conference comes to London on October 2nd 2018
Hi all, We're very happy to announce the first Haystack Europe conference in London on October 2nd. https://opensourceconnections.com/events/haystack-europe-2018/ Come and hear talks by Doug Turnbull, co-author of Relevant Search, Karen Renshaw, Head of Search and Content for Grainger Global Online and other relevance experts, plus the usual networking and knowledge sharing. Hope to meet some of you there! Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Hackdays in October, London & Montreal
On 13/07/2018 15:10, Charlie Hull wrote: On 12/07/2018 10:28, Charlie Hull wrote: Hi all, A couple of years ago I ran two free Lucene Hackdays in London and Boston (the latter just before Lucene Revolution). Here's what we got up to with the kind support of Alfresco, Bloomberg, BA Insight and Lucidworks http://www.flax.co.uk/blog/2016/10/21/tale-two-cities-two-lucene-hackdays/ I'd like to do this again during the weeks of 8th and 15th October in London and Montreal (so just before the Activate event). It's a great chance to get together IRL with other Lucene/Solr/Elasticsearch hackers! I have a venue for London but a sponsor for evening curry/drinks would be wonderful, and for Montreal I still need a venue and evening sponsor - do let me know if you or your employer can help. We have a placeholder event for London! https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/252740719/ ...and we now have a venue for our Montreal event which will be on Monday 15th October https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/253610289/ Hope to see some of you there! Cheers Charlie C I'll post again once there are more details and with a call for ideas as to what we should work on. Best Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Hackdays in October, London & Montreal
On 12/07/2018 10:28, Charlie Hull wrote: Hi all, A couple of years ago I ran two free Lucene Hackdays in London and Boston (the latter just before Lucene Revolution). Here's what we got up to with the kind support of Alfresco, Bloomberg, BA Insight and Lucidworks http://www.flax.co.uk/blog/2016/10/21/tale-two-cities-two-lucene-hackdays/ I'd like to do this again during the weeks of 8th and 15th October in London and Montreal (so just before the Activate event). It's a great chance to get together IRL with other Lucene/Solr/Elasticsearch hackers! I have a venue for London but a sponsor for evening curry/drinks would be wonderful, and for Montreal I still need a venue and evening sponsor - do let me know if you or your employer can help. We have a placeholder event for London! https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/252740719/ C I'll post again once there are more details and with a call for ideas as to what we should work on. Best Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Hackdays in October, London & Montreal
Hi all, A couple of years ago I ran two free Lucene Hackdays in London and Boston (the latter just before Lucene Revolution). Here's what we got up to with the kind support of Alfresco, Bloomberg, BA Insight and Lucidworks http://www.flax.co.uk/blog/2016/10/21/tale-two-cities-two-lucene-hackdays/ I'd like to do this again during the weeks of 8th and 15th October in London and Montreal (so just before the Activate event). It's a great chance to get together IRL with other Lucene/Solr/Elasticsearch hackers! I have a venue for London but a sponsor for evening curry/drinks would be wonderful, and for Montreal I still need a venue and evening sponsor - do let me know if you or your employer can help. I'll post again once there are more details and with a call for ideas as to what we should work on. Best Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Solr Issue after the DSE upgrade
On 17/06/2018 03:10, Umadevi Nalluri wrote: I am getting Connection refused (Connection refused) when I am runnind reload_core with dsetool after we setup jmx , this issue is happening since the dse upgrade to 5.0.12 , can some one please help with this issue is this a bug? Is there a work around for this? dsetool appears to be a utility from Datastax - have you tried asking them for support? Charlie Thanks Kantheti -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Parent product show in search result
On 04/06/2018 17:15, Apurba Hazra wrote: Hello, We are implementing solr search for our webseite using magento. Our requirement is, in search result page we have to show only parent product not all child product if the parent exist, otherwise we have to show child product. Will you please tell us how we can do that. Should we change setting in solr panel as well as magento admin panel. Please advice us, it's very urgent. Hi, How and more importantly *if* you can do this will depend on how Solr has been integrated with Magento. Magento documentation, mailing lists etc. should be your first port of call. Best Charlie *Thanks & Regards,* *Apurba Hazra* *Project Manager* *Navigator Software Pvt. Ltd.* Web Applications / Enterprise Mobility & Mobile Apps / Cloud Solutions / E-Commerce / Bespoke and Product development / Enterprise CMS / Online POS / VOIP Solutions / Internet Marketing / Business Intelligence & Analytics / Dedicated Hiring Solutions. www.needdevelopers.com www.boostmysale.com www.navsoft.in 20 Dr. E Moses Road, Mahalakshmi, Mumbai 400020 205 & 206 Haute Street Bldg., 86A Topsia Road; Kolkata 700046 Tel: (+91-33) 40259595 <00913340259595> -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: How to create a solr collection providing as much searching flexibility as possible?
On 29/04/2018 22:25, Raymond Xie wrote: Thank you Alessandro, It looks like my requirement is vague, but indeed I already indicated my data is in FIX format, which is a format, here is an example in the Wiki link in my original question: 8=FIX.4.2 | 9=178 | 35=8 | 49=PHLX | 56=PERS | 52=20071123-05:30:00.000 | 11=ATOMNOCCC9990900 | 20=3 | 150=E | 39=E | 55=MSFT | 167=CS | 54=1 | 38=15 | 40=2 | 44=15 | 58=PHLX EQUITY TESTING | 59=0 | 47=C | 32=0 | 31=0 | 151=15 | 14=0 | 6=0 | 10=128 | As the data format is quite special, and commonly used in Financial area (especially for trading data), I believe there must have been lots of studies already made. That's why I want to find out. Hi, Start with the search functionality you want to provide: which fields should be covered by a standard search box; which fields should the user be able to facet on; which should they be able to sort on. From these requirements you should be able to work backwards and decide how to index the data appropriately. Cheers Charlie Thank you. ** *Sincerely yours,* *Raymond* On Sat, Apr 28, 2018 at 11:32 AM, Alessandro Benedetti <a.benede...@sease.io wrote: Hi Raymond, your requirements are quite vague, Solr offers you those capabilities but you need to model your configuration and data accordingly. https://lucene.apache.org/solr/guide/7_3/solr-tutorial.html is a good starting point. After that you can study your requirements and design the search solution accordingly. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: IndexFetcher cannot download index file
On 24/04/2018 16:44, Walter Underwood wrote: In Ultraseek, we checked free disk space before starting a merge or replication. If there wasn’t enough space, it emailed an error to the admin and disabled merging or replication, respectively. Checking free disk space on Windows was a pain. On a related topic, we built something that can block connections if there's no space to accept new documents for indexing: https://github.com/flaxsearch/harahachibu Cheers Charlie wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Apr 24, 2018, at 8:39 AM, Shawn Heisey <elyog...@elyograg.org> wrote: On 4/24/2018 6:52 AM, Markus Jelsma wrote: Forget about it, recovery got a java.io.IOException: No space left on device but it wasn't clear until i inspected the real logs. The logs in de web admin didn't show the disk space exception, even when i expand the log line. Maybe that could be changed. What was the severity of the log entry showing the disk space exception? Can you share the whole message/stacktrace? If it doesn't show up in the admin UI logging tab, that would suggest that it was an INFO level log, when it should probably be ERROR. Thanks, Shawn -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Specialized Solr Application
On 16/04/2018 19:48, Terry Steichen wrote: I have from time-to-time posted questions to this list (and received very prompt and helpful responses). But it seems that many of you are operating in a very different space from me. The problems (and lessons-learned) which I encounter are often very different from those that are reflected in exchanges with most other participants. Hi Terry, Sounds like a fascinating use case. We have some similar clients - small scale law firms and publishers - who have taken advantage of Solr. One thing I would encourage you to do is to blog and/or talk about what you've built. Lucene Revolution is worth applying to talk at and if you do manage to get accepted - or if you go anyway - you'll meet lots of others with similar challenges and come away with a huge amount of useful information and contacts. Otherwise there are lots of smaller Meetup events (we run the London, UK one). Don't assume just because some people here are describing their 350 billion document learning-to-rank clustered monster that the small applications don't matter - they really do, and the fact that they're possible to build at all is a testament to the open source model and how we share information and tips. Cheers Charlie So I thought it would be useful to describe what I'm about, and see if there are others out there with similar implementations (or interest in moving in that direction). A sort of pay-forward. We (the Lakota Peoples Law Office) are a small public interest, pro bono law firm actively engaged in defending Native American North Dakota Water Protector clients against (ridiculously excessive) criminal charges. I have a small Solr (6.6.0) implementation - just one shard. I'm using the cloud mode mainly to be able to implement access controls. The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with 8GB of RAM and 4 cpu processors. We presently have 8 collections with a total of about 60,000 documents, mostly pdfs and emails. The indexed documents are partly our own files and partly those we obtain through legal discovery (which, surprisingly, is allowed in ND for criminal cases). We only have a few users (our lawyers and a couple of researchers mostly), so traffic is minimal. However, there's a premium on precision (and recall) in searches. The document repository is local to the server. I piggyback on the embedded Jetty httpd in order to serve files (selected from the hitlists). I just use a symbolic link to tie the repository to Solr/Jetty's "webapp" subdirectory. We provide remote access via ssh with port forwarding. It provides very snappy performance, with fully encrypted links. Appears quite stable. I've had some bizarre behavior apparently caused by an interaction between repository permissions, solr permissions and the ssh link. I seem "solved" for the moment, but time will tell for how long. If there are any folks out there who have similar requirements, I'd be more than happy to share the insights I've gained and problems I've encountered and (I think) overcome. There are so many unique parts of this small scale, specialized application (many dimensions of which are not strictly internal to Solr) that it probably won't be appreciated to dump them on this (excellent) Solr list. So, if you encounter problems peculiar to this kind of setup, we can perhaps help handle them off-list (although if they have more general Solr application, we should, of course, post them to the list). Terry Steichen -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://github.com/mattflax/dropwizard-tika-server written by a colleague of mine at Flax. Hope this is useful. Cheers Charlie On 9 April 2018 at 19:26, Hanjan, Harinder <harinder.han...@calgary.ca> wrote: > Thank you Charlie, Tim. > I will integrate Tika in my Java app and use SolrJ to send data to Solr. > > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, April 09, 2018 11:24 AM > To: solr-user@lucene.apache.org > Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from > HTML document instead of Solr's MostlyPassthroughHtmlMapper ? > > +1 > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__ > lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_ > BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d- > HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_ > 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0= > > > > We should add a chatbot to the list that includes Charlie's advice and the > link to Erick's blog post whenever Tika is used. > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca> > > wrote: > > > > > Hello! > > > > > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > > > we have in our Sharepoint system. I have used the tika-app.jar > > > directly to extract the document in question and it does _not_ throw > > > an exception and extract the contents just fine. So it would seem Solr > > > is doing something different than a Tika standalone installation. > > > > > > After some Googling, I found out that Solr uses its custom HtmlMapper > > > (MostlyPassthroughHtmlMapper) which passes through all elements in the > > > HTML document to Tika. As Tika limits nested elements to 100, this > > > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > > > XML element nesting. This is metioned in TIKA-2091 > > > (https://urldefense.proofpoint.com/v2/url?u=https- > 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDu > vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U= > 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6- > in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091? > focusedCommentId=15514131=com.atlassian.jira. > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > > > "solution" is to use Tika's default parsing/mapping mechanism but no > > > details have been provided on how to configure this at Solr. > > > > > > I'm hoping some folks here have the knowledge on how to configure Solr > > > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > > > use Tika's implementation. > > > > > > Thank you! > > > Harinder > > > > > > > > > > > > NOTICE - > > > This communication is intended ONLY for the use of the person or > > > entity named above and may contain information that is confidential or > > > legally privileged. If you are not the intended recipient named above > > > or a person responsible for delivering messages or communications to > > > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > > > distribution, or copying of this communication or any of the > > > information contained in it is strictly prohibited. If you have > > > received this communication in error, please notify us immediately by > > > telephone and then destroy or delete this communication, or return it > > > to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > > > > >
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinderwrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we > have in our Sharepoint system. I have used the tika-app.jar directly to > extract the document in question and it does _not_ throw an exception and > extract the contents just fine. So it would seem Solr is doing something > different than a Tika standalone installation. > > After some Googling, I found out that Solr uses its custom HtmlMapper > (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML > document to Tika. As Tika limits nested elements to 100, this causes Tika > to throw an exception: Suspected zip bomb: 100 levels of XML element > nesting. This is metioned in TIKA-2091 (https://issues.apache.org/ > jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > "solution" is to use Tika's default parsing/mapping mechanism but no > details have been provided on how to configure this at Solr. > > I'm hoping some folks here have the knowledge on how to configure Solr to > effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's > implementation. > > Thank you! > Harinder > > > > NOTICE - > This communication is intended ONLY for the use of the person or entity > named above and may contain information that is confidential or legally > privileged. If you are not the intended recipient named above or a person > responsible for delivering messages or communications to the intended > recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying > of this communication or any of the information contained in it is strictly > prohibited. If you have received this communication in error, please notify > us immediately by telephone and then destroy or delete this communication, > or return it to us by mail if requested by us. The City of Calgary thanks > you for your attention and co-operation. >
Re: Query redg : diacritics in keyword search
On 29/03/2018 14:12, Peter Lancaster wrote: Hi, You don't say whether the AsciiFolding filter is at index time or query time. In any case you can easily look at what's happening using the admin analysis tool which helpfully will even highlight where the analysed query and index token match. That said I'd expect what you want to work if you simply use on both index and query. Simply put: You use the filter at indexing time to collapse any variants of a term into a single variant, which is then stored in your index. You use the filter at query time to collapse any variants of a term that users type into a single variant, and if this exists in your index you get a match. If you don't use the same filter at both ends you won't get a match. Cheers Charlie Cheers, Peter. -Original Message- From: Paul, Lulu [mailto:lulu.p...@bl.uk] Sent: 29 March 2018 12:03 To: solr-user@lucene.apache.org Subject: Query redg : diacritics in keyword search Hi, The keyword search Carré returns values Carré and Carre (this works well as I added the tokenizer in the schema config to enable returning of both sets of values) Now looks like we want Carre to return both Carré and Carre (and this dosen’t work. Solr only returns Carre) – any ideas on how this scenario can be achieved? Thanks & Best Regards, Lulu Paul ** Experience the British Library online at www.bl.uk<http://www.bl.uk/> The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html<http://www.bl.uk/aboutus/annrep/index.html> Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook<http://www.bl.uk/adoptabook> The Library's St Pancras site is WiFi - enabled * The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmas...@bl.uk<mailto:postmas...@bl.uk> : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. * Think before you print This message is confidential and may contain privileged information. You should not disclose its contents to any other person. If you are not the intended recipient, please notify the sender named above immediately. It is expressly declared that this e-mail does not constitute nor form part of a contract or unilateral obligation. Opinions, conclusions and other information in this message that do not relate to the official business of findmypast shall be understood as neither given nor endorsed by it. __ This email has been checked for virus and other malicious content prior to leaving our network. ______ -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Solr or Elasticsearch
On 22/03/2018 13:13, Steven White wrote: Hi everyone, There are some good write ups on the internet comparing the two and the one thing that keeps coming up about Elasticsearch being superior to Solr is it's analytic capability. However, I cannot find what those analytic capabilities are and why they cannot be done using Solr. Can someone help me with this question? Hi Steve, As you've said there are lots of writeups, some more out-of-date than others. http://solr-vs-elasticsearch.com/ is quite good on features. The analytics in ES are based on a number of custom aggregations (which I always think of as facet-counting-on-steroids, but I realise it's more complicated than that). Here's an early doc on them https://www.elastic.co/guide/en/elasticsearch/guide/current/_analytics.html So you need a good grasp of Elasticsearch's DSL to use these. The integration with Kibana is good if you want to display your results. Solr's analytic capabilities use a Solr Search Component: https://lucene.apache.org/solr/guide/7_2/analytics.html . As with a lot of Solr features these can appear a lot more complex than Elasticsearch's offering. Yonik's blog is also worth reading as he often talks about new and upcoming Solr features like this. http://yonik.com/solr-facet-functions/ As we've always said, there are few cases where you can't build a solution using either engine and I believe that's also true for analytics. Personally, I'm a Solr user and the thing that concerns me about Elasticsearch is the fact that it is owned by a company that can any day decide to stop making Elasticsearch avaialble under Apache license and even completely close free access to it. Yes, but why would they? It would be suicide for a company that have such an established open source heritage - not least because a lot of Lucene developers who work for Elastic would object. I'd be a bit more annoyed about the fact they announced that their commercial XPack add-ons would be 'open code' and everyone thinks that means 'open source' - which it clearly isn't. So, this is a 2 part question: 1) What are the analytic capability of Elasticsearch that cannot be done using Solr? I want to see a complete list if possible. 2) Should an Elasticsearch user be worried that Elasticsearch may close it's open-source policy at anytime or that outsiders have no say about it's road map? That's a slightly different question about road map - but you do have some say, Elastic's developers have always been very helpful and open to suggestions from outsiders (who are also users of course!). Cheers Charlie Thanks, Steve -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Solr dih extract text from inline images in pdf
On 07/03/2018 13:29, lala wrote: Thanks Charlie... It's just confusing for me, In the DIH configuration file, the inner entity that takes "TikaEntityProcessor" as its processor, I can easily specify a tikaConfig attribute to an xml file, located inside the config folder in the core, and where in this file I should be able to override the PDFParser default properties... As in parseContext.Config... The thing is that I placed my tika-config.xml file in the config folder, set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing images inside PDF file!!! Let's say this is just experimenting Solr DIH crawling... Why it's not working.? This is my tika-config.xml file: true true I've read the code in both TikaEntityProcessor and TikaConfig... It should read the xml file from config folder, extract params and override original PDFParser attributes. But It DOESN'T! Any Idea?? Hi, My reading of https://tika.apache.org/1.17/configuring.html#Using_a_Tika_Configuration_XML_file indicates that your PDF parser may not run unless you explicitly exclude PDFs, which I don't think you're doing above. I'm not an expert on Tika configuration, but I think you should first try this xml file with standalone Tika and see if it does what you think it should. Once you're sure, then try it with DIH or SolrJ. Cheers Charlie -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Solr dih extract text from inline images in pdf
On 07/03/2018 09:32, lala wrote: Thanks for your reply Erick, Actually I am using Solrj to index files among other operations with Solr, but to index a large amount of differesnt kinds of file, I'm sending a DIH request to Solr using Solrj API : FileListEntityProcessor with TikaEntityParser... Why not benefit from this technology if Solr offers it? It simplifies our work tremendosely... It may simplify your work, but it isn't good practice. Tika has some heavy lifting to do to extract text from some formats and you should consider how this load will affect Solr. We've often put Tika into a different process for this reason. Isn't there any way to be able to extract inline images in PDF docs?? https://stackoverflow.com/questions/31303735/how-to-extract-images-from-a-file-using-apache-tika has some useful suggestions. Charlie Waiting your reply, best regards... -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Word / PDF document snippet rendering in search
On 02/03/2018 00:15, T Wild wrote: I'm interested in building a software system which will connect to various document sources, extract the content from the documents contained within each source, and make the extracted content available to a search engine such Solr. This search engine will serve as the back-end for a web-based search application. This is basically an 'enterprise search' system. You use 'connectors' to get text out of the source documents - in Solr applications we often use Apache Tika to extract text from common formats like Office or PDF, Apache ManifoldCF is another useful project for connecting to repositories. I'm interested in rendering snippets of these documents in the search results for well-known types, such as Microsoft Word and PDF. How would one go about implementing document snippet rendering in search? If you just want the snippets as text, you can use Solr highlighters which can provide contextual snippets (i.e chunks of text around the query matches). I'd be happy with serving up these snippets in any format, including as images. I just want to be able to give my users some kind of formatted preview of their results for well-known types. If you however want to show bits of the original documents that's more difficult. You'll need to store a reference to the original document in Solr and use an external system to display it - you'll need specific systems for different doc types: PDFs can be shown in various browser plugins for example. Another approach is illustrated in this open source code we wrote a while ago - it uses OpenOffice in 'headless' mode to provide images of the source document: https://github.com/flaxsearch/flaxcode/tree/master/flax_basic/libs/previewgen Hope this helps! Cheers Charlie Thank you! -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: DovValues and in-place udpates
On 12/02/2018 16:02, Brian Yee wrote: I asked a question here about fast inventory updates last week and I was recommended to use docValues with partial in-place updates. I think this will work well, but there is a problem I can't think of a good solution for. Consider this scenario: InStock = 1 for a product. InStock changes to 0 which triggers a fast in-place update with docValues. But it also triggers a slow update that will rebuild the entire document. Let's say that takes 10 minutes because we do updates in batches. During that 5 minutes, InStock changes again to 1 which triggers a fast update to solr. So in Solr InStock=1 which is correct. The slow update finishes and overwrites InStock=0 which is incorrect. How can we deal with this situation? It's a slightly crazy idea, but in the past we've solved a similar problem by building a custom Lucene codec that is backed by a Redis database. You change the stock value in Redis and Lucene doesn't actually notice and re-index. http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/ Not sure if this is a better way than DocValues, it was quite a while ago and Lucene has moved on a bit since then Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Purchase of support
On 12/02/2018 07:58, Hon Fook Boey wrote: Hi, May I know if support/maintenance can be p urchased for SOLR? Hi, Various companies provide support for Solr (including us): what kind of support are you looking for? Best Charlie Thanks and regards, Boey HF eHoB Technology Sdn Bhd (Co Reg No 561898-XGST Reg # 001282277376) No 12-2, Jln PJU 7/16A, Mutiara Damansara, 47800 Petaling Jaya, Malaysia Tel +6 03 7710 3308 Fax +6 03 7726 6228 È Mobile +6 012 395 0213 WWW www.ehob-tech.com.my -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Opinions on ExtractingRequestHandler
On 08/02/2018 11:47, Frederik Van Hoyweghen wrote: Hey everyone, What are your experiences on making (in production) use of Solr's ExtractingRequestHandler? I've been reading some mixed remarks so I was wondering what your actual experiences with it are. Personally, I feel like setting up a separate service which is solely responsible for parsing file contents (to be indexed by Solr later on in the process) using Tika is a safer approach, so we can use whatever Tika version we want along with other things we might want to add. Yes, do this. It's entirely possible to bring down Tika with a nasty PDF, or end up consuming lots of resources in the extraction step and have these impact your Solr server. Run it separately and you can monitor it/kill it if necessary. You might like my colleague Matt Pearce's DropWizard wrapper for Tika https://github.com/mattflax/dropwizard-tika-server Cheers Charlie Looking forward to your response! Kind regards, Frederik -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Relevancy Tuning For Solr With Apache Nutch 2.3
On 07/02/2018 21:59, Mukhopadhyay, Aratrika wrote: Hello , I am attempting to tune my results that I retrieve from solr to boost the importance of certain fields. The syntax of the query I am using is as follows : http://localhost:8983/solr/housegov_data/select?indent=on=QUERY=edismax=FIELD1^20.0_FIELD2^0.03=json<http://localhost:8983/solr/housegov_data/select?indent=on=QUERY=edismax=FIELD1%5e20.0_FIELD2%5e0.03=json>. The issue is that this is not boosting anything in most cases or it isn't being able to find any documents that match this criteria. I have used nutch to crawl websites and indexed the data to solr. I see that nutch applies an index time boost as well. Could that have something to do with this ? Can anyone look at the format of this query and enlighten me of any mistakes that I am making. Hi, - You seem to have two field incorrectly concatenated with an underscore: qf=FIELD1^20.0_FIELD2^0.03 - this should be a space or an encoded space - a large boost of 20 combined with a fractional boost of 0.03 worries me as it implies that one field is 666 times as important as another, are you sure this is the case? - you should turn off *all* the boosts, including the Nutch one, and start again, *gently* applying boosts where you can *prove* they improve relevancy - you should consider using a tool such as Quepid (disclaimer: we resell this, but there's a free trial period you can use) for relevancy tuning based on a set of test cases HTH, Charlie FYI : I am using a data driven schema. Regards, Aratrika Mukhopadhyay -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: External file fields
On 01/02/2018 18:55, Brian Yee wrote: Hello, I want to use external file field to store frequently changing inventory and price data. I got a proof of concept working with a mock text file and this will suit my needs. What is the best way to keep this file updated in a fast way. Ideally I would like to read changes from a Kafka queue and write to the file. But it seems like I would have to open the whole file, read the whole file, find the line I want to change, and write the whole file for every change. Is there a better way to do that? That approach seems like it would be difficult/slow if the file is several million lines long. Also, once I come up with a way to update the file quickly, what is the best way to distribute the file to all the different solrcloud nodes in the correct directory? Another approach would be the XJoin plugin we wrote - if you wait a few days we should have an updated patch for Solr v6.5 and possibly v7. XJoin lets you filter/join/rank Solr results using an external data source. http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/ http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/ Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Adding virtual host in Jetty (Solr deployed)
On 01/02/2018 12:40, solr2020 wrote: Hi, We have installed solr which is running in jetty 9x version. We are trying to change the default solr url to required URL as given below. Default url: http://localhost:8983/solr Required URL :http://test.com/solr To achieve this we are trying to configure virtual host in jetty (solr-jetty-context.xml) with the below jetty documentation reference (https://wiki.eclipse.org/Jetty/Howto/Configure_Virtual_Hosts). But it is not working. You're going to need to give more details I'm afraid, such as exactly what you expect it to do and what happens when you test it. Cheers Charlie -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Distributed search cross cluster
On 30/01/2018 16:09, Jan Høydahl wrote: Hi, A customer has 10 separate SolrCloud clusters, with same schema across all, but different content. Now they want users in each location to be able to federate a search across all locations. Each location is 100% independent, with separate ZK etc. Bandwidth and latency between the clusters is not an issue, they are actually in the same physical datacenter. Now my first thought was using a custom parameter, and let the receiving node fan out to all shards of all clusters. We’d need to contact the ZK for each environment and find all shards and replicas participating in the collection and then construct the shards=A1|A2,B1|B2… sting which would be quite big, but if we get it right, it should “just work". Now, my question is whether there are other smarter ways that would leave it up to existing Solr logic to select shards and load balance, that would also take into account any shard.keys/_route_ info etc. I thought of these * =collA,collB — but it only supports collections local to one cloud * Create a collection ALIAS to point to all 10 — but same here, only local to one cluster * Streaming expression top(merge(search(q=,zkHost=blabla))) — but we want it with pure search API * Write a custom ShardHandler plugin that knows about all clusters — but this is complex stuff :) * Write a custom SearchComponent plugin that knows about all clusters and adds the = param Another approach would be for the originating cluster to fan out just ONE request to each of the other clusters and then write some SearchComponent to merge those responses. That would let us query the other clusters using one LB IP address instead of requiring full visibility to all solr nodes of all clusters, but if we don’t need that isolation, that extra merge code seems fairly complex. So far I opt for the custom SearchComponent and = param approach. Any useful input from someone who tried a similar approach would be priceless! Hi Jan, We actually looked at this for the BioSolr project - a SolrCloud of SolrClouds. Unfortunately the funding didn't appear for the project so we didn't take it any further than some rough ideas - as you say, if you get it right it should 'just work'. We had some extra complications in terms of shared partial schemas... Cheers Charlie -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Using Solr with SharePoint Online
On 30/01/2018 07:57, Mohammed.Adnan2 wrote: Hello Team, I am a beginner learning Apache Solr. I am trying to check the compatibility of solr with SharePoint Online, but I am not getting anything concrete related to this in the website documentation. Can you please help me in providing some information on this? How I can index my SharePoint content with solr and then use solr on my SharePoint sites? I really appreciate your help on this. Thanks, Adnan Hi Adnan, There are various things you need to consider: 1. Why do you need Solr at all - Sharepoint Online has its own built-in search engine. 2. Installing Solr on a Windows server with access to Sharepoint Online - shouldn't be a huge problem, Solr is a Java application so you'll also need Java installed. You might want to run Solr as a Windows Service so it's always there in the background - look up NSSM. 3. You need a way to get the content out of Sharepoint and into Solr. The best way to do this will be to crawl the Sharepoint site. There are some commercially available connectors from BA Insight and Lucidworks or you'll have to roll your own. This https://github.com/golincode/SPOC might be a good starting point. If you go this route you'll certainly need to condition the data before you index it with Solr, so you'll have to understand how Solr schemas, analyzers etc. work. 4. Then you'll need a UI to talk to Solr to carry out queries - if this is to live within the Sharepoint world you'll need to write a web application compatible with Sharepoint. HTH, Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk