Re: less search results in prod
enable debugQuery and compare the queries evaluated in the development and production environment. Regards, Jayendra On Sun, Dec 4, 2011 at 5:18 AM, alx...@aim.com wrote: Hello, I have build solr-3.4.0 data folder in dev server and copied it to prod server. Made a search for a keyword, then modified qf and pf params in solrconfig.xml. Made search for the same keywords, then restored qf and pf params to their original value. Now, solr returns very less number of docs for the same keywords in comparison with the dev server. Tried other keywords, the issue is the same. Copied solrconfig.xml from dev server, but nothing changed. Took a look to statistics, the numDocs and maxDoc values are the same in both servers. Any ideas how to debug this issue? Thanks in advance. Alex.
Re: How to change the port of post.jar
You can pass the full url to post.jar as an argument. example - java -Durl=http://localhost:8080/solr/update -jar post.jar Regards, Jayendra On Wed, Nov 9, 2011 at 2:37 AM, 刘浪 liu.l...@eisoo.com wrote: Hi, I want to use post.jar to delete index.But my port is 8080. It is 8983 default. How can I change the port 8983 to 8080? Thank you, Amos --
Re: question about Field Collapsing/ grouping
Hi Ahson, http://wiki.apache.org/solr/FieldCollapsing group.ngroups seems to be added as an parameter, so you may not be needed to apply any patches. Solr 3.3 had released the grouping feature with it, so I presume it should already be included in it. Regards, Jayendra On Wed, Sep 14, 2011 at 4:22 AM, Ahson Iqbal mianah...@yahoo.com wrote: Hi Jayendra Thanks a lot for your response, now i have two questions one that to get the count of groups is it must to apply the specified patch, if so can you help me a little how i can apply that patch in steps as i am new to solr/java. Regards Ahsan - Original Message - From: Jayendra Patil jayendra.patil@gmail.com To: solr-user@lucene.apache.org; Ahson Iqbal mianah...@yahoo.com Cc: Sent: Tuesday, September 13, 2011 10:55 AM Subject: Re: question about Field Collapsing/ grouping The time we implemented the feature, there was no straight forward solution. What we did is to facet on the grouped by field and counting the facets. This would give you the distinct count for the groups. You may also want to check the Patch @ https://issues.apache.org/jira/browse/SOLR-2242, which will return the facet counts and you need to count it by yourself. Regards, Jayendra On Tue, Sep 13, 2011 at 1:27 AM, Ahson Iqbal mianah...@yahoo.com wrote: Hi Is it possible to get number of groups that matched with specified query. like let say there are three fields in index DocumentID Content Industry and now i want to query as +(Content:is Content:the) group=truegroup.field=industry now is it possible to get how many industries matched with specified query. Please help. Regards Ahsan
Re: question about Field Collapsing/ grouping
yup .. seems the group count feature is included now, as mentioned by Klein. Regards, Jayendra On Tue, Sep 13, 2011 at 8:27 AM, O. Klein kl...@octoweb.nl wrote: Isn't that what the parameter group.ngroups=true is for? -- View this message in context: http://lucene.472066.n3.nabble.com/question-about-Field-Collapsing-grouping-tp3331821p3332471.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: question about Field Collapsing/ grouping
The time we implemented the feature, there was no straight forward solution. What we did is to facet on the grouped by field and counting the facets. This would give you the distinct count for the groups. You may also want to check the Patch @ https://issues.apache.org/jira/browse/SOLR-2242, which will return the facet counts and you need to count it by yourself. Regards, Jayendra On Tue, Sep 13, 2011 at 1:27 AM, Ahson Iqbal mianah...@yahoo.com wrote: Hi Is it possible to get number of groups that matched with specified query. like let say there are three fields in index DocumentID Content Industry and now i want to query as +(Content:is Content:the) group=truegroup.field=industry now is it possible to get how many industries matched with specified query. Please help. Regards Ahsan
Re: Accessing a doc field while working at entity level
you should be able to do it using ${feed-source.last-update} You can find examples and explaination @ http://wiki.apache.org/solr/DataImportHandler Regards, Jayendra On Mon, Sep 5, 2011 at 8:02 AM, penela pen...@gmail.com wrote: Hi! This might probably be a stupid question, but I can't find clear info on how to do it (sorry if it is too obvious). I have a the following document configuration (only key elements shown) with two entities, one embedded into the other: dataConfig dataSource type=URLDataSource name=rss-ds / dataSource type=JdbcDataSource name=db-ds driver=com / document entity name=feed-source dataSource=db-ds ... rootEntity=false field column=last-update dateTimeFormat=-MM-dd HH:mm:ss locale=en / entity name=feed-content dataSource=rss-ds pk=link ... transformer=DateFormatTransformer, DummyTransformer field column=timestamp xpath=/rss/channel/item/pubDate dateTimeFormat=EEE, dd MMM HH:mm:ss z locale=en / /entity /entity /document /dataConfig What I want to do is accessing the outer entity field last-update while I'm in the inner entity Transformer DummyTransformer. Debugging with Eclipse it looks like that data is correctly stored during runtime on the Context variable passed as parameter to the Transformer in: context.doc.fields So the question is: Is there any way to access higher level entities' fields while in an embedded entity? Or document fields at least? Thanks! -Víctor -- View this message in context: http://lucene.472066.n3.nabble.com/Accessing-a-doc-field-while-working-at-entity-level-tp3310680p3310680.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search the contents of given URL in Solr.
For indexing the webpages, you can use Nutch with Solr, which would do the scarping and indexing of the page. For finding similar documents/pages you can use http://wiki.apache.org/solr/MoreLikeThis, by querying the above document (by id or search terms) and it would return similar documents from the index for the result. Regards, Jayendra On Tue, Aug 30, 2011 at 8:23 AM, Sheetal rituzprad...@gmail.com wrote: Hi, Is it possible to give the URL address of a site and solr search server reads the contents of the given site and recommends similar projects to that. I did scrapped the web contents from the given URL address and now have the plain text format of the contents in URL. But when I pass that scrapped text as query into Solr. It doesn't work as query being too large(depends on the given contents of URL). I read it somewhere that its possible , Given the URL address and outputs you the relevant projects to it. But I don't remember whether its using Solr search or other search engine. Does anyone have any ideas or suggestions for this..Would highly appreciate your comments Thank you in advance.. - Sheetal -- View this message in context: http://lucene.472066.n3.nabble.com/Search-the-contents-of-given-URL-in-Solr-tp3294376p3294376.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to get all the terms in a document as Luke does?
you might want to check - http://wiki.apache.org/solr/TermVectorComponent Should provide you with the term vectors with a lot of additional info. Regards, Jayendra On Tue, Aug 30, 2011 at 3:34 AM, Gabriele Kahlout gabri...@mysimpatico.com wrote: Hello, This time I'm trying to duplicate Luke's functionality of knowing which terms occur in a search result/document (w/o parsing it again). Any Solrj API to do that? P.S. I've also posted the question on SOhttp://stackoverflow.com/q/7219111/300248 . On Wed, Jul 6, 2011 at 11:09 AM, Gabriele Kahlout gabri...@mysimpatico.comwrote: From you patch I see TermFreqVector which provides the information I want. I also found FieldInvertState.getLength() which seems to be exactly what I want. I'm after the word count (sum of tf for every term in the doc). I'm just not sure whether FieldInvertState.getLength() returns just the number of terms (not multiplied by the frequency of each term - word count) or not though. It seems as if it returns word count, but I've not tested it sufficienctly. On Wed, Jul 6, 2011 at 1:39 AM, Trey Grainger the.apache.t...@gmail.comwrote: Gabriele, I created a patch that does this about a year ago. See https://issues.apache.org/jira/browse/SOLR-1837. It was written for Solr 1.4 and is based upon the Document Reconstructor in Luke. The patch adds a link to the main solr admin page to a docinspector page which will reconstruct the document given a uniqueid (required). Keep in mind that you're only looking at what's in the index for non-stored fields, not the original text. If you have any issues using this on the most recent release, let me know and I'd be happy to create a new patch for solr 3.3. One of these days I'll remove the JSP dependency and this may eventually making it into trunk. Thanks, -Trey Grainger Search Technology Development Team Lead, Careerbuilder.com Site Architect, Celiaccess.com On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout gabri...@mysimpatico.comwrote: Hello, With an inverted index the term is the key, and the documents are the values. Is it still however possible that given a document id I get the terms indexed for that document? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Upload doc and pdf in Solr 3.3.0
http://wiki.apache.org/solr/ExtractingRequestHandler may help. Regards, Jayendra On Thu, Aug 25, 2011 at 3:24 AM, Moinsn felix.wieg...@googlemail.com wrote: Good Morning, I have to set up a Solr System to seek in documents like pdf and doc. My Solr System is running in the meantime, but i cant find a tutorial that tells me what i have to do to put the files in the system. I hope you can help me a bit to bring that off on a simple way. And please excuse my bad english. -- View this message in context: http://lucene.472066.n3.nabble.com/Upload-doc-and-pdf-in-Solr-3-3-0-tp3283224p3283224.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issue in indexing Zip file content with apache-solr-3.3.0
Solr doesn't index the content of the files, but just the file names. you can apply patch - https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 Regards, Jayendra On Tue, Aug 23, 2011 at 2:26 AM, Jagdish Kumar jagdish.thapar...@hotmail.com wrote: Hi All I am using apache-solr-3.3.0 with apache-solr-cell-3.3.0.jar, though I am able to index the zip files, but I get no results if I search for content present in zip file. Please suggest possible solution. Thanks and regards Jagdish
Re: How to start troubleshooting a content extraction issue
You can test the standalone content extraction with the tika-app.jar - Command to output in text format - java -jar tika-app-0.8.jar --text file_path For more options java -jar tika-app-0.8.jar --help Use the correct tika-app version jar matching the Solr build. Regards, Jayendra On Wed, Aug 10, 2011 at 1:53 PM, Tim AtLee timat...@gmail.com wrote: Hello So, I'm a newbie to Solr and Tika and whatnot, so please use simple words for me :P I am running Solr on Tomcat 7 on Windows Server 2008 r2, running as the search engine for a Drupal web site. Up until recently, everything has been fine - searching works, faceting works, etc. Recently a user uploaded a 5mb xltm file, which seems to be causing Tomcat to spike in CPU usage, and eventually error out. When the documents are submitted to be index, the tomcat process spikes up to use 100% of 1 available CPU, with the eventual error in Drupal of Exception occured sending *sites/default/files/nodefiles/533/June 30, 2011.xltm* to Solr 0 Status: Communication Error. I am looking for some help in figuring out where to troubleshoot this. I assume it's this file, but I guess I'd like to be sure - so how can I submit this file for content extraction manually to see what happens? Thanks, Tim
Re: Possible bug in FastVectorHighlighter
Try using - str name=hl.tag.pre![CDATA[b]]/str str name=hl.tag.post![CDATA[/b]]/str Regards, Jayendra On Tue, Aug 9, 2011 at 4:46 AM, Massimo Schiavon mschia...@volunia.com wrote: In my Solr (3.3) configuration I specified these two params: str name=hl.simple.pre![CDATA[b]]/str str name=hl.simple.post![CDATA[/b]]/str when I do a simple search I obtain correctly highlighted results where matches areenclosed with correct tag. If I do the same request with hl.useFastVectorHighlighter=true in the http query string (or specifying the same parameter in the config file) the metches are enclosed with em tag (the default value). Anyone has encountered the same issue?
Re: Is there anyway to sort differently for facet values?
you can give it a try with the facet.sort. We had such a requirement for sorting facets by order determined by other field and had to resort to a very crude way to get through it. We pre-pended the facets values with the order in which it had to be displayed ... and used the facet.sort to sort alphabetically. e.g. prepend Small - 0_Small, Medium - 1_Medium, Large - 2_Large, XL - 3_XL You would need to handle the display part though. Surely not the best way, but worked for us. Regards, Jayendra On Thu, Aug 4, 2011 at 4:38 PM, Sethi, Parampreet parampreet.se...@teamaol.com wrote: It can be achieved by creating own (app specific) custom comparators for fields defined in schema.xml and having an extra attribute to specify the comparator class in the field tag itself. But it will require changes in the Solr to support this feature. (Not sure if it's feasible though just throwing an idea.) -param On 8/4/11 4:29 PM, Jonathan Rochkind rochk...@jhu.edu wrote: No, it can not. It just sorts alphabetically, actually by raw byte-order. No other facet sorting functionality is available, and it would be tricky to implement in a performant way because of the way lucene works. But it would certainly be useful to me too if someone could figure out a way to do it. On 8/4/2011 2:43 PM, Way Cool wrote: Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it. I will try that though. Can it handle the values below in the correct order? Under 10 10 - 20 20 - 30 Above 30 Or Small Medium Large XL ... My second question is that if Solr can't do that for the values above by using facet.sort. Is there any other ways in Solr? Thanks in advance, YH On Wed, Aug 3, 2011 at 8:35 PM, Erick Ericksonerickerick...@gmail.comwrote: have you looked at the facet.sort parameter? The index value is what I think you want. Best Erick On Aug 3, 2011 7:03 PM, Way Coolway1.wayc...@gmail.com wrote: Hi, guys, Is there anyway to sort differently for facet values? For example, sometimes I want to sort facet values by their values instead of # of docs, and I want to be able to have a predefined order for certain facets as well. Is that possible in Solr we can do that? Thanks, YH
Re: ' invisible ' words
Strange .. the only other difference that I see is the different configurations for the word delimiter filter, with the catenatewords and catenatenumbers @ index and query but it should not impact normal word searches. As others suggested, you may just want to use the same chain for both Index and query to start with and start with plain tokenizer and then add up filters one by one. Regards, Jayendra On Wed, Jul 13, 2011 at 11:29 PM, deniz denizdurmu...@gmail.com wrote: Hi Jayendra, I have changed the order and also removed the line related with synonyms... but the result is still the same... somehow some words are just invisible during my searches... - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/invisible-words-tp3158060p3168039.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ' invisible ' words
Hi Denis, The order of the filter during index time and query time are different e.g. the synonyms filter. Do you have a custom synonyms text file which may be causing the issues ? It usually works fine if you have the same filter order during Index and Query time. You can try out. Regards, Jayendra On Tue, Jul 12, 2011 at 11:19 PM, deniz denizdurmu...@gmail.com wrote: nothing was changed... the result is still the same... shuold i implement my own analyzer or tokenizer for the problem? - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/invisible-words-tp3158060p3164670.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Master Slave help
Do you mean the replication happens everytime you restart the server ? If so, you would need to modify the events you want the replication to happen. Check for the replicateAfter tag and remove the startup option, if you don't need it. requestHandler name=/replication class=solr.ReplicationHandler lst name=master !--Replicate on 'startup' and 'commit'. 'optimize' is also a valid value for replicateAfter. -- str name=replicateAfterstartup/str str name=replicateAftercommit/str !--Create a backup after 'optimize'. Other values can be 'commit', 'startup'. It is possible to have multiple entries of this config string. Note that this is just for backup, replication does not require this. -- !-- str name=backupAfteroptimize/str -- !--If configuration files need to be replicated give the names here, separated by comma -- str name=confFilesschema.xml,stopwords.txt,elevate.xml/str !--The default value of reservation is 10 secs.See the documentation below . Normally , you should not need to specify this -- str name=commitReserveDuration00:00:10/str /lst /requestHandler Regards, Jayendra On Mon, Jun 6, 2011 at 11:24 AM, Rohit Gupta ro...@in-rev.com wrote: Hi, I have configured my master slave server and everything seems to be running fine, the replication completed the firsttime it ran. But everytime I go the the replication link in the admin panel after restarting the server or server startup I notice the replication starting from scratch or at least the stats show that. What could be wrong? Thanks, Rohit
Re: Hitting the URI limit, how to get around this?
just a suggestion ... If the shards are know, you can add them as the default params in the requesthandler so they are added always. and the URL would just have the qt parameter. As the limit for uri is browser dependent. How are you querying solr .. any client api ?? through browser ?? is it hitting the max header length ?? Can you use post instead ?? Regards, Jayendra On Thu, Jun 2, 2011 at 7:12 PM, JohnRodey timothydd...@yahoo.com wrote: I have a master solr instance that I sent my request to, it hosts no documents it just farms the request out to a large number of shards. All the other solr instances that host the data contain multiple cores. Therefore my search string looks like http://host:port/solr/select?...shards=nodeA:1234/solr/core01,nodeA:1234/solr/core02,nodeA:1234/solr/core03,...; This shard list is pretty long and has finally hit the limit. So my question is how to best avoid having to build such a long uri? Is there a way to have mutiple tiers, where the master server has a list of servers (nodeA:1234,nodeB:1234,...) and each of those nodes query the cores that they host (nodeA hosts core01, core02, core03, ...)? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Hitting-the-URI-limit-how-to-get-around-this-tp3017837p3017837.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)
Hi Gary, I tried the patch on the the 3.1 source code (@ http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/) as well and it worked fine. @Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals with the Solr Cell module. You may want to verify the contents from the results by enabling the stored attribute on the text field. e.g. URL curl http://localhost:8983/solr/update/extract?stream.file=C:/Test.zipliteral.id=777045literal.title=Testcommit=true; Let me know if it works. I would be happy to share the generated artifact you can test on. Regards, Jayendra On Fri, May 20, 2011 at 11:15 AM, Gary Taylor g...@inovem.com wrote: Hello again. Unfortunately, I'm still getting nowhere with this. I have checked-out the 3.1 source and applied Jayendra's patches (see below) and it still appears that the contents of the files in the zipfile are not being indexed, only the filenames of those contained files. I'm using a simple CURL invocation to test this: curl http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; -F commit=true -F file=@solr1.zip solr1.zip contains two simple txt files (doc1.txt and doc2.txt). I'm expecting the contents of those txt files to be extracted from the zip and indexed, but this isn't happening - or at least, I don't get the desired result when I do a query afterwards. I do get a match if I search for either doc1.txt or doc2.txt, but not if I search for a word that appears in their contents. If I index one of the txt files (instead of the zipfile), I can query the content OK, so I'm assuming my query is sensible and matches the field specified on the CURL string (ie. text). I'm also happy that the Solr Cell content extraction is working because I can successfully index PDF, Word, etc. files. In a fit of desperation I have added log.info statements into the files referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see those in the log when I submit the zipfile with CURL, so I know I'm running those patched files in the build. If anyone can shed any light on what's happening here, I'd be very grateful. Thanks and kind regards, Gary. On 11/04/2011 11:12, Gary Taylor wrote: Jayendra, Thanks for the info - been keeping an eye on this list in case this topic cropped up again. It's currently a background task for me, so I'll try and take a look at the patches and re-test soon. Joey - glad you brought this issue up again. I haven't progressed any further with it. I've not yet moved to Solr 3.1 but it's on my to-do list, as is testing out the patches referenced by Jayendra. I'll post my findings on this thread - if you manage to test the patches before me, let me know how you get on. Thanks and kind regards, Gary. On 11/04/2011 05:02, Jayendra Patil wrote: The migration of Tika to the latest 0.8 version seems to have reintroduced the issue. I was able to get this working again with the following patches. (Solr Cell and Data Import handler) https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 You can try these. Regards, Jayendra On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com wrote: Hi Gary, I have been experiencing the same problem... Unable to extract content from archive file formats. I just tried again with a clean install of Solr 3.1.0 (using Tika 0.8) and continue to experience the same results. Did you have any success with this problem with Solr 1.4.1 or 3.1.0 ? I'm using this curl command to send data to Solr. curl http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true; -H application/octet-stream -F myfile=@data.zip No problem extracting single rich text documents, but archive files only result in the file names within the archive being indexed. Am I missing something else in my configuration? Solr doesn't seem to be unpacking the archive files. Based on the email chain associated with your first message, some people have been able to get this functionality to work as desired. -- Gary Taylor INOVEM Tel +44 (0)1488 648 480 Fax +44 (0)7092 115 933 gary.tay...@inovem.com www.inovem.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
The migration of Tika to the latest 0.8 version seems to have reintroduced the issue. I was able to get this working again with the following patches. (Solr Cell and Data Import handler) https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 You can try these. Regards, Jayendra On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel phan...@nearinfinity.com wrote: Hi Gary, I have been experiencing the same problem... Unable to extract content from archive file formats. I just tried again with a clean install of Solr 3.1.0 (using Tika 0.8) and continue to experience the same results. Did you have any success with this problem with Solr 1.4.1 or 3.1.0 ? I'm using this curl command to send data to Solr. curl http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true; -H application/octet-stream -F myfile=@data.zip No problem extracting single rich text documents, but archive files only result in the file names within the archive being indexed. Am I missing something else in my configuration? Solr doesn't seem to be unpacking the archive files. Based on the email chain associated with your first message, some people have been able to get this functionality to work as desired. On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor g...@inovem.com wrote: Can anyone shed any light on this, and whether it could be a config issue? I'm now using the latest SVN trunk, which includes the Tika 0.8 jars. When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to the ExtractingRequestHandler, I get the following log entry (formatted for ease of reading) : SolrInputDocument[ { ignored_meta=ignored_meta(1.0)={ [stream_source_info, file, stream_content_type, application/octet-stream, stream_size, 260, stream_name, solr1.zip, Content-Type, application/zip] }, ignored_=ignored_(1.0)={ [package-entry, package-entry] }, ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream}, ignored_stream_size=ignored_stream_size(1.0)={260}, ignored_stream_name=ignored_stream_name(1.0)={solr1.zip}, ignored_content_type=ignored_content_type(1.0)={application/zip}, docid=docid(1.0)={74}, type=type(1.0)={5}, text=text(1.0)={ doc2.txt doc1.txt } } ] So, the data coming back from Tika when parsing a ZIP file does not include the file contents, only the names of the files contained therein. I've tried forcing stream.type=application/zip in the CURL string, but that makes no difference. If I specify an invalid stream.type then I get an exception response, so I know it's being used. When I send one of those txt files individually to the ExtractingRequestHandler, I get: SolrInputDocument[ { ignored_meta=ignored_meta(1.0)={ [stream_source_info, file, stream_content_type, text/plain, stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt] }, ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain}, ignored_stream_size=ignored_stream_size(1.0)={30}, ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1}, ignored_stream_name=ignored_stream_name(1.0)={doc1.txt}, docid=docid(1.0)={74}, type=type(1.0)={5}, text=text(1.0)={ The quick brown fox } } ] and we see the file contents in the text field. I'm using the following requestHandler definition in solrconfig.xml: !-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -- requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler startup=lazy lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contenttext/str str name=lowernamestrue/str str name=uprefixignored_/str !-- capture link hrefs but ignore div attributes -- str name=captureAttrtrue/str str name=fmap.alinks/str str name=fmap.divignored_/str /lst /requestHandler Is there any further debug or diagnostic I can get out of Tika to help me work out why it's only returning the file names and not the file contents when parsing a ZIP file? Thanks and kind regards, Gary. On 25/01/2011 16:48, Jayendra Patil wrote: Hi Gary, The latest Solr Trunk was able to extract and index the contents of the zip file using the ExtractingRequestHandler. The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and worked pretty well. Tested again with sample url and works fine - curl http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true You would probably need
Re: Solrcore.properties
Can you please attach the other files. It doesn't seem to find the enable.master property, so you may want to check the properties file exists on the box having issues We have the following configuration in the core :- Core - - solrconfig.xml - Master Slave requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=enable${enable.master:false}/str str name=replicateAftercommit/str str name=confFilessolrcore_slave.properties:solrcore.properties,solrconfig.xml,schema.xml/str /lst lst name=slave str name=enable${enable.slave:false}/str str name=masterUrlhttp://master_host:port/solr/corename/replication/str /lst /requestHandler - solrcore.properties - Master enable.master=true enable.slave=false - solrcore_slave.properties - Slave enable.master=false enable.slave=true We have the default values and separate properties file for Master and slave. Replication is enabled for the solrcore.proerties file. Regards, Jayendra On Mon, Mar 28, 2011 at 2:06 PM, Ezequiel Calderara ezech...@gmail.com wrote: Hi all, i'm having problems when deploying solr in the production machines. I have a master solr, and 3 slaves. The master replicates the schema and the solrconfig for the slaves (this file in the master is named like solrconfig_slave.xml). The solrconfig of the slaves has for example the ${data.dir} and other values in the solrtcore.properties I think that solr isn't recognizing that file, because i get this error: HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: abortOnConfigurationErrorfalse/abortOnConfigurationError in null - org.apache.solr.common.SolrException: No system property or default value specified for enable.master at org.apache.solr.common.util.DOMUtil.substituteProperty(DOMUtil.java:311) ... MORE STACK TRACE INFO... But here is the thing: org.apache.solr.common.SolrException: No system property or default value specified for enable.master I'm attaching the master schema, the master solr config, the solr config of the slaves and the solrcore.properties. If anyone has any info on this i would be more than appreciated!... Thanks -- __ Ezequiel. Http://www.ironicnet.com
Re: Solr - multivalue fields - please help
Just a suggestion .. You can try using dynamic fields by appending the company name (or ID) as prefix ... e.g. For data - Employee ID Employer FromDate ToDate 21345 IBM 01/01/04 01/01/06 MS 01/01/07 01/01/08 BT 01/01/09 Present Index data as :- Employee ID - 21345 Employer Name - IBM MS BT (Multivalued fields) IBM_FROM_DATE - 01/01/04 (Dynamic field) IBM_TO_DATE - 01/01/06 (Dynamic field) You should be able to match the results and get the from and to date for the companies and handle it on UI side. Regards, Jayendra On Wed, Mar 23, 2011 at 8:24 AM, Sandra sclo...@consultant.com wrote: Hi everyone, I know that Solr cannot match 1 value in a multi-valued field with the corresponding value in another multi-valued field. However my data set appears to be in that form at the moment. With that in mind does anyone know of any good articles or discussions that have addressed this issue, specifically the alternatives that can be easily done/considered etc The data is of the following format: I have an unique Employee ID field, Employer (multi-value), FromDate (multi-value) amd ToDate (multi-value). For a given employee ID I am trying to return the relevent data. For example for a ID of 21345 and emplyer IMB return the work dates from and to. Or for same id and 2 work dates return the company of companies that the id was associated with etc Employee ID Employer FromDate ToDate 21345 IBM 01/01/04 01/01/06 MS 01/01/07 01/01/08 BT 01/01/09 Present Any suggestions/comments/ideas/articles much appreciated... Thanks, S. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-multivalue-fields-please-help-tp2720067p2720067.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr coding
Why not just add an extra field to the document in the Index for the user, so you can easily filter out the results on the user field and show only the documents submitted by the User. Regards, Jayendra On Wed, Mar 23, 2011 at 9:20 AM, satya swaroop satya.yada...@gmail.com wrote: Hi All, As for my project Requirement i need to keep privacy for search of files so that i need to modify the code of solr, for example if there are 5 users and each user indexes some files as user1 - java1, c1,sap1 user2 - java2, c2,sap2 user3 - java3, c3,sap3 user4 - java4, c4,sap4 user5 - java5, c5,sap5 and if a user2 searches for the keyword java then it should be display only the file java2 and not other files so inorder to keep this filtering inside solr itself may i know where to modify the code... i will access a database to check the user indexed files and then filter the result... i didnt have any cores.. i indexed all files in a single index... Regards, satya
Re: Solr coding
In that case, you may want to store the groups as multivalued fields who would have access to the document. A filter query on the user group should have the results filtered as you expect. you may also check Apache ManifoldCF as suggested by Szott. Regards, Jayendra On Wed, Mar 23, 2011 at 9:46 AM, satya swaroop satya.yada...@gmail.com wrote: Hi Jayendra, I forgot to mention the result also depends on the group of user too It is some wat complex so i didnt tell it.. now i explain the exact way.. user1, group1 - java1, c1,sap1 user2 ,group2- java2, c2,sap2 user3 ,group1,group3- java3, c3,sap3 user4 ,group3- java4, c4,sap4 user5 ,group3- java5, c5,sap5 user1,group1 means user1 belong to group1 Here the filter includes the group too.., if for eg: user1 searches for java then the results should show as java1,java3 since java3 file is acessable to all users who are related to the group1, so i thought of to edit the code... Thanks, satya
Re: Logic operator with dismax
Dismax does not support boolean queries, you may try using Extended Dismax for the boolean support. https://issues.apache.org/jira/browse/SOLR-1553 Regards, Jayendra On Mon, Mar 21, 2011 at 8:24 AM, Savvas-Andreas Moysidis savvas.andreas.moysi...@googlemail.com wrote: Hello, The Dismax search handler doesn't have the concept of a logical operator in terms of OR-AND but rather uses a feature called Min-Should-Match (or mm). This parameter specifies the absolute number or percentage of the entered terms that you need them to match. To have an OR-like effect you can specify an mm=0% and for AND-like an mm=100% should work. More information can be found here: http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29 On 21 March 2011 11:46, Gastone Penzo gastone.pe...@gmail.com wrote: Hi. i have a problem with logic operator OR in dismax query search. some days ago the query worked well. now it returns me anything (0 documents) i explain: the query is: http://localhost:8983/solr/select/?q= 1324OR4322OR2324OR%20hello+worlddefType=dismaxqf=code%20title the schema has the fields: code title i want to search the docs with hello world in the title, plus the docs with the codes 1324,4322,2324 (even if they don't have hello world in the title). the result is the query returns to me the docs with these codes AND hello world in the title (logic AND, not OR) the default operator in the schema is OR what's happened?? thank you -- Gastone Penzo *www.solr-italia.it* The first italian blog dedicated to Apache Solr
Re: SOLR DIH importing MySQL text column as a BLOB
Hi Kaushik, If the field is being treated as blobs, you can try using the FieldStreamDataSource mapping. This handles the blob objects to extract contents from it. This feature is available only after Solr 3.1, I suppose. http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/FieldStreamDataSource.html Regards, Jayendra On Tue, Mar 15, 2011 at 11:57 PM, Kaushik Chakraborty kaych...@gmail.com wrote: I've a column for posts in MySQL of type `text`, I've tried corresponding `field-type` for it in Solr `schema.xml` e.g. `string, text, text-ws`. But whenever I'm importing it using the DIH, it's getting imported as a BLOB object. I checked, this thing is happening only for columns of type `text` and not for `varchar`(they are getting indexed as string). Hence, the posts field is not becoming searchable. I found about this issue, after repeated search failures, when I did a `*:*` query search on Solr. A sample response: result name=response numFound=223 start=0 maxScore=1.0 doc float name=score1.0/float str name=solr_post_bio[B@10a33ce2/str date name=solr_post_created_at2011-02-21T07:02:55Z/date str name=solr_post_emailtest.acco...@gmail.com/str str name=solr_post_first_nameTest/str str name=solr_post_last_nameAccount/str str name=solr_post_message[B@2c93c4f1/str str name=solr_post_status_message_id1/str /doc The `data-config.xml` : document entity name=posts dataSource=jdbc query=select p.person_id as solr_post_person_id, pr.first_name as solr_post_first_name, pr.last_name as solr_post_last_name, u.email as solr_post_email, p.message as solr_post_message, p.id as solr_post_status_message_id, p.created_at as solr_post_created_at, pr.bio as solr_post_bio from posts p,users u,profiles pr where p.person_id = u.id and p.person_id = pr.person_id and p.type='StatusMessage' field column=solr_post_person_id / field column=solr_post_first_name/ field column=solr_post_last_name / field column=solr_post_email / field column=solr_post_message / field column=solr_post_status_message_id / field column=solr_post_created_at / field column=solr_post_bio/ /entity /document The `schema.xml` : fields field name=solr_post_status_message_id type=string indexed=true stored=true required=true / field name=solr_post_message type=text_ws indexed=true stored=true required=true / field name=solr_post_bio type=text indexed=false stored=true / field name=solr_post_first_name type=string indexed=false stored=true / field name=solr_post_last_name type=string indexed=false stored=true / field name=solr_post_email type=string indexed=false stored=true / field name=solr_post_created_at type=date indexed=false stored=true / /fields uniqueKeysolr_post_status_message_id/uniqueKey defaultSearchFieldsolr_post_message/defaultSearchField Thanks, Kaushik
Re: docBoost
you can use the ScriptTransformer to perform the boost calcualtion and addition. http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer dataConfig script![CDATA[ function f1(row) { // Add boost row.put('$docBoost',1.5); return row; } ]]/script document entity name=e pk=id transformer=script:f1 query=select * from X /entity /document /dataConfig Regards, Jayendra On Wed, Mar 9, 2011 at 2:01 PM, Brian Lamb brian.l...@journalexperts.com wrote: Anyone have any clue on this on? On Tue, Mar 8, 2011 at 2:11 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I am using dataimport to create my index and I want to use docBoost to assign some higher weights to certain docs. I understand the concept behind docBoost but I haven't been able to find an example anywhere that shows how to implement it. Assuming the following config file: document entity name=animal dataSource=animals pk=id query=SELECT * FROM animals field column=id name=id / field column=genus name=genus / field column=species name=species / entity name=boosters dataSource=boosts query=SELECT boost_score FROM boosts WHERE animal_id = ${ animal.id} field column=boost_score name=boost_score / /entity /entity /document How do I add in a docBoost score? The boost score is currently in a separate table as shown above.
Re: Same index is ranking differently on 2 machines
queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. - 0.021368012 = queryNorm (local) + 0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: - 1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: - 0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: + 0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: + 0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: - 0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: + 0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of: 1.4142135 = tf(termFreq(profile:dubai)=2) 7.6305184 = idf(docFreq=7, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) - 0.36931866 = (MATCH) max plus 0.01 times others of: - 0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of: - 0.003954251 = queryWeight(text:product^0.1), product of: - 0.1 = boost + 0.17194802 = (MATCH) max plus 0.01 times others of: + 0.00851347 = (MATCH) weight(text:product in 1551), product of: + 0.018402064 = queryWeight(text:product), product of: 1.8505468 = idf(docFreq=2589, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 0.4626367 = (MATCH) fieldWeight(text:product in 1551), product of: 1.0 = tf(termFreq(text:product)=1) 1.8505468 = idf(docFreq=2589, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 0.36930037 = (MATCH) weight(profile:product^2.0 in 1551), product of: - 0.1725098 =
Re: Same index is ranking differently on 2 machines
Are you sure you have the same config ... The boost seems different for the field text - text:dubai^0.1 text:dubai -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: Regards, Jayendra On Wed, Mar 9, 2011 at 4:38 PM, Allistair Crossley a...@roxxor.co.uk wrote: Thanks. Good to know, but even so my problem remains - the end score should not be different and is causing a dramatically different ranking of a document (3 versus 7 is dramatic for my client). This must be down to the scoring debug differences - it's the only difference I can find :( On Mar 9, 2011, at 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossley a...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. - 0.021368012 = queryNorm (local) + 0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: - 1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: - 0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: + 0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: + 0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: - 0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: + 0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933
Solr Cell DataImport Tika handler broken - fails to index Zip file contents
Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last year, but seems to have reappeared with the latest code. I had raised a jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332. The same fix is needed for the Solr Cell as well. I can raise a jira and provide the patch for the same, if the above patch seems good enough. Regards, Jayendra
Re: logical relation among filter queries
you can use the boolean operators in the filter query. e.g. fq=rating:(PG-13 OR R) Regards, Jayendra On Mon, Mar 7, 2011 at 9:25 PM, cyang2010 ysxsu...@hotmail.com wrote: I wonder what is the logical relation among filter queries. I can't find much documentation on filter query. for example, i want to find all titles that is either PG-13 or R through filter query. The following query won't give me any result back. So I suppose by default it is intersection among each filter query result? fq=rating:PG-13fq=rating:Rq=*:* How do i change it to union to include value for each filter query result? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/logical-relation-among-filter-queries-tp2649142p2649142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: adding a document using curl
If you are using the ExtractingRequestHandler, you can also try using the stream.file or stream.url. e.g. curl http://localhost:8080/solr/core0/update/extract?stream.file=C:/777045.zipliteral.id=777045literal.title=Testcommit=true; More detailed explaination @ http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika The literal prefix attributes with normal fields and the content extracted from the document is stored in the text field by default Regards, Jayendra On Thu, Mar 3, 2011 at 7:16 AM, Gary Taylor g...@inovem.com wrote: As an example, I run this in the same directory as the msword1.doc file: curl http://localhost:8983/solr/core0/update/extract?literal.docid=74literal.type=5; -F file=@msword1.doc The type literal is just part of my schema. Gary. On 03/03/2011 11:45, Ken Foskey wrote: On Thu, 2011-03-03 at 12:36 +0100, Markus Jelsma wrote: Here's a complete example http://wiki.apache.org/solr/UpdateXmlMessages#Passing_commit_parameters_as_part_of_the_URL I should have been clearer. A rich text document, XML I can make work and a script is in the example docs folder http://wiki.apache.org/solr/ExtractingRequestHandler I also read the solr 1.4 book and tried samples in there, could not make them work. Ta
Re: solr different sizes on master and slave
Hi Mike, There was an issue with the Snappuller wherein it fails to clean up the old index directories on the slave side. https://issues.apache.org/jira/browse/SOLR-2156 The patch can be applied to fix the issue. You can also delete the old index directories, except for the current one which is mentioned in the index.properties. Regards, Jayendra On Tue, Mar 1, 2011 at 4:27 PM, Mike Franon kongfra...@gmail.com wrote: ok doing some more research I noticed, on the slave it has multiple folders where it keeps them for example index index.20110204010900 index.20110204013355 index.20110218125400 and then there is an index.properties that shows which index it is using. I am just curious why does it keep multiple copies? Is there a setting somewhere I can change to only keep one copy so not to lose space? Thanks On Tue, Mar 1, 2011 at 3:26 PM, Mike Franon kongfra...@gmail.com wrote: No pending commits, what it looks like is there are almost two copies of the index on the master, not sure how that happened. On Tue, Mar 1, 2011 at 3:08 PM, Markus Jelsma markus.jel...@openindex.io wrote: Are there pending commits on the master? I was curious why would the size be dramatically different even though the index versions are the same? One is 1.2 Gb, and on the slave it is 512 MB I would think they should both be the same size no? Thanks
Re: Groupped results
Hi Rok, If I understood the use case rightly, Grouping of the results are possible in Solr http://wiki.apache.org/solr/FieldCollapsing Probably, you can create new fields with the combination for the groups and use the field collapsing feature to group the results. Id Type1Type2Title Group1 1abxfgab 2acabd ac 3adthm ad 4baefd ba 5bbikjbb 6bcazd bc It also provides the sort and a group sorting features. Regards, Jayendra On Wed, Mar 2, 2011 at 6:37 AM, Rok Rejc rokrej...@gmail.com wrote: I have an index with a number of documents. For example (this example is representative and contains many others fields): Id Type1 Type2 Title 1 a b xfg 2 a c abd 3 a d thm 4 b a efd 5 b b ikj 6 b c azd ... I want to query an index on a number of fields (not a problem), but I want to get results ordered in a groups, and after that (inside the group) I want to order result alphabeticaly by a Title. Group is not fixed but it is created in runtime. For example: - Group 1: documents with Type1=b and Type2=b. - Group 2: documents with Type1=a and Type2=b. - Group 3: documents with Type1=b and Type2=a. - Group 4: documents with Type1=b and Type2=c. ... So I want to retrieve results ordered by group (1,2,3,4) and after that alphabeticaly by a Title. I think that I should create a query where each group is seperated with OR operator, and boost each group with appropriate factor. After that I should order the results by this factor and title. Is this possible? Any suggestions are appreciated. Many thanks, Rok
Re: solr score issue
Check the Need help in understanding output of searcher.explain() function thread. http://mail-archives.apache.org/mod_mbox/lucene-java-user/201008.mbox/%3CAANLkTi=m9a1guhrahpeyqaxhu9gta9fjbnr7-8-zi...@mail.gmail.com%3E Regards, Jayendra On Fri, Feb 25, 2011 at 6:57 AM, Bagesh Sharma mail.bag...@gmail.com wrote: Hi sir , Can anyone explain me how this score is being calculated. i am searching here software engineer using dismax handler. Total documents indexed are 477 and query results are 28. Query is like that - q=software+engineerfq=location%3Adelhi dismax setting is - str name=qf alltext title^2 functional_role^1 /str str name=pf body^100 /str Here alltext field is made by copying all fields. body field contains detail of job. I am unable to understand how these scores have been calculated. From where to start score calculating and what are default score for any term matching. str name=20080604/3eb9a7b30131a782a0c0a0e2cdb2b6b8.html 0.5901718 = (MATCH) sum of: 0.0032821721 = (MATCH) sum of: 0.0026574256 = (MATCH) max plus 0.1 times others of: 0.0026574256 = (MATCH) weight(alltext:softwar in 339), product of: 0.0067262817 = queryWeight(alltext:softwar), product of: 3.6121683 = idf(docFreq=34, maxDocs=477) 0.0018621174 = queryNorm 0.39508092 = (MATCH) fieldWeight(alltext:softwar in 339), product of: 1.0 = tf(termFreq(alltext:softwar)=1) 3.6121683 = idf(docFreq=34, maxDocs=477) 0.109375 = fieldNorm(field=alltext, doc=339) 6.2474643E-4 = (MATCH) max plus 0.1 times others of: 6.2474643E-4 = (MATCH) weight(alltext:engin in 339), product of: 0.0032613424 = queryWeight(alltext:engin), product of: 1.7514161 = idf(docFreq=224, maxDocs=477) 0.0018621174 = queryNorm 0.19156113 = (MATCH) fieldWeight(alltext:engin in 339), product of: 1.0 = tf(termFreq(alltext:engin)=1) 1.7514161 = idf(docFreq=224, maxDocs=477) 0.109375 = fieldNorm(field=alltext, doc=339) 0.5868896 = weight(body:softwar engin^100.0 in 339), product of: 0.9995919 = queryWeight(body:softwar engin^100.0), product of: 100.0 = boost 5.3680387 = idf(body: softwar=34 engin=223) 0.0018621174 = queryNorm 0.58712924 = fieldWeight(body:softwar engin in 339), product of: 1.0 = tf(phraseFreq=1.0) 5.3680387 = idf(body: softwar=34 engin=223) 0.109375 = fieldNorm(field=body, doc=339) /str please suggest me. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-score-issue-tp2574680p2574680.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: query slop issue
qs is only the amount of slop on phrase queries explicitly specified in the q for qf fields. So only if the search q is water treatment plant, would the qs come into picture. Slop is the maximum allowable positional distance between terms to be considered a match is called slop. and distance is the number of positional moves of terms to reconstruct the phrase in same order. So with qs=1 you are allowed for only one positional move to recreate the exact phrase. You may also want to check the pf and the ps params for the dismax. Regards, Jayendra On Thu, Feb 24, 2011 at 8:31 AM, Bagesh Sharma mail.bag...@gmail.com wrote: Hi all, i have a search string q=water+treatment+plant and i am using dismax request handler where i have qs = 1 . in which way processing will be done means with in how many words water or treatment or plant should occur to come in result set. -- View this message in context: http://lucene.472066.n3.nabble.com/query-slop-issue-tp2567418p2567418.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem in full query searching
With dismax or extended dismax parser you should be able to achieve this. Dismax :- qf, qs, pf ps should help you to have exact control on the fields and boosts. Extended Dismax :- In addition to qf, qs, pf ps, you have pf2 and pf3 for the two and three words shingles. As Grijesh mentioned, use more weight for phrase or proximity matches Regards, Jayendra On Thu, Feb 24, 2011 at 4:03 AM, Grijesh pintu.grij...@gmail.com wrote: Try to configue more waight on ps and pf parameters of dismax request handler to boost phrase matching documents. Or if you do not want to consider the term frequency then use omitTermFreqAndPositions=true in field definition - Thanx: Grijesh http://lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-full-query-searching-tp2566054p2566230.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index MS office
http://wiki.apache.org/solr/ExtractingRequestHandler Regards, Jayendra On Wed, Feb 2, 2011 at 10:49 AM, Thumuluri, Sai sai.thumul...@verizonwireless.com wrote: Good Morning, I am planning to get started on indexing MS office using ApacheSolr - can someone please direct me where I should start? Thanks, Sai Thumuluri
Re: configure httpclient to access solr with user credential on third party host
This should help HttpClient client = new HttpClient(); client.getParams().setAuthenticationPreemptive(true); AuthScope scope = new AuthScope(AuthScope.ANY_HOST,AuthScope.ANY_PORT); client.getState().setCredentials(scope, new UsernamePasswordCredentials(user, password)); Regards, Jayendra On Thu, Jan 27, 2011 at 4:47 PM, Darniz rnizamud...@edmunds.com wrote: thanks exaclty i asked my domain hosting provider and he provided me with some other port i am wondering can i specify credentials without the port i mean when i open the browser and i type www.mydomainmame/solr i get the tomcat auth login screen. in the same way can i configure the http client so that i dont have to specify the port Thanks darniz -- View this message in context: http://lucene.472066.n3.nabble.com/configure-httpclient-to-access-solr-with-user-credential-on-third-party-host-tp2360364p2364190.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Hi Gary, The latest Solr Trunk was able to extract and index the contents of the zip file using the ExtractingRequestHandler. The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and worked pretty well. Tested again with sample url and works fine - curl http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true You would probably need to drill down to the Tika Jars and the apache-solr-cell-4.0-dev.jar used for Rich documents indexing. Regards, Jayendra On Tue, Jan 25, 2011 at 11:08 AM, Gary Taylor g...@inovem.com wrote: OK, got past the schema.xml problem, but now I'm back to square one. I can index the contents of binary files (Word, PDF etc...), as well as text files, but it won't index the content of files inside a zip. As an example, I have two txt files - doc1.txt and doc2.txt. If I index either of them individually using: curl http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; -F file=@doc1.txt and commit, Solr will index the contents and searches will match. If I zip those two files up into solr1.zip, and index that using: curl http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; -F file=@solr1.zip and commit, the file names are indexed, but not their contents. I have checked that Tika can correctly process the zip file when used standalone with the tika-app jar - it outputs both the filenames and contents. Should I be able to index the contents of files stored in a zip by using extract ? Thanks and kind regards, Gary. On 25/01/2011 15:32, Gary Taylor wrote: Thanks Erlend. Not used SVN before, but have managed to download and build latest trunk code. Now I'm getting an error when trying to access the admin page (via Jetty) because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but this appears to be no-longer supplied as part of the build so I get an exception cos it can't find that class. I've checked the CHANGES.txt and found the following in the change list to 1.4.0 (!?) : 66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, HTMLStripWhitespaceTokenizerFactory andHTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji) Unfortunately, I can't seem to get that to work correctly. Does anyone have an example fieldType stanza (for schema.xml) for stripping out HTML ? Thanks and kind regards, Gary. On 25/01/2011 14:17, Erlend Garåsen wrote: On 25.01.11 11.30, Erlend Garåsen wrote: Tika version 0.8 is not included in the latest release/trunk from SVN. Ouch, I wrote not instead of now. Sorry, I replied in a hurry. And to clarify, by content I mean the main content of a Word file. Title and other kinds of metadata are successfully extracted by the old 0.4 version of Tika, but you need a newer Tika version (0.8) in order to fetch the main content as well. So try the newest Solr version from trunk. Erlend
Re: StopFilterFactory and qf containing some fields that use it and some that do not
Have used edismax and Stopword filters as well. But usually use the fq parameter e.g. fq=title:the life and never had any issues. Can you turn on the debugQuery and check whats the Query formed for all the combinations you mentioned. Regards, Jayendra On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James james.d...@ingrambook.comwrote: I'm running into a problem with StopFilterFactory in conjunction with (e)dismax queries that have a mix of fields, only some of which use StopFilterFactory. It seems that if even 1 field on the qf parameter does not use StopFilterFactory, then stop words are not removed when searching any fields. Here's an example of what I mean: - I have 2 fields indexed: Title is textStemmed, which includes StopFilterFactory (see below). Contributor is textSimple, which does not include StopFilterFactory (see below). - The is a stop word in stopwords.txt - q=lifedefType=edismaxqf=Title ... returns 277,635 results - q=the lifedefType=edismaxqf=Title ... returns 277,635 results - q=lifedefType=edismaxqf=Title Contributor ... returns 277,635 results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results It seems as if the stop words are not being stripped from the query because qf contains a field that doesn't use StopFilterFactory. I did testing with combining Stemmed fields with not Stemmed fields in qf and it seems as if stemming gets applied regardless. But stop words do not. Does anyone have ideas on what is going on? Is this a feature or possibly a bug? Any known workarounds? Any advice is appreciated. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 fieldType name=textSimple class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=textStemmed class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=1 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=1 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType
Re: solr wildcard queries and analyzers
Had the same issues with international characters and wildcard searches. One workaround we implemented, was to index the field with and without the ASCIIFoldingFilterFactory. You would have an original field and one with english equivalent to be used during searching. Wildcard searches with english equivalent or international terms would match either of those. Also, lowere case the search terms if you are using lowercasefilter during indexing. Reagrds, Jayendra On Wed, Jan 12, 2011 at 7:46 AM, Kári Hreinsson k...@gagnavarslan.iswrote: Have you made any progress? Since the AnalyzingQueryParser doesn't inherit from QParserPlugin solr doesn't want to use it but I guess we could implement a similar parser that does inherit from QParserPlugin? Switching parser seems to be what is needed? Has really no one solved this before? - Kári - Original Message - From: Matti Oinas matti.oi...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, 11 January, 2011 12:47:52 PM Subject: Re: solr wildcard queries and analyzers This might be the solution. http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html 2011/1/11 Matti Oinas matti.oi...@gmail.com: Sorry, the message was not meant to be sent here. We are struggling with the same problem here. 2011/1/11 Matti Oinas matti.oi...@gmail.com: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers On wildcard and fuzzy searches, no text analysis is performed on the search word. 2011/1/11 Kári Hreinsson k...@gagnavarslan.is: Hi, I am having a problem with the fact that no text analysis are performed on wildcard queries. I have the following field type (a bit simplified): fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.TrimFilterFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.ASCIIFoldingFilterFactory / /analyzer /fieldType My problem has to do with Icelandic characters, when I index a document with a text field including the word sjálfsögðu it gets indexed as sjalfsogdu (because of the ASCIIFoldingFilterFactory which replaces the Icelandic characters with their English equivalents). Then, when I search (without a wildcard) for sjálfsögðu or sjalfsogdu I get that document as a result. This is convenient since it enables people to search without using accented characters and yet get the results they want (e.g. if they are working on computers with English keyboards). However this all falls apart when using wildcard searches, then the search string isn't passed through the filters, and even if I search for sjálf* I don't get any results because the index doesn't contain the original words (I get result if I search for sjalf*). I know people have been having a similar problem with the case sensitivity of wildcard queries and most often the solution seems to be to lowercase the string before passing it on to solr, which is not exactly an optimal solution (yet a simple one in that case). The Icelandic characters complicate things a bit and applying the same solution (doing the lowercasing and character mapping) in my application seems like unnecessary duplication of code already part of solr, not to mention complication of my application and possible maintenance down the road. Is there any way around this? How are people solving this? Is there a way to apply the filters to wildcard queries? I guess removing the ASCIIFoldingFilterFactory is the simplest solution but this normalization (of the text done by the filter) is often very useful. I hope I'm not overlooking some obvious explanation. :/ Thanks in advance, Kári Hreinsson
Re: Can't find source or jar for Solr class JaspellTernarySearchTrie
Checkout and build the code from - https://svn.apache.org/repos/asf/lucene/dev/trunk/ Class - https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.java https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.java Regards, Jayendra On Wed, Jan 12, 2011 at 8:46 AM, Larry White ljw1...@gmail.com wrote: Hi, I'm trying to find the source code for class: JaspellTernarySearchTrie. It's supposed to be used for spelling suggestions. It's referenced in the javadoc: http://lucene.apache.org/solr/api/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.html I realize this is a dumb question, but i've been looking through the downloads for several hours. I can't actually find the package org/apache/solr/spelling/suggest/ that it's supposed to be under. So if you would be so kind... What jar is it compiled into? Where is the source in the downloaded source tree? thanks.
Re: Failover setup (is this a bad idea)
Rather have a Master and multiple Slave combination, with master only being used for writes and slaves used for reads. Master to Slave replication is easily configurable. Two Solr instances sharing the same index is not at all good idea with both writing to the same index. Regards, Jayendra On Tue, Nov 30, 2010 at 7:13 AM, Keith Pope keith.p...@inflightproductions.com wrote: Hi, I have a windows cluster that I would like to install Solr onto, there are two nodes that provide basic failover. I was thinking of this setup: Tomcat installed as win service Two solr instances sharing the same index The second instance would take over when the first fails, so you should never get two writes/reads at once. Is this a bad idea? Would I end up corrupting my index? Thx Keith - Registered Office: 15 Stukeley Street, London WC2B 5LT, England. Registered in England number 1421223 This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. Please note that the information provided in this e-mail is in any case not legally binding; all committing statements require legally binding signatures. http://www.inflightproductions.com
Re: Extracting and indexing content from multiple binary files into a single Solr document
The way we implemented the same scenario is zipping all the attachments into a single zip file which can be passed to the ExtractingRequestHandler for indexing and included as a part of single Solr document. Regards, Jayendra On Wed, Nov 17, 2010 at 6:27 AM, Gary Taylor g...@inovem.com wrote: Hi, We're trying to use Solr to replace a custom Lucene server. One requirement we have is to be able to index the content of multiple binary files into a single Solr document. For example, a uniquely named object in our app can have multiple attached-files (eg. Word, PDF etc.), and we want to index (but not store) the contents of those files in the single Solr doc for that named object. At the moment, we're issuing HTTP requests direct from ColdFusion and using the /update/extract servlet, but can only specify a single file on each request. Is the best way to achieve this to extend ExtractingRequestHandler to allow multiple binary files and thus specify our own RequestHandler, or would using the SolrJ interface directly be a better bet, or am I missing something fundamental? Thanks and regards, Gary.
basic authentication for schema.url
We intend to use schema.url for indexing documents. However, the remote urls are secured and would need basic authentication to be able access the document. The implementation with stream.file would mean to download the files and would cause duplicity, whereas stream.body would have indexing performance issues with the hugh data being transferred over the network. The current implementation for stream.url in ContentStreamBase.URLStream does not support authentication. But can be easily supported by :- 1. Passing additional authentication parameter e.g. stream.url.auth with the encoded authentication value - SolrRequestParsers 2. Setting Authorization request property for the Connection - ContentStreamBase.URLStream this.conn.setRequestProperty(Authorization, Basic + encodedauthentication); Any suggestions ??? Regards, Jayendra
Re: basic authentication for schema.url
I meant stream.url Regards, Jayendra On Tue, Nov 16, 2010 at 5:37 PM, Jayendra Patil jayendra.patil@gmail.com wrote: We intend to use schema.url for indexing documents. However, the remote urls are secured and would need basic authentication to be able access the document. The implementation with stream.file would mean to download the files and would cause duplicity, whereas stream.body would have indexing performance issues with the hugh data being transferred over the network. The current implementation for stream.url in ContentStreamBase.URLStream does not support authentication. But can be easily supported by :- 1. Passing additional authentication parameter e.g. stream.url.auth with the encoded authentication value - SolrRequestParsers 2. Setting Authorization request property for the Connection - ContentStreamBase.URLStream this.conn.setRequestProperty(Authorization, Basic + encodedauthentication); Any suggestions ??? Regards, Jayendra
Re: Multiple Word Facets
The Shingle Filter Breaks the words in a sentence into a combination of 2/3 words. For faceting field you should use :- field name=facet_field *type=string* indexed=true stored=true multiValued=true/ The type of the field should be *string *so that it is not tokenised at all. On Wed, Oct 27, 2010 at 9:12 AM, Adam Estrada estrada.a...@gmail.comwrote: Thanks guys, the solr.ShingleFilterFactory did work to get me multiple terms per facet but now I am seeing some redundancy in the facets numbers. See below... Highway (62) Highway System (59) National (59) National Highway (59) National Highway System (59) System (59) See what's going on here? How can I make my multi token facets smarter so that the tokens aren't duplicated? Thanks in advance, Adam On Tue, Oct 26, 2010 at 10:32 PM, Ahmet Arslan iori...@yahoo.com wrote: Facets are generated from indexed terms. Depending on your need/use-case: You can use a additional separate String field (which is not tokenized) for facets, populate it via copyField. Search on tokenized field facet on non-tokenized field. Or You can add solr.ShingleFilterFactory to your index analyzer to form multiple word terms. --- On Wed, 10/27/10, Adam Estrada estrada.a...@gmail.com wrote: From: Adam Estrada estrada.a...@gmail.com Subject: Multiple Word Facets To: solr-user@lucene.apache.org Date: Wednesday, October 27, 2010, 4:43 AM All, I am a new to Solr faceting and stuck on how to get multiple-word facets returned from a standard Solr query. See below for what is currently being returned. lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=title int name=Federal89/int int name=EFLHD87/int int name=Eastern87/int int name=Lands87/int int name=Highways84/int int name=FHWA60/int int name=Transportation32/int int name=GIS22/int int name=Planning19/int int name=Asset15/int int name=Environment15/int int name=Management14/int int name=Realty12/int int name=Highway11/int int name=HEP10/int int name=Program9/int int name=HEPGIS7/int int name=Resources7/int int name=Roads7/int int name=EEI6/int int name=Environmental6/int int name=Right6/int int name=Way6/int ...etc... There are many terms in there that are 2 or 3 word phrases. For example, Eastern Federal Lands Highway Division all gets broken down in to the individual words that make up the total group of words. I've seen quite a few websites that do what it is I am trying to do here so any suggestions at this point would be great. See my schema below (copied from the example schema). fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer Similar for type=query. Please advise on how to group or cluster document terms so that they can be used as facets. Many thanks in advance, Adam Estrada
Re: after the slave node pull index from master, when will solr del the tmp index dir
We faced the same issue. If you are executing a complete clean build, the Slave copies the complete index and just switches the pointer in the index.properties to point to the new index. directory, leaving behind the old copies. And it does not clean it up. Had logged an JIRA and patch to SnapPuller class, you may want to give it a try - https://issues.apache.org/jira/browse/SOLR-2156 Regards, Jayendra 2010/10/26 Chengyang atreey...@163.com I noticed that the slave node have some tmp Index.x dir that created during the index sync with master, but they are not removed even after serval days. So when will solr del the tmp index dir?
Re: Solr ExtractingRequestHandler with Compressed files
There was this issue with the previous version of Solr, wherein only the file names from the zip used to get indexed. We had faced the same issue and ended up using the Solr trunk which has the Tika version upgraded and works fine. The Solr version 1.4.1 should also have the fix included. Try using it. Regards, Jayendra On Fri, Oct 22, 2010 at 6:02 PM, Joey Hanzel phan...@nearinfinity.comwrote: Hi, Has anyone had success using ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) ? I am sending solr the archived.tar file using curl. curl http://localhost:8983/solr/update/extract?literal.id=doc1fmap.content=body_textscommit=true -H 'Content-type:application/octet-stream' --data-binary @/home/archived.tar The result I get when I query the document is that the filenames inside the archive are indexed as the body_texts, but the content of those files is not extracted or included. This is not the behvior I expected. Ref: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example . When I send 1 of the actual documents inside the archive using the same curl command the extracted content is then stored in the body_texts field. Am I missing a step for the compressed files? I have added all the extraction depednenices as indicated by mat in http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and am able to succesfully extract data from MS Word, PDF, HTML documents. I'm using the following library versions. Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4 Given everything I have read this version of Tika should support extracting data from all files within a compressed file. Any help or suggestions would be appreciated.
Re: Solr sorting problem
need additional information . Sorting is easy in Solr just by passing the sort parameter However, when it comes to text sorting it depends on how you analyse and tokenize your fields Sorting does not work on fields with multiple tokens. http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F On Thu, Oct 21, 2010 at 7:24 PM, Moazzam Khan moazz...@gmail.com wrote: Hey guys, I have a list of people indexed in Solr. I am trying to sort by their first names but I keep getting results that are not alphabetically sorted (I see the names starting with W before the names starting with A). I have a feeling that the results are first being sorted by relevancy then sorted by first name. Is there a way I can get the results to be sorted alphabetically? Thanks, Moazzam
Re: /update/extract
The Extract Request Handler invokes the classes from the extraction package. https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/ExtractingRequestHandler.java This is package into the apache-solr-cell jar. Regards, Jayendra* * On Thu, Aug 19, 2010 at 10:04 AM, satya swaroop sswaro...@gmail.com wrote: Hi all, when we handle extract request handler what class gets invoked.. I need to know the navigation of classes when we send any files to solr. can anybody tell me the classes or any sources where i can get the answer.. or can anyone tell me what classes get invoked when we start the solr... I be thankful if anybody can help me with regarding this.. Regards, satya
Re: How to compile nightly build?
yup, The Nightly build you pointed out has pre-built code and does the include the lucene and module dependencies needed for compilation. In case you want to compile from the source You can check the code from the location @ https://svn.apache.org/repos/asf/lucene/dev/trunk/solr There are 3 Folder - Solr, Lucene and Modules If you are making changes in any of the folders :- From the modules folder execute - ant compile From the lucene folder execute - ant dist From the solr folder execute - ant dist Would require jdk 1.6 Regards, Jayendra On Fri, Aug 13, 2010 at 7:11 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: The nightly test artifacts don't currently contain everything needed to recompile the sources, this is a known issue... https://issues.apache.org/jira/browse/SOLR-1989 ...if you want to compile from source off hte trunk or 3x branch, you need to check out the *entire* branch (not just the solr directory, but it's parent including lucene and the modules) This is the problem with the source in the nightly artifacts at the moment: they only include the solr source. -Hoss
Re: diacritics on query string
*ASCIIFoldingFilter *is probably the filter known to replace the assented chars to normal ones. However i don't see that in your config. For the issue, you can easily debug the issue through solr analysis tool. Regards, Jayendra On Fri, Aug 13, 2010 at 3:20 AM, Andrea Gazzarini andrea.gazzar...@atcult.it wrote: Hi, I have a problem regarding a diacritic character on my query string : *q=intertestualità * which is encoded in *q=intertestualit%E0 * What I'm not understanding is the following query response fragments : lst name=responseHeader int name=status0/int int name=QTime23/int lst name=params str name=sortscore desc/str str name=flscore,title/str str name=debugQueryon/str str name=indenton/str str name=start0/str *str name=qintertestualit/str* str name=version2.2/str str name=rows3/str /lst and lst name=debug str name=rawquerystring*intertestualit*/str str name=querystring*intertestualit*/str I saw that my index contains the token intertestualita (with the 'à' char replaced with 'a'). Indeed if I query for intertestualita I found my results. The queried field is configured with the same chain : fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldtype So my question is : who is removing the à (%E0) characters from the input query? It seems that the query arrives to SOLR already without that character... Regards, Andrea
Re: Hierarchical faceting
We were able to get the hierarchy faceting working with a work around approach. e.g. if you have Europe//Norway//Oslo as an entry 1. Create a new multivalued field with string type field name=country_facet type=string indexed=true stored=true multiValued=true/ 2. Index the field for Europe//Norway//Oslo with values 0//Europe 1//Europe//Norway 2//Europe//Norway//Oslo 3. The Facet can now be used in the Queries :- 1st Level - Would return all entries @ 1st level e.g. 0//USA, 0//Europe fq= f.country_facet.facet.prefix=0// facet.field=country_facet 2nd Level - Would return all entries @ second level in Europe 1//Europe//Norway, 1//Europe//Sweden fq=country_facet:0//Europe f.country_facet.facet.prefix=1//Europe facet.field=country_facet 3rd Level - Would return 1//Europe//Norway entries fq=country_facet:1//Europe//Norway f.country_facet.facet.prefix=2//Europe//Norway facet.field=country_facet Increment the facet.prefix by 1 so that you limit the facet results to to that prefix. Also works for any depth. Regards, Jayendra On Thu, Aug 12, 2010 at 6:01 PM, Mats Bolstad mat...@stud.ntnu.no wrote: Hey all, I am doing a search on hierarchical data, and I have a hard time getting my head around the following problem. I want a result as follows, in one single query only: USA (3) California (2) Arizona (1) Europe (4) Norway (3) Oslo (3) Sweden (1) How it looks in the XML/JSON response is not really important, this is more a presentation issue. I guess I could store the values USA, USA/California, Europe/Norway/Oslo as strings for each document, and do some JavaScript-ing to show the hierarchies appropriately. When a specific item in the facet is selected, for example Norway, Solr could be queries with a filter query on Europe/Norway*? Do anyone have some experiences they could please share with me? I have tried out SOLR-64, and it gives me the results I look for. However, I do not have the opportunity to use a patch in the production environment ... -- Thanks, Mats Bolstad
Re: edismax pf2 and ps
We pretty much had the same issue, ended up customizing the ExtendedDismax code. In your case its just a change of a single line addShingledPhraseQueries(query, normalClauses, phraseFields2, 2, tiebreaker, pslop); to addShingledPhraseQueries(query, normalClauses, phraseFields2, 2, tiebreaker, 0); Regards, Jayendra On Thu, Aug 12, 2010 at 1:04 PM, Ron Mayer r...@0ape.com wrote: Short summary: Is there any way I can specify that I want a lot of phrase slop for the pf parameter, but none at all for the pf2 parameter? I find the 'pf' parameter with a pretty large 'ps' to do a very nice job for providing a modest boost to many documents that are quite well related to many queries in my system. In contrast, I find the 'pf2' parameter with zero 'ps' does extremely well at providing a high boost to documents that are often exactly what someone's searching for. Is there any way I can get both effects? Edismax's pf2 parameter is really nice for boosting exact phrases in queries like 'black jacket red cap white shoes'. But as soon as even a little phrase slop (ps) is added, it seems like it starts boosting documents with red jackets and white caps just as much as those with black jackets and red caps. My gut feeling is that if I could have pf with a large phrase slop and the pf2 with zero phrase slop, it'd give me better overall results than any single phrase slop setting that gets applied to both. Is there any good way for me to test that? Thanks, Ron
Re: PDF file
Try ... curl http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?stream.file= Full_Path_of_File/pub2009001.pdfliteral.id=777045commit=true stream.file - specify full path literal.extra params - specify any extra params if needed Regards, Jayendra On Tue, Aug 10, 2010 at 4:49 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov wrote: Thanks so much for your help! I tried to index a pdf file and got the following. The command I used is curl ' http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@pub2009001.pdf Did I do something wrong? Do I need modify anything in schema.xml or other configuration file? [xiao...@lhcinternal lhc]$ curl ' http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@pub2009001.pdf html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 404 /title /head bodyh2HTTP ERROR: 404/h2preNOT_FOUND/pre pRequestURI=/solr/lhc/update/extract/ppismalla href= http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ /body /html *** -Original Message- From: Sharp, Jonathan [mailto:jsh...@coh.org] Sent: Tuesday, August 10, 2010 4:37 PM To: solr-user@lucene.apache.org Subject: RE: PDF file Xiaohui, You need to add the following jars to the lib subdirectory of the solr config directory on your server. (path inside the solr 1.4.1 download) /dist/apache-solr-cell-1.4.1.jar plus all the jars in /contrib/extraction/lib HTH -Jon From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov] Sent: Tuesday, August 10, 2010 11:57 AM To: 'solr-user@lucene.apache.org' Subject: RE: PDF file Does anyone have any experience with PDF file? I really appreciate your help! Thanks so much in advance. -Original Message- From: Ma, Xiaohui (NIH/NLM/LHC) [C] Sent: Tuesday, August 10, 2010 10:37 AM To: 'solr-user@lucene.apache.org' Subject: PDF file I have a lot of pdf files. I am trying to import pdf files to solr and index them. I added ExtractingRequestHandler to solrconfig.xml. Please tell me if I need download some jar files. In the Solr1.4 Enterprise Search Server book, use following command to import a mccm.pdf. curl ' http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true' -F fi...@mccm.pdf Please tell me if there is a way to import pdf files from a directory. Thanks so much for your help! - SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. -
Re: Setting up apache solr in eclipse with Tomcat
Have got solr working in the Eclipse and deployed on Tomcat through eclipse plugin. The Crude approach, was to 1. Import the Solr war into Eclipse which will be imported as a web project and can be deployed on tomcat. 2. Add multiple source folders to the Project, linked to the checked out solr source code. e.g. entry in .project file linkedResources link namecommon/name type2/type locationD:/Solr/solr/src/common/location /link . /linkedResources 3. Remove the solr jars from the web-inf lib, so that changes on the project sources can be deployed and debugged. Let me know if you get a better approach. On Wed, Aug 4, 2010 at 3:49 AM, Hando420 hando...@gmail.com wrote: I would like to setup apache solr in eclipse using tomcat. It is easy to setup with jetty but with tomcat it doesn't run solr on runtime. Anyone has done this before? Hando -- View this message in context: http://lucene.472066.n3.nabble.com/Setting-up-apache-solr-in-eclipse-with-Tomcat-tp1021673p1021673.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Setting up apache solr in eclipse with Tomcat
The sole home is configured in the web.xml of the application which points to the folder having the conf files and the data directory env-entry env-entry-namesolr/home/env-entry-name env-entry-valueD:/multicore/env-entry-value env-entry-typejava.lang.String/env-entry-type /env-entry Regards, Jayendra On Wed, Aug 4, 2010 at 12:21 PM, Hando420 hando...@gmail.com wrote: Thanks man i haven't tried this but where do put that xml configuration. Is it to the web.xml in solr? Cheers, Hando -- View this message in context: http://lucene.472066.n3.nabble.com/Setting-up-apache-solr-in-eclipse-with-Tomcat-tp1021673p1023188.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solrj ContentStreamUpdateRequest Slow
ContentStreamUpdateRequest seems to read the file contents and transfer it over http, which slows down the indexing. Try Using StreamingUpdateSolrServer with stream.file param @ http://wiki.apache.org/solr/SolrPerformanceFactors#Embedded_vs_HTTP_Post e.g. SolrServer server = new StreamingUpdateSolrServer(Solr Server URL,20,8); UpdateRequest req = new UpdateRequest(/update/extract); ModifiableSolrParams params = null ; params = new ModifiableSolrParams(); params.add(stream.file, new String[]{local file path}); params.set(literal.id, value); req.setParams(params); server.request(req); server.commit(); Regards, Jayendra On Wed, Aug 4, 2010 at 3:01 PM, Tod listac...@gmail.com wrote: I'm running a slight variation of the example code referenced below and it takes a real long time to finally execute. In fact it hangs for a long time at solr.request(up) before finally executing. Is there anything I can look at or tweak to improve performance? I am also indexing a local pdf file, there are no firewall issues, solr is running on the same machine, and I tried the actual host name in addition to localhost but nothing helps. Thanks - Tod http://wiki.apache.org/solr/ContentStreamUpdateRequestExample
Re: query about qf defaults
You can use appends for any additional fq paramters, which would be appended to the ones passed @ query time. Check out the sample solrconfig.xml with the solr. !-- In addition to defaults, appends params can be specified to identify values which should be appended to the list of multi-val params from the query (or the existing defaults). In this example, the param fq=instock:true will be appended to any query time fq params the user may specify, as a mechanism for partitioning the index, independent of any user selected filtering that may also be desired (perhaps as a result of faceted searching). NOTE: there is *absolutely* nothing a client can do to prevent these appends values from being used, so don't use this mechanism unless you are sure you always want it. -- lst name=appends str name=fqinStock:true/str /lst Regards, Jayendra On Tue, Aug 3, 2010 at 8:25 AM, Robert Neve robert.n...@gmx.co.uk wrote: Hi, I have in my solr config file the code below to create a default for fq which works great. The problem I have is if I want to use a custom fq this one gets overwritten. Is there a way I can have it keep this fq and other custom ones? Basically this field sets if the person is to show up or not so it's important anyone set to d is never shown regardless of any other query filters. lst name=defaults str name=fqss_cck_field_status:d /str thanks in advance for any help Robert
QueryUtils API Change - Custom ExtendedDismaxQParserPlugin accessing QueryUtils.makeQueryable throws java.lang.IllegalAccessError
We have a custom implementation of ExtendedDismaxQParserPlugin, which we bundle into a jar and have it exposed in the multicore shared lib. The custom ExtendedDismaxQParserPlugin implementation still uses QueryUtils makeQueryable method, same as the ExtendedDismaxQParserPlugin implementation. However, the method calls throws an java.lang.IllegalAccessError, as it is being called from the inner ExtendedSolrQueryParser class and the makeQueryable has no access modifier (basically default) Any reason for having it with default access modifier ?? or any plans making it public ??? Regards, Jayendra
Document Boost with Solr Extraction - SolrContentHandler
We are using Solr Extract Handler for indexing document metadata with attachments. (/update/extract) However, the SolrContentHandler doesn't seem to support index time document boost attribute. Probably , document.setDocumentBoost(Float.parseFloat(boost)) is missing. Regards, Jayendra