Distributed search component.
Hi all, I am trying to create a distributed search component in solr which is quite difficult (at least for me, because I am new in solr and java). Anyway I have looked into solr source (FacetComponent, TermsComponent...) and created my own search component (it extends SearchComponent) but I still have two questions (for now): 1.) In the prepare method I have the following code: String shards = params.get(ShardParams.SHARDS); if (shards != null) { ListString lst = StrUtils.splitSmart(shards, ,, true); rb.shards = lst.toArray(new String[lst.size()]); rb.isDistrib = true; } If I remove rb.isDistrib = true; line the distributed methods are not called. But to set the isDistrib my code must be in the org.apache.solr.handler.component package (because it is not visible from the outside). Is this correct procedure/behaviour/design? 2.) Functions (process, distributedProcess, handleResponses...) are all called properly. I can read partial responses in the handleResponses but I don't know how to build final response. I see that for example TermsComponent has a helper in the ResponseBuilder which collects all the terms. Is this the only way (to edit the ResponseBuilder source), or can I achive that without editing the solr's source? Many thanks, Rok
Re: Faceting on multivalued field
Are you implying to change the DB query of the nested entity which fetches the comments (query is in my post) or something can be done during the index like using Transformers etc. ? Thanks, Kaushik On Mon, Apr 4, 2011 at 8:07 AM, Erick Erickson erickerick...@gmail.comwrote: Why not count them on the way in and just store that number along with the original e-mail? Best Erick On Sun, Apr 3, 2011 at 10:10 PM, Kaushik Chakraborty kaych...@gmail.com wrote: Ok. My expectation was since comment_post_id is a MultiValued field hence it would appear multiple times (i.e. for each comment). And hence when I would facet with that field it would also give me the count of those many documents where comment_post_id appears. My requirement is getting total for every document i.e. finding number of comments per post in the whole corpus. To explain it more clearly, I'm getting a result xml something like this str name=post_id46/str str name=post_textHello World/str str name=person_id20/str arr name=comment_id str9/str str10/str /arr arr name=comment_person_id str19/str str2/str /arr arr name=comment_post_id str46/str str46/str /arr arr name=comment_text strHello - from World/str strHi/str /arr lst name=facet_fields lst name=comment_post_id *int name=461/int* I need the count to be 2 as the post 46 has 2 comments. What other way can I approach? Thanks, Kaushik On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, I think you're misunderstanding faceting. It's counting the number of documents that have a particular value. So if you're faceting on comment_post_id, there is one and only one document with that value (assuming that the comment_post_ids are unique). Which is what's being reported This will be quite expensive on a large corpus, BTW. Is your task to show the totals for *every* document in your corpus or just the ones in a display page? Because if the latter, your app could just count up the number of elements in the XML returned for the multiValued comments field. If that's not relevant, could you explain a bit more why you need this count? Best Erick On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty kaych...@gmail.com wrote: Hi, My index contains a root entity Post and a child entity Comments. Each post can have multiple comments. data-config.xml: document entity name=posts transformer=TemplateTransformer dataSource=jdbc query= field column=post_id / field column=post_text/ field column=person_id/ entity name=comments dataSource=jdbc query=select * from comments where post_id = ${posts.post_id} field column=comment_id / field column=comment_text / field column=comment_person_id / field column=comment_post_id / /entity /entity /document The schema has all columns of comment entity as MultiValued fields and all fields are indexed stored. My requirement is to count the number of comments for each post. Approach I'm taking is to query on *:* and faceting the result on comment_post_id so that it gives the count of comment occurred for that post. But I'm getting incorrect result e.g. if a post has 2 comments, the multivalued fields are populated alright but the facet count is coming as 1 (for that post_id). What else do I need to do? Thanks, Kaushik
Using MLT feature
Hi, I would like to hear your opinion about the MLT feature and if it's a good solution to what I need to implement. My index has fields like: headline, body and medianame. What I need to do is, before adding a new doc, verify if a similar doc exists for this media. My idea is to use the MorelikeThisHandler (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way: For each new doc, perform a MLT search with q= medianame and stream.body=headline+bodytext. If no similar docs are found than I can safely add the doc. Is this feasible using the MLT handler? Is it a good approach? Are there a better way to perform this comparison? Thank you for your help. Best regards, Frederico Azeiteiro
Re: Using MLT feature
Do you want to not index if something similar? Or don't index if exact. I would look into a hash code of the document if you don't want to index exact. Similar though, I think has to be based off a document in the index. On Apr 4, 2011, at 5:16, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Hi, I would like to hear your opinion about the MLT feature and if it's a good solution to what I need to implement. My index has fields like: headline, body and medianame. What I need to do is, before adding a new doc, verify if a similar doc exists for this media. My idea is to use the MorelikeThisHandler (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way: For each new doc, perform a MLT search with q= medianame and stream.body=headline+bodytext. If no similar docs are found than I can safely add the doc. Is this feasible using the MLT handler? Is it a good approach? Are there a better way to perform this comparison? Thank you for your help. Best regards, Frederico Azeiteiro
Mongo REST interface and full data import
Hi everyone, I'm trying to make a simple data import from MongoDB into Solr using REST interface. As an test example I've created schecma.xml like: ?xml version=1.0 ? isbn title and data-import.xml as: Unfortunately it's not working and I'm stuck on this place. Could you please advise how correctly parser JSON format data? Data format looks like: { offset : 0, rows: [ { _id : { $oid : 4d9829412c8bd1064400 }, isbn : 716739356, title : Proteins, description : } , { _id : { $oid : 4d9829412c8bd1064401 }, isbn : 144433056X, title : How to Assess Doctors and Health Professionals, description : } , { _id : { $oid : 4d9829412c8bd1064402 }, isbn : 1406208159, title : Freestyle: Time Travel Guides: Pack B, description : Takes you on a trip through history to visit the great ancient civilisations. } , total_rows : 3 , query : {} , millis : 0 } Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2774479.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Using MLT feature
Hi, The ideia is don't index if something similar (headline+bodytext) for the same exact medianame. Do you mean I would need to index the doc first (maybe in a temp index) and then use the MLT feature to find similar docs before adding to final index? Thanks, Frederico -Original Message- From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] Sent: segunda-feira, 4 de Abril de 2011 10:22 To: solr-user@lucene.apache.org Subject: Re: Using MLT feature Do you want to not index if something similar? Or don't index if exact. I would look into a hash code of the document if you don't want to index exact.Similar though, I think has to be based off a document in the index. On Apr 4, 2011, at 5:16, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Hi, I would like to hear your opinion about the MLT feature and if it's a good solution to what I need to implement. My index has fields like: headline, body and medianame. What I need to do is, before adding a new doc, verify if a similar doc exists for this media. My idea is to use the MorelikeThisHandler (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way: For each new doc, perform a MLT search with q= medianame and stream.body=headline+bodytext. If no similar docs are found than I can safely add the doc. Is this feasible using the MLT handler? Is it a good approach? Are there a better way to perform this comparison? Thank you for your help. Best regards, Frederico Azeiteiro
Re: Spellchecking Escaped Queries
Thanks Chris, The field used for indexing and spellcheck is the same and is configured like this:.. fieldType name=title stored=true indexed=true multiValued=false class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=^([^!]+)\!([^!]+)$ replacement=$1i$2 replace=all/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=1 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.ASCIIFoldingFilterFactory/ /analyzer /fieldType I use the pattern replace filter to swap all instances of ! within a word to i. I know this part is working correctly as performing a search works correctly. The spellcheck is initialized like this: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetitle/str lst name=spellchecker str name=namedefault/str str name=fieldsearchfield/str str name=spellcheckIndexDir./spellchecker/str str name=buildOnCommitfalse/str /lst /searchComponent And is attached to as a component to my search handler. Thanks, Colin : I'm having an issue performing a spellcheck on some information and : search of the archive isn't helping. For this type of quesiton, there's not much feedback anyone can offer w/o knowing exactly what analyzers you have configured for hte various fieldtypes (both the field you index/search and the fieldtype used for spellchecking) it's also fairly critical to know how you have the spellcheck component configured. off the cuff: i'd guess that maybe WordDelimiterFilter is being used in a wonky way given your usecase -- but like i said: would need to see the configs to make a guess. -Hoss __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ -- Colin Vipurs Server Team Lead Shazam Entertainment Ltd 26-28 Hammersmith Grove, London W6 7HA m: +44 (0) 000 000 t: +44 (0) 20 8742 6820 w:www.shazam.com Please consider the environment before printing this document This e-mail and its contents are strictly private and confidential. It must not be disclosed, distributed or copied without our prior consent. If you have received this transmission in error, please notify Shazam Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it from your system. Please note that the information contained herein shall additionally constitute Confidential Information for the purposes of any NDA between the recipient/s and Shazam Entertainment. Shazam Entertainment Limited is incorporated in England and Wales under company number 3998831 and its registered office is at 26-28 Hammersmith Grove, London W6 7HA. __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __
Re: Spellchecking Escaped Queries
Thanks Chris, The field used for indexing and spellcheck is the same and is configured like this:.. fieldType name=title stored=true indexed=true multiValued=false class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=^([^!]+)\!([^!]+)$ replacement=$1i$2 replace=all/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=1 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.ASCIIFoldingFilterFactory/ /analyzer /fieldType I use the pattern replace filter to swap all instances of ! within a word to i. I know this part is working correctly as performing a search works correctly. The spellcheck is initialized like this: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetitle/str lst name=spellchecker str name=namedefault/str str name=fieldsearchfield/str str name=spellcheckIndexDir./spellchecker/str str name=buildOnCommitfalse/str /lst /searchComponent This is attached as a component to my search handler and spellchecking is done inline with the queries. Thanks, Colin : I'm having an issue performing a spellcheck on some information and : search of the archive isn't helping. For this type of quesiton, there's not much feedback anyone can offer w/o knowing exactly what analyzers you have configured for hte various fieldtypes (both the field you index/search and the fieldtype used for spellchecking) it's also fairly critical to know how you have the spellcheck component configured. off the cuff: i'd guess that maybe WordDelimiterFilter is being used in a wonky way given your usecase -- but like i said: would need to see the configs to make a guess. -Hoss __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ -- Colin Vipurs Server Team Lead Shazam Entertainment Ltd 26-28 Hammersmith Grove, London W6 7HA m: +44 (0) 000 000 t: +44 (0) 20 8742 6820 w:www.shazam.com Please consider the environment before printing this document This e-mail and its contents are strictly private and confidential. It must not be disclosed, distributed or copied without our prior consent. If you have received this transmission in error, please notify Shazam Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it from your system. Please note that the information contained herein shall additionally constitute Confidential Information for the purposes of any NDA between the recipient/s and Shazam Entertainment. Shazam Entertainment Limited is incorporated in England and Wales under company number 3998831 and its registered office is at 26-28 Hammersmith Grove, London W6 7HA. __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __
Re: Spellchecking Escaped Queries
Apologies for the duplicate post. I'm having Evolution problems Thanks Chris, The field used for indexing and spellcheck is the same and is configured like this:.. fieldType name=title stored=true indexed=true multiValued=false class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=^([^!]+)\!([^!]+)$ replacement=$1i$2 replace=all/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=1 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.ASCIIFoldingFilterFactory/ /analyzer /fieldType I use the pattern replace filter to swap all instances of ! within a word to i. I know this part is working correctly as performing a search works correctly. The spellcheck is initialized like this: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetitle/str lst name=spellchecker str name=namedefault/str str name=fieldsearchfield/str str name=spellcheckIndexDir./spellchecker/str str name=buildOnCommitfalse/str /lst /searchComponent And is attached to as a component to my search handler. Thanks, Colin : I'm having an issue performing a spellcheck on some information and : search of the archive isn't helping. For this type of quesiton, there's not much feedback anyone can offer w/o knowing exactly what analyzers you have configured for hte various fieldtypes (both the field you index/search and the fieldtype used for spellchecking) it's also fairly critical to know how you have the spellcheck component configured. off the cuff: i'd guess that maybe WordDelimiterFilter is being used in a wonky way given your usecase -- but like i said: would need to see the configs to make a guess. -Hoss __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ -- Colin Vipurs Server Team Lead Shazam Entertainment Ltd 26-28 Hammersmith Grove, London W6 7HA m: +44 (0) 000 000 t: +44 (0) 20 8742 6820 w:www.shazam.com Please consider the environment before printing this document This e-mail and its contents are strictly private and confidential. It must not be disclosed, distributed or copied without our prior consent. If you have received this transmission in error, please notify Shazam Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it from your system. Please note that the information contained herein shall additionally constitute Confidential Information for the purposes of any NDA between the recipient/s and Shazam Entertainment. Shazam Entertainment Limited is incorporated in England and Wales under company number 3998831 and its registered office is at 26-28 Hammersmith Grove, London W6 7HA. __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ -- Colin Vipurs Server Team Lead Shazam Entertainment Ltd 26-28 Hammersmith Grove, London W6 7HA m: +44 (0) 000 000 t: +44 (0) 20 8742 6820 w:www.shazam.com Please consider the environment before printing this document This e-mail and its contents are strictly private and confidential. It must not be disclosed, distributed or copied without our prior consent. If you have received this transmission in error, please notify Shazam Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it from your system. Please note that the information contained herein shall additionally constitute Confidential Information for the purposes of any NDA between the recipient/s and Shazam Entertainment. Shazam Entertainment Limited is incorporated in England and Wales under company number 3998831 and its registered office is at 26-28 Hammersmith Grove, London W6 7HA. __ This email has been scanned by the MessageLabs Email
Re: Using MLT feature
http://wiki.apache.org/solr/Deduplication On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote: Hi, The ideia is don't index if something similar (headline+bodytext) for the same exact medianame. Do you mean I would need to index the doc first (maybe in a temp index) and then use the MLT feature to find similar docs before adding to final index? Thanks, Frederico -Original Message- From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] Sent: segunda-feira, 4 de Abril de 2011 10:22 To: solr-user@lucene.apache.org Subject: Re: Using MLT feature Do you want to not index if something similar? Or don't index if exact. I would look into a hash code of the document if you don't want to index exact.Similar though, I think has to be based off a document in the index. On Apr 4, 2011, at 5:16, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Hi, I would like to hear your opinion about the MLT feature and if it's a good solution to what I need to implement. My index has fields like: headline, body and medianame. What I need to do is, before adding a new doc, verify if a similar doc exists for this media. My idea is to use the MorelikeThisHandler (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way: For each new doc, perform a MLT search with q= medianame and stream.body=headline+bodytext. If no similar docs are found than I can safely add the doc. Is this feasible using the MLT handler? Is it a good approach? Are there a better way to perform this comparison? Thank you for your help. Best regards, Frederico Azeiteiro -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
help with Jetty log message
Greetings all, I am currently using solr as the backend behind a log aggregation and search system my team is developing. All was well and good until I noticed a test server crashing quite unexpectedly. We'd like to dig more into the incident but none of us has much experience with Jetty crash logs - not to mention that our Java is very rusty. The crash log is joined as an attachment. Could anyone help us with understanding what went wrong there ? Also, would it be possible and/or wise to automatically restart the server in case of such a crash ? Thanks for your help. If you need any extra info about that case, do not hesitate to ask ! Matthieu Huin # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f051a618105, pid=5033, tid=1092958544 # # JRE version: 6.0_18-b18 # Java VM: OpenJDK 64-Bit Server VM (16.0-b13 mixed mode linux-amd64 ) # Derivative: IcedTea6 1.8.3 # Distribution: Debian GNU/Linux 5.0.8 (lenny), package 6b18-1.8.3-2~lenny1 # Problematic frame: # V [libjvm.so+0x5dc105] # # If you would like to submit a bug report, please include # instructions how to reproduce the bug and visit: # http://icedtea.classpath.org/bugzilla # --- T H R E A D --- Current thread (0x0207d800): GCTaskThread [stack: 0x41153000,0x41254000] [id=5036] siginfo:si_signo=SIGSEGV: si_errno=0, si_code=128 (), si_addr=0x Registers: RAX=0x, RBX=0x7f04acba89a8, RCX=0x020d85d8, RDX=0x0030002e00300031 RSP=0x41252eb0, RBP=0x41252f20, RSI=0x, RDI=0x0030002e00300041 R8 =0x04a3523e2a33, R9 =0x7f051aae7188, R10=0x0001, R11=0x41252da0 R12=0x7f04f15b4368, R13=0x0035003000360034, R14=0x41252f50, R15=0x020d8070 RIP=0x7f051a618105, EFL=0x00010246, CSGSFS=0x0033, ERR=0x TRAPNO=0x000d Top of Stack: (sp=0x41252eb0) 0x41252eb0: 04a3523e2a01 7f051aae7188 0x41252ec0: 04a00c960001e082 0004 0x41252ed0: 04a3523e2a33 0400 0x41252ee0: 04a3523e2a32 0x41252ef0: 4097fb58 7f04acba89a8 0x41252f00: 020d8020 0x41252f10: 41252f50 41252f5c 0x41252f20: 41252f90 7f051a61cb78 0x41252f30: 02196810 020d8070 0x41252f40: 0207d800 7f051a5a6f3b 0x41252f50: 7f04acba89a8 7b6e9b2f0207cf00 0x41252f60: 41252f90 02196810 0x41252f70: 0207d800 7f051a75254f 0x41252f80: 0207da90 0x41252f90: 41253070 7f051a3b4a10 0x41252fa0: 0207d800 41252fd0 0x41252fb0: 41253030 0207dac0 0x41252fc0: 0207dad0 0207dea8 0x41252fd0: 0207d800 0207deb0 0x41252fe0: 0207dee0 0207def0 0x41252ff0: 0207e2c8 41253000 0x41253000: 0207d800 0207deb0 0x41253010: 0207dee0 0207def0 0x41253020: 0207e2c8 0207e2d0 0x41253030: 0x41253040: 0207ec30 0x41253050: 0207ec30 0207eb50 0x41253060: 0207d800 1000 0x41253070: 41253140 7f051a5ce090 0x41253080: 0x41253090: 0x412530a0: Instructions: (pc=0x7f051a618105) 0x7f051a6180f5: f6 0f 85 d4 00 00 00 49 8b 54 24 08 48 8d 7a 10 0x7f051a618105: 8b 4f 08 83 f9 00 0f 8e e4 00 00 00 89 c8 c1 f8 Stack: [0x41153000,0x41254000], sp=0x41252eb0, free space=3ff0018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x5dc105] V [libjvm.so+0x5e0b78] V [libjvm.so+0x378a10] V [libjvm.so+0x592090] --- P R O C E S S --- Java Threads: ( = current thread ) 0x0540f000 JavaThread btpool0-12 [_thread_blocked, id=6839, stack(0x42623000,0x42724000)] 0x0234a800 JavaThread btpool0-11 [_thread_blocked, id=6796, stack(0x42522000,0x42623000)] 0x02754000 JavaThread btpool0-10 [_thread_blocked, id=6761, stack(0x42421000,0x42522000)] 0x0246e800 JavaThread TimeLimitedCollector timer thread daemon [_thread_blocked, id=5307, stack(0x4232,0x42421000)] 0x02317800 JavaThread MultiThreadedHttpConnectionManager cleanup daemon [_thread_blocked, id=5306,
Re: help with Jetty log message
This is not Solr crashing, per se, it is your JVM. I personally haven't generally had much success debugging these kinds of failure - see whether it happens again, and if it does, try updating your JVM/switching to another/etc. Anyone have better advice? Upayavira On Mon, 04 Apr 2011 11:59 +0200, Matthieu Huin matthieu.h...@wallix.com wrote: Greetings all, I am currently using solr as the backend behind a log aggregation and search system my team is developing. All was well and good until I noticed a test server crashing quite unexpectedly. We'd like to dig more into the incident but none of us has much experience with Jetty crash logs - not to mention that our Java is very rusty. The crash log is joined as an attachment. Could anyone help us with understanding what went wrong there ? Also, would it be possible and/or wise to automatically restart the server in case of such a crash ? Thanks for your help. If you need any extra info about that case, do not hesitate to ask ! Matthieu Huin Email had 1 attachment: + hs_err_pid5033.log 26k (text/x-log) --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
RE: Using MLT feature
Thank you Markus it looks great. But the wiki is not very detailed on this. Do you mean if I: 1. Create: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldsignature/str str name=fieldsheadline,body,medianame/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain 2. Add the request as the default update request 3. Add a signature indexed field to my schema. Then, When adding a new doc to my index, it is only added of not considered a duplicate using a Lookup3Signature on the field defined? All duplicates are ignored and not added to my index? Is it so simple as that? Does it works even if the medianame should be an exact match (not similar match as the headline and bodytext are)? Thank you for your help, Frederico Azeiteiro Developer -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: segunda-feira, 4 de Abril de 2011 10:48 To: solr-user@lucene.apache.org Subject: Re: Using MLT feature http://wiki.apache.org/solr/Deduplication On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote: Hi, The ideia is don't index if something similar (headline+bodytext) for the same exact medianame. Do you mean I would need to index the doc first (maybe in a temp index) and then use the MLT feature to find similar docs before adding to final index? Thanks, Frederico -Original Message- From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] Sent: segunda-feira, 4 de Abril de 2011 10:22 To: solr-user@lucene.apache.org Subject: Re: Using MLT feature Do you want to not index if something similar? Or don't index if exact. I would look into a hash code of the document if you don't want to index exact.Similar though, I think has to be based off a document in the index. On Apr 4, 2011, at 5:16, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Hi, I would like to hear your opinion about the MLT feature and if it's a good solution to what I need to implement. My index has fields like: headline, body and medianame. What I need to do is, before adding a new doc, verify if a similar doc exists for this media. My idea is to use the MorelikeThisHandler (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way: For each new doc, perform a MLT search with q= medianame and stream.body=headline+bodytext. If no similar docs are found than I can safely add the doc. Is this feasible using the MLT handler? Is it a good approach? Are there a better way to perform this comparison? Thank you for your help. Best regards, Frederico Azeiteiro -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Mongo REST interface and full data import
I'm having trouble seeing your schema files, etc. I don't know if gmail is stripping this on my end or whether your e-mail is stripping it on upload, anyone else seeing this? But to your question, what version are you using? From Solr3.1 http://wiki.apache.org/solr/Solr3.1 is the first version with JSON support for updates. See: http://wiki.apache.org/solr/UpdateJSON http://wiki.apache.org/solr/UpdateJSONBest Erick On Mon, Apr 4, 2011 at 5:31 AM, andrew_s sharov1...@gmail.com wrote: Hi everyone, I'm trying to make a simple data import from MongoDB into Solr using REST interface. As an test example I've created schecma.xml like: ?xml version=1.0 ? isbn title and data-import.xml as: Unfortunately it's not working and I'm stuck on this place. Could you please advise how correctly parser JSON format data? Data format looks like: { offset : 0, rows: [ { _id : { $oid : 4d9829412c8bd1064400 }, isbn : 716739356, title : Proteins, description : } , { _id : { $oid : 4d9829412c8bd1064401 }, isbn : 144433056X, title : How to Assess Doctors and Health Professionals, description : } , { _id : { $oid : 4d9829412c8bd1064402 }, isbn : 1406208159, title : Freestyle: Time Travel Guides: Pack B, description : Takes you on a trip through history to visit the great ancient civilisations. } , total_rows : 3 , query : {} , millis : 0 } Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2774479.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Faceting on multivalued field
Is there a kind of function query that can count number of values in a multi-valued field on a given document? I do not know. From: Erick Erickson [erickerick...@gmail.com] Sent: Sunday, April 03, 2011 10:37 PM To: solr-user@lucene.apache.org Subject: Re: Faceting on multivalued field Why not count them on the way in and just store that number along with the original e-mail? Best Erick On Sun, Apr 3, 2011 at 10:10 PM, Kaushik Chakraborty kaych...@gmail.comwrote: Ok. My expectation was since comment_post_id is a MultiValued field hence it would appear multiple times (i.e. for each comment). And hence when I would facet with that field it would also give me the count of those many documents where comment_post_id appears. My requirement is getting total for every document i.e. finding number of comments per post in the whole corpus. To explain it more clearly, I'm getting a result xml something like this str name=post_id46/str str name=post_textHello World/str str name=person_id20/str arr name=comment_id str9/str str10/str /arr arr name=comment_person_id str19/str str2/str /arr arr name=comment_post_id str46/str str46/str /arr arr name=comment_text strHello - from World/str strHi/str /arr lst name=facet_fields lst name=comment_post_id *int name=461/int* I need the count to be 2 as the post 46 has 2 comments. What other way can I approach? Thanks, Kaushik On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, I think you're misunderstanding faceting. It's counting the number of documents that have a particular value. So if you're faceting on comment_post_id, there is one and only one document with that value (assuming that the comment_post_ids are unique). Which is what's being reported This will be quite expensive on a large corpus, BTW. Is your task to show the totals for *every* document in your corpus or just the ones in a display page? Because if the latter, your app could just count up the number of elements in the XML returned for the multiValued comments field. If that's not relevant, could you explain a bit more why you need this count? Best Erick On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty kaych...@gmail.com wrote: Hi, My index contains a root entity Post and a child entity Comments. Each post can have multiple comments. data-config.xml: document entity name=posts transformer=TemplateTransformer dataSource=jdbc query= field column=post_id / field column=post_text/ field column=person_id/ entity name=comments dataSource=jdbc query=select * from comments where post_id = ${posts.post_id} field column=comment_id / field column=comment_text / field column=comment_person_id / field column=comment_post_id / /entity /entity /document The schema has all columns of comment entity as MultiValued fields and all fields are indexed stored. My requirement is to count the number of comments for each post. Approach I'm taking is to query on *:* and faceting the result on comment_post_id so that it gives the count of comment occurred for that post. But I'm getting incorrect result e.g. if a post has 2 comments, the multivalued fields are populated alright but the facet count is coming as 1 (for that post_id). What else do I need to do? Thanks, Kaushik
Re: Solrj performance bottleneck
Hi All, I just to want to share some findings which clearly identified the reason for our performance bottleneck. we had looked into several areas for optimization mostly directed at Solr configurations, stored fields, highlighting, JVM, OS cache etc. But it turned out that the main culprit was elsewhere. We were using the terms component for auto suggestion and while examining the firebug outputs for time taken during the searches, we detected that multiple requests were being spawned for autosuggestion as we typed in the keyword to search (1 request per each character typed) and this in turn cost us great delay in getting the search results. Once we turned auto suggestion off, the performance was remarkably better and came down to a second or so (compared to 8-10 seconds registered earlier). if anybody has some suggestions/experience on how to leverage autosuggestion without affecting search performance much, please do share them. Once again, thanks for your inputs in analyzing our issues. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/Solrj-performance-bottleneck-tp2682797p2775245.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Using MLT feature
Hi again, I guess I was wrong on my early post... There's no automated way to avoid the indexation of the duplicate doc. I guess I have 2 options: 1. Create a temp index with signatures and then have an app that for each new doc verifies if sig exists on my primary index. If not, add the article. 2. Before adding the doc, create a signature (using the same algorithm that SOLR uses) on my indexing app and then verify if signature exists before adding. I'm way thinking the right way here? :) Thank you, Frederico -Original Message- From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] Sent: segunda-feira, 4 de Abril de 2011 11:59 To: solr-user@lucene.apache.org Subject: RE: Using MLT feature Thank you Markus it looks great. But the wiki is not very detailed on this. Do you mean if I: 1. Create: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldsignature/str str name=fieldsheadline,body,medianame/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain 2. Add the request as the default update request 3. Add a signature indexed field to my schema. Then, When adding a new doc to my index, it is only added of not considered a duplicate using a Lookup3Signature on the field defined? All duplicates are ignored and not added to my index? Is it so simple as that? Does it works even if the medianame should be an exact match (not similar match as the headline and bodytext are)? Thank you for your help, Frederico Azeiteiro Developer -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: segunda-feira, 4 de Abril de 2011 10:48 To: solr-user@lucene.apache.org Subject: Re: Using MLT feature http://wiki.apache.org/solr/Deduplication On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote: Hi, The ideia is don't index if something similar (headline+bodytext) for the same exact medianame. Do you mean I would need to index the doc first (maybe in a temp index) and then use the MLT feature to find similar docs before adding to final index? Thanks, Frederico -Original Message- From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] Sent: segunda-feira, 4 de Abril de 2011 10:22 To: solr-user@lucene.apache.org Subject: Re: Using MLT feature Do you want to not index if something similar? Or don't index if exact. I would look into a hash code of the document if you don't want to index exact.Similar though, I think has to be based off a document in the index. On Apr 4, 2011, at 5:16, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Hi, I would like to hear your opinion about the MLT feature and if it's a good solution to what I need to implement. My index has fields like: headline, body and medianame. What I need to do is, before adding a new doc, verify if a similar doc exists for this media. My idea is to use the MorelikeThisHandler (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way: For each new doc, perform a MLT search with q= medianame and stream.body=headline+bodytext. If no similar docs are found than I can safely add the doc. Is this feasible using the MLT handler? Is it a good approach? Are there a better way to perform this comparison? Thank you for your help. Best regards, Frederico Azeiteiro -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Using MLT feature
Hi again, I guess I was wrong on my early post... There's no automated way to avoid the indexation of the duplicate doc. Yes there is, try set overwriteDupes to true and documents yielding the same signature will be overwritten. If you have need both fuzzy and exact matching then add a second update processor inside the chain and create another signature field. I guess I have 2 options: 1. Create a temp index with signatures and then have an app that for each new doc verifies if sig exists on my primary index. If not, add the article. 2. Before adding the doc, create a signature (using the same algorithm that SOLR uses) on my indexing app and then verify if signature exists before adding. I'm way thinking the right way here? :) Thank you, Frederico -Original Message- From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] Sent: segunda-feira, 4 de Abril de 2011 11:59 To: solr-user@lucene.apache.org Subject: RE: Using MLT feature Thank you Markus it looks great. But the wiki is not very detailed on this. Do you mean if I: 1. Create: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldsignature/str str name=fieldsheadline,body,medianame/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/s tr /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain 2. Add the request as the default update request 3. Add a signature indexed field to my schema. Then, When adding a new doc to my index, it is only added of not considered a duplicate using a Lookup3Signature on the field defined? All duplicates are ignored and not added to my index? Is it so simple as that? Does it works even if the medianame should be an exact match (not similar match as the headline and bodytext are)? Thank you for your help, Frederico Azeiteiro Developer -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: segunda-feira, 4 de Abril de 2011 10:48 To: solr-user@lucene.apache.org Subject: Re: Using MLT feature http://wiki.apache.org/solr/Deduplication On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote: Hi, The ideia is don't index if something similar (headline+bodytext) for the same exact medianame. Do you mean I would need to index the doc first (maybe in a temp index) and then use the MLT feature to find similar docs before adding to final index? Thanks, Frederico -Original Message- From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] Sent: segunda-feira, 4 de Abril de 2011 10:22 To: solr-user@lucene.apache.org Subject: Re: Using MLT feature Do you want to not index if something similar? Or don't index if exact. I would look into a hash code of the document if you don't want to index exact.Similar though, I think has to be based off a document in the index. On Apr 4, 2011, at 5:16, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Hi, I would like to hear your opinion about the MLT feature and if it's a good solution to what I need to implement. My index has fields like: headline, body and medianame. What I need to do is, before adding a new doc, verify if a similar doc exists for this media. My idea is to use the MorelikeThisHandler (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way: For each new doc, perform a MLT search with q= medianame and stream.body=headline+bodytext. If no similar docs are found than I can safely add the doc. Is this feasible using the MLT handler? Is it a good approach? Are there a better way to perform this comparison? Thank you for your help. Best regards, Frederico Azeiteiro
Re: Solrj performance bottleneck
Dear Rahul, Stefan has the right solution. the autosuggest must be checked both from Javascript and your backend. For javascript there are some really nice tools to do that such as Jquery which implements a auto-suggest with a tunable delay. It has also highlighting, you can add additional information etc... It is actually quite impressive. Here is the address : http://jqueryui.com/demos/autocomplete/#remote-jsonp. It's open source so you can just copy what they have done or see the method they used. For backend limit the number of request / second per ip or session and / or cache result. As for cache normally solr caches the common request but I don't know for term components. Hope this helps you ! Victor 2011/4/4 Stefan Matheis matheis.ste...@googlemail.com rahul, On Mon, Apr 4, 2011 at 4:18 PM, rahul asharud...@gmail.com wrote: if anybody has some suggestions/experience on how to leverage autosuggestion without affecting search performance much, please do share them. we use javascript intervals for autosuggestion. regularly check the value of the monitored input field and if changed, trigger a new request. this will cover both cases, slow-typing users and also ten-finger-guys (which will type much faster). a new request for every added character is indeed too much, even if your backend is responding within a few ms. Regards Stefan
dismax boost query not useful?
As I was reviewing the boosting capabilities of the dismax edismax query parsers, it's not clear to me that the boost query has much use. The value of boost functions, particularly with a multiplied boost that edismax supports, is very clear -- there are a variety of uses. But I can't think of a useful case when I want to both *add* a component to the ultimate score, and for that component to be a non-function query (i.e. use the lucene query parser). Also, you can basically get the same affect as a boost query via boost functions: bf=query(mybq)mybq=... and note you will probably multiply this via product(10,query(mybq)) to boost it to an appropriate number. ~ David Smiley
Problems indexing very large set of documents
Hey everybody, I've been running into some issues indexing a very large set of documents. There's about 4000 PDF files, ranging in size from 160MB to 10KB. Obviously this is a big task for Solr. I have a PHP script that iterates over the directory and uses PHP cURL to query Solr to index the files. For now, commit is set to false to speed up the indexing, and I'm assuming that Solr should be auto-committing as necessary. I'm using the default solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. Once all the documents have been finished the PHP script queries Solr to commit. The main problem is that after a few thousand documents (around 2000 last time I tried), nearly every document begins causing Java exceptions in Solr: Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) ... 23 more Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502) at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176) at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707) at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) ... 25 more As far as I know there's nothing special about these documents so I'm wondering if it's not properly autocommitting. What would be appropriate settings in solrconfig.xml for this particular application? I'd like it to autocommit as soon as it needs to but no more often than that for the sake of efficiency. Obviously it takes long enough to index 4000 documents and there's no reason to make it take longer. Thanks for your help! ~Brandon Waterloo
Re: Problems indexing very large set of documents
This is related to Apache TIKA. Which version are you using? Please see this thread for more details- http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.htmlHope it helps. Regards, Anuj On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo brandon.water...@matrix.msu.edu wrote: Hey everybody, I've been running into some issues indexing a very large set of documents. There's about 4000 PDF files, ranging in size from 160MB to 10KB. Obviously this is a big task for Solr. I have a PHP script that iterates over the directory and uses PHP cURL to query Solr to index the files. For now, commit is set to false to speed up the indexing, and I'm assuming that Solr should be auto-committing as necessary. I'm using the default solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. Once all the documents have been finished the PHP script queries Solr to commit. The main problem is that after a few thousand documents (around 2000 last time I tried), nearly every document begins causing Java exceptions in Solr: Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) ... 23 more Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502) at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176) at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707) at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) ... 25 more As far as I know there's nothing special about these documents so I'm wondering if it's not properly autocommitting. What would be appropriate settings in solrconfig.xml for this particular application? I'd like it to autocommit as soon as it needs to but no more often than that for the sake of efficiency. Obviously it takes long enough to index 4000 documents and there's no reason to make it take longer. Thanks for your help! ~Brandon Waterloo
RE: Problems indexing very large set of documents
Looks like I'm using Tika 0.4: apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar .../tika-parsers-0.4.jar ~Brandon Waterloo From: Anuj Kumar [anujs...@gmail.com] Sent: Monday, April 04, 2011 2:12 PM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Problems indexing very large set of documents This is related to Apache TIKA. Which version are you using? Please see this thread for more details- http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.htmlHope it helps. Regards, Anuj On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo brandon.water...@matrix.msu.edu wrote: Hey everybody, I've been running into some issues indexing a very large set of documents. There's about 4000 PDF files, ranging in size from 160MB to 10KB. Obviously this is a big task for Solr. I have a PHP script that iterates over the directory and uses PHP cURL to query Solr to index the files. For now, commit is set to false to speed up the indexing, and I'm assuming that Solr should be auto-committing as necessary. I'm using the default solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. Once all the documents have been finished the PHP script queries Solr to commit. The main problem is that after a few thousand documents (around 2000 last time I tried), nearly every document begins causing Java exceptions in Solr: Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) ... 23 more Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502) at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176) at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707) at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) ... 25 more As far as I know there's nothing special about these documents so I'm wondering if it's not properly
Re: Problems indexing very large set of documents
In the log messages are you able to locate the file at which it fails? Looks like TIKA is unable to parse one of your PDF files for the details. We need to hunt that one out. Regards, Anuj On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo brandon.water...@matrix.msu.edu wrote: Looks like I'm using Tika 0.4: apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar .../tika-parsers-0.4.jar ~Brandon Waterloo From: Anuj Kumar [anujs...@gmail.com] Sent: Monday, April 04, 2011 2:12 PM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Problems indexing very large set of documents This is related to Apache TIKA. Which version are you using? Please see this thread for more details- http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html Hope it helps. Regards, Anuj On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo brandon.water...@matrix.msu.edu wrote: Hey everybody, I've been running into some issues indexing a very large set of documents. There's about 4000 PDF files, ranging in size from 160MB to 10KB. Obviously this is a big task for Solr. I have a PHP script that iterates over the directory and uses PHP cURL to query Solr to index the files. For now, commit is set to false to speed up the indexing, and I'm assuming that Solr should be auto-committing as necessary. I'm using the default solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf. Once all the documents have been finished the PHP script queries Solr to commit. The main problem is that after a few thousand documents (around 2000 last time I tried), nearly every document begins causing Java exceptions in Solr: Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) ... 23 more Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502) at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176) at
Re: Matching the beginning of a word within a term
Thank you both for your replies. It looks like EdgeNGramFilter will do the job nicely. Time to reindex...again. On Fri, Apr 1, 2011 at 8:31 AM, Jan Høydahl jan@cominvent.com wrote: Check out http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory Don't know if it works with phrases though -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 31. mars 2011, at 16.49, Brian Lamb wrote: No, I don't really want to break down the words into subwords. In the example I provided, I would not want kind to match either record because it is not at the beginning of the word even though kind appears in both records as part of a word. On Wed, Mar 30, 2011 at 4:42 PM, lboutros boutr...@gmail.com wrote: Do you want to tokenize subwords based on dictionaries ? A bit like disagglutination of german words ? If so, something like this could help : DictionaryCompoundWordTokenFilter http://search.lucidimagination.com/search/document/CDRG_ch05_5.8.8 Ludovic http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html 2011/3/30 Brian Lamb [via Lucene] ml-node+2754668-300063934-383...@n3.nabble.com Hi all, I have a field set up like this: field name=common_names multiValued=true type=text indexed=true stored=true required=false / And I have some records: RECORD1 arr name=common_names strcompanion to mankind/str strpooch/str /arr RECORD2 arr name=common_names strcompanion to womankind/str strman's worst enemy/str /arr I would like to write a query that will match the beginning of a word within the term. Here is the query I would use as it exists now: http://localhost:8983/solr/search/?q=*:*fq={!q.op=AND%20df=common_names} companion man~10 In the above example. I would want to return only RECORD1. The query as it exists right now is designed to only match records where both words are present in the same term. So if I changed man to mankind in the query, RECORD1 will be returned. Even though the phrases companion and man exist in the same term in RECORD2, I do not want RECORD2 to be returned because 'man' is not at the beginning of the word. How can I achieve this? Thanks, Brian Lamb -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Matching-the-beginning-of-a-word-within-a-term-tp2754668p2754668.html To start a new topic under Solr - User, email ml-node+472068-1765922688-383...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE= . - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-the-beginning-of-a-word-within-a-term-tp2754668p2755561.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching on a multi valued field
I just noticed Juan's response and I find that I am encountering that very issue in a few cases. Boosting is a good way to put the more relevant results to the top but it is possible to only have the correct results returned? On Wed, Mar 30, 2011 at 11:51 AM, Brian Lamb brian.l...@journalexperts.comwrote: Thank you all for your responses. The field had already been set up with positionIncrementGap=100 so I just needed to add in the slop. On Tue, Mar 29, 2011 at 6:32 PM, Juan Pablo Mora jua...@informa.eswrote: A multiValued field is actually a single field with all data separated with positionIncrement. Try setting that value high enough and use a PhraseQuery. That is true but you cannot do things like: q=bar* foo*~10 with default query search. and if you use dismax you will have the same problems with multivalued fields. Imagine the situation: Doc1: field A: [foo bar,dooh] 2 values Doc2: field A: [bar dooh, whatever] Another 2 values the query: qt=dismax qf= fieldA q = ( bar dooh ) will return both Doc1 and Doc2. The only thing you can do in this situation is boost phrase query in Doc2 with parameter pf in order to get Doc2 in the first position of the results: pf = fieldA^1 Thanks, JP. El 29/03/2011, a las 23:14, Markus Jelsma escribió: orly, all replies came in while sending =) Hi, Your filter query is looking for a match of man's friend in a single field. Regardless of analysis of the common_names field, all terms are present in the common_names field of both documents. A multiValued field is actually a single field with all data separated with positionIncrement. Try setting that value high enough and use a PhraseQuery. That should work Cheers, Hi all, I have a field set up like this: field name=common_names multiValued=true type=text indexed=true stored=true required=false / And I have some records: RECORD1 arr name=common_names strman's best friend/str strpooch/str /arr RECORD2 arr name=common_names strman's worst enemy/str strfriend to no one/str /arr Now if I do a search such as: http://localhost:8983/solr/search/?q=*:*fq={!q.op=AND df=common_names}man's friend Both records are returned. However, I only want RECORD1 returned. I understand why RECORD2 is returned but how can I structure my query so that only RECORD1 is returned? Thanks, Brian Lamb
Re: Matching on a multi valued field
I have not find any solution to this. The only thing is to denormalize your multivalue field into several docs with a single value field. Try ComplexPhraseQueryParser (https://issues.apache.org/jira/browse/SOLR-1604) if you are using solr 1.4 version. El 04/04/2011, a las 21:21, Brian Lamb escribió: I just noticed Juan's response and I find that I am encountering that very issue in a few cases. Boosting is a good way to put the more relevant results to the top but it is possible to only have the correct results returned? On Wed, Mar 30, 2011 at 11:51 AM, Brian Lamb brian.l...@journalexperts.commailto:brian.l...@journalexperts.com wrote: Thank you all for your responses. The field had already been set up with positionIncrementGap=100 so I just needed to add in the slop. On Tue, Mar 29, 2011 at 6:32 PM, Juan Pablo Mora jua...@informa.esmailto:jua...@informa.es wrote: A multiValued field is actually a single field with all data separated with positionIncrement. Try setting that value high enough and use a PhraseQuery. That is true but you cannot do things like: q=bar* foo*~10 with default query search. and if you use dismax you will have the same problems with multivalued fields. Imagine the situation: Doc1: field A: [foo bar,dooh] 2 values Doc2: field A: [bar dooh, whatever] Another 2 values the query: qt=dismax qf= fieldA q = ( bar dooh ) will return both Doc1 and Doc2. The only thing you can do in this situation is boost phrase query in Doc2 with parameter pf in order to get Doc2 in the first position of the results: pf = fieldA^1 Thanks, JP. El 29/03/2011, a las 23:14, Markus Jelsma escribió: orly, all replies came in while sending =) Hi, Your filter query is looking for a match of man's friend in a single field. Regardless of analysis of the common_names field, all terms are present in the common_names field of both documents. A multiValued field is actually a single field with all data separated with positionIncrement. Try setting that value high enough and use a PhraseQuery. That should work Cheers, Hi all, I have a field set up like this: field name=common_names multiValued=true type=text indexed=true stored=true required=false / And I have some records: RECORD1 arr name=common_names strman's best friend/str strpooch/str /arr RECORD2 arr name=common_names strman's worst enemy/str strfriend to no one/str /arr Now if I do a search such as: http://localhost:8983/solr/search/?q=*:*fq={!q.op=ANDhttp://localhost:8983/solr/search/?q=*:*fq=%7B!q.op=AND df=common_names}man's friend Both records are returned. However, I only want RECORD1 returned. I understand why RECORD2 is returned but how can I structure my query so that only RECORD1 is returned? Thanks, Brian Lamb
RE: Using the Data Import Handler with SQLite
I was able to resolve this issue by using a different jdbc driver: http://www.xerial.org/trac/Xerial/wiki/SQLiteJDBC -Original Message- From: Zac Smith [mailto:z...@trinkit.com] Sent: Friday, April 01, 2011 5:56 PM To: solr-user@lucene.apache.org Subject: Using the Data Import Handler with SQLite I hope this question is being directed to the right place ... I am trying to use SQLite (v3) as a source for the Data Import Handler. I am using a sqllite jdbc driver (link below) and this works when using with only one entity. As soon as I add a sub-entity it falls over with a locked DB error: java.sql.SQLException: database is locked. Now I realize that you can only have one connection open to SQLite at a time. So I assume that the first query is leaving a connection open before it moves onto the sub-query. I am not sure if the issue would be in the jdbc driver or the DIH. It works fine with SQL Server. Is this a bug? Or something that just isn't possible with SQLite? Here is a sample of my data config file: dataConfig dataSource type=JdbcDataSource driver=org.sqlite.JDBC url=jdbc:sqlite:SolrImportTest.db / document entity name=locations pk=id query=select * from locations field column=Id name=Id / field column=Name name=Name / field column=RegionId name=RegionId / entity name=regions pk=id query=select * from regions where id = '${locations.RegionId}' field column=Name name=RegionName / /entity /entity /document /dataConfig sqllite jdbc driver : http://www.zentus.com/sqlitejdbc/
Re: does overwrite=false work with json
I tried it with the example json documents, and even if I add overwrite=false to the URL, it still overwrites. Do this twice: curl 'http://localhost:8983/solr/update/json?commit=trueoverwrite=false' --data-binary @books.json -H 'Content-type:application/json' Then do this query: curl 'http://localhost:8983/solr/select?q=title:monsterswt=jsonindent=true' --Dave
Re: Question about http://wiki.apache.org/solr/Deduplication
Thanks Hoss, Externanlizing this part is exactly the path we are exploring now, not only for this reason. We already started testing Hadoop SequenceFile for write ahead log for updates/deletes. SequenceFile supports append now (simply great!). It was a a pain to have to add hadoop into mix for mortal collection sizes 200 Mio, but on the other side, having hadoop around offers huge flexibility. Write ahead log catches update commands (all solr slaves, fronting clients accept updates but only to forward them to WAL). Solr master is trying to catch up with update stream indexing in async fashion, and finally solr slaves are chasing master index with standard solr replication. Overnight we run simple map reduce jobs to consolidate, normalize and sort update stream and reindex at the end. Deduplication and collection sorting is for us only an optimization, if done reasonably offten, like once per day/week, but if we do not do it, it doubles HW resorces. Imo, native WAL support on solr would be definitly one nice nice to have (for HA, update scalability...). Charming with WAL is that updates never wait/disapear, if too much traffic, we only have slightly higher update latency, but updates get definitley processed. Some basic primitives on WAL (consolidation, replaying update stream on solr etc...) should be supported in this case, sort of smallish hadoop features subset for solr clusters, but nothing oversized. Cheers, eks On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : Is it possible in solr to have multivalued id? Or I need to make my : own mv_ID for this? Any ideas how to achieve this efficiently? This isn't something the SignatureUpdateProcessor is going to be able to hel pyou with -- it does the deduplication be changing hte low level update (implemented as a delete then add) so that the key used to delete the older documents is based on the signature field instead of the id field. in order to do what you are describing, you would need to query the index for matching signatures, then add the resulting ids to your document before doing that update You could posibly do this in a custom UpdateProcessor, but you'd have to do something tricky to ensure you didn't overlook docs that had been addd but not yet committed when checking for dups. I don't have a good suggestion for how to do this internally in Slr -- it seems like the type of bulk processing logic that would be better suited for an external process before you ever start indexing (much like link analysis for back refrences) -Hoss
Re: Mongo REST interface and full data import
Sorry for mistake with Solr version ... I'm using Solr 3.1 -- View this message in context: http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2777319.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching on a multi valued field
On 4/4/2011 3:21 PM, Brian Lamb wrote: I just noticed Juan's response and I find that I am encountering that very issue in a few cases. Boosting is a good way to put the more relevant results to the top but it is possible to only have the correct results returned? Only what's already been said in the thread. You can simulate a non-phrase non-wildcard search, forced to match all within the same value of a multi-valued, by using phrase queries with slop. And it will only return hits that have all terms within the same value -- it's not a boosting solution. But if you need wildcards, or you need to find an actual phrase in the same value as additional term(s) or phrase(s), no, you are out of luck in Solr. That is, exactly what Juan said, he already said exactly this. If someone can think of a clever way to write some Java to do this in a new query component, that would be useful. I am not entirely sure how possible that is. I guess you'd have to make sure that ALL matching tokens or phrases are within the positionIncrementGap of each other, not sure how feasible that is, I'm not too familiar with Solr/Lucene source. But at any rate, there's no way to do it out of the box with Solr, no.
Very very large scale Solr Deployment = how to do (Expert Question)?
Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication = Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further… Maybe I am missing something very trivial as I think some of the “Solr Users/Use Cases” on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens