Re: questions on query format
2. If send solr the following query: q=*:* I get nothing just: responseresult name=response numFound=0 start=0 maxScore=0.0/lst name=highlighting//response Would appreciate some insight into what is going on. If you are using dismax as query parser, then *:* won't function as match all docs query. To retrieve all docs - with dismax - use q.alt=*:* parameter. Also, adding debugQuery=on will display information about parsed query.
Re: Dismax and phrases
On 10/23/2011 09:34 PM, Erick Erickson wrote: Hmmm dismax is, indeed, different. Note that dismax doesn't respect the default operator at all, so don't be mislead there. Could you paste the debug output for both the queries? Perhaps something will jump out at us. Best Erick Thank you Erick. I've tried to paste the query results here. First one is the query with 's around the terms and returns 6888 results. I've hid the explain parts of most of the results (and timing) just to keep the email reasonably short. If you need to see them let me know. + designates hidden subtree. Best regards, Lauri lst name=responseHeader int name=status0/int int name=QTime91/int lst name=params str name=explainOther/ str name=indenton/str str name=hl.fl/ str name=wtstandard/str str name=version2.2/str str name=rows10/str str name=fl*,score/str str name=debugQueryon/str str name=start0/str str name=qasuntojen hinnat/str str name=qtdismax/str str name=fq/ /lst /lst +result name=response numFound=6888 start=0 maxScore=3.0879765 lst name=debug lst name=queryBoosting str name=qasuntojenhinnat/str null name=match/ /lststr name=rawquerystringasuntojen hinnat/str str name=querystringasuntojen hinnat/str str name=parsedquery+DisjunctionMaxQuery((table.title_t:asuntojen hinnat^2.0 | title_t:asuntojen hinnat^2.0 | ingress_t:asuntojen hinnat | (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto table.description_fi:hinta) | table.description_t:asuntojen hinnat | graphic.title_t:asuntojen hinnat^2.0 | ((graphic.title_fi:asunto graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto table.title_fi:hinta)^2.0) | table.contents_t:asuntojen hinnat | text_t:asuntojen hinnat | (ingress_fi:asunto ingress_fi:hinta) | (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0 FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0)/str str name=parsedquery_toString+(table.title_t:asuntojen hinnat^2.0 | title_t:asuntojen hinnat^2.0 | ingress_t:asuntojen hinnat | (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto table.description_fi:hinta) | table.description_t:asuntojen hinnat | graphic.title_t:asuntojen hinnat^2.0 | ((graphic.title_fi:asunto graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto table.title_fi:hinta)^2.0) | table.contents_t:asuntojen hinnat | text_t:asuntojen hinnat | (ingress_fi:asunto ingress_fi:hinta) | (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto title_fi:hinta)^2.0))~0.01 () type:tie^6.0 type:kuv^2.0 type:tau^2.0 (1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0/str lst name=explain str name=/media/nss/DATA2/data/wwwprod/til/ashi/2011/07/ashi_2011_07_2011-08-26_tie_001_fi.html 3.1653805 = (MATCH) sum of: 1.9299976 = (MATCH) max plus 0.01 times others of: 1.9211313 = weight(title_t:asuntojen hinnat^2.0 in 5891), product of: 0.26658234 = queryWeight(title_t:asuntojen hinnat^2.0), product of: 2.0 = boost 14.413042 = idf(title_t: asuntojen=250 hinnat=329) 0.009247955 = queryNorm 7.206521 = fieldWeight(title_t:asuntojen hinnat in 5891), product of: 1.0 = tf(phraseFreq=1.0) 14.413042 = idf(title_t: asuntojen=250 hinnat=329) 0.5 = fieldNorm(field=title_t, doc=5891) 0.03292808 = (MATCH) sum of: 0.016520109 = (MATCH) weight(text_fi:asunto in 5891), product of: 0.044221584 = queryWeight(text_fi:asunto), product of: 4.781769 = idf(docFreq=3251, maxDocs=142742) 0.009247955 = queryNorm 0.3735757 = (MATCH) fieldWeight(text_fi:asunto in 5891), product of: 1.0 = tf(termFreq(text_fi:asunto)=1) 4.781769 = idf(docFreq=3251, maxDocs=142742) 0.078125 = fieldNorm(field=text_fi, doc=5891) 0.016407972 = (MATCH) weight(text_fi:hinta in 5891), product of: 0.03705935 = queryWeight(text_fi:hinta), product of: 4.0073023 = idf(docFreq=7054, maxDocs=142742) 0.009247955 = queryNorm 0.44274852 = (MATCH) fieldWeight(text_fi:hinta in 5891), product of: 1.4142135 = tf(termFreq(text_fi:hinta)=2) 4.0073023 = idf(docFreq=7054, maxDocs=142742) 0.078125 = fieldNorm(field=text_fi, doc=5891) 0.34379265 = (MATCH) sum of: 0.19207533 = (MATCH) weight(graphic.title_fi:asunto in 5891), product of: 0.10662244 = queryWeight(graphic.title_fi:asunto), product of: 5.76465 = idf(docFreq=1216, maxDocs=142742) 0.01849591 = queryNorm 1.8014531 = (MATCH) fieldWeight(graphic.title_fi:asunto in 5891), product of: 1.0 = tf(termFreq(graphic.title_fi:asunto)=1) 5.76465 = idf(docFreq=1216, maxDocs=142742) 0.3125 = fieldNorm(field=graphic.title_fi, doc=5891) 0.15171732 = (MATCH) weight(graphic.title_fi:hinta in 5891), product of: 0.09476117 =
Re: Want to support did you mean xxx but is Chinese
Hi Li Li, Thanks for your detail explanation. Basically I have similar implementation like yours. I just want to know if there is a better and total solution. I'll keep trying and see if I have any improvement that can share with you and the community. Any idea or advice are welcome . Floyd 2011/10/21 Li Li fancye...@gmail.com: we have implemented one supporting did you mean and preffix suggestion for Chinese. But we base our working on solr 1.4 and we did many modifications so it will cost time to integrate it to current solr/lucene. Here are our solution. glad to see any advices. 1. offline words and phrases discovery. we discovery new words and new phrases by mining query logs 2. online matching algorithm for each word, e.g., 贝多芬 we convert it to pinyin bei duo fen, then we indexing it using n-gram, which means gram3:bei gram3:eid ... to get did you mean result, we convert query 背朵分 into n-gram, it's a boolean or query, so there are many results( the words' pinyin similar to query will be ranked top) Then we reranks top 500 results by fine-grained algorithm we use edit distance to align query and result, we also take character into consideration. e.g query 十度,matches are 十渡 and 是度,their pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in both query and match also you need consider the hotness(popular degree) of different words/phrases. which can be known from query logs Another question is to convert Chinese into pinyin. because some character has more than one pinyin. e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and words/phrases first. word segmentation is a basic problem is Chinese IR 2011/10/21 Floyd Wu floyd...@gmail.com Does anybody know how to implement this idea in SOLR. Please kindly point me a direction. For example, when user enter a keyword in Chinese ��多芬 (this is Beethoven in Chinese) but key in a wrong combination of characters 背多分 (this is pronouncation the same with previous keyword ��多芬). There in solr index exist token ��多芬 actually. How to hit documents where ��多芬 exist when 背多分 is enter. This is basic function of commercial search engine especially in Chinese processing. I wonder how to implements in SOLR and where is the start point. Floyd
Using CURL to index directory
Hi I have been using curl for indexing individual files, does anyone of you knows how to index entire directory using curl ? Thanks Jagdish
Re: Can Solr handle large text files?
Thanks for the reminder - I had that set to 214xxx... (the max), but perf was terrible when I injected large files. So what's the max recommended field size in kb? I can try chopping up the syslogs into arbitrarily small pieces, but would love to know where to start. Thanks! Sent from my iPhone On Oct 23, 2011, at 2:01 PM, Erick Erickson erickerick...@gmail.com wrote: Also be aware that by default Solr is configured to only index the first 10,000 lines of text. See maxFieldLength in solrconfig.xml Best Erick On Fri, Oct 21, 2011 at 7:34 PM, Peter Spam ps...@mac.com wrote: Thanks for your note, Anand. What was the maximum chunk size for you? Could you post the relevant portions of your configuration file? Thanks! Pete On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: Hi, I was also facing the issue of highlighting the large text files. I applied the solution proposed here and it worked. But I am getting following error : Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get this file from. Its reference is present in browse.vm div class=results #if($response.response.get('grouped')) #foreach($grouping in $response.response.get('grouped')) #parse(hitGrouped.vm) #end #else #foreach($doc in $response.results) #parse(hit.vm) #end #end /div HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', cwd=C:\glassfish3\glassfish\domains\domain1\config at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268) at org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42) at org.apache.velocity.Template.process(Template.java:98) at org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446) at Thanks Regards, Anand Anand Nigam RBS Global Banking Markets Office: +91 124 492 5506 -Original Message- From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] Sent: 21 October 2011 14:58 To: solr-user@lucene.apache.org Subject: Re: Can Solr handle large text files? Hi Peter, highlighting in large text files can not be fast without dividing the original text in small piece. So take a look in http://xtf.cdlib.org/documentation/under-the-hood/#Chunking and in http://www.lucidimagination.com/blog/2010/09/16/2446/ Which means that you should divide your files and use Result Grouping / Field Collapsing to list only one hit per original document. (xtf also would solve your problem out of the box but xtf does not use solr). Best regards Karsten Original-Nachricht Datum: Thu, 20 Oct 2011 17:59:04 -0700 Von: Peter Spam ps...@mac.com An: solr-user@lucene.apache.org Betreff: Can Solr handle large text files? I have about 20k text files, some very small, but some up to 300MB, and would like to do text searching with highlighting. Imagine the text is the contents of your syslog. I would like to type in some terms, such as error and mail, and have Solr return the syslog lines with those terms PLUS two lines of context. Pretty much just like Google's highlighting. 1) Can Solr handle this? I had extremely long query times when I tried this with Solr 1.4.1 (yes I was using TermVectors, etc.). I tried breaking the files into 1MB pieces, but searching would be wonky = return the wrong number of documents (ie. if one file had a term 5 times, and that was the only file that had the term, I want 1 result, not 5 results). 2) What sort of tokenizer would be best? Here's what I'm using: field name=body type=text_pl indexed=true stored=true multiValued=false termVectors=true termPositions=true termOffsets=true / fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ /analyzer /fieldType Thanks! Pete *** The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. Authorised and regulated by the Financial Services Authority. The Royal Bank of Scotland N.V. is authorised and regulated by the De Nederlandsche Bank and has its seat at Amsterdam, the Netherlands, and is registered in the Commercial Register under number 33002587. Registered Office: Gustav Mahlerlaan 350, Amsterdam, The Netherlands. The
Re: Using CURL to index directory
Hi, Try the attached post-text.sh file. It was not written by me, it's part of a great tutorial written by Avi Rappoport that you can find at: http://www.lucidimagination.com/devzone/technical-articles/whitepapers/indexing-text-and-html-files-solr Regards, On Mon, Oct 24, 2011 at 9:13 AM, Jagdish Kumar jagdish.thapar...@hotmail.com wrote: Hi I have been using curl for indexing individual files, does anyone of you knows how to index entire directory using curl ? Thanks Jagdish -- Dirceu Vieira Júnior --- +47 9753 2473 dirceuvjr.blogspot.com twitter.com/dirceuvjr post-text.sh Description: Bourne shell script
Re: Painfully slow indexing
Hey guys, Your responses are welcome, but I still haven't gained a lot of improvements *Are you posting through HTTP/SOLRJ?* I am using RSolr gem, which internally uses Ruby HTTP lib to POST document to Solr *Your script time 'T' includes time between sending POST request -to- the response fetched after successful response right??* Correct. It also includes the time taken to convert all those documents from a Ruby Hash to XML. *generate the ready-for-indexing XML documents on a file system* Alain, I have somewhere 6m documents for Indexing. You mean to say that I should convert all of it into one XML file and then index? *are you calling commit after your batches or do an optimize by any chance?* I am not optimizing, but I am performing an autocommit every 10 docs. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Fri, Oct 21, 2011 at 16:32, Simon Willnauer simon.willna...@googlemail.com wrote: On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash pra...@gmail.com wrote: Hi guys, I have set up a Solr instance and upon attempting to index document, the whole process is painfully slow. I will try to put as much info as I can in this mail. Pl. feel free to ask me anything else that might be required. I am sending documents in batches not exceeding 2,000. The size of each of them depends but usually is around 10-15MiB. My indexing script tells me that Solr took T seconds to add N documents of size S. For the same data, the Solr Log add QTime is QT. Some of the sample data are: N ST QT - 390 docs | 3,478,804 Bytes | 14.5s| 2297 852 docs | 6,039,535 Bytes | 25.3s| 4237 1345 docs | 11,147,512 Bytes | 47s | 8543 1147 docs | 9,457,717 Bytes | 44s | 2297 1096 docs | 13,058,204 Bytes | 54.3s | 8782 The time T includes the time of converting an array of Hash objects into XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there is a huge difference between both the time T and QT. After a lot of efforts, I have no clue why these times do not match. The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M -XX:+UseParNewGC I believe my Indexing is getting slow. Relevant portion from my schema file are as follows. On a related note, every document has one dynamic field. Based on this rate, it takes me ~30hrs to do a full index of my database. I would really appreciate kindness of community in order to get this indexing faster. indexDefaults useCompoundFilefalse/useCompoundFile mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxMergeCount10/int int name=maxThreadCount10/int /mergeScheduler ramBufferSizeMB2048/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength300/maxFieldLength writeLockTimeout1000/writeLockTimeout maxBufferedDocs5/maxBufferedDocs termIndexInterval256/termIndexInterval mergeFactor10/mergeFactor useCompoundFilefalse/useCompoundFile !-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnceExplicit19/int int name=segmentsPerTier9/int /mergePolicy -- /indexDefaults mainIndex unlockOnStartuptrue/unlockOnStartup reopenReaderstrue/reopenReaders deletionPolicy class=solr.SolrDeletionPolicy str name=maxCommitsToKeep1/str str name=maxOptimizedCommitsToKeep0/str /deletionPolicy infoStream file=INFOSTREAM.txtfalse/infoStream /mainIndex updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs /autoCommit /updateHandler *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny hey, are you calling commit after your batches or do an optimize by any chance? I would suggest you to stream your documents to solr and try to commit only if you really need to. Set your RAM Buffer to something between 256 and 320 MB and remove the maxBufferedDocs setting completely. You can also experiment with your merge settings a little and 10 merging threads seem to be a lot. I know you have lots of CPU but IO will be the bottleneck here. simon
Re: Solr indexing plugin: skip single faulty document?
Thanks Erik! I'll be reading that issue, it's pretty much everything i need! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-plugin-skip-single-faulty-document-tp3427646p3447400.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing plugin: skip single faulty document?
Don't get too excited, I don't know what the state of that patch is in. It's on my lng TODO list to go back and look some more. If you wanted to work on it and bring it up to snuff please feel free to do it and submit a modernized patch! Erick On Mon, Oct 24, 2011 at 9:44 AM, samuele.mattiuzzo samum...@gmail.com wrote: Thanks Erik! I'll be reading that issue, it's pretty much everything i need! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-plugin-skip-single-faulty-document-tp3427646p3447400.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: help needed on solr-uima integration
Hi, Where can I find test code for solr-uima component? Thanks, Xue-Feng From: Xue-Feng Yang just4l...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Sunday, October 23, 2011 3:43:58 AM Subject: help needed on solr-uima integration Hi, After google online, some parts in the puzzle still missing. The best is to find a simple example showing the whole process. Is there any example like apache-uima/examples/descriptors/tutorial/ex3 RoomNumber and DateTime integrated into solr? In particular, how to feed text into solr for indexing which has at least two fields? Thanks, Xue-Feng
RE: Using CURL to index directory
Thanks for quick response I am working on Windows machine and also I need to post text, zip, pdfs, images etc , it would be gr8 if you can help me out with multiple filetypes on windows Thanks Jagdish Date: Mon, 24 Oct 2011 09:30:49 +0200 Subject: Re: Using CURL to index directory From: dirceu...@gmail.com To: solr-user@lucene.apache.org Hi, Try the attached post-text.sh file. It was not written by me, it's part of a great tutorial written by Avi Rappoport that you can find at: http://www.lucidimagination.com/devzone/technical-articles/whitepapers/indexing-text-and-html-files-solr Regards, On Mon, Oct 24, 2011 at 9:13 AM, Jagdish Kumar jagdish.thapar...@hotmail.com wrote: Hi I have been using curl for indexing individual files, does anyone of you knows how to index entire directory using curl ? Thanks Jagdish -- Dirceu Vieira Júnior --- +47 9753 2473 dirceuvjr.blogspot.com twitter.com/dirceuvjr
Re: Solr indexing plugin: skip single faulty document?
Ok i'll surely check out what i can! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-plugin-skip-single-faulty-document-tp3427646p3447537.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: data-import problem
Hi Radha Krishna, try command full-import instead of fullimport see http://wiki.apache.org/solr/DataImportHandler#Commands Best regards Karsten Original-Nachricht Datum: Mon, 24 Oct 2011 11:10:22 +0530 Von: Radha Krishna Reddy radhakrishn...@gmail.com An: solr-user@lucene.apache.org Betreff: data-import problem Hi, I am trying to comfigure solr on aws ubuntu instance.I have mysql on a different server.so i created a ssh tunnel for mysql on port 3309. Download the mysql jdbc driver and copied it to lib folder. *I edited the example/solr/conf/solrconfig.xml* ... *when i tried to import data.* http://myservername/solr/dataimport?command=fullimport i* am getting the following response* ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime5/int/lstlst name=initArgslst name=defaultsstr name=configdata-config.xml/str/lst/lststr name=commandfullimport/strstr name=statusidle/strstr name=importResponse/lst name=statusMessages/str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response Can someone help me on this?Also where can i find the logs. Thanks and Regards, Radha Krishna.
Re: multiple document types in a core
Hi Erick, Your right I think. On resources we gain a little bit on: disk (a production implementation with live data would be 500 mb saved in disk usage on each slave and master) some reduction in network traffic on replication (we do a full re-index every 24 hours at present) On design we gain a little by being able to support searches at various document levels (perform a destination search or hotel search and return documents at the correct level for the search with out the need to perform field collapsing) But in the cold light of day I don't think we gain huge amounts. (leaving aside the index replication of a full index) cheers lee c On 23 October 2011 19:05, Erick Erickson erickerick...@gmail.com wrote: Yes, stored fields are placed verbatim for every doc. But I wonder at the utility of trying to share stored information. The stored info is put in certain files in the index, see: http://lucene.apache.org/java/3_0_2/fileformats.html#file-names and the files that store data are pretty much irrelevant to searching, the data in them is only referenced when assembling the document for return. So by adding this complexity you'll be saving a bit on file transfers when replicating your index, but not much else. Is it worth it? If so, why? Best Erick On Mon, Oct 17, 2011 at 11:07 AM, lee carroll lee.a.carr...@googlemail.com wrote: Just as a follow up it looks like stored fields are stored verbatim for every doc. hotel index and store dest attributes index size: 131M number of records 49147 hotel index only dest attributes index size: 111m number of records 49147 ~400 chars(bytes) of destination data * 49147 (number of hotel docs) = ~19m basically everything is being stored No difference in time to index (very rough and not scientific :-) ) So it does seem an ok strategy to denormalise docs with index fields but normalise with stored fields ? Or have i missed some problems with this ? cheers lee c On 16 October 2011 11:54, lee carroll lee.a.carr...@googlemail.com wrote: Hi Chris thanks for the response It's an inverted index, so *tems* exist once (per segment) and those terms point to the documents -- so having the same terms (in the same fields) for multiple types of documents in one index is going to take up less overall space then having distinct collections for each type of document. I'm not asking about the indexed terms but rather the stored values. By having two doc types are we gaining anything by storing attributes only for that doc type cheers lee c
Re: help needed on solr-uima integration
(11/10/24 17:42), Xue-Feng Yang wrote: Hi, Where can I find test code for solr-uima component? You should find them under: solr/contrib/uima/src/test koji -- Check out Query Log Visualizer for Apache Solr http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html http://www.rondhuit.com/en/
indexing key value pair into lucene solr index
hi, in my use case i have list of key value pairs in each document object, if i index them as separate index fields then in the result doc object i will get two arrays corresponding to my keys and values. The problem i face here is that there wont be any mapping between those keys and values. do we have any easy to index these data in solr ? thanks in advance ... -- -JAME
Re: indexing key value pair into lucene solr index
Hi Jame, you can - generate one token for each pair (key, value) -- key_value - insert a gap between each pair and us phrase queries - use key as field-name (if you have a restricted set of keys) - wait for joins in Solr 4.0 (http://wiki.apache.org/solr/Join) - use position or payloads to connect key and value - tell the forum your exact use-case with examples Best regrads Karsten Original-Nachricht Datum: Mon, 24 Oct 2011 17:11:49 +0530 Von: jame vaalet jamevaa...@gmail.com An: solr-user@lucene.apache.org Betreff: indexing key value pair into lucene solr index hi, in my use case i have list of key value pairs in each document object, if i index them as separate index fields then in the result doc object i will get two arrays corresponding to my keys and values. The problem i face here is that there wont be any mapping between those keys and values. do we have any easy to index these data in solr ? thanks in advance ... -- -JAME
Re: indexing key value pair into lucene solr index
thanks karsten. can we preserve order within index field ? if yes, i can index them separately and map them using their order. On 24 October 2011 17:32, karsten-s...@gmx.de wrote: Hi Jame, you can - generate one token for each pair (key, value) -- key_value - insert a gap between each pair and us phrase queries - use key as field-name (if you have a restricted set of keys) - wait for joins in Solr 4.0 (http://wiki.apache.org/solr/Join) - use position or payloads to connect key and value - tell the forum your exact use-case with examples Best regrads Karsten Original-Nachricht Datum: Mon, 24 Oct 2011 17:11:49 +0530 Von: jame vaalet jamevaa...@gmail.com An: solr-user@lucene.apache.org Betreff: indexing key value pair into lucene solr index hi, in my use case i have list of key value pairs in each document object, if i index them as separate index fields then in the result doc object i will get two arrays corresponding to my keys and values. The problem i face here is that there wont be any mapping between those keys and values. do we have any easy to index these data in solr ? thanks in advance ... -- -JAME -- -JAME
Re: indexing key value pair into lucene solr index
Hi Jame, preserve order in index fields: if you don't want to use phrase queries in key or value this order is position. if you use phrase queries but no value has more then 50 Tokens you also could use position and start each pair with position 100, 200, 300 ... Otherwise you could use payloads. Imho there is no standard way to connect the positions of two fields. You have to write your own Query. My Tip: Take org.apache.lucene.search.spans.TermSpans as starting point and use the queryparser-Module. btw: normaly there is a standard solution in lucene for each problem. So please tell more about your use-case and somebody will have an answer without program by your own. Best regards Karsten Original-Nachricht Datum: Mon, 24 Oct 2011 17:53:26 +0530 Von: jame vaalet jamevaa...@gmail.com An: solr-user@lucene.apache.org Betreff: Re: indexing key value pair into lucene solr index thanks karsten. can we preserve order within index field ? if yes, i can index them separately and map them using their order. On 24 October 2011 17:32, karsten-s...@gmx.de wrote: Hi Jame, you can - generate one token for each pair (key, value) -- key_value - insert a gap between each pair and us phrase queries - use key as field-name (if you have a restricted set of keys) - wait for joins in Solr 4.0 (http://wiki.apache.org/solr/Join) - use position or payloads to connect key and value - tell the forum your exact use-case with examples Best regrads Karsten Original-Nachricht Datum: Mon, 24 Oct 2011 17:11:49 +0530 Von: jame vaalet jamevaa...@gmail.com An: solr-user@lucene.apache.org Betreff: indexing key value pair into lucene solr index hi, in my use case i have list of key value pairs in each document object, if i index them as separate index fields then in the result doc object i will get two arrays corresponding to my keys and values. The problem i face here is that there wont be any mapping between those keys and values. do we have any easy to index these data in solr ? thanks in advance ... -- -JAME -- -JAME
Re: indexing key value pair into lucene solr index
On Oct 24, 2011, at 1:41pm, jame vaalet wrote: hi, in my use case i have list of key value pairs in each document object, if i index them as separate index fields then in the result doc object i will get two arrays corresponding to my keys and values. The problem i face here is that there wont be any mapping between those keys and values. do we have any easy to index these data in solr ? thanks in advance ... As Karsten said, providing more detail re what you're actually trying to do usually makes for better and more helpful/accurate answers. But I'm guessing you only want to search on the key, not the value, right? If so, then: 1. Create a multi-value field with a custom type, indexed, stored. 2. During indexing, add entries as keytabvalue 3. In the custom type, set the analyzer to strip off the tabvalue so you only index the key. E.g. fieldType name=key_value class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true omitTermFreqAndPositions=true omitNorms=true analyzer type=index !-- Get rid of tabvalue text at the end of each string -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=\t\d+$ replacement= / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ /analyzer /fieldType -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions training Hadoop, Cascading, Mahout Solr
Re: help needed on solr-uima integration
Thanks Koji. I found it. I should the solution there. Xue-Feng From: Koji Sekiguchi k...@r.email.ne.jp To: solr-user@lucene.apache.org Sent: Monday, October 24, 2011 7:30:01 AM Subject: Re: help needed on solr-uima integration (11/10/24 17:42), Xue-Feng Yang wrote: Hi, Where can I find test code for solr-uima component? You should find them under: solr/contrib/uima/src/test koji -- Check out Query Log Visualizer for Apache Solr http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html http://www.rondhuit.com/en/
Re: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log
I am currently running into the exact same exception, but I'm not using Maven. What are my options to fix the issue? -- View this message in context: http://lucene.472066.n3.nabble.com/java-lang-NoSuchMethodError-org-slf4j-spi-LocationAwareLogger-log-tp3435001p3447968.html Sent from the Solr - User mailing list archive at Nabble.com.
Tag cloud from tweets
I have saved tweets related to some keywords in solr, can Solr be used to generate the tag cloud of important words from these tweets? Regards, Rohit
Re: Tag cloud from tweets
Sure. Just facet on a tokenized field of the tweet text. You'll want to tune the analysis configuration to suit your desires, but no problem getting counts back using facet=onfacet.field=tweet_text kinda thing. Erik On Oct 24, 2011, at 13:14 , Rohit wrote: I have saved tweets related to some keywords in solr, can Solr be used to generate the tag cloud of important words from these tweets? Regards, Rohit
RE: Optimization /Commit memory
I have not spent a lot of time researching it, but one would expect that the OS RAM requirement for optimization of an index to be minimal. My understanding is that during optimization an essentially new index is built. Once complete it switches out the indexes and will throw away the old one. (In Windows it may not throw away the old one until the next Commit). JRJ -Original Message- From: Sujatha Arun [mailto:suja.a...@gmail.com] Sent: Friday, October 21, 2011 12:10 AM To: solr-user@lucene.apache.org Subject: Re: Optimization /Commit memory Just one more thing ,when we are talking about Optimization , we are referring to HD free space for replicating the index (2 or 3 times the index size ) .what is role of RAM (OS) here? Regards Suajtha On Fri, Oct 21, 2011 at 10:12 AM, Sujatha Arun suja.a...@gmail.com wrote: Thanks that helps. Regards Sujatha On Thu, Oct 20, 2011 at 6:23 PM, Jaeger, Jay - DOT jay.jae...@dot.wi.govwrote: Well, since the OS RAM includes the JVM RAM, that is part of your requirement, yes? Aside from the JVM and normal OS requirements, all you need OS RAM for is file caching. Thus, for updates, the OS RAM is not a major factor. For searches, you want sufficient OS RAM to cache enough of the index to get the query performance you need, and to cache queries inside the JVM if you get a lot of repeat queries (see solrconfig.xml for the various caches: we have not played with them much). So, the amount of RAM necessary for that is very much dependent upon the size of your index, so I cannot give you a simple number. You seem to believe that you have to have sufficient memory to have the entire index in memory. Except where extremely high performance is required, I have not found that to be the case. This is just one of those your mileage may vary things. There is not a single answer or formula that fits every situation. JRJ -Original Message- From: Sujatha Arun [mailto:suja.a...@gmail.com] Sent: Wednesday, October 19, 2011 11:58 PM To: solr-user@lucene.apache.org Subject: Re: Optimization /Commit memory Thanks Jay , I was trying to compute the *OS RAM requirement* *not JVM RAM* for a 14 GB Index [cumulative Index size of all Instances].And I put it thus - Requirement of Operating System RAM for an Index of 14GB is - Index Size + 3 Times the maximum Index Size of Individual Instance for Optimize . That is to say ,I have several Instances ,combined Index Size is 14GB .Maximum Individual Index Size is 2.5GB .so My requirement for OS RAM is 14GB +3 * 2.5 GB ~ = 22GB. Correct? Regards Sujatha On Thu, Oct 20, 2011 at 3:45 AM, Jaeger, Jay - DOT jay.jae...@dot.wi.gov wrote: Commit does not particularly spike disk or memory usage, unless you are adding a very large number of documents between commits. A commit can cause a need to merge indexes, which can increase disk space temporarily. An optimize is *likely* to merge indexes, which will usually increase disk space temporarily. How much disk space depends very much upon how big your index is in the first place. A 2 to 3 times factor of the sum of your peak index file size seems safe, to me. Solr uses only modest amounts of memory for the JVM for this stuff. JRJ -Original Message- From: Sujatha Arun [mailto:suja.a...@gmail.com] Sent: Wednesday, October 19, 2011 4:04 AM To: solr-user@lucene.apache.org Subject: Optimization /Commit memory Do we require 2 or 3 Times OS RAM memory or Hard Disk Space while performing Commit or Optimize or Both? what is the requirement in terms of size of RAM and HD for commit and Optimize Regards Sujatha
some basic information on Solr
Hi all, I am doing a student project on search engine research. Right now I have some basic questions about Slor. 1. How many types of data file Solr can support (estimate)? i.e. No. of file types solr can look at for indexing and searching. 2. How much is estimated cost of incidents per year for Solr ? Since the numbers could vary from different platforms, however we would like to know the estimate answers regarding the general cases. Thanks -- Dan Wu (Fiona Wu) 武丹 Master of Engineering Management Program Degree Candidate Duke University, North Carolina, USA Email: dan...@duke.edu Tel: 919-599-2730
RE: some basic information on Solr
1. Solr, proper, does not index files. An adjunct called Solr Cel can. See http://wiki.apache.org/solr/ExtractingRequestHandler . That article describes which kinds of files it Solr Cel can handle. 2. I have no idea what you mean by incidents per year. Please explain. 3. Even though you didn't ask: You are apparently a student at an advanced level. At your level I would guess that your professors expect *YOU* to read thru the material available on the Internet on Solr and figure it out on your own, rather than just asking others to do your work for you. ;^) In particular, before asking further questions you should probably read thru http://wiki.apache.org/solr/FrontPage and http://lucene.apache.org/solr/tutorial.html . JRJ -Original Message- From: Dan Wu [mailto:wudan1...@gmail.com] Sent: Monday, October 24, 2011 12:43 PM To: solr-user@lucene.apache.org Subject: some basic information on Solr Hi all, I am doing a student project on search engine research. Right now I have some basic questions about Slor. 1. How many types of data file Solr can support (estimate)? i.e. No. of file types solr can look at for indexing and searching. 2. How much is estimated cost of incidents per year for Solr ? Since the numbers could vary from different platforms, however we would like to know the estimate answers regarding the general cases. Thanks -- Dan Wu (Fiona Wu) 武丹 Master of Engineering Management Program Degree Candidate Duke University, North Carolina, USA Email: dan...@duke.edu Tel: 919-599-2730
RE: indexing key value pair into lucene solr index
Maybe put them in a single string field (or any other field type that is not analyzed -- certainly not text) using some character separator that will connect them, but won't confuse the Solr query parser? So maybe you start out with key value pairs of Key1 value1 Key2 value2 Key3 value3 Preprocess them for indexing, and then index (and search) for them as, for example, Key1$value1 Key2$value2 Key3$value3 (You could also store their individual values in a separate field, of course). JRJ -Original Message- From: jame vaalet [mailto:jamevaa...@gmail.com] Sent: Monday, October 24, 2011 6:42 AM To: solr-user@lucene.apache.org Subject: indexing key value pair into lucene solr index hi, in my use case i have list of key value pairs in each document object, if i index them as separate index fields then in the result doc object i will get two arrays corresponding to my keys and values. The problem i face here is that there wont be any mapping between those keys and values. do we have any easy to index these data in solr ? thanks in advance ... -- -JAME
Re: org.apache.pdfbox.pdmodel.PDPage Error
Is this really a stumper? This is my first experience with Solr and having spent only an hour or so with it I hit this barrier (below). I'm sure *I* am doing something completely wrong just hoping someone more familiar with the platform can help me identify fix it. For starters...what's Could not initialize class ... mean in Java exactly? Maybe that the class (ie code) itself doesn't exist? - so perhaps I haven't downloaded all the pieces of the project? Or, could it be a hint that my kit is just not configured correctly? Sorry, I'm not a Java expert...but would like to get this stabilized...if possible. If this is the wrong mailing list then just tell me and I'll go away... Thanks! On Oct 20, 2011, at 2:54 PM, MBD wrote: Hi, I'm new to Solr and trying to get it to index PDFs. Having trouble getting started. Following examples in ExtractingRequestHandler wiki http://wiki.apache.org/solr/ExtractingRequestHandler. Got Solr running and it indexes html, xml txt files just fine...but when I try to feed it a .pdf it spits out an Error 500 Could not initialize class org.apache.pdfbox.pdmodel.PDPage: $ curl http://localhost:8983/solr/update/extract?literal.id=doc1commit=true; -F myfile=@index.pdf html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 500 Could not initialize class org.apache.pdfbox.pdmodel.PDPage java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.PDPage ... I thought maybe it's because Tika isn't installed/included so I tried downloading and installing Tika separately...but even the Tika install fails with: --- Test set: org.apache.tika.parser.pdf.PDFParserTest --- Tests run: 5, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 0.63 sec FAILURE! testVarious(org.apache.tika.parser.pdf.PDFParserTest) Time elapsed: 0.165 sec ERROR! java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.PDPage I don't know Java (but hopefully won't need to in order to get basic indexing up and running as ultimate goal is to query this via Sunspot from a Rails app) so go easy on me. Let me know if you want/need more of the error dump. Any help would be greatly appreciated! -Mike
Re: some basic information on Solr
JRJ, We did check the solr official website but found it was really technical, since we are not on the developer side and we just want some basic information or numbers about its usage. Thanks for your answer, anyway. 2011/10/24 Jaeger, Jay - DOT jay.jae...@dot.wi.gov 1. Solr, proper, does not index files. An adjunct called Solr Cel can. See http://wiki.apache.org/solr/ExtractingRequestHandler . That article describes which kinds of files it Solr Cel can handle. 2. I have no idea what you mean by incidents per year. Please explain. 3. Even though you didn't ask: You are apparently a student at an advanced level. At your level I would guess that your professors expect *YOU* to read thru the material available on the Internet on Solr and figure it out on your own, rather than just asking others to do your work for you. ;^) In particular, before asking further questions you should probably read thru http://wiki.apache.org/solr/FrontPage and http://lucene.apache.org/solr/tutorial.html . JRJ -Original Message- From: Dan Wu [mailto:wudan1...@gmail.com] Sent: Monday, October 24, 2011 12:43 PM To: solr-user@lucene.apache.org Subject: some basic information on Solr Hi all, I am doing a student project on search engine research. Right now I have some basic questions about Slor. 1. How many types of data file Solr can support (estimate)? i.e. No. of file types solr can look at for indexing and searching. 2. How much is estimated cost of incidents per year for Solr ? Since the numbers could vary from different platforms, however we would like to know the estimate answers regarding the general cases. Thanks
joins and filter queries effecting scoring
I have 2 types of docs, users and posts. I want to view all the docs that belong to certain users by joining posts and users together. I have to filter the users with a filter query of is_active_boolean:true so that the score is not effected,but since I do a join, I have to move the filter query to the query parameter so that I can get the filter applied. The problem is that since the is_active_boolean is moved to the query, the score is affected which returns back an order that I don't want. If I leave the is_active_boolean:true in the fq paramater, I get no results back. My question is how can I apply a filter query to users so that the score is not effected?
Re: questions on query format
Thanks, ?q.alt=*:* worked for me -- how do I make sure that the standard query parser is configured. Thanks. MM. On Mon, Oct 24, 2011 at 2:47 AM, Ahmet Arslan iori...@yahoo.com wrote: 2. If send solr the following query: q=*:* I get nothing just: responseresult name=response numFound=0 start=0 maxScore=0.0/lst name=highlighting//response Would appreciate some insight into what is going on. If you are using dismax as query parser, then *:* won't function as match all docs query. To retrieve all docs - with dismax - use q.alt=*:* parameter. Also, adding debugQuery=on will display information about parsed query.
Is there a good web front end application / interface for solr
Greetings guys, Is there a good front end application / interface for solr? Features I'm looking for are: configure query interface (using non programatic features) configure pagination configure bookmarking of results export results of a query to a csv or other format (JSON, etc.) Is there any demand for such an application? Thanks.
DataImportHandler Nested Entities
Hi, I want to use Solr 3.1 to index the content of a website. Rather than using a web crawler to fetch the content and load it into Solr I want to use the DIH to get the data from the Content Management Database that supports the website. It would be possible to write SQL to obtain a complete set of metadata (for example DC.subject or DC.type) for each page or binary document stored in the database, using the JDBCDataSource. One of the values obtained would be the HTTP URL of the actual page or document, and I would need to obtain and index this content as well. Could you tell me if its possible to nest entities that use a URLDataSource inside entities that use a JDBCDataSource ? Andy
Re: questions on query format
?q.alt=*:* worked for me -- how do I make sure that the standard query parser is configured. You can append defType=lucene to your search URL. More permanent way is to default defType parameter in solrconfig.xml.
A sort-by-geodist question
Hi, I've started to use Solr to build up a search service, but I have encountered a problem here. However, when I use this URL, it always returns *sort param could not be parsed as a query, and is not a field that exists in the index: geodist()* * * http://localhost:8080/solr/select/?indent=truefl=name,coordinatesq=*:*sfield=coordinatespt=45.15,-93.85sort=geodist()%20aschttp://antlion.skimbl.com:8080/skimbl-solr/select/?indent=truefl=name,coordinatesq=*:*sfield=coordinatespt=45.15,-93.85sort=geodist(45.15,-93.85)%20asc It works only when I specify coordinates in geodist(). http://localhost:8080/solr/select/?indent=truefl=name,coordinatesq=*:*sfield=coordinatespt=45.15,-93.85sort=geodist(45.15,-93.85)%20aschttp://antlion.skimbl.com:8080/skimbl-solr/select/?indent=truefl=name,coordinatesq=*:*sfield=coordinatespt=45.15,-93.85sort=geodist(45.15,-93.85)%20asc And the returned documents don't seem to be ranked by distance according to the criteria. My lucene is 3.4. The field 'coordinates' is in geohash format. Can anyone here give me some pointers? Thank you very much. Yung-chung Lin
Solr main query response input to facet query
Hi, I am implementing an solr solution where I want to use some field values from main query output as an input in building facet. How do I do that? Eg: Response from main query: doc str name=namename1/str int name=prod_id200/int /doc doc str name=namename1/str int name=prod_id400/int /doc I want to build facet for the query where prod_id:200 prod_id:400. I like to do all this in single query ideally. if it can't be done in one query, I am ok with 2 query as well. Please help. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-main-query-response-input-to-facet-query-tp3449938p3449938.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sorting fields with letters?
Tried using the ord() function, but it was the same as the standard sort. Do I just need to bite the bullet and reindex everything? Thanks! Pete On Oct 21, 2011, at 5:26 PM, Tomás Fernández Löbbe wrote: I don't know if you'll find exactly what you need, but you can sort by any field or FunctionQuery. See http://wiki.apache.org/solr/FunctionQuery On Fri, Oct 21, 2011 at 7:03 PM, Peter Spam ps...@mac.com wrote: Is there a way to use a custom sorter, to avoid re-indexing? Thanks! Pete On Oct 21, 2011, at 2:13 PM, Tomás Fernández Löbbe wrote: Well, yes. You probably have a string field for that content, right? so the content is being compared as strings, not as numbers, that why something like 1000 is lower than 2. Leading zeros would be an option. Another option is to separate the field into numeric fields and sort by those (this last option is only recommended if your data always look similar). Something like 11C15 to field1: 11, field2:C field3: 15. Then use sort=field1,field2,field3. Anyway, both this options require reindexing. Regards, Tomás On Fri, Oct 21, 2011 at 4:57 PM, Peter Spam ps...@mac.com wrote: Hi everyone, I have a field that has a letter in it (for example, 1A1, 2A1, 11C15, etc.). Sorting it seems to work most of the time, except for a few things, like 10A1 is lower than 8A100, and 10A100 is lower than 10A99. Any ideas? I bet if my data had leading zeros (ie 10A099), it would behave better? (But I can't really change my data now, as it would take a few days to re-inject - which is possible but a hassle). Thanks! Pete