Indexing multiple documents in Solr/SolrCell
Hi, I am new to this forum and would like to know if the function described below has been developed or exists in Solr. If it does not exist, is it a good Idea and can I contribute. We need to index multiple documents with different formats. So we use Solr with Tika (Solr Cell). Question: Can you index both metadata and content for multiple documents iteratively in Solr? For example I have an XML with metadata and a links to the documents content. There are many documents in this XML and I would like to index them all without firing multiple URLs. Example of XML add doc field name=id34122/field field name=authorMichael/field field name=size3MB/field field name=URLURL of the document/field /doc /add doc2./doc2.../docN I need to index all these documents by sending this XML in a single URL.The collection of documents to be indexed could be on a file system. I have altered the Solr code to be able to do this but is there an already existing feature?
Re: Tika trouble
Anyone has a clue? List, I somehow fail to index certain pdf files using the ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but modified schema. I have a very simple schema for this case using only and ID field, a timestamp field and two dynamic fields; ignored_* and attr_* both indexed, stored and multivalued strings. They are multivalued simple because some HTML files fail when storing multiple hyperlinks. I have posted multiple files to http://.../update/extract?literal.id=doc1 including: 1. the whitepaper at http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP 2. the html file of the frontpage of http://nu.nl/ 3. another pdf at http://www.google.nl/url?sa=tsource=webct=rescd=1ved=0CAcQFjAAurl=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdfrct=jq=2007.cmp_mapreduce.hpca.pdfei=PPz7SpiiOM6l4QbZjKjRAwusg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A For each document i have a corresponding select/?q=*:*: 1. No text? Should i see something? docstr name=iddoc1/str arr name=ignored_content_type strapplication/octet-stream/str /arr arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=cf57b4ad644d /str /arr arr name=ignored_stream_size str491238/str /arr arr name=ignored_text str/str /arr date name=timestamp2009-11-12T12:17:23.016Z/date /doc 2. Plenty of data, this seems to be ok doc str name=iddoc1/str arr name=ignored_content_type strapplication/xhtml+xml/str /arr arr name=ignored_links strhttp://www.nu.nl//str strhttp://www.nu.nl//str strhttp://www.nu.nl/algemeen//str strhttp://www.nu.nl/economie//str arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=b6e44d087bdd /str /arr arr name=ignored_stream_size str36991/str /arr arr name=ignored_text str A LOT OF TEXT HERE /str /arr date name=timestamp2009-11-12T12:19:15.415Z/date /doc 3. a lot of garbage doc str name=iddoc1/str arr name=ignored_content_encoding strwindows-1252/str /arr arr name=ignored_content_language strfr/str /arr arr name=ignored_content_type strtext/plain/str /arr arr name=ignored_language strfr/str /arr arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=83df0fd4d358 /str /arr arr name=ignored_stream_size str361458/str /arr arr name=ignored_text str A LOT OF GARBAGE HERE including ió½·Þp™ó 40› š©xÓ ^CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4 ¢9r —!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)` Ñ „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ªU:šBÝ‘GuŠë MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L ‡ëŽó©pk_ Ša Â=u×; (ä�...@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D» @fI$0°�Î Ù·p“Œ,Øâ †¶v ¤v1#8¼0 › èð€-†šZ 6¾ ! ñb ˆbˆ¤v)LS)T X² ¬ l...@€ 6E$Q endstream endobj 137 0 obj/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p] endobj 138 0 obj/Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0 endobj 139 0 obj/Count 12/Kids[140 0 R 141 0 R]/Type/Pages endobj 140 0 obj/Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0 R]/Type/Pages/Parent 139 0 R endobj 141 0 obj/Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0 R]/Type/Pages/Parent /str /arr date name=timestamp2009-11-12T12:21:28.306Z/date /doc Any ideas? Why doesn't the whitepaper produce any results and why is the next whitepaper full of garbage? At least i'm happy that HTML works fine. Regards, - Markus Jelsma Buyways B.V. Technisch ArchitectFriesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17
Re: Tika trouble
What I could try to say is that if you want to index a Pdf, then you should use a Pdf extractor. A Pdf Extractor is able to extract the text content and the metadata of the files. I suppose you have just opened and indexed the pdf as is. So you stored bynary data and stop. For my applciation I've used PdfExtractor, but also pdfBox project could be used. Antonio 2009/11/16 Markus Jelsma - Buyways B.V. mar...@buyways.nl Anyone has a clue? List, I somehow fail to index certain pdf files using the ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but modified schema. I have a very simple schema for this case using only and ID field, a timestamp field and two dynamic fields; ignored_* and attr_* both indexed, stored and multivalued strings. They are multivalued simple because some HTML files fail when storing multiple hyperlinks. I have posted multiple files to http://.../update/extract?literal.id=doc1 including: 1. the whitepaper at http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP 2. the html file of the frontpage of http://nu.nl/ 3. another pdf at http://www.google.nl/url?sa=tsource=webct=rescd=1ved=0CAcQFjAAurl=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdfrct=jq=2007.cmp_mapreduce.hpca.pdfei=PPz7SpiiOM6l4QbZjKjRAwusg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8Ahttp://www.google.nl/url?sa=tsource=webct=rescd=1ved=0CAcQFjAAurl=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdfrct=jq=2007.cmp_mapreduce.hpca.pdfei=PPz7SpiiOM6l4QbZjKjRAwusg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A For each document i have a corresponding select/?q=*:*: 1. No text? Should i see something? docstr name=iddoc1/str arr name=ignored_content_type strapplication/octet-stream/str /arr arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=cf57b4ad644d /str /arr arr name=ignored_stream_size str491238/str /arr arr name=ignored_text str/str /arr date name=timestamp2009-11-12T12:17:23.016Z/date /doc 2. Plenty of data, this seems to be ok doc str name=iddoc1/str arr name=ignored_content_type strapplication/xhtml+xml/str /arr arr name=ignored_links strhttp://www.nu.nl//str strhttp://www.nu.nl//str strhttp://www.nu.nl/algemeen//str strhttp://www.nu.nl/economie//str arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=b6e44d087bdd /str /arr arr name=ignored_stream_size str36991/str /arr arr name=ignored_text str A LOT OF TEXT HERE /str /arr date name=timestamp2009-11-12T12:19:15.415Z/date /doc 3. a lot of garbage doc str name=iddoc1/str arr name=ignored_content_encoding strwindows-1252/str /arr arr name=ignored_content_language strfr/str /arr arr name=ignored_content_type strtext/plain/str /arr arr name=ignored_language strfr/str /arr arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=83df0fd4d358 /str /arr arr name=ignored_stream_size str361458/str /arr arr name=ignored_text str A LOT OF GARBAGE HERE including ió½·Þp™ó 40› š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4 ¢9r —!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)` Ñ „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ªU:šBÝ‘GuŠë MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L ‡ëŽó©pk _ Ša Â=u×; (ä�...@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D» @fI$0°�Î Ù·p“Œ,Øâ †¶v ¤v1#8¼0 › èð€-†šZ 6¾ ! ñb ˆbˆ¤v)LS)T X² ¬ l...@€ 6E$Q endstream endobj 137 0 obj/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p] endobj 138 0 obj/Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0 endobj 139 0 obj/Count 12/Kids[140 0 R 141 0 R]/Type/Pages endobj 140 0 obj/Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0 R]/Type/Pages/Parent 139 0 R endobj 141 0 obj/Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0 R]/Type/Pages/Parent /str /arr date name=timestamp2009-11-12T12:21:28.306Z/date /doc Any ideas? Why doesn't the whitepaper produce any results and why is the next whitepaper full of garbage? At least i'm happy that HTML works fine. Regards, - Markus Jelsma Buyways B.V. Technisch ArchitectFriesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 -- Antonio Calò
Re: Indexing multiple documents in Solr/SolrCell
Hi, the problem you've described -- an integration of DataImportHandler (to traverse the XML file and get the document urls) and Solr Cell (to extract content afterwards) -- is already addressed in issue SOLR-1358 (https://issues.apache.org/jira/browse/SOLR-1358). Best, Sascha Kerwin wrote: Hi, I am new to this forum and would like to know if the function described below has been developed or exists in Solr. If it does not exist, is it a good Idea and can I contribute. We need to index multiple documents with different formats. So we use Solr with Tika (Solr Cell). Question: Can you index both metadata and content for multiple documents iteratively in Solr? For example I have an XML with metadata and a links to the documents content. There are many documents in this XML and I would like to index them all without firing multiple URLs. Example of XML add doc field name=id34122/field field name=authorMichael/field field name=size3MB/field field name=URLURL of the document/field /doc /add doc2./doc2.../docN I need to index all these documents by sending this XML in a single URL.The collection of documents to be indexed could be on a file system. I have altered the Solr code to be able to do this but is there an already existing feature?
Re: javabin in .NET?
Yep, I think I mostly nailed the unmarshalling. Need more tests though. And then integrate it to SolrNet. Is there any way (or are there any plans) to have an update handler that accepts javabin? 2009/11/16 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com start with a JavabinDecoder only so that the class is simple to start with. 2009/11/16 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: For a client the marshal() part is not important.unmarshal() is probably all you need On Sun, Nov 15, 2009 at 12:36 AM, Mauricio Scheffer mauricioschef...@gmail.com wrote: Original code is here: http://bit.ly/hkCbI I just started porting it here: http://bit.ly/37hiOs It needs: tests/debugging, porting NamedList, SolrDocument, SolrDocumentList Thanks for any help! Cheers, Mauricio 2009/11/14 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com OK. Is there anyone trying it out? where is this code ? I can try to help .. On Fri, Nov 13, 2009 at 8:10 PM, Mauricio Scheffer mauricioschef...@gmail.com wrote: I meant the standard IO libraries. They are different enough that the code has to be manually ported. There were some automated tools back when Microsoft introduced .Net, but IIRC they never really worked. Anyway it's not a big deal, it should be a straightforward job. Testing it thoroughly cross-platform is another thing though. 2009/11/13 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com The javabin format does not have many dependencies. it may have 3-4 classes an that is it. On Fri, Nov 13, 2009 at 6:05 PM, Mauricio Scheffer mauricioschef...@gmail.com wrote: Nope. It has to be manually ported. Not so much because of the language itself but because of differences in the libraries. 2009/11/13 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Is there any tool to directly port java to .Net? then we can etxract out the client part of the javabin code and convert it. On Thu, Nov 12, 2009 at 9:56 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Has anyone looked into using the javabin response format from .NET (instead of SolrJ)? It's mainly a curiosity. How much better could performance/bandwidth/throughput be? How difficult would it be to implement some .NET code (C#, I'd guess being the best choice) to handle this response format? Thanks, Erik -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Tika trouble
Thank you for your reply. I had the assumption Tika could also extract text content from various documenttypes instead of only meta data. I'll use the CLI tools from http://www.foolabs.com/xpdf/ to extract text manually. - Markus Jelsma Buyways B.V. Technisch ArchitectFriesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2009-11-16 at 12:06 +0100, Antonio Calò wrote: What I could try to say is that if you want to index a Pdf, then you should use a Pdf extractor. A Pdf Extractor is able to extract the text content and the metadata of the files. I suppose you have just opened and indexed the pdf as is. So you stored bynary data and stop. For my applciation I've used PdfExtractor, but also pdfBox project could be used. Antonio 2009/11/16 Markus Jelsma - Buyways B.V. mar...@buyways.nl Anyone has a clue? List, I somehow fail to index certain pdf files using the ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but modified schema. I have a very simple schema for this case using only and ID field, a timestamp field and two dynamic fields; ignored_* and attr_* both indexed, stored and multivalued strings. They are multivalued simple because some HTML files fail when storing multiple hyperlinks. I have posted multiple files to http://.../update/extract?literal.id=doc1 including: 1. the whitepaper at http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP 2. the html file of the frontpage of http://nu.nl/ 3. another pdf at http://www.google.nl/url?sa=tsource=webct=rescd=1ved=0CAcQFjAAurl=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdfrct=jq=2007.cmp_mapreduce.hpca.pdfei=PPz7SpiiOM6l4QbZjKjRAwusg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8Ahttp://www.google.nl/url?sa=tsource=webct=rescd=1ved=0CAcQFjAAurl=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdfrct=jq=2007.cmp_mapreduce.hpca.pdfei=PPz7SpiiOM6l4QbZjKjRAwusg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A For each document i have a corresponding select/?q=*:*: 1. No text? Should i see something? docstr name=iddoc1/str arr name=ignored_content_type strapplication/octet-stream/str /arr arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=cf57b4ad644d /str /arr arr name=ignored_stream_size str491238/str /arr arr name=ignored_text str/str /arr date name=timestamp2009-11-12T12:17:23.016Z/date /doc 2. Plenty of data, this seems to be ok doc str name=iddoc1/str arr name=ignored_content_type strapplication/xhtml+xml/str /arr arr name=ignored_links strhttp://www.nu.nl//str strhttp://www.nu.nl//str strhttp://www.nu.nl/algemeen//str strhttp://www.nu.nl/economie//str arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=b6e44d087bdd /str /arr arr name=ignored_stream_size str36991/str /arr arr name=ignored_text str A LOT OF TEXT HERE /str /arr date name=timestamp2009-11-12T12:19:15.415Z/date /doc 3. a lot of garbage doc str name=iddoc1/str arr name=ignored_content_encoding strwindows-1252/str /arr arr name=ignored_content_language strfr/str /arr arr name=ignored_content_type strtext/plain/str /arr arr name=ignored_language strfr/str /arr arr name=ignored_stream_content_type str text/xml; charset=UTF-8; boundary=83df0fd4d358 /str /arr arr name=ignored_stream_size str361458/str /arr arr name=ignored_text str A LOT OF GARBAGE HERE including ió½·Þp™ó 40› š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4 ¢9r —!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)` Ñ „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ªU:šBÝ‘GuŠë MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L ‡ëŽó©pk _ Ša Â=u×; (ä�...@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D» @fI$0°�Î Ù·p“Œ,Øâ †¶v ¤v1#8¼0 › èð€-†šZ 6¾ ! ñb ˆbˆ¤v)LS)T X² ¬ l...@€ 6E$Q endstream endobj 137 0 obj/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p] endobj 138 0 obj/Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0 endobj 139 0 obj/Count 12/Kids[140 0 R 141 0 R]/Type/Pages endobj 140 0 obj/Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0
EmbeddedSolrServer: java.lang.NoClassDefFoundError: javax/servlet/ServletRequest
Hi, I'm newbie using Solr and I'd like to run some tests against our data set. I have successful tested Solr + Cell using the standard Http Solr server and now we need to test the Embedded solution and when a try to start the embedded server i get this exception: INFO: registering core: Exception in thread Thread-1 java.lang.NoClassDefFoundError: javax/servlet/ServletRequest at org.apache.solr.servlet.SolrRequestParsers.init(SolrRequestParsers.java:94) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.init(EmbeddedSolrServer.java:90) at petrobras.ep.solrindexer.Embedded$1.run(Embedded.java:25) Caused by: java.lang.ClassNotFoundException: javax.servlet.ServletRequest at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) The EmbeddedSolrServer depends on servlet-api? I'm facing a lack of documentation about EmbeddedSolrServer, all documentation is at http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer? thanks in advance! [ ]'s Leonardo da S. Souza °v° Linux user #375225 /(_)\ http://counter.li.org/ ^ ^
Experiences from migrating from FAST to Solr
We'd like to share with the solr users a recent news item from http://sesat.no Sesam has spent some three months migrating all its indexes from FAST to Solr+Lucene. It was a joyful experience and allowed us to implement a number of improvements we never could under FAST. We've written a review on the whole process to help others wishing to take the same steps. http://sesat.no/moving-from-fast-to-solr-review.html And we've released (under the LGPLv3 license) our own Sesat document processing framework that is compatible with the FAST document processing framework. http://sesat.no/documentprocessor.html mrtn
RE: solr stops running periodically
By that I mean that the java/tomcat process just disappears. I had similar problem when I started Tomcat via SSH, and then I improperly closed SSH without exit command. In some cases (OutOfMemory) memory is not enough to generate log (or CPU can be overloaded by Garbage Collector to such extent that you will have to wait few days until LOG will be generated) - but process cant' disappear... Process can't simply disappear... if it is JVM crash you should see dump file (you may need to set specific option for JVM to generate dump file in case of crash) -Original Message- From: athir nuaimi [mailto:at...@nuaim.com] Sent: November-15-09 1:46 PM To: solr-user@lucene.apache.org Subject: solr stops running periodically We have 4 machines running solr. On one of the machines, every 2-3 days solr stops running. By that I mean that the java/tomcat process just disappears. If I look at the catalina logs, I see normal log entries and then nothing. There is no shutdown messages like you would normally see if you sent a SIGTERM to the process. Obviously this is a problem. I''m new to solr/java so if there are more diagnostic things I can do I'd appreciate any tips/advice. thanks in advance Athir
Solr - Load Increasing.
Hi All. My server solr box cpu utilization increasing b/w 60 to 90% and some time solr is getting down and we are restarting it manually. No of documents in solr 30 laks. No of add/update requrest solr 30 thousand / day. Avg of every 30 minutes around 500 writes. No of search request 9laks / day. Size of the data directory: 4gb. My system ram is 8gb. System available space 12gb. processor Family: Pentium Pro Our solr data size can be increase in number like 90 laks. and writes per day will be around 1laks. - Hope its possible by solr. For write commit i have configured like autoCommit maxDocs1/maxDocs maxTime10/maxTime /autoCommit Is all above can be possible? 90laks datas and 1laks per day writes and 30laks per day read?? - if yes what type of system configuration would require. Please suggest us. thanks, Kalidoss.m, Get your world in your inbox! Mail, widgets, documents, spreadsheets, organizer and much more with your Sifymail WIYI id! Log on to http://www.sify.com ** DISCLAIMER ** Information contained and transmitted by this E-MAIL is proprietary to Sify Limited and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If this is a forwarded message, the content of this E-MAIL may not have been sent with the authority of the Company. If you are not the intended recipient, an agent of the intended recipient or a person responsible for delivering the information to the named recipient, you are notified that any use, distribution, transmission, printing, copying or dissemination of this information in any way or in any manner is strictly prohibited. If you have received this communication in error, please delete this mail notify us immediately at ad...@sifycorp.com
Re: DataImportHandler Questions-Load data in parallel and temp tables
On Mon, Nov 16, 2009 at 6:25 PM, amitj am...@ieee.org wrote: Is there also a way we can include some kind of annotation on the schema field and send the data retrieved for that field to an external application. We have a requirement where we require some data fields (out of the fields for an entity defined in data-config.xml) to act as entities for entity extraction and auto complete purposes and we are using some external application. No. it is not possible in Solr now. Noble Paul നോബിള് नोब्ळ् wrote: writing to a remote Solr through SolrJ is in the cards. I may even take it up after 1.4 release. For now your best bet is to override the class SolrWriter and override the corresponding methods for add/delete. 2009/4/27 Amit Nithian anith...@gmail.com: All, I have a few questions regarding the data import handler. We have some pretty gnarly SQL queries to load our indices and our current loader implementation is extremely fragile. I am looking to migrate over to the DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom stuff to remotely load the indices so that my index loader and main search engine are separated. Currently, unless I am missing something, the data gathering from the entity and the data processing (i.e. conversion to a Solr Document) is done sequentially and I was looking to make this execute in parallel so that I can have multiple threads processing different parts of the resultset and loading documents into Solr. Secondly, I need to create temporary tables to store results of a few queries and use them later for inner joins was wondering how to best go about this? I am thinking to add support in DIH for the following: 1) Temporary tables (maybe call it temporary entities)? --Specific only to SQL though unless it can be generalized to other sources. 2) Parallel support - Including some mechanism to get the number of records (whether it be count or the MAX(custom_id)-MIN(custom_id)) 3) Support in DIH or Solr to post documents to a remote index (i.e. create a new UpdateHandler instead of DirectUpdateHandler2). If any of these exist or anyone else is working on this (OR you have better suggestions), please let me know. Thanks! Amit -- - -- --Noble Paul -- View this message in context: http://old.nabble.com/DataImportHandler-Questions-Load-data-in-parallel-and-temp-tables-tp23266396p26371403.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: javabin in .NET?
On Mon, Nov 16, 2009 at 5:55 PM, Mauricio Scheffer mauricioschef...@gmail.com wrote: Yep, I think I mostly nailed the unmarshalling. Need more tests though. And then integrate it to SolrNet. Is there any way (or are there any plans) to have an update handler that accepts javabin? There is already one . look at BinaryRequestWriter. But I would say that may not make a lot of difference as indexing is a back-end operation and slight perf improvements won't make much difference. 2009/11/16 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com start with a JavabinDecoder only so that the class is simple to start with. 2009/11/16 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: For a client the marshal() part is not important.unmarshal() is probably all you need On Sun, Nov 15, 2009 at 12:36 AM, Mauricio Scheffer mauricioschef...@gmail.com wrote: Original code is here: http://bit.ly/hkCbI I just started porting it here: http://bit.ly/37hiOs It needs: tests/debugging, porting NamedList, SolrDocument, SolrDocumentList Thanks for any help! Cheers, Mauricio 2009/11/14 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com OK. Is there anyone trying it out? where is this code ? I can try to help .. On Fri, Nov 13, 2009 at 8:10 PM, Mauricio Scheffer mauricioschef...@gmail.com wrote: I meant the standard IO libraries. They are different enough that the code has to be manually ported. There were some automated tools back when Microsoft introduced .Net, but IIRC they never really worked. Anyway it's not a big deal, it should be a straightforward job. Testing it thoroughly cross-platform is another thing though. 2009/11/13 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com The javabin format does not have many dependencies. it may have 3-4 classes an that is it. On Fri, Nov 13, 2009 at 6:05 PM, Mauricio Scheffer mauricioschef...@gmail.com wrote: Nope. It has to be manually ported. Not so much because of the language itself but because of differences in the libraries. 2009/11/13 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Is there any tool to directly port java to .Net? then we can etxract out the client part of the javabin code and convert it. On Thu, Nov 12, 2009 at 9:56 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Has anyone looked into using the javabin response format from .NET (instead of SolrJ)? It's mainly a curiosity. How much better could performance/bandwidth/throughput be? How difficult would it be to implement some .NET code (C#, I'd guess being the best choice) to handle this response format? Thanks, Erik -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Solr 1.3 query and index perf tank during optimize
Otis Gospodnetic otis_gospodne...@yahoo.com wrote on 11/13/2009 11:15:43 PM: Let's take a step back. Why do you need to optimize? You said: As long as I'm not optimizing, search and indexing times are satisfactory. :) You don't need to optimize just because you are continuously adding and deleting documents. On the contrary! That's a fair question. Basically, search entries are keyed to other documents. We have finite storage, so we purge old documents. My understanding was that deleted documents still take space until an optimize is done. Therefore, if I don't optimize, the index size on disk will grow without bound. Am I mistaken? If I don't ever have to optimize, it would make my life easier. Thanks, Jerry
Re: Stop solr without losing documents
On Fri, Nov 13, 2009 at 4:09 PM, Chris Hostetter hossman_luc...@fucit.org wrote: please don't kill -9 ... it's grossly overkill, and doesn't give your [ ... snip ... ] Alternately, you could take advantage of the enabled feature from your client (just have it test the enabled url ever N updates or so) and when it sees that you have disabled the port it can send one last commit and then stop sending updates until it sees the enabled URL work againg -- as soon as you see the updates stop, you can safely shutdown hte port. Thanks, Hoss. I'll use Catalina stop instead of kill -9. It's good to know about the enabled feature -- my team was just discussing whether something like that existed that we could use -- but as we'd also like to recover cleanly from power failures and other Solr terminations, I think we'll track which docs are uncommitted outside of Solr. Michael
ext3 vs ext4 vs xfs for solr....recommendations needed...
Folks: For those of your experienced linux-solr hands, I am seeking recommendations for which file system you think would work best with solr. We are currently running with Ubuntu 9.04 on an amazon ec2 instance. The default file system I think is ext3. I am of course seeking, of course, to ensure good performance with stability. What I have been reading is that ext4 may be a little too bleeding edge...but I defer to those of you who know more about this... Thanks, - Bill
Re: Stop solr without losing documents
On Fri, Nov 13, 2009 at 11:02 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: So I think the question is really: If I stop the servlet container, does Solr issue a commit in the shutdown hook in order to ensure all buffered docs are persisted to disk before the JVM exits. Exactly right, Otis. I don't have the Solr source handy, but if I did, I'd look for Shutdown, Hook and finalize in the code. Thanks for the direction. There was some talk of close()ing a SolrCore that I found, but I don't believe this meant a commit. I somehow hadn't thought of actually *trying* to add a doc and then shut down a Solr instance; shame on me. Unfortunately, when I test this via * make a new solr * add a doc * commit * verify it shows up in a search -- it does * add a 2nd doc * shutdown solr doesn't stop. It stops accepting connections, but java refuses to actually die. Not sure what we're doing wrong on our end, but I see this frequently and end up having to do a kill (usually not -9!). I guess we'll stick with externally tracking which docs have committed, so that when we inevitably have to kill Solr it doesn't cause a problem. Michael
Re: Stop solr without losing documents
On Fri, Nov 13, 2009 at 11:45 PM, Lance Norskog goks...@gmail.com wrote: I would go with polling Solr to find what is not yet there. In production, it is better to assume that things will break, and have backstop janitors that fix them. And then test those janitors regularly. Good idea, Lance. I certainly agree with the idea of backstop janitors. We don't have a good way of polling Solr for what's in there or not -- we have a kind of asynchronous, multithreaded updating system sending docs to Solr -- but we always can find out *externally* which docs have been committed or not. Michael
Re: ext3 vs ext4 vs xfs for solr....recommendations needed...
William Pierce wrote: Folks: For those of your experienced linux-solr hands, I am seeking recommendations for which file system you think would work best with solr. We are currently running with Ubuntu 9.04 on an amazon ec2 instance. The default file system I think is ext3. I am of course seeking, of course, to ensure good performance with stability. What I have been reading is that ext4 may be a little too bleeding edge...but I defer to those of you who know more about this... Thanks, - Bill I'd prob stick to ext3 - there appear to be quite a few wins in terms of access speed, but ext4 has some sort of issue with writes - I think it involves fsync, which lucene/solr uses for an index commit. If you have Lucene's autocommit turned on (off by default, and removed in Lucene 3.0), the speed on ext4 is just hammered for indexing. Its not so bad without autocommit (fewer fsyncs, as they should only occur on Solr commits), but it makes the upgrade less compelling certainly. You can see the hit in this sqllite insert test - I'm guessing its the same issue: http://www.phoronix.com/scan.php?page=articleitem=ext4_btrfs_nilfs2num=2 -- - Mark http://www.lucidimagination.com
Index time boosting troubles
Hi, I had working index time boosting on documents like so: doc boost=10.0 Everything was great until I made some changes that I thought where no related to the doc boost but after that my doc boosting appears to be missing. I'm having a tough time debugging this and didn't have the sense to version control this so I would have something to revert to (lesson learned). In schema.xml I have fieldType name=float class=solr.FloatField omitNorms=false/ Is there something else I should be watching out for? Some query parameter perhaps? Or something else? I think wildcards in query affect it but I don't have any, some setting in solrconfig.xml or cheme.xml? Thanks! Jon
Re: Some guide about setting up local/geo search at solr
Localsolr is not in contrib yet. I am interested in knowing whether currently there is a better solution for setting up a local search. Cheers. On Sun, Nov 15, 2009 at 9:25 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Nota bene: My understanding is the external versions of Local Lucene/Solr are eventually going to be deprecated in favour of what we have in contrib. Here's a stub page with a link to the spatial JIRA issue: http://wiki.apache.org/solr/SpatialSearch Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Bertie Shen bertie.s...@gmail.com To: solr-user@lucene.apache.org Sent: Sat, November 14, 2009 3:32:01 AM Subject: Some guide about setting up local/geo search at solr Hey, I spent some times figuring out how to set up local/geo/spatial search at solr. I hope the following description can help given the current status. 1) Download localsolr. I download it from http://developer.k-int.com/m2snapshots/localsolr/localsolr/1.5/ and put jar file (in my case, localsolr-1.5.jar) in your application's WEB_INF/lib directory of application server. 2) Download locallucene. I download it from http://sourceforge.net/projects/locallucene/ and put jar file (in my case, locallucene.jar in locallucene_r2.0/dist/ diectory) in your application's WEB_INF/lib directory of application server. I also need to copy gt2-referencing-2.3.1.jar, geoapi-nogenerics-2.1-M2.jar, and jsr108-0.01.jar under locallucene_r2.0/lib/ directory to WEB_INF/lib. Do not copy lucene-spatial-2.9.1.jar under Lucene codebase. The namespace has been changed from com.pjaol.blah.blah.blah to org.apache.blah blah. 3) Update your solrconfig.xml and schema.xml. I copy it from http://www.gissearch.com/localsolr. 4) Restart application server and try a query /solr/select?qt=geolat=xx.xxlong=yy.yyq=abcradius=zz.
$DeleteDocbyQuery in solr 1.4 is not working
Hi, I have added a deleted field in my database, and am using the Dataimporthandler to add rows to the index... I am using solr 1.4 I have added my the deleted field to the query and the RegexTransformer... and the field definition below field column=$deleteDocByQuery regex=^true$ replaceWith=id:${List.id} sourceColName=deleted/ When I run the deltaImport command... I see the below output INFO: [] webapp=/solr path=/dataimport params={command=delta-importdebug=trueexpungeDeletes=true} status=0 QTime=1 Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.DataImporter doDeltaImport INFO: Starting Delta Import Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.DocBuilder doDelta INFO: Starting delta collection. Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Running ModifiedRowKey() for Entity: List Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity List with URL: jdbc:postgresql://localhost:5432/tlists Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Time taken for getConnection(): 4 Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed ModifiedRowKey for Entity: List rows obtained : 1 Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed DeletedRowKey for Entity: List rows obtained : 0 Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.DocBuilder collectDelta INFO: Completed parentDeltaQuery for Entity: List Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.SolrWriter deleteByQuery INFO: Deleting documents from Solr with query: id:api__list__365522 Nov 16, 2009 5:29:10 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/mnt/solr-index/index,segFN=segments_r,version=1257863009839,generation=27,filenames=[_bg.fdt, _bg.tii, segments_r, _bg.fnm, _bg.nrm, _bg.fdx, _bg.prx, _bg.tis, _bg.frq] Nov 16, 2009 5:29:10 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1257863009839 Nov 16, 2009 5:29:10 PM org.apache.solr.handler.dataimport.DocBuilder doDelta INFO: Delta Import completed successfully It says its deleting the document... but when I do the search its still showing up Any Ideas? Regards Mark
Config Relationship between MaxWarmingSearchers and StreamingUpdateSolrServer
My application updates the master index frequently, sometimes very frequently. Is there a good rule of thumb for configuring: 1) maxWarmingSearchers in the master 2) the SUSS thread pool size (and perhaps queue length) to match the server settings?
Re: SolrJ looping until I get all the results
On Mon, 2009-11-02 at 19:49 -0500, Paul Tomblin wrote: Here's what I'm thinking final static int MAX_ROWS = 100; int start = 0; query.setRows(MAX_ROWS); while (true) { QueryResponse resp = solrChunkServer.query(query); SolrDocumentList docs = resp.getResults(); if (docs.size() == 0) break; start += MAX_ROWS; query.setStart(start); } Why not after the first limited fetch read how many hits there are and on the second fetch get all remaining documents. Example code (see the do-while loop) http://sesat.no/projects/sesat-kernel/xref/no/sesat/search/query/token/SolrTokenEvaluator.html#237 ~mck -- This above all: to thine own self be true. It must follow that you cannot then be false to any man. Shakespeare | semb.wever.org | sesat.no | finn.no | signature.asc Description: This is a digitally signed message part
Re: Wildcards at the Beginning of a Search.
There is a text_rev field type in the example schema.xml file in the official release of 1.4. It uses the ReversedWildcardFilterFactory to revers a field. You can do a copyField from the field you want to use for leading wildcard searches to a field using the text_rev field, and then do a regular trailing wildcard search on the reversed field. -Jay http://www.lucidimagination.com On Thu, Nov 12, 2009 at 4:41 AM, Jörg Agatz joerg.ag...@googlemail.comwrote: is in solr 1.4 maby a way to search with an wildcard at the beginning? in 1.3 i cant activate it. KingArtus
PhP, Solr and Delta Imports
Hello, I have an already working Solr service based un full imports connected via php to a Zend Framework MVC (I connect it directly to the Controller). I use the SolrClient class for php which is great: http://www.php.net/manual/en/class.solrclient.php For now on, every time I want to edit a document I have to do a full import again or I can delete the document by its id and add it again with the updated info... Anyone can guide me a bit in how to do delta imports? If its via php, better! Thanks in advance, Pablo Ferrari Tinkerlabs.net
Re: PhP, Solr and Delta Imports
On Mon, Nov 16, 2009 at 2:49 PM, Pablo Ferrari pabs.ferr...@gmail.comwrote: Hello, I have an already working Solr service based un full imports connected via php to a Zend Framework MVC (I connect it directly to the Controller). I use the SolrClient class for php which is great: http://www.php.net/manual/en/class.solrclient.php For now on, every time I want to edit a document I have to do a full import again or I can delete the document by its id and add it again with the updated info... Anyone can guide me a bit in how to do delta imports? If its via php, better! Thanks in advance, Pablo Ferrari Tinkerlabs.net Hello Pablo, You have a couple of options and you do not have to do a full data re-import for the entire index. My example below uses 'doc_id' as the uniqueKey field in your schema. It also assumes that it is an integer type 1. You can remove the document from the index by query or by id (assuming you have its id or uniqueKey field) if you want to just take it out of the active index. $client = new SolrClient($options); $client-deleteById(400); // I recommend this one OR $client-deleteByQuery('doc_id:400'); // This should work too. 2. If all you want to do is to replace/update an existing document in the Solr index and you still want the document to remain active in the index then you can just update it by building a SolrInputDocument object and then submitting just that document using the SolrClient. $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc-addField('doc_id', 334455); $doc-addField('other_field', 'Other Field Value'); $doc-addField('another_field', 'Another Field Value'); $updateResponse = $client-addDocument($doc); If your changes are coming from the db it would be helpful to have a time stamp column that changes each time the record is modified. Then you can keep track of when the last index process was done and the next time you can retrieve only 'active' documents that have been modified or created after this last re-index process. You can send the SolrInputDocuments to the Solr Index using the SolrClient object as shown above for each document. Do not forget to save the changes to the index with a call to SolrClient::commit() If you are updating a lot of records, I would remmend waiting till the end to do the commit (and optimize call if needed). More examples are available here http://us2.php.net/manual/en/solr.examples.php -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: Config Relationship between MaxWarmingSearchers and StreamingUpdateSolrServer
Hi Erik, I didn't look at the source code, and I think the javadoc for SUSS doesn't mention it, but I am under the impression that the number of threads to use should roughly match the number of CPU cores on the master. The maxWarmingSearchers should only be relevant to slaves, not masters, no? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Erik Earle erikea...@yahoo.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 1:20:23 PM Subject: Config Relationship between MaxWarmingSearchers and StreamingUpdateSolrServer My application updates the master index frequently, sometimes very frequently. Is there a good rule of thumb for configuring: 1) maxWarmingSearchers in the master 2) the SUSS thread pool size (and perhaps queue length) to match the server settings?
Re: Solr 1.3 query and index perf tank during optimize
I'd have to verify this to be sure, but I *believe* deleted docs data is expunged during index segment merges. See https://issues.apache.org/jira/browse/SOLR-1275 Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Jerome L Quinn jlqu...@us.ibm.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 10:05:55 AM Subject: Re: Solr 1.3 query and index perf tank during optimize Otis Gospodnetic wrote on 11/13/2009 11:15:43 PM: Let's take a step back. Why do you need to optimize? You said: As long as I'm not optimizing, search and indexing times are satisfactory. :) You don't need to optimize just because you are continuously adding and deleting documents. On the contrary! That's a fair question. Basically, search entries are keyed to other documents. We have finite storage, so we purge old documents. My understanding was that deleted documents still take space until an optimize is done. Therefore, if I don't optimize, the index size on disk will grow without bound. Am I mistaken? If I don't ever have to optimize, it would make my life easier. Thanks, Jerry
Re: Solr - Load Increasing.
Hi, Your autoCommit settings are very aggressive. I'm guessing that's what's causing the CPU load. btw. what is laks? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: kalidoss kalidoss.muthuramalin...@sifycorp.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 9:11:21 AM Subject: Solr - Load Increasing. Hi All. My server solr box cpu utilization increasing b/w 60 to 90% and some time solr is getting down and we are restarting it manually. No of documents in solr 30 laks. No of add/update requrest solr 30 thousand / day. Avg of every 30 minutes around 500 writes. No of search request 9laks / day. Size of the data directory: 4gb. My system ram is 8gb. System available space 12gb. processor Family: Pentium Pro Our solr data size can be increase in number like 90 laks. and writes per day will be around 1laks. - Hope its possible by solr. For write commit i have configured like 1 10 Is all above can be possible? 90laks datas and 1laks per day writes and 30laks per day read?? - if yes what type of system configuration would require. Please suggest us. thanks, Kalidoss.m, Get your world in your inbox! Mail, widgets, documents, spreadsheets, organizer and much more with your Sifymail WIYI id! Log on to http://www.sify.com ** DISCLAIMER ** Information contained and transmitted by this E-MAIL is proprietary to Sify Limited and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If this is a forwarded message, the content of this E-MAIL may not have been sent with the authority of the Company. If you are not the intended recipient, an agent of the intended recipient or a person responsible for delivering the information to the named recipient, you are notified that any use, distribution, transmission, printing, copying or dissemination of this information in any way or in any manner is strictly prohibited. If you have received this communication in error, please delete this mail notify us immediately at ad...@sifycorp.com
Re: Solr - Load Increasing.
Probably lakh: 100,000. So, 900k qpd and 3M docs. http://en.wikipedia.org/wiki/Lakh wunder On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote: Hi, Your autoCommit settings are very aggressive. I'm guessing that's what's causing the CPU load. btw. what is laks? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: kalidoss kalidoss.muthuramalin...@sifycorp.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 9:11:21 AM Subject: Solr - Load Increasing. Hi All. My server solr box cpu utilization increasing b/w 60 to 90% and some time solr is getting down and we are restarting it manually. No of documents in solr 30 laks. No of add/update requrest solr 30 thousand / day. Avg of every 30 minutes around 500 writes. No of search request 9laks / day. Size of the data directory: 4gb. My system ram is 8gb. System available space 12gb. processor Family: Pentium Pro Our solr data size can be increase in number like 90 laks. and writes per day will be around 1laks. - Hope its possible by solr. For write commit i have configured like 1 10 Is all above can be possible? 90laks datas and 1laks per day writes and 30laks per day read?? - if yes what type of system configuration would require. Please suggest us. thanks, Kalidoss.m, Get your world in your inbox! Mail, widgets, documents, spreadsheets, organizer and much more with your Sifymail WIYI id! Log on to http://www.sify.com ** DISCLAIMER ** Information contained and transmitted by this E-MAIL is proprietary to Sify Limited and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If this is a forwarded message, the content of this E-MAIL may not have been sent with the authority of the Company. If you are not the intended recipient, an agent of the intended recipient or a person responsible for delivering the information to the named recipient, you are notified that any use, distribution, transmission, printing, copying or dissemination of this information in any way or in any manner is strictly prohibited. If you have received this communication in error, please delete this mail notify us immediately at ad...@sifycorp.com
RE: Solr - Load Increasing.
Hi, Lakh or Lac - 100,000 Crore - 100,00,000 (ten million) Commonly used in India Sincerely, Sithu D Sudarsan -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Monday, November 16, 2009 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Solr - Load Increasing. Probably lakh: 100,000. So, 900k qpd and 3M docs. http://en.wikipedia.org/wiki/Lakh wunder On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote: Hi, Your autoCommit settings are very aggressive. I'm guessing that's what's causing the CPU load. btw. what is laks? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: kalidoss kalidoss.muthuramalin...@sifycorp.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 9:11:21 AM Subject: Solr - Load Increasing. Hi All. My server solr box cpu utilization increasing b/w 60 to 90% and some time solr is getting down and we are restarting it manually. No of documents in solr 30 laks. No of add/update requrest solr 30 thousand / day. Avg of every 30 minutes around 500 writes. No of search request 9laks / day. Size of the data directory: 4gb. My system ram is 8gb. System available space 12gb. processor Family: Pentium Pro Our solr data size can be increase in number like 90 laks. and writes per day will be around 1laks. - Hope its possible by solr. For write commit i have configured like 1 10 Is all above can be possible? 90laks datas and 1laks per day writes and 30laks per day read?? - if yes what type of system configuration would require. Please suggest us. thanks, Kalidoss.m, Get your world in your inbox! Mail, widgets, documents, spreadsheets, organizer and much more with your Sifymail WIYI id! Log on to http://www.sify.com ** DISCLAIMER ** Information contained and transmitted by this E-MAIL is proprietary to Sify Limited and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If this is a forwarded message, the content of this E-MAIL may not have been sent with the authority of the Company. If you are not the intended recipient, an agent of the intended recipient or a person responsible for delivering the information to the named recipient, you are notified that any use, distribution, transmission, printing, copying or dissemination of this information in any way or in any manner is strictly prohibited. If you have received this communication in error, please delete this mail notify us immediately at ad...@sifycorp.com
Re: Solr - Load Increasing.
On Mon, Nov 16, 2009 at 5:22 PM, Walter Underwood wun...@wunderwood.orgwrote: Probably lakh: 100,000. So, 900k qpd and 3M docs. http://en.wikipedia.org/wiki/Lakh wunder On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote: Hi, Your autoCommit settings are very aggressive. I'm guessing that's what's causing the CPU load. btw. what is laks? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: kalidoss kalidoss.muthuramalin...@sifycorp.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 9:11:21 AM Subject: Solr - Load Increasing. Hi All. My server solr box cpu utilization increasing b/w 60 to 90% and some time solr is getting down and we are restarting it manually. No of documents in solr 30 laks. No of add/update requrest solr 30 thousand / day. Avg of every 30 minutes around 500 writes. No of search request 9laks / day. Size of the data directory: 4gb. My system ram is 8gb. System available space 12gb. processor Family: Pentium Pro Our solr data size can be increase in number like 90 laks. and writes per day will be around 1laks. - Hope its possible by solr. For write commit i have configured like 1 10 Is all above can be possible? 90laks datas and 1laks per day writes and 30laks per day read?? - if yes what type of system configuration would require. Please suggest us. thanks, Kalidoss.m, Get your world in your inbox! Mail, widgets, documents, spreadsheets, organizer and much more with your Sifymail WIYI id! Log on to http://www.sify.com ** DISCLAIMER ** Information contained and transmitted by this E-MAIL is proprietary to Sify Limited and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If this is a forwarded message, the content of this E-MAIL may not have been sent with the authority of the Company. If you are not the intended recipient, an agent of the intended recipient or a person responsible for delivering the information to the named recipient, you are notified that any use, distribution, transmission, printing, copying or dissemination of this information in any way or in any manner is strictly prohibited. If you have received this communication in error, please delete this mail notify us immediately at ad...@sifycorp.com Thanks Walter for clarifying that. I too was wondering what laks meant. It was a bit distracting when I read the original post. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: Solr - Load Increasing.
I think it would be useful for members of this list to realize that not everyone uses the same metrology and terms. It is very easy for Americans to use the imperial system and presume everyone does the same; Europeans to use the metric system etc. Hopefully members on this list would be persuaded to use or at least clarify their terminology. While the apocryphal saying goes the great thing about standards is they are so many choose from, we should all make an effort to communicate across cultures and nations. On Mon, Nov 16, 2009 at 5:33 PM, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 16, 2009 at 5:22 PM, Walter Underwood wun...@wunderwood.org wrote: Probably lakh: 100,000. So, 900k qpd and 3M docs. http://en.wikipedia.org/wiki/Lakh wunder On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote: Hi, Your autoCommit settings are very aggressive. I'm guessing that's what's causing the CPU load. btw. what is laks? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: kalidoss kalidoss.muthuramalin...@sifycorp.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 9:11:21 AM Subject: Solr - Load Increasing. Hi All. My server solr box cpu utilization increasing b/w 60 to 90% and some time solr is getting down and we are restarting it manually. No of documents in solr 30 laks. No of add/update requrest solr 30 thousand / day. Avg of every 30 minutes around 500 writes. No of search request 9laks / day. Size of the data directory: 4gb. My system ram is 8gb. System available space 12gb. processor Family: Pentium Pro Our solr data size can be increase in number like 90 laks. and writes per day will be around 1laks. - Hope its possible by solr. For write commit i have configured like 1 10 Is all above can be possible? 90laks datas and 1laks per day writes and 30laks per day read?? - if yes what type of system configuration would require. Please suggest us. thanks, Kalidoss.m, Get your world in your inbox! Mail, widgets, documents, spreadsheets, organizer and much more with your Sifymail WIYI id! Log on to http://www.sify.com ** DISCLAIMER ** Information contained and transmitted by this E-MAIL is proprietary to Sify Limited and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If this is a forwarded message, the content of this E-MAIL may not have been sent with the authority of the Company. If you are not the intended recipient, an agent of the intended recipient or a person responsible for delivering the information to the named recipient, you are notified that any use, distribution, transmission, printing, copying or dissemination of this information in any way or in any manner is strictly prohibited. If you have received this communication in error, please delete this mail notify us immediately at ad...@sifycorp.com Thanks Walter for clarifying that. I too was wondering what laks meant. It was a bit distracting when I read the original post. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: Solr - Load Increasing.
Nice to learn a new word for the day! But to answer your question, or at least part of it, I don't really think you want a configuration like autoCommit maxDocs1/maxDocs maxTime10/maxTime /autoCommit Committing every doc, and every 10 milliseconds? That's just asking for problems. How about starting with 1000 docs, and five minutes for maxTime (5*60*1000) or about 3 laks of milliseconds. That should help performance a lot. Try that, and see how it works. Tom On Mon, Nov 16, 2009 at 2:43 PM, Shashi Kant sk...@sloan.mit.edu wrote: I think it would be useful for members of this list to realize that not everyone uses the same metrology and terms. It is very easy for Americans to use the imperial system and presume everyone does the same; Europeans to use the metric system etc. Hopefully members on this list would be persuaded to use or at least clarify their terminology. While the apocryphal saying goes the great thing about standards is they are so many choose from, we should all make an effort to communicate across cultures and nations. On Mon, Nov 16, 2009 at 5:33 PM, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 16, 2009 at 5:22 PM, Walter Underwood wun...@wunderwood.org wrote: Probably lakh: 100,000. So, 900k qpd and 3M docs. http://en.wikipedia.org/wiki/Lakh wunder On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote: Hi, Your autoCommit settings are very aggressive. I'm guessing that's what's causing the CPU load. btw. what is laks? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: kalidoss kalidoss.muthuramalin...@sifycorp.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 9:11:21 AM Subject: Solr - Load Increasing. Hi All. My server solr box cpu utilization increasing b/w 60 to 90% and some time solr is getting down and we are restarting it manually. No of documents in solr 30 laks. No of add/update requrest solr 30 thousand / day. Avg of every 30 minutes around 500 writes. No of search request 9laks / day. Size of the data directory: 4gb. My system ram is 8gb. System available space 12gb. processor Family: Pentium Pro Our solr data size can be increase in number like 90 laks. and writes per day will be around 1laks. - Hope its possible by solr. For write commit i have configured like 1 10 Is all above can be possible? 90laks datas and 1laks per day writes and 30laks per day read?? - if yes what type of system configuration would require. Please suggest us. thanks, Kalidoss.m, Get your world in your inbox! Mail, widgets, documents, spreadsheets, organizer and much more with your Sifymail WIYI id! Log on to http://www.sify.com ** DISCLAIMER ** Information contained and transmitted by this E-MAIL is proprietary to Sify Limited and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If this is a forwarded message, the content of this E-MAIL may not have been sent with the authority of the Company. If you are not the intended recipient, an agent of the intended recipient or a person responsible for delivering the information to the named recipient, you are notified that any use, distribution, transmission, printing, copying or dissemination of this information in any way or in any manner is strictly prohibited. If you have received this communication in error, please delete this mail notify us immediately at ad...@sifycorp.com Thanks Walter for clarifying that. I too was wondering what laks meant. It was a bit distracting when I read the original post. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: exclude some fields from copying dynamic fields | schema.xml
Oh well. There is no direct feature for controlling what is copied. If you use the DataImportHandler, you can include Java plugins or Javascript/JRuby/Groovy code to do the copying. On Sun, Nov 15, 2009 at 9:37 PM, Vicky_Dev vikrantv_shirbh...@yahoo.co.in wrote: Thanks for response Defining field is not working :( Is there any way to stop copy task for particular set of values Thanks ~Vikrant Lance Norskog-2 wrote: There is no direct way. Let's say you have a nocopy_s and you do not want a copy nocopy_str_s. This might work: declare nocopy_str_s as a field and make it not indexed and not stored. I don't know if this will work. It requires two overrides to work: 1) that declaring a field name that matches a wildcard will override the default wildcard rule, and 2) that stored=false indexed=false works. On Fri, Nov 13, 2009 at 3:23 AM, Vicky_Dev vikrantv_shirbh...@yahoo.co.in wrote: Hi, we are using the following entry in schema.xml to make a copy of one type of dynamic field to another : copyField source=*_s dest=*_str_s / Is it possible to exclude some fields from copying. We are using Solr1.3 ~Vikrant -- View this message in context: http://old.nabble.com/exclude-some-fields-from-copying-dynamic-fields-%7C-schema.xml-tp26335109p26335109.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com -- View this message in context: http://old.nabble.com/exclude-some-fields-from-copying-dynamic-fields-%7C-schema.xml-tp26335109p26367099.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Newbie Solr questions
thanks, so there is no way to create custom documents/field via the SolrJ client API @ runtime.? On Nov 16, 2009, at 4:49 PM, Lance Norskog wrote: here is no way to create custom documents/fields via the SolrJ client @ runtime.
Re: Newbie Solr questions
Sorry, I did not answer the question. Yes, that's right. SolrJ can only change the documents in the index. It has no power over the metadata. On Mon, Nov 16, 2009 at 4:00 PM, yz5od2 woods5242-outdo...@yahoo.com wrote: thanks, so there is no way to create custom documents/field via the SolrJ client API @ runtime.? On Nov 16, 2009, at 4:49 PM, Lance Norskog wrote: here is no way to create custom documents/fields via the SolrJ client @ runtime. -- Lance Norskog goks...@gmail.com
core size
I'm are planning out a system with large indexes and wondering what kind of performance boost I'd see if I split out documents into many cores rather than using a single core and splitting by a field. I've got about 500GB worth of indexes ranging from 100MB to 50GB each. I'm assuming if we split them out to multiple cores we would see the most dramatic benefit in searches on the smaller cores, but I'm just wondering what level of speedup I should expect. Eventually the cores will be split up anyway, I'm just trying to determine how to prioritize it. thanks, Phil
Replication admin page auto-reload
The replication admin page on slaves used to have an auto-reload set to reload every few seconds. In the official 1.4 release this doesn't seem to be working, but it does in a nightly build from early June. Was this changed on purpose or is this a bug? I looked through CHANGES.txt to see if anything was mentioned related to this but didn't see anything. If it's a bug I'll open an issue in JIRA -Jay
Re: Some guide about setting up local/geo search at solr
Not that I know. It's not in contrib, but if you apply that patch from http://wiki.apache.org/solr/SpatialSearch I am guessing it puts things in contrib/spatial. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Bertie Shen bertie.s...@gmail.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 12:41:38 PM Subject: Re: Some guide about setting up local/geo search at solr Localsolr is not in contrib yet. I am interested in knowing whether currently there is a better solution for setting up a local search. Cheers. On Sun, Nov 15, 2009 at 9:25 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Nota bene: My understanding is the external versions of Local Lucene/Solr are eventually going to be deprecated in favour of what we have in contrib. Here's a stub page with a link to the spatial JIRA issue: http://wiki.apache.org/solr/SpatialSearch Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Bertie Shen To: solr-user@lucene.apache.org Sent: Sat, November 14, 2009 3:32:01 AM Subject: Some guide about setting up local/geo search at solr Hey, I spent some times figuring out how to set up local/geo/spatial search at solr. I hope the following description can help given the current status. 1) Download localsolr. I download it from http://developer.k-int.com/m2snapshots/localsolr/localsolr/1.5/ and put jar file (in my case, localsolr-1.5.jar) in your application's WEB_INF/lib directory of application server. 2) Download locallucene. I download it from http://sourceforge.net/projects/locallucene/ and put jar file (in my case, locallucene.jar in locallucene_r2.0/dist/ diectory) in your application's WEB_INF/lib directory of application server. I also need to copy gt2-referencing-2.3.1.jar, geoapi-nogenerics-2.1-M2.jar, and jsr108-0.01.jar under locallucene_r2.0/lib/ directory to WEB_INF/lib. Do not copy lucene-spatial-2.9.1.jar under Lucene codebase. The namespace has been changed from com.pjaol.blah.blah.blah to org.apache.blah blah. 3) Update your solrconfig.xml and schema.xml. I copy it from http://www.gissearch.com/localsolr. 4) Restart application server and try a query /solr/select?qt=geolat=xx.xxlong=yy.yyq=abcradius=zz.
Re: core size
If an index fits in memory, I am guessing you'll see the speed change roughly proportionally to the size of the index. If an index does not fit into memory (i.e. disk head has to run around the disk to look for info), then the improvement will be even greater. I haven't explicitly tested this and am hoping somebody will correct me if this is wrong. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Phil Hagelberg p...@hagelb.org To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 8:42:49 PM Subject: core size I'm are planning out a system with large indexes and wondering what kind of performance boost I'd see if I split out documents into many cores rather than using a single core and splitting by a field. I've got about 500GB worth of indexes ranging from 100MB to 50GB each. I'm assuming if we split them out to multiple cores we would see the most dramatic benefit in searches on the smaller cores, but I'm just wondering what level of speedup I should expect. Eventually the cores will be split up anyway, I'm just trying to determine how to prioritize it. thanks, Phil
Re: Replication admin page auto-reload
On Nov 17, 2009, at 2:48 AM, Jay Hill wrote: The replication admin page on slaves used to have an auto-reload set to reload every few seconds. In the official 1.4 release this doesn't seem to be working, but it does in a nightly build from early June. Was this changed on purpose or is this a bug? I looked through CHANGES.txt to see if anything was mentioned related to this but didn't see anything. If it's a bug I'll open an issue in JIRA Noble changed this: ~/dev/solr/src/webapp/web/admin/replication: svn log header.jsp r809125 | noble | 2009-08-29 14:46:54 +0200 (Sat, 29 Aug 2009) | 1 line automatic refresh is very annoying. The user can do a refresh on his browser if required ~/dev/solr/src/webapp/web/admin/replication: svn diff -r800729:809125 header.jsp Index: header.jsp === --- header.jsp (revision 800729) +++ header.jsp (revision 809125) @@ -67,12 +67,7 @@ NamedList namedlist = executeCommand(details,core,rh); NamedList detailsMap = (NamedList)namedlist.get(details); -if(detailsMap != null) -if(true.equals((String)detailsMap.get(isSlave))){ % - meta http-equiv=refresh content=10/ -%}% - /head body