Re: Binary content index with multiple cores
Ok i find a way to use it, it was a problem with librairies. In fact i dont want to index PDF or Word directly i just want to get the content to add into my document content so i guess i will have to use tika to get the XML and to get the node that i want. -- View this message in context: http://lucene.472066.n3.nabble.com/Binary-content-index-with-multiple-cores-tp3997221p3997370.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Binary content index with multiple cores
To help finding the solution, with my JUnit test here is the stack trace : org.apache.solr.client.solrj.SolrServerException: Server at http://localhost:8983/solr/document returned non ok status:500, message:Internal Server Error at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) And the console error from apache tomcat : [WARNING] [talledLocalContainer] Jul 26, 2012 7:32:20 AM org.apache.solr.common.SolrException log [WARNING] [talledLocalContainer] SEVERE: org.apache.solr.common.SolrException: lazy loading error [WARNING] [talledLocalContainer]at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:260) [WARNING] [talledLocalContainer]at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242) [WARNING] [talledLocalContainer]at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) [WARNING] [talledLocalContainer]at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365) [WARNING] [talledLocalContainer]at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260) [WARNING] [talledLocalContainer]at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) [WARNING] [talledLocalContainer]at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) [WARNING] [talledLocalContainer]at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) [WARNING] [talledLocalContainer]at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) [WARNING] [talledLocalContainer]at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) [WARNING] [talledLocalContainer]at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) [WARNING] [talledLocalContainer]at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:615) [WARNING] [talledLocalContainer]at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) [WARNING] [talledLocalContainer]at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) [WARNING] [talledLocalContainer]at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) [WARNING] [talledLocalContainer]at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) [WARNING] [talledLocalContainer]at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) [WARNING] [talledLocalContainer]at java.lang.Thread.run(Thread.java:722) [WARNING] [talledLocalContainer] Caused by: *org.apache.solr.common.SolrException: Error Instantiating Request Handler, solr.extraction.ExtractingRequestHandler is not a org.apache.solr.request.SolrRequestHandler* [WARNING] [talledLocalContainer]at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:421) [WARNING] [talledLocalContainer]at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:455) [WARNING] [talledLocalContainer]at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:251) [WARNING] [talledLocalContainer]... 17 more [WARNING] [talledLocalContainer] I hope it will help you to find something wrong. -- View this message in context: http://lucene.472066.n3.nabble.com/Binary-content-index-with-multiple-cores-tp3997221p3997368.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: numFound inconsistent for different rows-param
i resolved my confusion and discovered that the documents of the second shard contained the same 'unique' id. rows=0 displayed the 'correct' numFound since (as i understand) there was no merge of the results. cheerio, patrick On 25.07.2012 17:07, patrick wrote: hi, i'm running two solr v3.6 instances: rdta01:9983/solr/msg-core : 8 documents rdta01:28983/solr/msg-core : 4 documents the following two queries with rows=10 resp rows=0 return different numFound results which confuses me. i hope someone can clarify this behaviour. URL with rows=10: - http://rdta01:9983/solr/msg-core/select?q=*:*shards=rdta01%3A9983%2Fsolr%2Fmsg-core%2Crdta01%3A28983%2Fsolr%2Fmsg-coreindent=onstart=0rows=10 numFound=8 (incorrect, second shard is missing) URL with rows=0: http://rdta01:9983/solr/msg-core/select?q=*:*shards=rdta01%3A9983%2Fsolr%2Fmsg-core%2Crdta01%3A28983%2Fsolr%2Fmsg-coreindent=onstart=0rows=0 numFound=12 (correct) cheerio, patrick
Re: Binary content index with multiple cores
Thanks for replying, here is my dependency related to solr-cell : org.apache.solr:solr-cell:jar:3.6.0:compile [INFO] | +- com.ibm.icu:icu4j:jar:4.8.1.1:compile [INFO] | +- *org.apache.tika:tika-parsers:jar:1.0:compile* [INFO] | | +- org.apache.tika:tika-core:jar:1.0:compile [INFO] | | +- edu.ucar:netcdf:jar:4.2-min:compile [INFO] | | +- org.apache.james:apache-mime4j-core:jar:0.7:compile [INFO] | | +- org.apache.james:apache-mime4j-dom:jar:0.7:compile [INFO] | | +- org.apache.commons:commons-compress:jar:1.3:compile [INFO] | | +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile [INFO] | | | +- org.apache.pdfbox:fontbox:jar:1.6.0:compile [INFO] | | | \- org.apache.pdfbox:jempbox:jar:1.6.0:compile [INFO] | | +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile [INFO] | | +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile [INFO] | | +- org.apache.poi:poi:jar:3.8-beta4:compile [INFO] | | +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile [INFO] | | +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile [INFO] | | | \- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile [INFO] | | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile [INFO] | | +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile [INFO] | | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile [INFO] | | +- asm:asm:jar:3.1:compile [INFO] | | +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile [INFO] | | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile [INFO] | | \- rome:rome:jar:0.9:compile [INFO] | | \- jdom:jdom:jar:1.0:compile [INFO] | \- xerces:xercesImpl:jar:2.8.1:compile [INFO] | \- xml-apis:xml-apis:jar:1.3.03:compile As you can see i have the tika-parsers. About the solr.war, when i start my mvn cargo:run i put into the pom.xml the fact that he create the sol.war and for solr-cell tomcat needs some dependencies like solr-cell, solr-core, solr-solrj, tika-core and slf4j-api. Have you any idea about where is my mistake ? -- View this message in context: http://lucene.472066.n3.nabble.com/Binary-content-index-with-multiple-cores-tp3997221p3997367.html Sent from the Solr - User mailing list archive at Nabble.com.
Bulk indexing data into solr
Hi, I am starting to use solr, now I need to index a rather large amount of data, it seems that calling solr to pass data through HTTP is rather inefficient, I am think still call lucene API directly for bulk index but to use solr for search, is this design OK? Thanks very much for helps, Lisheng
Re: solr spellchecker hogging all of my memory
Do the spellcheck objects eventually get collected off the heap? Maybe you should dump the heap later and ensure those objects get collected, in which case, I'd call this a normal heap expansion due to a temporary usage spike. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Wed, Jul 25, 2012 at 10:03 PM, dboychuck dboych...@build.com wrote: before I optimize (build my spellchecker index) my solr instance running in tomcat uses about 2 gigs of memory as soon as I optimize it jumps to about 5 gigs http://d.pr/i/oUQI it just doesn't seem right http://pastebin.com/6Cg7F0dK is there anything wrong with my configuration? when i dump the heap I can see that spellchecker is using a majority of the memory -- View this message in context: http://lucene.472066.n3.nabble.com/solr-spellchecker-hogging-all-of-my-memory-tp3997353.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Skip first word
Hi Ahmet, business asked me to apply EdgeNGram with minGramSize=1 on the first term and with minGramSize=3 on the latter terms. We are developing a search suggestion mechanism, the idea is that if the user types D, the engine should suggest Dolce Gabbana, but if we type G, it should suggest other brands. Only if users type Gab it should suggest Dolce Gabbana. Thanks S Inizio: Ahmet Arslan [iori...@yahoo.com] Inviato: mercoledì 25 luglio 2012 18.10 Fine: solr-user@lucene.apache.org Oggetto: Re: Skip first word is there a tokenizer and/or a combination of filter to remove the first term from a field? For example: The quick brown fox should be tokenized as: quick brown fox There is no such filter that i know of. Though, you can implement one with modifying source code of LengthFilterFactory or StopFilterFactory. They both remove tokens. Out of curiosity, what is the use case for this?
solr host name on solrconfig.xml
Hello i need the host name of my solr-server in my solrconfig.xml anybody knows the correct variable? something like ${solr.host} or ${solr.host.name} ... exists an documantation about ALL available variables in the solr namespaces? thx a lot -- View this message in context: http://lucene.472066.n3.nabble.com/solr-host-name-on-solrconfig-xml-tp3997371.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Bulk indexing data into solr
Hello! If you use Java (and I think you do, because you mention Lucene) you should take a look at StreamingUpdateSolrServer. It not only allows you to send data in batches, but also index using multiple threads. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi, I am starting to use solr, now I need to index a rather large amount of data, it seems that calling solr to pass data through HTTP is rather inefficient, I am think still call lucene API directly for bulk index but to use solr for search, is this design OK? Thanks very much for helps, Lisheng
Re: Binary content index with multiple cores
About the solr.war, when i start my mvn cargo:run i put into the pom.xml the fact that he create the sol.war and for solr-cell tomcat needs some dependencies like solr-cell, solr-core, solr-solrj, tika-core and slf4j-api. Have you any idea about where is my mistake ? Okey, for solr-cell tomcat needs dependencies. These dependencies shipped with solr download. (apache-solr-3.6.1.tgz for example). You don't need to embed those jars into solr.war. You can consume them using lib directives. That said, to enable solr-cell you don't need to re-create solr nor use maven.
Solr - hl.fragsize Issue
i am using solr 3.5 , and in search query i set hl.fragsize = 100 , but my fragment does not contain exact 100 chars , average fragment size is 120 . Can anybody have idea about this issue?? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-hl-fragsize-Issue-tp3997457.html Sent from the Solr - User mailing list archive at Nabble.com.
Expression Sort in Solr
am working on solr for search. I required to perform a expression sort such that ORDER BY (IF(COUNTRY=1,100,0) + IF(AVAILABLE=2,1000,IF(AVAILABLE=1,60,0)) + IF (DELIVERYIN IN (5,6,7),100,IF (DELIVERYIN IN (80,90),50,0))) DESC can anyone tell me hows is this possible? -- View this message in context: http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Expression Sort in Solr
How dynamic are those numbers? If this expression can be computed at index time into a sort_order field, that'd be best. Otherwise, if these factors are truly dynamic at run-time, look at the function query sorting capability here: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function and build up the expression from there. I still encourage you to aim towards computing as much of this at index-time as possible to minimize the functions (and thus caches) you need at query time. Erik On Jul 26, 2012, at 03:47 , lavesh wrote: am working on solr for search. I required to perform a expression sort such that ORDER BY (IF(COUNTRY=1,100,0) + IF(AVAILABLE=2,1000,IF(AVAILABLE=1,60,0)) + IF (DELIVERYIN IN (5,6,7),100,IF (DELIVERYIN IN (80,90),50,0))) DESC can anyone tell me hows is this possible? -- View this message in context: http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - hl.fragsize Issue
i am using solr 3.5 , and in search query i set hl.fragsize = 100 , but my fragment does not contain exact 100 chars , average fragment size is 120 . Can anybody have idea about this issue?? Are you using FastVectorHighlighter or DefaultSolrHighlighter? Could it be that 120 includes character numbers of em tags ?
Re: leaks in solr
Did you find any more clues? I have this problem in my machines as well.. On Fri, Jun 29, 2012 at 6:04 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi list, while monitoring my solr 3.6.1 installation I recognized an increase of memory usage in OldGen JVM heap on my slave. I decided to force Full GC from jvisualvm and send optimize to the already optimized slave index. Normally this helps because I have monitored this issue over the past. But not this time. The Full GC didn't free any memory. So I decided to take a heap dump and see what MemoryAnalyzer is showing. The heap dump is about 23 GB in size. 1.) Report Top consumers - Biggest Objects: Total: 12.3 GB org.apache.lucene.search.FieldCacheImpl : 8.1 GB class java.lang.ref.Finalizer : 2.1 GB org.apache.solr.util.ConcurrentLRUCache : 1.5 GB org.apache.lucene.index.ReadOnlySegmentReader : 622.5 MB ... As you can see, Finalizer has already reached 2.1 GB!!! * java.util.concurrent.ConcurrentHashMap$Segment[16] @ 0x37b056fd0 * segments java.util.concurrent.ConcurrentHashMap @ 0x39b02d268 * map org.apache.solr.util.ConcurrentLRUCache @ 0x398f33c30 * referent java.lang.ref.Finalizer @ 0x37affa810 * next java.lang.ref.Finalizer @ 0x37affa838 ... Seams to be org.apache.solr.util.ConcurrentLRUCache The attributes are: Type |Name | Value - boolean| isDestroyed | true - ref| cleanupThread| null ref| evictionListener | null --- long | oldestEntry | 0 -- int| acceptableWaterMark | 9500 -- ref| stats| org.apache.solr.util.ConcurrentLRUCache$Stats @ 0x37b074dc8 boolean| islive | true - boolean| newThreadForCleanup | false boolean| isCleaning | false ref| markAndSweepLock | java.util.concurrent.locks.ReentrantLock @ 0x39bf63978 - int| lowerWaterMark | 9000 - int| upperWaterMark | 1 - ref| map | java.util.concurrent.ConcurrentHashMap @ 0x39b02d268 -- 2.) While searching for open files and their references I noticed that there are references to index files which are already deleted from disk. E.g. recent index files are data/index/_2iqw.frq and data/index/_2iqx.frq. But I also see references to data/index/_2hid.frq which are quite old and are deleted way back from earlier replications. I have to analyze this a bit deeper. So far my report, I go on analyzing this huge heap dump. If you need any other info or even the heap dump, let me know. Regards Bernd
Re: Bulk indexing data into solr
On 7/26/2012 7:34 AM, Rafał Kuć wrote: If you use Java (and I think you do, because you mention Lucene) you should take a look at StreamingUpdateSolrServer. It not only allows you to send data in batches, but also index using multiple threads. A caveat to what Rafał said: The streaming object has no error detection out of the box. It queues everything up internally and returns immediately. Behind the scenes, it uses multiple threads to send documents to Solr, but any errors encountered are simply sent to the logging mechanism, then ignored. When you use HttpSolrServer, all errors encountered will throw exceptions, but you have to wait for completion. If you need both concurrent capability and error detection, you would have to manage multiple indexing threads yourself. Apparently there is a method in the concurrent class that you can override and handle errors differently, though I have not seen how to write code so your program would know that an error occurred. I filed an issue with a patch to solve this, but some of the developers have come up with an idea that might be better. None of the ideas have been committed to the project. https://issues.apache.org/jira/browse/SOLR-3284 Just an FYI, the streaming class was renamed to ConcurrentUpdateSolrServer in Solr 4.0 Alpha. Both are available in 3.6.x. Thanks, Shawn
querying using filter query and lots of possible values
Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Re: Expression Sort in Solr
Hi i know we look to create at index time however all values are dynamic if(exists(query(COUNTRY:(22 33 44)),100,20),INCOMEhttp://devjs.infoedge.com:8080/solr/select?q=*:*fq=GENDER:FEMALEsort=sum(if(exists(query(AGE:22)),100,20),INCOME ) IS NOT WORKING ALSO I NEED NESTED IF On Thu, Jul 26, 2012 at 8:48 PM, Erik Hatcher-4 [via Lucene] ml-node+s472066n3997464...@n3.nabble.com wrote: How dynamic are those numbers? If this expression can be computed at index time into a sort_order field, that'd be best. Otherwise, if these factors are truly dynamic at run-time, look at the function query sorting capability here: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function and build up the expression from there. I still encourage you to aim towards computing as much of this at index-time as possible to minimize the functions (and thus caches) you need at query time. Erik On Jul 26, 2012, at 03:47 , lavesh wrote: am working on solr for search. I required to perform a expression sort such that ORDER BY (IF(COUNTRY=1,100,0) + IF(AVAILABLE=2,1000,IF(AVAILABLE=1,60,0)) + IF (DELIVERYIN IN (5,6,7),100,IF (DELIVERYIN IN (80,90),50,0))) DESC can anyone tell me hows is this possible? -- View this message in context: http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369p3997464.html To unsubscribe from Expression Sort in Solr, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3997369code=bGF2ZXNoLnJhd2F0QGdtYWlsLmNvbXwzOTk3MzY5fDYyODA2MjY2MQ== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Never explain yourself. Your friends don’t need it and your enemies won’t believe it . -- View this message in context: http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369p3997475.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: querying using filter query and lots of possible values
Hi Daniel, index the id into a field of type tint or tlong and use a range query (http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29): fq=id:[200 TO 2000] If you want to exclude certain ids it might be wiser to simply add an exclusion query in addition to the range query instead of listing all the single values. You will run into problems with too long request urls. If you cannot avoid long urls you might want to increase maxBooleanClauses (see http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section). Cheers, Chantal Am 26.07.2012 um 18:01 schrieb Daniel Brügge: Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Re: Skip first word
Hi, use two fields: 1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 for inputs of length 3, 2. the other one tokenized as appropriate with minsize=3 and longer for all longer inputs Cheers, Chantal Am 26.07.2012 um 09:05 schrieb Finotti Simone: Hi Ahmet, business asked me to apply EdgeNGram with minGramSize=1 on the first term and with minGramSize=3 on the latter terms. We are developing a search suggestion mechanism, the idea is that if the user types D, the engine should suggest Dolce Gabbana, but if we type G, it should suggest other brands. Only if users type Gab it should suggest Dolce Gabbana. Thanks S Inizio: Ahmet Arslan [iori...@yahoo.com] Inviato: mercoledì 25 luglio 2012 18.10 Fine: solr-user@lucene.apache.org Oggetto: Re: Skip first word is there a tokenizer and/or a combination of filter to remove the first term from a field? For example: The quick brown fox should be tokenized as: quick brown fox There is no such filter that i know of. Though, you can implement one with modifying source code of LengthFilterFactory or StopFilterFactory. They both remove tokens. Out of curiosity, what is the use case for this?
RE: Bulk indexing data into solr
Thanks very much, both your and Rafal's advice are very helpful! -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Thursday, July 26, 2012 8:47 AM To: solr-user@lucene.apache.org Subject: Re: Bulk indexing data into solr On 7/26/2012 7:34 AM, Rafał Kuć wrote: If you use Java (and I think you do, because you mention Lucene) you should take a look at StreamingUpdateSolrServer. It not only allows you to send data in batches, but also index using multiple threads. A caveat to what Rafał said: The streaming object has no error detection out of the box. It queues everything up internally and returns immediately. Behind the scenes, it uses multiple threads to send documents to Solr, but any errors encountered are simply sent to the logging mechanism, then ignored. When you use HttpSolrServer, all errors encountered will throw exceptions, but you have to wait for completion. If you need both concurrent capability and error detection, you would have to manage multiple indexing threads yourself. Apparently there is a method in the concurrent class that you can override and handle errors differently, though I have not seen how to write code so your program would know that an error occurred. I filed an issue with a patch to solve this, but some of the developers have come up with an idea that might be better. None of the ideas have been committed to the project. https://issues.apache.org/jira/browse/SOLR-3284 Just an FYI, the streaming class was renamed to ConcurrentUpdateSolrServer in Solr 4.0 Alpha. Both are available in 3.6.x. Thanks, Shawn
Re: Bulk indexing data into solr
Right in time, guys. https://issues.apache.org/jira/browse/SOLR-3585 Here is server side update processing fork. It does the best for halting processing on exception occurs. Plug this UpdateProcessor, specify number of threads. Then submit lazy iterator into StreamingUpdateServer at client side. PS: Don't do the following: send many-many docs one-by-one or instantiate huge arrayList of SolrInputDocument at client-side. On Thu, Jul 26, 2012 at 7:46 PM, Shawn Heisey s...@elyograg.org wrote: On 7/26/2012 7:34 AM, Rafał Kuć wrote: If you use Java (and I think you do, because you mention Lucene) you should take a look at StreamingUpdateSolrServer. It not only allows you to send data in batches, but also index using multiple threads. A caveat to what Rafał said: The streaming object has no error detection out of the box. It queues everything up internally and returns immediately. Behind the scenes, it uses multiple threads to send documents to Solr, but any errors encountered are simply sent to the logging mechanism, then ignored. When you use HttpSolrServer, all errors encountered will throw exceptions, but you have to wait for completion. If you need both concurrent capability and error detection, you would have to manage multiple indexing threads yourself. Apparently there is a method in the concurrent class that you can override and handle errors differently, though I have not seen how to write code so your program would know that an error occurred. I filed an issue with a patch to solve this, but some of the developers have come up with an idea that might be better. None of the ideas have been committed to the project. https://issues.apache.org/**jira/browse/SOLR-3284https://issues.apache.org/jira/browse/SOLR-3284 Just an FYI, the streaming class was renamed to ConcurrentUpdateSolrServer in Solr 4.0 Alpha. Both are available in 3.6.x. Thanks, Shawn -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: querying using filter query and lots of possible values
Hey Chantal, thanks for your answer. The range queries would not work, because they are not values in a row. They can be randomly ordered with gaps. Above was just an example. Excluding is also not a solution, because the list of excluded id would be even longer. To specify it even more. The IDs are not even integers, but UUIDs. And they are tens of thousands. And the document pool contains hundreds of million documents. Thanks. Daniel On Thu, Jul 26, 2012 at 6:22 PM, Chantal Ackermann c.ackerm...@it-agenten.com wrote: Hi Daniel, index the id into a field of type tint or tlong and use a range query ( http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29): fq=id:[200 TO 2000] If you want to exclude certain ids it might be wiser to simply add an exclusion query in addition to the range query instead of listing all the single values. You will run into problems with too long request urls. If you cannot avoid long urls you might want to increase maxBooleanClauses (see http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section). Cheers, Chantal Am 26.07.2012 um 18:01 schrieb Daniel Brügge: Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Is it possible or wise to query multiple cores in parallel in SolrCloud
Hi, I am playing around with a SolrCloud setup (4 shards) and thousands of cores. I am thinking of executing queries on hundreds of cores like a distributed query. Is this possible at all from SolrCloud side. And is this wise? Thanks regards Daniel
Re: Bulk indexing data into solr
Coming back to your original question. I'm puzzled a little. It's not clear where you wanna call Lucene API directly from. if you mean that you has standalone indexer, which write index files. Then it stops and these files become available for Solr Process it will work. Sharing index between processes, or using EmbeddedServer is looking for problem (despite Lucene has Locks mechanism, which I'm not completely aware of). I can conclude that your data for indexing is collocate with the solr server. In this case consider http://wiki.apache.org/solr/ContentStream#RemoteStreaming Please give more details about your design. On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I am starting to use solr, now I need to index a rather large amount of data, it seems that calling solr to pass data through HTTP is rather inefficient, I am think still call lucene API directly for bulk index but to use solr for search, is this design OK? Thanks very much for helps, Lisheng -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: querying using filter query and lots of possible values
You can't update the original documents except by reindexing them, so no easy group assigment option. If you create this 'collection' once but query it multiple times, you may be able to use SOLR4 join with IDs being stored separately and joined on. Still not great because the performance is an issue when mapping on IDs: http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ . If the list is some sort of combination of smaller lists - you could probably precompute (at index time) those fragments and do compound query over them. But if you have to query every time and the list is different every time, that could be complicated. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge daniel.brue...@googlemail.com wrote: Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Re: querying using filter query and lots of possible values
Thanks Alexandre, the list of IDs is constant for a longer time. I will take a look at these join thematic. Maybe another solution would be to really create a whole new collection or set of documents containing the aggregated documents (from the ids) from scratch and to execute queries on this collection. Then this would take some time, but maybe it's worth it because the querying will thank you. Daniel On Thu, Jul 26, 2012 at 7:43 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: You can't update the original documents except by reindexing them, so no easy group assigment option. If you create this 'collection' once but query it multiple times, you may be able to use SOLR4 join with IDs being stored separately and joined on. Still not great because the performance is an issue when mapping on IDs: http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ . If the list is some sort of combination of smaller lists - you could probably precompute (at index time) those fragments and do compound query over them. But if you have to query every time and the list is different every time, that could be complicated. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge daniel.brue...@googlemail.com wrote: Hi, i am facing the following issue: I have couple of million documents, which have a field called source_id. My problem is, that I want to retrieve all the documents which have a source_id in a specific range of values. This range can be pretty big, so for example a list of 200 to 2000 source ids. I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5 6 .) but this reminds me of SQLs WHERE IN (...) which was always bit slow for a huge number of values. Another solution that came into my mind was to assigned all the documents I want to retrieve a new kind of filter id. So all the documents which i want to analyse get a new id. But i need to update all the millions of documents for this and assign them a new id. This could take some time. Do you can think of a nicer way to solve this issue? Regards greetings Daniel
Re: Skip first word
That's is best option I had also used shingle filter factory . . On Jul 26, 2012 10:03 PM, Chantal Ackermann-2 [via Lucene] ml-node+s472066n399748...@n3.nabble.com wrote: Hi, use two fields: 1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 for inputs of length 3, 2. the other one tokenized as appropriate with minsize=3 and longer for all longer inputs Cheers, Chantal Am 26.07.2012 um 09:05 schrieb Finotti Simone: Hi Ahmet, business asked me to apply EdgeNGram with minGramSize=1 on the first term and with minGramSize=3 on the latter terms. We are developing a search suggestion mechanism, the idea is that if the user types D, the engine should suggest Dolce Gabbana, but if we type G, it should suggest other brands. Only if users type Gab it should suggest Dolce Gabbana. Thanks S Inizio: Ahmet Arslan [[hidden email]http://user/SendEmail.jtp?type=nodenode=3997480i=0] Inviato: mercoledì 25 luglio 2012 18.10 Fine: [hidden email]http://user/SendEmail.jtp?type=nodenode=3997480i=1 Oggetto: Re: Skip first word is there a tokenizer and/or a combination of filter to remove the first term from a field? For example: The quick brown fox should be tokenized as: quick brown fox There is no such filter that i know of. Though, you can implement one with modifying source code of LengthFilterFactory or StopFilterFactory. They both remove tokens. Out of curiosity, what is the use case for this? -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Skip-first-word-tp3997277p3997480.html To unsubscribe from Lucene, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml - THANKS AND REGARDS, SYED ABDUL KATHER -- View this message in context: http://lucene.472066.n3.nabble.com/Skip-first-word-tp3997277p3997509.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Bulk indexing data into solr
Hi, I think at least before lucene 4.0 we can only allow one process/thread to write on a lucene folder. Based on this fact my initial plan is: 1) There is one set of lucene index folders. 2) Solr server only perform queries in those servers 3) Having a separate process (multi-threads) to index those lucene folders (each folder is a separate app). Only one thread will index one given lucene folder. Thanks very much for helps, Lisheng -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Thursday, July 26, 2012 10:15 AM To: solr-user@lucene.apache.org Subject: Re: Bulk indexing data into solr Coming back to your original question. I'm puzzled a little. It's not clear where you wanna call Lucene API directly from. if you mean that you has standalone indexer, which write index files. Then it stops and these files become available for Solr Process it will work. Sharing index between processes, or using EmbeddedServer is looking for problem (despite Lucene has Locks mechanism, which I'm not completely aware of). I can conclude that your data for indexing is collocate with the solr server. In this case consider http://wiki.apache.org/solr/ContentStream#RemoteStreaming Please give more details about your design. On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I am starting to use solr, now I need to index a rather large amount of data, it seems that calling solr to pass data through HTTP is rather inefficient, I am think still call lucene API directly for bulk index but to use solr for search, is this design OK? Thanks very much for helps, Lisheng -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: querying using filter query and lots of possible values
Hi Daniel, depending on how you decide on the list of ids, in the first place, you could also create a new index (core) and populate it with DIH which would select only documents from your main index (core) in this range of ids. When updating you could try a delta import. Of course, this is only worth the effort if that core would exist for some time - but you've written that the subset of ids is constant for a longer time. Just another idea on top ;-) Chantal
Re: separation of indexes to optimize facet queries without fulltext
: My thought was, that I could separate indexes. So for the facet queries : where I don't need : fulltext search (so also no indexed fulltext field) I can use a completely : new setup of a : sharded Solr which doesn't include the indexed fulltext, so the index is : kept small containing : just the few fields I have. : : And for the fulltext queries I have the current Solr configuration which : includes as mentioned : above all the fields incl. the index fulltext field. : : Is this a normal way of handling these requirements. That there are : different kind of : Solr configurations for the different needs? Because the huge redundancy It's definitley doable -- one thing i'm not clear on is why, if your faceting queries don't care about the full text, you would need to leave those small fields in your full index ... is your plan to do faceting and drill down using the smaller index, but then display docs resulting from those queries by using the same fq params when querying the full index ? if so then it should work, if not -- you may not need those fields in that index. In general there is nothing wrong with having multiple indexes to solve multiple usecases -- an index is usually an inverted denormalization of some structured source data designed for fast queries/retrieval. If there are multiple distinct ways you want to query/retrieve data that don't lend themselves to the same denormalization, there's nothing wrong with multiple denormalizations. Something else to consider is an approach i've used many times: having a single index, but using special purpose replicas. You can have a master index that you update at the rate of change, one set of slaves that are used for one type of query pattern (faceting on X, Y, and Z for example) and a differnet set of slaves that are used for a different query pattern (faceting on A, B, and C) so each set of slaves gets a higher cahce hit rate then if the queries were randomized across all machines -Hoss
Re: querying using filter query and lots of possible values
Exactly. Creating a new index from the aggregated documents is the plan I described above. I don't really now, how long this will take for each new index. Hopefully under 1 hour or so. That would be tolerable. Thanks. Daniel On Thu, Jul 26, 2012 at 8:47 PM, Chantal Ackermann c.ackerm...@it-agenten.com wrote: Hi Daniel, depending on how you decide on the list of ids, in the first place, you could also create a new index (core) and populate it with DIH which would select only documents from your main index (core) in this range of ids. When updating you could try a delta import. Of course, this is only worth the effort if that core would exist for some time - but you've written that the subset of ids is constant for a longer time. Just another idea on top ;-) Chantal
Re: leaks in solr
Hi Guys I am also seeing this problem. I am using SOLR 4 from Trunk and seeing this issue repeat every day. Any inputs about how to resolve this would be great -Saroj On Thu, Jul 26, 2012 at 8:33 AM, Karthick Duraisamy Soundararaj karthick.soundara...@gmail.com wrote: Did you find any more clues? I have this problem in my machines as well.. On Fri, Jun 29, 2012 at 6:04 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi list, while monitoring my solr 3.6.1 installation I recognized an increase of memory usage in OldGen JVM heap on my slave. I decided to force Full GC from jvisualvm and send optimize to the already optimized slave index. Normally this helps because I have monitored this issue over the past. But not this time. The Full GC didn't free any memory. So I decided to take a heap dump and see what MemoryAnalyzer is showing. The heap dump is about 23 GB in size. 1.) Report Top consumers - Biggest Objects: Total: 12.3 GB org.apache.lucene.search.FieldCacheImpl : 8.1 GB class java.lang.ref.Finalizer : 2.1 GB org.apache.solr.util.ConcurrentLRUCache : 1.5 GB org.apache.lucene.index.ReadOnlySegmentReader : 622.5 MB ... As you can see, Finalizer has already reached 2.1 GB!!! * java.util.concurrent.ConcurrentHashMap$Segment[16] @ 0x37b056fd0 * segments java.util.concurrent.ConcurrentHashMap @ 0x39b02d268 * map org.apache.solr.util.ConcurrentLRUCache @ 0x398f33c30 * referent java.lang.ref.Finalizer @ 0x37affa810 * next java.lang.ref.Finalizer @ 0x37affa838 ... Seams to be org.apache.solr.util.ConcurrentLRUCache The attributes are: Type |Name | Value - boolean| isDestroyed | true - ref| cleanupThread| null ref| evictionListener | null --- long | oldestEntry | 0 -- int| acceptableWaterMark | 9500 -- ref| stats| org.apache.solr.util.ConcurrentLRUCache$Stats @ 0x37b074dc8 boolean| islive | true - boolean| newThreadForCleanup | false boolean| isCleaning | false ref| markAndSweepLock | java.util.concurrent.locks.ReentrantLock @ 0x39bf63978 - int| lowerWaterMark | 9000 - int| upperWaterMark | 1 - ref| map | java.util.concurrent.ConcurrentHashMap @ 0x39b02d268 -- 2.) While searching for open files and their references I noticed that there are references to index files which are already deleted from disk. E.g. recent index files are data/index/_2iqw.frq and data/index/_2iqx.frq. But I also see references to data/index/_2hid.frq which are quite old and are deleted way back from earlier replications. I have to analyze this a bit deeper. So far my report, I go on analyzing this huge heap dump. If you need any other info or even the heap dump, let me know. Regards Bernd
Re: language detection and phonetic
Le 26 juil. 2012 à 21:22, Alireza Salimi a écrit : The question is: is there any cleaner way to do that? I've always done phonetic match using a separate phonetic field (title-ph for example) and copy-field. There's one considerable advantage to that: using such as dismax, you can say prefer exact matches, but also honour phonetic matches (by boosting the title-fr^2 title-ph^1.1). Paul
Re: Bulk indexing data into solr
IIRC about a two month ago problem with such scheme discussed here, but I can remember exact details. Scheme is generally correct. But you didn't tell how do you let solr know that it need to reread new index generation, after indexer fsync segments get. btw, it might be a possible issue: https://lucene.apache.org/core/old_versioned_docs//versions/3_0_1/api/all/org/apache/lucene/index/IndexWriter.html#commit() Note that this operation calls Directory.sync on the index files. That call should not return until the file contents metadata are on stable storage. For FSDirectory, this calls the OS's fsync. But, beware: some hardware devices may in fact cache writes even during fsync, and return before the bits are actually on stable storage, to give the appearance of faster performance. you should ensure that after segments.get is fsync'ed, all other index files are fsynced for other processes too. Could you tell more about your data: what's the format? whether they are located relatively to indexer? And why you can't use remote streaming by Solr's upd handler or indexer client app with StreamingUpdateServer ? On Thu, Jul 26, 2012 at 10:47 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I think at least before lucene 4.0 we can only allow one process/thread to write on a lucene folder. Based on this fact my initial plan is: 1) There is one set of lucene index folders. 2) Solr server only perform queries in those servers 3) Having a separate process (multi-threads) to index those lucene folders (each folder is a separate app). Only one thread will index one given lucene folder. Thanks very much for helps, Lisheng -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Thursday, July 26, 2012 10:15 AM To: solr-user@lucene.apache.org Subject: Re: Bulk indexing data into solr Coming back to your original question. I'm puzzled a little. It's not clear where you wanna call Lucene API directly from. if you mean that you has standalone indexer, which write index files. Then it stops and these files become available for Solr Process it will work. Sharing index between processes, or using EmbeddedServer is looking for problem (despite Lucene has Locks mechanism, which I'm not completely aware of). I can conclude that your data for indexing is collocate with the solr server. In this case consider http://wiki.apache.org/solr/ContentStream#RemoteStreaming Please give more details about your design. On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I am starting to use solr, now I need to index a rather large amount of data, it seems that calling solr to pass data through HTTP is rather inefficient, I am think still call lucene API directly for bulk index but to use solr for search, is this design OK? Thanks very much for helps, Lisheng -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: separation of indexes to optimize facet queries without fulltext
Hi Chris, thanks for the answer. the plan is that in lots of queries I just need faceted values and don't even do a fulltext search. And on the other hand I need the fulltext search for exactly one task in my application, which is search documents and returning them. Here no faceting at all is need, but only filtering with fields, which i also use for the other queries. So if 95% of the queries don't use the fulltext i thought it would make sense to split them. Your suggestion to have one main master index and several slave indexes sounds promising. Is it possible to have this replication in SolrCloud e.g with different kind of schemas etc? Thanks. Daniel On Thu, Jul 26, 2012 at 9:05 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : My thought was, that I could separate indexes. So for the facet queries : where I don't need : fulltext search (so also no indexed fulltext field) I can use a completely : new setup of a : sharded Solr which doesn't include the indexed fulltext, so the index is : kept small containing : just the few fields I have. : : And for the fulltext queries I have the current Solr configuration which : includes as mentioned : above all the fields incl. the index fulltext field. : : Is this a normal way of handling these requirements. That there are : different kind of : Solr configurations for the different needs? Because the huge redundancy It's definitley doable -- one thing i'm not clear on is why, if your faceting queries don't care about the full text, you would need to leave those small fields in your full index ... is your plan to do faceting and drill down using the smaller index, but then display docs resulting from those queries by using the same fq params when querying the full index ? if so then it should work, if not -- you may not need those fields in that index. In general there is nothing wrong with having multiple indexes to solve multiple usecases -- an index is usually an inverted denormalization of some structured source data designed for fast queries/retrieval. If there are multiple distinct ways you want to query/retrieve data that don't lend themselves to the same denormalization, there's nothing wrong with multiple denormalizations. Something else to consider is an approach i've used many times: having a single index, but using special purpose replicas. You can have a master index that you update at the rate of change, one set of slaves that are used for one type of query pattern (faceting on X, Y, and Z for example) and a differnet set of slaves that are used for a different query pattern (faceting on A, B, and C) so each set of slaves gets a higher cahce hit rate then if the queries were randomized across all machines -Hoss
RE: Bulk indexing data into solr
Hi, I really appreciate your quick helps! 1) I want to let solr not cache any IndexerReader (hopefully it is possible), because our app is made of many lucene folders and each of them not very large, from my previous test it seems that performance is fine if each time we just create IndexerReader. Hopefully doing this way we have no sync issue? 2) Our data is mainly in RDB (currently in mySQL and will move to Cassendra later). My main concern is that by using Solr we need to pass rather large amount of data through network layer via HTTP, which could be a problem? Best regards, Lisheng -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Thursday, July 26, 2012 12:46 PM To: solr-user@lucene.apache.org Subject: Re: Bulk indexing data into solr IIRC about a two month ago problem with such scheme discussed here, but I can remember exact details. Scheme is generally correct. But you didn't tell how do you let solr know that it need to reread new index generation, after indexer fsync segments get. btw, it might be a possible issue: https://lucene.apache.org/core/old_versioned_docs//versions/3_0_1/api/all/org/apache/lucene/index/IndexWriter.html#commit() Note that this operation calls Directory.sync on the index files. That call should not return until the file contents metadata are on stable storage. For FSDirectory, this calls the OS's fsync. But, beware: some hardware devices may in fact cache writes even during fsync, and return before the bits are actually on stable storage, to give the appearance of faster performance. you should ensure that after segments.get is fsync'ed, all other index files are fsynced for other processes too. Could you tell more about your data: what's the format? whether they are located relatively to indexer? And why you can't use remote streaming by Solr's upd handler or indexer client app with StreamingUpdateServer ? On Thu, Jul 26, 2012 at 10:47 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I think at least before lucene 4.0 we can only allow one process/thread to write on a lucene folder. Based on this fact my initial plan is: 1) There is one set of lucene index folders. 2) Solr server only perform queries in those servers 3) Having a separate process (multi-threads) to index those lucene folders (each folder is a separate app). Only one thread will index one given lucene folder. Thanks very much for helps, Lisheng -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Thursday, July 26, 2012 10:15 AM To: solr-user@lucene.apache.org Subject: Re: Bulk indexing data into solr Coming back to your original question. I'm puzzled a little. It's not clear where you wanna call Lucene API directly from. if you mean that you has standalone indexer, which write index files. Then it stops and these files become available for Solr Process it will work. Sharing index between processes, or using EmbeddedServer is looking for problem (despite Lucene has Locks mechanism, which I'm not completely aware of). I can conclude that your data for indexing is collocate with the solr server. In this case consider http://wiki.apache.org/solr/ContentStream#RemoteStreaming Please give more details about your design. On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I am starting to use solr, now I need to index a rather large amount of data, it seems that calling solr to pass data through HTTP is rather inefficient, I am think still call lucene API directly for bulk index but to use solr for search, is this design OK? Thanks very much for helps, Lisheng -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Map/Reduce directly against solr4 index.
Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster.
Re: Map/Reduce directly against solr4 index.
Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster.
UUID generation not working
Hi 1. I am using UUID to generate unique id in my collection but when I tried to index the collection it could not find any doucmnets. can you please tell me how to use UUID in schema.xm Thanks, Sarala -- View this message in context: http://lucene.472066.n3.nabble.com/UUID-generation-not-working-tp3997571.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Map/Reduce directly against solr4 index.
It's not free (for production use anyway), but you might consider DataStax Enterprise: http://www.datastax.com/products/enterprise It is a very nice consolidation of Cassandra, Solr and Hadoop. No ETL required. Cheers, Jeff On Jul 26, 2012, at 3:55 PM, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster.
Re: leaks in solr
On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote: Hi Guys I am also seeing this problem. I am using SOLR 4 from Trunk and seeing this issue repeat every day. Any inputs about how to resolve this would be great -Saroj Trunk from what date? - Mark
Re: leaks in solr
it was from 4/11/12 -Saroj On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com wrote: On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote: Hi Guys I am also seeing this problem. I am using SOLR 4 from Trunk and seeing this issue repeat every day. Any inputs about how to resolve this would be great -Saroj Trunk from what date? - Mark
Re: UUID generation not working
: : 1. I am using UUID to generate unique id in my collection but when I tried : to index the collection it could not find any doucmnets. can you please : tell me how to use UUID in schema.xm in general, if you are having a problem achieving a goal, please post what you've tried and what kinds of errors/behavior yo uare getting instead -- ie: in this case telling us *how* you have already tried using UUID to generate unique id would be helpful. In Solr 3.x, you can use the UUIDField like so... fieldType name=uuid class=solr.UUIDField indexed=true / ... field name=id type=uuid indexed=true stored=true default=NEW/ ... uniqueKeyid/uniqueKey ...to generate a new UUID for every doc added. But for Solr 4.x some things have changed, as noted in the Upgrading section for Solr 4.0.0-ALPHA... * Due to low level changes to support SolrCloud, the uniqueKey field can no longer be populated via copyField/ or field default=... in the schema.xml. Users wishing to have Solr automatically generate a uniqueKey value when adding documents should instead use an instance of solr.UUIDUpdateProcessorFactory in their update processor chain. See SOLR-2796 for more details. ... https://issues.apache.org/jira/browse/SOLR-2796 https://issues.apache.org/jira/browse/SOLR-3495 -Hoss
Re: solr host name on solrconfig.xml
: i need the host name of my solr-server in my solrconfig.xml : anybody knows the correct variable? : : something like ${solr.host} or ${solr.host.name} ... : : exists an documantation about ALL available variables in the solr : namespaces? Off the top of my head i don't know that there are any system properties that solr creates for you in the solr.* namespace -- when you see examples of people talking aboutthings like $solr.data.dir that's just convention in the example files that you can set when you run Solr and solr will *read* that value because you use it in your solrconfig.xml Any run time java system property should be available when the solrconfig.xml is read, and you can get a list of all the properties in your system from the Properties link in the Solr Admin UI. I don't think there is a standard java system property for the hostname (machines can have multiple hosts, even multiple IPs) but you could always do something like... java -Dsolr.my.hostname=`hostname` -jar start.jar ...when running solr. -Hoss
Re: leaks in solr
I'd take a look at this issue: https://issues.apache.org/jira/browse/SOLR-3392 Fixed late April. On Jul 26, 2012, at 7:41 PM, roz dev rozde...@gmail.com wrote: it was from 4/11/12 -Saroj On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com wrote: On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote: Hi Guys I am also seeing this problem. I am using SOLR 4 from Trunk and seeing this issue repeat every day. Any inputs about how to resolve this would be great -Saroj Trunk from what date? - Mark - Mark Miller lucidimagination.com
Re: leaks in solr
Thanks Mark. We are never calling commit or optimize with openSearcher=false. As per logs, this is what is happening openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false} -- But, We are going to use 4.0 Alpha and see if that helps. -Saroj On Thu, Jul 26, 2012 at 5:12 PM, Mark Miller markrmil...@gmail.com wrote: I'd take a look at this issue: https://issues.apache.org/jira/browse/SOLR-3392 Fixed late April. On Jul 26, 2012, at 7:41 PM, roz dev rozde...@gmail.com wrote: it was from 4/11/12 -Saroj On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com wrote: On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote: Hi Guys I am also seeing this problem. I am using SOLR 4 from Trunk and seeing this issue repeat every day. Any inputs about how to resolve this would be great -Saroj Trunk from what date? - Mark - Mark Miller lucidimagination.com
Re: Binary content index with multiple cores
: Here is my solrconfig.xml for one of the core : ... : lib dir=../../dist/ regex=apache-solr-cell-\d.*\.jar / : lib dir=../../contrib/extraction/lib regex=.*\.jar / ... : I've added the maven dependencies like this for the solr war : ... : dependency : groupIdorg.apache.solr/groupId : artifactIdsolr-cell/artifactId : classpathshared/classpath : /dependency Doing both of these things is the precise cause of your problem. You know have two instances of all of hte solr-cell classes in your classpath, at differnet levels of the hierarchy. Due to the excentricities of java classloading, this is causing the classloader to not realize that the instance of the ExtractingRequestHandler class that it finds is in fact an subclass of the instance of the SolrRequestHandler class that it finds. If you want to modify the war, modify the war. If you want to load jars as a plugin, load them as plugins. Under no circumstances should you try to do both with the same jar(s). -Hoss
Re: leaks in solr
Mark, We use solr 3.6.0 on freebsd 9. Over a period of time, it accumulates lots of space! On Thu, Jul 26, 2012 at 8:47 PM, roz dev rozde...@gmail.com wrote: Thanks Mark. We are never calling commit or optimize with openSearcher=false. As per logs, this is what is happening openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false} -- But, We are going to use 4.0 Alpha and see if that helps. -Saroj On Thu, Jul 26, 2012 at 5:12 PM, Mark Miller markrmil...@gmail.com wrote: I'd take a look at this issue: https://issues.apache.org/jira/browse/SOLR-3392 Fixed late April. On Jul 26, 2012, at 7:41 PM, roz dev rozde...@gmail.com wrote: it was from 4/11/12 -Saroj On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com wrote: On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote: Hi Guys I am also seeing this problem. I am using SOLR 4 from Trunk and seeing this issue repeat every day. Any inputs about how to resolve this would be great -Saroj Trunk from what date? - Mark - Mark Miller lucidimagination.com
Re: Map/Reduce directly against solr4 index.
I think the performance should be close to Hadoop running on HDFS, if somehow Hadoop job can directly read the Solr Index file while executing the job on the local solr node. Kindna like how HBase and Cassadra integrate with Hadoop. Plus, we can run the map reduce job on a standby Solr4 cluster. This way, the documents in Solr will be our primary source of truth. And we have the ability to run near real time search queries and analytics on it. No need to export data around. Solr4 is becoming a very interesting solution to many web scale problems. Just missing the map/reduce component. :) On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote: Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster.
Re: Map/Reduce directly against solr4 index.
You raise an interesting possibility. A map/reduce solr handler over solrcloud... On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote: I think the performance should be close to Hadoop running on HDFS, if somehow Hadoop job can directly read the Solr Index file while executing the job on the local solr node. Kindna like how HBase and Cassadra integrate with Hadoop. Plus, we can run the map reduce job on a standby Solr4 cluster. This way, the documents in Solr will be our primary source of truth. And we have the ability to run near real time search queries and analytics on it. No need to export data around. Solr4 is becoming a very interesting solution to many web scale problems. Just missing the map/reduce component. :) On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote: Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster.
Re: Map/Reduce directly against solr4 index.
Mahout includes a file reader for Lucene indexes. It will read from HDFS or local disks. On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni dar...@ontrenet.com wrote: You raise an interesting possibility. A map/reduce solr handler over solrcloud... On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote: I think the performance should be close to Hadoop running on HDFS, if somehow Hadoop job can directly read the Solr Index file while executing the job on the local solr node. Kindna like how HBase and Cassadra integrate with Hadoop. Plus, we can run the map reduce job on a standby Solr4 cluster. This way, the documents in Solr will be our primary source of truth. And we have the ability to run near real time search queries and analytics on it. No need to export data around. Solr4 is becoming a very interesting solution to many web scale problems. Just missing the map/reduce component. :) On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote: Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster. -- Lance Norskog goks...@gmail.com
Re: Map/Reduce directly against solr4 index.
Can it read distributed lucene indexes in SolrCloud? On Jul 26, 2012 7:11 PM, Lance Norskog goks...@gmail.com wrote: Mahout includes a file reader for Lucene indexes. It will read from HDFS or local disks. On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni dar...@ontrenet.com wrote: You raise an interesting possibility. A map/reduce solr handler over solrcloud... On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote: I think the performance should be close to Hadoop running on HDFS, if somehow Hadoop job can directly read the Solr Index file while executing the job on the local solr node. Kindna like how HBase and Cassadra integrate with Hadoop. Plus, we can run the map reduce job on a standby Solr4 cluster. This way, the documents in Solr will be our primary source of truth. And we have the ability to run near real time search queries and analytics on it. No need to export data around. Solr4 is becoming a very interesting solution to many web scale problems. Just missing the map/reduce component. :) On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote: Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster. -- Lance Norskog goks...@gmail.com
Re: leaks in solr
What does the Statistics page in the Solr admin say? There might be several searchers open: org.apache.solr.search.SolrIndexSearcher Each searcher holds open different generations of the index. If obsolete index files are held open, it may be old searchers. How big are the caches? How long does it take to autowarm them? On Thu, Jul 26, 2012 at 6:15 PM, Karthick Duraisamy Soundararaj karthick.soundara...@gmail.com wrote: Mark, We use solr 3.6.0 on freebsd 9. Over a period of time, it accumulates lots of space! On Thu, Jul 26, 2012 at 8:47 PM, roz dev rozde...@gmail.com wrote: Thanks Mark. We are never calling commit or optimize with openSearcher=false. As per logs, this is what is happening openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false} -- But, We are going to use 4.0 Alpha and see if that helps. -Saroj On Thu, Jul 26, 2012 at 5:12 PM, Mark Miller markrmil...@gmail.com wrote: I'd take a look at this issue: https://issues.apache.org/jira/browse/SOLR-3392 Fixed late April. On Jul 26, 2012, at 7:41 PM, roz dev rozde...@gmail.com wrote: it was from 4/11/12 -Saroj On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com wrote: On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote: Hi Guys I am also seeing this problem. I am using SOLR 4 from Trunk and seeing this issue repeat every day. Any inputs about how to resolve this would be great -Saroj Trunk from what date? - Mark - Mark Miller lucidimagination.com -- Lance Norskog goks...@gmail.com
Re: Updating a SOLR index with a properties file
You can use the DataImportHandler. The DIH file would use a file reader, then the line reader tool, then separate the line with a regular expression into two fields. If you need a unique ID, look up the UUID tools. I have never heard of this use case. On Thu, Jul 26, 2012 at 1:56 PM, Florian Popescu florian.pope...@gmail.com wrote: I am not sure if this is already possible with the built in set of request handlers. I am trying to update the index using a properties file (one document per file). Is this something that can be done? I searched the wiki and none of the stuff there seems to be addressing this. Thanks in advance, Florian -- Lance Norskog goks...@gmail.com
Re: Map/Reduce directly against solr4 index.
No. This is just a Hadoop file input class. Distributed Hadoop has to get files from a distributed file service. It sounds like you want some kind of distributed file service that maps a TaskNode (??) on a given server to the files available on that server. There might be something that does this. HDFS works very hard at doing this; are you sure it is not good enough? I am endlessly amazed at the speed of these distributed apps. Have you done a proof of concept? On Thu, Jul 26, 2012 at 7:40 PM, Trung Pham tr...@phamcom.com wrote: Can it read distributed lucene indexes in SolrCloud? On Jul 26, 2012 7:11 PM, Lance Norskog goks...@gmail.com wrote: Mahout includes a file reader for Lucene indexes. It will read from HDFS or local disks. On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni dar...@ontrenet.com wrote: You raise an interesting possibility. A map/reduce solr handler over solrcloud... On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote: I think the performance should be close to Hadoop running on HDFS, if somehow Hadoop job can directly read the Solr Index file while executing the job on the local solr node. Kindna like how HBase and Cassadra integrate with Hadoop. Plus, we can run the map reduce job on a standby Solr4 cluster. This way, the documents in Solr will be our primary source of truth. And we have the ability to run near real time search queries and analytics on it. No need to export data around. Solr4 is becoming a very interesting solution to many web scale problems. Just missing the map/reduce component. :) On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote: Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster. -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: Updating a SOLR index with a properties file
Thanks! I will try I out and see how it works. This is for indexing a bunch of java resource bundles and trying to 'refactor' the keys. Basically trying to figure out if a key is used in multiple places and extracting it out if applicable. Florian On Jul 26, 2012, at 10:46 PM, Lance Norskog goks...@gmail.com wrote: You can use the DataImportHandler. The DIH file would use a file reader, then the line reader tool, then separate the line with a regular expression into two fields. If you need a unique ID, look up the UUID tools. I have never heard of this use case. On Thu, Jul 26, 2012 at 1:56 PM, Florian Popescu florian.pope...@gmail.com wrote: I am not sure if this is already possible with the built in set of request handlers. I am trying to update the index using a properties file (one document per file). Is this something that can be done? I searched the wiki and none of the stuff there seems to be addressing this. Thanks in advance, Florian -- Lance Norskog goks...@gmail.com
Re: Significance of Analyzer Class attribute
: When I specify analyzer class in schema, something : like below and do : analysis on this field in analysis page : I cant see : verbose output on : tokenizer and filters The reason for that is that if you use an explicit Analyzer implimentation, the analysis tool doesn't know what the individual phases of hte tokenfilters are -- the Analyzer API doesn't expose that information (some Analyzers may be monolithic and not made up of individual TokenFilters) : fieldType name=text_chinese : class=solr.TextField : analyzer : class=org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer : tokenizer ... : Above config is somehow wrong. You cannot use both analyzer combined : with tokenizer and filter altogether. If you want to use lucene analyzer : in schema.xml there should be only analyzer definition. Right. what's happening here is htat since a class is specifid for hte analyzer, it is ignoring the tokenizer+tokenfilters listed. I've opened a bug to add better error checking to catch these kinds of configuration mistakes... https://issues.apache.org/jira/browse/SOLR-3683 -Hoss
Re: Significance of Analyzer Class attribute
Hi All, Thank you for the replies. --Regards Rajani On Fri, Jul 27, 2012 at 9:58 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : When I specify analyzer class in schema, something : like below and do : analysis on this field in analysis page : I cant see : verbose output on : tokenizer and filters The reason for that is that if you use an explicit Analyzer implimentation, the analysis tool doesn't know what the individual phases of hte tokenfilters are -- the Analyzer API doesn't expose that information (some Analyzers may be monolithic and not made up of individual TokenFilters) : fieldType name=text_chinese : class=solr.TextField :analyzer : class=org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer :tokenizer ... : Above config is somehow wrong. You cannot use both analyzer combined : with tokenizer and filter altogether. If you want to use lucene analyzer : in schema.xml there should be only analyzer definition. Right. what's happening here is htat since a class is specifid for hte analyzer, it is ignoring the tokenizer+tokenfilters listed. I've opened a bug to add better error checking to catch these kinds of configuration mistakes... https://issues.apache.org/jira/browse/SOLR-3683 -Hoss
Re: solr host name on solrconfig.xml
okay. thx. i knw this way but its not so nice :P i set a new variable in my core.properties file which i load in solr.xml for each core =)) -- View this message in context: http://lucene.472066.n3.nabble.com/solr-host-name-on-solrconfig-xml-tp3997371p3997652.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Map/Reduce directly against solr4 index.
That is exactly what I want. I want the distributed Hadoop TaskNode to be running on the same server that is holding the local distributed solr index. This way there is no need to move any data around... I think other people call this feature 'data locality' of map/reduce. I believe HBase and Hadoop integration work exactly like this. The only difference here is we are substituting HDFS with the distributed Solr indexes. Since solr4 can manage the sharded/distributed index files, it's doing the exact work that HDFS is doing. In theory, this should be achievable. On Thu, Jul 26, 2012 at 7:51 PM, Lance Norskog goks...@gmail.com wrote: No. This is just a Hadoop file input class. Distributed Hadoop has to get files from a distributed file service. It sounds like you want some kind of distributed file service that maps a TaskNode (??) on a given server to the files available on that server. There might be something that does this. HDFS works very hard at doing this; are you sure it is not good enough? I am endlessly amazed at the speed of these distributed apps. Have you done a proof of concept? On Thu, Jul 26, 2012 at 7:40 PM, Trung Pham tr...@phamcom.com wrote: Can it read distributed lucene indexes in SolrCloud? On Jul 26, 2012 7:11 PM, Lance Norskog goks...@gmail.com wrote: Mahout includes a file reader for Lucene indexes. It will read from HDFS or local disks. On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni dar...@ontrenet.com wrote: You raise an interesting possibility. A map/reduce solr handler over solrcloud... On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote: I think the performance should be close to Hadoop running on HDFS, if somehow Hadoop job can directly read the Solr Index file while executing the job on the local solr node. Kindna like how HBase and Cassadra integrate with Hadoop. Plus, we can run the map reduce job on a standby Solr4 cluster. This way, the documents in Solr will be our primary source of truth. And we have the ability to run near real time search queries and analytics on it. No need to export data around. Solr4 is becoming a very interesting solution to many web scale problems. Just missing the map/reduce component. :) On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote: Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster. -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: Solr - hl.fragsize Issue
Hi @iorixxx , I use DefaultSolrHighlighter , and yes fragment size also includes em tags but if we remove em from fragment , then also the average size of fragment is 110 instead of 100. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-hl-fragsize-Issue-tp3997457p3997656.html Sent from the Solr - User mailing list archive at Nabble.com.