Atomic bug update
Hello, I've upgraded the SolrCloud from 7.6 to 8.8 and unfortunately I got the following exception on Atomic updates of some of the documents. And in some cases some fields are retrieved with an array of multi values in case the field is defined as a single value. Is there a bug on this version regarding the Atomic updates? And how can I solve this issue? org.apache.solr.common.SolrException: TransactionLog doesn't know how to serialize class org.apache.lucene.document.LazyDocument$LazyField; try implementing ObjectResolver? at org.apache.solr.update.TransactionLog$1.resolve(TransactionLog.java:100) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:266) at org.apache.solr.common.util.JavaBinCodec$BinEntryWriter.put(JavaBinCodec.java:441) at org.apache.solr.common.ConditionalKeyMapWriter$EntryWriterWrapper.put(ConditionalKeyMapWriter.java:44) at org.apache.solr.common.MapWriter$EntryWriter.putNoEx(MapWriter.java:101) at org.apache.solr.common.MapWriter$EntryWriter.lambda$getBiConsumer$0(MapWriter.java:161) at org.apache.solr.common.SolrInputDocument.lambda$writeMap$0(SolrInputDocument.java:59) at java.base/java.util.LinkedHashMap.forEach(LinkedHashMap.java:684) at org.apache.solr.common.SolrInputDocument.writeMap(SolrInputDocument.java:61) at org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:667) at org.apache.solr.update.TransactionLog.write(TransactionLog.java:397) at org.apache.solr.update.UpdateLog.add(UpdateLog.java:585) at org.apache.solr.update.UpdateLog.add(UpdateLog.java:557) at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:351) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:294) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241) at org.apache.solr.update.processor.RunUpdateProcessorFactory$RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:73) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:256) at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:495) at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:336) at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:336) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:222) at org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245) at org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:343) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:291) at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:244) at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:196) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:131) at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:122) at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:70) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:82) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2646) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:794) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:567) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:357) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at
Re: Change field to DocValues
That's right, I want to avoid a complete reindexing process. But should I create another field with the docValues property or change the current field directly? Can I use streaming expressions to update the whole index or should I select and update using batches? Thanks, Mahmoud On Wed, Feb 17, 2021 at 4:51 PM xiefengchang wrote: > Hi: > I think you are just trying to avoid complete re-index right? > why don't you take a look at this: > https://lucene.apache.org/solr/guide/8_0/updating-parts-of-documents.html > > > > > > > > > > > > > > > > > > At 2021-02-17 21:14:11, "Mahmoud Almokadem" > wrote: > >Hello, > > > >I've an integer field on an index with billions of documents and need to > do > >facets on this field, unfortunately the field doesn't have the docValues > >property, so the FieldCache will be fired and use much memory. > > > >What is the best way to change the field to be docValues supported? > > > >Regards, > >Mahmoud >
Change field to DocValues
Hello, I've an integer field on an index with billions of documents and need to do facets on this field, unfortunately the field doesn't have the docValues property, so the FieldCache will be fired and use much memory. What is the best way to change the field to be docValues supported? Regards, Mahmoud
Re: Reindex single shard on solr
You're right Erick. for the Hash.murmurhash3_x86_32 method I don't know should I pass my Id directly or with specific format like '1874f9aa-4cad-4839-a282-d624fe2c40c6!document_id', so I used a predefined method that get shard name directly. createCollection method doesn't create a collection physically on SolrCould, it's only a reference to the size of shards of the collection. Also the CloudSolrClient doesn't have a method called "getCollection" may be related to the SolrRequest class which is not used on my code. I used the following code to target my shards String id = document.getFieldValue("document_id").toString(); Slice slice = router.getTargetSlice(id, document, null, null, solrCollection ); String shard = slice.getName(); if(targetShards.contains(shard)){ bufferDocuments.add(document); } Thanks for your help, Mahmoud On Fri, Dec 14, 2018 at 11:20 PM Erick Erickson wrote: > Why do you need to create a collection? That's probably just there in > the test code to have something to test against. > > WARNING: I haven't verified this, but it should be something like the > following. What you need > is the hash range for the shard (slice) you're trying to update, then > send each doc ID through > the hash function and, if the result falls in the range of your target > shard, index the doc. > > CloudSolrClient cloudSolrClient = . > > DocCollection coll = cloudSolrClient.getCollection(collName); > Slice slice = coll.getSlice("shard_name_you_care_about"); // you can > get all the slices and interate BTW. > DocRouter.Range range = slice.getRange() > > for (each doc) { > int hash = Hash.murmurhash3_x86_32(whatever_your_unique_key_is, 0, > id.length(), 0); > if (range.includes(hash)) { > index it to Solr > } > } > > "Hash" is in org.apache.solr.common.util, in > > solr-solrj-##.jar, part of the normal distro. > > Best, > Erick > On Fri, Dec 14, 2018 at 11:53 AM Mahmoud Almokadem > wrote: > > > > Thanks Erick, > > > > I got it from TestHashPartitioner.java > > > > > https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/test/org/apache/solr/cloud/TestHashPartitioner.java > > > > Here is a sample code > > > > router = DocRouter.getDocRouter(CompositeIdRouter.NAME); > > int shardsCount = 12; > > solrCollection = createCollection(shardsCount, router); > > > > SolrInputDocument document = getSolrDocument(item); //need to implement > > this method to get SolrInputDocument > > > > String id = "1874f9aa-4cad-4839-a282-d624fe2c40c6" > > Slice slice = router.getTargetSlice(id, document, null, null, > > solrCollection ); > > String shardName = slice.getName(); // shard1, shard2, ... etc > > > > //Helper methods from > > DocCollection createCollection(int nSlices, DocRouter router) { > > List ranges = router.partitionRange(nSlices, > > router.fullRange()); > > > > Map slices = new HashMap<>(); > > for (int i=0; i > DocRouter.Range range = ranges.get(i); > > Slice slice = new Slice("shard"+(i+1), null, > > map("range",range)); > > slices.put(slice.getName(), slice); > > } > > > > DocCollection coll = new DocCollection("collection1", slices, > null, > > router); > > return coll; > > } > > > > > > public static Map map(Object... params) { > > LinkedHashMap ret = new LinkedHashMap(); > > for (int i=0; i > Object o = ret.put(params[i], params[i+1]); > > // TODO: handle multi-valued map? > > } > > return ret; > > } > > > > > > Mahmoud > > > > On Fri, Dec 14, 2018 at 7:06 PM Mahmoud Almokadem < > prog.mahm...@gmail.com> > > wrote: > > > > > Thanks Erick, > > > > > > You know how to use this method. Or I need to dive into the code? > > > > > > I've the document_id as string uniqueKey and have 12 shards. > > > > > > On Fri, Dec 14, 2018 at 5:58 PM Erick Erickson < > erickerick...@gmail.com> > > > wrote: > > > > > >> Sure. Of course you have to make sure you use the exact same hashing > > >> algorithm on the . > > >> > > >> See CompositeIdRouter.sliceHash > > >> > > >> Best, > > >> Erick > > >> On Fri, Dec 14, 2018 at 3:36 AM Mahmoud Almokadem > > >> wrote: > > >> > > > >> > Hello, > > >> > > > >> > I've a corruption on some of the shards on my collection and I've a > full > > >> > dataset on my database, and I'm using CompositeId for routing > documents. > > >> > > > >> > Can I traverse the whole dataset and do something like hashing the > > >> > document_id to identify that this document belongs to a specific > shard > > >> to > > >> > send the desired documents only instead of reindex the whole > dataset? > > >> > > > >> > Sincerely, > > >> > Mahmoud > > >> > > > >
Re: Reindex single shard on solr
Thanks Erick, I got it from TestHashPartitioner.java https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/test/org/apache/solr/cloud/TestHashPartitioner.java Here is a sample code router = DocRouter.getDocRouter(CompositeIdRouter.NAME); int shardsCount = 12; solrCollection = createCollection(shardsCount, router); SolrInputDocument document = getSolrDocument(item); //need to implement this method to get SolrInputDocument String id = "1874f9aa-4cad-4839-a282-d624fe2c40c6" Slice slice = router.getTargetSlice(id, document, null, null, solrCollection ); String shardName = slice.getName(); // shard1, shard2, ... etc //Helper methods from DocCollection createCollection(int nSlices, DocRouter router) { List ranges = router.partitionRange(nSlices, router.fullRange()); Map slices = new HashMap<>(); for (int i=0; i wrote: > Thanks Erick, > > You know how to use this method. Or I need to dive into the code? > > I've the document_id as string uniqueKey and have 12 shards. > > On Fri, Dec 14, 2018 at 5:58 PM Erick Erickson > wrote: > >> Sure. Of course you have to make sure you use the exact same hashing >> algorithm on the . >> >> See CompositeIdRouter.sliceHash >> >> Best, >> Erick >> On Fri, Dec 14, 2018 at 3:36 AM Mahmoud Almokadem >> wrote: >> > >> > Hello, >> > >> > I've a corruption on some of the shards on my collection and I've a full >> > dataset on my database, and I'm using CompositeId for routing documents. >> > >> > Can I traverse the whole dataset and do something like hashing the >> > document_id to identify that this document belongs to a specific shard >> to >> > send the desired documents only instead of reindex the whole dataset? >> > >> > Sincerely, >> > Mahmoud >> >
Re: Reindex single shard on solr
Thanks Erick, You know how to use this method. Or I need to dive into the code? I've the document_id as string uniqueKey and have 12 shards. On Fri, Dec 14, 2018 at 5:58 PM Erick Erickson wrote: > Sure. Of course you have to make sure you use the exact same hashing > algorithm on the . > > See CompositeIdRouter.sliceHash > > Best, > Erick > On Fri, Dec 14, 2018 at 3:36 AM Mahmoud Almokadem > wrote: > > > > Hello, > > > > I've a corruption on some of the shards on my collection and I've a full > > dataset on my database, and I'm using CompositeId for routing documents. > > > > Can I traverse the whole dataset and do something like hashing the > > document_id to identify that this document belongs to a specific shard > to > > send the desired documents only instead of reindex the whole dataset? > > > > Sincerely, > > Mahmoud >
Re: no segments* file found
Thanks Eric, already tried to Lucene but cannot continue cause I need to get my collection up ASAP. So, I started my reindexing process and I'll investigate this issue while indexing. Mahmoud On Fri, Dec 14, 2018 at 6:08 PM Erick Erickson wrote: > You'd have to dive into the Lucene code and figure out the format, > offhand I don't know what it is. > > However, there's no guarantee here that it'll result in a consistent index. > Consider merging two segments, seg1 and seg2. Here's the merge sequence: > > 1> merge the segments, At the end of this you have seg1, seg2, and > seg3. Segments_N points only go seg1 and seg2. > 2> write a new segments_N+1 file that points _only_ to seg3 > 3> delete seg1 and seg2 > > So if you were part way through merging and had seg1, seg2, and seg3 on > disk, > reconstructing the segments_N file from the available segments on disk > will result > duplicate documents in your index. > > FWIW, > Erick > On Fri, Dec 14, 2018 at 3:27 AM Mahmoud Almokadem > wrote: > > > > Hello, > > > > I'm facing an issue that some shards of my SolrCloud collection is > > corrupted due to they don't have segments_N file but I think the whole > > segments are still available. Can I create a segment_N file from the > > available files? > > > > This is the stack trace: > > > > org.apache.solr.core.SolrCoreInitializationException: SolrCore > > 'my_collection_shard12_replica_n22' is not available due to init failure: > > Error opening new searcher > > at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1495) > > at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:251) > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330) > > at > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629) > > at > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > at > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > at > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190) > > at > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188) > > at > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168) > > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) > > at > > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166) > > at > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > at > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) > > at > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) > > at > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > > at > > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) > > at > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > > at org.eclipse.jetty.server.Server.handle(Server.java:530) > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347) > > at > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256) > > at > > org.eclipse.jetty.io > .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279) > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) > > at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) > > at > > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247) > > at > > > org.eclipse.jetty.util.thread.strategy.EatWhatYo
Reindex single shard on solr
Hello, I've a corruption on some of the shards on my collection and I've a full dataset on my database, and I'm using CompositeId for routing documents. Can I traverse the whole dataset and do something like hashing the document_id to identify that this document belongs to a specific shard to send the desired documents only instead of reindex the whole dataset? Sincerely, Mahmoud
no segments* file found
Hello, I'm facing an issue that some shards of my SolrCloud collection is corrupted due to they don't have segments_N file but I think the whole segments are still available. Can I create a segment_N file from the available files? This is the stack trace: org.apache.solr.core.SolrCoreInitializationException: SolrCore 'my_collection_shard12_replica_n22' is not available due to init failure: Error opening new searcher at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1495) at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:251) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:530) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) at java.base/java.lang.Thread.run(Thread.java:844) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.(SolrCore.java:1008) at org.apache.solr.core.SolrCore.(SolrCore.java:863) at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040) at org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ... 1 more Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2095) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2215) at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1091) at org.apache.solr.core.SolrCore.(SolrCore.java:980) ... 9 more Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(NRTCachingDirectory(MMapDirectory@/path/to/index/my_collection_shard12_replica_n22/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@659b5e1; maxCacheMB=48.0
Re: Dataimporter status
Thanks Shawn, I'm already using the admin UI and get URL for fetching the status of dataimporter from network console and tried it outside the admin UI. Admin UI have the same behavior, when I pressed on execute the status messages are swapped between "not started", "started and indexing", "completed on 3 seconds", "completed on 10 seconds" something like that. I understood what you mean that the dataimporter are load balanced between shards, that's made me using the old admin UI on using dataimporter to get accurate status of what is running now. Because the it's related to core not collection. I think the dataimporter feature must moved to the core level instead of collection level. Thanks, Mahmoud On Tue, Dec 5, 2017 at 6:57 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 12/3/2017 9:27 AM, Mahmoud Almokadem wrote: > >> We're facing an issue related to the dataimporter status on new Admin UI >> (7.0.1). >> >> Calling to the API >> http://solrip/solr/collection/dataimport?_=1512314812090 >> mand=status=on=json >> >> returns different status despite the importer is running >> The messages are swapped between the following when refreshing the page: >> > > > > The old Admin UI was working well. >> >> Is that a bug on the new Admin UI? >> > > What I'm going to say below is based on the idea that you're running > SolrCloud. If you're not, then this seems extremely odd and should not be > happening. > > The first part of your message has a URL that accesses the API directly, > *not* the admin UI, so I'm going to concentrate on that, and not discuss > the admin UI, because the admin UI is not involved when using that kind of > URL. > > When requests are sent to a collection name rather than directly to a > core, SolrCloud load balances those requests across the cloud, picking > different replicas and shards so each individual request ends up on a > different core, and possibly on a different server. > > This load balancing is a general feature of SolrCloud, and happens even > with the dataimport handler. You never know which shard/replica is going > to actually get a /dataimport request. So what is happening here is that > one of the cores in your collection is actually doing a dataimport, but all > the others aren't. When the status command is load balanced to the core > that did the import, then you see the status with actual data, and when > load balancing sends the request to one of the other cores, you see the > empty status. > > If you want to reliably see the status of an import on SolrCloud, you're > going to have to choose one of the cores (collection_shardN_replicaM) on > one of the servers in your cloud, and send both the import command and the > status command to that one core, instead of the collection. You might even > need to add a distrib=false parameter to the request to keep it from being > load balanced, but I am not sure whether that's needed for /dataimport. > > Thanks, > Shawn >
Slices not found for checkpointCollection
Hi all, I'm running Solr 7.0.1. When I tried to run TopicStream with the following expression String expression = "topic(checkpointCollection," + "myCollection" + "," + "q=\"*:*\"," + "fl=\"document_id,title,full_text\"," + "id=\"myTopic\"," + "rows=\"300\"," + "initialCheckpoint=\"0\"," + "wt=javabin)"; I got the error java.io.IOException: Slices not found for checkpointCollection Should I create a checkpointCollection on the cluster? And if yes, what is the schema for this collection? I used the topic stream instead of search to fetch all documents with no docValues fields. Thanks, Mahmoud
Dataimporter status
We're facing an issue related to the dataimporter status on new Admin UI (7.0.1). Calling to the API http://solrip/solr/collection/dataimport?_=1512314812090=status=on=json returns different status despite the importer is running The messages are swapped between the following when refreshing the page: { "responseHeader":{ "status":0, "QTime":0}, "initArgs":[ "defaults",[ "config","data-config-online-live-pervoice.xml"]], "command":"status", "status":"idle", "importResponse":"", "statusMessages":{}} === { "responseHeader":{ "status":0, "QTime":0}, "initArgs":[ "defaults",[ "config","data-config-online-live-pervoice.xml"]], "command":"status", "status":"idle", "importResponse":"", "statusMessages":{ "Total Requests made to DataSource":"2", "Total Rows Fetched":"715", "Total Documents Processed":"679", "Total Documents Skipped":"0", "Full Dump Started":"2017-12-03 18:22:31", "":"Indexing completed. Added/Updated: 679 documents. Deleted 0 documents.", "Committed":"2017-12-03 18:22:32", "Total Documents Failed":"36", "Time taken":"0:0:54.638", "Full Import failed":"2017-12-03 18:22:32"}} The old Admin UI was working well. Is that a bug on the new Admin UI? Thanks, Mahmoud
Log page auto refresh
Hello, I've an issue related to the Log page on the new Admin UI (7.0.1), When I expand an item, it collapsed again after short time. This behavior is different than the old Admin UI. Thanks, Mahmoud
Re: Unbalanced CPU no SolrCloud
It takes more time after I stopped the indexing. The load firstly was with the first node and after I restarted the indexing process the load with changed to the second node the first node worked properly. Thanks, Mahmoud On Mon, Oct 16, 2017 at 5:29 PM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Does the load stops when you stop indexing or it last for some more time? > Is it always one node that behaves like this and it starts as soon as you > start indexing? Is load different between nodes when you are doing lighter > indexing? > > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 16 Oct 2017, at 13:35, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > > > > The transition of the load happened after I restarted the bulk insert > > process. > > > > The size of the index on each server about 500GB. > > > > There are about 8 warnings on each server for "Not found segment file" > like > > that > > > > Error getting file length for [segments_2s4] > > > > java.nio.file.NoSuchFileException: > > /media/ssd_losedata/solr-home/data/documents_online_shard16_ > replica_n1/data/index/segments_2s4 > > at > > java.base/sun.nio.fs.UnixException.translateToIOException( > UnixException.java:92) > > at > > java.base/sun.nio.fs.UnixException.rethrowAsIOException( > UnixException.java:111) > > at > > java.base/sun.nio.fs.UnixException.rethrowAsIOException( > UnixException.java:116) > > at > > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes( > UnixFileAttributeViews.java:55) > > at > > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes( > UnixFileSystemProvider.java:145) > > at > > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes( > LinuxFileSystemProvider.java:99) > > at java.base/java.nio.file.Files.readAttributes(Files.java:1755) > > at java.base/java.nio.file.Files.size(Files.java:2369) > > at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) > > at > > org.apache.lucene.store.NRTCachingDirectory.fileLength( > NRTCachingDirectory.java:128) > > at > > org.apache.solr.handler.admin.LukeRequestHandler.getFileLength( > LukeRequestHandler.java:611) > > at > > org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo( > LukeRequestHandler.java:584) > > at > > org.apache.solr.handler.admin.LukeRequestHandler.handleRequestBody( > LukeRequestHandler.java:136) > > at > > org.apache.solr.handler.RequestHandlerBase.handleRequest( > RequestHandlerBase.java:177) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2474) > > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:720) > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:526) > > at > > org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:378) > > at > > org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:322) > > at > > org.eclipse.jetty.servlet.ServletHandler$CachedChain. > doFilter(ServletHandler.java:1691) > > at > > org.eclipse.jetty.servlet.ServletHandler.doHandle( > ServletHandler.java:582) > > at > > org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:143) > > at > > org.eclipse.jetty.security.SecurityHandler.handle( > SecurityHandler.java:548) > > at > > org.eclipse.jetty.server.session.SessionHandler. > doHandle(SessionHandler.java:226) > > at > > org.eclipse.jetty.server.handler.ContextHandler. > doHandle(ContextHandler.java:1180) > > at org.eclipse.jetty.servlet.ServletHandler.doScope( > ServletHandler.java:512) > > at > > org.eclipse.jetty.server.session.SessionHandler. > doScope(SessionHandler.java:185) > > at > > org.eclipse.jetty.server.handler.ContextHandler. > doScope(ContextHandler.java:1112) > > at > > org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:141) > > at > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle( > ContextHandlerCollection.java:213) > > at > > org.eclipse.jetty.server.handler.HandlerCollection. > handle(HandlerCollection.java:119) > > at > > org.eclipse.jetty.server.handler.HandlerWrapper.handle( > HandlerWrapper.java:134) > > at > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle( > RewriteHandler.java:335) > > at > > org.eclipse.jetty.server.handler.HandlerWrapper.handle( > HandlerWrapper.ja
Re: Unbalanced CPU no SolrCloud
The transition of the load happened after I restarted the bulk insert process. The size of the index on each server about 500GB. There are about 8 warnings on each server for "Not found segment file" like that Error getting file length for [segments_2s4] java.nio.file.NoSuchFileException: /media/ssd_losedata/solr-home/data/documents_online_shard16_replica_n1/data/index/segments_2s4 at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145) at java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) at java.base/java.nio.file.Files.readAttributes(Files.java:1755) at java.base/java.nio.file.Files.size(Files.java:2369) at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) at org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128) at org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:611) at org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:584) at org.apache.solr.handler.admin.LukeRequestHandler.handleRequestBody(LukeRequestHandler.java:136) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2474) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:720) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:526) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:378) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:322) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.base/java.lang.Thread.run(Thread.java:844) On Mon, Oct 16, 2017 at 1:08 PM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > I did not look at graph details - now I see that it is over 3h time span. > It seems that there was a load on the other server before this one and > ended with 14GB read spike and 10GB write spike, just before load started > on this server. Do you see any errors or suspicious logs lines? > How big is your index? > > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 16 Oct 2017, at 12:39, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > > > > Yes, it's constantly since I starte
Re: Unbalanced CPU no SolrCloud
Yes, it's constantly since I started this bulk indexing process. As you see the write operations on the loaded server are 3x the normal server despite Disk writes not 3x times. Mahmoud On Mon, Oct 16, 2017 at 12:32 PM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Hi Mahmoud, > Is this something that you see constantly? Network charts suggests that > your servers are loaded equally and as you said - you are not using routing > so expected. Disk read/write and CPU are not equal and it is expected to > not be equal during heavy indexing since it also triggers segment merges > which require those resources. Even if host same documents (e.g. leader and > replica) merges are not likely to happen at the same time and you can > expect to see such cases. > > Thanks, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 16 Oct 2017, at 11:58, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > > > > Here are the screen shots for the two server metrics on Amazon > > > > https://ibb.co/kxBQam > > https://ibb.co/fn0Jvm > > https://ibb.co/kUpYT6 > > > > > > > > On Mon, Oct 16, 2017 at 11:37 AM, Mahmoud Almokadem < > prog.mahm...@gmail.com> > > wrote: > > > >> Hi Emir, > >> > >> We doesn't use routing. > >> > >> Servers is already balanced and the number of documents on each shard > are > >> approximately the same. > >> > >> Nothing running on the servers except Solr and ZooKeeper. > >> > >> I initialized the client as > >> > >> String zkHost = "192.168.1.89:2181,192.168.1.99:2181"; > >> > >> CloudSolrClient solrCloud = new CloudSolrClient.Builder() > >>.withZkHost(zkHost) > >>.build(); > >> > >>solrCloud.setIdField("document_id"); > >>solrCloud.setDefaultCollection(collection); > >>solrCloud.setRequestWriter(new BinaryRequestWriter()); > >> > >> > >> And the documents are approximately the same size. > >> > >> I Used 10 threads with 10 SolrClients to send data to solr and every > >> thread send a batch of 1000 documents every time. > >> > >> Thanks, > >> Mahmoud > >> > >> > >> > >> On Mon, Oct 16, 2017 at 11:01 AM, Emir Arnautović < > >> emir.arnauto...@sematext.com> wrote: > >> > >>> Hi Mahmoud, > >>> Do you use routing? Are your servers equally balanced - do you end up > >>> having approximately the same number of documents hosted on both > servers > >>> (counted all shards)? > >>> Do you have anything else running on those servers? > >>> How do you initialise your SolrJ client? > >>> Are documents of similar size? > >>> > >>> Thanks, > >>> Emir > >>> -- > >>> Monitoring - Log Management - Alerting - Anomaly Detection > >>> Solr & Elasticsearch Consulting Support Training - > http://sematext.com/ > >>> > >>> > >>> > >>>> On 16 Oct 2017, at 10:46, Mahmoud Almokadem <prog.mahm...@gmail.com> > >>> wrote: > >>>> > >>>> We've installed SolrCloud 7.0.1 with two nodes and 8 shards per node. > >>>> > >>>> The configurations and the specs of the two servers are identical. > >>>> > >>>> When running bulk indexing using SolrJ we see one of the servers is > >>> fully > >>>> loaded as you see on the images and the other is normal. > >>>> > >>>> Images URLs: > >>>> > >>>> https://ibb.co/jkE6gR > >>>> https://ibb.co/hyzvam > >>>> https://ibb.co/mUpvam > >>>> https://ibb.co/e4bxo6 > >>>> > >>>> How can I figure this issue? > >>>> > >>>> Thanks, > >>>> Mahmoud > >>> > >>> > >> > >
Re: Unbalanced CPU no SolrCloud
Here are the screen shots for the two server metrics on Amazon https://ibb.co/kxBQam https://ibb.co/fn0Jvm https://ibb.co/kUpYT6 On Mon, Oct 16, 2017 at 11:37 AM, Mahmoud Almokadem <prog.mahm...@gmail.com> wrote: > Hi Emir, > > We doesn't use routing. > > Servers is already balanced and the number of documents on each shard are > approximately the same. > > Nothing running on the servers except Solr and ZooKeeper. > > I initialized the client as > > String zkHost = "192.168.1.89:2181,192.168.1.99:2181"; > > CloudSolrClient solrCloud = new CloudSolrClient.Builder() > .withZkHost(zkHost) > .build(); > > solrCloud.setIdField("document_id"); > solrCloud.setDefaultCollection(collection); > solrCloud.setRequestWriter(new BinaryRequestWriter()); > > > And the documents are approximately the same size. > > I Used 10 threads with 10 SolrClients to send data to solr and every > thread send a batch of 1000 documents every time. > > Thanks, > Mahmoud > > > > On Mon, Oct 16, 2017 at 11:01 AM, Emir Arnautović < > emir.arnauto...@sematext.com> wrote: > >> Hi Mahmoud, >> Do you use routing? Are your servers equally balanced - do you end up >> having approximately the same number of documents hosted on both servers >> (counted all shards)? >> Do you have anything else running on those servers? >> How do you initialise your SolrJ client? >> Are documents of similar size? >> >> Thanks, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >> > On 16 Oct 2017, at 10:46, Mahmoud Almokadem <prog.mahm...@gmail.com> >> wrote: >> > >> > We've installed SolrCloud 7.0.1 with two nodes and 8 shards per node. >> > >> > The configurations and the specs of the two servers are identical. >> > >> > When running bulk indexing using SolrJ we see one of the servers is >> fully >> > loaded as you see on the images and the other is normal. >> > >> > Images URLs: >> > >> > https://ibb.co/jkE6gR >> > https://ibb.co/hyzvam >> > https://ibb.co/mUpvam >> > https://ibb.co/e4bxo6 >> > >> > How can I figure this issue? >> > >> > Thanks, >> > Mahmoud >> >> >
Re: Unbalanced CPU no SolrCloud
Hi Emir, We doesn't use routing. Servers is already balanced and the number of documents on each shard are approximately the same. Nothing running on the servers except Solr and ZooKeeper. I initialized the client as String zkHost = "192.168.1.89:2181,192.168.1.99:2181"; CloudSolrClient solrCloud = new CloudSolrClient.Builder() .withZkHost(zkHost) .build(); solrCloud.setIdField("document_id"); solrCloud.setDefaultCollection(collection); solrCloud.setRequestWriter(new BinaryRequestWriter()); And the documents are approximately the same size. I Used 10 threads with 10 SolrClients to send data to solr and every thread send a batch of 1000 documents every time. Thanks, Mahmoud On Mon, Oct 16, 2017 at 11:01 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Hi Mahmoud, > Do you use routing? Are your servers equally balanced - do you end up > having approximately the same number of documents hosted on both servers > (counted all shards)? > Do you have anything else running on those servers? > How do you initialise your SolrJ client? > Are documents of similar size? > > Thanks, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 16 Oct 2017, at 10:46, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > > > > We've installed SolrCloud 7.0.1 with two nodes and 8 shards per node. > > > > The configurations and the specs of the two servers are identical. > > > > When running bulk indexing using SolrJ we see one of the servers is fully > > loaded as you see on the images and the other is normal. > > > > Images URLs: > > > > https://ibb.co/jkE6gR > > https://ibb.co/hyzvam > > https://ibb.co/mUpvam > > https://ibb.co/e4bxo6 > > > > How can I figure this issue? > > > > Thanks, > > Mahmoud > >
Unbalanced CPU no SolrCloud
We've installed SolrCloud 7.0.1 with two nodes and 8 shards per node. The configurations and the specs of the two servers are identical. When running bulk indexing using SolrJ we see one of the servers is fully loaded as you see on the images and the other is normal. Images URLs: https://ibb.co/jkE6gR https://ibb.co/hyzvam https://ibb.co/mUpvam https://ibb.co/e4bxo6 How can I figure this issue? Thanks, Mahmoud
Re: Move index directory to another partition
Thanks all for your commits. I followed Shawn steps (rsync) cause everything on that volume (ZooKeeper, Solr home and data) and everything went great. Thanks again, Mahmoud On Sun, Aug 6, 2017 at 12:47 AM, Erick Ericksonwrote: > bq: I was envisioning a scenario where the entire solr home is on the old > volume that's going away. If I were setting up a Solr install where the > large/fast storage was a separate filesystem, I would put the solr home > (or possibly even the entire install) under that mount point. It would > be a lot easier than setting dataDir in core.properties for every core, > especially in a cloud install. > > Agreed. Nothing in what I said precludes this. If you don't specify > dataDir, > then the index for a new replica goes in the default place, i.e. under > your install > directory usually. In your case under your new mount point. I usually don't > recommend trying to take control of where dataDir points, just let it > default. > I only mentioned it so you'd be aware it exists. So if your new install > is associated with a bigger/better/larger EBS it's all automatic. > > bq: If the dataDir property is already in use to relocate index data, then > ADDREPLICA and DELETEREPLICA would be a great way to go. I would not > expect most SolrCloud users to use that method. > > I really don't understand this. Each Solr replica has an associated > dataDir whether you specified it or not (the default is relative to > the core.properties file). ADDREPLICA creates a new replica in a new > place, initially the data directory and index are empty. The new > replica goes into recovery and uses the standard replication process > to copy the index via HTTP from a healthy replica and write it to its > data directory. Once that's done, the replica becomes live. There's > nothing about dataDir already being in use here at all. > > When you start Solr there's the default place Solr expects to find the > replicas. This is not necessarily where Solr is executing from, see > the "-s" option in bin/solr start -s. > > If you're talking about using dataDir to point to an existing index, > yes that would be a problem and not something I meant to imply at all. > > Why wouldn't most SolrCloud users use ADDREPLICA/DELTEREPLICA? It's > commonly used to more replicas around a cluster. > > Best, > Erick > > On Fri, Aug 4, 2017 at 11:15 AM, Shawn Heisey wrote: > > On 8/2/2017 9:17 AM, Erick Erickson wrote: > >> Not entirely sure about AWS intricacies, but getting a new replica to > >> use a particular index directory in the general case is just > >> specifying dataDir=some_directory on the ADDREPLICA command. The index > >> just needs an HTTP connection (uses the old replication process) so > >> nothing huge there. Then DELETEREPLICA for the old one. There's > >> nothing that ZK has to know about to make this work, it's all local to > >> the Solr instance. > > > > I was envisioning a scenario where the entire solr home is on the old > > volume that's going away. If I were setting up a Solr install where the > > large/fast storage was a separate filesystem, I would put the solr home > > (or possibly even the entire install) under that mount point. It would > > be a lot easier than setting dataDir in core.properties for every core, > > especially in a cloud install. > > > > If the dataDir property is already in use to relocate index data, then > > ADDREPLICA and DELETEREPLICA would be a great way to go. I would not > > expect most SolrCloud users to use that method. > > > > Thanks, > > Shawn > > >
Re: Move index directory to another partition
Thanks Shawn, I'm using ubuntu and I'll try rsync command. Unfortunately I'm using one replication factor but I think the downtime will be less than five minutes after following your steps. But how can I start Solr backup or why should I run it although I copied the index and changed theo path? And what do you mean with "Using multiple passes with rsync"? Thanks, Mahmoud On Tuesday, August 1, 2017, Shawn Heisey <apa...@elyograg.org> wrote: > On 7/31/2017 12:28 PM, Mahmoud Almokadem wrote: > > I've a SolrCloud of four instances on Amazon and the EBS volumes that > > contain the data on everynode is going to be full, unfortunately Amazon > > doesn't support expanding the EBS. So, I'll attach larger EBS volumes to > > move the index to. > > > > I can stop the updates on the index, but I'm afraid to use "cp" command > to > > copy the files that are "on merge" operation. > > > > The copy operation may take several hours. > > > > How can I move the data directory without stopping the instance? > > Use rsync to do the copy. Do an initial copy while Solr is running, > then do a second copy, which should be pretty fast because rsync will > see the data from the first copy. Then shut Solr down and do a third > rsync which will only copy a VERY small changeset. Reconfigure Solr > and/or the OS to use the new location, and start Solr back up. Because > you mentioned "cp" I am assuming that you're NOT on Windows, and that > the OS will most likely allow you to do anything you need with index > files while Solr has them open. > > If you have set up your replicas with SolrCloud properly, then your > collections will not go offline when one Solr instance is shut down, and > that instance will be brought back into sync with the rest of the > cluster when it starts back up. Using multiple passes with rsync should > mean that Solr will not need to be shutdown for very long. > > The options I typically use for this kind of copy with rsync are "-avH > --delete". I would recommend that you research rsync options so that > you fully understand what I have suggested. > > Thanks, > Shawn > >
Move index directory to another partition
Hello, I've a SolrCloud of four instances on Amazon and the EBS volumes that contain the data on everynode is going to be full, unfortunately Amazon doesn't support expanding the EBS. So, I'll attach larger EBS volumes to move the index to. I can stop the updates on the index, but I'm afraid to use "cp" command to copy the files that are "on merge" operation. The copy operation may take several hours. How can I move the data directory without stopping the instance? Thanks, Mahmoud
Re: Clean checkbox on DIH
Thanks Shawn, We already use the admin UI for testing and bulk uploads. We are using curl scripts for automation process. I'll report the issues regarding the new UI on JIRA. Thanks, Mahmoud On Tuesday, May 2, 2017, Shawn Heisey <apa...@elyograg.org> wrote: > On 5/2/2017 6:53 AM, Mahmoud Almokadem wrote: > > And for the dataimport I always use the old UI cause the new UI > > doesn't show the live update and sometimes doesn't show the > > configuration. I think there are many bugs on the new UI. > > Do you know if these problems have been reported in the Jira issue > tracker? The old UI is going to disappear in Solr 7.0 when it is > released. If there are bugs in the new UI, we need to have them > reported so they can be fixed. > > As I stated earlier, when it comes to DIH, the admin UI is more useful > for testing and research than actual usage. The URLs for the admin UI > cannot be used in automation tools -- the API must be used directly. > > Thanks, > Shawn > >
Re: Clean checkbox on DIH
Thanks Shawn for your clarifications, I think showing a confirmation message saying that "The whole index will be cleaned" when the clean option is checked will be good. I always remove the check from the file /opt/solr/server/solr-webapp/webapp/tpl/dataimport.html after installing solr but when I upgraded it this time forget to do that and press Execute with the check and the whole index cleaned. And for the dataimport I always use the old UI cause the new UI doesn't show the live update and sometimes doesn't show the configuration. I think there are many bugs on the new UI. Thanks, Mahmoud On Mon, May 1, 2017 at 4:30 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 4/28/2017 9:01 AM, Mahmoud Almokadem wrote: > > We already using a shell scripts to do our import and using fullimport > > command to do our delta import and everything is doing well several > > years ago. But default of the UI is full import with clean and commit. > > If I press the Execute button by mistake the whole index is cleaned > > without any notification. > > I understand your frustration. What I'm worried about is the fallout if > we change the default to be unchecked, from people who didn't verify the > setting and expected full-import to wipe their index before it started > importing, just like it has always done for the last few years. > > The default value for the clean parameter when NOT using the admin UI is > true for full-import, and false for delta-import. That's not going to > change. I firmly believe that the admin UI should have the same > defaults as the API itself. The very nature of a full-import carries > the implication that you want to start over with an empty index. > > What if there were some bright red text in the UI near the execute > button that urged you to double-check that the "clean" box has the > setting you want? An alternate idea would be to pop up a yes/no > verification dialog on execute when the clean box is checked. > > Thanks, > Shawn > >
Re: Clean checkbox on DIH
Thanks Shawn, We already using a shell scripts to do our import and using fullimport command to do our delta import and everything is doing well several years ago. But default of the UI is full import with clean and commit. If I press the Execute button by mistake the whole index is cleaned without any notification. Thanks, Mahmoud On Fri, Apr 28, 2017 at 2:51 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 4/28/2017 5:11 AM, Mahmoud Almokadem wrote: > > I'd like to request to uncheck the "Clean" checkbox by default on DIH > page, > > cause it cleaned the whole index about 2TB when I click Execute button by > > wrong. Or show a confirmation message that the whole index will be > cleaned!! > > When somebody is doing a full-import, clean is what almost all users are > going to want. If you're wanting to do full-import without cleaning, > then you are in the minority. It is perhaps a fairly large minority, > but still not the majority. > > Also, once you move into production, you should not be using the admin > UI for this. You should be calling the DIH handler directly with HTTP > from another source, which might be a shell script using curl, or a > full-blown program in another language. > > Thanks, > Shawn > >
Clean checkbox on DIH
Hello, I'd like to request to uncheck the "Clean" checkbox by default on DIH page, cause it cleaned the whole index about 2TB when I click Execute button by wrong. Or show a confirmation message that the whole index will be cleaned!! Sincerely, Mahmoud
TransactionLog doesn't know how to serialize class java.util.UUID; try implementing ObjectResolver?
Hello, When I try to update a document exists on solr cloud I got this message: TransactionLog doesn't know how to serialize class java.util.UUID; try implementing ObjectResolver? With the stack trace: {"data":{"responseHeader":{"status":500,"QTime":3},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"TransactionLog doesn't know how to serialize class java.util.UUID; try implementing ObjectResolver?","trace":"org.apache.solr.common.SolrException: TransactionLog doesn't know how to serialize class java.util.UUID; try implementing ObjectResolver?\n\tat org.apache.solr.update.TransactionLog$1.resolve(TransactionLog.java:100)\n\tat org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:234)\n\tat org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:589)\n\tat org.apache.solr.update.TransactionLog.write(TransactionLog.java:395)\n\tat org.apache.solr.update.UpdateLog.add(UpdateLog.java:524)\n\tat org.apache.solr.update.UpdateLog.add(UpdateLog.java:508)\n\tat org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:320)\n\tat org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)\n\tat org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)\n\tat org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)\n\tat org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)\n\tat org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:980)\n\tat org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1193)\n\tat org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:749)\n\tat org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)\n\tat org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:502)\n\tat org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:141)\n\tat org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:117)\n\tat org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:80)\n\tat org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)\n\tat org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2440)\n\tat org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:347)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:298)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
Re: Enable Gzip compression Solr 6.0
Thanks Rick, I already running Solr on my infrastructure and behind a web application. The web application is working as a proxy before Solr, so I think I can compress the content on Solr end. But I have made it on the proxy now. Thanks again, Mahmoud > On Apr 12, 2017, at 4:31 PM, Rick Leir <rl...@leirtech.com> wrote: > > Hi Mahmoud > I assume you are running Solr 'behind' a web application, so Solr is not > directly on the net. > > The gzip compression is an Apache thing, and relates to your web application. > > Connections to Solr are within your infrastructure, so you might not want to > gzip them. But maybe your setup is different? > > Older versions of Solr used Tomcat which supported gzip. Newer versions use > Zookeeper and Jetty and you prolly will find a way. > Cheers -- Rick > >> On April 12, 2017 8:48:45 AM EDT, Mahmoud Almokadem <prog.mahm...@gmail.com> >> wrote: >> Hello, >> >> How can I enable Gzip compression for Solr 6.0 to save bandwidth >> between >> the server and clients? >> >> Thanks, >> Mahmoud > > -- > Sorry for being brief. Alternate email is rickleir at yahoo dot com
Enable Gzip compression Solr 6.0
Hello, How can I enable Gzip compression for Solr 6.0 to save bandwidth between the server and clients? Thanks, Mahmoud
Re: Indexing CPU performance
Thanks Toke, After sorting with Self Time(CPU) I got that the FSDirectory$FSIndexOutput$1.write() is taking much of CPU time, so the bottleneck now is the IO of the hard drive? https://drive.google.com/open?id=0BwLcshoSCVcdb2I4U1RBNnI0OVU On Tue, Mar 14, 2017 at 4:19 PM, Toke Eskildsen <t...@kb.dk> wrote: > On Tue, 2017-03-14 at 11:51 +0200, Mahmoud Almokadem wrote: > > Here is the profiler screenshot from VisualVM after upgrading > > > > https://drive.google.com/open?id=0BwLcshoSCVcddldVRTExaDR2dzg > > > > the jetty is taking the most time on CPU. Does this mean, the jetty > > is the bottleneck on indexing? > > You need to sort on and look at the "Self Time (CPU)" column in > VisualVM, not the default "Self Time", to see where the power is used. > The default is pretty useless for locating hot spots. > > - Toke Eskildsen, Royal Danish Library >
Re: Indexing CPU performance
After upgrading to 6.4.2 I got 3500+ docs/sec throughput with two uploading clients to solr which is good to me for the whole reindexing. I'll try Shawn code to posting to solr using HttpSolrClient instead of SolrCloudClient. Thanks to all, Mahmoud On Tue, Mar 14, 2017 at 10:23 AM, Mahmoud Almokadem <prog.mahm...@gmail.com> wrote: > > I'm using VisualVM and sematext to monitor my cluster. > > Below is screenshots for each of them. > > https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28 > > https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA > > https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk > > https://drive.google.com/open?id=0BwLcshoSCVcdR3hJSHRZTjdSZm8 > > https://drive.google.com/open?id=0BwLcshoSCVcdUzRETDlFeFIxU2M > > Thanks, > Mahmoud > > On Tue, Mar 14, 2017 at 10:20 AM, Mahmoud Almokadem < > prog.mahm...@gmail.com> wrote: > >> Thanks Erick, >> >> I think there are something missing, the rate I'm talking about is for >> bulk upload and one time indexing to on-going indexing. >> My dataset is about 250 million documents and I need to index them to >> solr. >> >> Thanks Shawn for your clarification, >> >> I think that I got stuck on this version 6.4.1 I'll upgrade my cluster >> and test again. >> >> Thanks for help >> Mahmoud >> >> >> On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> >> wrote: >> >>> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote: >>> > When I start my bulk indexer program the CPU utilization is 100% on >>> each >>> > server but the rate of the indexer is about 1500 docs per second. >>> > >>> > I know that some solr benchmarks reached 70,000+ doc. per second. >>> >>> There are *MANY* factors that affect indexing rate. When you say that >>> the CPU utilization is 100 percent, what operating system are you >>> running and what tool are you using to see CPU percentage? Within that >>> tool, where are you looking to see that usage level? >>> >>> On some operating systems with some reporting tools, a server with 8 CPU >>> cores can show up to 800 percent CPU usage, so 100 percent utilization >>> on the Solr process may not be full utilization of the server's >>> resources. It also might be an indicator of the full system usage, if >>> you are looking in the right place. >>> >>> > The question: What is the best way to determine the bottleneck on solr >>> > indexing rate? >>> >>> I have two likely candidates for you. The first one is a bug that >>> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2. If you don't >>> have one of those two versions, then this is not affecting you: >>> >>> https://issues.apache.org/jira/browse/SOLR-10130 >>> >>> The other likely bottleneck, which could be a problem whether or not the >>> previous bug is present, is single-threaded indexing, so every batch of >>> docs must wait for the previous batch to finish before it can begin, and >>> only one CPU gets utilized on the server side. Both Solr and SolrJ are >>> fully capable of handling several indexing threads at once, and that is >>> really the only way to achieve maximum indexing performance. If you >>> want multi-threaded (parallel) indexing, you must create the threads on >>> the client side, or run multiple indexing processes that each handle >>> part of the job. Multi-threaded code is not easy to write correctly. >>> >>> The fieldTypes and analysis that you have configured in your schema may >>> include classes that process very slowly, or may include so many filters >>> that the end result is slow performance. I am not familiar with the >>> performance of the classes that Solr includes, so I would not be able to >>> look at a schema and tell you which entries are slow. As Erick >>> mentioned, processing for 300+ fields could be one reason for slow >>> indexing. >>> >>> If you are doing a commit operation for every batch, that will slow it >>> down even more. If you have autoSoftCommit configured with a very low >>> maxTime or maxDocs value, that can result in extremely frequent commits >>> that make indexing much slower. Although frequent autoCommit is very >>> much desirable for good operation (as long as openSearcher set to >>> false), commits that open new searchers should be much less frequent. >>> The best option is to only commit (with a new searcher) *
Re: Indexing CPU performance
Here is the profiler screenshot from VisualVM after upgrading https://drive.google.com/open?id=0BwLcshoSCVcddldVRTExaDR2dzg the jetty is taking the most time on CPU. Does this mean, the jetty is the bottleneck on indexing? Thanks, Mahmoud On Tue, Mar 14, 2017 at 11:41 AM, Mahmoud Almokadem <prog.mahm...@gmail.com> wrote: > Thanks Shalin, > > I'm posting data to solr with SolrInputDocument using SolrJ. > > According to the profiler, the com.codahale.metrics.Meter.mark is take > much processing than others as mentioned on this issue > https://issues.apache.org/jira/browse/SOLR-10130. > > And I think the profiler of sematext is different than VisualVM. > > Thanks for help, > Mahmoud > > > > On Tue, Mar 14, 2017 at 11:08 AM, Shalin Shekhar Mangar < > shalinman...@gmail.com> wrote: > >> According to the profiler output, a significant amount of cpu is being >> spent in JSON parsing but your previous email said that you use SolrJ. >> SolrJ uses the javabin binary format to send documents to Solr and it >> never ever uses JSON so there is definitely some other indexing >> process that you have not accounted for. >> >> On Tue, Mar 14, 2017 at 12:31 AM, Mahmoud Almokadem >> <prog.mahm...@gmail.com> wrote: >> > Thanks Erick, >> > >> > I've commented out the line SolrClient.add(doclist) and get 5500+ docs >> per >> > second from single producer. >> > >> > Regarding more shards, you mean use 2 nodes with 8 shards per node so we >> > got 16 shards on the same 2 nodes or spread shards over more nodes? >> > >> > I'm using solr 6.4.1 with zookeeper on the same nodes. >> > >> > Here's what I got from sematext profiler >> > >> > 51% >> > Thread.java:745java.lang.Thread#run >> > >> > 42% >> > QueuedThreadPool.java:589 >> > org.eclipse.jetty.util.thread.QueuedThreadPool$2#run >> > Collapsed 29 calls (Expand) >> > >> > 43% >> > UpdateRequestHandler.java:97 >> > org.apache.solr.handler.UpdateRequestHandler$1#load >> > >> > 30% >> > JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load >> > >> > 30% >> > JsonLoader.java:115 >> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load >> > >> > 13% >> > JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load >> > >> > 9% >> > ThreadPoolExecutor.java:617 >> > java.util.concurrent.ThreadPoolExecutor$Worker#run >> > >> > 9% >> > ThreadPoolExecutor.java:1142 >> > java.util.concurrent.ThreadPoolExecutor#runWorker >> > >> > 33% >> > ConcurrentMergeScheduler.java:626 >> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run >> > >> > 33% >> > ConcurrentMergeScheduler.java:588 >> > org.apache.lucene.index.ConcurrentMergeScheduler#doMerge >> > >> > 33% >> > SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge >> > >> > 33% >> > IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge >> > >> > 33% >> > IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle >> > >> > 20% >> > SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge >> > >> > 11% >> > SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge >> > >> > 2% >> > SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge >> > >> > >> > On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson < >> erickerick...@gmail.com> >> > wrote: >> > >> >> Note that 70,000 docs/second pretty much guarantees that there are >> >> multiple shards. Lots of shards. >> >> >> >> But since you're using SolrJ, the very first thing I'd try would be >> >> to comment out the SolrClient.add(doclist) call so you're doing >> >> everything _except_ send the docs to Solr. That'll tell you whether >> >> there's any bottleneck on getting the docs from the system of record. >> >> The fact that you're pegging the CPUs argues that you are feeding Solr >> >> as fast as Solr can go so this is just a sanity check. But it's >> >> simple/fast. >> >> >> >> As far as what on Solr could be the bottleneck, no real way to know >> >> without profiling. But 300+ fields per doc probably just means you're >> >> doin
Re: Indexing CPU performance
Thanks Shalin, I'm posting data to solr with SolrInputDocument using SolrJ. According to the profiler, the com.codahale.metrics.Meter.mark is take much processing than others as mentioned on this issue https://issues.apache.org/jira/browse/SOLR-10130. And I think the profiler of sematext is different than VisualVM. Thanks for help, Mahmoud On Tue, Mar 14, 2017 at 11:08 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > According to the profiler output, a significant amount of cpu is being > spent in JSON parsing but your previous email said that you use SolrJ. > SolrJ uses the javabin binary format to send documents to Solr and it > never ever uses JSON so there is definitely some other indexing > process that you have not accounted for. > > On Tue, Mar 14, 2017 at 12:31 AM, Mahmoud Almokadem > <prog.mahm...@gmail.com> wrote: > > Thanks Erick, > > > > I've commented out the line SolrClient.add(doclist) and get 5500+ docs > per > > second from single producer. > > > > Regarding more shards, you mean use 2 nodes with 8 shards per node so we > > got 16 shards on the same 2 nodes or spread shards over more nodes? > > > > I'm using solr 6.4.1 with zookeeper on the same nodes. > > > > Here's what I got from sematext profiler > > > > 51% > > Thread.java:745java.lang.Thread#run > > > > 42% > > QueuedThreadPool.java:589 > > org.eclipse.jetty.util.thread.QueuedThreadPool$2#run > > Collapsed 29 calls (Expand) > > > > 43% > > UpdateRequestHandler.java:97 > > org.apache.solr.handler.UpdateRequestHandler$1#load > > > > 30% > > JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load > > > > 30% > > JsonLoader.java:115 > > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load > > > > 13% > > JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load > > > > 9% > > ThreadPoolExecutor.java:617 > > java.util.concurrent.ThreadPoolExecutor$Worker#run > > > > 9% > > ThreadPoolExecutor.java:1142 > > java.util.concurrent.ThreadPoolExecutor#runWorker > > > > 33% > > ConcurrentMergeScheduler.java:626 > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run > > > > 33% > > ConcurrentMergeScheduler.java:588 > > org.apache.lucene.index.ConcurrentMergeScheduler#doMerge > > > > 33% > > SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge > > > > 33% > > IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge > > > > 33% > > IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle > > > > 20% > > SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge > > > > 11% > > SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge > > > > 2% > > SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge > > > > > > On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> Note that 70,000 docs/second pretty much guarantees that there are > >> multiple shards. Lots of shards. > >> > >> But since you're using SolrJ, the very first thing I'd try would be > >> to comment out the SolrClient.add(doclist) call so you're doing > >> everything _except_ send the docs to Solr. That'll tell you whether > >> there's any bottleneck on getting the docs from the system of record. > >> The fact that you're pegging the CPUs argues that you are feeding Solr > >> as fast as Solr can go so this is just a sanity check. But it's > >> simple/fast. > >> > >> As far as what on Solr could be the bottleneck, no real way to know > >> without profiling. But 300+ fields per doc probably just means you're > >> doing a lot of processing, I'm not particularly hopeful you'll be able > >> to speed things up without either more shards or simplifying your > >> schema. > >> > >> Best, > >> Erick > >> > >> On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem > >> <prog.mahm...@gmail.com> wrote: > >> > Hi great community, > >> > > >> > I have a SolrCloud with the following configuration: > >> > > >> >- 2 nodes (r3.2xlarge 61GB RAM) > >> >- 4 shards. > >> >- The producer can produce 13,000+ docs per second > >> >- The schema contains about 300+ fields and the document size is > about > >> >3KB. > >> >- Using SolrJ and SolrCloudClient, each batch to solr contains 500 > >> docs. > >> > > >> > When I start my bulk indexer program the CPU utilization is 100% on > each > >> > server but the rate of the indexer is about 1500 docs per second. > >> > > >> > I know that some solr benchmarks reached 70,000+ doc. per second. > >> > > >> > The question: What is the best way to determine the bottleneck on solr > >> > indexing rate? > >> > > >> > Thanks, > >> > Mahmoud > >> > > > > -- > Regards, > Shalin Shekhar Mangar. >
Re: Indexing CPU performance
Thanks Erick, I think there are something missing, the rate I'm talking about is for bulk upload and one time indexing to on-going indexing. My dataset is about 250 million documents and I need to index them to solr. Thanks Shawn for your clarification, I think that I got stuck on this version 6.4.1 I'll upgrade my cluster and test again. Thanks for help Mahmoud On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote: > > When I start my bulk indexer program the CPU utilization is 100% on each > > server but the rate of the indexer is about 1500 docs per second. > > > > I know that some solr benchmarks reached 70,000+ doc. per second. > > There are *MANY* factors that affect indexing rate. When you say that > the CPU utilization is 100 percent, what operating system are you > running and what tool are you using to see CPU percentage? Within that > tool, where are you looking to see that usage level? > > On some operating systems with some reporting tools, a server with 8 CPU > cores can show up to 800 percent CPU usage, so 100 percent utilization > on the Solr process may not be full utilization of the server's > resources. It also might be an indicator of the full system usage, if > you are looking in the right place. > > > The question: What is the best way to determine the bottleneck on solr > > indexing rate? > > I have two likely candidates for you. The first one is a bug that > affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2. If you don't > have one of those two versions, then this is not affecting you: > > https://issues.apache.org/jira/browse/SOLR-10130 > > The other likely bottleneck, which could be a problem whether or not the > previous bug is present, is single-threaded indexing, so every batch of > docs must wait for the previous batch to finish before it can begin, and > only one CPU gets utilized on the server side. Both Solr and SolrJ are > fully capable of handling several indexing threads at once, and that is > really the only way to achieve maximum indexing performance. If you > want multi-threaded (parallel) indexing, you must create the threads on > the client side, or run multiple indexing processes that each handle > part of the job. Multi-threaded code is not easy to write correctly. > > The fieldTypes and analysis that you have configured in your schema may > include classes that process very slowly, or may include so many filters > that the end result is slow performance. I am not familiar with the > performance of the classes that Solr includes, so I would not be able to > look at a schema and tell you which entries are slow. As Erick > mentioned, processing for 300+ fields could be one reason for slow > indexing. > > If you are doing a commit operation for every batch, that will slow it > down even more. If you have autoSoftCommit configured with a very low > maxTime or maxDocs value, that can result in extremely frequent commits > that make indexing much slower. Although frequent autoCommit is very > much desirable for good operation (as long as openSearcher set to > false), commits that open new searchers should be much less frequent. > The best option is to only commit (with a new searcher) *once* at the > end of the indexing run. If automatic soft commits are desired, make > them happen as infrequently as you can. > > https://lucidworks.com/understanding-transaction- > logs-softcommit-and-commit-in-sorlcloud/ > > Using CloudSolrClient will make single-threaded indexing fairly > efficient, by always sending documents to the correct shard leader. FYI > -- your 500 document batches are split into smaller batches (which I > think are only 10 documents) that are directed to correct shard leaders > by CloudSolrClient. Indexing with multiple threads becomes even more > important with these smaller batches. > > Note that with SolrJ, you will need to tweak the HttpClient creation, or > you will likely find that each SolrJ client object can only utilize two > threads to each Solr server. The default per-route maximum connection > limit for HttpClient is 2, with a total connection limit of 20. > > This code snippet shows how I create a Solr client that can do many > threads (300 per route, 5000 total) and also has custom timeout settings: > > RequestConfig rc = RequestConfig.custom().setConnectTimeout(15000) > .setSocketTimeout(Const.SOCKET_TIMEOUT).build(); > httpClient = HttpClients.custom().setDefaultRequestConfig(rc) > .setMaxConnPerRoute(300).setMaxConnTotal(5000) > .disableAutomaticRetries().build(); > client = new HttpSolrClient(serverBaseUrl, httpClient); > > This is using HttpSolrClient, but CloudSolrClient can be built in a > similar manner. I am not yet using the new SolrJ Builder paradigm found > in 6.x, I should switch my code to that. > > Thanks, > Shawn > >
Re: Indexing CPU performance
I'm using VisualVM and sematext to monitor my cluster. Below is screenshots for each of them. https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28 https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk https://drive.google.com/open?id=0BwLcshoSCVcdR3hJSHRZTjdSZm8 https://drive.google.com/open?id=0BwLcshoSCVcdUzRETDlFeFIxU2M Thanks, Mahmoud On Tue, Mar 14, 2017 at 10:20 AM, Mahmoud Almokadem <prog.mahm...@gmail.com> wrote: > Thanks Erick, > > I think there are something missing, the rate I'm talking about is for > bulk upload and one time indexing to on-going indexing. > My dataset is about 250 million documents and I need to index them to solr. > > Thanks Shawn for your clarification, > > I think that I got stuck on this version 6.4.1 I'll upgrade my cluster and > test again. > > Thanks for help > Mahmoud > > > On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote: >> > When I start my bulk indexer program the CPU utilization is 100% on each >> > server but the rate of the indexer is about 1500 docs per second. >> > >> > I know that some solr benchmarks reached 70,000+ doc. per second. >> >> There are *MANY* factors that affect indexing rate. When you say that >> the CPU utilization is 100 percent, what operating system are you >> running and what tool are you using to see CPU percentage? Within that >> tool, where are you looking to see that usage level? >> >> On some operating systems with some reporting tools, a server with 8 CPU >> cores can show up to 800 percent CPU usage, so 100 percent utilization >> on the Solr process may not be full utilization of the server's >> resources. It also might be an indicator of the full system usage, if >> you are looking in the right place. >> >> > The question: What is the best way to determine the bottleneck on solr >> > indexing rate? >> >> I have two likely candidates for you. The first one is a bug that >> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2. If you don't >> have one of those two versions, then this is not affecting you: >> >> https://issues.apache.org/jira/browse/SOLR-10130 >> >> The other likely bottleneck, which could be a problem whether or not the >> previous bug is present, is single-threaded indexing, so every batch of >> docs must wait for the previous batch to finish before it can begin, and >> only one CPU gets utilized on the server side. Both Solr and SolrJ are >> fully capable of handling several indexing threads at once, and that is >> really the only way to achieve maximum indexing performance. If you >> want multi-threaded (parallel) indexing, you must create the threads on >> the client side, or run multiple indexing processes that each handle >> part of the job. Multi-threaded code is not easy to write correctly. >> >> The fieldTypes and analysis that you have configured in your schema may >> include classes that process very slowly, or may include so many filters >> that the end result is slow performance. I am not familiar with the >> performance of the classes that Solr includes, so I would not be able to >> look at a schema and tell you which entries are slow. As Erick >> mentioned, processing for 300+ fields could be one reason for slow >> indexing. >> >> If you are doing a commit operation for every batch, that will slow it >> down even more. If you have autoSoftCommit configured with a very low >> maxTime or maxDocs value, that can result in extremely frequent commits >> that make indexing much slower. Although frequent autoCommit is very >> much desirable for good operation (as long as openSearcher set to >> false), commits that open new searchers should be much less frequent. >> The best option is to only commit (with a new searcher) *once* at the >> end of the indexing run. If automatic soft commits are desired, make >> them happen as infrequently as you can. >> >> https://lucidworks.com/understanding-transaction-logs- >> softcommit-and-commit-in-sorlcloud/ >> >> Using CloudSolrClient will make single-threaded indexing fairly >> efficient, by always sending documents to the correct shard leader. FYI >> -- your 500 document batches are split into smaller batches (which I >> think are only 10 documents) that are directed to correct shard leaders >> by CloudSolrClient. Indexing with multiple threads becomes even more >> important with these smaller batches. >> >> Note that with
Re: Indexing CPU performance
Hi Erick, Thanks for detailed answer. The producer can sustain producing with that rate, it's not a spikes. So, I can ran more clients that write to Solr although I got that maximum utilization with a single client? Do you think it will increase throughput? And you advice me to add more shards on the same two nodes until I get the best throughput? autocommit is 15000 and softcommit is 6 Thanks, Mahmoud > On Mar 13, 2017, at 9:28 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > OK, so you can get a 360% speedup by commenting out the solr.add. That > indicates that, indeed, you're pretty much running Solr flat out, not > surprising. You _might_ squeeze a little more out of Solr by adding > more client indexers, but that's not going to drive you to the numbers > you need. I do have one observation though. You say "...can produce > 13,000+ docs per second...". Is this sustained or occasional spikes? > If the latter, can you let Solr fall behind and pick up the extra > files when producer slows down? > > Second, you'll have to have at least three clients running to even do > the upstream processing without Solr in the picture at all. IOW, you > can't gather and generate the Solr documents fast enough with one > client, much less index them too. > > bq: Regarding more shards, you mean use 2 nodes with 8 shards per node so we > got 16 shards on the same 2 nodes or spread shards over more nodes? > > Yes ;). Once you have enough shards/replicas on a box that you're > running all the CPUs flat out, adding more shards won't do you any > good. And we're just skipping over what that'll do to your ability to > run queries. Plus, 13,000 docs/second will mount up pretty quickly, so > you have to do your capacity planning for the projected maximum number > of docs you'll host on this collection. My bet: If you size your > cluster appropriately for the eventual total size, your indexing > throughput will hit your numbers. Unless you have a very short > retention. > > See: > https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ > > Best, > Erick > > On Mon, Mar 13, 2017 at 12:01 PM, Mahmoud Almokadem > <prog.mahm...@gmail.com> wrote: >> Thanks Erick, >> >> I've commented out the line SolrClient.add(doclist) and get 5500+ docs per >> second from single producer. >> >> Regarding more shards, you mean use 2 nodes with 8 shards per node so we >> got 16 shards on the same 2 nodes or spread shards over more nodes? >> >> I'm using solr 6.4.1 with zookeeper on the same nodes. >> >> Here's what I got from sematext profiler >> >> 51% >> Thread.java:745java.lang.Thread#run >> >> 42% >> QueuedThreadPool.java:589 >> org.eclipse.jetty.util.thread.QueuedThreadPool$2#run >> Collapsed 29 calls (Expand) >> >> 43% >> UpdateRequestHandler.java:97 >> org.apache.solr.handler.UpdateRequestHandler$1#load >> >> 30% >> JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load >> >> 30% >> JsonLoader.java:115 >> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load >> >> 13% >> JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load >> >> 9% >> ThreadPoolExecutor.java:617 >> java.util.concurrent.ThreadPoolExecutor$Worker#run >> >> 9% >> ThreadPoolExecutor.java:1142 >> java.util.concurrent.ThreadPoolExecutor#runWorker >> >> 33% >> ConcurrentMergeScheduler.java:626 >> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run >> >> 33% >> ConcurrentMergeScheduler.java:588 >> org.apache.lucene.index.ConcurrentMergeScheduler#doMerge >> >> 33% >> SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge >> >> 33% >> IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge >> >> 33% >> IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle >> >> 20% >> SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge >> >> 11% >> SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge >> >> 2% >> SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge >> >> >> On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >>> Note that 70,000 docs/second pretty much guarantees that there are >>> multiple shards. Lots of shards. >>> >>> But since you're using SolrJ, the very first thing I'd try would be >>> to comment out the Sol
Re: Indexing CPU performance
Thanks Erick, I've commented out the line SolrClient.add(doclist) and get 5500+ docs per second from single producer. Regarding more shards, you mean use 2 nodes with 8 shards per node so we got 16 shards on the same 2 nodes or spread shards over more nodes? I'm using solr 6.4.1 with zookeeper on the same nodes. Here's what I got from sematext profiler 51% Thread.java:745java.lang.Thread#run 42% QueuedThreadPool.java:589 org.eclipse.jetty.util.thread.QueuedThreadPool$2#run Collapsed 29 calls (Expand) 43% UpdateRequestHandler.java:97 org.apache.solr.handler.UpdateRequestHandler$1#load 30% JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load 30% JsonLoader.java:115 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load 13% JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load 9% ThreadPoolExecutor.java:617 java.util.concurrent.ThreadPoolExecutor$Worker#run 9% ThreadPoolExecutor.java:1142 java.util.concurrent.ThreadPoolExecutor#runWorker 33% ConcurrentMergeScheduler.java:626 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run 33% ConcurrentMergeScheduler.java:588 org.apache.lucene.index.ConcurrentMergeScheduler#doMerge 33% SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge 33% IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge 33% IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle 20% SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge 11% SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge 2% SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Note that 70,000 docs/second pretty much guarantees that there are > multiple shards. Lots of shards. > > But since you're using SolrJ, the very first thing I'd try would be > to comment out the SolrClient.add(doclist) call so you're doing > everything _except_ send the docs to Solr. That'll tell you whether > there's any bottleneck on getting the docs from the system of record. > The fact that you're pegging the CPUs argues that you are feeding Solr > as fast as Solr can go so this is just a sanity check. But it's > simple/fast. > > As far as what on Solr could be the bottleneck, no real way to know > without profiling. But 300+ fields per doc probably just means you're > doing a lot of processing, I'm not particularly hopeful you'll be able > to speed things up without either more shards or simplifying your > schema. > > Best, > Erick > > On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem > <prog.mahm...@gmail.com> wrote: > > Hi great community, > > > > I have a SolrCloud with the following configuration: > > > >- 2 nodes (r3.2xlarge 61GB RAM) > >- 4 shards. > >- The producer can produce 13,000+ docs per second > >- The schema contains about 300+ fields and the document size is about > >3KB. > >- Using SolrJ and SolrCloudClient, each batch to solr contains 500 > docs. > > > > When I start my bulk indexer program the CPU utilization is 100% on each > > server but the rate of the indexer is about 1500 docs per second. > > > > I know that some solr benchmarks reached 70,000+ doc. per second. > > > > The question: What is the best way to determine the bottleneck on solr > > indexing rate? > > > > Thanks, > > Mahmoud >
Indexing CPU performance
Hi great community, I have a SolrCloud with the following configuration: - 2 nodes (r3.2xlarge 61GB RAM) - 4 shards. - The producer can produce 13,000+ docs per second - The schema contains about 300+ fields and the document size is about 3KB. - Using SolrJ and SolrCloudClient, each batch to solr contains 500 docs. When I start my bulk indexer program the CPU utilization is 100% on each server but the rate of the indexer is about 1500 docs per second. I know that some solr benchmarks reached 70,000+ doc. per second. The question: What is the best way to determine the bottleneck on solr indexing rate? Thanks, Mahmoud
Re: Time of insert
Thanks Alessandro, I used the DIH as it is and no atomic updates was called with this DIH. Add this script to my script transformation section and everything worked properly: var now = java.time.LocalDateTime.now(); var dtf = java.time.format.DateTimeFormatter.ofPattern("-MM-dd'T'HH:mm:ss'Z'"); var val = dtf.format(now); var hash = new java.util.HashMap(); hash.put('add', val); row.put('time_stamp_log', hash); The time_stamp_log no contains the log of the updates on documents and the created_date set one time. I think hash.put('add', val); fires the atomic updates on documents. But when I remove this part of script I got created_date field updated every time. Thanks for your help. On Tue, Feb 7, 2017 at 11:30 AM, alessandro.benedettiwrote: > Hi Mahomoud, > I need to double check but let's assume you use atomic updates and a > created_data stored with default to NOW. > > 1) First time the document is not in the index you will get the default > NOW. > 2) second time, using the atomic update you will update only a subset of > fields you send to Solr. > Under the hood Solr will fetch the existing Doc, change only few fields and > send it back to Solr. > created_date will have the date fetched from the old version of the > document, so the default will not be used this time. > > Have you tried ? > > Cheers > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Time-of-insert-tp4319040p4319122.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Time of insert
Thanks Alex for your reply. But the field created_date will be updated every time the document inserted to the solr. I want to record the first time the document indexed to solr and I'm using DataImport handler. And I tried solr.TimestampUpdateProcessorFactory but I got NullPointerException, So I changed it to use default value for the field on the schema but this field contains the last update of the document not the first time the document inserted. Thanks, Mahmoud On Tue, Feb 7, 2017 at 12:10 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > If you are reindexing full documents, there is no way. > > If you are actually doing updates using Solr updates XML/JSON, then > you can have a created_date field with default value of NOW. > Similarly, you could probably do something with UpdateRequestProcessor > chains to get that NOW added somewhere. > > Regards, >Alex. > > http://www.solr-start.com/ - Resources for Solr users, new and experienced > > > On 6 February 2017 at 15:32, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > > Hello, > > > > I'm using dih on solr 6 for indexing data from sql server. The document > can > > br indexed many times according to the updates on it. Is that available > to > > get the first time the document inserted to solr? > > > > And how to get the dates of the document updated? > > > > Thanks for help, > > Mahmoud >
Time of insert
Hello, I'm using dih on solr 6 for indexing data from sql server. The document can br indexed many times according to the updates on it. Is that available to get the first time the document inserted to solr? And how to get the dates of the document updated? Thanks for help, Mahmoud
Solr Kafka DIH
Hello, Is there a way to get SolrCloud to pull data from a topic in Kafak periodically using Dataimport Handler? Thanks Mahmoud
Re: Search with the start of field
Thanks all, I think SpapnFirstQuery will solve my problem. But does Solr 4.8 supports xmlparser to use it? and does SpanFirst supports phrase search instead of single term? Thanks, Mahmoud On Wed, Sep 21, 2016 at 10:45 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Yes, definately SpanFirstQuery! But i didn't know you can invoke it via > XMLQueryParser, thank Mikhail for that! > > There is a tiny drawback to SpanFirst, there is no gradient boosting > depending on distance from the beginning. > > Markus > > > > -Original message- > > From:Mikhail Khludnev <m...@apache.org> > > Sent: Wednesday 21st September 2016 9:24 > > To: solr-user <solr-user@lucene.apache.org> > > Subject: Re: Search with the start of field > > > > You can experiment with {!xmlparser}.. see > > https://cwiki.apache.org/confluence/display/solr/Other+ > Parsers#OtherParsers-XMLQueryParser > > > > On Wed, Sep 21, 2016 at 9:06 AM, Mahmoud Almokadem < > prog.mahm...@gmail.com> > > wrote: > > > > > Hello, > > > > > > What is the best way to search with the start token of field? > > > > > > For example: the field contains these values > > > > > > Document1: ABC DEF GHI > > > Document2: DEF GHI JKL > > > > > > when I search with DEF, I want to get Document2 only. Is that possible? > > > > > > Thanks, > > > Mahmoud > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev >
Search with the start of field
Hello, What is the best way to search with the start token of field? For example: the field contains these values Document1: ABC DEF GHI Document2: DEF GHI JKL when I search with DEF, I want to get Document2 only. Is that possible? Thanks, Mahmoud
insertion time
Hello, We always update the same document many times using DataImportHandler. Can I add a field for the first time the document inserted to the index and another field for the last time the document updated? Thanks, Mahmoud
Re: Cold replication
Thanks Erick, I'll take a look at the replication on Solr. But I don't know if it well support incremental backup or not. And I want to use SSD because my index cannot be held in memory. The index is about 200GB on each instance and the RAM is 61GB and the update frequency is high. So, I want to use SSDs equipped with the servers instead on EBSs. Would you explain what you mean with proper warming? Thanks, Mahmoud On Mon, Jul 18, 2016 at 5:46 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Have you tried the replication API backup command here? > > https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler > > Warning, I haven't worked with this personally in this > situation so test. > > I do have to ask why you think SSDs are required here and > if you've measured. With proper warming, most of the > index is held in memory anyway and the source of > the data (SSD or spinning) is not a huge issue. SSDs > certainly are better/faster, but have you measured whether > they are _enough_ faster to be worth the added > complexity? > > Best, > Erick > > Best, > Erick > > On Mon, Jul 18, 2016 at 4:05 AM, Mahmoud Almokadem > <prog.mahm...@gmail.com> wrote: > > Hi, > > > > We have SolrCloud 6.0 installed on 4 i2.2xlarge instances with 4 shards. > We store the indices on EBS attached to these instances. Fortunately these > instances are equipped with TEMPORARY SSDs. We need to the store the > indices on the SSDs but they are not safe. > > > > The index is updated every five minutes. > > > > Could we use the SSDs to store the indices and create an incremental > backup or cold replication on the EBS? So we use EBS only for storing > indices not serving the data to the solr. > > > > Incase of losing the data on SSDs we can restore a backup from the EBS. > Is it possible? > > > > Thanks, > > Mahmoud > > > > >
Cold replication
Hi, We have SolrCloud 6.0 installed on 4 i2.2xlarge instances with 4 shards. We store the indices on EBS attached to these instances. Fortunately these instances are equipped with TEMPORARY SSDs. We need to the store the indices on the SSDs but they are not safe. The index is updated every five minutes. Could we use the SSDs to store the indices and create an incremental backup or cold replication on the EBS? So we use EBS only for storing indices not serving the data to the solr. Incase of losing the data on SSDs we can restore a backup from the EBS. Is it possible? Thanks, Mahmoud
DIH Schedule Solr 6
Hello, We have a cluster of solr 4.8.1 installed on tomcat servlet container and we’re able to use DIH Schedule by adding this lines to web.xml of the installation directory: org.apache.solr.handler.dataimport.scheduler.ApplicationListener No we are planing to migrate to Solr 6 and we already installed it as service. The question is how to install DIHS on Solr 6 as service? Thanks, Mahmoud
Searching and sorting using field aliasing
Hi all, I have two cores (core1, core2). core1 contains fields(f1, f2, f3, date1) and core2 contains fields(f2, f3, f4, date2). I want to search on the two cores with the date field. Is there an alias to query the two fields on distributed search. For example when q=dateField:NOW perform search on date1 and date2. And I want to sort on the dateField which sort date1 and date2. Regards, Mahmoud
Re: Arabic analyser
Thank Alex, So BasisTech works for the latest version of solr? Sincerely, Mahmoud On Tue, Nov 10, 2015 at 5:28 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > If this is for a significant project and you are ready to pay for it, > BasisTech has commercial solutions in this area I believe. > > Regards, >Alex. > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 10 November 2015 at 08:46, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > > Thanks Pual, > > > > Arabic analyser applying filters of normalisation and stemming only for > > single terms out of standard tokenzier. > > Gathering all synonyms will be hard work. Should I customise my Tokenizer > > to handle this case? > > > > Sincerely, > > Mahmoud > > > > > > On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <p...@hoplahup.net> > wrote: > > > >> Mahmoud, > >> > >> there is an arabic analyzer: > >> https://wiki.apache.org/solr/LanguageAnalysis#Arabic > >> doesn't it do what you describe? > >> Synonyms probably work there too. > >> > >> Paul > >> > >> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com> > >> > 9 novembre 2015 17:47 > >> > Thanks Jack, > >> > > >> > This is a good solution, but we have more combinations that I think > >> > can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’ > >> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo > >> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be > >> > applied for each separate term. > >> > > >> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a > >> > single term? > >> > > >> > Thanks, > >> > Mahmoud > >> > > >> > > >> > > >> > Jack Krupansky <mailto:jack.krupan...@gmail.com> > >> > 9 novembre 2015 16:47 > >> > Use an index-time (but not query time) synonym filter with a rule > like: > >> > > >> > Abd Allah,Abdallah > >> > > >> > This will index the combined word in addition to the separate words. > >> > > >> > -- Jack Krupansky > >> > > >> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem < > >> prog.mahm...@gmail.com> > >> > > >> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com> > >> > 9 novembre 2015 10:48 > >> > Hello, > >> > > >> > We are indexing Arabic content and facing a problem for tokenizing > multi > >> > terms phrases like 'عبد الله' 'Abd Allah', so users will search for > >> > 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد > >> > الله' with space. We are using StandardTokenizer. > >> > > >> > > >> > Is there any configurations to handle this case? > >> > > >> > Thank you, > >> > Mahmoud > >> > > >> > >> >
Re: Arabic analyser
Thank you very much David, It's wonderful and I will try it. On Wed, Nov 11, 2015 at 1:37 PM, David Murgatroyd <dmu...@gmail.com> wrote: > >So BasisTech works for the latest version of solr? > > Yes, our latest Arabic analyzer supports up through 5.3.x. But since the > examples you give are names, it sounds like you might instead/also want our > fuzzy name matcher which will find "عبد الله" not only with "عبدالله" but > also with typos like "عبالله" or even translations into 'English' like > "abdollah". You can visit http://www.basistech.com/solutions/search/solr/ > and fill out the form there to learn more (mentioning this thread). See > also http://www.slideshare.net/dmurga/simple-fuzzy-name-matching-in-solr > for a talk I gave at the San Francisco Solr Meet-up in April on how it > plugs in to Solr by creating a special field type you can query just like > any other; this was also presented at Lucene/Solr Revolution last month ( > http://lucenerevolution.org/sessions/simple-fuzzy-name-matching-in-solr/). > > Best, > David Murgatroyd > (VP, Engineering, Basis Technology) > > On Wed, Nov 11, 2015 at 4:31 AM, Mahmoud Almokadem <prog.mahm...@gmail.com > > > wrote: > > > Thank Alex, > > > > So BasisTech works for the latest version of solr? > > > > Sincerely, > > Mahmoud > > > > On Tue, Nov 10, 2015 at 5:28 PM, Alexandre Rafalovitch < > arafa...@gmail.com > > > > > wrote: > > > > > If this is for a significant project and you are ready to pay for it, > > > BasisTech has commercial solutions in this area I believe. > > > > > > Regards, > > >Alex. > > > > > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > > > http://www.solr-start.com/ > > > > > > > > > On 10 November 2015 at 08:46, Mahmoud Almokadem < > prog.mahm...@gmail.com> > > > wrote: > > > > Thanks Pual, > > > > > > > > Arabic analyser applying filters of normalisation and stemming only > for > > > > single terms out of standard tokenzier. > > > > Gathering all synonyms will be hard work. Should I customise my > > Tokenizer > > > > to handle this case? > > > > > > > > Sincerely, > > > > Mahmoud > > > > > > > > > > > > On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <p...@hoplahup.net> > > > wrote: > > > > > > > >> Mahmoud, > > > >> > > > >> there is an arabic analyzer: > > > >> https://wiki.apache.org/solr/LanguageAnalysis#Arabic > > > >> doesn't it do what you describe? > > > >> Synonyms probably work there too. > > > >> > > > >> Paul > > > >> > > > >> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com> > > > >> > 9 novembre 2015 17:47 > > > >> > Thanks Jack, > > > >> > > > > >> > This is a good solution, but we have more combinations that I > think > > > >> > can’t be handled as synonyms like every word starts with ‘عبد’ > ‘Abd’ > > > >> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo > > > >> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will > be > > > >> > applied for each separate term. > > > >> > > > > >> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a > > > >> > single term? > > > >> > > > > >> > Thanks, > > > >> > Mahmoud > > > >> > > > > >> > > > > >> > > > > >> > Jack Krupansky <mailto:jack.krupan...@gmail.com> > > > >> > 9 novembre 2015 16:47 > > > >> > Use an index-time (but not query time) synonym filter with a rule > > > like: > > > >> > > > > >> > Abd Allah,Abdallah > > > >> > > > > >> > This will index the combined word in addition to the separate > words. > > > >> > > > > >> > -- Jack Krupansky > > > >> > > > > >> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem < > > > >> prog.mahm...@gmail.com> > > > >> > > > > >> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com> > > > >> > 9 novembre 2015 10:48 > > > >> > Hello, > > > >> > > > > >> > We are indexing Arabic content and facing a problem for tokenizing > > > multi > > > >> > terms phrases like 'عبد الله' 'Abd Allah', so users will search > for > > > >> > 'عبدالله' 'Abdallah' without space and need to get the results of > > 'عبد > > > >> > الله' with space. We are using StandardTokenizer. > > > >> > > > > >> > > > > >> > Is there any configurations to handle this case? > > > >> > > > > >> > Thank you, > > > >> > Mahmoud > > > >> > > > > >> > > > >> > > > > > >
Re: Arabic analyser
Thanks Pual, Arabic analyser applying filters of normalisation and stemming only for single terms out of standard tokenzier. Gathering all synonyms will be hard work. Should I customise my Tokenizer to handle this case? Sincerely, Mahmoud On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <p...@hoplahup.net> wrote: > Mahmoud, > > there is an arabic analyzer: > https://wiki.apache.org/solr/LanguageAnalysis#Arabic > doesn't it do what you describe? > Synonyms probably work there too. > > Paul > > > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com> > > 9 novembre 2015 17:47 > > Thanks Jack, > > > > This is a good solution, but we have more combinations that I think > > can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’ > > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo > > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be > > applied for each separate term. > > > > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a > > single term? > > > > Thanks, > > Mahmoud > > > > > > > > Jack Krupansky <mailto:jack.krupan...@gmail.com> > > 9 novembre 2015 16:47 > > Use an index-time (but not query time) synonym filter with a rule like: > > > > Abd Allah,Abdallah > > > > This will index the combined word in addition to the separate words. > > > > -- Jack Krupansky > > > > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem < > prog.mahm...@gmail.com> > > > > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com> > > 9 novembre 2015 10:48 > > Hello, > > > > We are indexing Arabic content and facing a problem for tokenizing multi > > terms phrases like 'عبد الله' 'Abd Allah', so users will search for > > 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد > > الله' with space. We are using StandardTokenizer. > > > > > > Is there any configurations to handle this case? > > > > Thank you, > > Mahmoud > > > >
Re: Arabic analyser
Thanks Jack, This is a good solution, but we have more combinations that I think can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’ and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be applied for each separate term. Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a single term? Thanks, Mahmoud > On Nov 9, 2015, at 5:47 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > > Use an index-time (but not query time) synonym filter with a rule like: > > Abd Allah,Abdallah > > This will index the combined word in addition to the separate words. > > -- Jack Krupansky > > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > >> Hello, >> >> We are indexing Arabic content and facing a problem for tokenizing multi >> terms phrases like 'عبد الله' 'Abd Allah', so users will search for >> 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد >> الله' with space. We are using StandardTokenizer. >> >> >> Is there any configurations to handle this case? >> >> Thank you, >> Mahmoud >>
Arabic analyser
Hello, We are indexing Arabic content and facing a problem for tokenizing multi terms phrases like 'عبد الله' 'Abd Allah', so users will search for 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد الله' with space. We are using StandardTokenizer. Is there any configurations to handle this case? Thank you, Mahmoud
Re: Invalid parsing with solr edismax operators
Thanks Jack. I have reported it as a bug on JIRA https://issues.apache.org/jira/browse/SOLR-8237 <https://issues.apache.org/jira/browse/SOLR-8237> Mahmoud > On Nov 4, 2015, at 5:30 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > > I think you should go ahead and file a Jira ticket for this as a bug since > either it is an actual bug or some behavior nuance that needs to be > documented better. > > -- Jack Krupansky > > On Wed, Nov 4, 2015 at 8:24 AM, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > >> I removed the q.op=“AND” and add the mm=2 >> when searching for (public libraries) I got 19 with >> "parsedquery_toString": "+(((Title:public^200.0 | TotalField:public^0.1) >> (Title:libraries^200.0 | TotalField:libraries^0.1))~2)", >> >> and when adding + and searching for +(public libraries) I got 1189 with >> "parsedquery_toString": "+(+((Title:public^200.0 | TotalField:public^0.1) >> (Title:libraries^200.0 | TotalField:libraries^0.1)))", >> >> >> I think when adding + before parentheses I got all terms mandatory despite >> the value of mm=2 in the two cases. >> >> Mahmoud >> >> >> >>> On Nov 4, 2015, at 3:04 PM, Alessandro Benedetti <abenede...@apache.org> >> wrote: >>> >>> Here we go : >>> >>> Title^200 TotalField^1 >>> >>> + Jack explanation and you have the parsed query explained ! >>> >>> Cheers >>> >>> On 4 November 2015 at 12:56, Mahmoud Almokadem <prog.mahm...@gmail.com> >>> wrote: >>> >>>> Thank you Alessandro for your reply. >>>> >>>> Here is the request handler >>>> >>>> >>>> >>>> >>>>explicit >>>> 10 >>>> TotalField >>>> AND >>>> edismax >>>> Title^200 TotalField^1 >>>> >>>> >>>> >>>> >>>> >>>> >>>> Mahmoud >>>> >>>> >>>>> On Nov 4, 2015, at 2:43 PM, Alessandro Benedetti < >> abenede...@apache.org> >>>> wrote: >>>>> >>>>> Hi Mahmoud, >>>>> can you send us the solrconfig.xml snippet of your request handler >>>> please ? >>>>> >>>>> It's kinda strange you get a boost factor for the Title field and that >>>>> parsing query, according to your config. >>>>> >>>>> Cheers >>>>> >>>>> On 4 November 2015 at 08:39, Mahmoud Almokadem <prog.mahm...@gmail.com >>> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I'm using solr 4.8.1. Using edismax as the parser we got the >> undesirable >>>>>> parsed queries and results. The following is two different cases with >>>>>> strange behavior: Searching with these parameters >>>>>> >>>>>> "mm":"2", >>>>>> "df":"TotalField", >>>>>> "debug":"true", >>>>>> "indent":"true", >>>>>> "fl":"Title", >>>>>> "start":"0", >>>>>> "q.op":"AND", >>>>>> "fq":"", >>>>>> "rows":"10", >>>>>> "wt":"json" >>>>>> and the query is >>>>>> >>>>>> "q":"+(public libraries)", >>>>>> Retrieve 502 documents with these parsed query >>>>>> >>>>>> "rawquerystring":"+(public libraries)", >>>>>> "querystring":"+(public libraries)", >>>>>> "parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 | >>>>>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | >>>>>> TotalField:libraries^0.1)/no_coord", >>>>>> "parsedquery_toString":"+(+((Title:public^200.0 | >> TotalField:public^0.1) >>>>>> (Title:libraries^200.0 | TotalField:libraries^0.1)))" >>>>>> and if the query is >>>>>> >>>>>> "q":" (public libraries) " >>>>>> then it retrieves 8 documents with these parsed query >>>>>> >>>>>> "rawquerystring":" (public libraries) ", >>>>>> "querystring":" (public libraries) ", >>>>>> "parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 | >>>>>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | >>>>>> TotalField:libraries^0.1)))~2))/no_coord", >>>>>> "parsedquery_toString":"+(((Title:public^200.0 | >> TotalField:public^0.1) >>>>>> (Title:libraries^200.0 | TotalField:libraries^0.1))~2)" >>>>>> So the results of adding "+" to get all tokens before the parenthesis >>>>>> retrieve more results than removing it. >>>>>> >>>>>> Is this a bug on this version or there are something missing? >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> -- >>>>> >>>>> Benedetti Alessandro >>>>> Visiting card : http://about.me/alessandro_benedetti >>>>> >>>>> "Tyger, tyger burning bright >>>>> In the forests of the night, >>>>> What immortal hand or eye >>>>> Could frame thy fearful symmetry?" >>>>> >>>>> William Blake - Songs of Experience -1794 England >>>> >>>> >>> >>> >>> -- >>> -- >>> >>> Benedetti Alessandro >>> Visiting card : http://about.me/alessandro_benedetti >>> >>> "Tyger, tyger burning bright >>> In the forests of the night, >>> What immortal hand or eye >>> Could frame thy fearful symmetry?" >>> >>> William Blake - Songs of Experience -1794 England >> >>
Re: Invalid parsing with solr edismax operators
Thank you Alessandro for your reply. Here is the request handler explicit 10 TotalField AND edismax Title^200 TotalField^1 Mahmoud > On Nov 4, 2015, at 2:43 PM, Alessandro Benedetti <abenede...@apache.org> > wrote: > > Hi Mahmoud, > can you send us the solrconfig.xml snippet of your request handler please ? > > It's kinda strange you get a boost factor for the Title field and that > parsing query, according to your config. > > Cheers > > On 4 November 2015 at 08:39, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > >> Hello, >> >> I'm using solr 4.8.1. Using edismax as the parser we got the undesirable >> parsed queries and results. The following is two different cases with >> strange behavior: Searching with these parameters >> >> "mm":"2", >> "df":"TotalField", >> "debug":"true", >> "indent":"true", >> "fl":"Title", >> "start":"0", >> "q.op":"AND", >> "fq":"", >> "rows":"10", >> "wt":"json" >> and the query is >> >> "q":"+(public libraries)", >> Retrieve 502 documents with these parsed query >> >> "rawquerystring":"+(public libraries)", >> "querystring":"+(public libraries)", >> "parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 | >> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | >> TotalField:libraries^0.1)/no_coord", >> "parsedquery_toString":"+(+((Title:public^200.0 | TotalField:public^0.1) >> (Title:libraries^200.0 | TotalField:libraries^0.1)))" >> and if the query is >> >> "q":" (public libraries) " >> then it retrieves 8 documents with these parsed query >> >> "rawquerystring":" (public libraries) ", >> "querystring":" (public libraries) ", >> "parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 | >> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | >> TotalField:libraries^0.1)))~2))/no_coord", >> "parsedquery_toString":"+(((Title:public^200.0 | TotalField:public^0.1) >> (Title:libraries^200.0 | TotalField:libraries^0.1))~2)" >> So the results of adding "+" to get all tokens before the parenthesis >> retrieve more results than removing it. >> >> Is this a bug on this version or there are something missing? > > > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England
Re: Invalid parsing with solr edismax operators
I removed the q.op=“AND” and add the mm=2 when searching for (public libraries) I got 19 with "parsedquery_toString": "+(((Title:public^200.0 | TotalField:public^0.1) (Title:libraries^200.0 | TotalField:libraries^0.1))~2)", and when adding + and searching for +(public libraries) I got 1189 with "parsedquery_toString": "+(+((Title:public^200.0 | TotalField:public^0.1) (Title:libraries^200.0 | TotalField:libraries^0.1)))", I think when adding + before parentheses I got all terms mandatory despite the value of mm=2 in the two cases. Mahmoud > On Nov 4, 2015, at 3:04 PM, Alessandro Benedetti <abenede...@apache.org> > wrote: > > Here we go : > > Title^200 TotalField^1 > > + Jack explanation and you have the parsed query explained ! > > Cheers > > On 4 November 2015 at 12:56, Mahmoud Almokadem <prog.mahm...@gmail.com> > wrote: > >> Thank you Alessandro for your reply. >> >> Here is the request handler >> >> >> >> >> explicit >> 10 >> TotalField >> AND >> edismax >> Title^200 TotalField^1 >> >> >> >> >> >> >> Mahmoud >> >> >>> On Nov 4, 2015, at 2:43 PM, Alessandro Benedetti <abenede...@apache.org> >> wrote: >>> >>> Hi Mahmoud, >>> can you send us the solrconfig.xml snippet of your request handler >> please ? >>> >>> It's kinda strange you get a boost factor for the Title field and that >>> parsing query, according to your config. >>> >>> Cheers >>> >>> On 4 November 2015 at 08:39, Mahmoud Almokadem <prog.mahm...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> I'm using solr 4.8.1. Using edismax as the parser we got the undesirable >>>> parsed queries and results. The following is two different cases with >>>> strange behavior: Searching with these parameters >>>> >>>> "mm":"2", >>>> "df":"TotalField", >>>> "debug":"true", >>>> "indent":"true", >>>> "fl":"Title", >>>> "start":"0", >>>> "q.op":"AND", >>>> "fq":"", >>>> "rows":"10", >>>> "wt":"json" >>>> and the query is >>>> >>>> "q":"+(public libraries)", >>>> Retrieve 502 documents with these parsed query >>>> >>>> "rawquerystring":"+(public libraries)", >>>> "querystring":"+(public libraries)", >>>> "parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 | >>>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | >>>> TotalField:libraries^0.1)/no_coord", >>>> "parsedquery_toString":"+(+((Title:public^200.0 | TotalField:public^0.1) >>>> (Title:libraries^200.0 | TotalField:libraries^0.1)))" >>>> and if the query is >>>> >>>> "q":" (public libraries) " >>>> then it retrieves 8 documents with these parsed query >>>> >>>> "rawquerystring":" (public libraries) ", >>>> "querystring":" (public libraries) ", >>>> "parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 | >>>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | >>>> TotalField:libraries^0.1)))~2))/no_coord", >>>> "parsedquery_toString":"+(((Title:public^200.0 | TotalField:public^0.1) >>>> (Title:libraries^200.0 | TotalField:libraries^0.1))~2)" >>>> So the results of adding "+" to get all tokens before the parenthesis >>>> retrieve more results than removing it. >>>> >>>> Is this a bug on this version or there are something missing? >>> >>> >>> >>> >>> -- >>> -- >>> >>> Benedetti Alessandro >>> Visiting card : http://about.me/alessandro_benedetti >>> >>> "Tyger, tyger burning bright >>> In the forests of the night, >>> What immortal hand or eye >>> Could frame thy fearful symmetry?" >>> >>> William Blake - Songs of Experience -1794 England >> >> > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England
Invalid parsing with solr edismax operators
Hello, I'm using solr 4.8.1. Using edismax as the parser we got the undesirable parsed queries and results. The following is two different cases with strange behavior: Searching with these parameters "mm":"2", "df":"TotalField", "debug":"true", "indent":"true", "fl":"Title", "start":"0", "q.op":"AND", "fq":"", "rows":"10", "wt":"json" and the query is "q":"+(public libraries)", Retrieve 502 documents with these parsed query "rawquerystring":"+(public libraries)", "querystring":"+(public libraries)", "parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 | TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | TotalField:libraries^0.1)/no_coord", "parsedquery_toString":"+(+((Title:public^200.0 | TotalField:public^0.1) (Title:libraries^200.0 | TotalField:libraries^0.1)))" and if the query is "q":" (public libraries) " then it retrieves 8 documents with these parsed query "rawquerystring":" (public libraries) ", "querystring":" (public libraries) ", "parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 | TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | TotalField:libraries^0.1)))~2))/no_coord", "parsedquery_toString":"+(((Title:public^200.0 | TotalField:public^0.1) (Title:libraries^200.0 | TotalField:libraries^0.1))~2)" So the results of adding "+" to get all tokens before the parenthesis retrieve more results than removing it. Is this a bug on this version or there are something missing?
Invalid parsing with solr edismax operators
Hello, I'm using solr 4.8.1. Using edismax as the parser we got the undesirable parsed queries and results. The following is two different cases with strange behavior: Searching with these parameters "mm":"2", "df":"TotalField", "debug":"true", "indent":"true", "fl":"Title", "start":"0", "q.op":"AND", "fq":"", "rows":"10", "wt":"json" and the query is "q":"+(public libraries)", Retrieve 502 documents with these parsed query "rawquerystring":"+(public libraries)", "querystring":"+(public libraries)", "parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 | TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | TotalField:libraries^0.1)/no_coord", "parsedquery_toString":"+(+((Title:public^200.0 | TotalField:public^0.1) (Title:libraries^200.0 | TotalField:libraries^0.1)))" and if the query is "q":" (public libraries) " then it retrieves 8 documents with these parsed query "rawquerystring":" (public libraries) ", "querystring":" (public libraries) ", "parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 | TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | TotalField:libraries^0.1)))~2))/no_coord", "parsedquery_toString":"+(((Title:public^200.0 | TotalField:public^0.1) (Title:libraries^200.0 | TotalField:libraries^0.1))~2)" So the results of adding "+" to get all tokens before the parenthesis retrieve more results than removing it. Is this a bug on this version or there are something missing?
edismax operators
Hello, I've a strange behaviour on using edismax with multiwords. When using passing q=+(word1 word2) I got rawquerystring: +(word1 word2), querystring: +(word1 word2), parsedquery: (+(+(DisjunctionMaxQuery((title:word1)) DisjunctionMaxQuery((title:word2)/no_coord, parsedquery_toString: +(+((title:word1) (title:word2))), I expected to get two words as must as I added + before the parentheses so It must be applied for all terms in parentheses. How can I apply default operator AND for all words. Thanks, Mahmoud
Re: edismax operators
Thank you Jack for your clarifications. I used regular defType and set q.op=AND so all terms without operators are must. How can I use this with edismax? Thanks, Mahmoud On Thu, Apr 2, 2015 at 2:14 PM, Jack Krupansky jack.krupan...@gmail.com wrote: The parentheses signal a nested query. Your plus operator applies to the overall nested query - that the nested query must match something. Use the plus operator on each of the discrete terms if each of them is mandatory. The plus and minus operators apply to the overall nested query - they do not distribute to each term within the nested query. They don't magically distribute to all nested queries. Let's see you full set of query parameters, both on the request and in solrconfig. -- Jack Krupansky On Thu, Apr 2, 2015 at 7:12 AM, Mahmoud Almokadem prog.mahm...@gmail.com wrote: Hello, I've a strange behaviour on using edismax with multiwords. When using passing q=+(word1 word2) I got rawquerystring: +(word1 word2), querystring: +(word1 word2), parsedquery: (+(+(DisjunctionMaxQuery((title:word1)) DisjunctionMaxQuery((title:word2)/no_coord, parsedquery_toString: +(+((title:word1) (title:word2))), I expected to get two words as must as I added + before the parentheses so It must be applied for all terms in parentheses. How can I apply default operator AND for all words. Thanks, Mahmoud
Re: edismax operators
Thanks all for you response, But the parsed_query and number of results still when changing MM parameter the following results for mm=100% and mm=0% http://solrserver/solr/collection1/select?q=%2B(word1+word2)rows=0fl=Titlewt=jsonindent=truedebugQuery=truedefType=edismaxqf=titlemm=100%25stopwords=truelowercaseOperators=true http://10.1.1.118:8090/solr/PAEB/select?q=%2B(word1+word2)rows=0fl=Titlewt=jsonindent=truedebugQuery=truedefType=edismaxqf=titlemm=100%25stopwords=truelowercaseOperators=true rawquerystring: +(word1 word2), querystring: +(word1 word2), parsedquery: (+(+(DisjunctionMaxQuery((title:word1)) DisjunctionMaxQuery((title:word2)/no_coord, parsedquery_toString: +(+((title:word1) (title:word2)))”, http://solrserver/solr/collection1/select?q=%2B(word1+word2)rows=0fl=Titlewt=jsonindent=truedebugQuery=truedefType=edismaxqf=titlemm=0%25stopwords=truelowercaseOperators=true http://10.1.1.118:8090/solr/PAEB/select?q=%2B(word1+word2)rows=0fl=Titlewt=jsonindent=truedebugQuery=truedefType=edismaxqf=titlemm=100%25stopwords=truelowercaseOperators=true rawquerystring: +(word1 word2), querystring: +(word1 word2), parsedquery: (+(+(DisjunctionMaxQuery((title:word1)) DisjunctionMaxQuery((title:word2)/no_coord, parsedquery_toString: +(+((title:word1) (title:word2))), There are any changes on two queries solr version 4.8.1 Thanks, Mahmoud On Thu, Apr 2, 2015 at 6:56 PM, Davis, Daniel (NIH/NLM) [C] daniel.da...@nih.gov wrote: Thanks Shawn, This is what I thought, but Solr often has features I don't anticipate. -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Thursday, April 02, 2015 12:54 PM To: solr-user@lucene.apache.org Subject: Re: edismax operators On 4/2/2015 9:59 AM, Davis, Daniel (NIH/NLM) [C] wrote: Can the mm parameter be set per clause?I guess I've ignored it in the past aside from setting it once to what seemed like a reasonable value. That is probably replicated across every collection, which cannot be ideal for relevance. It applies to the whole query. You can have a different value on every query you send. Just like with other parameters, defaults can be configured in the solrconfig.xml request handler definition. Thanks, Shawn
Re: Create core problem in tomcat
You may have a field types in your schema that using stopwords.txt file like this: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ArabicNormalizationFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= *lang/stopwords_ar.txt* / filter class=solr.StopFilterFactory ignoreCase=true words= *lang/stopwords_en.txt* / filter class=solr.StopFilterFactory ignoreCase=true words= *stopwords.txt* / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ArabicNormalizationFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= *lang/stopwords_ar.txt* / filter class=solr.StopFilterFactory ignoreCase=true words= *lang/stopwords_en.txt* / filter class=solr.StopFilterFactory ignoreCase=true words= *stopwords.txt* / /analyzer /fieldType so, you must have files *stopwords_ar.txt* and* stopwords_en.txt* in INSTANCE_DIR/conf/lang/ and *stopwords.txt* in INSTANCE_DIR/conf/ sincerly, Mahmoud On Thu, Jan 1, 2015 at 9:18 AM, Noora noora.sa...@gmail.com wrote: Hi I'm using apache solr 4.7.2 ant apache tomcat ? I can't create core with query in my solr while I cat do it with jetty with the same config. The first problem was you can pass the system property -Dsolr.allow.unsafe.resourceloading=true to your JVM that I solve it in my Catalina.sh Now my error is : Unable to create core: uut8 Caused by: Can't find resource 'stopwords.txt' in classpath or conf My query is : http://10.1.221.210:8983/solr/admin/cores?action=CREATEname=my_coreinstanceDir=my_coredataDir=dataconfigSet=myConfig Can any one help me?
Re: Solr performance issues
Thanks all. I've the same index with a bit different schema and 200M documents, installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size of index is about 1.5TB, have many updates every 5 minutes, complex queries and faceting with response time of 100ms that is acceptable for us. Toke Eskildsen, Is the index updated while you are searching? *No* Do you do any faceting or other heavy processing as part of a search? *No* How many hits does a search typically have and how many documents are returned? *The test for QTime only with no documents returned and No. of hits varying from 50,000 to 50,000,000.* How many concurrent searches do you need to support? How fast should the response time be? *May be 100 concurrent searches with 100ms with facets.* Does splitting the shard to two shards on the same node so every shard will be on a single EBS Volume better than using LVM? Thanks On Mon, Dec 29, 2014 at 2:00 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Mahmoud Almokadem [prog.mahm...@gmail.com] wrote: We've installed a cluster of one collection of 350M documents on 3 r3.2xlarge (60GB RAM) Amazon servers. The size of index on each shard is about 1.1TB and maximum storage on Amazon is 1 TB so we add 2 SSD EBS General purpose (1x1TB + 1x500GB) on each instance. Then we create logical volume using LVM of 1.5TB to fit our index. Your search speed will be limited by the slowest storage in your group, which would be your 500GB EBS. The General Purpose SSD option means (as far as I can read at http://aws.amazon.com/ebs/details/#piops) that your baseline of 3 IOPS/MB = 1500 IOPS, with bursts of 3000 IOPS. Unfortunately they do not say anything about latency. For comparison, I checked the system logs from a local test with our 21TB / 7 billion documents index. It used ~27,000 IOPS during the test, with mean search time a bit below 1 second. That was with ~100GB RAM for disk cache, which is about ½% of index size. The test was with simple term queries (1-3 terms) and some faceting. Back of the envelope: 27,000 IOPS for 21TB is ~1300 IOPS/TB. Your indexes are 1.1TB, so 1.1*1300 IOPS ~= 1400 IOPS. All else being equal (which is never the case), getting 1-3 second response times for a 1.1TB index, when one link in the storage chain is capped at a few thousand IOPS, you are using networked storage and you have little RAM for caching, does not seem unrealistic. If possible, you could try temporarily boosting performance of the EBS, to see if raw IO is the bottleneck. The response time is about 1 and 3 seconds for simple queries (1 token). Is the index updated while you are searching? Do you do any faceting or other heavy processing as part of a search? How many hits does a search typically have and how many documents are returned? How many concurrent searches do you need to support? How fast should the response time be? - Toke Eskildsen
Re: Solr performance issues
Thanks Shawn. What do you mean with important parts of index? and how to calculate their size? Thanks, Mahmoud Sent from my iPhone On Dec 29, 2014, at 8:19 PM, Shawn Heisey apa...@elyograg.org wrote: On 12/29/2014 2:36 AM, Mahmoud Almokadem wrote: I've the same index with a bit different schema and 200M documents, installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size of index is about 1.5TB, have many updates every 5 minutes, complex queries and faceting with response time of 100ms that is acceptable for us. Toke Eskildsen, Is the index updated while you are searching? *No* Do you do any faceting or other heavy processing as part of a search? *No* How many hits does a search typically have and how many documents are returned? *The test for QTime only with no documents returned and No. of hits varying from 50,000 to 50,000,000.* How many concurrent searches do you need to support? How fast should the response time be? *May be 100 concurrent searches with 100ms with facets.* Does splitting the shard to two shards on the same node so every shard will be on a single EBS Volume better than using LVM? The basic problem is simply that the system has so little memory that it must read large amounts of data from the disk when it does a query. There is not enough RAM to cache the important parts of the index. RAM is much faster than disk, even SSD. Typical consumer-grade DDR3-1600 memory has a data transfer rate of about 12800 megabytes per second. If it's ECC memory (which I would say is a requirement) then the transfer rate is probably a little bit slower than that. Figuring 9 bits for every byte gets us about 11377 MB/s. That's only an estimate, and it could be wrong in either direction, but I'll go ahead and use it. http://en.wikipedia.org/wiki/DDR3_SDRAM#JEDEC_standard_modules If your SSD is SATA, the transfer rate will be limited to approximately 600MB/s -- the 6 gigabit per second transfer rate of the newest SATA standard. That makes memory about 18 times as fast as SATA SSD. I saw one PCI express SSD that claimed a transfer rate of 2900 MB/s. Even that is only about one fourth of the estimated speed of DDR3-1600 with ECC. I don't know what interface technology Amazon uses for their SSD volumes, but I would bet on it being the cheaper version, which would mean SATA. The networking between the EC2 instance and the EBS storage is unknown to me and may be a further bottleneck. http://ocz.com/enterprise/z-drive-4500/specifications Bottom line -- you need a lot more memory. Speeding up the disk may *help* ... but it will not replace that simple requirement. With EC2 as the platform, you may need more instances and more shards. Your 200 million document index that works well with only 90GB of total memory ... that's surprising to me. That means that the important parts of that index *do* fit in memory ... but if the index gets much larger, performance is likely to drop off sharply. Thanks, Shawn
Solr performance issues
Dears, We've installed a cluster of one collection of 350M documents on 3 r3.2xlarge (60GB RAM) Amazon servers. The size of index on each shard is about 1.1TB and maximum storage on Amazon is 1 TB so we add 2 SSD EBS General purpose (1x1TB + 1x500GB) on each instance. Then we create logical volume using LVM of 1.5TB to fit our index. The response time is about 1 and 3 seconds for simple queries (1 token). Is the LVM become a bottleneck for our index? Thanks for help.
Re: Solr-Distributed search
Hi, you can search using this sample Url http://localhost:8080/solr/core1/select?q=*:*shards=localhost:8080/solr/core1,localhost:8080/solr/core2,localhost:8080/solr/core3 Mahmoud Almokadem On Thu, Jun 5, 2014 at 8:13 AM, Anurag Verma vermanur...@gmail.com wrote: Hi, Can you please help me solr distribued search in multicore? i would be very happy as i am stuck here. In java code how do i implement distributed search? -- Thanks Regards Anurag Verma
Shards range error
, autoCreated:true}, news_english:{ shards:{ shard1:{ range:8000-, state:active, replicas:{10.0.1.237:8080_solr_news_english:{ state:active, base_url:http://10.0.1.237:8080/solr;, core:news_english, node_name:10.0.1.237:8080_solr, leader:true}}}, shard2:{ range:0-7fff, state:active, replicas:{10.0.1.6:8080_solr_news_english:{ state:active, base_url:http://10.0.1.6:8080/solr;, core:news_english, node_name:10.0.1.6:8080_solr, leader:true, maxShardsPerNode:1, router:{name:compositeId}, replicationFactor:1, autoCreated:true}} So, what should I do to create new collections with 3 shards Thanks to all and sorry for poor English Mahmoud Almokadem