Atomic bug update

2021-02-23 Thread Mahmoud Almokadem
Hello,

I've upgraded the SolrCloud from 7.6 to 8.8 and unfortunately I got the
following exception on Atomic updates of some of the documents.

And in some cases some fields are retrieved with an array of multi values
in case the field is defined as a single value.

Is there a bug on this version regarding the Atomic updates? And how can I
solve this issue?

org.apache.solr.common.SolrException: TransactionLog doesn't know how to
serialize class org.apache.lucene.document.LazyDocument$LazyField; try
implementing ObjectResolver?
at org.apache.solr.update.TransactionLog$1.resolve(TransactionLog.java:100)
at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:266)
at
org.apache.solr.common.util.JavaBinCodec$BinEntryWriter.put(JavaBinCodec.java:441)
at
org.apache.solr.common.ConditionalKeyMapWriter$EntryWriterWrapper.put(ConditionalKeyMapWriter.java:44)
at org.apache.solr.common.MapWriter$EntryWriter.putNoEx(MapWriter.java:101)
at
org.apache.solr.common.MapWriter$EntryWriter.lambda$getBiConsumer$0(MapWriter.java:161)
at
org.apache.solr.common.SolrInputDocument.lambda$writeMap$0(SolrInputDocument.java:59)
at java.base/java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
at
org.apache.solr.common.SolrInputDocument.writeMap(SolrInputDocument.java:61)
at
org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:667)
at org.apache.solr.update.TransactionLog.write(TransactionLog.java:397)
at org.apache.solr.update.UpdateLog.add(UpdateLog.java:585)
at org.apache.solr.update.UpdateLog.add(UpdateLog.java:557)
at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:351)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:294)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241)
at
org.apache.solr.update.processor.RunUpdateProcessorFactory$RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:73)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:256)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:495)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:336)
at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:336)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:222)
at
org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245)
at
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:343)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:291)
at
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:244)
at
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:196)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:131)
at
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:122)
at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:70)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:82)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2646)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:794)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:567)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:357)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201)
at
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at

Re: Change field to DocValues

2021-02-17 Thread Mahmoud Almokadem
That's right, I want to avoid a complete reindexing process.
But should I create another field with the docValues property or change the
current field directly?

Can I use streaming expressions to update the whole index or should I
select and update using batches?


Thanks,
Mahmoud


On Wed, Feb 17, 2021 at 4:51 PM xiefengchang 
wrote:

> Hi:
> I think you are just trying to avoid complete re-index right?
> why don't you take a look at this:
> https://lucene.apache.org/solr/guide/8_0/updating-parts-of-documents.html
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> At 2021-02-17 21:14:11, "Mahmoud Almokadem" 
> wrote:
> >Hello,
> >
> >I've an integer field on an index with billions of documents and need to
> do
> >facets on this field, unfortunately the field doesn't have the docValues
> >property, so the FieldCache will be fired and use much memory.
> >
> >What is the best way to change the field to be docValues supported?
> >
> >Regards,
> >Mahmoud
>


Change field to DocValues

2021-02-17 Thread Mahmoud Almokadem
Hello,

I've an integer field on an index with billions of documents and need to do
facets on this field, unfortunately the field doesn't have the docValues
property, so the FieldCache will be fired and use much memory.

What is the best way to change the field to be docValues supported?

Regards,
Mahmoud


Re: Reindex single shard on solr

2018-12-15 Thread Mahmoud Almokadem
You're right Erick.

for the Hash.murmurhash3_x86_32 method I don't know should I pass my Id
directly or with specific format like
'1874f9aa-4cad-4839-a282-d624fe2c40c6!document_id', so I used a predefined
method that get shard name directly.

createCollection method doesn't create a collection physically on
SolrCould, it's only a reference to the size of shards of the collection.

Also the CloudSolrClient doesn't have a method called "getCollection" may
be related to the SolrRequest class which is not used on my code.

I used the following code to target my shards

String id = document.getFieldValue("document_id").toString();
Slice slice = router.getTargetSlice(id, document, null,
null, solrCollection );
String shard = slice.getName();
if(targetShards.contains(shard)){
bufferDocuments.add(document);
}

Thanks for your help,
Mahmoud

On Fri, Dec 14, 2018 at 11:20 PM Erick Erickson 
wrote:

> Why do you need to create a collection? That's probably just there in
> the test code to have something to test against.
>
> WARNING: I haven't verified this, but it should be something like the
> following. What you need
> is the hash range for the shard (slice) you're trying to update, then
> send each doc ID through
> the hash function and, if the result falls in the range of your target
> shard, index the doc.
>
> CloudSolrClient cloudSolrClient = .
>
> DocCollection coll = cloudSolrClient.getCollection(collName);
> Slice slice = coll.getSlice("shard_name_you_care_about"); // you can
> get all the slices and interate BTW.
> DocRouter.Range range = slice.getRange()
>
> for (each doc) {
>   int hash =  Hash.murmurhash3_x86_32(whatever_your_unique_key_is, 0,
> id.length(), 0);
>   if (range.includes(hash)) {
>   index it to Solr
>   }
> }
>
> "Hash" is in org.apache.solr.common.util, in
>
> solr-solrj-##.jar, part of the normal distro.
>
> Best,
> Erick
> On Fri, Dec 14, 2018 at 11:53 AM Mahmoud Almokadem
>  wrote:
> >
> > Thanks Erick,
> >
> > I got it from TestHashPartitioner.java
> >
> >
> https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/test/org/apache/solr/cloud/TestHashPartitioner.java
> >
> > Here is a sample code
> >
> > router = DocRouter.getDocRouter(CompositeIdRouter.NAME);
> > int shardsCount = 12;
> > solrCollection = createCollection(shardsCount, router);
> >
> > SolrInputDocument document = getSolrDocument(item); //need to implement
> > this method to get SolrInputDocument
> >
> > String id = "1874f9aa-4cad-4839-a282-d624fe2c40c6"
> > Slice slice = router.getTargetSlice(id, document, null, null,
> > solrCollection );
> > String shardName = slice.getName(); // shard1, shard2, ... etc
> >
> > //Helper methods from
> > DocCollection createCollection(int nSlices, DocRouter router) {
> > List ranges = router.partitionRange(nSlices,
> > router.fullRange());
> >
> > Map slices = new HashMap<>();
> > for (int i=0; i > DocRouter.Range range = ranges.get(i);
> > Slice slice = new Slice("shard"+(i+1), null,
> > map("range",range));
> > slices.put(slice.getName(), slice);
> > }
> >
> > DocCollection coll = new DocCollection("collection1", slices,
> null,
> > router);
> > return coll;
> > }
> >
> >
> > public static Map map(Object... params) {
> > LinkedHashMap ret = new LinkedHashMap();
> > for (int i=0; i > Object o = ret.put(params[i], params[i+1]);
> > // TODO: handle multi-valued map?
> > }
> > return ret;
> > }
> >
> >
> > Mahmoud
> >
> > On Fri, Dec 14, 2018 at 7:06 PM Mahmoud Almokadem <
> prog.mahm...@gmail.com>
> > wrote:
> >
> > > Thanks Erick,
> > >
> > > You know how to use this method. Or I need to dive into the code?
> > >
> > > I've the document_id as string uniqueKey and have 12 shards.
> > >
> > > On Fri, Dec 14, 2018 at 5:58 PM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > >> Sure. Of course you have to make sure you use the exact same hashing
> > >> algorithm on the .
> > >>
> > >> See CompositeIdRouter.sliceHash
> > >>
> > >> Best,
> > >> Erick
> > >> On Fri, Dec 14, 2018 at 3:36 AM Mahmoud Almokadem
> > >>  wrote:
> > >> >
> > >> > Hello,
> > >> >
> > >> > I've a corruption on some of the shards on my collection and I've a
> full
> > >> > dataset on my database, and I'm using CompositeId for routing
> documents.
> > >> >
> > >> > Can I traverse the whole dataset and do something like hashing the
> > >> > document_id to identify that this document belongs to a specific
> shard
> > >> to
> > >> > send the desired documents only instead of reindex the whole
> dataset?
> > >> >
> > >> > Sincerely,
> > >> > Mahmoud
> > >>
> > >
>


Re: Reindex single shard on solr

2018-12-14 Thread Mahmoud Almokadem
Thanks Erick,

I got it from TestHashPartitioner.java

https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/test/org/apache/solr/cloud/TestHashPartitioner.java

Here is a sample code

router = DocRouter.getDocRouter(CompositeIdRouter.NAME);
int shardsCount = 12;
solrCollection = createCollection(shardsCount, router);

SolrInputDocument document = getSolrDocument(item); //need to implement
this method to get SolrInputDocument

String id = "1874f9aa-4cad-4839-a282-d624fe2c40c6"
Slice slice = router.getTargetSlice(id, document, null, null,
solrCollection );
String shardName = slice.getName(); // shard1, shard2, ... etc

//Helper methods from
DocCollection createCollection(int nSlices, DocRouter router) {
List ranges = router.partitionRange(nSlices,
router.fullRange());

Map slices = new HashMap<>();
for (int i=0; i
wrote:

> Thanks Erick,
>
> You know how to use this method. Or I need to dive into the code?
>
> I've the document_id as string uniqueKey and have 12 shards.
>
> On Fri, Dec 14, 2018 at 5:58 PM Erick Erickson 
> wrote:
>
>> Sure. Of course you have to make sure you use the exact same hashing
>> algorithm on the .
>>
>> See CompositeIdRouter.sliceHash
>>
>> Best,
>> Erick
>> On Fri, Dec 14, 2018 at 3:36 AM Mahmoud Almokadem
>>  wrote:
>> >
>> > Hello,
>> >
>> > I've a corruption on some of the shards on my collection and I've a full
>> > dataset on my database, and I'm using CompositeId for routing documents.
>> >
>> > Can I traverse the whole dataset and do something like hashing the
>> > document_id to identify that this document belongs to a specific shard
>> to
>> > send the desired documents only instead of reindex the whole dataset?
>> >
>> > Sincerely,
>> > Mahmoud
>>
>


Re: Reindex single shard on solr

2018-12-14 Thread Mahmoud Almokadem
Thanks Erick,

You know how to use this method. Or I need to dive into the code?

I've the document_id as string uniqueKey and have 12 shards.

On Fri, Dec 14, 2018 at 5:58 PM Erick Erickson 
wrote:

> Sure. Of course you have to make sure you use the exact same hashing
> algorithm on the .
>
> See CompositeIdRouter.sliceHash
>
> Best,
> Erick
> On Fri, Dec 14, 2018 at 3:36 AM Mahmoud Almokadem
>  wrote:
> >
> > Hello,
> >
> > I've a corruption on some of the shards on my collection and I've a full
> > dataset on my database, and I'm using CompositeId for routing documents.
> >
> > Can I traverse the whole dataset and do something like hashing the
> > document_id to identify that this document belongs to a specific shard
> to
> > send the desired documents only instead of reindex the whole dataset?
> >
> > Sincerely,
> > Mahmoud
>


Re: no segments* file found

2018-12-14 Thread Mahmoud Almokadem
Thanks Eric,

already tried to Lucene but cannot continue cause I need to get my
collection up ASAP. So, I started my reindexing process and I'll
investigate this issue while indexing.

Mahmoud

On Fri, Dec 14, 2018 at 6:08 PM Erick Erickson 
wrote:

> You'd have to dive into the Lucene code and figure out the format,
> offhand I don't know what it is.
>
> However, there's no guarantee here that it'll result in a consistent index.
> Consider merging two segments, seg1 and seg2. Here's the merge sequence:
>
> 1> merge the segments, At the end of this you have seg1, seg2, and
> seg3. Segments_N points only go seg1 and seg2.
> 2> write a new segments_N+1 file that points _only_ to seg3
> 3> delete seg1 and seg2
>
> So if you were part way through merging and had seg1, seg2, and seg3 on
> disk,
> reconstructing the segments_N file from the available segments on disk
> will result
> duplicate documents in your index.
>
> FWIW,
> Erick
> On Fri, Dec 14, 2018 at 3:27 AM Mahmoud Almokadem
>  wrote:
> >
> > Hello,
> >
> > I'm facing an issue that some shards of my SolrCloud collection is
> > corrupted due to they don't have segments_N file but I think the whole
> > segments are still available. Can I create a segment_N file from the
> > available files?
> >
> > This is the stack trace:
> >
> > org.apache.solr.core.SolrCoreInitializationException: SolrCore
> > 'my_collection_shard12_replica_n22' is not available due to init failure:
> > Error opening new searcher
> > at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1495)
> > at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:251)
> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
> > at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> > at
> >
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> > at org.eclipse.jetty.server.Server.handle(Server.java:530)
> > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)
> > at
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256)
> > at
> > org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
> > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
> > at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
> > at
> >
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)
> > at
> >
> org.eclipse.jetty.util.thread.strategy.EatWhatYo

Reindex single shard on solr

2018-12-14 Thread Mahmoud Almokadem
Hello,

I've a corruption on some of the shards on my collection and I've a full
dataset on my database, and I'm using CompositeId for routing documents.

Can I traverse the whole dataset and do something like hashing the
document_id to identify that this document belongs to a specific shard  to
send the desired documents only instead of reindex the whole dataset?

Sincerely,
Mahmoud


no segments* file found

2018-12-14 Thread Mahmoud Almokadem
Hello,

I'm facing an issue that some shards of my SolrCloud collection is
corrupted due to they don't have segments_N file but I think the whole
segments are still available. Can I create a segment_N file from the
available files?

This is the stack trace:

org.apache.solr.core.SolrCoreInitializationException: SolrCore
'my_collection_shard12_replica_n22' is not available due to init failure:
Error opening new searcher
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1495)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:251)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:530)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
at java.base/java.lang.Thread.run(Thread.java:844)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.(SolrCore.java:1008)
at org.apache.solr.core.SolrCore.(SolrCore.java:863)
at
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1040)
at org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:640)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
... 1 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2095)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2215)
at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1091)
at org.apache.solr.core.SolrCore.(SolrCore.java:980)
... 9 more
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments*
file found in
LockValidatingDirectoryWrapper(NRTCachingDirectory(MMapDirectory@/path/to/index/my_collection_shard12_replica_n22/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@659b5e1;
maxCacheMB=48.0 

Re: Dataimporter status

2017-12-06 Thread Mahmoud Almokadem
Thanks Shawn,

I'm already using the admin UI and get URL for fetching the status of
dataimporter from network console and tried it outside the admin UI. Admin
UI have the same behavior,  when I pressed on execute the status messages
are swapped between "not started", "started and indexing", "completed on 3
seconds", "completed on 10 seconds" something like that.

I understood what you mean that the dataimporter are load balanced between
shards, that's made me using the old admin UI on using dataimporter to get
accurate status of what is running now. Because the it's related to core
not collection.

I think the dataimporter feature must moved to the core level instead of
collection level.

Thanks,
Mahmoud


On Tue, Dec 5, 2017 at 6:57 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 12/3/2017 9:27 AM, Mahmoud Almokadem wrote:
>
>> We're facing an issue related to the dataimporter status on new Admin UI
>> (7.0.1).
>>
>> Calling to the API
>> http://solrip/solr/collection/dataimport?_=1512314812090
>> mand=status=on=json
>>
>> returns different status despite the importer is running
>> The messages are swapped between the following when refreshing the page:
>>
>
> 
>
> The old Admin UI was working well.
>>
>> Is that a bug on the new Admin UI?
>>
>
> What I'm going to say below is based on the idea that you're running
> SolrCloud.  If you're not, then this seems extremely odd and should not be
> happening.
>
> The first part of your message has a URL that accesses the API directly,
> *not* the admin UI, so I'm going to concentrate on that, and not discuss
> the admin UI, because the admin UI is not involved when using that kind of
> URL.
>
> When requests are sent to a collection name rather than directly to a
> core, SolrCloud load balances those requests across the cloud, picking
> different replicas and shards so each individual request ends up on a
> different core, and possibly on a different server.
>
> This load balancing is a general feature of SolrCloud, and happens even
> with the dataimport handler.  You never know which shard/replica is going
> to actually get a /dataimport request.  So what is happening here is that
> one of the cores in your collection is actually doing a dataimport, but all
> the others aren't.  When the status command is load balanced to the core
> that did the import, then you see the status with actual data, and when
> load balancing sends the request to one of the other cores, you see the
> empty status.
>
> If you want to reliably see the status of an import on SolrCloud, you're
> going to have to choose one of the cores (collection_shardN_replicaM) on
> one of the servers in your cloud, and send both the import command and the
> status command to that one core, instead of the collection.  You might even
> need to add a distrib=false parameter to the request to keep it from being
> load balanced, but I am not sure whether that's needed for /dataimport.
>
> Thanks,
> Shawn
>


Slices not found for checkpointCollection

2017-12-06 Thread Mahmoud Almokadem
Hi all,

I'm running Solr 7.0.1. When I tried to run TopicStream with the following
expression

String expression = "topic(checkpointCollection," +
 "myCollection" + "," +
"q=\"*:*\"," +
"fl=\"document_id,title,full_text\"," +
"id=\"myTopic\"," +
"rows=\"300\"," +
"initialCheckpoint=\"0\"," +
"wt=javabin)";

I got the error

java.io.IOException: Slices not found for checkpointCollection

Should I create a checkpointCollection on the cluster? And if yes, what is
the schema for this collection?

I used the topic stream instead of search to fetch all documents with no
docValues fields.

Thanks,
Mahmoud


Dataimporter status

2017-12-03 Thread Mahmoud Almokadem
We're facing an issue related to the dataimporter status on new Admin UI
(7.0.1).

Calling to the API
http://solrip/solr/collection/dataimport?_=1512314812090=status=on=json


returns different status despite the importer is running
The messages are swapped between the following when refreshing the page:
{
  "responseHeader":{
"status":0,
"QTime":0},
  "initArgs":[
"defaults",[
  "config","data-config-online-live-pervoice.xml"]],
  "command":"status",
  "status":"idle",
  "importResponse":"",
  "statusMessages":{}}

===
{
  "responseHeader":{
"status":0,
"QTime":0},
  "initArgs":[
"defaults",[
  "config","data-config-online-live-pervoice.xml"]],
  "command":"status",
  "status":"idle",
  "importResponse":"",
  "statusMessages":{
"Total Requests made to DataSource":"2",
"Total Rows Fetched":"715",
"Total Documents Processed":"679",
"Total Documents Skipped":"0",
"Full Dump Started":"2017-12-03 18:22:31",
"":"Indexing completed. Added/Updated: 679 documents. Deleted 0
documents.",
"Committed":"2017-12-03 18:22:32",
"Total Documents Failed":"36",
"Time taken":"0:0:54.638",
"Full Import failed":"2017-12-03 18:22:32"}}


The old Admin UI was working well.

Is that a bug on the new Admin UI?

Thanks,
Mahmoud


Log page auto refresh

2017-12-03 Thread Mahmoud Almokadem
Hello,

I've an issue related to the Log page on the new Admin UI (7.0.1), When I
expand an item, it collapsed again after short time.

This behavior is different than the old Admin UI.

Thanks,
Mahmoud


Re: Unbalanced CPU no SolrCloud

2017-10-16 Thread Mahmoud Almokadem
It takes more time after I stopped the indexing.

The load firstly was with the first node and after I restarted the indexing
process the load with changed to the second node the first node worked
properly.

Thanks,
Mahmoud


On Mon, Oct 16, 2017 at 5:29 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Does the load stops when you stop indexing or it last for some more time?
> Is it always one node that behaves like this and it starts as soon as you
> start indexing? Is load different between nodes when you are doing lighter
> indexing?
>
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 16 Oct 2017, at 13:35, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> >
> > The transition of the load happened after I restarted the bulk insert
> > process.
> >
> > The size of the index on each server about 500GB.
> >
> > There are about 8 warnings on each server for "Not found segment file"
> like
> > that
> >
> > Error getting file length for [segments_2s4]
> >
> > java.nio.file.NoSuchFileException:
> > /media/ssd_losedata/solr-home/data/documents_online_shard16_
> replica_n1/data/index/segments_2s4
> > at
> > java.base/sun.nio.fs.UnixException.translateToIOException(
> UnixException.java:92)
> > at
> > java.base/sun.nio.fs.UnixException.rethrowAsIOException(
> UnixException.java:111)
> > at
> > java.base/sun.nio.fs.UnixException.rethrowAsIOException(
> UnixException.java:116)
> > at
> > java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(
> UnixFileAttributeViews.java:55)
> > at
> > java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(
> UnixFileSystemProvider.java:145)
> > at
> > java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(
> LinuxFileSystemProvider.java:99)
> > at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
> > at java.base/java.nio.file.Files.size(Files.java:2369)
> > at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
> > at
> > org.apache.lucene.store.NRTCachingDirectory.fileLength(
> NRTCachingDirectory.java:128)
> > at
> > org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(
> LukeRequestHandler.java:611)
> > at
> > org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(
> LukeRequestHandler.java:584)
> > at
> > org.apache.solr.handler.admin.LukeRequestHandler.handleRequestBody(
> LukeRequestHandler.java:136)
> > at
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:177)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2474)
> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:720)
> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:526)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:378)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:322)
> > at
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1691)
> > at
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(
> ServletHandler.java:582)
> > at
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
> > at
> > org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:548)
> > at
> > org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:226)
> > at
> > org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1180)
> > at org.eclipse.jetty.servlet.ServletHandler.doScope(
> ServletHandler.java:512)
> > at
> > org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
> > at
> > org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1112)
> > at
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
> > at
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
> ContextHandlerCollection.java:213)
> > at
> > org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:119)
> > at
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)
> > at
> > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
> RewriteHandler.java:335)
> > at
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.ja

Re: Unbalanced CPU no SolrCloud

2017-10-16 Thread Mahmoud Almokadem
The transition of the load happened after I restarted the bulk insert
process.

The size of the index on each server about 500GB.

There are about 8 warnings on each server for "Not found segment file" like
that

Error getting file length for [segments_2s4]

java.nio.file.NoSuchFileException:
/media/ssd_losedata/solr-home/data/documents_online_shard16_replica_n1/data/index/segments_2s4
at
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
at
java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at
java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:145)
at
java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
at java.base/java.nio.file.Files.readAttributes(Files.java:1755)
at java.base/java.nio.file.Files.size(Files.java:2369)
at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
at
org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
at
org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:611)
at
org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:584)
at
org.apache.solr.handler.admin.LukeRequestHandler.handleRequestBody(LukeRequestHandler.java:136)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2474)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:720)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:526)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:378)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:322)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.base/java.lang.Thread.run(Thread.java:844)

On Mon, Oct 16, 2017 at 1:08 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> I did not look at graph details - now I see that it is over 3h time span.
> It seems that there was a load on the other server before this one and
> ended with 14GB read spike and 10GB write spike, just before load started
> on this server. Do you see any errors or suspicious logs lines?
> How big is your index?
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 16 Oct 2017, at 12:39, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> >
> > Yes, it's constantly since I starte

Re: Unbalanced CPU no SolrCloud

2017-10-16 Thread Mahmoud Almokadem
Yes, it's constantly since I started this bulk indexing process.
As you see the write operations on the loaded server are 3x the normal
server despite Disk writes not 3x times.

Mahmoud


On Mon, Oct 16, 2017 at 12:32 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Mahmoud,
> Is this something that you see constantly? Network charts suggests that
> your servers are loaded equally and as you said - you are not using routing
> so expected. Disk read/write and CPU are not equal and it is expected to
> not be equal during heavy indexing since it also triggers segment merges
> which require those resources. Even if host same documents (e.g. leader and
> replica) merges are not likely to happen at the same time and you can
> expect to see such cases.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 16 Oct 2017, at 11:58, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> >
> > Here are the screen shots for the two server metrics on Amazon
> >
> > https://ibb.co/kxBQam
> > https://ibb.co/fn0Jvm
> > https://ibb.co/kUpYT6
> >
> >
> >
> > On Mon, Oct 16, 2017 at 11:37 AM, Mahmoud Almokadem <
> prog.mahm...@gmail.com>
> > wrote:
> >
> >> Hi Emir,
> >>
> >> We doesn't use routing.
> >>
> >> Servers is already balanced and the number of documents on each shard
> are
> >> approximately the same.
> >>
> >> Nothing running on the servers except Solr and ZooKeeper.
> >>
> >> I initialized the client as
> >>
> >> String zkHost = "192.168.1.89:2181,192.168.1.99:2181";
> >>
> >> CloudSolrClient solrCloud = new CloudSolrClient.Builder()
> >>.withZkHost(zkHost)
> >>.build();
> >>
> >>solrCloud.setIdField("document_id");
> >>solrCloud.setDefaultCollection(collection);
> >>solrCloud.setRequestWriter(new BinaryRequestWriter());
> >>
> >>
> >> And the documents are approximately the same size.
> >>
> >> I Used 10 threads with 10 SolrClients to send data to solr and every
> >> thread send a batch of 1000 documents every time.
> >>
> >> Thanks,
> >> Mahmoud
> >>
> >>
> >>
> >> On Mon, Oct 16, 2017 at 11:01 AM, Emir Arnautović <
> >> emir.arnauto...@sematext.com> wrote:
> >>
> >>> Hi Mahmoud,
> >>> Do you use routing? Are your servers equally balanced - do you end up
> >>> having approximately the same number of documents hosted on both
> servers
> >>> (counted all shards)?
> >>> Do you have anything else running on those servers?
> >>> How do you initialise your SolrJ client?
> >>> Are documents of similar size?
> >>>
> >>> Thanks,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
> >>>> On 16 Oct 2017, at 10:46, Mahmoud Almokadem <prog.mahm...@gmail.com>
> >>> wrote:
> >>>>
> >>>> We've installed SolrCloud 7.0.1 with two nodes and 8 shards per node.
> >>>>
> >>>> The configurations and the specs of the two servers are identical.
> >>>>
> >>>> When running bulk indexing using SolrJ we see one of the servers is
> >>> fully
> >>>> loaded as you see on the images and the other is normal.
> >>>>
> >>>> Images URLs:
> >>>>
> >>>> https://ibb.co/jkE6gR
> >>>> https://ibb.co/hyzvam
> >>>> https://ibb.co/mUpvam
> >>>> https://ibb.co/e4bxo6
> >>>>
> >>>> How can I figure this issue?
> >>>>
> >>>> Thanks,
> >>>> Mahmoud
> >>>
> >>>
> >>
>
>


Re: Unbalanced CPU no SolrCloud

2017-10-16 Thread Mahmoud Almokadem
Here are the screen shots for the two server metrics on Amazon

https://ibb.co/kxBQam
https://ibb.co/fn0Jvm
https://ibb.co/kUpYT6



On Mon, Oct 16, 2017 at 11:37 AM, Mahmoud Almokadem <prog.mahm...@gmail.com>
wrote:

> Hi Emir,
>
> We doesn't use routing.
>
> Servers is already balanced and the number of documents on each shard are
> approximately the same.
>
> Nothing running on the servers except Solr and ZooKeeper.
>
> I initialized the client as
>
> String zkHost = "192.168.1.89:2181,192.168.1.99:2181";
>
> CloudSolrClient solrCloud = new CloudSolrClient.Builder()
> .withZkHost(zkHost)
> .build();
>
> solrCloud.setIdField("document_id");
> solrCloud.setDefaultCollection(collection);
> solrCloud.setRequestWriter(new BinaryRequestWriter());
>
>
> And the documents are approximately the same size.
>
> I Used 10 threads with 10 SolrClients to send data to solr and every
> thread send a batch of 1000 documents every time.
>
> Thanks,
> Mahmoud
>
>
>
> On Mon, Oct 16, 2017 at 11:01 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Mahmoud,
>> Do you use routing? Are your servers equally balanced - do you end up
>> having approximately the same number of documents hosted on both servers
>> (counted all shards)?
>> Do you have anything else running on those servers?
>> How do you initialise your SolrJ client?
>> Are documents of similar size?
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 16 Oct 2017, at 10:46, Mahmoud Almokadem <prog.mahm...@gmail.com>
>> wrote:
>> >
>> > We've installed SolrCloud 7.0.1 with two nodes and 8 shards per node.
>> >
>> > The configurations and the specs of the two servers are identical.
>> >
>> > When running bulk indexing using SolrJ we see one of the servers is
>> fully
>> > loaded as you see on the images and the other is normal.
>> >
>> > Images URLs:
>> >
>> > https://ibb.co/jkE6gR
>> > https://ibb.co/hyzvam
>> > https://ibb.co/mUpvam
>> > https://ibb.co/e4bxo6
>> >
>> > How can I figure this issue?
>> >
>> > Thanks,
>> > Mahmoud
>>
>>
>


Re: Unbalanced CPU no SolrCloud

2017-10-16 Thread Mahmoud Almokadem
Hi Emir,

We doesn't use routing.

Servers is already balanced and the number of documents on each shard are
approximately the same.

Nothing running on the servers except Solr and ZooKeeper.

I initialized the client as

String zkHost = "192.168.1.89:2181,192.168.1.99:2181";

CloudSolrClient solrCloud = new CloudSolrClient.Builder()
.withZkHost(zkHost)
.build();

solrCloud.setIdField("document_id");
solrCloud.setDefaultCollection(collection);
solrCloud.setRequestWriter(new BinaryRequestWriter());


And the documents are approximately the same size.

I Used 10 threads with 10 SolrClients to send data to solr and every thread
send a batch of 1000 documents every time.

Thanks,
Mahmoud



On Mon, Oct 16, 2017 at 11:01 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Mahmoud,
> Do you use routing? Are your servers equally balanced - do you end up
> having approximately the same number of documents hosted on both servers
> (counted all shards)?
> Do you have anything else running on those servers?
> How do you initialise your SolrJ client?
> Are documents of similar size?
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 16 Oct 2017, at 10:46, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> >
> > We've installed SolrCloud 7.0.1 with two nodes and 8 shards per node.
> >
> > The configurations and the specs of the two servers are identical.
> >
> > When running bulk indexing using SolrJ we see one of the servers is fully
> > loaded as you see on the images and the other is normal.
> >
> > Images URLs:
> >
> > https://ibb.co/jkE6gR
> > https://ibb.co/hyzvam
> > https://ibb.co/mUpvam
> > https://ibb.co/e4bxo6
> >
> > How can I figure this issue?
> >
> > Thanks,
> > Mahmoud
>
>


Unbalanced CPU no SolrCloud

2017-10-16 Thread Mahmoud Almokadem
We've installed SolrCloud 7.0.1 with two nodes and 8 shards per node.

The configurations and the specs of the two servers are identical.

When running bulk indexing using SolrJ we see one of the servers is fully
loaded as you see on the images and the other is normal.

Images URLs:

https://ibb.co/jkE6gR
https://ibb.co/hyzvam
https://ibb.co/mUpvam
https://ibb.co/e4bxo6

How can I figure this issue?

Thanks,
Mahmoud


Re: Move index directory to another partition

2017-08-10 Thread Mahmoud Almokadem
Thanks all for your commits.

I followed Shawn steps (rsync) cause everything on that volume (ZooKeeper,
Solr home and data) and everything went great.

Thanks again,
Mahmoud


On Sun, Aug 6, 2017 at 12:47 AM, Erick Erickson 
wrote:

> bq: I was envisioning a scenario where the entire solr home is on the old
> volume that's going away.  If I were setting up a Solr install where the
> large/fast storage was a separate filesystem, I would put the solr home
> (or possibly even the entire install) under that mount point.  It would
> be a lot easier than setting dataDir in core.properties for every core,
> especially in a cloud install.
>
> Agreed. Nothing in what I said precludes this. If you don't specify
> dataDir,
> then the index for a new replica goes in the default place, i.e. under
> your install
> directory usually. In your case under your new mount point. I usually don't
> recommend trying to take control of where dataDir points, just let it
> default.
> I only mentioned it so you'd be aware it exists. So if your new install
> is associated with a bigger/better/larger EBS it's all automatic.
>
> bq: If the dataDir property is already in use to relocate index data, then
> ADDREPLICA and DELETEREPLICA would be a great way to go.  I would not
> expect most SolrCloud users to use that method.
>
> I really don't understand this. Each Solr replica has an associated
> dataDir whether you specified it or not (the default is relative to
> the core.properties file). ADDREPLICA creates a new replica in a new
> place, initially the data directory and index are empty. The new
> replica goes into recovery and uses the standard replication process
> to copy the index via HTTP from a healthy replica and write it to its
> data directory. Once that's done, the replica becomes live. There's
> nothing about dataDir already being in use here at all.
>
> When you start Solr there's the default place Solr expects to find the
> replicas. This is not necessarily where Solr is executing from, see
> the "-s" option in bin/solr start -s.
>
> If you're talking about using dataDir to point to an existing index,
> yes that would be a problem and not something I meant to imply at all.
>
> Why wouldn't most SolrCloud users use ADDREPLICA/DELTEREPLICA? It's
> commonly used to more replicas around a cluster.
>
> Best,
> Erick
>
> On Fri, Aug 4, 2017 at 11:15 AM, Shawn Heisey  wrote:
> > On 8/2/2017 9:17 AM, Erick Erickson wrote:
> >> Not entirely sure about AWS intricacies, but getting a new replica to
> >> use a particular index directory in the general case is just
> >> specifying dataDir=some_directory on the ADDREPLICA command. The index
> >> just needs an HTTP connection (uses the old replication process) so
> >> nothing huge there. Then DELETEREPLICA for the old one. There's
> >> nothing that ZK has to know about to make this work, it's all local to
> >> the Solr instance.
> >
> > I was envisioning a scenario where the entire solr home is on the old
> > volume that's going away.  If I were setting up a Solr install where the
> > large/fast storage was a separate filesystem, I would put the solr home
> > (or possibly even the entire install) under that mount point.  It would
> > be a lot easier than setting dataDir in core.properties for every core,
> > especially in a cloud install.
> >
> > If the dataDir property is already in use to relocate index data, then
> > ADDREPLICA and DELETEREPLICA would be a great way to go.  I would not
> > expect most SolrCloud users to use that method.
> >
> > Thanks,
> > Shawn
> >
>


Re: Move index directory to another partition

2017-08-01 Thread Mahmoud Almokadem
Thanks Shawn,

I'm using ubuntu and I'll try rsync command. Unfortunately I'm using one
replication factor but I think the downtime will be less than five minutes
after following your steps.

But how can I start Solr backup or why should I run it although I copied
the index and changed theo path?

And what do you mean with "Using multiple passes with rsync"?

Thanks,
Mahmoud


On Tuesday, August 1, 2017, Shawn Heisey <apa...@elyograg.org> wrote:

> On 7/31/2017 12:28 PM, Mahmoud Almokadem wrote:
> > I've a SolrCloud of four instances on Amazon and the EBS volumes that
> > contain the data on everynode is going to be full, unfortunately Amazon
> > doesn't support expanding the EBS. So, I'll attach larger EBS volumes to
> > move the index to.
> >
> > I can stop the updates on the index, but I'm afraid to use "cp" command
> to
> > copy the files that are "on merge" operation.
> >
> > The copy operation may take several  hours.
> >
> > How can I move the data directory without stopping the instance?
>
> Use rsync to do the copy.  Do an initial copy while Solr is running,
> then do a second copy, which should be pretty fast because rsync will
> see the data from the first copy.  Then shut Solr down and do a third
> rsync which will only copy a VERY small changeset.  Reconfigure Solr
> and/or the OS to use the new location, and start Solr back up.  Because
> you mentioned "cp" I am assuming that you're NOT on Windows, and that
> the OS will most likely allow you to do anything you need with index
> files while Solr has them open.
>
> If you have set up your replicas with SolrCloud properly, then your
> collections will not go offline when one Solr instance is shut down, and
> that instance will be brought back into sync with the rest of the
> cluster when it starts back up.  Using multiple passes with rsync should
> mean that Solr will not need to be shutdown for very long.
>
> The options I typically use for this kind of copy with rsync are "-avH
> --delete".  I would recommend that you research rsync options so that
> you fully understand what I have suggested.
>
> Thanks,
> Shawn
>
>


Move index directory to another partition

2017-07-31 Thread Mahmoud Almokadem
Hello,

I've a SolrCloud of four instances on Amazon and the EBS volumes that
contain the data on everynode is going to be full, unfortunately Amazon
doesn't support expanding the EBS. So, I'll attach larger EBS volumes to
move the index to.

I can stop the updates on the index, but I'm afraid to use "cp" command to
copy the files that are "on merge" operation.

The copy operation may take several  hours.

How can I move the data directory without stopping the instance?

Thanks,
Mahmoud


Re: Clean checkbox on DIH

2017-05-02 Thread Mahmoud Almokadem
Thanks Shawn,

We already use the admin UI for testing and bulk uploads. We are using curl
scripts for automation process.

I'll report the issues regarding the new UI on JIRA.

Thanks,
Mahmoud


On Tuesday, May 2, 2017, Shawn Heisey <apa...@elyograg.org> wrote:

> On 5/2/2017 6:53 AM, Mahmoud Almokadem wrote:
> > And for the dataimport I always use the old UI cause the new UI
> > doesn't show the live update and sometimes doesn't show the
> > configuration. I think there are many bugs on the new UI.
>
> Do you know if these problems have been reported in the Jira issue
> tracker?  The old UI is going to disappear in Solr 7.0 when it is
> released.  If there are bugs in the new UI, we need to have them
> reported so they can be fixed.
>
> As I stated earlier, when it comes to DIH, the admin UI is more useful
> for testing and research than actual usage.  The URLs for the admin UI
> cannot be used in automation tools -- the API must be used directly.
>
> Thanks,
> Shawn
>
>


Re: Clean checkbox on DIH

2017-05-02 Thread Mahmoud Almokadem
Thanks Shawn for your clarifications,

I think showing a confirmation message saying that "The whole index will be
cleaned" when the clean option is checked will be good.

I always remove the check from the file
/opt/solr/server/solr-webapp/webapp/tpl/dataimport.html after installing
solr but when I upgraded it this time forget to do that and press Execute
with the check and the whole index cleaned.

And for the dataimport I always use the old UI cause the new UI doesn't
show the live update and sometimes doesn't show the configuration. I think
there are many bugs on the new UI.

Thanks,
Mahmoud

On Mon, May 1, 2017 at 4:30 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/28/2017 9:01 AM, Mahmoud Almokadem wrote:
> > We already using a shell scripts to do our import and using fullimport
> > command to do our delta import and everything is doing well several
> > years ago. But default of the UI is full import with clean and commit.
> > If I press the Execute button by mistake the whole index is cleaned
> > without any notification.
>
> I understand your frustration.  What I'm worried about is the fallout if
> we change the default to be unchecked, from people who didn't verify the
> setting and expected full-import to wipe their index before it started
> importing, just like it has always done for the last few years.
>
> The default value for the clean parameter when NOT using the admin UI is
> true for full-import, and false for delta-import.  That's not going to
> change.  I firmly believe that the admin UI should have the same
> defaults as the API itself.  The very nature of a full-import carries
> the implication that you want to start over with an empty index.
>
> What if there were some bright red text in the UI near the execute
> button that urged you to double-check that the "clean" box has the
> setting you want?  An alternate idea would be to pop up a yes/no
> verification dialog on execute when the clean box is checked.
>
> Thanks,
> Shawn
>
>


Re: Clean checkbox on DIH

2017-04-28 Thread Mahmoud Almokadem
Thanks Shawn,

We already using a shell scripts to do our import and using fullimport
command to do our delta import and everything is doing well several years
ago. But default of the UI is full import with clean and commit. If I press
the Execute button by mistake the whole index is cleaned without any
notification.

Thanks,
Mahmoud




On Fri, Apr 28, 2017 at 2:51 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/28/2017 5:11 AM, Mahmoud Almokadem wrote:
> > I'd like to request to uncheck the "Clean" checkbox by default on DIH
> page,
> > cause it cleaned the whole index about 2TB when I click Execute button by
> > wrong. Or show a confirmation message that the whole index will be
> cleaned!!
>
> When somebody is doing a full-import, clean is what almost all users are
> going to want.  If you're wanting to do full-import without cleaning,
> then you are in the minority.  It is perhaps a fairly large minority,
> but still not the majority.
>
> Also, once you move into production, you should not be using the admin
> UI for this.  You should be calling the DIH handler directly with HTTP
> from another source, which might be a shell script using curl, or a
> full-blown program in another language.
>
> Thanks,
> Shawn
>
>


Clean checkbox on DIH

2017-04-28 Thread Mahmoud Almokadem
Hello,

I'd like to request to uncheck the "Clean" checkbox by default on DIH page,
cause it cleaned the whole index about 2TB when I click Execute button by
wrong. Or show a confirmation message that the whole index will be cleaned!!

Sincerely,
Mahmoud


TransactionLog doesn't know how to serialize class java.util.UUID; try implementing ObjectResolver?

2017-04-27 Thread Mahmoud Almokadem
Hello,

When I try to update a document exists on solr cloud I got this message:

TransactionLog doesn't know how to serialize class java.util.UUID; try
implementing ObjectResolver?

With the stack trace:


{"data":{"responseHeader":{"status":500,"QTime":3},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"TransactionLog
doesn't know how to serialize class java.util.UUID; try implementing
ObjectResolver?","trace":"org.apache.solr.common.SolrException:
TransactionLog doesn't know how to serialize class java.util.UUID; try
implementing ObjectResolver?\n\tat
org.apache.solr.update.TransactionLog$1.resolve(TransactionLog.java:100)\n\tat
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:234)\n\tat
org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:589)\n\tat
org.apache.solr.update.TransactionLog.write(TransactionLog.java:395)\n\tat
org.apache.solr.update.UpdateLog.add(UpdateLog.java:524)\n\tat
org.apache.solr.update.UpdateLog.add(UpdateLog.java:508)\n\tat
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:320)\n\tat
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)\n\tat
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)\n\tat
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)\n\tat
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)\n\tat
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:980)\n\tat
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1193)\n\tat
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:749)\n\tat
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)\n\tat
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:502)\n\tat
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:141)\n\tat
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:117)\n\tat
org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:80)\n\tat
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)\n\tat
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:2440)\n\tat
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:347)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:298)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\tat
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat

Re: Enable Gzip compression Solr 6.0

2017-04-12 Thread Mahmoud Almokadem
Thanks Rick,

I already running Solr on my infrastructure and behind a web application.

The web application is working as a proxy before Solr, so I think I can 
compress the content on Solr end. But I have made it on the proxy now.

Thanks again,
Mahmoud 


> On Apr 12, 2017, at 4:31 PM, Rick Leir <rl...@leirtech.com> wrote:
> 
> Hi Mahmoud
> I assume you are running Solr 'behind' a web application, so Solr is not 
> directly on the net.
> 
> The gzip compression is an Apache thing, and relates to your web application. 
> 
> Connections to Solr are within your infrastructure, so you might not want to 
> gzip them. But maybe your setup​ is different?
> 
> Older versions of Solr used Tomcat which supported gzip. Newer versions use 
> Zookeeper and Jetty and you prolly will find a way.
> Cheers -- Rick
> 
>> On April 12, 2017 8:48:45 AM EDT, Mahmoud Almokadem <prog.mahm...@gmail.com> 
>> wrote:
>> Hello,
>> 
>> How can I enable Gzip compression for Solr 6.0 to save bandwidth
>> between
>> the server and clients?
>> 
>> Thanks,
>> Mahmoud
> 
> -- 
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


Enable Gzip compression Solr 6.0

2017-04-12 Thread Mahmoud Almokadem
Hello,

How can I enable Gzip compression for Solr 6.0 to save bandwidth between
the server and clients?

Thanks,
Mahmoud


Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
Thanks Toke,

After sorting with Self Time(CPU) I got that the
FSDirectory$FSIndexOutput$1.write() is taking much of CPU time, so the
bottleneck now is the IO of the hard drive?

https://drive.google.com/open?id=0BwLcshoSCVcdb2I4U1RBNnI0OVU

On Tue, Mar 14, 2017 at 4:19 PM, Toke Eskildsen <t...@kb.dk> wrote:

> On Tue, 2017-03-14 at 11:51 +0200, Mahmoud Almokadem wrote:
> > Here is the profiler screenshot from VisualVM after upgrading
> >
> > https://drive.google.com/open?id=0BwLcshoSCVcddldVRTExaDR2dzg
> >
> > the jetty is taking the most time on CPU. Does this mean, the jetty
> > is the bottleneck on indexing?
>
> You need to sort on and look at the "Self Time (CPU)" column in
> VisualVM, not the default "Self Time", to see where the power is used.
> The default is pretty useless for locating hot spots.
>
> - Toke Eskildsen, Royal Danish Library
>


Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
After upgrading to 6.4.2 I got 3500+ docs/sec throughput with two uploading
clients to solr which is good to me for the whole reindexing.

I'll try Shawn code to posting to solr using HttpSolrClient instead of
SolrCloudClient.

Thanks to all,
Mahmoud

On Tue, Mar 14, 2017 at 10:23 AM, Mahmoud Almokadem <prog.mahm...@gmail.com>
wrote:

>
> I'm using VisualVM and sematext to monitor my cluster.
>
> Below is screenshots for each of them.
>
> https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28
>
> https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA
>
> https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk
>
> https://drive.google.com/open?id=0BwLcshoSCVcdR3hJSHRZTjdSZm8
>
> https://drive.google.com/open?id=0BwLcshoSCVcdUzRETDlFeFIxU2M
>
> Thanks,
> Mahmoud
>
> On Tue, Mar 14, 2017 at 10:20 AM, Mahmoud Almokadem <
> prog.mahm...@gmail.com> wrote:
>
>> Thanks Erick,
>>
>> I think there are something missing, the rate I'm talking about is for
>> bulk upload and one time indexing to on-going indexing.
>> My dataset is about 250 million documents and I need to index them to
>> solr.
>>
>> Thanks Shawn for your clarification,
>>
>> I think that I got stuck on this version 6.4.1 I'll upgrade my cluster
>> and test again.
>>
>> Thanks for help
>> Mahmoud
>>
>>
>> On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org>
>> wrote:
>>
>>> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote:
>>> > When I start my bulk indexer program the CPU utilization is 100% on
>>> each
>>> > server but the rate of the indexer is about 1500 docs per second.
>>> >
>>> > I know that some solr benchmarks reached 70,000+ doc. per second.
>>>
>>> There are *MANY* factors that affect indexing rate.  When you say that
>>> the CPU utilization is 100 percent, what operating system are you
>>> running and what tool are you using to see CPU percentage?  Within that
>>> tool, where are you looking to see that usage level?
>>>
>>> On some operating systems with some reporting tools, a server with 8 CPU
>>> cores can show up to 800 percent CPU usage, so 100 percent utilization
>>> on the Solr process may not be full utilization of the server's
>>> resources.  It also might be an indicator of the full system usage, if
>>> you are looking in the right place.
>>>
>>> > The question: What is the best way to determine the bottleneck on solr
>>> > indexing rate?
>>>
>>> I have two likely candidates for you.  The first one is a bug that
>>> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2.  If you don't
>>> have one of those two versions, then this is not affecting you:
>>>
>>> https://issues.apache.org/jira/browse/SOLR-10130
>>>
>>> The other likely bottleneck, which could be a problem whether or not the
>>> previous bug is present, is single-threaded indexing, so every batch of
>>> docs must wait for the previous batch to finish before it can begin, and
>>> only one CPU gets utilized on the server side.  Both Solr and SolrJ are
>>> fully capable of handling several indexing threads at once, and that is
>>> really the only way to achieve maximum indexing performance.  If you
>>> want multi-threaded (parallel) indexing, you must create the threads on
>>> the client side, or run multiple indexing processes that each handle
>>> part of the job.  Multi-threaded code is not easy to write correctly.
>>>
>>> The fieldTypes and analysis that you have configured in your schema may
>>> include classes that process very slowly, or may include so many filters
>>> that the end result is slow performance.  I am not familiar with the
>>> performance of the classes that Solr includes, so I would not be able to
>>> look at a schema and tell you which entries are slow.  As Erick
>>> mentioned, processing for 300+ fields could be one reason for slow
>>> indexing.
>>>
>>> If you are doing a commit operation for every batch, that will slow it
>>> down even more.  If you have autoSoftCommit configured with a very low
>>> maxTime or maxDocs value, that can result in extremely frequent commits
>>> that make indexing much slower.  Although frequent autoCommit is very
>>> much desirable for good operation (as long as openSearcher set to
>>> false), commits that open new searchers should be much less frequent.
>>> The best option is to only commit (with a new searcher) *

Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
Here is the profiler screenshot from VisualVM after upgrading

https://drive.google.com/open?id=0BwLcshoSCVcddldVRTExaDR2dzg

the jetty is taking the most time on CPU. Does this mean, the jetty is the
bottleneck on indexing?

Thanks,
Mahmoud


On Tue, Mar 14, 2017 at 11:41 AM, Mahmoud Almokadem <prog.mahm...@gmail.com>
wrote:

> Thanks Shalin,
>
> I'm posting data to solr with SolrInputDocument using SolrJ.
>
> According to the profiler, the com.codahale.metrics.Meter.mark is take
> much processing than others as mentioned on this issue
> https://issues.apache.org/jira/browse/SOLR-10130.
>
> And I think the profiler of sematext is different than VisualVM.
>
> Thanks for help,
> Mahmoud
>
>
>
> On Tue, Mar 14, 2017 at 11:08 AM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
>> According to the profiler output, a significant amount of cpu is being
>> spent in JSON parsing but your previous email said that you use SolrJ.
>> SolrJ uses the javabin binary format to send documents to Solr and it
>> never ever uses JSON so there is definitely some other indexing
>> process that you have not accounted for.
>>
>> On Tue, Mar 14, 2017 at 12:31 AM, Mahmoud Almokadem
>> <prog.mahm...@gmail.com> wrote:
>> > Thanks Erick,
>> >
>> > I've commented out the line SolrClient.add(doclist) and get 5500+ docs
>> per
>> > second from single producer.
>> >
>> > Regarding more shards, you mean use 2 nodes with 8 shards per node so we
>> > got 16 shards on the same 2 nodes or spread shards over more nodes?
>> >
>> > I'm using solr 6.4.1 with zookeeper on the same nodes.
>> >
>> > Here's what I got from sematext profiler
>> >
>> > 51%
>> > Thread.java:745java.lang.Thread#run
>> >
>> > 42%
>> > QueuedThreadPool.java:589
>> > org.eclipse.jetty.util.thread.QueuedThreadPool$2#run
>> > Collapsed 29 calls (Expand)
>> >
>> > 43%
>> > UpdateRequestHandler.java:97
>> > org.apache.solr.handler.UpdateRequestHandler$1#load
>> >
>> > 30%
>> > JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load
>> >
>> > 30%
>> > JsonLoader.java:115
>> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load
>> >
>> > 13%
>> > JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load
>> >
>> > 9%
>> > ThreadPoolExecutor.java:617
>> > java.util.concurrent.ThreadPoolExecutor$Worker#run
>> >
>> > 9%
>> > ThreadPoolExecutor.java:1142
>> > java.util.concurrent.ThreadPoolExecutor#runWorker
>> >
>> > 33%
>> > ConcurrentMergeScheduler.java:626
>> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run
>> >
>> > 33%
>> > ConcurrentMergeScheduler.java:588
>> > org.apache.lucene.index.ConcurrentMergeScheduler#doMerge
>> >
>> > 33%
>> > SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge
>> >
>> > 33%
>> > IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge
>> >
>> > 33%
>> > IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle
>> >
>> > 20%
>> > SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge
>> >
>> > 11%
>> > SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge
>> >
>> > 2%
>> > SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge
>> >
>> >
>> > On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Note that 70,000 docs/second pretty much guarantees that there are
>> >> multiple shards. Lots of shards.
>> >>
>> >> But since you're using SolrJ, the  very first thing I'd try would be
>> >> to comment out the SolrClient.add(doclist) call so you're doing
>> >> everything _except_ send the docs to Solr. That'll tell you whether
>> >> there's any bottleneck on getting the docs from the system of record.
>> >> The fact that you're pegging the CPUs argues that you are feeding Solr
>> >> as fast as Solr can go so this is just a sanity check. But it's
>> >> simple/fast.
>> >>
>> >> As far as what on Solr could be the bottleneck, no real way to know
>> >> without profiling. But 300+ fields per doc probably just means you're
>> >> doin

Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
Thanks Shalin,

I'm posting data to solr with SolrInputDocument using SolrJ.

According to the profiler, the com.codahale.metrics.Meter.mark is take much
processing than others as mentioned on this issue
https://issues.apache.org/jira/browse/SOLR-10130.

And I think the profiler of sematext is different than VisualVM.

Thanks for help,
Mahmoud



On Tue, Mar 14, 2017 at 11:08 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> According to the profiler output, a significant amount of cpu is being
> spent in JSON parsing but your previous email said that you use SolrJ.
> SolrJ uses the javabin binary format to send documents to Solr and it
> never ever uses JSON so there is definitely some other indexing
> process that you have not accounted for.
>
> On Tue, Mar 14, 2017 at 12:31 AM, Mahmoud Almokadem
> <prog.mahm...@gmail.com> wrote:
> > Thanks Erick,
> >
> > I've commented out the line SolrClient.add(doclist) and get 5500+ docs
> per
> > second from single producer.
> >
> > Regarding more shards, you mean use 2 nodes with 8 shards per node so we
> > got 16 shards on the same 2 nodes or spread shards over more nodes?
> >
> > I'm using solr 6.4.1 with zookeeper on the same nodes.
> >
> > Here's what I got from sematext profiler
> >
> > 51%
> > Thread.java:745java.lang.Thread#run
> >
> > 42%
> > QueuedThreadPool.java:589
> > org.eclipse.jetty.util.thread.QueuedThreadPool$2#run
> > Collapsed 29 calls (Expand)
> >
> > 43%
> > UpdateRequestHandler.java:97
> > org.apache.solr.handler.UpdateRequestHandler$1#load
> >
> > 30%
> > JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load
> >
> > 30%
> > JsonLoader.java:115
> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load
> >
> > 13%
> > JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load
> >
> > 9%
> > ThreadPoolExecutor.java:617
> > java.util.concurrent.ThreadPoolExecutor$Worker#run
> >
> > 9%
> > ThreadPoolExecutor.java:1142
> > java.util.concurrent.ThreadPoolExecutor#runWorker
> >
> > 33%
> > ConcurrentMergeScheduler.java:626
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run
> >
> > 33%
> > ConcurrentMergeScheduler.java:588
> > org.apache.lucene.index.ConcurrentMergeScheduler#doMerge
> >
> > 33%
> > SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge
> >
> > 33%
> > IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge
> >
> > 33%
> > IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle
> >
> > 20%
> > SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge
> >
> > 11%
> > SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge
> >
> > 2%
> > SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge
> >
> >
> > On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> Note that 70,000 docs/second pretty much guarantees that there are
> >> multiple shards. Lots of shards.
> >>
> >> But since you're using SolrJ, the  very first thing I'd try would be
> >> to comment out the SolrClient.add(doclist) call so you're doing
> >> everything _except_ send the docs to Solr. That'll tell you whether
> >> there's any bottleneck on getting the docs from the system of record.
> >> The fact that you're pegging the CPUs argues that you are feeding Solr
> >> as fast as Solr can go so this is just a sanity check. But it's
> >> simple/fast.
> >>
> >> As far as what on Solr could be the bottleneck, no real way to know
> >> without profiling. But 300+ fields per doc probably just means you're
> >> doing a lot of processing, I'm not particularly hopeful you'll be able
> >> to speed things up without either more shards or simplifying your
> >> schema.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem
> >> <prog.mahm...@gmail.com> wrote:
> >> > Hi great community,
> >> >
> >> > I have a SolrCloud with the following configuration:
> >> >
> >> >- 2 nodes (r3.2xlarge 61GB RAM)
> >> >- 4 shards.
> >> >- The producer can produce 13,000+ docs per second
> >> >- The schema contains about 300+ fields and the document size is
> about
> >> >3KB.
> >> >- Using SolrJ and SolrCloudClient, each batch to solr contains 500
> >> docs.
> >> >
> >> > When I start my bulk indexer program the CPU utilization is 100% on
> each
> >> > server but the rate of the indexer is about 1500 docs per second.
> >> >
> >> > I know that some solr benchmarks reached 70,000+ doc. per second.
> >> >
> >> > The question: What is the best way to determine the bottleneck on solr
> >> > indexing rate?
> >> >
> >> > Thanks,
> >> > Mahmoud
> >>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
Thanks Erick,

I think there are something missing, the rate I'm talking about is for bulk
upload and one time indexing to on-going indexing.
My dataset is about 250 million documents and I need to index them to solr.

Thanks Shawn for your clarification,

I think that I got stuck on this version 6.4.1 I'll upgrade my cluster and
test again.

Thanks for help
Mahmoud


On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote:
> > When I start my bulk indexer program the CPU utilization is 100% on each
> > server but the rate of the indexer is about 1500 docs per second.
> >
> > I know that some solr benchmarks reached 70,000+ doc. per second.
>
> There are *MANY* factors that affect indexing rate.  When you say that
> the CPU utilization is 100 percent, what operating system are you
> running and what tool are you using to see CPU percentage?  Within that
> tool, where are you looking to see that usage level?
>
> On some operating systems with some reporting tools, a server with 8 CPU
> cores can show up to 800 percent CPU usage, so 100 percent utilization
> on the Solr process may not be full utilization of the server's
> resources.  It also might be an indicator of the full system usage, if
> you are looking in the right place.
>
> > The question: What is the best way to determine the bottleneck on solr
> > indexing rate?
>
> I have two likely candidates for you.  The first one is a bug that
> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2.  If you don't
> have one of those two versions, then this is not affecting you:
>
> https://issues.apache.org/jira/browse/SOLR-10130
>
> The other likely bottleneck, which could be a problem whether or not the
> previous bug is present, is single-threaded indexing, so every batch of
> docs must wait for the previous batch to finish before it can begin, and
> only one CPU gets utilized on the server side.  Both Solr and SolrJ are
> fully capable of handling several indexing threads at once, and that is
> really the only way to achieve maximum indexing performance.  If you
> want multi-threaded (parallel) indexing, you must create the threads on
> the client side, or run multiple indexing processes that each handle
> part of the job.  Multi-threaded code is not easy to write correctly.
>
> The fieldTypes and analysis that you have configured in your schema may
> include classes that process very slowly, or may include so many filters
> that the end result is slow performance.  I am not familiar with the
> performance of the classes that Solr includes, so I would not be able to
> look at a schema and tell you which entries are slow.  As Erick
> mentioned, processing for 300+ fields could be one reason for slow
> indexing.
>
> If you are doing a commit operation for every batch, that will slow it
> down even more.  If you have autoSoftCommit configured with a very low
> maxTime or maxDocs value, that can result in extremely frequent commits
> that make indexing much slower.  Although frequent autoCommit is very
> much desirable for good operation (as long as openSearcher set to
> false), commits that open new searchers should be much less frequent.
> The best option is to only commit (with a new searcher) *once* at the
> end of the indexing run.  If automatic soft commits are desired, make
> them happen as infrequently as you can.
>
> https://lucidworks.com/understanding-transaction-
> logs-softcommit-and-commit-in-sorlcloud/
>
> Using CloudSolrClient will make single-threaded indexing fairly
> efficient, by always sending documents to the correct shard leader.  FYI
> -- your 500 document batches are split into smaller batches (which I
> think are only 10 documents) that are directed to correct shard leaders
> by CloudSolrClient.  Indexing with multiple threads becomes even more
> important with these smaller batches.
>
> Note that with SolrJ, you will need to tweak the HttpClient creation, or
> you will likely find that each SolrJ client object can only utilize two
> threads to each Solr server.  The default per-route maximum connection
> limit for HttpClient is 2, with a total connection limit of 20.
>
> This code snippet shows how I create a Solr client that can do many
> threads (300 per route, 5000 total) and also has custom timeout settings:
>
> RequestConfig rc = RequestConfig.custom().setConnectTimeout(15000)
> .setSocketTimeout(Const.SOCKET_TIMEOUT).build();
> httpClient = HttpClients.custom().setDefaultRequestConfig(rc)
> .setMaxConnPerRoute(300).setMaxConnTotal(5000)
> .disableAutomaticRetries().build();
> client = new HttpSolrClient(serverBaseUrl, httpClient);
>
> This is using HttpSolrClient, but CloudSolrClient can be built in a
> similar manner.  I am not yet using the new SolrJ Builder paradigm found
> in 6.x, I should switch my code to that.
>
> Thanks,
> Shawn
>
>


Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
I'm using VisualVM and sematext to monitor my cluster.

Below is screenshots for each of them.

https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28

https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA

https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk

https://drive.google.com/open?id=0BwLcshoSCVcdR3hJSHRZTjdSZm8

https://drive.google.com/open?id=0BwLcshoSCVcdUzRETDlFeFIxU2M

Thanks,
Mahmoud

On Tue, Mar 14, 2017 at 10:20 AM, Mahmoud Almokadem <prog.mahm...@gmail.com>
wrote:

> Thanks Erick,
>
> I think there are something missing, the rate I'm talking about is for
> bulk upload and one time indexing to on-going indexing.
> My dataset is about 250 million documents and I need to index them to solr.
>
> Thanks Shawn for your clarification,
>
> I think that I got stuck on this version 6.4.1 I'll upgrade my cluster and
> test again.
>
> Thanks for help
> Mahmoud
>
>
> On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote:
>> > When I start my bulk indexer program the CPU utilization is 100% on each
>> > server but the rate of the indexer is about 1500 docs per second.
>> >
>> > I know that some solr benchmarks reached 70,000+ doc. per second.
>>
>> There are *MANY* factors that affect indexing rate.  When you say that
>> the CPU utilization is 100 percent, what operating system are you
>> running and what tool are you using to see CPU percentage?  Within that
>> tool, where are you looking to see that usage level?
>>
>> On some operating systems with some reporting tools, a server with 8 CPU
>> cores can show up to 800 percent CPU usage, so 100 percent utilization
>> on the Solr process may not be full utilization of the server's
>> resources.  It also might be an indicator of the full system usage, if
>> you are looking in the right place.
>>
>> > The question: What is the best way to determine the bottleneck on solr
>> > indexing rate?
>>
>> I have two likely candidates for you.  The first one is a bug that
>> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2.  If you don't
>> have one of those two versions, then this is not affecting you:
>>
>> https://issues.apache.org/jira/browse/SOLR-10130
>>
>> The other likely bottleneck, which could be a problem whether or not the
>> previous bug is present, is single-threaded indexing, so every batch of
>> docs must wait for the previous batch to finish before it can begin, and
>> only one CPU gets utilized on the server side.  Both Solr and SolrJ are
>> fully capable of handling several indexing threads at once, and that is
>> really the only way to achieve maximum indexing performance.  If you
>> want multi-threaded (parallel) indexing, you must create the threads on
>> the client side, or run multiple indexing processes that each handle
>> part of the job.  Multi-threaded code is not easy to write correctly.
>>
>> The fieldTypes and analysis that you have configured in your schema may
>> include classes that process very slowly, or may include so many filters
>> that the end result is slow performance.  I am not familiar with the
>> performance of the classes that Solr includes, so I would not be able to
>> look at a schema and tell you which entries are slow.  As Erick
>> mentioned, processing for 300+ fields could be one reason for slow
>> indexing.
>>
>> If you are doing a commit operation for every batch, that will slow it
>> down even more.  If you have autoSoftCommit configured with a very low
>> maxTime or maxDocs value, that can result in extremely frequent commits
>> that make indexing much slower.  Although frequent autoCommit is very
>> much desirable for good operation (as long as openSearcher set to
>> false), commits that open new searchers should be much less frequent.
>> The best option is to only commit (with a new searcher) *once* at the
>> end of the indexing run.  If automatic soft commits are desired, make
>> them happen as infrequently as you can.
>>
>> https://lucidworks.com/understanding-transaction-logs-
>> softcommit-and-commit-in-sorlcloud/
>>
>> Using CloudSolrClient will make single-threaded indexing fairly
>> efficient, by always sending documents to the correct shard leader.  FYI
>> -- your 500 document batches are split into smaller batches (which I
>> think are only 10 documents) that are directed to correct shard leaders
>> by CloudSolrClient.  Indexing with multiple threads becomes even more
>> important with these smaller batches.
>>
>> Note that with

Re: Indexing CPU performance

2017-03-13 Thread Mahmoud Almokadem
Hi Erick,

Thanks for detailed answer. 

The producer can sustain producing with that rate, it's not a spikes.

So, I can ran more clients that write to Solr although I got that maximum 
utilization with a single client? Do you think it will increase throughput?

And you advice me to add more shards on the same two nodes until I get the best 
throughput?

 autocommit is 15000 and softcommit is 6

Thanks,
Mahmoud

> On Mar 13, 2017, at 9:28 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> OK, so you can get a 360% speedup by commenting out the solr.add. That
> indicates that, indeed, you're pretty much running Solr flat out, not
> surprising. You _might_ squeeze a little more out of Solr by adding
> more client indexers, but that's not going to drive you to the numbers
> you need. I do have one observation though. You say "...can produce
> 13,000+ docs per second...". Is this sustained or occasional spikes?
> If the latter, can you let Solr fall behind and pick up the extra
> files when producer slows down?
> 
> Second, you'll have to have at least three clients running to even do
> the upstream processing without Solr in the picture at all. IOW, you
> can't gather and generate the Solr documents fast enough with one
> client, much less index them too.
> 
> bq: Regarding more shards, you mean use 2 nodes with 8 shards per node so we
> got 16 shards on the same 2 nodes or spread shards over more nodes?
> 
> Yes ;). Once you have enough shards/replicas on a box that you're
> running all the CPUs flat out, adding more shards won't do you any
> good. And we're just skipping over what that'll do to your ability to
> run queries. Plus, 13,000 docs/second will mount up pretty quickly, so
> you have to do your capacity planning for the projected maximum number
> of docs you'll host on this collection. My bet: If you size your
> cluster appropriately for the eventual total size, your indexing
> throughput will hit your numbers. Unless you have a very short
> retention.
> 
> See: 
> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> Best,
> Erick
> 
> On Mon, Mar 13, 2017 at 12:01 PM, Mahmoud Almokadem
> <prog.mahm...@gmail.com> wrote:
>> Thanks Erick,
>> 
>> I've commented out the line SolrClient.add(doclist) and get 5500+ docs per
>> second from single producer.
>> 
>> Regarding more shards, you mean use 2 nodes with 8 shards per node so we
>> got 16 shards on the same 2 nodes or spread shards over more nodes?
>> 
>> I'm using solr 6.4.1 with zookeeper on the same nodes.
>> 
>> Here's what I got from sematext profiler
>> 
>> 51%
>> Thread.java:745java.lang.Thread#run
>> 
>> 42%
>> QueuedThreadPool.java:589
>> org.eclipse.jetty.util.thread.QueuedThreadPool$2#run
>> Collapsed 29 calls (Expand)
>> 
>> 43%
>> UpdateRequestHandler.java:97
>> org.apache.solr.handler.UpdateRequestHandler$1#load
>> 
>> 30%
>> JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load
>> 
>> 30%
>> JsonLoader.java:115
>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load
>> 
>> 13%
>> JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load
>> 
>> 9%
>> ThreadPoolExecutor.java:617
>> java.util.concurrent.ThreadPoolExecutor$Worker#run
>> 
>> 9%
>> ThreadPoolExecutor.java:1142
>> java.util.concurrent.ThreadPoolExecutor#runWorker
>> 
>> 33%
>> ConcurrentMergeScheduler.java:626
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run
>> 
>> 33%
>> ConcurrentMergeScheduler.java:588
>> org.apache.lucene.index.ConcurrentMergeScheduler#doMerge
>> 
>> 33%
>> SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge
>> 
>> 33%
>> IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge
>> 
>> 33%
>> IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle
>> 
>> 20%
>> SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge
>> 
>> 11%
>> SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge
>> 
>> 2%
>> SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge
>> 
>> 
>> On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>>> Note that 70,000 docs/second pretty much guarantees that there are
>>> multiple shards. Lots of shards.
>>> 
>>> But since you're using SolrJ, the  very first thing I'd try would be
>>> to comment out the Sol

Re: Indexing CPU performance

2017-03-13 Thread Mahmoud Almokadem
Thanks Erick,

I've commented out the line SolrClient.add(doclist) and get 5500+ docs per
second from single producer.

Regarding more shards, you mean use 2 nodes with 8 shards per node so we
got 16 shards on the same 2 nodes or spread shards over more nodes?

I'm using solr 6.4.1 with zookeeper on the same nodes.

Here's what I got from sematext profiler

51%
Thread.java:745java.lang.Thread#run

42%
QueuedThreadPool.java:589
org.eclipse.jetty.util.thread.QueuedThreadPool$2#run
Collapsed 29 calls (Expand)

43%
UpdateRequestHandler.java:97
org.apache.solr.handler.UpdateRequestHandler$1#load

30%
JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load

30%
JsonLoader.java:115
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load

13%
JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load

9%
ThreadPoolExecutor.java:617
java.util.concurrent.ThreadPoolExecutor$Worker#run

9%
ThreadPoolExecutor.java:1142
java.util.concurrent.ThreadPoolExecutor#runWorker

33%
ConcurrentMergeScheduler.java:626
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run

33%
ConcurrentMergeScheduler.java:588
org.apache.lucene.index.ConcurrentMergeScheduler#doMerge

33%
SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge

33%
IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge

33%
IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle

20%
SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge

11%
SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge

2%
SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge


On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Note that 70,000 docs/second pretty much guarantees that there are
> multiple shards. Lots of shards.
>
> But since you're using SolrJ, the  very first thing I'd try would be
> to comment out the SolrClient.add(doclist) call so you're doing
> everything _except_ send the docs to Solr. That'll tell you whether
> there's any bottleneck on getting the docs from the system of record.
> The fact that you're pegging the CPUs argues that you are feeding Solr
> as fast as Solr can go so this is just a sanity check. But it's
> simple/fast.
>
> As far as what on Solr could be the bottleneck, no real way to know
> without profiling. But 300+ fields per doc probably just means you're
> doing a lot of processing, I'm not particularly hopeful you'll be able
> to speed things up without either more shards or simplifying your
> schema.
>
> Best,
> Erick
>
> On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem
> <prog.mahm...@gmail.com> wrote:
> > Hi great community,
> >
> > I have a SolrCloud with the following configuration:
> >
> >- 2 nodes (r3.2xlarge 61GB RAM)
> >- 4 shards.
> >- The producer can produce 13,000+ docs per second
> >- The schema contains about 300+ fields and the document size is about
> >3KB.
> >- Using SolrJ and SolrCloudClient, each batch to solr contains 500
> docs.
> >
> > When I start my bulk indexer program the CPU utilization is 100% on each
> > server but the rate of the indexer is about 1500 docs per second.
> >
> > I know that some solr benchmarks reached 70,000+ doc. per second.
> >
> > The question: What is the best way to determine the bottleneck on solr
> > indexing rate?
> >
> > Thanks,
> > Mahmoud
>


Indexing CPU performance

2017-03-13 Thread Mahmoud Almokadem
Hi great community,

I have a SolrCloud with the following configuration:

   - 2 nodes (r3.2xlarge 61GB RAM)
   - 4 shards.
   - The producer can produce 13,000+ docs per second
   - The schema contains about 300+ fields and the document size is about
   3KB.
   - Using SolrJ and SolrCloudClient, each batch to solr contains 500 docs.

When I start my bulk indexer program the CPU utilization is 100% on each
server but the rate of the indexer is about 1500 docs per second.

I know that some solr benchmarks reached 70,000+ doc. per second.

The question: What is the best way to determine the bottleneck on solr
indexing rate?

Thanks,
Mahmoud


Re: Time of insert

2017-02-07 Thread Mahmoud Almokadem
Thanks Alessandro,

I used the DIH as it is and no atomic updates was called with this DIH.

Add this script to my script transformation section and everything worked
properly:

var now = java.time.LocalDateTime.now();


var dtf =
java.time.format.DateTimeFormatter.ofPattern("-MM-dd'T'HH:mm:ss'Z'");


var val = dtf.format(now);



var hash = new java.util.HashMap();



hash.put('add', val);


row.put('time_stamp_log', hash);


The time_stamp_log no contains the log of the updates on documents and the
created_date set one time.

I think hash.put('add', val); fires the atomic updates on documents.

But when I remove this part of script I got created_date field updated
every time.

Thanks for your help.



On Tue, Feb 7, 2017 at 11:30 AM, alessandro.benedetti 
wrote:

> Hi Mahomoud,
> I need to double check but let's assume you use atomic updates and a
> created_data stored with default to NOW.
>
> 1) First time the document is not in the index you will get the default
> NOW.
> 2) second time, using the atomic update you will update only a subset of
> fields you send to Solr.
> Under the hood Solr will fetch the existing Doc, change only few fields and
> send it back to Solr.
> created_date will have the date fetched from the old version of the
> document, so the default will not be used this time.
>
> Have you tried ?
>
> Cheers
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Time-of-insert-tp4319040p4319122.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Time of insert

2017-02-06 Thread Mahmoud Almokadem
Thanks Alex for your reply. But the field created_date will be updated
every time the document inserted to the solr. I want to record the first
time the document indexed to solr and I'm using DataImport handler.

And I tried solr.TimestampUpdateProcessorFactory but I got
NullPointerException, So I changed it to use default value for the field on
the schema

  



but this field contains the last update of the document not the first time
the document inserted.


Thanks,
Mahmoud

On Tue, Feb 7, 2017 at 12:10 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> If you are reindexing full documents, there is no way.
>
> If you are actually doing updates using Solr updates XML/JSON, then
> you can have a created_date field with default value of NOW.
> Similarly, you could probably do something with UpdateRequestProcessor
> chains to get that NOW added somewhere.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 6 February 2017 at 15:32, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> > Hello,
> >
> > I'm using dih on solr 6 for indexing data from sql server. The document
> can
> > br indexed many times according to the updates on it. Is that available
> to
> > get the first time the document inserted to solr?
> >
> > And how to get the dates of the document updated?
> >
> > Thanks for help,
> > Mahmoud
>


Time of insert

2017-02-06 Thread Mahmoud Almokadem
Hello,

I'm using dih on solr 6 for indexing data from sql server. The document can
br indexed many times according to the updates on it. Is that available to
get the first time the document inserted to solr?

And how to get the dates of the document updated?

Thanks for help,
Mahmoud


Solr Kafka DIH

2017-01-30 Thread Mahmoud Almokadem
Hello,

Is there a way to get SolrCloud to pull data from a topic in Kafak periodically 
using Dataimport Handler?

Thanks
Mahmoud

Re: Search with the start of field

2016-09-21 Thread Mahmoud Almokadem
Thanks all,

I think SpapnFirstQuery will solve my problem. But does Solr 4.8 supports
xmlparser to use it? and does SpanFirst supports phrase search instead of
single term?

Thanks,
Mahmoud


On Wed, Sep 21, 2016 at 10:45 AM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Yes, definately SpanFirstQuery! But i didn't know you can invoke it via
> XMLQueryParser, thank Mikhail for that!
>
> There is a tiny drawback to SpanFirst, there is no gradient boosting
> depending on distance from the beginning.
>
> Markus
>
>
>
> -Original message-
> > From:Mikhail Khludnev <m...@apache.org>
> > Sent: Wednesday 21st September 2016 9:24
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: Re: Search with the start of field
> >
> > You can experiment with {!xmlparser}.. see
> > https://cwiki.apache.org/confluence/display/solr/Other+
> Parsers#OtherParsers-XMLQueryParser
> >
> > On Wed, Sep 21, 2016 at 9:06 AM, Mahmoud Almokadem <
> prog.mahm...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > What is the best way to search with the start token of field?
> > >
> > > For example: the field contains these values
> > >
> > > Document1: ABC  DEF GHI
> > > Document2: DEF GHI JKL
> > >
> > > when I search with DEF, I want to get Document2 only. Is that possible?
> > >
> > > Thanks,
> > > Mahmoud
> > >
> > >
> > --
> > Sincerely yours
> > Mikhail Khludnev
>


Search with the start of field

2016-09-21 Thread Mahmoud Almokadem
Hello,

What is the best way to search with the start token of field?

For example: the field contains these values 

Document1: ABC  DEF GHI
Document2: DEF GHI JKL

when I search with DEF, I want to get Document2 only. Is that possible?

Thanks,
Mahmoud 



insertion time

2016-08-14 Thread Mahmoud Almokadem
Hello, 

We always update the same document many times using DataImportHandler. Can I 
add a field for the first time the document inserted to the index and another 
field for the last time the document updated?


Thanks,
Mahmoud 

Re: Cold replication

2016-07-18 Thread Mahmoud Almokadem
Thanks Erick,

I'll take a look at the replication on Solr. But I don't know if it well
support incremental backup or not.

And I want to use SSD because my index cannot be held in memory. The index
is about 200GB on each instance and the RAM is 61GB and the update
frequency is high. So, I want to use SSDs equipped with the servers instead
on EBSs.

Would you explain what you mean with proper warming?

Thanks,
Mahmoud


On Mon, Jul 18, 2016 at 5:46 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Have you tried the replication API backup command here?
>
> https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler
>
> Warning, I haven't worked with this personally in this
> situation so test.
>
> I do have to ask why you think SSDs are required here and
> if you've measured. With proper warming, most of the
> index is held in memory anyway and the source of
> the data (SSD or spinning) is not a huge issue. SSDs
> certainly are better/faster, but have you measured whether
> they are _enough_ faster to be worth the added
> complexity?
>
> Best,
> Erick
>
> Best,
> Erick
>
> On Mon, Jul 18, 2016 at 4:05 AM, Mahmoud Almokadem
> <prog.mahm...@gmail.com> wrote:
> > Hi,
> >
> > We have SolrCloud 6.0 installed on 4 i2.2xlarge instances with 4 shards.
> We store the indices on EBS attached to these instances. Fortunately these
> instances are equipped with TEMPORARY SSDs. We need to the store the
> indices on the SSDs but they are not safe.
> >
> > The index is updated every five minutes.
> >
> > Could we use the SSDs to store the indices and create an incremental
> backup or cold replication on the EBS? So we use EBS only for storing
> indices not serving the data to the solr.
> >
> > Incase of losing the data on SSDs we can restore a backup from the EBS.
> Is it possible?
> >
> > Thanks,
> > Mahmoud
> >
> >
>


Cold replication

2016-07-18 Thread Mahmoud Almokadem
Hi, 

We have SolrCloud 6.0 installed on 4 i2.2xlarge instances with 4 shards. We 
store the indices on EBS attached to these instances. Fortunately these 
instances are equipped with TEMPORARY SSDs. We need to the store the indices on 
the SSDs but they are not safe.

The index is updated every five minutes. 

Could we use the SSDs to store the indices and create an incremental backup or 
cold replication on the EBS? So we use EBS only for storing indices not serving 
the data to the solr.

Incase of losing the data on SSDs we can restore a backup from the EBS. Is it 
possible?

Thanks, 
Mahmoud 




DIH Schedule Solr 6

2016-04-21 Thread Mahmoud Almokadem
Hello, 

We have a cluster of solr 4.8.1 installed on tomcat servlet container and we’re 
able to use DIH Schedule by adding this lines to web.xml of the installation 
directory:

  
   
org.apache.solr.handler.dataimport.scheduler.ApplicationListener
  



No we are planing to migrate to Solr 6 and we already installed it as service. 
The question is how to install DIHS on Solr 6 as service?

Thanks,
Mahmoud 



Searching and sorting using field aliasing

2015-12-02 Thread Mahmoud Almokadem
Hi all, 

I have two cores (core1, core2). core1 contains fields(f1, f2, f3, date1) and 
core2 contains fields(f2, f3, f4, date2).
I want to search on the two cores with the date field. Is there an alias to 
query the two fields on distributed search.

For example when q=dateField:NOW perform search on date1 and date2. And I want 
to sort on the dateField which sort date1 and date2.

Regards,
Mahmoud 

Re: Arabic analyser

2015-11-11 Thread Mahmoud Almokadem
Thank Alex,

So BasisTech works for the latest version of solr?

Sincerely,
Mahmoud

On Tue, Nov 10, 2015 at 5:28 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> If this is for a significant project and you are ready to pay for it,
> BasisTech has commercial solutions in this area I believe.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 10 November 2015 at 08:46, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> > Thanks Pual,
> >
> > Arabic analyser applying filters of normalisation and stemming only for
> > single terms out of standard tokenzier.
> > Gathering all synonyms will be hard work. Should I customise my Tokenizer
> > to handle this case?
> >
> > Sincerely,
> > Mahmoud
> >
> >
> > On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <p...@hoplahup.net>
> wrote:
> >
> >> Mahmoud,
> >>
> >> there is an arabic analyzer:
> >>   https://wiki.apache.org/solr/LanguageAnalysis#Arabic
> >> doesn't it do what you describe?
> >> Synonyms probably work there too.
> >>
> >> Paul
> >>
> >> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com>
> >> > 9 novembre 2015 17:47
> >> > Thanks Jack,
> >> >
> >> > This is a good solution, but we have more combinations that I think
> >> > can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’
> >> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo
> >> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be
> >> > applied for each separate term.
> >> >
> >> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a
> >> > single term?
> >> >
> >> > Thanks,
> >> > Mahmoud
> >> >
> >> >
> >> >
> >> > Jack Krupansky <mailto:jack.krupan...@gmail.com>
> >> > 9 novembre 2015 16:47
> >> > Use an index-time (but not query time) synonym filter with a rule
> like:
> >> >
> >> > Abd Allah,Abdallah
> >> >
> >> > This will index the combined word in addition to the separate words.
> >> >
> >> > -- Jack Krupansky
> >> >
> >> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <
> >> prog.mahm...@gmail.com>
> >> >
> >> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com>
> >> > 9 novembre 2015 10:48
> >> > Hello,
> >> >
> >> > We are indexing Arabic content and facing a problem for tokenizing
> multi
> >> > terms phrases like 'عبد الله' 'Abd Allah', so users will search for
> >> > 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
> >> > الله' with space. We are using StandardTokenizer.
> >> >
> >> >
> >> > Is there any configurations to handle this case?
> >> >
> >> > Thank you,
> >> > Mahmoud
> >> >
> >>
> >>
>


Re: Arabic analyser

2015-11-11 Thread Mahmoud Almokadem
Thank you very much David, It's wonderful and I will try it.

On Wed, Nov 11, 2015 at 1:37 PM, David Murgatroyd <dmu...@gmail.com> wrote:

> >So BasisTech works for the latest version of solr?
>
> Yes, our latest Arabic analyzer supports up through 5.3.x. But since the
> examples you give are names, it sounds like you might instead/also want our
> fuzzy name matcher which will find "عبد الله" not only with "عبدالله" but
> also with typos like "عبالله" or even translations into 'English' like
> "abdollah". You can visit http://www.basistech.com/solutions/search/solr/
> and fill out the form there to learn more (mentioning this thread). See
> also http://www.slideshare.net/dmurga/simple-fuzzy-name-matching-in-solr
> for a talk I gave at the San Francisco Solr Meet-up in April on how it
> plugs in to Solr by creating a special field type you can query just like
> any other; this was also presented at Lucene/Solr Revolution last month (
> http://lucenerevolution.org/sessions/simple-fuzzy-name-matching-in-solr/).
>
> Best,
> David Murgatroyd
> (VP, Engineering, Basis Technology)
>
> On Wed, Nov 11, 2015 at 4:31 AM, Mahmoud Almokadem <prog.mahm...@gmail.com
> >
> wrote:
>
> > Thank Alex,
> >
> > So BasisTech works for the latest version of solr?
> >
> > Sincerely,
> > Mahmoud
> >
> > On Tue, Nov 10, 2015 at 5:28 PM, Alexandre Rafalovitch <
> arafa...@gmail.com
> > >
> > wrote:
> >
> > > If this is for a significant project and you are ready to pay for it,
> > > BasisTech has commercial solutions in this area I believe.
> > >
> > > Regards,
> > >Alex.
> > > 
> > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > http://www.solr-start.com/
> > >
> > >
> > > On 10 November 2015 at 08:46, Mahmoud Almokadem <
> prog.mahm...@gmail.com>
> > > wrote:
> > > > Thanks Pual,
> > > >
> > > > Arabic analyser applying filters of normalisation and stemming only
> for
> > > > single terms out of standard tokenzier.
> > > > Gathering all synonyms will be hard work. Should I customise my
> > Tokenizer
> > > > to handle this case?
> > > >
> > > > Sincerely,
> > > > Mahmoud
> > > >
> > > >
> > > > On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <p...@hoplahup.net>
> > > wrote:
> > > >
> > > >> Mahmoud,
> > > >>
> > > >> there is an arabic analyzer:
> > > >>   https://wiki.apache.org/solr/LanguageAnalysis#Arabic
> > > >> doesn't it do what you describe?
> > > >> Synonyms probably work there too.
> > > >>
> > > >> Paul
> > > >>
> > > >> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com>
> > > >> > 9 novembre 2015 17:47
> > > >> > Thanks Jack,
> > > >> >
> > > >> > This is a good solution, but we have more combinations that I
> think
> > > >> > can’t be handled as synonyms like every word starts with ‘عبد’
> ‘Abd’
> > > >> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo
> > > >> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will
> be
> > > >> > applied for each separate term.
> > > >> >
> > > >> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a
> > > >> > single term?
> > > >> >
> > > >> > Thanks,
> > > >> > Mahmoud
> > > >> >
> > > >> >
> > > >> >
> > > >> > Jack Krupansky <mailto:jack.krupan...@gmail.com>
> > > >> > 9 novembre 2015 16:47
> > > >> > Use an index-time (but not query time) synonym filter with a rule
> > > like:
> > > >> >
> > > >> > Abd Allah,Abdallah
> > > >> >
> > > >> > This will index the combined word in addition to the separate
> words.
> > > >> >
> > > >> > -- Jack Krupansky
> > > >> >
> > > >> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <
> > > >> prog.mahm...@gmail.com>
> > > >> >
> > > >> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com>
> > > >> > 9 novembre 2015 10:48
> > > >> > Hello,
> > > >> >
> > > >> > We are indexing Arabic content and facing a problem for tokenizing
> > > multi
> > > >> > terms phrases like 'عبد الله' 'Abd Allah', so users will search
> for
> > > >> > 'عبدالله' 'Abdallah' without space and need to get the results of
> > 'عبد
> > > >> > الله' with space. We are using StandardTokenizer.
> > > >> >
> > > >> >
> > > >> > Is there any configurations to handle this case?
> > > >> >
> > > >> > Thank you,
> > > >> > Mahmoud
> > > >> >
> > > >>
> > > >>
> > >
> >
>


Re: Arabic analyser

2015-11-10 Thread Mahmoud Almokadem
Thanks Pual,

Arabic analyser applying filters of normalisation and stemming only for
single terms out of standard tokenzier.
Gathering all synonyms will be hard work. Should I customise my Tokenizer
to handle this case?

Sincerely,
Mahmoud


On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <p...@hoplahup.net> wrote:

> Mahmoud,
>
> there is an arabic analyzer:
>   https://wiki.apache.org/solr/LanguageAnalysis#Arabic
> doesn't it do what you describe?
> Synonyms probably work there too.
>
> Paul
>
> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com>
> > 9 novembre 2015 17:47
> > Thanks Jack,
> >
> > This is a good solution, but we have more combinations that I think
> > can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’
> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo
> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be
> > applied for each separate term.
> >
> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a
> > single term?
> >
> > Thanks,
> > Mahmoud
> >
> >
> >
> > Jack Krupansky <mailto:jack.krupan...@gmail.com>
> > 9 novembre 2015 16:47
> > Use an index-time (but not query time) synonym filter with a rule like:
> >
> > Abd Allah,Abdallah
> >
> > This will index the combined word in addition to the separate words.
> >
> > -- Jack Krupansky
> >
> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <
> prog.mahm...@gmail.com>
> >
> > Mahmoud Almokadem <mailto:prog.mahm...@gmail.com>
> > 9 novembre 2015 10:48
> > Hello,
> >
> > We are indexing Arabic content and facing a problem for tokenizing multi
> > terms phrases like 'عبد الله' 'Abd Allah', so users will search for
> > 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
> > الله' with space. We are using StandardTokenizer.
> >
> >
> > Is there any configurations to handle this case?
> >
> > Thank you,
> > Mahmoud
> >
>
>


Re: Arabic analyser

2015-11-09 Thread Mahmoud Almokadem
Thanks Jack, 

This is a good solution, but we have more combinations that I think can’t be 
handled as synonyms like every word starts with ‘عبد’ ‘Abd’ and ‘أبو’ ‘Abo’. 
When using Standard tokenizer on ‘أبو بكر’ ‘Abo Bakr’, It’ll be tokenised to 
‘أبو’ and ‘بكر’ and the filters will be applied for each separate term.

Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a single term?

Thanks,
Mahmoud 


> On Nov 9, 2015, at 5:47 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote:
> 
> Use an index-time (but not query time) synonym filter with a rule like:
> 
> Abd Allah,Abdallah
> 
> This will index the combined word in addition to the separate words.
> 
> -- Jack Krupansky
> 
> On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> 
>> Hello,
>> 
>> We are indexing Arabic content and facing a problem for tokenizing multi
>> terms phrases like 'عبد الله' 'Abd Allah', so users will search for
>> 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
>> الله' with space. We are using StandardTokenizer.
>> 
>> 
>> Is there any configurations to handle this case?
>> 
>> Thank you,
>> Mahmoud
>> 



Arabic analyser

2015-11-09 Thread Mahmoud Almokadem
Hello,

We are indexing Arabic content and facing a problem for tokenizing multi
terms phrases like 'عبد الله' 'Abd Allah', so users will search for
'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
الله' with space. We are using StandardTokenizer.


Is there any configurations to handle this case?

Thank you,
Mahmoud


Re: Invalid parsing with solr edismax operators

2015-11-05 Thread Mahmoud Almokadem
Thanks Jack. I have reported it as a bug on JIRA 

https://issues.apache.org/jira/browse/SOLR-8237 
<https://issues.apache.org/jira/browse/SOLR-8237>

Mahmoud 

> On Nov 4, 2015, at 5:30 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote:
> 
> I think you should go ahead and file a Jira ticket for this as a bug since
> either it is an actual bug or some behavior nuance that needs to be
> documented better.
> 
> -- Jack Krupansky
> 
> On Wed, Nov 4, 2015 at 8:24 AM, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> 
>> I removed the q.op=“AND” and add the mm=2
>> when searching for (public libraries) I got 19 with
>> "parsedquery_toString": "+(((Title:public^200.0 | TotalField:public^0.1)
>> (Title:libraries^200.0 | TotalField:libraries^0.1))~2)",
>> 
>> and when adding + and searching for +(public libraries) I got 1189 with
>> "parsedquery_toString": "+(+((Title:public^200.0 | TotalField:public^0.1)
>> (Title:libraries^200.0 | TotalField:libraries^0.1)))",
>> 
>> 
>> I think when adding + before parentheses I got all terms mandatory despite
>> the value of mm=2 in the two cases.
>> 
>> Mahmoud
>> 
>> 
>> 
>>> On Nov 4, 2015, at 3:04 PM, Alessandro Benedetti <abenede...@apache.org>
>> wrote:
>>> 
>>> Here we go :
>>> 
>>> Title^200 TotalField^1
>>> 
>>> + Jack explanation and you have the parsed query explained !
>>> 
>>> Cheers
>>> 
>>> On 4 November 2015 at 12:56, Mahmoud Almokadem <prog.mahm...@gmail.com>
>>> wrote:
>>> 
>>>> Thank you Alessandro for your reply.
>>>> 
>>>> Here is the request handler
>>>> 
>>>> 
>>>> 
>>>> 
>>>>explicit
>>>>  10
>>>>  TotalField
>>>> AND
>>>> edismax
>>>> Title^200 TotalField^1
>>>> 
>>>>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Mahmoud
>>>> 
>>>> 
>>>>> On Nov 4, 2015, at 2:43 PM, Alessandro Benedetti <
>> abenede...@apache.org>
>>>> wrote:
>>>>> 
>>>>> Hi Mahmoud,
>>>>> can you send us the solrconfig.xml snippet of your request handler
>>>> please ?
>>>>> 
>>>>> It's kinda strange you get a boost factor for the Title field and that
>>>>> parsing query, according to your config.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On 4 November 2015 at 08:39, Mahmoud Almokadem <prog.mahm...@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I'm using solr 4.8.1. Using edismax as the parser we got the
>> undesirable
>>>>>> parsed queries and results. The following is two different cases with
>>>>>> strange behavior: Searching with these parameters
>>>>>> 
>>>>>> "mm":"2",
>>>>>> "df":"TotalField",
>>>>>> "debug":"true",
>>>>>> "indent":"true",
>>>>>> "fl":"Title",
>>>>>> "start":"0",
>>>>>> "q.op":"AND",
>>>>>> "fq":"",
>>>>>> "rows":"10",
>>>>>> "wt":"json"
>>>>>> and the query is
>>>>>> 
>>>>>> "q":"+(public libraries)",
>>>>>> Retrieve 502 documents with these parsed query
>>>>>> 
>>>>>> "rawquerystring":"+(public libraries)",
>>>>>> "querystring":"+(public libraries)",
>>>>>> "parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 |
>>>>>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 |
>>>>>> TotalField:libraries^0.1)/no_coord",
>>>>>> "parsedquery_toString":"+(+((Title:public^200.0 |
>> TotalField:public^0.1)
>>>>>> (Title:libraries^200.0 | TotalField:libraries^0.1)))"
>>>>>> and if the query is
>>>>>> 
>>>>>> "q":" (public libraries) "
>>>>>> then it retrieves 8 documents with these parsed query
>>>>>> 
>>>>>> "rawquerystring":" (public libraries) ",
>>>>>> "querystring":" (public libraries) ",
>>>>>> "parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 |
>>>>>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 |
>>>>>> TotalField:libraries^0.1)))~2))/no_coord",
>>>>>> "parsedquery_toString":"+(((Title:public^200.0 |
>> TotalField:public^0.1)
>>>>>> (Title:libraries^200.0 | TotalField:libraries^0.1))~2)"
>>>>>> So the results of adding "+" to get all tokens before the parenthesis
>>>>>> retrieve more results than removing it.
>>>>>> 
>>>>>> Is this a bug on this version or there are something missing?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> --
>>>>> 
>>>>> Benedetti Alessandro
>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>> 
>>>>> "Tyger, tyger burning bright
>>>>> In the forests of the night,
>>>>> What immortal hand or eye
>>>>> Could frame thy fearful symmetry?"
>>>>> 
>>>>> William Blake - Songs of Experience -1794 England
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> --
>>> 
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>> 
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>> 
>>> William Blake - Songs of Experience -1794 England
>> 
>> 



Re: Invalid parsing with solr edismax operators

2015-11-04 Thread Mahmoud Almokadem
Thank you Alessandro for your reply. 

Here is the request handler 




 explicit
   10
   TotalField
  AND
  edismax
  Title^200 TotalField^1
   
 




Mahmoud


> On Nov 4, 2015, at 2:43 PM, Alessandro Benedetti <abenede...@apache.org> 
> wrote:
> 
> Hi Mahmoud,
> can you send us the solrconfig.xml snippet of your request handler please ?
> 
> It's kinda strange you get a boost factor for the Title field and that
> parsing query, according to your config.
> 
> Cheers
> 
> On 4 November 2015 at 08:39, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> 
>> Hello,
>> 
>> I'm using solr 4.8.1. Using edismax as the parser we got the undesirable
>> parsed queries and results. The following is two different cases with
>> strange behavior: Searching with these parameters
>> 
>>  "mm":"2",
>>  "df":"TotalField",
>>  "debug":"true",
>>  "indent":"true",
>>  "fl":"Title",
>>  "start":"0",
>>  "q.op":"AND",
>>  "fq":"",
>>  "rows":"10",
>>  "wt":"json"
>> and the query is
>> 
>> "q":"+(public libraries)",
>> Retrieve 502 documents with these parsed query
>> 
>> "rawquerystring":"+(public libraries)",
>> "querystring":"+(public libraries)",
>> "parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 |
>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 |
>> TotalField:libraries^0.1)/no_coord",
>> "parsedquery_toString":"+(+((Title:public^200.0 | TotalField:public^0.1)
>> (Title:libraries^200.0 | TotalField:libraries^0.1)))"
>> and if the query is
>> 
>> "q":" (public libraries) "
>> then it retrieves 8 documents with these parsed query
>> 
>> "rawquerystring":" (public libraries) ",
>> "querystring":" (public libraries) ",
>> "parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 |
>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 |
>> TotalField:libraries^0.1)))~2))/no_coord",
>> "parsedquery_toString":"+(((Title:public^200.0 | TotalField:public^0.1)
>> (Title:libraries^200.0 | TotalField:libraries^0.1))~2)"
>> So the results of adding "+" to get all tokens before the parenthesis
>> retrieve more results than removing it.
>> 
>> Is this a bug on this version or there are something missing?
> 
> 
> 
> 
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England



Re: Invalid parsing with solr edismax operators

2015-11-04 Thread Mahmoud Almokadem
I removed the q.op=“AND” and add the mm=2
when searching for (public libraries) I got 19 with "parsedquery_toString": 
"+(((Title:public^200.0 | TotalField:public^0.1) (Title:libraries^200.0 | 
TotalField:libraries^0.1))~2)",

and when adding + and searching for +(public libraries) I got 1189 with
"parsedquery_toString": "+(+((Title:public^200.0 | TotalField:public^0.1) 
(Title:libraries^200.0 | TotalField:libraries^0.1)))",


I think when adding + before parentheses I got all terms mandatory despite the 
value of mm=2 in the two cases.

Mahmoud



> On Nov 4, 2015, at 3:04 PM, Alessandro Benedetti <abenede...@apache.org> 
> wrote:
> 
> Here we go :
> 
> Title^200 TotalField^1
> 
> + Jack explanation and you have the parsed query explained !
> 
> Cheers
> 
> On 4 November 2015 at 12:56, Mahmoud Almokadem <prog.mahm...@gmail.com>
> wrote:
> 
>> Thank you Alessandro for your reply.
>> 
>> Here is the request handler
>> 
>> 
>> 
>> 
>> explicit
>>   10
>>   TotalField
>>  AND
>>  edismax
>>  Title^200 TotalField^1
>> 
>> 
>> 
>> 
>> 
>> 
>> Mahmoud
>> 
>> 
>>> On Nov 4, 2015, at 2:43 PM, Alessandro Benedetti <abenede...@apache.org>
>> wrote:
>>> 
>>> Hi Mahmoud,
>>> can you send us the solrconfig.xml snippet of your request handler
>> please ?
>>> 
>>> It's kinda strange you get a boost factor for the Title field and that
>>> parsing query, according to your config.
>>> 
>>> Cheers
>>> 
>>> On 4 November 2015 at 08:39, Mahmoud Almokadem <prog.mahm...@gmail.com>
>>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I'm using solr 4.8.1. Using edismax as the parser we got the undesirable
>>>> parsed queries and results. The following is two different cases with
>>>> strange behavior: Searching with these parameters
>>>> 
>>>> "mm":"2",
>>>> "df":"TotalField",
>>>> "debug":"true",
>>>> "indent":"true",
>>>> "fl":"Title",
>>>> "start":"0",
>>>> "q.op":"AND",
>>>> "fq":"",
>>>> "rows":"10",
>>>> "wt":"json"
>>>> and the query is
>>>> 
>>>> "q":"+(public libraries)",
>>>> Retrieve 502 documents with these parsed query
>>>> 
>>>> "rawquerystring":"+(public libraries)",
>>>> "querystring":"+(public libraries)",
>>>> "parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 |
>>>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 |
>>>> TotalField:libraries^0.1)/no_coord",
>>>> "parsedquery_toString":"+(+((Title:public^200.0 | TotalField:public^0.1)
>>>> (Title:libraries^200.0 | TotalField:libraries^0.1)))"
>>>> and if the query is
>>>> 
>>>> "q":" (public libraries) "
>>>> then it retrieves 8 documents with these parsed query
>>>> 
>>>> "rawquerystring":" (public libraries) ",
>>>> "querystring":" (public libraries) ",
>>>> "parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 |
>>>> TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 |
>>>> TotalField:libraries^0.1)))~2))/no_coord",
>>>> "parsedquery_toString":"+(((Title:public^200.0 | TotalField:public^0.1)
>>>> (Title:libraries^200.0 | TotalField:libraries^0.1))~2)"
>>>> So the results of adding "+" to get all tokens before the parenthesis
>>>> retrieve more results than removing it.
>>>> 
>>>> Is this a bug on this version or there are something missing?
>>> 
>>> 
>>> 
>>> 
>>> --
>>> --
>>> 
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>> 
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>> 
>>> William Blake - Songs of Experience -1794 England
>> 
>> 
> 
> 
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England



Invalid parsing with solr edismax operators

2015-11-04 Thread Mahmoud Almokadem
Hello, 

I'm using solr 4.8.1. Using edismax as the parser we got the undesirable parsed 
queries and results. The following is two different cases with strange 
behavior: Searching with these parameters

  "mm":"2",
  "df":"TotalField",
  "debug":"true",
  "indent":"true",
  "fl":"Title",
  "start":"0",
  "q.op":"AND",
  "fq":"",
  "rows":"10",
  "wt":"json" 
and the query is

"q":"+(public libraries)",
Retrieve 502 documents with these parsed query

"rawquerystring":"+(public libraries)",
"querystring":"+(public libraries)",
"parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 | 
TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | 
TotalField:libraries^0.1)/no_coord",
"parsedquery_toString":"+(+((Title:public^200.0 | TotalField:public^0.1) 
(Title:libraries^200.0 | TotalField:libraries^0.1)))"
and if the query is

"q":" (public libraries) "
then it retrieves 8 documents with these parsed query

"rawquerystring":" (public libraries) ",
"querystring":" (public libraries) ",
"parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 | 
TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 | 
TotalField:libraries^0.1)))~2))/no_coord",
"parsedquery_toString":"+(((Title:public^200.0 | TotalField:public^0.1) 
(Title:libraries^200.0 | TotalField:libraries^0.1))~2)"
So the results of adding "+" to get all tokens before the parenthesis retrieve 
more results than removing it.

Is this a bug on this version or there are something missing?

Invalid parsing with solr edismax operators

2015-11-01 Thread Mahmoud Almokadem
Hello,

I'm using solr 4.8.1. Using edismax as the parser we got the undesirable
parsed queries and results. The following is two different cases with
strange behavior: Searching with these parameters

  "mm":"2",
  "df":"TotalField",
  "debug":"true",
  "indent":"true",
  "fl":"Title",
  "start":"0",
  "q.op":"AND",
  "fq":"",
  "rows":"10",
  "wt":"json"

and the query is

"q":"+(public libraries)",

Retrieve 502 documents with these parsed query

"rawquerystring":"+(public libraries)",
"querystring":"+(public libraries)",
"parsedquery":"(+(+(DisjunctionMaxQuery((Title:public^200.0 |
TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 |
TotalField:libraries^0.1)/no_coord",
"parsedquery_toString":"+(+((Title:public^200.0 |
TotalField:public^0.1) (Title:libraries^200.0 |
TotalField:libraries^0.1)))"

and if the query is

"q":" (public libraries) "

then it retrieves 8 documents with these parsed query

"rawquerystring":" (public libraries) ",
"querystring":" (public libraries) ",
"parsedquery":"(+((DisjunctionMaxQuery((Title:public^200.0 |
TotalField:public^0.1)) DisjunctionMaxQuery((Title:libraries^200.0 |
TotalField:libraries^0.1)))~2))/no_coord",
"parsedquery_toString":"+(((Title:public^200.0 |
TotalField:public^0.1) (Title:libraries^200.0 |
TotalField:libraries^0.1))~2)"

So the results of adding "+" to get all tokens before the parenthesis
retrieve more results than removing it.

Is this a bug on this version or there are something missing?


edismax operators

2015-04-02 Thread Mahmoud Almokadem
Hello,

I've a strange behaviour on using edismax with multiwords. When using
passing q=+(word1 word2) I got

rawquerystring: +(word1 word2), querystring: +(word1 word2), 
parsedquery: (+(+(DisjunctionMaxQuery((title:word1))
DisjunctionMaxQuery((title:word2)/no_coord,
parsedquery_toString: +(+((title:word1)
(title:word2))),

I expected to get two words as must as I added + before the parentheses
so It must be applied for all terms in parentheses.

How can I apply default operator AND for all words.

Thanks,
Mahmoud


Re: edismax operators

2015-04-02 Thread Mahmoud Almokadem
Thank you Jack for your clarifications. I used regular defType and set
q.op=AND so all terms without operators are must. How can I use this with
edismax?

Thanks,
Mahmoud

On Thu, Apr 2, 2015 at 2:14 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 The parentheses signal a nested query. Your plus operator applies to the
 overall nested query - that the nested query must match something. Use the
 plus operator on each of the discrete terms if each of them is mandatory.
 The plus and minus operators apply to the overall nested query - they do
 not distribute to each term within the nested query. They don't magically
 distribute to all nested queries.

 Let's see you full set of query parameters, both on the request and in
 solrconfig.

 -- Jack Krupansky

 On Thu, Apr 2, 2015 at 7:12 AM, Mahmoud Almokadem prog.mahm...@gmail.com
 wrote:

  Hello,
 
  I've a strange behaviour on using edismax with multiwords. When using
  passing q=+(word1 word2) I got
 
  rawquerystring: +(word1 word2), querystring: +(word1 word2), 
  parsedquery: (+(+(DisjunctionMaxQuery((title:word1))
  DisjunctionMaxQuery((title:word2)/no_coord,
  parsedquery_toString: +(+((title:word1)
  (title:word2))),
 
  I expected to get two words as must as I added + before the parentheses
  so It must be applied for all terms in parentheses.
 
  How can I apply default operator AND for all words.
 
  Thanks,
  Mahmoud
 



Re: edismax operators

2015-04-02 Thread Mahmoud Almokadem
Thanks all for you response,

But the parsed_query and number of results still when changing MM parameter

the following results for mm=100% and mm=0%

http://solrserver/solr/collection1/select?q=%2B(word1+word2)rows=0fl=Titlewt=jsonindent=truedebugQuery=truedefType=edismaxqf=titlemm=100%25stopwords=truelowercaseOperators=true
http://10.1.1.118:8090/solr/PAEB/select?q=%2B(word1+word2)rows=0fl=Titlewt=jsonindent=truedebugQuery=truedefType=edismaxqf=titlemm=100%25stopwords=truelowercaseOperators=true

rawquerystring: +(word1 word2), querystring: +(word1 word2),
parsedquery: (+(+(DisjunctionMaxQuery((title:word1))
DisjunctionMaxQuery((title:word2)/no_coord,
parsedquery_toString: +(+((title:word1) (title:word2)))”,



http://solrserver/solr/collection1/select?q=%2B(word1+word2)rows=0fl=Titlewt=jsonindent=truedebugQuery=truedefType=edismaxqf=titlemm=0%25stopwords=truelowercaseOperators=true
http://10.1.1.118:8090/solr/PAEB/select?q=%2B(word1+word2)rows=0fl=Titlewt=jsonindent=truedebugQuery=truedefType=edismaxqf=titlemm=100%25stopwords=truelowercaseOperators=true

rawquerystring: +(word1 word2), querystring: +(word1 word2),
parsedquery: (+(+(DisjunctionMaxQuery((title:word1))
DisjunctionMaxQuery((title:word2)/no_coord,
parsedquery_toString: +(+((title:word1) (title:word2))),

There are any changes on two queries

solr version 4.8.1

Thanks,
Mahmoud

On Thu, Apr 2, 2015 at 6:56 PM, Davis, Daniel (NIH/NLM) [C] 
daniel.da...@nih.gov wrote:

 Thanks Shawn,

 This is what I thought, but Solr often has features I don't anticipate.

 -Original Message-
 From: Shawn Heisey [mailto:apa...@elyograg.org]
 Sent: Thursday, April 02, 2015 12:54 PM
 To: solr-user@lucene.apache.org
 Subject: Re: edismax operators

 On 4/2/2015 9:59 AM, Davis, Daniel (NIH/NLM) [C] wrote:
  Can the mm parameter be set per clause?I guess I've ignored it in
 the past aside from setting it once to what seemed like a reasonable value.
  That is probably replicated across every collection, which cannot be
 ideal for relevance.

 It applies to the whole query.  You can have a different value on every
 query you send.  Just like with other parameters, defaults can be
 configured in the solrconfig.xml request handler definition.

 Thanks,
 Shawn




Re: Create core problem in tomcat

2015-01-01 Thread Mahmoud Almokadem
You may have a field types in your schema that using stopwords.txt file
like this:

fieldType name=text_general class=solr.TextField
positionIncrementGap=100

 analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ArabicNormalizationFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=
*lang/stopwords_ar.txt* /
filter class=solr.StopFilterFactory ignoreCase=true words=
*lang/stopwords_en.txt* /
filter class=solr.StopFilterFactory ignoreCase=true words=
*stopwords.txt* /
  /analyzer

  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ArabicNormalizationFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=
*lang/stopwords_ar.txt* /
filter class=solr.StopFilterFactory ignoreCase=true words=
*lang/stopwords_en.txt* /
filter class=solr.StopFilterFactory ignoreCase=true words=
*stopwords.txt* /
  /analyzer

/fieldType

so, you must have files *stopwords_ar.txt* and* stopwords_en.txt* in
INSTANCE_DIR/conf/lang/ and *stopwords.txt* in INSTANCE_DIR/conf/

sincerly,
Mahmoud

On Thu, Jan 1, 2015 at 9:18 AM, Noora noora.sa...@gmail.com wrote:

 Hi
 I'm using apache solr 4.7.2 ant apache tomcat ?
 I can't create core with query in my solr while I cat do it with jetty with
 the same config.
 The first problem was you can pass the system property
 -Dsolr.allow.unsafe.resourceloading=true to your JVM that I solve it in my
 Catalina.sh
 Now my error is :
 Unable to create core: uut8 Caused by: Can't find resource 'stopwords.txt'
 in classpath or conf

 My query is :

 http://10.1.221.210:8983/solr/admin/cores?action=CREATEname=my_coreinstanceDir=my_coredataDir=dataconfigSet=myConfig

 Can any one help me?



Re: Solr performance issues

2014-12-29 Thread Mahmoud Almokadem
Thanks all.

I've the same index with a bit different schema and 200M documents,
installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size
of index is about 1.5TB, have many updates every 5 minutes, complex queries
and faceting with response time of 100ms that is acceptable for us.

Toke Eskildsen,

Is the index updated while you are searching? *No*
Do you do any faceting or other heavy processing as part of a search? *No*
How many hits does a search typically have and how many documents are
returned? *The test for QTime only with no documents returned and No. of
hits varying from 50,000 to 50,000,000.*
How many concurrent searches do you need to support? How fast should the
response time be? *May be 100 concurrent searches with 100ms with facets.*

Does splitting the shard to two shards on the same node so every shard will
be on a single EBS Volume better than using LVM?

Thanks

On Mon, Dec 29, 2014 at 2:00 AM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Mahmoud Almokadem [prog.mahm...@gmail.com] wrote:
  We've installed a cluster of one collection of 350M documents on 3
  r3.2xlarge (60GB RAM) Amazon servers. The size of index on each shard is
  about 1.1TB and maximum storage on Amazon is 1 TB so we add 2 SSD EBS
  General purpose (1x1TB + 1x500GB) on each instance. Then we create
 logical
  volume using LVM of 1.5TB to fit our index.

 Your search speed will be limited by the slowest storage in your group,
 which would be your 500GB EBS. The General Purpose SSD option means (as far
 as I can read at http://aws.amazon.com/ebs/details/#piops) that your
 baseline of 3 IOPS/MB = 1500 IOPS, with bursts of 3000 IOPS. Unfortunately
 they do not say anything about latency.

 For comparison, I checked the system logs from a local test with our 21TB
 / 7 billion documents index. It used ~27,000 IOPS during the test, with
 mean search time a bit below 1 second. That was with ~100GB RAM for disk
 cache, which is about ½% of index size. The test was with simple term
 queries (1-3 terms) and some faceting. Back of the envelope: 27,000 IOPS
 for 21TB is ~1300 IOPS/TB. Your indexes are 1.1TB, so 1.1*1300 IOPS ~= 1400
 IOPS.

 All else being equal (which is never the case), getting 1-3 second
 response times for a 1.1TB index, when one link in the storage chain is
 capped at a few thousand IOPS, you are using networked storage and you have
 little RAM for caching, does not seem unrealistic. If possible, you could
 try temporarily boosting performance of the EBS, to see if raw IO is the
 bottleneck.

  The response time is about 1 and 3 seconds for simple queries (1 token).

 Is the index updated while you are searching?
 Do you do any faceting or other heavy processing as part of a search?
 How many hits does a search typically have and how many documents are
 returned?
 How many concurrent searches do you need to support? How fast should the
 response time be?

 - Toke Eskildsen



Re: Solr performance issues

2014-12-29 Thread Mahmoud Almokadem
Thanks Shawn.

What do you mean with important parts of index? and how to calculate their 
size?

Thanks,
Mahmoud

Sent from my iPhone

 On Dec 29, 2014, at 8:19 PM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 12/29/2014 2:36 AM, Mahmoud Almokadem wrote:
 I've the same index with a bit different schema and 200M documents,
 installed on 3 r3.xlarge (30GB RAM, and 600 General Purpose SSD). The size
 of index is about 1.5TB, have many updates every 5 minutes, complex queries
 and faceting with response time of 100ms that is acceptable for us.
 
 Toke Eskildsen,
 
 Is the index updated while you are searching? *No*
 Do you do any faceting or other heavy processing as part of a search? *No*
 How many hits does a search typically have and how many documents are
 returned? *The test for QTime only with no documents returned and No. of
 hits varying from 50,000 to 50,000,000.*
 How many concurrent searches do you need to support? How fast should the
 response time be? *May be 100 concurrent searches with 100ms with facets.*
 
 Does splitting the shard to two shards on the same node so every shard will
 be on a single EBS Volume better than using LVM?
 
 The basic problem is simply that the system has so little memory that it
 must read large amounts of data from the disk when it does a query.
 There is not enough RAM to cache the important parts of the index.  RAM
 is much faster than disk, even SSD.
 
 Typical consumer-grade DDR3-1600 memory has a data transfer rate of
 about 12800 megabytes per second.  If it's ECC memory (which I would say
 is a requirement) then the transfer rate is probably a little bit slower
 than that.  Figuring 9 bits for every byte gets us about 11377 MB/s.
 That's only an estimate, and it could be wrong in either direction, but
 I'll go ahead and use it.
 
 http://en.wikipedia.org/wiki/DDR3_SDRAM#JEDEC_standard_modules
 
 If your SSD is SATA, the transfer rate will be limited to approximately
 600MB/s -- the 6 gigabit per second transfer rate of the newest SATA
 standard.  That makes memory about 18 times as fast as SATA SSD.  I saw
 one PCI express SSD that claimed a transfer rate of 2900 MB/s.  Even
 that is only about one fourth of the estimated speed of DDR3-1600 with
 ECC.  I don't know what interface technology Amazon uses for their SSD
 volumes, but I would bet on it being the cheaper version, which would
 mean SATA.  The networking between the EC2 instance and the EBS storage
 is unknown to me and may be a further bottleneck.
 
 http://ocz.com/enterprise/z-drive-4500/specifications
 
 Bottom line -- you need a lot more memory.  Speeding up the disk may
 *help* ... but it will not replace that simple requirement.  With EC2 as
 the platform, you may need more instances and more shards.
 
 Your 200 million document index that works well with only 90GB of total
 memory ... that's surprising to me.  That means that the important parts
 of that index *do* fit in memory ... but if the index gets much larger,
 performance is likely to drop off sharply.
 
 Thanks,
 Shawn
 


Solr performance issues

2014-12-26 Thread Mahmoud Almokadem
Dears,

We've installed a cluster of one collection of 350M documents on 3
r3.2xlarge (60GB RAM) Amazon servers. The size of index on each shard is
about 1.1TB and maximum storage on Amazon is 1 TB so we add 2 SSD EBS
General purpose (1x1TB + 1x500GB) on each instance. Then we create logical
volume using LVM of 1.5TB to fit our index.

The response time is about 1 and 3 seconds for simple queries (1 token).

Is the LVM become a bottleneck for our index?

Thanks for help.


Re: Solr-Distributed search

2014-06-05 Thread Mahmoud Almokadem
Hi, you can search using this sample Url

http://localhost:8080/solr/core1/select?q=*:*shards=localhost:8080/solr/core1,localhost:8080/solr/core2,localhost:8080/solr/core3

Mahmoud Almokadem


On Thu, Jun 5, 2014 at 8:13 AM, Anurag Verma vermanur...@gmail.com wrote:

 Hi,
 Can you please help me solr distribued search in multicore? i would
 be very happy as i am stuck here.

 In java code how do i implement distributed search?
 --
 Thanks  Regards
 Anurag Verma



Shards range error

2014-04-22 Thread Mahmoud Almokadem
,
autoCreated:true},
  news_english:{
shards:{
  shard1:{
range:8000-,
state:active,
replicas:{10.0.1.237:8080_solr_news_english:{
state:active,
base_url:http://10.0.1.237:8080/solr;,
core:news_english,
node_name:10.0.1.237:8080_solr,
leader:true}}},
  shard2:{
range:0-7fff,
state:active,
replicas:{10.0.1.6:8080_solr_news_english:{
state:active,
base_url:http://10.0.1.6:8080/solr;,
core:news_english,
node_name:10.0.1.6:8080_solr,
leader:true,
maxShardsPerNode:1,
router:{name:compositeId},
replicationFactor:1,
autoCreated:true}}

So, what should I do to create new collections with 3 shards


Thanks to all and sorry for poor English

Mahmoud Almokadem