RE: Trailing space issue with indexed data.

2020-08-18 Thread Markus Jelsma
Hello,

You can use TrimFieldUpdateProcessorFactory [1] in your URP chain to remove 
leading or trailing whitespace when indexing.

Regards,
Markus

[1] 
https://lucene.apache.org/solr/8_6_0//solr-core/org/apache/solr/update/processor/TrimFieldUpdateProcessorFactory.html

 
 
-Original message-
> From:Fiz N 
> Sent: Tuesday 18th August 2020 19:57
> To: solr-user@lucene.apache.org
> Subject: Trailing space issue with indexed data.
> 
> Hell SOLR Experts,
> 
> I am using SOLR 8.6 and indexing data from MSSQL DB.
> 
> after indexing is done I am seeing
> 
> “Page_number”:”1“,
> “Doc_name”:”  office 770 toll free “
> “Doc_text”:” From:  Hyan, gan \nTo:  Delacruz
>   Decruz \n“
> 
> I was remove these empty spaces.
> 
> How to achieve it ?
> 
> Thanks
> Fiz Nadian.
> 


RE: Drop bad document in update batch

2020-08-18 Thread Markus Jelsma
Ah yes, i should have looked at the list of subclasses of 
UpdateRequestProcessorFactory in de API docs as it is not mentioned in the 
manual.

Thanks Erick! 
 
-Original message-
> From:Erick Erickson 
> Sent: Tuesday 18th August 2020 19:04
> To: solr-user@lucene.apache.org
> Subject: Re: Drop bad document in update batch
> 
> I think you’re looking for TolerantUpdateProcessor(Factory), added in 
> SOLR-445. It hung around for a LOGGG time and didn’t actually get 
> added until 6.1.
> 
> > On Aug 18, 2020, at 12:51 PM, Markus Jelsma  
> > wrote:
> > 
> > Hello,
> > 
> > Normally, if a single document is bad, the whole indexing batch is dropped. 
> > I think i remember there was an URP(?) that discards bad documents from the 
> > batch, but i cannot find it in the manual [1].
> > 
> > Is it possible or am i starting to imagine things?
> > 
> > Thanks,
> > Markus
> > 
> > [1] https://lucene.apache.org/solr/guide/8_6/update-request-processors.html
> 
> 


Re: Trailing space issue with indexed data.

2020-08-18 Thread Jörn Franke
During indexing. Do they matter for search, ie would the search be different 
with/without them?

> Am 18.08.2020 um 19:57 schrieb Fiz N :
> 
> Hell SOLR Experts,
> 
> I am using SOLR 8.6 and indexing data from MSSQL DB.
> 
> after indexing is done I am seeing
> 
> “Page_number”:”1“,
> “Doc_name”:”  office 770 toll free “
> “Doc_text”:” From:  Hyan, gan \nTo:  Delacruz
>  Decruz \n“
> 
> I was remove these empty spaces.
> 
> How to achieve it ?
> 
> Thanks
> Fiz Nadian.


Trailing space issue with indexed data.

2020-08-18 Thread Fiz N
Hell SOLR Experts,

I am using SOLR 8.6 and indexing data from MSSQL DB.

after indexing is done I am seeing

“Page_number”:”1“,
“Doc_name”:”  office 770 toll free “
“Doc_text”:” From:  Hyan, gan \nTo:  Delacruz
  Decruz \n“

I was remove these empty spaces.

How to achieve it ?

Thanks
Fiz Nadian.


Re: Deleted collection is getting back after restart

2020-08-18 Thread Erick Erickson
What this sounds like is you’re not really connecting to ZooKeeper the way you 
think. First, insure that you’re really connecting from Solr by going to the 
admin UI and looking at the Zookeeper information. Are the URLs and ports 
correct?

Second, Zookeeper by default puts the data in /tmp/zookeper, which will 
disappear upon reboot. Make very sure your ZK data dir points somewhere 
permanent in zoo.cfg.

Third, there should be no reason to restart Zookeeper.

Fourth, if you’re using an older version of Solr, lingering replicas would 
magically reappear. Look for “legacyCloud” in the reference guide for the 
version you’re using. If legacyCloud = true, any remnants of collections found 
on disk will magically reappear. the “smoking gun” here is if you have a 
clutserstate.json in Zookeeper that has some collection info, and the rest of 
your info in collections//state.json.

Best,
Erick

> On Aug 18, 2020, at 1:32 PM, yaswanth kumar  wrote:
> 
> I am using solr with zookeeper ensemble, but some time when we delete the
> collection with the solr API , they are getting disappeared from solr cloud
> but after some days, when the machines are rebooted, they are coming back
> on the cloud but with down status. Not really sure if its an issue
> with zookeeper which is not persisting the delete activity on the
> clusterstate (even clusterstate is showing back these nodes with down
> status)
> 
> Also the similar issue is happening when we update the schema , when we do
> any modifications to solr schema and upload it via zookeeper it works fine
> until the next reboot of the solr boxes, once the reboot is done, for some
> reason its getting the back the older version schema.
> 
> Is it mandatory that we need to restart zookeeper upon doing the above two
> operations always??
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com



Deleted collection is getting back after restart

2020-08-18 Thread yaswanth kumar
I am using solr with zookeeper ensemble, but some time when we delete the
collection with the solr API , they are getting disappeared from solr cloud
but after some days, when the machines are rebooted, they are coming back
on the cloud but with down status. Not really sure if its an issue
with zookeeper which is not persisting the delete activity on the
clusterstate (even clusterstate is showing back these nodes with down
status)

Also the similar issue is happening when we update the schema , when we do
any modifications to solr schema and upload it via zookeeper it works fine
until the next reboot of the solr boxes, once the reboot is done, for some
reason its getting the back the older version schema.

Is it mandatory that we need to restart zookeeper upon doing the above two
operations always??

-- 
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanth...@gmail.com


Re: Drop bad document in update batch

2020-08-18 Thread Erick Erickson
I think you’re looking for TolerantUpdateProcessor(Factory), added in SOLR-445. 
It hung around for a LOGGG time and didn’t actually get added until 6.1.

> On Aug 18, 2020, at 12:51 PM, Markus Jelsma  
> wrote:
> 
> Hello,
> 
> Normally, if a single document is bad, the whole indexing batch is dropped. I 
> think i remember there was an URP(?) that discards bad documents from the 
> batch, but i cannot find it in the manual [1].
> 
> Is it possible or am i starting to imagine things?
> 
> Thanks,
> Markus
> 
> [1] https://lucene.apache.org/solr/guide/8_6/update-request-processors.html



Re: Use of NRTCachingDirectoryFactory

2020-08-18 Thread Erick Erickson
In a word, “yes”. NRTCachingDirectory almost
always “does the right thing” without any
modifications based on your environment.

Second, since Lucene uses MMapDirectory,
the relevant portions of your index will 
already be in the OS’s RAM, which 
is why it’s a mistake to try to force it. All
you’ll really do is chew up Java’s RAM
to no good purpose, adding to GC etc.

See: https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

In your case, with such a small index,
it’s probably already entirely in memory.

Best,
Erick

> On Aug 18, 2020, at 4:40 AM, Tushar Arora  wrote:
> 
> Hi,
> 
> One of our indexes has a size of around 1GB. And the production server has
> RAM of 16GB. And this is a slave server. Data replicates from the master
> server to it every 5 minutes.
> 
> Is it a good practice to keep this index in RAM?
> I checked the solr.RAMDirectoryFactory. But, it does not work with
> replication. So, using solr.NRTCachingDirectoryFactory is a good choice?
> 
> Regards,
> Tushar



Drop bad document in update batch

2020-08-18 Thread Markus Jelsma
Hello,

Normally, if a single document is bad, the whole indexing batch is dropped. I 
think i remember there was an URP(?) that discards bad documents from the 
batch, but i cannot find it in the manual [1].

Is it possible or am i starting to imagine things?

Thanks,
Markus

[1] https://lucene.apache.org/solr/guide/8_6/update-request-processors.html


Re: SOLR indexing takes longer time

2020-08-18 Thread Walter Underwood
Instead of writing code, I’d fire up SQL Workbench/J, load the same JDBC driver
that is being used in Solr, and run the query.

https://www.sql-workbench.eu 

If that takes 3.5 hours, you have isolated the problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 18, 2020, at 6:50 AM, David Hastings  
> wrote:
> 
> Another thing to mention is to make sure the indexer you build doesnt send
> commits until its actually done.  Made that mistake with some early in
> house indexers.
> 
> On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:
> 
>> 1. You could write some code to pull the items out of Mongo and dump
>> them to disk - if this is still slow, then it's Mongo that's the problem.
>> 2. Write a standalone indexer to replace DIH, it's single threaded and
>> deprecated anyway.
>> 3. Minor point - consider whether you need to index everything every
>> time or just the deltas.
>> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
>> old version you're running.
>> 
>> HTH
>> 
>> Charlie
>> 
>> On 17/08/2020 19:22, Abhijit Pawar wrote:
>>> Hello,
>>> 
>>> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
>>> replicas and just single core.
>>> It takes almost 3.5 hours to index that data.
>>> I am using a data import handler to import data from the mongo database.
>>> 
>>> Is there something we can do to reduce the time taken to index?
>>> Will upgrade to newer version help?
>>> 
>>> Appreciate your help!
>>> 
>>> Regards,
>>> Abhijit
>>> 
>> 
>> --
>> Charlie Hull
>> OpenSource Connections, previously Flax
>> 
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.o19s.com
>> 
>> 



Re: SOLR indexing takes longer time

2020-08-18 Thread David Hastings
Another thing to mention is to make sure the indexer you build doesnt send
commits until its actually done.  Made that mistake with some early in
house indexers.

On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:

> 1. You could write some code to pull the items out of Mongo and dump
> them to disk - if this is still slow, then it's Mongo that's the problem.
> 2. Write a standalone indexer to replace DIH, it's single threaded and
> deprecated anyway.
> 3. Minor point - consider whether you need to index everything every
> time or just the deltas.
> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
> old version you're running.
>
> HTH
>
> Charlie
>
> On 17/08/2020 19:22, Abhijit Pawar wrote:
> > Hello,
> >
> > We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> > replicas and just single core.
> > It takes almost 3.5 hours to index that data.
> > I am using a data import handler to import data from the mongo database.
> >
> > Is there something we can do to reduce the time taken to index?
> > Will upgrade to newer version help?
> >
> > Appreciate your help!
> >
> > Regards,
> > Abhijit
> >
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>
>


Re: SOLR indexing takes longer time

2020-08-18 Thread Charlie Hull
1. You could write some code to pull the items out of Mongo and dump 
them to disk - if this is still slow, then it's Mongo that's the problem.
2. Write a standalone indexer to replace DIH, it's single threaded and 
deprecated anyway.
3. Minor point - consider whether you need to index everything every 
time or just the deltas.
4. Upgrade Solr anyway, not for speed reasons but because that's a very 
old version you're running.


HTH

Charlie

On 17/08/2020 19:22, Abhijit Pawar wrote:

Hello,

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?

Appreciate your help!

Regards,
Abhijit



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Use of NRTCachingDirectoryFactory

2020-08-18 Thread Tushar Arora
Hi,

One of our indexes has a size of around 1GB. And the production server has
RAM of 16GB. And this is a slave server. Data replicates from the master
server to it every 5 minutes.

Is it a good practice to keep this index in RAM?
I checked the solr.RAMDirectoryFactory. But, it does not work with
replication. So, using solr.NRTCachingDirectoryFactory is a good choice?

Regards,
Tushar