Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread Shawn Heisey

On 2/21/2021 3:07 PM, cratervoid wrote:

Thanks Shawn, I copied the solrconfig.xml file from the gettingstarted
example on 7.7.3 installation to the 8.8.0 installation, restarted the
server and it now works. Comparing the two files it looks like as you said
this section was left out of the _default/solrconfig.xml file in version
8.8.0:


 
   true
   ignored_
   _text_
 
   

So those trying out the tutorial will need to add this section to get it to
work for sample.html.



This line from that config also is involved:

  regex=".*\.jar" />


That loads the contrib jars needed for the ExtractingRequestHandler to 
work right.  There are a LOT of jars there.  Tika is a very heavyweight 
piece of software.


Thanks,
Shawn


Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread cratervoid
Thanks Shawn, I copied the solrconfig.xml file from the gettingstarted
example on 7.7.3 installation to the 8.8.0 installation, restarted the
server and it now works. Comparing the two files it looks like as you said
this section was left out of the _default/solrconfig.xml file in version
8.8.0:



  true
  ignored_
  _text_

  

So those trying out the tutorial will need to add this section to get it to
work for sample.html.



On Sat, Feb 20, 2021 at 4:21 PM Shawn Heisey  wrote:

> On 2/20/2021 3:58 PM, cratervoid wrote:
> > SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
> >
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
>
> The problem here is that the solrconfig.xml in use by the index named
> "gettingstarted" does not define a handler at /update/extract.
>
> Typically a handler defined at that URL path will utilize the extracting
> request handler class.  This handler uses Tika (another Apache project)
> to extract usable data from rich text formats like PDF, HTML, etc.
>
>
>startup="lazy"
>class="solr.extraction.ExtractingRequestHandler" >
>  
>true
>ignored_
>_text_
>  
>
>
> Note that using this handler will require adding some contrib jars to Solr.
>
> Tika can become very unstable because it deals with undocumented file
> formats, so we do not recommend using that handler in production.  If
> the functionality is important, Tika should be included in a program
> that's separate from Solr, so that if it crashes, it does not take Solr
> down with it.
>
> Thanks,
> Shawn
>


Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread cratervoid
Thanks Alex. I copied the solrconfig.xml over from 7.7.3 to the 8.8.0 conf
folder and restarted the server.  Now indexing works without erroring on
sample.html.  There is 1K difference between the 2 files so I'll diff them
to see what was left out of the 8.8 version.

On Sat, Feb 20, 2021 at 4:27 PM Alexandre Rafalovitch 
wrote:

> Most likely issue is that your core configuration (solrconfig.xml)
> does not have the request handler for that. The same config may have
> had that in 7.x, but changed since.
>
> More details:
> https://lucene.apache.org/solr/guide/8_8/uploading-data-with-solr-cell-using-apache-tika.html
>
> Regards,
>Alex.
>
> On Sat, 20 Feb 2021 at 17:59, cratervoid  wrote:
> >
> > I am trying out indexing the exampledocs in the examples folder with the
> > SimplePostTool on windows 10 using solr 8.8.  All the documents index
> > except sample.html. For that file I get the errors below.  I then
> > downloaded solr 7.7.3 and indexed the exampledocs folder with no errors,
> > including sample.html.
> > ```
> > PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> > example\exampledocs\post.jar example\exampledocs\sample.html
> > SimplePostTool version 5.0.0
> > Posting files to [base] url
> > http://localhost:8983/solr/gettingstarted/update...
> > Entering auto mode. File endings considered are
> >
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > POSTing file sample.html (text/html) to [base]/extract
> > SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
> >
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> > SimplePostTool: WARNING: Response: 
> > 
> > 
> > Error 404 Not Found
> > 
> > HTTP ERROR 404 Not Found
> > 
> > URI:/solr/gettingstarted/update/extract
> > STATUS:404
> > MESSAGE:Not Found
> > SERVLET:default
> > 
> >
> > 
> > 
> > SimplePostTool: WARNING: IOException while reading response:
> > java.io.FileNotFoundException:
> >
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> > 1 files indexed.
> > COMMITting Solr index changes to
> > http://localhost:8983/solr/gettingstarted/update...
> > Time spent: 0:00:00.086
> > ```
> >
> > However the json and all other file types index with no problem. For
> > example:
> > ```
> > PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> > example\exampledocs\post.jar example\exampledocs\books.json
> > SimplePostTool version 5.0.0
> > Posting files to [base] url
> > http://localhost:8983/solr/gettingstarted/update...
> > Entering auto mode. File endings considered are
> >
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > POSTing file books.json (application/json) to [base]/json/docs
> > 1 files indexed.
> > COMMITting Solr index changes to
> > http://localhost:8983/solr/gettingstarted/update...
> > ```
> > Just following this tutorial:[
> >
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support][1
> > ]
> >
> >   [1]:
> >
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support
>


Re: HTML sample.html not indexing in Solr 8.8

2021-02-20 Thread Alexandre Rafalovitch
Most likely issue is that your core configuration (solrconfig.xml)
does not have the request handler for that. The same config may have
had that in 7.x, but changed since.

More details: 
https://lucene.apache.org/solr/guide/8_8/uploading-data-with-solr-cell-using-apache-tika.html

Regards,
   Alex.

On Sat, 20 Feb 2021 at 17:59, cratervoid  wrote:
>
> I am trying out indexing the exampledocs in the examples folder with the
> SimplePostTool on windows 10 using solr 8.8.  All the documents index
> except sample.html. For that file I get the errors below.  I then
> downloaded solr 7.7.3 and indexed the exampledocs folder with no errors,
> including sample.html.
> ```
> PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> example\exampledocs\post.jar example\exampledocs\sample.html
> SimplePostTool version 5.0.0
> Posting files to [base] url
> http://localhost:8983/solr/gettingstarted/update...
> Entering auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> POSTing file sample.html (text/html) to [base]/extract
> SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> SimplePostTool: WARNING: Response: 
> 
> 
> Error 404 Not Found
> 
> HTTP ERROR 404 Not Found
> 
> URI:/solr/gettingstarted/update/extract
> STATUS:404
> MESSAGE:Not Found
> SERVLET:default
> 
>
> 
> 
> SimplePostTool: WARNING: IOException while reading response:
> java.io.FileNotFoundException:
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> 1 files indexed.
> COMMITting Solr index changes to
> http://localhost:8983/solr/gettingstarted/update...
> Time spent: 0:00:00.086
> ```
>
> However the json and all other file types index with no problem. For
> example:
> ```
> PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> example\exampledocs\post.jar example\exampledocs\books.json
> SimplePostTool version 5.0.0
> Posting files to [base] url
> http://localhost:8983/solr/gettingstarted/update...
> Entering auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> POSTing file books.json (application/json) to [base]/json/docs
> 1 files indexed.
> COMMITting Solr index changes to
> http://localhost:8983/solr/gettingstarted/update...
> ```
> Just following this tutorial:[
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support][1
> ]
>
>   [1]:
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support


Re: HTML sample.html not indexing in Solr 8.8

2021-02-20 Thread Shawn Heisey

On 2/20/2021 3:58 PM, cratervoid wrote:

SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html


The problem here is that the solrconfig.xml in use by the index named 
"gettingstarted" does not define a handler at /update/extract.


Typically a handler defined at that URL path will utilize the extracting 
request handler class.  This handler uses Tika (another Apache project) 
to extract usable data from rich text formats like PDF, HTML, etc.


  
  

  true
  ignored_
  _text_

  

Note that using this handler will require adding some contrib jars to Solr.

Tika can become very unstable because it deals with undocumented file 
formats, so we do not recommend using that handler in production.  If 
the functionality is important, Tika should be included in a program 
that's separate from Solr, so that if it crashes, it does not take Solr 
down with it.


Thanks,
Shawn


HTML sample.html not indexing in Solr 8.8

2021-02-20 Thread cratervoid
I am trying out indexing the exampledocs in the examples folder with the
SimplePostTool on windows 10 using solr 8.8.  All the documents index
except sample.html. For that file I get the errors below.  I then
downloaded solr 7.7.3 and indexed the exampledocs folder with no errors,
including sample.html.
```
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
example\exampledocs\post.jar example\exampledocs\sample.html
SimplePostTool version 5.0.0
Posting files to [base] url
http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file sample.html (text/html) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
SimplePostTool: WARNING: Response: 


Error 404 Not Found

HTTP ERROR 404 Not Found

URI:/solr/gettingstarted/update/extract
STATUS:404
MESSAGE:Not Found
SERVLET:default




SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException:
http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
1 files indexed.
COMMITting Solr index changes to
http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.086
```

However the json and all other file types index with no problem. For
example:
```
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
example\exampledocs\post.jar example\exampledocs\books.json
SimplePostTool version 5.0.0
Posting files to [base] url
http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to
http://localhost:8983/solr/gettingstarted/update...
```
Just following this tutorial:[
https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support][1
]

  [1]:
https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support


Re: Urgent- General Question about document Indexing frequency in solr

2021-02-04 Thread Scott Stults
Manisha,

The most general recommendation around commits is to not explicitly commit
after every update. There are settings that will let Solr automatically
commit after some threshold is met, and by delegating commits to that
mechanism you can generally ingest faster.

See this blog post that goes into detail about how to set that up for your
situation:

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


Kind regards,
Scott


On Wed, Feb 3, 2021 at 5:44 PM Manisha Rahatadkar <
manisha.rahatad...@anjusoftware.com> wrote:

> Hi All
>
> Looking for some help on document indexing frequency. I am using apache
> solr 7.7 and SolrNet library to commit documents to Solr. Summary for this
> function is:
> // Summary:
> // Commits posted documents, blocking until index changes are flushed
> to disk and
> // blocking until a new searcher is opened and registered as the main
> query searcher,
> // making the changes visible.
>
> I understand that, the document gets reindexed after every commit. I have
> noticed that as the number of documents are increasing, the reindexing
> takes time. and sometimes I am getting solr connection time out error.
> I have following questions:
>
>   1.  Is there any frequency suggested by Solr for document insert/update
> and reindex? Is there any standard recommendation?
>   2.  If I remove the copy fields from managed-schema.xml, do I need to
> delete the existing indexed data from solr core and then insert data and
> reindex it again?
>
> Thanks in advance.
>
> Regards
> Manisha
>
>
>
> Confidentiality Notice
> 
> This email message, including any attachments, is for the sole use of the
> intended recipient and may contain confidential and privileged information.
> Any unauthorized view, use, disclosure or distribution is prohibited. If
> you are not the intended recipient, please contact the sender by reply
> email and destroy all copies of the original message. Anju Software, Inc.
> 4500 S. Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Urgent- General Question about document Indexing frequency in solr

2021-02-03 Thread Manisha Rahatadkar
Hi All

Looking for some help on document indexing frequency. I am using apache solr 
7.7 and SolrNet library to commit documents to Solr. Summary for this function 
is:
// Summary:
// Commits posted documents, blocking until index changes are flushed to 
disk and
// blocking until a new searcher is opened and registered as the main query 
searcher,
// making the changes visible.

I understand that, the document gets reindexed after every commit. I have 
noticed that as the number of documents are increasing, the reindexing takes 
time. and sometimes I am getting solr connection time out error.
I have following questions:

  1.  Is there any frequency suggested by Solr for document insert/update and 
reindex? Is there any standard recommendation?
  2.  If I remove the copy fields from managed-schema.xml, do I need to delete 
the existing indexed data from solr core and then insert data and reindex it 
again?

Thanks in advance.

Regards
Manisha



Confidentiality Notice

This email message, including any attachments, is for the sole use of the 
intended recipient and may contain confidential and privileged information. Any 
unauthorized view, use, disclosure or distribution is prohibited. If you are 
not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message. Anju Software, Inc. 4500 S. 
Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.


Re: NRT - Indexing

2021-02-02 Thread Dominique Bejean
Hi,

The issue was buildOnCommit=true on a SuggestComponent.

Dominique

Le mar. 2 févr. 2021 à 00:54, Shawn Heisey  a écrit :

> On 2/1/2021 12:08 AM, haris.k...@vnc.biz wrote:
> > Hope you're doing good. I am trying to configure NRT - Indexing in my
> > project. For this reason, I have configured *autoSoftCommit* to execute
> > every second and *autoCommit* to execute every 5 minutes. Everything
> > works as expected on the dev and test server. But on the production
> > server, there are more than 6 million documents indexed in Solr, so
> > whenever a new document is indexed it takes 2-3 minutes before appearing
> > in the search despite the setting I have described above. Since the
> > target is to develop a real-time system, this delay of 2-3 minutes is
> > not acceptable. How can I reduce this time window?
>
> Setting autoSoftCommit with a max time of 1000 (one second) does not
> mean you will see changes within one second.  It means that one second
> after indexing begins, Solr will start a soft commit operation.  That
> commit operation must fully complete and the new searcher must come
> online before changes are visible.  Those steps may take much longer
> than one second, which seems to be happening on your system.
>
> With the information available, I cannot tell you why your commits are
> taking so long.  One of the most common reasons for poor Solr
> performance is a lack of free memory on the system for caching purposes.
>
> Thanks,
> Shawn
>


Re: NRT - Indexing

2021-02-01 Thread Shawn Heisey

On 2/1/2021 12:08 AM, haris.k...@vnc.biz wrote:
Hope you're doing good. I am trying to configure NRT - Indexing in my 
project. For this reason, I have configured *autoSoftCommit* to execute 
every second and *autoCommit* to execute every 5 minutes. Everything 
works as expected on the dev and test server. But on the production 
server, there are more than 6 million documents indexed in Solr, so 
whenever a new document is indexed it takes 2-3 minutes before appearing 
in the search despite the setting I have described above. Since the 
target is to develop a real-time system, this delay of 2-3 minutes is 
not acceptable. How can I reduce this time window?


Setting autoSoftCommit with a max time of 1000 (one second) does not 
mean you will see changes within one second.  It means that one second 
after indexing begins, Solr will start a soft commit operation.  That 
commit operation must fully complete and the new searcher must come 
online before changes are visible.  Those steps may take much longer 
than one second, which seems to be happening on your system.


With the information available, I cannot tell you why your commits are 
taking so long.  One of the most common reasons for poor Solr 
performance is a lack of free memory on the system for caching purposes.


Thanks,
Shawn


Re: NRT - Indexing

2021-02-01 Thread Dominique Bejean
Hi,

It is not the cause of your issue, but Solr version is 8.6.0, and
solrconfig.xml includes
7.5.0

By "I am using a service that fetches data from the Postgres database and
indexes it to solr. The service runs with a delay of 5 seconds.". You man,
you are using DIH and launch a delta-import each 5 seconds ?

Solr logs may help.

Dominique



Le lun. 1 févr. 2021 à 13:00,  a écrit :

> Hello,
>
>
> I am attaching the solrconfig.xml along with this email, also I am
> attaching a text document that has JSON object regarding the system
> information I am using a service that fetches data from the Postgres
> database and indexes it to solr. The service runs with a delay of 5 seconds.
>
>
> Regards
>
>
> Mit freundlichen Grüssen / Kind regards
>
>
> Muhammad Haris Khan
>
>
> *VNC - Virtual Network Consult*
>
>
> *-- Solr Ingenieur --*
>
>
> - On 1 February, 2021 3:50 PM, Dominique Bejean <
> dominique.bej...@eolya.fr> wrote:
>
>
>
> Hi,
>
>
> What is your Solr version ?
>
> Can you share your solrconfig.xml ?
>
> How is your sharding ?
>
> Did you grep your solr logs on with the "commit' pattern in order to see
>
> hard and soft commit occurrences ?
>
> How are you pushing new docs or updates in the collection ?
>
>
> Regards.
>
>
> Dominique
>
>
>
>
>
> Le lun. 1 févr. 2021 à 08:08,  a écrit :
>
>
> > Hello,
>
> >
>
> > Hope you're doing good. I am trying to configure NRT - Indexing in my
>
> > project. For this reason, I have configured *autoSoftCommit* to execute
>
> > every second and *autoCommit* to execute every 5 minutes. Everything
>
> > works as expected on the dev and test server. But on the production
> server,
>
> > there are more than 6 million documents indexed in Solr, so whenever a
> new
>
> > document is indexed it takes 2-3 minutes before appearing in the search
>
> > despite the setting I have described above. Since the target is to
> develop
>
> > a real-time system, this delay of 2-3 minutes is not acceptable. How can
> I
>
> > reduce this time window?
>
> >
>
> > Plus any advice on better scaling the Solr considering more than 6
> million
>
> > records would be very helpful. Thank you in advance.
>
> >
>
> >
>
> >
>
> > Mit freundlichen Grüssen / Kind regards
>
> >
>
> > Muhammad Haris Khan
>
> >
>
> > *VNC - Virtual Network Consult*
>
> >
>
> > *-- Solr Ingenieur --*
>
> >
>


Re: NRT - Indexing

2021-02-01 Thread haris . khan
Hello,I am attaching the solrconfig.xml along with this email, also I am 
attaching a text document that has JSON object regarding the system information 
I am using a service that fetches data from the Postgres database and indexes 
it to solr. The service runs with a delay of 5 seconds.RegardsMit freundlichen 
Grüssen/ Kind regardsMuhammad Haris KhanVNC - Virtual Network Consult-- 
Solr Ingenieur --- On 1 February, 2021 3:50 PM, Dominique Bejean 
dominique.bej...@eolya.fr wrote:Hi,What is your Solr version ?Can you 
share your solrconfig.xml ?How is your sharding ?Did you grep your solr logs on 
with the "commit' pattern in order to seehard and soft commit occurrences ?How 
are you pushing new docs or updates in the collection ?Regards.DominiqueLe lun. 
1 févr. 2021 à 08:08, haris.k...@vnc.biz a écrit : Hello, 
Hope you're doing good. I am trying to configure NRT - Indexing in my 
project. For this reason, I have configured *autoSoftCommit* to execute 
every second and *autoCommit* to execute every 5 minutes. Everything works 
as expected on the dev and test server. But on the production server, there 
are more than 6 million documents indexed in Solr, so whenever a new 
document is indexed it takes 2-3 minutes before appearing in the search 
despite the setting I have described above. Since the target is to develop 
a real-time system, this delay of 2-3 minutes is not acceptable. How can I 
reduce this time window? Plus any advice on better scaling the Solr 
considering more than 6 million records would be very helpful. Thank you in 
advance. Mit freundlichen Grüssen / Kind regards 
Muhammad Haris Khan *VNC - Virtual Network Consult* *-- Solr 
Ingenieur --*

solrconfig.xml
Description: XML document


Re: NRT - Indexing

2021-02-01 Thread Dominique Bejean
Hi,

What is your Solr version ?
Can you share your solrconfig.xml ?
How is your sharding ?
Did you grep your solr logs on with the "commit' pattern in order to see
hard and soft commit occurrences ?
How are you pushing new docs or updates in the collection ?

Regards.

Dominique




Le lun. 1 févr. 2021 à 08:08,  a écrit :

> Hello,
>
> Hope you're doing good. I am trying to configure NRT - Indexing in my
> project. For this reason, I have configured *autoSoftCommit* to execute
> every second and *autoCommit* to execute every 5 minutes. Everything
> works as expected on the dev and test server. But on the production server,
> there are more than 6 million documents indexed in Solr, so whenever a new
> document is indexed it takes 2-3 minutes before appearing in the search
> despite the setting I have described above. Since the target is to develop
> a real-time system, this delay of 2-3 minutes is not acceptable. How can I
> reduce this time window?
>
> Plus any advice on better scaling the Solr considering more than 6 million
> records would be very helpful. Thank you in advance.
>
>
>
> Mit freundlichen Grüssen / Kind regards
>
> Muhammad Haris Khan
>
> *VNC - Virtual Network Consult*
>
> *-- Solr Ingenieur --*
>


Re: NRT - Indexing

2021-02-01 Thread Mr Havercamp
I'm running into the same issue. I've set autoSoftCommit and autoCommit but
the speed at which docs are indexed seems to be inconsistent with the
settings. I have lowered the autoCommit to a minute but it still takes a
few minutes for docs to show after indexing. Soft commit settings also seem
to have no effect (from what I understand of the docs, Soft commit makes
items viewable but I'm not seeing them until well after the autoCommit
period has passed.

On Mon, 1 Feb 2021 at 15:08,  wrote:

> Hello,
>
> Hope you're doing good. I am trying to configure NRT - Indexing in my
> project. For this reason, I have configured *autoSoftCommit* to execute
> every second and *autoCommit* to execute every 5 minutes. Everything
> works as expected on the dev and test server. But on the production server,
> there are more than 6 million documents indexed in Solr, so whenever a new
> document is indexed it takes 2-3 minutes before appearing in the search
> despite the setting I have described above. Since the target is to develop
> a real-time system, this delay of 2-3 minutes is not acceptable. How can I
> reduce this time window?
>
> Plus any advice on better scaling the Solr considering more than 6 million
> records would be very helpful. Thank you in advance.
>
>
>
> Mit freundlichen Grüssen / Kind regards
>
> Muhammad Haris Khan
>
> *VNC - Virtual Network Consult*
>
> *-- Solr Ingenieur --*
>


NRT - Indexing

2021-01-31 Thread haris . khan
Hello,

Hope you're doing good. I am trying to configure NRT - Indexing in my project. 
For this reason, I have configuredautoSoftCommitto execute every 
second andautoCommitto execute every 5 minutes. Everything works as 
expected on the dev and test server. But on the production server, there are 
more than 6 million documents indexed in Solr, so whenever a new document is 
indexed it takes 2-3 minutes before appearing in the search despite the setting 
I have described above. Since the target is to develop a real-time system, this 
delay of 2-3 minutes is not acceptable. How can I reduce this time window?

Plus any advice on better scaling the Solr considering more than 6 million 
records would be very helpful. Thank you in advance.



Mit freundlichen Grüssen/ Kind regards

Muhammad Haris Khan

VNC - Virtual Network Consult

-- Solr Ingenieur --


NRT - Indexing

2021-01-29 Thread haris . khan
Hello,

Hope you're doing good. I am trying to configure NRT - Indexing in my project. 
For this reason, I have configured autoSoftCommit to execute every second and 
autoCommit to execute every 5 minutes. Everything works as expected on the dev 
and test server. But on the production server, there are more than 6 million 
documents indexed in Solr, so whenever a new document is indexed it takes 2-3 
minutes before appearing in the search despite the setting I have described 
above. Since the target is to develop a real-time system, this delay of 2-3 
minutes is not acceptable. How can I reduce this time window?

Plus any advice on better scaling the Solr considering more than 6 million 
records would be very helpful. Thank you in advance.


Mit freundlichen Grüssen/ Kind regards

Muhammad Haris Khan

VNC - Virtual Network Consult

-- Solr Ingenieur --


Re: Re:Interpreting Solr indexing times

2021-01-13 Thread Alessandro Benedetti
I agree, documents may be gigantic or very small,  with heavy text analysis
or simple strings ...
so it's not possible to give an evaluation here.
But you could make use of the nightly benchmark to give you an idea of
Lucene indexing speed (the engine inside Apache Solr) :

http://home.apache.org/~mikemccand/lucenebench/indexing.html

Not sure we have something similar for Apache Solr officially.
https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceData -> this
should be a bit outdated

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: [Solr8.7] Indexing only some language ?

2021-01-10 Thread Bruno Mannina
PErfect ! Thanks !

-Message d'origine-
De : xiefengchang [mailto:fengchang_fi...@163.com]
Envoyé : dimanche 10 janvier 2021 04:50
À : solr-user@lucene.apache.org
Objet : Re:[Solr8.7] Indexing only some language ?

Take a look at the document here:
https://lucene.apache.org/solr/guide/8_7/dynamic-fields.html#dynamic-fields


here's the point: "a field that does not match any explicitly defined fields
can be matched with a dynamic field."


so I guess the priority is quite clear~

















At 2021-01-10 03:38:01, "Bruno Mannina"  wrote:
>Hello,
>
>
>
>I would like to define in my schema.xml some text_xx fields.
>
>I have patent titles in several languages.
>
>Only 6 of them (EN, IT, FR, PT, ES, DE) interest me.
>
>
>
>I know how to define these 6 fields, I use text_en, text_it etc.
>
>
>
>i.e. for English language:
>
>stored="true" termVectors="true" termPositions="true"
>termOffsets="true"/>
>
>
>
>But I have more than 6 languages like: AR, CN, JP, KR etc.
>
>I can't analyze all source files to detect all languages and define
>them in my schema.
>
>
>
>I would like to use a dynamic field to index other languages.
>
>indexed="true" stored="true" omitTermFreqAndPositions="true"
>omitNorms="true"/>
>
>
>
>Is it ok to do that?
>
>Is TIEN field will be indexed twice internally or as tien is already
>defined
>ti* will not process tien?
>
>
>
>Thanks for your kind reply,
>
>
>
>Sincerely
>
>Bruno
>
>
>
>
>
>
>
>
>
>--
>L'absence de virus dans ce courrier électronique a été vérifiée par le
logiciel antivirus Avast.
>https://www.avast.com/antivirus


--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus



Re:Interpreting Solr indexing times

2021-01-10 Thread xiefengchang
it's hard to answer your question without your solrconfig.xml, 
managed-schema(or schema.xml), and good to have some log snippet as well~

















At 2021-01-07 21:28:00, "ufuk yılmaz"  wrote:
>Hello all,
>
>I have been looking at our SolrCloud indexing performance statistics and 
>trying to make sense of the numbers. We are using a custom Flume sink and 
>sending updates to Solr (8.4) using SolrJ.
>
>I know these stuff depend on a lot of things but can you tell me if these 
>statistics are horribly bad (which means something is going obviously wrong), 
>or something expectable from a Solr cluster under right circumstances?
>
>We are sending documents in batches of 1000.
>
>{
>  "UPDATE./update.distrib.requestTimes": {
>"count": 7579,
>"meanRate": 0.044953336300254124,
>"1minRate": 0.2855655259375961,
>"5minRate": 0.29214637836736357,
>"15minRate": 0.29510868125823914,
>"min_ms": 5.854106,
>"max_ms": 56854.784017,
>"mean_ms": 3100.877968690649,
>"median_ms": 1084.258683,
>"stddev_ms": 4643.097311691323,
>"p75_ms": 2407.196867,
>"p95_ms": 15509.748909,
>"p99_ms": 16206.134345,
>"p999_ms": 16206.134345
>  },
>  "UPDATE./update.local.totalTime": 0,
>  "UPDATE./update.requestTimes": {
>"count": 7579,
>"meanRate": 0.044953336230621366,
>"1minRate": 0.2855655259375961,
>"5minRate": 0.29214637836736357,
>"15minRate": 0.29510868125823914,
>"min_ms": 5.857796,
>"max_ms": 56854.792298,
>"mean_ms": 3100.885675292589,
>"median_ms": 1084.264825,
>"stddev_ms": 4643.097457508117,
>"p75_ms": 2407.201642,
>"p95_ms": 15509.755934,
>"p99_ms": 16206.141754,
>"p999_ms": 16206.141754
>  },
>  "UPDATE./update.requests": 7580,
>  "UPDATE./update.totalTime": 33520426747162,
>  "UPDATE.update.totalTime": 0,
>  "UPDATE.updateHandler.adds": 854,
>  "UPDATE.updateHandler.autoCommitMaxTime": "15000ms",
>  "UPDATE.updateHandler.autoCommits": 2428,
>  "UPDATE.updateHandler.softAutoCommitMaxTime":"1ms",
>  "UPDATE.updateHandler.softAutoCommits":3380,
>  "UPDATE.updateHandler.commits": {
>"count": 5777,
>"meanRate": 0.034265134931240636,
>"1minRate": 0.13653886429826526,
>"5minRate": 0.12997330621941325,
>"15minRate": 0.12634106125326003
>  },
>  "UPDATE.updateHandler.cumulativeAdds": {
>"count": 2578492,
>"meanRate": 15.293816240408821,
>"1minRate": 90.7054223213904,
>"5minRate": 99.48315440730897,
>"15minRate": 101.77967003607128
>  },
>}
>
>
>Sent from Mail for Windows 10
>


Re:[Solr8.7] Indexing only some language ?

2021-01-09 Thread xiefengchang
Take a look at the document here: 
https://lucene.apache.org/solr/guide/8_7/dynamic-fields.html#dynamic-fields


here's the point: "a field that does not match any explicitly defined fields 
can be matched with a dynamic field."


so I guess the priority is quite clear~

















At 2021-01-10 03:38:01, "Bruno Mannina"  wrote:
>Hello,
>
>
>
>I would like to define in my schema.xml some text_xx fields.
>
>I have patent titles in several languages.
>
>Only 6 of them (EN, IT, FR, PT, ES, DE) interest me.
>
>
>
>I know how to define these 6 fields, I use text_en, text_it etc.
>
>
>
>i.e. for English language:
>
>stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
>
>
>
>But I have more than 6 languages like: AR, CN, JP, KR etc.
>
>I can't analyze all source files to detect all languages and define them in
>my schema.
>
>
>
>I would like to use a dynamic field to index other languages.
>
>indexed="true" stored="true" omitTermFreqAndPositions="true"
>omitNorms="true"/>
>
>
>
>Is it ok to do that?
>
>Is TIEN field will be indexed twice internally or as tien is already defined
>ti* will not process tien?
>
>
>
>Thanks for your kind reply,
>
>
>
>Sincerely
>
>Bruno
>
>
>
>
>
>
>
>
>
>--
>L'absence de virus dans ce courrier électronique a été vérifiée par le 
>logiciel antivirus Avast.
>https://www.avast.com/antivirus


[Solr8.7] Indexing only some language ?

2021-01-09 Thread Bruno Mannina
Hello,



I would like to define in my schema.xml some text_xx fields.

I have patent titles in several languages.

Only 6 of them (EN, IT, FR, PT, ES, DE) interest me.



I know how to define these 6 fields, I use text_en, text_it etc.



i.e. for English language:





But I have more than 6 languages like: AR, CN, JP, KR etc.

I can't analyze all source files to detect all languages and define them in
my schema.



I would like to use a dynamic field to index other languages.





Is it ok to do that?

Is TIEN field will be indexed twice internally or as tien is already defined
ti* will not process tien?



Thanks for your kind reply,



Sincerely

Bruno









--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus


Interpreting Solr indexing times

2021-01-07 Thread ufuk yılmaz
Hello all,

I have been looking at our SolrCloud indexing performance statistics and trying 
to make sense of the numbers. We are using a custom Flume sink and sending 
updates to Solr (8.4) using SolrJ.

I know these stuff depend on a lot of things but can you tell me if these 
statistics are horribly bad (which means something is going obviously wrong), 
or something expectable from a Solr cluster under right circumstances?

We are sending documents in batches of 1000.

{
  "UPDATE./update.distrib.requestTimes": {
"count": 7579,
"meanRate": 0.044953336300254124,
"1minRate": 0.2855655259375961,
"5minRate": 0.29214637836736357,
"15minRate": 0.29510868125823914,
"min_ms": 5.854106,
"max_ms": 56854.784017,
"mean_ms": 3100.877968690649,
"median_ms": 1084.258683,
"stddev_ms": 4643.097311691323,
"p75_ms": 2407.196867,
"p95_ms": 15509.748909,
"p99_ms": 16206.134345,
"p999_ms": 16206.134345
  },
  "UPDATE./update.local.totalTime": 0,
  "UPDATE./update.requestTimes": {
"count": 7579,
"meanRate": 0.044953336230621366,
"1minRate": 0.2855655259375961,
"5minRate": 0.29214637836736357,
"15minRate": 0.29510868125823914,
"min_ms": 5.857796,
"max_ms": 56854.792298,
"mean_ms": 3100.885675292589,
"median_ms": 1084.264825,
"stddev_ms": 4643.097457508117,
"p75_ms": 2407.201642,
"p95_ms": 15509.755934,
"p99_ms": 16206.141754,
"p999_ms": 16206.141754
  },
  "UPDATE./update.requests": 7580,
  "UPDATE./update.totalTime": 33520426747162,
  "UPDATE.update.totalTime": 0,
  "UPDATE.updateHandler.adds": 854,
  "UPDATE.updateHandler.autoCommitMaxTime": "15000ms",
  "UPDATE.updateHandler.autoCommits": 2428,
  "UPDATE.updateHandler.softAutoCommitMaxTime":"1ms",
  "UPDATE.updateHandler.softAutoCommits":3380,
  "UPDATE.updateHandler.commits": {
"count": 5777,
"meanRate": 0.034265134931240636,
"1minRate": 0.13653886429826526,
"5minRate": 0.12997330621941325,
"15minRate": 0.12634106125326003
  },
  "UPDATE.updateHandler.cumulativeAdds": {
"count": 2578492,
"meanRate": 15.293816240408821,
"1minRate": 90.7054223213904,
"5minRate": 99.48315440730897,
"15minRate": 101.77967003607128
  },
}


Sent from Mail for Windows 10



Re: Indexing performance 7.3 vs 8.7

2020-12-23 Thread Bram Van Dam
On 23/12/2020 16:00, Ron Buchanan wrote:
>   - both run Java 1.8, but 7.3 is running HotSpot and 8.7 is running
>   OpenJDK (and a bit newer)

If you're using G1GC, you probably want to give Java 11 a go. It's an
easy thing to test, and it's had a positive impact for us. Your mileage
may vary.

 - Bram


Indexing performance 7.3 vs 8.7

2020-12-23 Thread Ron Buchanan
(this is long, just trying to be thorough)

I'm working on upgrading from Solr 7.3 to Solr 8.7 and I am seeing a
significant drop in indexing throughput during a full index reload - from
~1300 documents per second to ~450 documents/sec

Background:

VM hosts (these are configured identically):


   - Our Solr clusters run in a virtualized environment.
  - Each Virtual Machine has 8 CPUs and 64Gb RAM.
  - The hosts are organized into 2 4-host clusters - one for 7.3 and
  one for 8.7.
  - Each cluster has its own 3 VM Zookeeper cluster (running the
  version that was current at the time of install).


JVM:


   - all the JVMs are set-up with -Xms28G and -Xmx28Gb
  - the Solr 8.7 cluster is running with the default JVM settings
  (i.e., as configured by the Solr install script) **other than memory**
  - the Solr 7.3 cluster was configured awhile ago, but I'm fairly sure
  it's running pretty vanilla JVM settings (if not outright
default) **other
  than memory**
  - the most obvious difference between the JVM settings for the
  environments is the garbage collector: ConcurrentMarkSweep for
7.3 and G1GC
  for 8.7
  - both run Java 1.8, but 7.3 is running HotSpot and 8.7 is running
  OpenJDK (and a bit newer)


Solr:


   - 1 shard, 1 replica per host - all NRT (both clusters)
  - Both the Solr 7.3 and 8.7 clusters are running the same schema
  - with one exception, only the most minimal changes were made to the
  default Solr 8.7 solrconfig.xml to keep it in-line with the 7.3
solrconfig
  (mostly around Cache settings)
 - the exception: running with luceneMatchVersion=7.3.0


Data Loading:


   - Data is loaded by a completely separate VM running a custom Java
  process that collects data from source and generates SolrInputDocuments
  from that source and sends it via CloudSolrClient
  - this Java process is multi-threaded with an upper-limit on the
  number of simultaneous threads sending documents and the size of the
  document payload
  - we are loading ~10 million documents during a full-reload - this is
  a product catalog, so the documents actually represent data about SKUs we
  sell (and they aren't particularly large, though the size is variable)
  - the existing Solr 7.3 cluster has a full-reload time of around 2.5
  hours, the Solr 8.7 cluster requires around 6.25 hours


Efforts so far:

   - checked network speed from the VM generating updates (it's the same
   server for both 7.3 and 8.7) and the clusters
  - performance to the 8.7 cluster is actually better
   - as best as possible, controlling for VM topology (i.e., distribution
   of the VMs across hosts within the VM cluster)
   - real-time JVM monitoring with VisualVM during indexing on 8.7 cluster
  - looked nice - same as I've always seen for the 7.3 cluster
   - checked the GC logs with GCEasy
  - reported as healthy


Thoughts/questions/considerations:

   - could running an older LuceneMatchVersion affect indexing performance?
   - still a little concerned that the VM topology is affecting things (our
   VM-crew split the 7.3 cluster across VM clusters in an attempt to improve
   resiliency in case VM cluster failure and that's not something we can or
   want to replicate) - that said, the performance difference is consistent
   with what I've seen in our QA environment and that environment has a less
   even spread of VMs across hosts (e.g., multiple Solr VMs on the same VM
   host)
   - we have a couple of custom tokenizers and tokenFilters - those were
   rebuilt using the 8.7.0 versions of solr-core and apache-core - they're
   pretty simple and I'm not terribly concerned about this, but it is
   non-standard
   - query performance is comparable between 7.3 and 8.7 and documents
   returned are reasonably consistent (few really big differences, mostly just
   scoring differences that affect ordering)
   - after watching the 8.7 JVMs in real-time during indexing, I decided to
   drop the memory to -Xms20g and -Xmx20g - this had no effect on indexing
   speed (or GC impacts) - so, I think it's at least safe to say this is not
   memory-bound


Final question:

is it simply typical to see significantly worse indexing performance on 8.7
than 7.3?

Any suggestions on where to look would be highly appreciated.

Thanks,

Ron


Re: SOLR 8.6.0 date Indexing Issues.

2020-11-20 Thread Jörn Franke
Your should format the date according to the ISO Standard: 

https://lucene.apache.org/solr/guide/6_6/working-with-dates.html

Eg. 2018-07-12T00:00:00Z

You can either transform the date that you have in Solr or in your client 
pushing the doc to Solr. 
All major programming language have date utilities that allow you do to this 
transformation easily.


> Am 20.11.2020 um 21:50 schrieb Fiz N :
> 
> Hello Experts,
> 
> I am having  issues with indexing Date field in SOLR 8.6.0. I am indexing
> from MongoDB. In MongoDB the Format is as follows
> 
> 
> * "R_CREATION_DATE" : "12-Jul-18",  "R_MODIFY_DATE" : "30-Apr-19", *
> 
> In my Managed Schema I have the following entries.
> 
> 
> 
> 
> .
> 
> I am getting an error in the Solr log.
> 
> * org.apache.solr.common.SolrException: ERROR: [doc=mt_100] Error adding
> field 'R_MODIFY_DATE'='15-Jul-19' msg=Couldn't parse date because:
> Improperly formatted datetime: 15-Jul-19*
> 
> Please let me know how to handle this usecase with Date format
> "12-JUL-18". what changes should I do to make it work ?
> 
> Thanks
> Fiz N.


SOLR 8.6.0 date Indexing Issues.

2020-11-20 Thread Fiz N
Hello Experts,

I am having  issues with indexing Date field in SOLR 8.6.0. I am indexing
from MongoDB. In MongoDB the Format is as follows


* "R_CREATION_DATE" : "12-Jul-18",  "R_MODIFY_DATE" : "30-Apr-19", *

 In my Managed Schema I have the following entries.
 
 


 .

 I am getting an error in the Solr log.

* org.apache.solr.common.SolrException: ERROR: [doc=mt_100] Error adding
field 'R_MODIFY_DATE'='15-Jul-19' msg=Couldn't parse date because:
Improperly formatted datetime: 15-Jul-19*

 Please let me know how to handle this usecase with Date format
"12-JUL-18". what changes should I do to make it work ?

 Thanks
 Fiz N.


Solr 7.7 Indexing issue

2020-09-30 Thread Manisha Rahatadkar
Hello all

We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr 
using Solr.Net commit. The data is being synced to SOLR in batches. The 
document size is very huge (~0.5GB average) and solr indexing is taking long 
time. Total document size is ~200GB. As the solr commit is done as a part of 
API, the API calls are failing as document indexing is not completed.

  1.  What is your advise on syncing such a large volume of data to Solr KB.
  2.  Because of the search fields requirements, almost 8 fields are defined as 
Text fields.
  3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large 
volume of data? ( IF "%SOLR_JAVA_MEM%"=="" set SOLR_JAVA_MEM=-Xms2g -Xmx2g)
  4.  How to set up Solr in production on Windows? Currently it's set up as a 
standalone engine and client is requested to take the backup of the drive. Is 
there any other better way to do? How to set up for the disaster recovery?

Thanks in advance.

Regards
Manisha Rahatadkar


Confidentiality Notice

This email message, including any attachments, is for the sole use of the 
intended recipient and may contain confidential and privileged information. Any 
unauthorized view, use, disclosure or distribution is prohibited. If you are 
not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message. Anju Software, Inc. 4500 S. 
Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.


Re: Exclude a folder/directory from indexing

2020-08-28 Thread Walter Underwood
For building a crawler, I’d start with Scrapy (https://scrapy.org 
<https://scrapy.org/>). It is a solid design and
should be easy to use for crawling web pages, files, or an API. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 28, 2020, at 4:16 AM, Joe Doupnik  wrote:
> 
> Some time ago I faced a roughly similar challenge. After many trials and 
> tests I ended up creating my own programs to accomplish the tasks of fetching 
> files, selecting which are allowed to be indexed, and feeding them into Solr 
> (POST style). This work is open source, found on https://netlab1.net/, web 
> page section titled Presentations of long term utility, item Solr/Lucene 
> Search Service. This is a set of docs, three small PHP programs, and a Solr 
> schema etc bundle, all within one downloadable zip file.
> On filtering found files, my solution uses a list of regular expressions 
> which are simple to state and to process. The docs discuss the rules. 
> Luckily, the code dealing with rules per se and doing the filtering is very 
> short and simple; see crawler.php for convertfilter() and filterbyname(). 
> Thus you may wish to consider them or equivalents for inclusion in your 
> system, whatever that may be.
> Thanks,
> Joe D.
> 
> On 27/08/2020 20:32, Alexandre Rafalovitch wrote:
>> If you are indexing from Drupal into Solr, that's the question for
>> Drupal's solr module. If you are doing it some other way, which way
>> are you doing it? bin/post command?
>> 
>> Most likely this is not the Solr question, but whatever you have
>> feeding data into Solr.
>> 
>> Regards,
>>   Alex.
>> 
>> On Thu, 27 Aug 2020 at 15:21, Staley, Phil R - DCF
>>  wrote:
>>> Can you or how do you exclude a specific folder/directory from indexing in 
>>> SOLR version 7.x or 8.x?   Also our CMS is Drupal 8
>>> 
>>> Thanks,
>>> 
>>> Phil Staley
>>> DCF Webmaster
>>> 608 422-6569
>>> phil.sta...@wisconsin.gov
>>> 
>>> 
> 



Re: Exclude a folder/directory from indexing

2020-08-28 Thread Joe Doupnik
    Some time ago I faced a roughly similar challenge. After many 
trials and tests I ended up creating my own programs to accomplish the 
tasks of fetching files, selecting which are allowed to be indexed, and 
feeding them into Solr (POST style). This work is open source, found on 
https://netlab1.net/, web page section titled Presentations of long term 
utility, item Solr/Lucene Search Service. This is a set of docs, three 
small PHP programs, and a Solr schema etc bundle, all within one 
downloadable zip file.
    On filtering found files, my solution uses a list of regular 
expressions which are simple to state and to process. The docs discuss 
the rules. Luckily, the code dealing with rules per se and doing the 
filtering is very short and simple; see crawler.php for convertfilter() 
and filterbyname(). Thus you may wish to consider them or equivalents 
for inclusion in your system, whatever that may be.

    Thanks,
    Joe D.

On 27/08/2020 20:32, Alexandre Rafalovitch wrote:

If you are indexing from Drupal into Solr, that's the question for
Drupal's solr module. If you are doing it some other way, which way
are you doing it? bin/post command?

Most likely this is not the Solr question, but whatever you have
feeding data into Solr.

Regards,
   Alex.

On Thu, 27 Aug 2020 at 15:21, Staley, Phil R - DCF
 wrote:

Can you or how do you exclude a specific folder/directory from indexing in SOLR 
version 7.x or 8.x?   Also our CMS is Drupal 8

Thanks,

Phil Staley
DCF Webmaster
608 422-6569
phil.sta...@wisconsin.gov






Re: Exclude a folder/directory from indexing

2020-08-27 Thread Alexandre Rafalovitch
If you are indexing from Drupal into Solr, that's the question for
Drupal's solr module. If you are doing it some other way, which way
are you doing it? bin/post command?

Most likely this is not the Solr question, but whatever you have
feeding data into Solr.

Regards,
  Alex.

On Thu, 27 Aug 2020 at 15:21, Staley, Phil R - DCF
 wrote:
>
> Can you or how do you exclude a specific folder/directory from indexing in 
> SOLR version 7.x or 8.x?   Also our CMS is Drupal 8
>
> Thanks,
>
> Phil Staley
> DCF Webmaster
> 608 422-6569
> phil.sta...@wisconsin.gov
>
>


Exclude a folder/directory from indexing

2020-08-27 Thread Staley, Phil R - DCF
Can you or how do you exclude a specific folder/directory from indexing in SOLR 
version 7.x or 8.x?   Also our CMS is Drupal 8

Thanks,

Phil Staley
DCF Webmaster
608 422-6569
phil.sta...@wisconsin.gov




Re: SOLR indexing takes longer time

2020-08-18 Thread Walter Underwood
Instead of writing code, I’d fire up SQL Workbench/J, load the same JDBC driver
that is being used in Solr, and run the query.

https://www.sql-workbench.eu <https://www.sql-workbench.eu/>

If that takes 3.5 hours, you have isolated the problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 18, 2020, at 6:50 AM, David Hastings  
> wrote:
> 
> Another thing to mention is to make sure the indexer you build doesnt send
> commits until its actually done.  Made that mistake with some early in
> house indexers.
> 
> On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:
> 
>> 1. You could write some code to pull the items out of Mongo and dump
>> them to disk - if this is still slow, then it's Mongo that's the problem.
>> 2. Write a standalone indexer to replace DIH, it's single threaded and
>> deprecated anyway.
>> 3. Minor point - consider whether you need to index everything every
>> time or just the deltas.
>> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
>> old version you're running.
>> 
>> HTH
>> 
>> Charlie
>> 
>> On 17/08/2020 19:22, Abhijit Pawar wrote:
>>> Hello,
>>> 
>>> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
>>> replicas and just single core.
>>> It takes almost 3.5 hours to index that data.
>>> I am using a data import handler to import data from the mongo database.
>>> 
>>> Is there something we can do to reduce the time taken to index?
>>> Will upgrade to newer version help?
>>> 
>>> Appreciate your help!
>>> 
>>> Regards,
>>> Abhijit
>>> 
>> 
>> --
>> Charlie Hull
>> OpenSource Connections, previously Flax
>> 
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.o19s.com
>> 
>> 



Re: SOLR indexing takes longer time

2020-08-18 Thread David Hastings
Another thing to mention is to make sure the indexer you build doesnt send
commits until its actually done.  Made that mistake with some early in
house indexers.

On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:

> 1. You could write some code to pull the items out of Mongo and dump
> them to disk - if this is still slow, then it's Mongo that's the problem.
> 2. Write a standalone indexer to replace DIH, it's single threaded and
> deprecated anyway.
> 3. Minor point - consider whether you need to index everything every
> time or just the deltas.
> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
> old version you're running.
>
> HTH
>
> Charlie
>
> On 17/08/2020 19:22, Abhijit Pawar wrote:
> > Hello,
> >
> > We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> > replicas and just single core.
> > It takes almost 3.5 hours to index that data.
> > I am using a data import handler to import data from the mongo database.
> >
> > Is there something we can do to reduce the time taken to index?
> > Will upgrade to newer version help?
> >
> > Appreciate your help!
> >
> > Regards,
> > Abhijit
> >
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>
>


Re: SOLR indexing takes longer time

2020-08-18 Thread Charlie Hull
1. You could write some code to pull the items out of Mongo and dump 
them to disk - if this is still slow, then it's Mongo that's the problem.
2. Write a standalone indexer to replace DIH, it's single threaded and 
deprecated anyway.
3. Minor point - consider whether you need to index everything every 
time or just the deltas.
4. Upgrade Solr anyway, not for speed reasons but because that's a very 
old version you're running.


HTH

Charlie

On 17/08/2020 19:22, Abhijit Pawar wrote:

Hello,

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?

Appreciate your help!

Regards,
Abhijit



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: SOLR indexing takes longer time

2020-08-17 Thread Aroop Ganguly
Adding on to what others have said, indexing speed in general is largely 
affected by the parallelism and isolation you can give to each node.
Is there a reason why you cannot have more than 1 shard?
If you have 5 node cluster, why not have 5 shards, maxshardspernode=1 replica=1 
is ok. You should see dramatic gains.
Solr’s power and speed in doing everything comes from using it as a distributed 
system. By sharing more you will be using the benefit of that distributed 
capability,

HTH

Regards
Aroop

> On Aug 17, 2020, at 11:22 AM, Abhijit Pawar  wrote:
> 
> Hello,
> 
> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> replicas and just single core.
> It takes almost 3.5 hours to index that data.
> I am using a data import handler to import data from the mongo database.
> 
> Is there something we can do to reduce the time taken to index?
> Will upgrade to newer version help?
> 
> Appreciate your help!
> 
> Regards,
> Abhijit



Re: SOLR indexing takes longer time

2020-08-17 Thread Shawn Heisey

On 8/17/2020 12:22 PM, Abhijit Pawar wrote:

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?


There's not enough information here to provide a diagnosis.

Are you running Solr in cloud mode (with zookeeper)?

3.5 hours for 20 documents sounds like slowness with the data 
source, not a problem with Solr, but it's too soon to rule anything out.


Would you be able to write a program that pulls data from your mongo 
database but doesn't send it to Solr?  Ideally it would be a Java 
program using the same JDBC driver you're using with DIH.


Thanks,
Shawn



Re: SOLR indexing takes longer time

2020-08-17 Thread Walter Underwood
I’m seeing multiple red flags for performance here. The top ones are “DIH”,
“MongoDB”, and “SQL on MongoDB”. MongoDB is not a relational database.

Our multi-threaded extractor using the Mongo API was still three times slower
than the same approach on MySQL.

Check the CPU usage on the Solr hosts while you are indexing. If it is under 
50%, the bottleneck is MongoDB and single-threaded indexing.

For another check, run that same query in a regular database client and time it.
The Solr indexing will never be faster than that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 17, 2020, at 11:58 AM, Abhijit Pawar  wrote:
> 
> Sure Divye,
> 
> *Here's the config.*
> 
> *conf/solr-config.xml:*
> 
> 
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
> 
>  name="config">/home/ec2-user/solr/solr-5.4.1/server/solr/test_core/conf/dataimport/data-source-config.xml
> 
> 
> 
> 
> 
> *schema.xml:*
> has of all the field definitions
> 
> *conf/dataimport/data-source-config.xml*
> 
> 
>  driver="com.mongodb.jdbc.MongoDriver" url="mongodb://< ADDRESS>>:27017/<>"/>
> 
>  dataSource="mongod"
> transformer="<>,TemplateTransformer"
> onError="continue"
> pk="uuid"
> query="SELECT field1,field2,field3,.. FROM products"
> deltaImportQuery="SELECT field1,field2,field3,.. FROM products WHERE
> orgidStr = '${dataimporter.request.orgid}' AND idStr =
> '${dataimporter.delta.idStr}'"
> deltaQuery="SELECT idStr FROM products WHERE orgidStr =
> '${dataimporter.request.orgid}' AND updatedAt >
> '${dataimporter.last_index_time}'"
>> 
> 
> 
> 
> 
> .
> .
> . 4-5 more nested entities...
> 
> On Mon, Aug 17, 2020 at 1:32 PM Divye Handa 
> wrote:
> 
>> Can you share the dih configuration you are using for same?
>> 
>> On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:
>> 
>>> Hello,
>>> 
>>> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
>>> replicas and just single core.
>>> It takes almost 3.5 hours to index that data.
>>> I am using a data import handler to import data from the mongo database.
>>> 
>>> Is there something we can do to reduce the time taken to index?
>>> Will upgrade to newer version help?
>>> 
>>> Appreciate your help!
>>> 
>>> Regards,
>>> Abhijit
>>> 
>> 



Re: SOLR indexing takes longer time

2020-08-17 Thread Abhijit Pawar
Sure Divye,

*Here's the config.*

*conf/solr-config.xml:*





/home/ec2-user/solr/solr-5.4.1/server/solr/test_core/conf/dataimport/data-source-config.xml





*schema.xml:*
has of all the field definitions

*conf/dataimport/data-source-config.xml*









.
.
. 4-5 more nested entities...

On Mon, Aug 17, 2020 at 1:32 PM Divye Handa 
wrote:

> Can you share the dih configuration you are using for same?
>
> On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:
>
> > Hello,
> >
> > We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> > replicas and just single core.
> > It takes almost 3.5 hours to index that data.
> > I am using a data import handler to import data from the mongo database.
> >
> > Is there something we can do to reduce the time taken to index?
> > Will upgrade to newer version help?
> >
> > Appreciate your help!
> >
> > Regards,
> > Abhijit
> >
>


Re: SOLR indexing takes longer time

2020-08-17 Thread Jörn Franke
The DIH is single threaded and deprecated. Your best bet is to have a 
script/program extracting data from MongoDB and write them to Solr in Batches 
using multiple threads. You will see a significant higher performance for your 
data.

> Am 17.08.2020 um 20:23 schrieb Abhijit Pawar :
> 
> Hello,
> 
> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> replicas and just single core.
> It takes almost 3.5 hours to index that data.
> I am using a data import handler to import data from the mongo database.
> 
> Is there something we can do to reduce the time taken to index?
> Will upgrade to newer version help?
> 
> Appreciate your help!
> 
> Regards,
> Abhijit


Re: SOLR indexing takes longer time

2020-08-17 Thread Divye Handa
Can you share the dih configuration you are using for same?

On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:

> Hello,
>
> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> replicas and just single core.
> It takes almost 3.5 hours to index that data.
> I am using a data import handler to import data from the mongo database.
>
> Is there something we can do to reduce the time taken to index?
> Will upgrade to newer version help?
>
> Appreciate your help!
>
> Regards,
> Abhijit
>


SOLR indexing takes longer time

2020-08-17 Thread Abhijit Pawar
Hello,

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?

Appreciate your help!

Regards,
Abhijit


RE: Time-out errors while indexing (Solr 7.7.1)

2020-07-07 Thread Kommu, Vinodh K.
Hi Eric, Toke,

Can you please look at the details shared in my trail email & respond with your 
suggestions/feedback?


Thanks & Regards,
Vinodh

From: Kommu, Vinodh K.
Sent: Monday, July 6, 2020 4:58 PM
To: solr-user@lucene.apache.org
Subject: RE: Time-out errors while indexing (Solr 7.7.1)


Thanks Eric & Toke for your response over this.





Just wanted to correct few things here about number of docs:



Total number of documents exists in the entire cluster (all collections) = 
6393876826 (6.3B)

Total number of documents exists on 2 bigger collections (3749389864 & 
1780147848) = 5529537712 (5.5B)

Total number of documents exists on remaining collections = 864339114 (864M)



So all collections docs altogether do not have 13B. If you see above numbers, 
the biggest collection in the cluster holds close to 3.7B docs and second 
biggest collection holds upto 1.7B docs whereas remaining 20 collections in the 
cluster holds 864M docs only which gives the total docs in the cluster is 6.3B 
docs



On hardware side, cluster sits on 6 solr VMs, each VMs has 170G total memory 
(with 2 solr instances running per VM), 16 vCPUs and each solr JVM runs with 
31G heap. Remaining memory is allocated to OS disk cache and other OS related 
operations. Vm.swapiness on each VM is set to 0 so swap memory will be never 
used. Each collection is created using rule based replica placement API with 6 
shards and replicas factor as 3.



One other observation with collections cores placement, as mentioned above we 
create collections using rule based replica placement i.e. rule to ensure no 
same shard’s replica should sit on same VM using following command.



curl -s -k user:password 
"https://localhost:22010/solr/admin/collections?action=CREATE=$SOLR_COLLECTION=${SHARDS_NO?}=${REPLICATION_FACTOR?}=${MAX_SHARDS_PER_NODE?}=$SOLR_COLLECTION=shard:*,replica:<2,host:*"



Variable values in above command:



SOLR_COLLECTION = collection name

SHARDS_NO = 6

REPLICATION_FACTOR = 3

MAX_SHARDS_PER_NODE = (a math logic will work based on number of solr VMs, 
number of nodes per VM and total number of replicas i.e total number of 
replicas / number of VMs. Here in this cluster the number would be 18/6 = 3 max 
shards per machine)





Ideally it is supposed to create 3 cores per VM for each collection based on 
rule based replica placement but from below snippet, there were 2, 3 & 4 cores 
for each collection are placed differently on each VMs. So apparently VM2 and 
VM6 have more cores than other VMs so I presume this could be one of the reason 
to see more IO operations than remaining 4 VMs.





That said, I believe solr does this replica placement considering other factors 
like free disk on each VM etc while creating a new collection correct? If so, 
is this replica placement across the VMs are fine? If not, what's needed to 
correct this? Can an additional core with 210G size can create more disk IO 
operations? If yes, can move the additional core from these VMs to other VM 
where the cores are less make any difference? (like ensuring each VM has only 
max of 3 shards)



Also we have been noticing significant surge in IO operations at storage level 
too. Wondering to understand if storage has IOPS limit could make solr crave 
for IO operations or other way around which is solr make more read write 
operations leading storage IOPS to reach its higher limit?





VM1:



176G  node1/solr/Collection2_shard5_replica_n30



176G  node2/solr/Collection2_shard2_replica_n24



176G  node2/solr/Collection2_shard3_replica_n2



177G  node1/solr/Collection2_shard6_replica_n10



208G  node1/solr/Collection1_shard5_replica_n18



208G  node2/solr/Collection1_shard2_replica_n1



1.1T  total





VM2:



176G  node2/solr/Collection2_shard4_replica_n16



176G  node2/solr/Collection2_shard6_replica_n34



177G  node1/solr/Collection2_shard5_replica_n6



207G  node2/solr/Collection1_shard6_replica_n10



208G  node1/solr/Collection1_shard1_replica_n32



208G  node2/solr/Collection1_shard5_replica_n30



210G  node1/solr/Collection1_shard3_replica_n14



1.4T  total





VM3:



175G  node2/solr/Collection2_shard2_replica_n12



177G  node1/solr/Collection2_shard1_replica_n20



208G  node1/solr/Collection1_shard1_replica_n8



208G  node2/solr/Collection1_shard2_replica_n12



209G  node1/solr/Collection1_shard4_replica_n28



976G  total





VM4:



176G  node1/solr/Collection2_shard4_replica_n28



177G  node1/solr/Collection2_shard1_replica_n8



207G  node2/solr/Collection1_shard6_replica_n22



208G  node1/solr/Collection1_shard5_replica_n6



210G  node1/solr/Collection1_shard3_replica_n26



975G  total





VM5:



176G  node2/solr/Collection2_shard3_replica_n14



177G  node1/solr/Collection2_shard5_replica_n18



177G  node2/solr/Collection2_shard1_replica_n32



208G  node1/solr/Collection1_shard2_replica_n24



210G  node1/solr/Collection1_shard3_replica_n2



210G  node2/solr/Coll

Re: Out of memory errors with Spatial indexing

2020-07-06 Thread David Smiley
I believe you are experiencing this bug: LUCENE-5056
<https://issues.apache.org/jira/browse/LUCENE-5056>
The fix would probably be adjusting code in here
org.apache.lucene.spatial.query.SpatialArgs#calcDistanceFromErrPct

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Jul 6, 2020 at 5:18 AM Sunil Varma  wrote:

> Hi David
> Thanks for your response. Yes, I noticed that all the data causing issue
> were at the poles. I tried the "RptWithGeometrySpatialField" field type
> definition but get a "Spatial context does not support S2 spatial
> index"error. Setting "spatialContextFactory="Geo3D" I still see the
> original OOM error .
>
> On Sat, 4 Jul 2020 at 05:49, David Smiley  wrote:
>
> > Hi Sunil,
> >
> > Your shape is at a pole, and I'm aware of a bug causing an exponential
> > explosion of needed grid squares when you have polygons super-close to
> the
> > pole.  Might you try S2PrefixTree instead?  I forget if this would fix it
> > or not by itself.  For indexing non-point data, I recommend
> > class="solr.RptWithGeometrySpatialField" which internally is based off a
> > combination of a course grid and storing the original vector geometry for
> > accurate verification:
> >  > class="solr.RptWithGeometrySpatialField"
> >   prefixTree="s2" />
> > The internally coarser grid will lessen the impact of that pole bug.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Fri, Jul 3, 2020 at 7:48 AM Sunil Varma 
> > wrote:
> >
> > > We are seeing OOM errors  when trying to index some spatial data. I
> > believe
> > > the data itself might not be valid but it shouldn't cause the Server to
> > > crash. We see this on both Solr 7.6 and Solr 8. Below is the input that
> > is
> > > causing the error.
> > >
> > > {
> > > "id": "bad_data_1",
> > > "spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0
> > > 1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30,
> > > 74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0
> > > 1.000150474662E30)"
> > > }
> > >
> > > Above dynamic field is mapped to field type "location_rpt" (
> > > solr.SpatialRecursivePrefixTreeFieldType).
> > >
> > >   Any pointers to get around this issue would be highly appreciated.
> > >
> > > Thanks!
> > >
> >
>


RE: Time-out errors while indexing (Solr 7.7.1)

2020-07-06 Thread Kommu, Vinodh K.
a_n34



208G  node1/solr/Collection1_shard1_replica_n20



209G  node2/solr/Collection1_shard4_replica_n16



1.3T  total





Thanks & Regards,

Vinodh



-Original Message-
From: Erick Erickson 
Sent: Saturday, July 4, 2020 7:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Time-out errors while indexing (Solr 7.7.1)



ATTENTION: External Email – Be Suspicious of Attachments, Links and Requests 
for Login Information.



You need more shards. And, I’m pretty certain, more hardware.



You say you have 13 billion documents and 6 shards. Solr/Lucene has an absolute 
upper limit of 2B (2^31) docs per shard. I don’t quite know how you’re running 
at all unless that 13B is a round number. If you keep adding documents, your 
installation will shortly, at best, stop accepting new documents for indexing. 
At worst you’ll start seeing weird errors and possibly corrupt indexes and have 
to re-index everything from scratch.



You’ve backed yourself in to a pretty tight corner here. You either have to 
re-index to a properly-sized cluster or use SPLITSHARD. This latter will double 
the index-on-disk size (it creates two child indexes per replica and keeps the 
old one for safety’s sake that you have to clean up later). I strongly 
recommend you stop ingesting more data while you do this.



You say you have 6 VMs with 2 nodes running on each. If those VMs are 
co-located with anything else, the physical hardware is going to be stressed. 
VMs themselves aren’t bad, but somewhere there’s physical hardware that runs it…



In fact, I urge you to stop ingesting data immediately and address this issue. 
You have a cluster that’s mis-configured, and you must address that before Bad 
Things Happen.



Best,

Erick



> On Jul 4, 2020, at 5:09 AM, Mad have 
> mailto:madhava.a.re...@gmail.com>> wrote:

>

> Hi Eric,

>

> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. 
> Total number of shards are 6 with 3 replicas. I can see the index size is 
> more than 220GB on each node for the collection where we are facing the 
> performance issue.

>

> The more documents we add to the collection the indexing become slow and I 
> also have same impression that the size of the collection is creating this 
> issue. Appreciate if you can suggests any solution on this.

>

>

> Regards,

> Madhava

> Sent from my iPhone

>

>> On 3 Jul 2020, at 23:30, Erick Erickson 
>> mailto:erickerick...@gmail.com>> wrote:

>>

>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, 
>> _that’s_ a red flag.

>>

>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson 
>>> mailto:erickerick...@gmail.com>> wrote:

>>>

>>> You haven’t said how many _shards_ are present. Nor how many replicas of 
>>> the collection you’re hosting per physical machine. Nor how large the 
>>> indexes are on disk. Those are the numbers that count. The latter is 
>>> somewhat fuzzy, but if your aggregate index size on a machine with, say, 
>>> 128G of memory is a terabyte, that’s a red flag.

>>>

>>> Short form, though is yes. Subject to the questions above, this is what I’d 
>>> be looking at first.

>>>

>>> And, as I said, if you’ve been steadily increasing the total number of 
>>> documents, you’ll reach a tipping point sometime.

>>>

>>> Best,

>>> Erick

>>>

>>>>> On Jul 3, 2020, at 5:32 PM, Mad have 
>>>>> mailto:madhava.a.re...@gmail.com>> wrote:

>>>>

>>>> Hi Eric,

>>>>

>>>> The collection has almost 13billion documents with each document around 
>>>> 5kb size, all the columns around 150 are the indexed. Do you think that 
>>>> number of documents in the collection causing this issue. Appreciate your 
>>>> response.

>>>>

>>>> Regards,

>>>> Madhava

>>>>

>>>> Sent from my iPhone

>>>>

>>>>> On 3 Jul 2020, at 12:42, Erick Erickson 
>>>>> mailto:erickerick...@gmail.com>> wrote:

>>>>>

>>>>> If you’re seeing low CPU utilization at the same time, you

>>>>> probably just have too much data on too little hardware. Check

>>>>> your swapping, how much of your I/O is just because Lucene can’t

>>>>> hold all the parts of the index it needs in memory at once? Lucene

>>>>> uses MMapDirectory to hold the index and you may well be swapping,

>>>>> see:

>>>>>

>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bi

>>>>> t.html

>>>>>

>>>>>

Re: Out of memory errors with Spatial indexing

2020-07-06 Thread Sunil Varma
Hi David
Thanks for your response. Yes, I noticed that all the data causing issue
were at the poles. I tried the "RptWithGeometrySpatialField" field type
definition but get a "Spatial context does not support S2 spatial
index"error. Setting "spatialContextFactory="Geo3D" I still see the
original OOM error .

On Sat, 4 Jul 2020 at 05:49, David Smiley  wrote:

> Hi Sunil,
>
> Your shape is at a pole, and I'm aware of a bug causing an exponential
> explosion of needed grid squares when you have polygons super-close to the
> pole.  Might you try S2PrefixTree instead?  I forget if this would fix it
> or not by itself.  For indexing non-point data, I recommend
> class="solr.RptWithGeometrySpatialField" which internally is based off a
> combination of a course grid and storing the original vector geometry for
> accurate verification:
>  class="solr.RptWithGeometrySpatialField"
>   prefixTree="s2" />
> The internally coarser grid will lessen the impact of that pole bug.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Fri, Jul 3, 2020 at 7:48 AM Sunil Varma 
> wrote:
>
> > We are seeing OOM errors  when trying to index some spatial data. I
> believe
> > the data itself might not be valid but it shouldn't cause the Server to
> > crash. We see this on both Solr 7.6 and Solr 8. Below is the input that
> is
> > causing the error.
> >
> > {
> > "id": "bad_data_1",
> > "spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0
> > 1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30,
> > 74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0
> > 1.000150474662E30)"
> > }
> >
> > Above dynamic field is mapped to field type "location_rpt" (
> > solr.SpatialRecursivePrefixTreeFieldType).
> >
> >   Any pointers to get around this issue would be highly appreciated.
> >
> > Thanks!
> >
>


Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-04 Thread Mad have
Thank a lot for your inputs and suggestions, even I was thinking similar like 
creating  another collection of the same ( hot and cold), and moving documents 
which are older than certain days like 180 days from original collection (hot) 
to new collection(cold). 

Thanks,
Madhava

Sent from my iPhone

> On 4 Jul 2020, at 14:37, Erick Erickson  wrote:
> 
> You need more shards. And, I’m pretty certain, more hardware.
> 
> You say you have 13 billion documents and 6 shards. Solr/Lucene has an 
> absolute upper limit of 2B (2^31) docs per shard. I don’t quite know how 
> you’re running at all unless that 13B is a round number. If you keep adding 
> documents, your installation will shortly, at best, stop accepting new 
> documents for indexing. At worst you’ll start seeing weird errors and 
> possibly corrupt indexes and have to re-index everything from scratch.
> 
> You’ve backed yourself in to a pretty tight corner here. You either have to 
> re-index to a properly-sized cluster or use SPLITSHARD. This latter will 
> double the index-on-disk size (it creates two child indexes per replica and 
> keeps the old one for safety’s sake that you have to clean up later). I 
> strongly recommend you stop ingesting more data while you do this.
> 
> You say you have 6 VMs with 2 nodes running on each. If those VMs are 
> co-located with anything else, the physical hardware is going to be stressed. 
> VMs themselves aren’t bad, but somewhere there’s physical hardware that runs 
> it…
> 
> In fact, I urge you to stop ingesting data immediately and address this 
> issue. You have a cluster that’s mis-configured, and you must address that 
> before Bad Things Happen.
> 
> Best,
> Erick
> 
>> On Jul 4, 2020, at 5:09 AM, Mad have  wrote:
>> 
>> Hi Eric,
>> 
>> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. 
>> Total number of shards are 6 with 3 replicas. I can see the index size is 
>> more than 220GB on each node for the collection where we are facing the 
>> performance issue.
>> 
>> The more documents we add to the collection the indexing become slow and I 
>> also have same impression that the size of the collection is creating this 
>> issue. Appreciate if you can suggests any solution on this.
>> 
>> 
>> Regards,
>> Madhava 
>> Sent from my iPhone
>> 
>>>> On 3 Jul 2020, at 23:30, Erick Erickson  wrote:
>>> 
>>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, 
>>> _that’s_ a red flag.
>>> 
>>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson  wrote:
>>>> 
>>>> You haven’t said how many _shards_ are present. Nor how many replicas of 
>>>> the collection you’re hosting per physical machine. Nor how large the 
>>>> indexes are on disk. Those are the numbers that count. The latter is 
>>>> somewhat fuzzy, but if your aggregate index size on a machine with, say, 
>>>> 128G of memory is a terabyte, that’s a red flag.
>>>> 
>>>> Short form, though is yes. Subject to the questions above, this is what 
>>>> I’d be looking at first.
>>>> 
>>>> And, as I said, if you’ve been steadily increasing the total number of 
>>>> documents, you’ll reach a tipping point sometime.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>>> On Jul 3, 2020, at 5:32 PM, Mad have  wrote:
>>>>> 
>>>>> Hi Eric,
>>>>> 
>>>>> The collection has almost 13billion documents with each document around 
>>>>> 5kb size, all the columns around 150 are the indexed. Do you think that 
>>>>> number of documents in the collection causing this issue. Appreciate your 
>>>>> response.
>>>>> 
>>>>> Regards,
>>>>> Madhava 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 3 Jul 2020, at 12:42, Erick Erickson  wrote:
>>>>>> 
>>>>>> If you’re seeing low CPU utilization at the same time, you probably
>>>>>> just have too much data on too little hardware. Check your
>>>>>> swapping, how much of your I/O is just because Lucene can’t
>>>>>> hold all the parts of the index it needs in memory at once? Lucene
>>>>>> uses MMapDirectory to hold the index and you may well be
>>>>>> swapping, see:
>>>>>> 
>>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>>> 
>>

Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-04 Thread Erick Erickson
You need more shards. And, I’m pretty certain, more hardware.

You say you have 13 billion documents and 6 shards. Solr/Lucene has an absolute 
upper limit of 2B (2^31) docs per shard. I don’t quite know how you’re running 
at all unless that 13B is a round number. If you keep adding documents, your 
installation will shortly, at best, stop accepting new documents for indexing. 
At worst you’ll start seeing weird errors and possibly corrupt indexes and have 
to re-index everything from scratch.

You’ve backed yourself in to a pretty tight corner here. You either have to 
re-index to a properly-sized cluster or use SPLITSHARD. This latter will double 
the index-on-disk size (it creates two child indexes per replica and keeps the 
old one for safety’s sake that you have to clean up later). I strongly 
recommend you stop ingesting more data while you do this.

You say you have 6 VMs with 2 nodes running on each. If those VMs are 
co-located with anything else, the physical hardware is going to be stressed. 
VMs themselves aren’t bad, but somewhere there’s physical hardware that runs it…

In fact, I urge you to stop ingesting data immediately and address this issue. 
You have a cluster that’s mis-configured, and you must address that before Bad 
Things Happen.

Best,
Erick

> On Jul 4, 2020, at 5:09 AM, Mad have  wrote:
> 
> Hi Eric,
> 
> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. 
> Total number of shards are 6 with 3 replicas. I can see the index size is 
> more than 220GB on each node for the collection where we are facing the 
> performance issue.
> 
> The more documents we add to the collection the indexing become slow and I 
> also have same impression that the size of the collection is creating this 
> issue. Appreciate if you can suggests any solution on this.
> 
> 
> Regards,
> Madhava 
> Sent from my iPhone
> 
>> On 3 Jul 2020, at 23:30, Erick Erickson  wrote:
>> 
>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, 
>> _that’s_ a red flag.
>> 
>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson  wrote:
>>> 
>>> You haven’t said how many _shards_ are present. Nor how many replicas of 
>>> the collection you’re hosting per physical machine. Nor how large the 
>>> indexes are on disk. Those are the numbers that count. The latter is 
>>> somewhat fuzzy, but if your aggregate index size on a machine with, say, 
>>> 128G of memory is a terabyte, that’s a red flag.
>>> 
>>> Short form, though is yes. Subject to the questions above, this is what I’d 
>>> be looking at first.
>>> 
>>> And, as I said, if you’ve been steadily increasing the total number of 
>>> documents, you’ll reach a tipping point sometime.
>>> 
>>> Best,
>>> Erick
>>> 
>>>>> On Jul 3, 2020, at 5:32 PM, Mad have  wrote:
>>>> 
>>>> Hi Eric,
>>>> 
>>>> The collection has almost 13billion documents with each document around 
>>>> 5kb size, all the columns around 150 are the indexed. Do you think that 
>>>> number of documents in the collection causing this issue. Appreciate your 
>>>> response.
>>>> 
>>>> Regards,
>>>> Madhava 
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On 3 Jul 2020, at 12:42, Erick Erickson  wrote:
>>>>> 
>>>>> If you’re seeing low CPU utilization at the same time, you probably
>>>>> just have too much data on too little hardware. Check your
>>>>> swapping, how much of your I/O is just because Lucene can’t
>>>>> hold all the parts of the index it needs in memory at once? Lucene
>>>>> uses MMapDirectory to hold the index and you may well be
>>>>> swapping, see:
>>>>> 
>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>> 
>>>>> But my guess is that you’ve just reached a tipping point. You say:
>>>>> 
>>>>> "From last 2-3 weeks we have been noticing either slow indexing or 
>>>>> timeout errors while indexing”
>>>>> 
>>>>> So have you been continually adding more documents to your
>>>>> collections for more than the 2-3 weeks? If so you may have just
>>>>> put so much data on the same boxes that you’ve gone over
>>>>> the capacity of your hardware. As Toke says, adding physical
>>>>> memory for the OS to use to hold relevant parts of the index may
>>>>> alleviate the problem (again, refer to Uwe’s article for 

Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-04 Thread Mad have
Hi Eric,

There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. 
Total number of shards are 6 with 3 replicas. I can see the index size is more 
than 220GB on each node for the collection where we are facing the performance 
issue.

The more documents we add to the collection the indexing become slow and I also 
have same impression that the size of the collection is creating this issue. 
Appreciate if you can suggests any solution on this.


Regards,
Madhava 
Sent from my iPhone

> On 3 Jul 2020, at 23:30, Erick Erickson  wrote:
> 
> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, 
> _that’s_ a red flag.
> 
>> On Jul 3, 2020, at 5:53 PM, Erick Erickson  wrote:
>> 
>> You haven’t said how many _shards_ are present. Nor how many replicas of the 
>> collection you’re hosting per physical machine. Nor how large the indexes 
>> are on disk. Those are the numbers that count. The latter is somewhat fuzzy, 
>> but if your aggregate index size on a machine with, say, 128G of memory is a 
>> terabyte, that’s a red flag.
>> 
>> Short form, though is yes. Subject to the questions above, this is what I’d 
>> be looking at first.
>> 
>> And, as I said, if you’ve been steadily increasing the total number of 
>> documents, you’ll reach a tipping point sometime.
>> 
>> Best,
>> Erick
>> 
>>>> On Jul 3, 2020, at 5:32 PM, Mad have  wrote:
>>> 
>>> Hi Eric,
>>> 
>>> The collection has almost 13billion documents with each document around 5kb 
>>> size, all the columns around 150 are the indexed. Do you think that number 
>>> of documents in the collection causing this issue. Appreciate your response.
>>> 
>>> Regards,
>>> Madhava 
>>> 
>>> Sent from my iPhone
>>> 
>>>> On 3 Jul 2020, at 12:42, Erick Erickson  wrote:
>>>> 
>>>> If you’re seeing low CPU utilization at the same time, you probably
>>>> just have too much data on too little hardware. Check your
>>>> swapping, how much of your I/O is just because Lucene can’t
>>>> hold all the parts of the index it needs in memory at once? Lucene
>>>> uses MMapDirectory to hold the index and you may well be
>>>> swapping, see:
>>>> 
>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>> 
>>>> But my guess is that you’ve just reached a tipping point. You say:
>>>> 
>>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout 
>>>> errors while indexing”
>>>> 
>>>> So have you been continually adding more documents to your
>>>> collections for more than the 2-3 weeks? If so you may have just
>>>> put so much data on the same boxes that you’ve gone over
>>>> the capacity of your hardware. As Toke says, adding physical
>>>> memory for the OS to use to hold relevant parts of the index may
>>>> alleviate the problem (again, refer to Uwe’s article for why).
>>>> 
>>>> All that said, if you’re going to keep adding document you need to
>>>> seriously think about adding new machines and moving some of
>>>> your replicas to them.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen  wrote:
>>>>> 
>>>>>> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
>>>>>> We are performing QA performance testing on couple of collections
>>>>>> which holds 2 billion and 3.5 billion docs respectively.
>>>>> 
>>>>> How many shards?
>>>>> 
>>>>>> 1.  Our performance team noticed that read operations are pretty
>>>>>> more than write operations like 100:1 ratio, is this expected during
>>>>>> indexing or solr nodes are doing any other operations like syncing?
>>>>> 
>>>>> Are you saying that there are 100 times more read operations when you
>>>>> are indexing? That does not sound too unrealistic as the disk cache
>>>>> might be filled with the data that the writers are flushing.
>>>>> 
>>>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>>>> but such massive difference in IO-utilization does indicate that you
>>>>> are starved for cache.
>>>>> 
>>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>>&

Re: Out of memory errors with Spatial indexing

2020-07-03 Thread David Smiley
Hi Sunil,

Your shape is at a pole, and I'm aware of a bug causing an exponential
explosion of needed grid squares when you have polygons super-close to the
pole.  Might you try S2PrefixTree instead?  I forget if this would fix it
or not by itself.  For indexing non-point data, I recommend
class="solr.RptWithGeometrySpatialField" which internally is based off a
combination of a course grid and storing the original vector geometry for
accurate verification:

The internally coarser grid will lessen the impact of that pole bug.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Jul 3, 2020 at 7:48 AM Sunil Varma  wrote:

> We are seeing OOM errors  when trying to index some spatial data. I believe
> the data itself might not be valid but it shouldn't cause the Server to
> crash. We see this on both Solr 7.6 and Solr 8. Below is the input that is
> causing the error.
>
> {
> "id": "bad_data_1",
> "spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0
> 1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30,
> 74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0
> 1.000150474662E30)"
> }
>
> Above dynamic field is mapped to field type "location_rpt" (
> solr.SpatialRecursivePrefixTreeFieldType).
>
>   Any pointers to get around this issue would be highly appreciated.
>
> Thanks!
>


Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Erick Erickson
Oops, I transposed that. If your index is a terabyte and your RAM is 128M, 
_that’s_ a red flag.

> On Jul 3, 2020, at 5:53 PM, Erick Erickson  wrote:
> 
> You haven’t said how many _shards_ are present. Nor how many replicas of the 
> collection you’re hosting per physical machine. Nor how large the indexes are 
> on disk. Those are the numbers that count. The latter is somewhat fuzzy, but 
> if your aggregate index size on a machine with, say, 128G of memory is a 
> terabyte, that’s a red flag.
> 
> Short form, though is yes. Subject to the questions above, this is what I’d 
> be looking at first.
> 
> And, as I said, if you’ve been steadily increasing the total number of 
> documents, you’ll reach a tipping point sometime.
> 
> Best,
> Erick
> 
>> On Jul 3, 2020, at 5:32 PM, Mad have  wrote:
>> 
>> Hi Eric,
>> 
>> The collection has almost 13billion documents with each document around 5kb 
>> size, all the columns around 150 are the indexed. Do you think that number 
>> of documents in the collection causing this issue. Appreciate your response.
>> 
>> Regards,
>> Madhava 
>> 
>> Sent from my iPhone
>> 
>>> On 3 Jul 2020, at 12:42, Erick Erickson  wrote:
>>> 
>>> If you’re seeing low CPU utilization at the same time, you probably
>>> just have too much data on too little hardware. Check your
>>> swapping, how much of your I/O is just because Lucene can’t
>>> hold all the parts of the index it needs in memory at once? Lucene
>>> uses MMapDirectory to hold the index and you may well be
>>> swapping, see:
>>> 
>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>> 
>>> But my guess is that you’ve just reached a tipping point. You say:
>>> 
>>> "From last 2-3 weeks we have been noticing either slow indexing or timeout 
>>> errors while indexing”
>>> 
>>> So have you been continually adding more documents to your
>>> collections for more than the 2-3 weeks? If so you may have just
>>> put so much data on the same boxes that you’ve gone over
>>> the capacity of your hardware. As Toke says, adding physical
>>> memory for the OS to use to hold relevant parts of the index may
>>> alleviate the problem (again, refer to Uwe’s article for why).
>>> 
>>> All that said, if you’re going to keep adding document you need to
>>> seriously think about adding new machines and moving some of
>>> your replicas to them.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen  wrote:
>>>> 
>>>>> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
>>>>> We are performing QA performance testing on couple of collections
>>>>> which holds 2 billion and 3.5 billion docs respectively.
>>>> 
>>>> How many shards?
>>>> 
>>>>> 1.  Our performance team noticed that read operations are pretty
>>>>> more than write operations like 100:1 ratio, is this expected during
>>>>> indexing or solr nodes are doing any other operations like syncing?
>>>> 
>>>> Are you saying that there are 100 times more read operations when you
>>>> are indexing? That does not sound too unrealistic as the disk cache
>>>> might be filled with the data that the writers are flushing.
>>>> 
>>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>>> but such massive difference in IO-utilization does indicate that you
>>>> are starved for cache.
>>>> 
>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>>> check: How many replicas are each physical box handling? If they are
>>>> sharing resources, fewer replicas would probably be better.
>>>> 
>>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>>> more? Would that help or create any other problems?
>>>> 
>>>> It does not hurt the server to increase the client timeout as the
>>>> initiated query will keep running until it is finished, independent of
>>>> whether or not there is a client to receive the result.
>>>> 
>>>> If you want a better max time for query processing, you should look at 
>>>> 
>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>>> but due to its inherent limitations it might not help in your
>

Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Erick Erickson
You haven’t said how many _shards_ are present. Nor how many replicas of the 
collection you’re hosting per physical machine. Nor how large the indexes are 
on disk. Those are the numbers that count. The latter is somewhat fuzzy, but if 
your aggregate index size on a machine with, say, 128G of memory is a terabyte, 
that’s a red flag.

Short form, though is yes. Subject to the questions above, this is what I’d be 
looking at first.

And, as I said, if you’ve been steadily increasing the total number of 
documents, you’ll reach a tipping point sometime.

Best,
Erick

> On Jul 3, 2020, at 5:32 PM, Mad have  wrote:
> 
> Hi Eric,
> 
> The collection has almost 13billion documents with each document around 5kb 
> size, all the columns around 150 are the indexed. Do you think that number of 
> documents in the collection causing this issue. Appreciate your response.
> 
> Regards,
> Madhava 
> 
> Sent from my iPhone
> 
>> On 3 Jul 2020, at 12:42, Erick Erickson  wrote:
>> 
>> If you’re seeing low CPU utilization at the same time, you probably
>> just have too much data on too little hardware. Check your
>> swapping, how much of your I/O is just because Lucene can’t
>> hold all the parts of the index it needs in memory at once? Lucene
>> uses MMapDirectory to hold the index and you may well be
>> swapping, see:
>> 
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> 
>> But my guess is that you’ve just reached a tipping point. You say:
>> 
>> "From last 2-3 weeks we have been noticing either slow indexing or timeout 
>> errors while indexing”
>> 
>> So have you been continually adding more documents to your
>> collections for more than the 2-3 weeks? If so you may have just
>> put so much data on the same boxes that you’ve gone over
>> the capacity of your hardware. As Toke says, adding physical
>> memory for the OS to use to hold relevant parts of the index may
>> alleviate the problem (again, refer to Uwe’s article for why).
>> 
>> All that said, if you’re going to keep adding document you need to
>> seriously think about adding new machines and moving some of
>> your replicas to them.
>> 
>> Best,
>> Erick
>> 
>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen  wrote:
>>> 
>>>> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
>>>> We are performing QA performance testing on couple of collections
>>>> which holds 2 billion and 3.5 billion docs respectively.
>>> 
>>> How many shards?
>>> 
>>>> 1.  Our performance team noticed that read operations are pretty
>>>> more than write operations like 100:1 ratio, is this expected during
>>>> indexing or solr nodes are doing any other operations like syncing?
>>> 
>>> Are you saying that there are 100 times more read operations when you
>>> are indexing? That does not sound too unrealistic as the disk cache
>>> might be filled with the data that the writers are flushing.
>>> 
>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>> but such massive difference in IO-utilization does indicate that you
>>> are starved for cache.
>>> 
>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>> check: How many replicas are each physical box handling? If they are
>>> sharing resources, fewer replicas would probably be better.
>>> 
>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>> more? Would that help or create any other problems?
>>> 
>>> It does not hurt the server to increase the client timeout as the
>>> initiated query will keep running until it is finished, independent of
>>> whether or not there is a client to receive the result.
>>> 
>>> If you want a better max time for query processing, you should look at 
>>> 
>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>> but due to its inherent limitations it might not help in your
>>> situation.
>>> 
>>>> 4.  When we created an empty collection and loaded same data file,
>>>> it loaded fine without any issues so having more documents in a
>>>> collection would create such problems?
>>> 
>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>> leading to excessive IO-activity, which might be what you are seeing. I
>>> can see from an earlier post that you were using streaming expressions
>>> for another collec

Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Mad have
Hi Eric,

The collection has almost 13billion documents with each document around 5kb 
size, all the columns around 150 are the indexed. Do you think that number of 
documents in the collection causing this issue. Appreciate your response.

Regards,
Madhava 

Sent from my iPhone

> On 3 Jul 2020, at 12:42, Erick Erickson  wrote:
> 
> If you’re seeing low CPU utilization at the same time, you probably
> just have too much data on too little hardware. Check your
> swapping, how much of your I/O is just because Lucene can’t
> hold all the parts of the index it needs in memory at once? Lucene
> uses MMapDirectory to hold the index and you may well be
> swapping, see:
> 
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> 
> But my guess is that you’ve just reached a tipping point. You say:
> 
> "From last 2-3 weeks we have been noticing either slow indexing or timeout 
> errors while indexing”
> 
> So have you been continually adding more documents to your
> collections for more than the 2-3 weeks? If so you may have just
> put so much data on the same boxes that you’ve gone over
> the capacity of your hardware. As Toke says, adding physical
> memory for the OS to use to hold relevant parts of the index may
> alleviate the problem (again, refer to Uwe’s article for why).
> 
> All that said, if you’re going to keep adding document you need to
> seriously think about adding new machines and moving some of
> your replicas to them.
> 
> Best,
> Erick
> 
>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen  wrote:
>> 
>>> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
>>> We are performing QA performance testing on couple of collections
>>> which holds 2 billion and 3.5 billion docs respectively.
>> 
>> How many shards?
>> 
>>> 1.  Our performance team noticed that read operations are pretty
>>> more than write operations like 100:1 ratio, is this expected during
>>> indexing or solr nodes are doing any other operations like syncing?
>> 
>> Are you saying that there are 100 times more read operations when you
>> are indexing? That does not sound too unrealistic as the disk cache
>> might be filled with the data that the writers are flushing.
>> 
>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>> but such massive difference in IO-utilization does indicate that you
>> are starved for cache.
>> 
>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>> check: How many replicas are each physical box handling? If they are
>> sharing resources, fewer replicas would probably be better.
>> 
>>> 3.  Our client timeout is set to 2mins, can they increase further
>>> more? Would that help or create any other problems?
>> 
>> It does not hurt the server to increase the client timeout as the
>> initiated query will keep running until it is finished, independent of
>> whether or not there is a client to receive the result.
>> 
>> If you want a better max time for query processing, you should look at 
>> 
>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>> but due to its inherent limitations it might not help in your
>> situation.
>> 
>>> 4.  When we created an empty collection and loaded same data file,
>>> it loaded fine without any issues so having more documents in a
>>> collection would create such problems?
>> 
>> Solr 7 does have a problem with sparse DocValues and many documents,
>> leading to excessive IO-activity, which might be what you are seeing. I
>> can see from an earlier post that you were using streaming expressions
>> for another collection: This is one of the things that are affected by
>> the Solr 7 DocValues issue.
>> 
>> More info about DocValues and streaming:
>> https://issues.apache.org/jira/browse/SOLR-13013
>> 
>> Fairly in-depth info on the problem with Solr 7 docValues:
>> https://issues.apache.org/jira/browse/LUCENE-8374
>> 
>> If this is your problem, upgrading to Solr 8 and indexing the
>> collection from scratch should fix it. 
>> 
>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>> or you can ensure that there are values defined for all DocValues-
>> fields in all your documents.
>> 
>>> java.net.SocketTimeoutException: Read timed out
>>>   at java.net.SocketInputStream.socketRead0(Native Method) 
>> ...
>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>> timeout expired: 60/60 ms
>> 
>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>> should be able to change it in solr.xml.
>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>> 
>> BUT if an update takes > 10 minutes to be processed, it indicates that
>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>> 
>> - Toke Eskildsen, Royal Danish Library
>> 
>> 
> 


Out of memory errors with Spatial indexing

2020-07-03 Thread Sunil Varma
We are seeing OOM errors  when trying to index some spatial data. I believe
the data itself might not be valid but it shouldn't cause the Server to
crash. We see this on both Solr 7.6 and Solr 8. Below is the input that is
causing the error.

{
"id": "bad_data_1",
"spatialwkt_srpt": "LINESTRING (-126.86037681029909 -90.0
1.000150474662E30, 73.58164711175415 -90.0 1.000150474662E30,
74.52836551959528 -90.0 1.000150474662E30, 74.97006811540834 -90.0
1.000150474662E30)"
}

Above dynamic field is mapped to field type "location_rpt" (
solr.SpatialRecursivePrefixTreeFieldType).

  Any pointers to get around this issue would be highly appreciated.

Thanks!


Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Erick Erickson
If you’re seeing low CPU utilization at the same time, you probably
just have too much data on too little hardware. Check your
swapping, how much of your I/O is just because Lucene can’t
hold all the parts of the index it needs in memory at once? Lucene
uses MMapDirectory to hold the index and you may well be
swapping, see:

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

But my guess is that you’ve just reached a tipping point. You say:

"From last 2-3 weeks we have been noticing either slow indexing or timeout 
errors while indexing”

So have you been continually adding more documents to your
collections for more than the 2-3 weeks? If so you may have just
put so much data on the same boxes that you’ve gone over
the capacity of your hardware. As Toke says, adding physical
memory for the OS to use to hold relevant parts of the index may
alleviate the problem (again, refer to Uwe’s article for why).

All that said, if you’re going to keep adding document you need to
seriously think about adding new machines and moving some of
your replicas to them.

Best,
Erick

> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen  wrote:
> 
> On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
>> We are performing QA performance testing on couple of collections
>> which holds 2 billion and 3.5 billion docs respectively.
> 
> How many shards?
> 
>>  1.  Our performance team noticed that read operations are pretty
>> more than write operations like 100:1 ratio, is this expected during
>> indexing or solr nodes are doing any other operations like syncing?
> 
> Are you saying that there are 100 times more read operations when you
> are indexing? That does not sound too unrealistic as the disk cache
> might be filled with the data that the writers are flushing.
> 
> In that case, more RAM would help. Okay, more RAM nearly always helps,
> but such massive difference in IO-utilization does indicate that you
> are starved for cache.
> 
> I noticed you have at least 18 replicas. That's a lot. Just to sanity
> check: How many replicas are each physical box handling? If they are
> sharing resources, fewer replicas would probably be better.
> 
>>  3.  Our client timeout is set to 2mins, can they increase further
>> more? Would that help or create any other problems?
> 
> It does not hurt the server to increase the client timeout as the
> initiated query will keep running until it is finished, independent of
> whether or not there is a client to receive the result.
> 
> If you want a better max time for query processing, you should look at 
> 
> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
> but due to its inherent limitations it might not help in your
> situation.
> 
>>  4.  When we created an empty collection and loaded same data file,
>> it loaded fine without any issues so having more documents in a
>> collection would create such problems?
> 
> Solr 7 does have a problem with sparse DocValues and many documents,
> leading to excessive IO-activity, which might be what you are seeing. I
> can see from an earlier post that you were using streaming expressions
> for another collection: This is one of the things that are affected by
> the Solr 7 DocValues issue.
> 
> More info about DocValues and streaming:
> https://issues.apache.org/jira/browse/SOLR-13013
> 
> Fairly in-depth info on the problem with Solr 7 docValues:
> https://issues.apache.org/jira/browse/LUCENE-8374
> 
> If this is your problem, upgrading to Solr 8 and indexing the
> collection from scratch should fix it. 
> 
> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
> or you can ensure that there are values defined for all DocValues-
> fields in all your documents.
> 
>> java.net.SocketTimeoutException: Read timed out
>>at java.net.SocketInputStream.socketRead0(Native Method) 
> ...
>> Remote error message: java.util.concurrent.TimeoutException: Idle
>> timeout expired: 60/60 ms
> 
> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
> should be able to change it in solr.xml.
> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
> 
> BUT if an update takes > 10 minutes to be processed, it indicates that
> the cluster is overloaded.  Increasing the timeout is just a band-aid.
> 
> - Toke Eskildsen, Royal Danish Library
> 
> 



Re: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Toke Eskildsen
On Thu, 2020-07-02 at 11:16 +, Kommu, Vinodh K. wrote:
> We are performing QA performance testing on couple of collections
> which holds 2 billion and 3.5 billion docs respectively.

How many shards?

>   1.  Our performance team noticed that read operations are pretty
> more than write operations like 100:1 ratio, is this expected during
> indexing or solr nodes are doing any other operations like syncing?

Are you saying that there are 100 times more read operations when you
are indexing? That does not sound too unrealistic as the disk cache
might be filled with the data that the writers are flushing.

In that case, more RAM would help. Okay, more RAM nearly always helps,
but such massive difference in IO-utilization does indicate that you
are starved for cache.

I noticed you have at least 18 replicas. That's a lot. Just to sanity
check: How many replicas are each physical box handling? If they are
sharing resources, fewer replicas would probably be better.

>   3.  Our client timeout is set to 2mins, can they increase further
> more? Would that help or create any other problems?

It does not hurt the server to increase the client timeout as the
initiated query will keep running until it is finished, independent of
whether or not there is a client to receive the result.

If you want a better max time for query processing, you should look at 

https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
 but due to its inherent limitations it might not help in your
situation.

>   4.  When we created an empty collection and loaded same data file,
> it loaded fine without any issues so having more documents in a
> collection would create such problems?

Solr 7 does have a problem with sparse DocValues and many documents,
leading to excessive IO-activity, which might be what you are seeing. I
can see from an earlier post that you were using streaming expressions
for another collection: This is one of the things that are affected by
the Solr 7 DocValues issue.

More info about DocValues and streaming:
https://issues.apache.org/jira/browse/SOLR-13013

Fairly in-depth info on the problem with Solr 7 docValues:
https://issues.apache.org/jira/browse/LUCENE-8374

If this is your problem, upgrading to Solr 8 and indexing the
collection from scratch should fix it. 

Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
or you can ensure that there are values defined for all DocValues-
fields in all your documents.

> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method) 
...
> Remote error message: java.util.concurrent.TimeoutException: Idle
> timeout expired: 60/60 ms

There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
should be able to change it in solr.xml.
https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html

BUT if an update takes > 10 minutes to be processed, it indicates that
the cluster is overloaded.  Increasing the timeout is just a band-aid.

- Toke Eskildsen, Royal Danish Library




RE: Time-out errors while indexing (Solr 7.7.1)

2020-07-03 Thread Kommu, Vinodh K.
Anyone has any thoughts or suggestions on this issue?

Thanks & Regards,
Vinodh

From: Kommu, Vinodh K.
Sent: Thursday, July 2, 2020 4:46 PM
To: solr-user@lucene.apache.org
Subject: Time-out errors while indexing (Solr 7.7.1)

Hi,

We are performing QA performance testing on couple of collections which holds 2 
billion and 3.5 billion docs respectively. Indexing happens from a separate 
client using solrJ which uses 10 thread and batch size 1000. From last 2-3 
weeks we have been noticing either slow indexing or timeout errors while 
indexing. As part of troubleshooting, we did noticed that when peak disk IO 
utilization is reaching higher side, then indexing is happening slowly and when 
disk IO is constantly near 100%, timeout issues are observed.

Few questions here:


  1.  Our performance team noticed that read operations are pretty more than 
write operations like 100:1 ratio, is this expected during indexing or solr 
nodes are doing any other operations like syncing?
  2.  Zookeeper has a latency around (min/avg/max: 0/0/2205), can this latency 
create instabilities issues to ZK or Solr clusters? Or impact indexing or 
searching operations?
  3.  Our client timeout is set to 2mins, can they increase further more? Would 
that help or create any other problems?
  4.  When we created an empty collection and loaded same data file, it loaded 
fine without any issues so having more documents in a collection would create 
such problems?

Any suggestions or feedback would be really appreciated.

Solr version - 7.7.1

Time out error snippet:

ERROR 
(updateExecutor-3-thread-30055-processing-x:TestCollection_shard5_replica_n18 
https:localhost:1122//solr//TestCollection_shard6_replica_n22<https://localhost:1122/solr/TestCollection_shard6_replica_n22>
 r:core_node21 n:localhost:1122_solr c:TestCollection s:shard5) 
[c:TestCollection s:shard5 r:core_node21 x:TestCollection_shard5_replica_n18] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_212]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
~[?:1.8.0_212]
at java.net.SocketInputStream.read(SocketInputStream.java:171) 
~[?:1.8.0_212]
at java.net.SocketInputStream.read(SocketInputStream.java:141) 
~[?:1.8.0_212]
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) 
~[?:1.8.0_212]
at sun.security.ssl.InputRecord.read(InputRecord.java:503) 
~[?:1.8.0_212]
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) 
~[?:1.8.0_212]
at 
sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) 
~[?:1.8.0_212]
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) 
~[?:1.8.0_212]
at 
org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
 ~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
 ~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120)
 ~[solr-core-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 
2019-02-23 02:39:07]
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 
~[httpclient-4.5.6.jar:4.5.6]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
 ~[httpclient-4.5.6

Re: Solr 8.5.2 indexing issue

2020-07-02 Thread gnandre
It seems that the issue is not with reference_url field itself. There is
one copy field which has the reference_url field as source and another
field called url_path as destination.
This destination field url_path has the following field type definition.

  

  
  
 
  
  
  
  
 
  
  


  
  
  
 
  
  
  
  

  

If I remove  SynonymGraphFilterFactory and FlattenGraphFilterFactory in
above field type definition then it works otherwise it throws the
same error (IndexOutOfBoundsException) .

On Sun, Jun 28, 2020 at 9:06 AM Erick Erickson 
wrote:

> How are you sending this to Solr? I just tried 8.5, submitting that doc
> through the admin UI and it works fine.
> I defined “asset_id” with as the same type as your reference_url field.
>
> And does the log on the Solr node that tries to index this give any more
> info?
>
> Best,
> Erick
>
> > On Jun 27, 2020, at 10:45 PM, gnandre  wrote:
> >
> > {
> >"asset_id":"add-ons:576deefef7453a9189aa039b66500eb2",
> >
> >
> "reference_url":"modeling-a-high-speed-backplane-part-3-4-port-s-parameters-to-differential-tdr-and-tdt.html"}
>
>


Time-out errors while indexing (Solr 7.7.1)

2020-07-02 Thread Kommu, Vinodh K.
Hi,

We are performing QA performance testing on couple of collections which holds 2 
billion and 3.5 billion docs respectively. Indexing happens from a separate 
client using solrJ which uses 10 thread and batch size 1000. From last 2-3 
weeks we have been noticing either slow indexing or timeout errors while 
indexing. As part of troubleshooting, we did noticed that when peak disk IO 
utilization is reaching higher side, then indexing is happening slowly and when 
disk IO is constantly near 100%, timeout issues are observed.

Few questions here:


  1.  Our performance team noticed that read operations are pretty more than 
write operations like 100:1 ratio, is this expected during indexing or solr 
nodes are doing any other operations like syncing?
  2.  Zookeeper has a latency around (min/avg/max: 0/0/2205), can this latency 
create instabilities issues to ZK or Solr clusters? Or impact indexing or 
searching operations?
  3.  Our client timeout is set to 2mins, can they increase further more? Would 
that help or create any other problems?
  4.  When we created an empty collection and loaded same data file, it loaded 
fine without any issues so having more documents in a collection would create 
such problems?

Any suggestions or feedback would be really appreciated.

Solr version - 7.7.1

Time out error snippet:

ERROR 
(updateExecutor-3-thread-30055-processing-x:TestCollection_shard5_replica_n18 
https:localhost:1122//solr//TestCollection_shard6_replica_n22 r:core_node21 
n:localhost:1122_solr c:TestCollection s:shard5) [c:TestCollection s:shard5 
r:core_node21 x:TestCollection_shard5_replica_n18] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient error
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_212]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
~[?:1.8.0_212]
at java.net.SocketInputStream.read(SocketInputStream.java:171) 
~[?:1.8.0_212]
at java.net.SocketInputStream.read(SocketInputStream.java:141) 
~[?:1.8.0_212]
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) 
~[?:1.8.0_212]
at sun.security.ssl.InputRecord.read(InputRecord.java:503) 
~[?:1.8.0_212]
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) 
~[?:1.8.0_212]
at 
sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) 
~[?:1.8.0_212]
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) 
~[?:1.8.0_212]
at 
org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
 ~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
 ~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
 ~[httpcore-4.4.10.jar:4.4.10]
at 
org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120)
 ~[solr-core-7.7.1.jar:7.7.1 5bf96d32f88eb8a2f5e775339885cd6ba84a3b58 - ishan - 
2019-02-23 02:39:07]
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 
~[httpclient-4.5.6.jar:4.5.6]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 
~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
 ~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
 ~[httpclient-4.5.6.jar:4.5.6]
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
 ~[httpclient-4.5.6.jar:4.5.6

Re: Solr 8.5.2 indexing issue

2020-06-28 Thread Erick Erickson
How are you sending this to Solr? I just tried 8.5, submitting that doc through 
the admin UI and it works fine. 
I defined “asset_id” with as the same type as your reference_url field.

And does the log on the Solr node that tries to index this give any more info?

Best,
Erick

> On Jun 27, 2020, at 10:45 PM, gnandre  wrote:
> 
> {
>"asset_id":"add-ons:576deefef7453a9189aa039b66500eb2",
> 
> "reference_url":"modeling-a-high-speed-backplane-part-3-4-port-s-parameters-to-differential-tdr-and-tdt.html"}



Solr 8.5.2 indexing issue

2020-06-27 Thread gnandre
Hi,

I have the following document which fails to get indexed.

{
"asset_id":"add-ons:576deefef7453a9189aa039b66500eb2",

"reference_url":"modeling-a-high-speed-backplane-part-3-4-port-s-parameters-to-differential-tdr-and-tdt.html"}

I am not sure what is so special about the content in the reference_url
field.

reference_url field is defined as follows in schema:



It throws the following error.

Status: 
{"data":{"responseHeader":{"status":400,"QTime":18},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.IndexOutOfBoundsException"],"msg":"Exception
writing document id add-ons:576deefef7453a9189aa039b66500eb2 to the index;
possible analysis
error.","code":400}},"status":400,"config":{"method":"POST","transformRequest":[null],"transformResponse":[null],"jsonpCallbackParam":"callback","headers":{"Content-type":"application/json","Accept":"application/json,
text/plain, */*","X-Requested-With":"XMLHttpRequest"},"data":"[{\n
\"asset_id\":\"add-ons:576deefef7453a9189aa039b66500eb2\",\n
\"reference_url\":\"modeling-a-high-speed-backplane-part-3-4-port-s-parameters-to-differential-tdr-and-tdt.html\"}]","url":"add-ons/update","params":{"wt":"json","_":1593304427428,"commitWithin":1000,"overwrite":true},"timeout":1},"statusText":"Bad
Request","xhrStatus":"complete","resource":{"0":"[","1":"{","2":"\n","3":"
","4":" ","5":" ","6":" ","7":" ","8":" ","9":" ","10":"
","11":"\"","12":"a","13":"s","14":"s","15":"e","16":"t","17":"_","18":"i","19":"d","20":"\"","21":":","22":"\"","23":"a","24":"d","25":"d","26":"-","27":"o","28":"n","29":"s","30":":","31":"5","32":"7","33":"6","34":"d","35":"e","36":"e","37":"f","38":"e","39":"f","40":"7","41":"4","42":"5","43":"3","44":"a","45":"9","46":"1","47":"8","48":"9","49":"a","50":"a","51":"0","52":"3","53":"9","54":"b","55":"6","56":"6","57":"5","58":"0","59":"0","60":"e","61":"b","62":"2","63":"\"","64":",","65":"\n","66":"
","67":" ","68":" ","69":" ","70":" ","71":" ","72":" ","73":"
","74":"\"","75":"r","76":"e","77":"f","78":"e","79":"r","80":"e","81":"n","82":"c","83":"e","84":"_","85":"u","86":"r","87":"l","88":"\"","89":":","90":"\"","91":"m","92":"o","93":"d","94":"e","95":"l","96":"i","97":"n","98":"g","99":"-","100":"a","101":"-","102":"h","103":"i","104":"g","105":"h","106":"-","107":"s","108":"p","109":"e","110":"e","111":"d","112":"-","113":"b","114":"a","115":"c","116":"k","117":"p","118":"l","119":"a","120":"n","121":"e","122":"-","123":"p","124":"a","125":"r","126":"t","127":"-","128":"3","129":"-","130":"4","131":"-","132":"p","133":"o","134":"r","135":"t","136":"-","137":"s","138":"-","139":"p","140":"a","141":"r","142":"a","143":"m","144":"e","145":"t","146":"e","147":"r","148":"s","149":"-","150":"t","151":"o","152":"-","153":"d","154":"i","155":"f","156":"f","157":"e","158":"r","159":"e","160":"n","161":"t","162":"i","163":"a","164":"l","165":"-","166":"t","167":"d","168":"r","169":"-","170":"a","171":"n","172":"d","173":"-","174":"t","175":"d","176":"t","177":".","178":"h","179":"t","180":"m","181":"l","182":"\"","183":"}","184":"]"}}


Re: Prevent Re-indexing if Doc Fields are Same

2020-06-26 Thread Walter Underwood
If you don’t want to buy disk space for deleted docs, you should not be 
using Solr. That is an essential part of a reliable Solr installation.

To avoid reindexing unchanged documents, use a bookkeeping RDBMS
table. In that table, put the document ID and the most recent successful
update to Solr. You can check if the fields are the same with a checksum
of the data. MD5 is fine for that. Check that database before sending the
document and update it after new documents are indexed.

You may also want to record deletes in the database.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 26, 2020, at 1:12 AM, Anshuman Singh  wrote:
> 
> I was reading about in-place updates
> https://lucene.apache.org/solr/guide/7_4/updating-parts-of-documents.html,
> In my use case I have to update the field "LASTUPDATETIME", all other
> fields are same. Updates are very frequent and I can't bear the cost of
> deleted docs.
> 
> If I provide all the fields, it deletes the document and re-index it. But
> if I just "set" the "LASTUPDATETIME" field (non-indexed, non-stored,
> docValue field), it does an in-place update without deletion. But the
> problem is I don't know if the document is present or I'm indexing it the
> first time.
> 
> Is there a way to prevent re-indexing if other fields are the same?
> 
> *P.S. I'm looking for a solution that doesn't require looking up if doc is
> present in the Collection or not.*



Prevent Re-indexing if Doc Fields are Same

2020-06-26 Thread Anshuman Singh
I was reading about in-place updates
https://lucene.apache.org/solr/guide/7_4/updating-parts-of-documents.html,
In my use case I have to update the field "LASTUPDATETIME", all other
fields are same. Updates are very frequent and I can't bear the cost of
deleted docs.

If I provide all the fields, it deletes the document and re-index it. But
if I just "set" the "LASTUPDATETIME" field (non-indexed, non-stored,
docValue field), it does an in-place update without deletion. But the
problem is I don't know if the document is present or I'm indexing it the
first time.

Is there a way to prevent re-indexing if other fields are the same?

*P.S. I'm looking for a solution that doesn't require looking up if doc is
present in the Collection or not.*


Indexing error when using Category Routed Alias

2020-06-09 Thread Tom Evans
Hi all

1. Setup simple 1 node solrcloud test setup using docker-compose,
solr:8.5.2, zookeeper:3.5.8.
2. Upload a configset
3. Create two collections, one standard collection, one CRA, both
using the same configset

legacy:
action=CREATE=products_old=products=true=1=-1

CRA:

{
  "create-alias": {
"name": "products_20200609",
"router": {
  "name": "category",
  "field": "date_published.year",
  "maxCardinality": 30,
  "mustMatch": "(199[6-9]|20[0,1,2][0-9])"
},
"create-collection": {
  "config": "products",
  "numShards": 1,
  "nrtReplicas": 1,
  "tlogReplicas": 0,
  "maxShardsPerNode": 1,
  "autoAddReplicas": true
}
  }
}

Post a small selection of docs in JSON format using curl to non-CRA
collection -> OK

> $ docker-compose exec -T solr curl -H 'Content-Type: application/json' 
> -d@/resources/product-json/products-12381742.json 
> http://solr:8983/solr/products_old/update/json/docs
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 11.6M  10071  100 11.6M  5   950k  0:00:14  0:00:12  0:00:02  687k
{
  "responseHeader":{
"rf":1,
"status":0,
"QTime":12541}}

The same documents, sent to the CRA -> boom

> $ docker-compose exec -T solr curl -H 'Content-Type: application/json' 
> -d@/resources/product-json/products-12381742.json 
> http://solr:8983/solr/products_20200609/update/json/docs
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 11.6M  100   888  100 11.6M366  4913k  0:00:02  0:00:02 --:--:-- 4914k
{
  "responseHeader":{
"status":400,
"QTime":2422},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException",
  
"error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException",
  
"root-error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException"],
"msg":"Async exception during distributed update: Error from
server at 
http://10.20.36.130:8983/solr/products_20200609__CRA__2005_shard1_replica_n1/:
null\n\n\n\nrequest:
http://10.20.36.130:8983/solr/products_20200609__CRA__2005_shard1_replica_n1/\nRemote
error message: Cannot parse provided JSON: JSON Parse Error:
char=\u0002,position=0 AFTER='\u0002'
BEFORE='2update.contentType0applicat'",
"code":400}}

Repeating the request again to the CRA -> OK

> $ docker-compose exec -T solr curl -H 'Content-Type: application/json' 
> -d@/resources/product-json/products-12381742.json 
> http://solr:8983/solr/products_20200609/update/json/docs
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 11.6M  10071  100 11.6M  6  1041k  0:00:11  0:00:11 --:--:--  706k
{
  "responseHeader":{
"rf":1,
"status":0,
"QTime":11446}}

It seems to be related to when a new collection is needed to be
created by the CRA.

The relevant logs:

2020-06-09 02:12:56.107 INFO
(OverseerThreadFactory-9-thread-3-processing-n:10.20.36.130:8983_solr)
[   ] o.a.s.c.a.c.CreateCollectionCmd Create collection
products_20200609__CRA__2005
2020-06-09 02:12:56.232 INFO
(OverseerStateUpdate-72169202568593409-10.20.36.130:8983_solr-n_00)[
  ] o.a.s.c.o.SliceMutator createReplica() {
  "operation":"ADDREPLICA",
  "collection":"products_20200609__CRA__2005",
  "shard":"shard1",
  "core":"products_20200609__CRA__2005_shard1_replica_n1",
  "state":"down",
  "base_url":"http://10.20.36.130:8983/solr;,
  "node_name":"10.20.36.130:8983_solr",
  "type":"NRT",
  "waitForFinalState":"false"}
2020-06-09 02:12:56.444 INFO  (qtp90045638-25) [
x:products_20200609__CRA__2005_shard1_replica_n1]
o.a.s.h.a.CoreAdminOperation core create command
qt=/admin/cores=core_node2=products=true=products_20200609__CRA__2005_shard1_replica_n1=CREATE=1=products_20200609__CRA__2005=shard1=javabin=2=NRT
2020-06-09 02:12:56.476 INFO  (qtp90045638-25)
[c:products_20200609__CRA__2005 s:shard1 r:core_node2
x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.c.SolrConfig
Using Lucene MatchVersion: 8.5.1
2020-06-09 02:12:56.512 INFO  (qtp90045638-25)
[c:products_20200609__CRA__2005 s:shard1 r:core_node2
x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.s.IndexSchema
[products_20200609__CRA__2005_shard1_replica_n1] Schema name=variants
2020-06-09 02:12:56.543 INFO  (qtp90045638-25)
[c:products_20200609__CRA__2005 s:shard1 r:core_node2
x:products_20200609__CRA__2005_shard1_replica_n1] o.a.s.r.RestManager
Registered ManagedResource impl
org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory$SynonymManager
for path /schema/analysis/synonyms/default
2020-06-09 02:12:56.543 INFO  

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Erick...

On Sun, Jun 7, 2020 at 1:50 PM Erick Erickson 
wrote:

> https://lucidworks.com/post/indexing-with-solrj/
>
>
> > On Jun 7, 2020, at 3:22 PM, Fiz N  wrote:
> >
> > Thanks Jorn and Erick.
> >
> > Hi Erick, looks like the skeletal SOLRJ program attachment is missing.
> >
> > Thanks
> > Fiz
> >
> > On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson 
> > wrote:
> >
> >> Here’s a skeletal SolrJ program using Tika as another alternative.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
> >>>
> >>> You have to write an external application that creates multiple
> threads,
> >> parses the PDFs and index them in Solr. Ideally you parse the PDFs once
> and
> >> store the resulting text on some file system and then index it. Reason
> is
> >> that if you upgrade to two major versions of Solr you might need to
> reindex
> >> again. Then you can save time because you don’t need to parse the PDFs
> >> again.
> >>> It can be also useful in case you are not sure yet about the final
> >> schema and need to index several times in different schemas etc
> >>>
> >>> You can also use Apache manifoldCF.
> >>>
> >>>
> >>>
> >>>> Am 07.06.2020 um 19:19 schrieb Fiz N :
> >>>>
> >>>> Hello SOLR Experts,
> >>>>
> >>>> I am working on a POC to Index millions of PDF documents present in
> >>>> Multiple Folder in fileshare.
> >>>>
> >>>> Could you please let me the best practices and step to implement it.
> >>>>
> >>>> Thanks
> >>>> Fiz Nadiyal.
> >>
> >>
>
>


Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
https://lucidworks.com/post/indexing-with-solrj/


> On Jun 7, 2020, at 3:22 PM, Fiz N  wrote:
> 
> Thanks Jorn and Erick.
> 
> Hi Erick, looks like the skeletal SOLRJ program attachment is missing.
> 
> Thanks
> Fiz
> 
> On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson 
> wrote:
> 
>> Here’s a skeletal SolrJ program using Tika as another alternative.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
>>> 
>>> You have to write an external application that creates multiple threads,
>> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and
>> store the resulting text on some file system and then index it. Reason is
>> that if you upgrade to two major versions of Solr you might need to reindex
>> again. Then you can save time because you don’t need to parse the PDFs
>> again.
>>> It can be also useful in case you are not sure yet about the final
>> schema and need to index several times in different schemas etc
>>> 
>>> You can also use Apache manifoldCF.
>>> 
>>> 
>>> 
>>>> Am 07.06.2020 um 19:19 schrieb Fiz N :
>>>> 
>>>> Hello SOLR Experts,
>>>> 
>>>> I am working on a POC to Index millions of PDF documents present in
>>>> Multiple Folder in fileshare.
>>>> 
>>>> Could you please let me the best practices and step to implement it.
>>>> 
>>>> Thanks
>>>> Fiz Nadiyal.
>> 
>> 



Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Jorn and Erick.

Hi Erick, looks like the skeletal SOLRJ program attachment is missing.

Thanks
Fiz

On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson 
wrote:

> Here’s a skeletal SolrJ program using Tika as another alternative.
>
> Best,
> Erick
>
> > On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
> >
> > You have to write an external application that creates multiple threads,
> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and
> store the resulting text on some file system and then index it. Reason is
> that if you upgrade to two major versions of Solr you might need to reindex
> again. Then you can save time because you don’t need to parse the PDFs
> again.
> > It can be also useful in case you are not sure yet about the final
> schema and need to index several times in different schemas etc
> >
> > You can also use Apache manifoldCF.
> >
> >
> >
> >> Am 07.06.2020 um 19:19 schrieb Fiz N :
> >>
> >> Hello SOLR Experts,
> >>
> >> I am working on a POC to Index millions of PDF documents present in
> >> Multiple Folder in fileshare.
> >>
> >> Could you please let me the best practices and step to implement it.
> >>
> >> Thanks
> >> Fiz Nadiyal.
>
>


Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
Here’s a skeletal SolrJ program using Tika as another alternative.

Best,
Erick

> On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
> 
> You have to write an external application that creates multiple threads, 
> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and 
> store the resulting text on some file system and then index it. Reason is 
> that if you upgrade to two major versions of Solr you might need to reindex 
> again. Then you can save time because you don’t need to parse the PDFs again. 
> It can be also useful in case you are not sure yet about the final schema and 
> need to index several times in different schemas etc
> 
> You can also use Apache manifoldCF.
> 
> 
> 
>> Am 07.06.2020 um 19:19 schrieb Fiz N :
>> 
>> Hello SOLR Experts,
>> 
>> I am working on a POC to Index millions of PDF documents present in
>> Multiple Folder in fileshare.
>> 
>> Could you please let me the best practices and step to implement it.
>> 
>> Thanks
>> Fiz Nadiyal.



Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Jörn Franke
You have to write an external application that creates multiple threads, parses 
the PDFs and index them in Solr. Ideally you parse the PDFs once and store the 
resulting text on some file system and then index it. Reason is that if you 
upgrade to two major versions of Solr you might need to reindex again. Then you 
can save time because you don’t need to parse the PDFs again. 
It can be also useful in case you are not sure yet about the final schema and 
need to index several times in different schemas etc

You can also use Apache manifoldCF.



> Am 07.06.2020 um 19:19 schrieb Fiz N :
> 
> Hello SOLR Experts,
> 
> I am working on a POC to Index millions of PDF documents present in
> Multiple Folder in fileshare.
> 
> Could you please let me the best practices and step to implement it.
> 
> Thanks
> Fiz Nadiyal.


Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Hello SOLR Experts,

I am working on a POC to Index millions of PDF documents present in
Multiple Folder in fileshare.

Could you please let me the best practices and step to implement it.

Thanks
Fiz Nadiyal.


Re: Not all EML files are indexing during indexing

2020-06-03 Thread Charlie Hull
I think the OP is indexing flat files, not web pages (but otherwise, I 
agree with you that Scrapy is great - I know some of the people behind 
it too and they're a good bunch).


Charlie

On 02/06/2020 16:41, Walter Underwood wrote:

On Jun 2, 2020, at 7:40 AM, Charlie Hull  wrote:

If it was me I'd probably build a standalone indexer script in Python that did 
the file handling, called out to a separate Tika service for extraction, posted 
to Solr.

I would do the same thing, and I would base that script on Scrapy (https://scrapy.org 
<https://scrapy.org/>). I worked on a Python-based web spider for about ten 
years.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Not all EML files are indexing during indexing

2020-06-02 Thread Walter Underwood

> On Jun 2, 2020, at 7:40 AM, Charlie Hull  wrote:
> 
> If it was me I'd probably build a standalone indexer script in Python that 
> did the file handling, called out to a separate Tika service for extraction, 
> posted to Solr.

I would do the same thing, and I would base that script on Scrapy 
(https://scrapy.org ). I worked on a Python-based web 
spider for about ten years.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Not all EML files are indexing during indexing

2020-06-02 Thread Charlie Hull
Ah OK. I haven't used SimplePostTool myself and I note the docs say 
"View this not as a best-practice code example, but as a standalone 
example built with an explicit purpose of not having external jar 
dependencies."


I'm wondering if it's some kind of synchronisation issue between new 
files arriving in the folder and being picked up by your Powershell 
script. Hard to say really without seeing all the code...perhaps take 
out the Tika & Solr parts for now and verify the rest of your code 
really can spot every new or updated file that arrives?


If it was me I'd probably build a standalone indexer script in Python 
that did the file handling, called out to a separate Tika service for 
extraction, posted to Solr.


Cheers


Charlie





On 02/06/2020 14:48, Zheng Lin Edwin Yeo wrote:

Hi Charlie,

The main code that is doing the indexing is from the Solr's
SimplePostTools, but we have done some modification to it.

The walking through a folder is done by PowerShell script, the extracting
of the content from .eml file is from Tika that comes with Solr, and the
images in the .eml file are done by OCR that comes with Solr.

As we have modified the SimplePostTool code to do the checking if the file
already exists in the index by running a Solr search query of the ID, I'm
thinking if this issue is caused by the PowerShell script or the query in
the SimplePostTool code not being able to keep up with the large number of
files?

Regards,
Edwin


On Mon, 1 Jun 2020 at 17:19, Charlie Hull  wrote:


Hi Edwin,

What code is actually doing the indexing? AFAIK Solr doesn't include any
code for actually walking a folder, extracting the content from .eml
files and pushing this data into its index, so I'm guessing you've built
something external?

Charlie


On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:

Hi,

I am running this on Solr 7.6.0

Currently I have a situation whereby there's more than 2 million EML file
in a folder, and the folder is constantly updating the EML files with the
latest information and adding new EML files.

When I do the indexing, it is suppose to index the new EML files, and
update those index in which the EML file content has changed. However, I
found that not all new EML files are updated with each run of the

indexing.

Could it be caused by the large number of files in the folder? Or due to
some other reasons?

Regards,
Edwin


--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Not all EML files are indexing during indexing

2020-06-02 Thread Zheng Lin Edwin Yeo
Hi Charlie,

The main code that is doing the indexing is from the Solr's
SimplePostTools, but we have done some modification to it.

The walking through a folder is done by PowerShell script, the extracting
of the content from .eml file is from Tika that comes with Solr, and the
images in the .eml file are done by OCR that comes with Solr.

As we have modified the SimplePostTool code to do the checking if the file
already exists in the index by running a Solr search query of the ID, I'm
thinking if this issue is caused by the PowerShell script or the query in
the SimplePostTool code not being able to keep up with the large number of
files?

Regards,
Edwin


On Mon, 1 Jun 2020 at 17:19, Charlie Hull  wrote:

> Hi Edwin,
>
> What code is actually doing the indexing? AFAIK Solr doesn't include any
> code for actually walking a folder, extracting the content from .eml
> files and pushing this data into its index, so I'm guessing you've built
> something external?
>
> Charlie
>
>
> On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:
> > Hi,
> >
> > I am running this on Solr 7.6.0
> >
> > Currently I have a situation whereby there's more than 2 million EML file
> > in a folder, and the folder is constantly updating the EML files with the
> > latest information and adding new EML files.
> >
> > When I do the indexing, it is suppose to index the new EML files, and
> > update those index in which the EML file content has changed. However, I
> > found that not all new EML files are updated with each run of the
> indexing.
> >
> > Could it be caused by the large number of files in the folder? Or due to
> > some other reasons?
> >
> > Regards,
> > Edwin
> >
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>
>


Re: Not all EML files are indexing during indexing

2020-06-01 Thread Charlie Hull

Hi Edwin,

What code is actually doing the indexing? AFAIK Solr doesn't include any 
code for actually walking a folder, extracting the content from .eml 
files and pushing this data into its index, so I'm guessing you've built 
something external?


Charlie


On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:

Hi,

I am running this on Solr 7.6.0

Currently I have a situation whereby there's more than 2 million EML file
in a folder, and the folder is constantly updating the EML files with the
latest information and adding new EML files.

When I do the indexing, it is suppose to index the new EML files, and
update those index in which the EML file content has changed. However, I
found that not all new EML files are updated with each run of the indexing.

Could it be caused by the large number of files in the folder? Or due to
some other reasons?

Regards,
Edwin



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Not all EML files are indexing during indexing

2020-05-31 Thread Zheng Lin Edwin Yeo
Hi,

I am running this on Solr 7.6.0

Currently I have a situation whereby there's more than 2 million EML file
in a folder, and the folder is constantly updating the EML files with the
latest information and adding new EML files.

When I do the indexing, it is suppose to index the new EML files, and
update those index in which the EML file content has changed. However, I
found that not all new EML files are updated with each run of the indexing.

Could it be caused by the large number of files in the folder? Or due to
some other reasons?

Regards,
Edwin


Re: Indexing huge data onto solr

2020-05-26 Thread Erick Erickson
It Depends (tm). Often, you can create a single (albeit, perhaps complex)
SQL query that does this for you and just process the response.

I’ve also seen situations where it’s possible to hold one of the tables 
in memory on the client and just use that rather than a separate query.

It depends on the characteristics of your particular database, your DBA
could probably help.

Best,
Erick

> On May 25, 2020, at 11:56 PM, Srinivas Kashyap 
>  wrote:
> 
> Hi Erick,
> 
> Thanks for the below response. The link which you provided holds good if you 
> have single entity where you can join the tables and index it. But in our 
> scenario, we have nested entities joining different tables as shown below:
> 
> db-data-config.xml:
> 
> 
> 
> (table 1 join table 2)
> (table 3 join table 4)
> (table 5 join table 6)
> (table 7 join table 8)
> 
> 
> 
> Do you have any recommendations for it to run multiple sql’s and make it as 
> single solr document that can be sent over solrJ for indexing?
> 
> Say parent entity has 100 documents, should I iterate over each one of parent 
> tuples and execute the child entity sql’s(with where condition of parent) to 
> create one solr document? Won’t it be more load on database by executing more 
> sqls? Is there an optimum solution?
> 
> Thanks,
> Srinivas
> From: Erick Erickson 
> Sent: 22 May 2020 22:52
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing huge data onto solr
> 
> You have a lot more control over the speed and form of importing data if
> you just do the initial load in SolrJ. Here’s an example, taking the Tika
> parts out is easy:
> 
> https://lucidworks.com/post/indexing-with-solrj/<https://lucidworks.com/post/indexing-with-solrj>
> 
> It’s especially instructive to comment out just the call to 
> CloudSolrClient.add(doclist…); If
> that _still_ takes a long time, then your DB query is the root of the 
> problem. Even with 100M
> records, I’d be really surprised if Solr is the bottleneck, but the above 
> test will tell you
> where to go to try to speed things up.
> 
> Best,
> Erick
> 
>> On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
>> mailto:srini...@bamboorose.com.INVALID>> 
>> wrote:
>> 
>> Hi All,
>> 
>> We are runnnig solr 8.4.1. We have a database table which has more than 100 
>> million of records. Till now we were using DIH to do full-import on the 
>> tables. But for this table, when we do full-import via DIH it is taking more 
>> than 3-4 days to complete and also it consumes fair bit of JVM memory while 
>> running.
>> 
>> Are there any speedier/alternates ways to load data onto this solr core.
>> 
>> P.S: Only initial data import is problem, further updates/additions to this 
>> core is being done through SolrJ.
>> 
>> Thanks,
>> Srinivas
>> 
>> DISCLAIMER:
>> E-mails and attachments from Bamboo Rose, LLC are confidential.
>> If you are not the intended recipient, please notify the sender immediately 
>> by replying to the e-mail, and then delete it without making copies or using 
>> it in any way.
>> No representation is made that this email or any attachments are free of 
>> viruses. Virus scanning is recommended and is the responsibility of the 
>> recipient.
>> 
>> Disclaimer
>> 
>> The information contained in this communication from the sender is 
>> confidential. It is intended solely for use by the recipient and others 
>> authorized to receive it. If you are not the recipient, you are hereby 
>> notified that any disclosure, copying, distribution or taking action in 
>> relation of the contents of this information is strictly prohibited and may 
>> be unlawful.
>> 
>> This email has been scanned for viruses and malware, and may have been 
>> automatically archived by Mimecast Ltd, an innovator in Software as a 
>> Service (SaaS) for business. Providing a safer and more useful place for 
>> your human generated data. Specializing in; Security, archiving and 
>> compliance. To find out more visit the Mimecast website.



RE: Indexing huge data onto solr

2020-05-25 Thread Srinivas Kashyap
Hi Erick,

Thanks for the below response. The link which you provided holds good if you 
have single entity where you can join the tables and index it. But in our 
scenario, we have nested entities joining different tables as shown below:

db-data-config.xml:



 (table 1 join table 2)
 (table 3 join table 4)
 (table 5 join table 6)
 (table 7 join table 8)



Do you have any recommendations for it to run multiple sql’s and make it as 
single solr document that can be sent over solrJ for indexing?

Say parent entity has 100 documents, should I iterate over each one of parent 
tuples and execute the child entity sql’s(with where condition of parent) to 
create one solr document? Won’t it be more load on database by executing more 
sqls? Is there an optimum solution?

Thanks,
Srinivas
From: Erick Erickson 
Sent: 22 May 2020 22:52
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data onto solr

You have a lot more control over the speed and form of importing data if
you just do the initial load in SolrJ. Here’s an example, taking the Tika
parts out is easy:

https://lucidworks.com/post/indexing-with-solrj/<https://lucidworks.com/post/indexing-with-solrj>

It’s especially instructive to comment out just the call to 
CloudSolrClient.add(doclist…); If
that _still_ takes a long time, then your DB query is the root of the problem. 
Even with 100M
records, I’d be really surprised if Solr is the bottleneck, but the above test 
will tell you
where to go to try to speed things up.

Best,
Erick

> On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
> mailto:srini...@bamboorose.com.INVALID>> 
> wrote:
>
> Hi All,
>
> We are runnnig solr 8.4.1. We have a database table which has more than 100 
> million of records. Till now we were using DIH to do full-import on the 
> tables. But for this table, when we do full-import via DIH it is taking more 
> than 3-4 days to complete and also it consumes fair bit of JVM memory while 
> running.
>
> Are there any speedier/alternates ways to load data onto this solr core.
>
> P.S: Only initial data import is problem, further updates/additions to this 
> core is being done through SolrJ.
>
> Thanks,
> Srinivas
> 
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately 
> by replying to the e-mail, and then delete it without making copies or using 
> it in any way.
> No representation is made that this email or any attachments are free of 
> viruses. Virus scanning is recommended and is the responsibility of the 
> recipient.
>
> Disclaimer
>
> The information contained in this communication from the sender is 
> confidential. It is intended solely for use by the recipient and others 
> authorized to receive it. If you are not the recipient, you are hereby 
> notified that any disclosure, copying, distribution or taking action in 
> relation of the contents of this information is strictly prohibited and may 
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been 
> automatically archived by Mimecast Ltd, an innovator in Software as a Service 
> (SaaS) for business. Providing a safer and more useful place for your human 
> generated data. Specializing in; Security, archiving and compliance. To find 
> out more visit the Mimecast website.


Re: Indexing huge data onto solr

2020-05-22 Thread matthew sporleder
I can index (without nested entities ofc ;) ) 100M records in about
6-8 hours on a pretty low-powered machine using vanilla DIH -> mysql
so it is probably worth looking at why it is going slow before writing
your own indexer (which we are finally having to do)

On Fri, May 22, 2020 at 1:22 PM Erick Erickson  wrote:
>
> You have a lot more control over the speed and form of importing data if
> you just do the initial load in SolrJ. Here’s an example, taking the Tika
> parts out is easy:
>
> https://lucidworks.com/post/indexing-with-solrj/
>
> It’s especially instructive to comment out just the call to 
> CloudSolrClient.add(doclist…); If
> that _still_ takes a long time, then your DB query is the root of the 
> problem. Even with 100M
> records, I’d be really surprised if Solr is the bottleneck, but the above 
> test will tell you
> where to go to try to speed things up.
>
> Best,
> Erick
>
> > On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
> >  wrote:
> >
> > Hi All,
> >
> > We are runnnig solr 8.4.1. We have a database table which has more than 100 
> > million of records. Till now we were using DIH to do full-import on the 
> > tables. But for this table, when we do full-import via DIH it is taking 
> > more than 3-4 days to complete and also it consumes fair bit of JVM memory 
> > while running.
> >
> > Are there any speedier/alternates ways to load data onto this solr core.
> >
> > P.S: Only initial data import is problem, further updates/additions to this 
> > core is being done through SolrJ.
> >
> > Thanks,
> > Srinivas
> > 
> > DISCLAIMER:
> > E-mails and attachments from Bamboo Rose, LLC are confidential.
> > If you are not the intended recipient, please notify the sender immediately 
> > by replying to the e-mail, and then delete it without making copies or 
> > using it in any way.
> > No representation is made that this email or any attachments are free of 
> > viruses. Virus scanning is recommended and is the responsibility of the 
> > recipient.
> >
> > Disclaimer
> >
> > The information contained in this communication from the sender is 
> > confidential. It is intended solely for use by the recipient and others 
> > authorized to receive it. If you are not the recipient, you are hereby 
> > notified that any disclosure, copying, distribution or taking action in 
> > relation of the contents of this information is strictly prohibited and may 
> > be unlawful.
> >
> > This email has been scanned for viruses and malware, and may have been 
> > automatically archived by Mimecast Ltd, an innovator in Software as a 
> > Service (SaaS) for business. Providing a safer and more useful place for 
> > your human generated data. Specializing in; Security, archiving and 
> > compliance. To find out more visit the Mimecast website.
>


Re: Indexing huge data onto solr

2020-05-22 Thread Erick Erickson
You have a lot more control over the speed and form of importing data if
you just do the initial load in SolrJ. Here’s an example, taking the Tika
parts out is easy:

https://lucidworks.com/post/indexing-with-solrj/

It’s especially instructive to comment out just the call to 
CloudSolrClient.add(doclist…); If
that _still_ takes a long time, then your DB query is the root of the problem. 
Even with 100M
records, I’d be really surprised if Solr is the bottleneck, but the above test 
will tell you
where to go to try to speed things up.

Best,
Erick

> On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
>  wrote:
> 
> Hi All,
> 
> We are runnnig solr 8.4.1. We have a database table which has more than 100 
> million of records. Till now we were using DIH to do full-import on the 
> tables. But for this table, when we do full-import via DIH it is taking more 
> than 3-4 days to complete and also it consumes fair bit of JVM memory while 
> running.
> 
> Are there any speedier/alternates ways to load data onto this solr core.
> 
> P.S: Only initial data import is problem, further updates/additions to this 
> core is being done through SolrJ.
> 
> Thanks,
> Srinivas
> 
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately 
> by replying to the e-mail, and then delete it without making copies or using 
> it in any way.
> No representation is made that this email or any attachments are free of 
> viruses. Virus scanning is recommended and is the responsibility of the 
> recipient.
> 
> Disclaimer
> 
> The information contained in this communication from the sender is 
> confidential. It is intended solely for use by the recipient and others 
> authorized to receive it. If you are not the recipient, you are hereby 
> notified that any disclosure, copying, distribution or taking action in 
> relation of the contents of this information is strictly prohibited and may 
> be unlawful.
> 
> This email has been scanned for viruses and malware, and may have been 
> automatically archived by Mimecast Ltd, an innovator in Software as a Service 
> (SaaS) for business. Providing a safer and more useful place for your human 
> generated data. Specializing in; Security, archiving and compliance. To find 
> out more visit the Mimecast website.



Indexing huge data onto solr

2020-05-22 Thread Srinivas Kashyap
Hi All,

We are runnnig solr 8.4.1. We have a database table which has more than 100 
million of records. Till now we were using DIH to do full-import on the tables. 
But for this table, when we do full-import via DIH it is taking more than 3-4 
days to complete and also it consumes fair bit of JVM memory while running.

Are there any speedier/alternates ways to load data onto this solr core.

P.S: Only initial data import is problem, further updates/additions to this 
core is being done through SolrJ.

Thanks,
Srinivas

DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.


Re: Different indexing times for two different collections with different data sizes

2020-05-20 Thread Erick Erickson
The easy question first. There is an absolute limit of 2B docs per shard. 
Internally, Lucene assigns an integer internal document ID that overflows after 
2B. That includes deleted docs, so your “maxDoc” on the admin page is the 
limit. Practically, as you are finding, you run into performance issues at 
significantly than 2B. Note that when segments are merged, the internal IDs get 
reassigned...

Indexing scales pretty linearly with the number of shards, _assuming_ you’re 
adding more hardware. To really answer the question you need to look at what 
the bottleneck is on your current system. IOW, “It Depends(tm)”.

Let’s claim your current system is running all your CPUs flat out. Or I/O is 
maxed out. Adding more shards to the existing hardware won’t help. Perhaps you 
don’t even need more shards, you just need to move some of your replicas to new 
hardware.

OTOH, let’s claim that your indexing isn’t straining your current hardware at 
all, then adding more shards to existing hardware should increase throughput.

Probably the issue is merging. When segments are merged, they’re re-written. My 
guess is that your larger collection is doing more merging than your test 
collection, but that’s a guess. See Mike McCandless’ blog, TieredMergePolicy is 
the default you’re probably using: 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Best,
Erick

> On May 20, 2020, at 7:25 AM, Kommu, Vinodh K.  wrote:
> 
> Hi,
> 
> Recently we had noticed that one of the largest collection (shards = 6 ; 
> replication factor =3) which holds up to 1TB of data & nearly 3.2 billion of 
> docs is taking longer time to index than it used to before. To see the 
> indexing time difference, we created another collection using largest 
> collection configs (schema.xml and solrconfig.xml files) and loaded the 
> collection with up to 100 million docs which is ~60G of data. Later we tried 
> to index exactly same 25 million docs data file on these two collections 
> which clearly showed timing difference. BTW, we are running on Solr 7.7.1 
> version.
> 
> Original largest collection has completed indexing in ~100mins
> Newly created collection (which has 100 million docs) has completed in ~70mins
> 
> This indexing time difference is due to the amount of data that each 
> collection hold? If yes, how to increase indexing performance on larger data 
> collection? adding more shards can help here?
> 
> Also, is there any threshold numbers for a single shard can hold in terms of 
> size and number of docs before adding a new shard?
> 
> Any answers would really help!!
> 
> 
> Thanks & Regards,
> Vinodh
> 
> DTCC DISCLAIMER: This email and any files transmitted with it are 
> confidential and intended solely for the use of the individual or entity to 
> whom they are addressed. If you have received this email in error, please 
> notify us immediately and delete the email and any attachments from your 
> system. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus transmitted by this email.



Different indexing times for two different collections with different data sizes

2020-05-20 Thread Kommu, Vinodh K.
Hi,

Recently we had noticed that one of the largest collection (shards = 6 ; 
replication factor =3) which holds up to 1TB of data & nearly 3.2 billion of 
docs is taking longer time to index than it used to before. To see the indexing 
time difference, we created another collection using largest collection configs 
(schema.xml and solrconfig.xml files) and loaded the collection with up to 100 
million docs which is ~60G of data. Later we tried to index exactly same 25 
million docs data file on these two collections which clearly showed timing 
difference. BTW, we are running on Solr 7.7.1 version.

Original largest collection has completed indexing in ~100mins
Newly created collection (which has 100 million docs) has completed in ~70mins

This indexing time difference is due to the amount of data that each collection 
hold? If yes, how to increase indexing performance on larger data collection? 
adding more shards can help here?

Also, is there any threshold numbers for a single shard can hold in terms of 
size and number of docs before adding a new shard?

Any answers would really help!!


Thanks & Regards,
Vinodh

DTCC DISCLAIMER: This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error, please notify us 
immediately and delete the email and any attachments from your system. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email.


Re: nested entities and DIH indexing time

2020-05-14 Thread Shawn Heisey
On 5/14/2020 3:14 PM, matthew sporleder wrote:> Can a non-nested entity 
write into existing docs, or do they always> have to produce 
document-per-entity?
This is the only thing I found on this topic, and it is on a third-party 
website, so I can't say much about how accurate it is:


https://stackoverflow.com/questions/21006045/can-solr-dih-do-atomic-updates

I have never used a ScriptTransformer, so I do not know how to actually 
do this.


Your schema would have to be compatible with atomic updates.

Thanks,
Shawn



Re: nested entities and DIH indexing time

2020-05-14 Thread matthew sporleder
On Thu, May 14, 2020 at 4:46 PM Shawn Heisey  wrote:
>
> On 5/14/2020 9:36 AM, matthew sporleder wrote:
> > It appears that adding entities to my entities in my data import
> > config is slowing down my import process by a lot.  Is there a good
> > way to speed this up?  I see the ID's are individually queried instead
> > of using IN() or similar normal techniques to make things faster.
> >
> > Just looking for some tips.  I prefer this architecture to the way we
> > currently do it with complex SQL, inserting weird strings, and then
> > splitting on them (gross but faster).
>
> When you have nested entities, this is how DIH works.  A separate SQL
> query for the inner entity is made for each row returned on the outer
> entity.  Nested entities tend to be extremely slow for this reason.
>
> The best way to work around this is to make the database server do the
> heavy lifting -- using JOIN or other methods so that you only need one
> entity and one SQL query.  Doing this will mean that you'll need to
> split the data after import, using either the DIH config or the analysis
> configuration in the schema.
>
> Thanks,
> Shawn

This is too bad because it is very clean and the JOIN/CONCAT/SPLIT
method is very gross.

I was also hoping to use different delta queries for each nested entity.

Can a non-nested entity write into existing docs, or do they always
have to produce document-per-entity?


Re: nested entities and DIH indexing time

2020-05-14 Thread Shawn Heisey

On 5/14/2020 9:36 AM, matthew sporleder wrote:

It appears that adding entities to my entities in my data import
config is slowing down my import process by a lot.  Is there a good
way to speed this up?  I see the ID's are individually queried instead
of using IN() or similar normal techniques to make things faster.

Just looking for some tips.  I prefer this architecture to the way we
currently do it with complex SQL, inserting weird strings, and then
splitting on them (gross but faster).


When you have nested entities, this is how DIH works.  A separate SQL 
query for the inner entity is made for each row returned on the outer 
entity.  Nested entities tend to be extremely slow for this reason.


The best way to work around this is to make the database server do the 
heavy lifting -- using JOIN or other methods so that you only need one 
entity and one SQL query.  Doing this will mean that you'll need to 
split the data after import, using either the DIH config or the analysis 
configuration in the schema.


Thanks,
Shawn


nested entities and DIH indexing time

2020-05-14 Thread matthew sporleder
It appears that adding entities to my entities in my data import
config is slowing down my import process by a lot.  Is there a good
way to speed this up?  I see the ID's are individually queried instead
of using IN() or similar normal techniques to make things faster.

Just looking for some tips.  I prefer this architecture to the way we
currently do it with complex SQL, inserting weird strings, and then
splitting on them (gross but faster).


Re: Indexing Korean

2020-05-13 Thread ART GALLERY
check out the videos on this website TROO.TUBE don't be such a
sheep/zombie/loser/NPC. Much love!
https://troo.tube/videos/watch/aaa64864-52ee-4201-922f-41300032f219

On Mon, May 4, 2020 at 8:33 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:
>
> Oh wow, I had no idea this existed. Thank you so much!
>
> Best,
> Audrey
>
> On 5/1/20, 12:58 PM, "Markus Jelsma"  wrote:
>
> Hello,
>
> Although it is not mentioned in Solr's language analysis page in the 
> manual, Lucene has had support for Korean for quite a while now.
>
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_8-5F5-5F0_analyzers-2Dnori_index.html=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=SqDPKA-n_YGjJ4_W3yBTcA-esk2YjXReCnvgtETUuv8=GCBa9JGIjJgWrcahymeFn16-B_f9XyuoAA-hQapaIas=
>
> Regards,
> Markus
>
>
>
> -Original message-
> > From:Audrey Lorberfeld - audrey.lorberf...@ibm.com 
> 
> > Sent: Friday 1st May 2020 17:34
> > To: solr-user@lucene.apache.org
> > Subject: Indexing Korean
> >
> >  Hi All,
> >
> > My team would like to index Korean, but it looks like Solr OOTB does 
> not have explicit support for Korean. If any of you have schema pipelines you 
> could share for your Korean documents, I would love to see them! I'm assuming 
> I would just use some combination of the OOTB CJK factories
> >
> > Best,
> > Audrey
> >
> >
>


RE: Indexing Korean

2020-05-04 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Oh wow, I had no idea this existed. Thank you so much!

Best,
Audrey

On 5/1/20, 12:58 PM, "Markus Jelsma"  wrote:

Hello,

Although it is not mentioned in Solr's language analysis page in the 
manual, Lucene has had support for Korean for quite a while now.


https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_8-5F5-5F0_analyzers-2Dnori_index.html=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=SqDPKA-n_YGjJ4_W3yBTcA-esk2YjXReCnvgtETUuv8=GCBa9JGIjJgWrcahymeFn16-B_f9XyuoAA-hQapaIas=
 

Regards,
Markus



-Original message-
> From:Audrey Lorberfeld - audrey.lorberf...@ibm.com 

> Sent: Friday 1st May 2020 17:34
> To: solr-user@lucene.apache.org
> Subject: Indexing Korean
> 
>  Hi All,
> 
> My team would like to index Korean, but it looks like Solr OOTB does not 
have explicit support for Korean. If any of you have schema pipelines you could 
share for your Korean documents, I would love to see them! I'm assuming I would 
just use some combination of the OOTB CJK factories
> 
> Best,
> Audrey
> 
> 



RE: Indexing Korean

2020-05-01 Thread Markus Jelsma
Hello,

Although it is not mentioned in Solr's language analysis page in the manual, 
Lucene has had support for Korean for quite a while now.

https://lucene.apache.org/core/8_5_0/analyzers-nori/index.html

Regards,
Markus

 
 
-Original message-
> From:Audrey Lorberfeld - audrey.lorberf...@ibm.com 
> Sent: Friday 1st May 2020 17:34
> To: solr-user@lucene.apache.org
> Subject: Indexing Korean
> 
>  Hi All,
> 
> My team would like to index Korean, but it looks like Solr OOTB does not have 
> explicit support for Korean. If any of you have schema pipelines you could 
> share for your Korean documents, I would love to see them! I'm assuming I 
> would just use some combination of the OOTB CJK factories
> 
> Best,
> Audrey
> 
> 


Indexing Korean

2020-05-01 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
 Hi All,

My team would like to index Korean, but it looks like Solr OOTB does not have 
explicit support for Korean. If any of you have schema pipelines you could 
share for your Korean documents, I would love to see them! I'm assuming I would 
just use some combination of the OOTB CJK factories

Best,
Audrey



Re: Solr indexing with Tika DIH - ZeroByteFileException

2020-04-23 Thread Charlie Hull
If users can upload any PDF, including broken or huge ones, and some 
cause a Tika error, you should decouple Tika from Solr and run it as a 
separate process to extract text before indexing with Solr. Otherwise 
some of what is uploaded *will* break Solr.

https://lucidworks.com/post/indexing-with-solrj/ has some good hints.

Cheers

Charlie

On 11/06/2019 15:27, neilb wrote:

Hi, while going through solr logs, I found data import error for certain
documents. Here are details about the error.

Exception while processing: file document :
null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
to read content Processing Document # 7866
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.ZeroByteFileException: InputStream must
have > 0 bytes
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)


How do I know which document(document name with path) is #7866? And how do I
ignore ZeroByteFileException as document network share is not in my control.
Users can upload any size pdfs to it.

Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Solr indexing with Tika DIH - ZeroByteFileException

2020-04-22 Thread ravi kumar amaravadi
Hi,
Iam also facing same issue. Does anyone have any update/soulution how to fix
this issue as part DIH?

Thanks.

Regards,
Ravi kumar



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Indexing data from multiple data sources

2020-04-20 Thread Charlie Hull
The link you quote is Sematext's mirror of the Apache solr-user mailing 
list. There are others also providing copies of this list. As the cat is 
very much out of the bag your best course of action is to change all the 
logins and passwords that have been leaked and review your security 
procedures.


Cheers

Charlie

On 18/04/2020 13:27, RaviKiran Moola wrote:

Hi,
Greetings of the day!!!

Unfortunately we have enclosed our database source details in the Solr 
community post while sending our queries to solr support as mentioned 
in the below mail.


We find that it has been posted with this link 
https://sematext.com/opensee/m/Solr/eHNlswSd1vD6AF?subj=RE+Indexing+data+from+multiple+data+sources


As it is open to the world, what we are requesting here is, could you 
please remove that post as-soon-as possible before it creates any 
sucurity issues for us.


Your help is very very appreciable!!!

FYI.
Here I'm attaching the below screenshot




Thanks & Regards,

Ravikiran Moola



*From:* RaviKiran Moola
*Sent:* Friday, April 17, 2020 9:13 PM
*To:* solr-user@lucene.apache.org 
*Subject:* RE: Indexing data from multiple data sources
Hi,

Greetings!!!

We are working on indexing data from multiple data sources (MySQL & 
MSSQL) in a single collection. We specified data source details like 
connection details along with the required fields for both data 
sources in a single data config file, along with specified required 
fields details in the managed schema and here fetching the same 
columns from both data sources by specifying the common “unique key”.


Unable to index the data from the data sources using solr.

Here I’m attaching the data config file and screenshot.

Data config file:

 url="jdbc:mysql://182.74.133.92:3306/ra_dev" user="devuser" 
password="Welcome_009" batchSize="1" />
 driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" 
url="jdbc:sqlserver://182.74.133.92;databasename=BB_SOLR" 
user="matuser" password="MatDev:07"/>

  
  

   
   
  

   
   
  

 



Thanks & Regards,

Ravikiran Moola

+91-9494924492




--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: Indexing data from multiple data sources

2020-04-18 Thread RaviKiran Moola
Hi,
Greetings of the day!!!

Unfortunately we have enclosed our database source details in the Solr 
community post while sending our queries to solr support as mentioned in the 
below mail.

We find that it has been posted with this link 
https://sematext.com/opensee/m/Solr/eHNlswSd1vD6AF?subj=RE+Indexing+data+from+multiple+data+sources

As it is open to the world, what we are requesting here is, could you please 
remove that post as-soon-as possible before it creates any sucurity issues for 
us.

Your help is very very appreciable!!!

FYI.
Here I'm attaching the below screenshot

[cid:6ccc253a-a590-4e89-b5de-fd9a59d88aba]



Thanks & Regards,

Ravikiran Moola



From: RaviKiran Moola
Sent: Friday, April 17, 2020 9:13 PM
To: solr-user@lucene.apache.org 
Subject: RE: Indexing data from multiple data sources

Hi,

Greetings!!!

We are working on indexing data from multiple data sources (MySQL & MSSQL) in a 
single collection. We specified data source details like connection details 
along with the required fields for both data sources in a single data config 
file, along with specified required fields details in the managed schema and 
here fetching the same columns from both data sources by specifying the common 
“unique key”.

Unable to index the data from the data sources using solr.

Here I’m attaching the data config file and screenshot.

Data config file:

 
 
  
  
   
   
  
   
   
  

 




Thanks & Regards,

Ravikiran Moola

+91-9494924492



Indexing data from multiple data sources(CSV, RDBMS)

2020-04-18 Thread Shravan Kumar Bolla
Hi,

I am working on indexing data from multiple data sources using a single 
collection. I specified data sources information in the data-config file and 
also updated managed schema.xml by adding the fields from all the data sources 
by specifying the common unique key across all the sources.

Here is a sample config file.

 
>   url="jdbc:mysql://localhost/aaa" user="***" password="***" batchSize="1" />
>   driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" 
> url="jdbc:sqlserver://localhost;databasename=aaa" user="***" password="**"/>
>   
>   
>
>
>   
>
>
>   
> 
>  
> 

Error Details:
Full Import failed:java.lang.RuntimeException:java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException: Invalid type for 
data source: Jdbc-2
Processing Document #1

Thanks,
Shravan


Re: Indexing data from multiple data sources

2020-04-17 Thread Jörn Franke
What does your Solr.log say? Any error ?

> Am 17.04.2020 um 20:22 schrieb RaviKiran Moola 
> :
> 
> 
> Hi,
> 
> Greetings!!!
> 
> We are working on indexing data from multiple data sources (MySQL & MSSQL) in 
> a single collection. We specified data source details like connection details 
> along with the required fields for both data sources in a single data config 
> file, along with specified required fields details in the managed schema and 
> here fetching the same columns from both data sources by specifying the 
> common “unique key”.
> 
> Unable to index the data from the data sources using solr.
> 
> Here I’m attaching the data config file and screenshot.
> 
> Data config file:
>  
>   url="jdbc:mysql://182.74.133.92:3306/ra_dev" user="devuser" 
> password="Welcome_009" batchSize="1" />  
>   driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" 
> url="jdbc:sqlserver://182.74.133.92;databasename=BB_SOLR" user="matuser" 
> password="MatDev:07"/>   
>   
>   
>
> 
>   
>  
>   
>   
>   
>  
> 
> 
> 
> Thanks & Regards,
> Ravikiran Moola
> +91-9494924492
> 


RE: Indexing data from multiple data sources

2020-04-17 Thread RaviKiran Moola
Hi,

Greetings!!!

We are working on indexing data from multiple data sources (MySQL & MSSQL) in a 
single collection. We specified data source details like connection details 
along with the required fields for both data sources in a single data config 
file, along with specified required fields details in the managed schema and 
here fetching the same columns from both data sources by specifying the common 
“unique key”.

Unable to index the data from the data sources using solr.

Here I’m attaching the data config file and screenshot.

Data config file:

 
 
  
  
   
   
  
   
   
  

 




Thanks & Regards,

Ravikiran Moola

+91-9494924492



Re: Inconsistent / confusing documentation on indexing nested documents.

2020-04-03 Thread Chris Hostetter


: Is the documentation wrong or have I misunderstood it?

The documentation is definitely wrong, thanks for pointing this out...

https://issues.apache.org/jira/browse/SOLR-14383


-Hoss
http://www.lucidworks.com/


Inconsistent / confusing documentation on indexing nested documents.

2020-04-03 Thread Peter Pimley
Hi,

The page "Indexing Nested Documents" has an XML example showing two
different ways of adding nested documents:

https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html#xml-examples

The text says:

  "It illustrates two styles of adding child documents: the first is
associated via a field "comment" (preferred), and the second is done
in the classic way now referred to as an "anonymous" or "unlabelled"
child document."

However in the XML directly below there is no field named "comment".
There is one named "content" and another named "comments" (plural),
but no field named "comment".  In fact, looking at the Json example
immediately below, I wonder if the XML element currently named
"content" should be named "comments", and what is currently marked
"comments" should be "content"?

Secondly, in the Json example it says:

  "The labelled relationship here is one child document but could have
been wrapped in array brackets."

However in the actual Json, the parent document (ID=1) with a labelled
relationship has two child documents (IDs 2 and 3), and they are
already in array brackets.

Is the documentation wrong or have I misunderstood it?

Thanks,
Peter


Debugging indexing timeouts

2020-03-02 Thread fredsearch157
Hi all,

A couple of months ago, I migrated my solr deployment off of some legacy 
hardware (old spinning disks), and onto much newer hardware (SSD's, newer 
processors). While I am seeing much improved search performance since this 
move, I am also seeing intermittent indexing timeouts for 10-15 min periods 
about once a day or so (both from my indexing code and between replicas), which 
were not happening before. I have been scratching my head trying to figure out 
why, but have thus far been unsuccessful. I was hoping someone on here could 
maybe offer some thoughts as to how to further debug.

Some information about my setup:
-Solr Cloud 8.3, running on linux
-2 nodes, 1 shard (2 replicas) per collection
-Handful of collections, maxing out in the 10s of millions of docs per 
collection. Less than 100 million docs total
-Nodes are 8 CPU cores with SSD storage. 64 GB of RAM on server, heap size of 
26 GB.
-Relatively aggressive NRT tuning (hard commit 60 sec, soft commit 15 sec).
-Multi-threaded indexing process using SolrJ CloudSolrClient, sending updates 
in batches of ~1000 docs
-Indexing and querying is done constantly throughout the day

The indexing process, heap sizes, and soft/hard commit intervals were carefully 
tuned for my original setup, and were working flawlessly until the hardware 
change. It's only since the move to faster hardware/SSDs that I am now seeing 
timeouts during indexing (maybe counter-intuitively).

My first thought was that I was having stop the world GC pauses which were 
causing the timeouts, but when I captured GC logs during one of the timeout 
windows and ran it through a log analyzer, there were no issues detected. 
Largest GC pause was under 1 second. I monitor the heap continuously, and I 
always sit between 15-20 GB of 26 GB used...so I don't think that my heap is 
too small necessarily.

My next thought was that maybe it had to do with segment merges happening in 
the background, causing indexing to block. I am using the dynamic defaults for 
the merge scheduler, which almost certainly changed when I moved hardware 
(since now it is detecting a non-spinning disk, and my understanding is that 
the max concurrent merges is set based on this). I have been unable to confirm 
this though. I do not see any merge warnings or errors in the logs, and I have 
thus far been able to catch it in action to try and confirm via a thread dump.

Interestingly, when I did take a thread dump during normal execution, I noticed 
that one of my nodes has a huge number of running threads (~1700) compared to 
the other node (~150). Most of the threads are updateExecutor threads that 
appear to be permanently in a waiting state. I'm not sure what causes the node 
to get into this state, or if it is related to the timeouts at all.

I have thus far been unable to replicate the issue in a test environment, so 
it's hard to trial and error possible solutions. Does anyone have any 
suggestions on what could be causing these timeouts all of a sudden, or tips on 
how to debug further?

Thanks!

  1   2   3   4   5   6   7   8   9   10   >