Ah OK. I haven't used SimplePostTool myself and I note the docs say "View this not as a best-practice code example, but as a standalone example built with an explicit purpose of not having external jar dependencies."

I'm wondering if it's some kind of synchronisation issue between new files arriving in the folder and being picked up by your Powershell script. Hard to say really without seeing all the code...perhaps take out the Tika & Solr parts for now and verify the rest of your code really can spot every new or updated file that arrives?

If it was me I'd probably build a standalone indexer script in Python that did the file handling, called out to a separate Tika service for extraction, posted to Solr.

Cheers


Charlie





On 02/06/2020 14:48, Zheng Lin Edwin Yeo wrote:
Hi Charlie,

The main code that is doing the indexing is from the Solr's
SimplePostTools, but we have done some modification to it.

The walking through a folder is done by PowerShell script, the extracting
of the content from .eml file is from Tika that comes with Solr, and the
images in the .eml file are done by OCR that comes with Solr.

As we have modified the SimplePostTool code to do the checking if the file
already exists in the index by running a Solr search query of the ID, I'm
thinking if this issue is caused by the PowerShell script or the query in
the SimplePostTool code not being able to keep up with the large number of
files?

Regards,
Edwin


On Mon, 1 Jun 2020 at 17:19, Charlie Hull <char...@flax.co.uk> wrote:

Hi Edwin,

What code is actually doing the indexing? AFAIK Solr doesn't include any
code for actually walking a folder, extracting the content from .eml
files and pushing this data into its index, so I'm guessing you've built
something external?

Charlie


On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:
Hi,

I am running this on Solr 7.6.0

Currently I have a situation whereby there's more than 2 million EML file
in a folder, and the folder is constantly updating the EML files with the
latest information and adding new EML files.

When I do the indexing, it is suppose to index the new EML files, and
update those index in which the EML file content has changed. However, I
found that not all new EML files are updated with each run of the
indexing.
Could it be caused by the large number of files in the folder? Or due to
some other reasons?

Regards,
Edwin

--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com

Reply via email to