Hi Charlie,

The main code that is doing the indexing is from the Solr's
SimplePostTools, but we have done some modification to it.

The walking through a folder is done by PowerShell script, the extracting
of the content from .eml file is from Tika that comes with Solr, and the
images in the .eml file are done by OCR that comes with Solr.

As we have modified the SimplePostTool code to do the checking if the file
already exists in the index by running a Solr search query of the ID, I'm
thinking if this issue is caused by the PowerShell script or the query in
the SimplePostTool code not being able to keep up with the large number of
files?

Regards,
Edwin


On Mon, 1 Jun 2020 at 17:19, Charlie Hull <char...@flax.co.uk> wrote:

> Hi Edwin,
>
> What code is actually doing the indexing? AFAIK Solr doesn't include any
> code for actually walking a folder, extracting the content from .eml
> files and pushing this data into its index, so I'm guessing you've built
> something external?
>
> Charlie
>
>
> On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:
> > Hi,
> >
> > I am running this on Solr 7.6.0
> >
> > Currently I have a situation whereby there's more than 2 million EML file
> > in a folder, and the folder is constantly updating the EML files with the
> > latest information and adding new EML files.
> >
> > When I do the indexing, it is suppose to index the new EML files, and
> > update those index in which the EML file content has changed. However, I
> > found that not all new EML files are updated with each run of the
> indexing.
> >
> > Could it be caused by the large number of files in the folder? Or due to
> > some other reasons?
> >
> > Regards,
> > Edwin
> >
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>
>

Reply via email to