Re: Not all EML files are indexing during indexing

2020-06-03 Thread Charlie Hull
I think the OP is indexing flat files, not web pages (but otherwise, I agree with you that Scrapy is great - I know some of the people behind it too and they're a good bunch). Charlie On 02/06/2020 16:41, Walter Underwood wrote: On Jun 2, 2020, at 7:40 AM, Charlie Hull wrote: If it was me

Re: Not all EML files are indexing during indexing

2020-06-02 Thread Walter Underwood
> On Jun 2, 2020, at 7:40 AM, Charlie Hull wrote: > > If it was me I'd probably build a standalone indexer script in Python that > did the file handling, called out to a separate Tika service for extraction, > posted to Solr. I would do the same thing, and I would base that script on Scrapy

Re: Not all EML files are indexing during indexing

2020-06-02 Thread Charlie Hull
Ah OK. I haven't used SimplePostTool myself and I note the docs say "View this not as a best-practice code example, but as a standalone example built with an explicit purpose of not having external jar dependencies." I'm wondering if it's some kind of synchronisation issue between new files

Re: Not all EML files are indexing during indexing

2020-06-02 Thread Zheng Lin Edwin Yeo
Hi Charlie, The main code that is doing the indexing is from the Solr's SimplePostTools, but we have done some modification to it. The walking through a folder is done by PowerShell script, the extracting of the content from .eml file is from Tika that comes with Solr, and the images in the .eml

Re: Not all EML files are indexing during indexing

2020-06-01 Thread Charlie Hull
Hi Edwin, What code is actually doing the indexing? AFAIK Solr doesn't include any code for actually walking a folder, extracting the content from .eml files and pushing this data into its index, so I'm guessing you've built something external? Charlie On 01/06/2020 02:13, Zheng Lin Edwin

Not all EML files are indexing during indexing

2020-05-31 Thread Zheng Lin Edwin Yeo
Hi, I am running this on Solr 7.6.0 Currently I have a situation whereby there's more than 2 million EML file in a folder, and the folder is constantly updating the EML files with the latest information and adding new EML files. When I do the indexing, it is suppose to index the new EML files,