I'd strongly recommend rolling your own ingest code.  See Erick's
superb: https://lucidworks.com/post/indexing-with-solrj/

You can easily get attachments via the RecursiveParserWrapper, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351

This will return a list of Metadata objects; the first one will be the
main/container, each other entry will be an attachment.  Let us know
if you have any questions/surprises.  There are a couple of todos for
.eml...

On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl <jan....@cominvent.com> wrote:
>
> Try the Apache Tika mailing list.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <edwinye...@gmail.com>:
> >
> > Hi,
> >
> > Does anyone knows if this can be done on the Solr side?
> > Or it has to be done on the Tika side?
> >
> > Regards,
> > Edwin
> >
> > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Would like to check, Is there anyway which we can detect the number of
> >> attachments and their names during indexing of EML files in Solr, and index
> >> those information into Solr?
> >>
> >> Currently, Solr is able to use Tika and Tesseract OCR to extract the
> >> contents of the attachments. However, I could not find the information
> >> about the number of attachments in the EML file and what are their 
> >> filename.
> >>
> >> I am using Solr 7.6.0 in production, and also trying out on the new Solr
> >> 8.2.0.
> >>
> >> Regards,
> >> Edwin
> >>
>

Reply via email to