Re: Indexing PDF on SOLR 8.5

Erick Erickson Sun, 07 Jun 2020 13:51:13 -0700

https://lucidworks.com/post/indexing-with-solrj/



> On Jun 7, 2020, at 3:22 PM, Fiz N <fiznewy...@gmail.com> wrote:
> 
> Thanks Jorn and Erick.
> 
> Hi Erick, looks like the skeletal SOLRJ program attachment is missing.
> 
> Thanks
> Fiz
> 
> On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> Here’s a skeletal SolrJ program using Tika as another alternative.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 7, 2020, at 2:06 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>> 
>>> You have to write an external application that creates multiple threads,
>> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and
>> store the resulting text on some file system and then index it. Reason is
>> that if you upgrade to two major versions of Solr you might need to reindex
>> again. Then you can save time because you don’t need to parse the PDFs
>> again.
>>> It can be also useful in case you are not sure yet about the final
>> schema and need to index several times in different schemas etc
>>> 
>>> You can also use Apache manifoldCF.
>>> 
>>> 
>>> 
>>>> Am 07.06.2020 um 19:19 schrieb Fiz N <fiznewy...@gmail.com>:
>>>> 
>>>> Hello SOLR Experts,
>>>> 
>>>> I am working on a POC to Index millions of PDF documents present in
>>>> Multiple Folder in fileshare.
>>>> 
>>>> Could you please let me the best practices and step to implement it.
>>>> 
>>>> Thanks
>>>> Fiz Nadiyal.
>> 
>>

Re: Indexing PDF on SOLR 8.5

Reply via email to