Re: [magnolia-user] Re: Search indexes - magnolia 4.1.1

Denis Demichev Thu, 15 Oct 2009 05:53:08 -0700

Hello All,

Thank you for that update - I tried to delete Lucene indexes in
workspace_name/index and they were rebuilt.
As for PDF extractor it looks like it won't throw any PDF parser related
exception. Here's an excerpt from
org.apache.jackrabbit.extractor.PdfTextExtractor:


} catch (Exception e) {
            // it may happen that PDFParser throws a runtime
            // exception when parsing certain pdf documents
            logger.warn("Failed to extract PDF text content", e);
            return new StringReader("");
} finally {
            stream.close();
}
That's what jackrabbit 1.5 has at least. 1.6 is the same.
Does anyone know when exactly incoming documents are parsed? Right after
upload or maybe during activation procedure?

Thanks!



Regards,
Denis


On Thu, Oct 15, 2009 at 7:49 AM, Jan Haderka
<[email protected]>wrote:

>
> That is not what Matteo asked.
> As he correctly pointed out the presence of corrupted PDF will cause the
> indexing to fail, which in turn would cause Magnolia to fail at startup.
> Your only option then is to either remove PDF text extractor so the PDF
> is not indexed and after indexes get created remove the corrupted PDF
> file and redo the indexing again, this time with PDF indexer enabled, or
> if you have EE you can also comment out <SearchIndexer/> section in
> workspace.xml of affected workspace and use MagnoliaTools to remove the
> affected node and uncomment the section afterwards again.
>
> Jan
>
> On Thu, 2009-10-15 at 13:35 +0200, Zdenek Skodik wrote:
> > Hi Matteo,
> >
> > yep, in order to index PDF files you need
> > to first parse them to extract text that you
> > want to index from them.
> >
> > -
> > Best regards,
> >
> > Zdenek Skodik
> > Magnolia International Ltd.
> >
> > Magnolia®  - Simple Open-Source Content Management
> >
> >
> > On Čt, 2009-10-15 at 13:12 +0200, Matteo Pelucco wrote:
> > > Zdenek Skodik ha scritto:
> > > > Hi Denis,
> > > >
> > > > there isn't nothing new about this issue.
> > > > If you need to rebuild your Lucene indexes:
> > > >
> > > > * stop your application server
> > > > * delete all ../repositories/magnolia/workspaces/*/index folders
> > > > * during startup of your server the indexes will be recreated
> > >
> > > Hi Zdenek,
> > > in this case, I think that if PDF are stored on DB level, each index
> > > rebuild phase will be end with exception, isn't it?
> > > Correct me if I am wrong...
> > >
> > > Matteo
> > >
> > >
> > > ----------------------------------------------------------------
> > > For list details see
> > > http://www.magnolia-cms.com/home/community/mailing-lists.html
> > > To unsubscribe, E-mail to: <[email protected]>
> > > ----------------------------------------------------------------
> >
> >
> > ----------------------------------------------------------------
> > For list details see
> > http://www.magnolia-cms.com/home/community/mailing-lists.html
> > To unsubscribe, E-mail to: <[email protected]>
> > ----------------------------------------------------------------
>
>
> ----------------------------------------------------------------
> For list details see
> http://www.magnolia-cms.com/home/community/mailing-lists.html
> To unsubscribe, E-mail to: <[email protected]>
> ----------------------------------------------------------------
>
>

----------------------------------------------------------------
For list details see
http://www.magnolia-cms.com/home/community/mailing-lists.html
To unsubscribe, E-mail to: <[email protected]>
----------------------------------------------------------------

Re: [magnolia-user] Re: Search indexes - magnolia 4.1.1

Reply via email to