Re: [magnolia-user] Re: Search indexes - magnolia 4.1.1

Denis Demichev Fri, 16 Oct 2009 05:56:12 -0700

Hello All,

Matteo wrote:
>>Sorry, I missed something, how can you say that STK is related to PDF?
STK has a bunch of sample files in DMS and majority of them are PDF.
I still cannot index PDFs even if I delete lucene indexes.


However, while indexing a RTF file I have an exception:

java.lang.IllegalArgumentException: The document is really a RTF file
    at
org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocument.java:114)
    at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:49)
    at
org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
    at
org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
    at
org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
    at
org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)

It looks like org.apache.jackrabbit.extractor.MSWordTextExtractor is chosen
for text extraction instead of
org.apache.jackrabbit.extractor.RTFTextExtractor.
I.e. an invalid file type is detected here:
line 402 of org.apache.jackrabbit.core.query.lucene.NodeIndexer.
InternalValue typeValue = getValue(NameConstants.JCR_MIMETYPE);

Here's an implementation of getValue:

    /**
     * Utility method that extracts the first value of the named property
     * of the current node. Returns <code>null</code> if the property does
     * not exist or contains no values.
     *
     * @param name property name
     * @return value of the named property, or <code>null</code>
     * @throws ItemStateException if the property can not be accessed
     */
    protected InternalValue getValue(Name name) throws ItemStateException {
        try {
            PropertyId id = new PropertyId(node.getNodeId(), name);
            PropertyState property =
                (PropertyState) stateProvider.getItemState(id);
            InternalValue[] values = property.getValues();
            if (values.length > 0) {
                return values[0];
            } else {
                return null;
            }
        } catch (NoSuchItemStateException e) {
            return null;
        }
    }

So my assumption is: JCR node with RTF file contains a wrong MIME type
associated with RTF file added... Not sure how to check this MIME value in
Magnolia though.
Should be "application/rtf" or "text/rtf", but not "application/vnd.ms-word"
or "application/msword".


Would really appreciate any help with PDF - I don't see any exception and
thus cannot research what exactly went wrong.


Thank you!

Regards,
Denis


On Fri, Oct 16, 2009 at 2:40 AM, Matteo Pelucco
<[email protected]>wrote:

>
> Denis Demichev ha scritto:
>
>> Hello Matteo,
>>
>> Thank you for your quick response.
>>
>
> Magnolia give me one T-shirt for each message I write.
> I have now a shop :-)
>
>   >>You should be able to use query manager and to succesfully execute this
>> query:
>>  >>SELECT * FROM nt:base
>>
>> I tried to run it against DMS successfully: 244 nodes returned in 734ms
>>
>
> Ok, this is the proof that DMS is indexed.
> Try now to delete ..workspaces/dms/index/* from filesystem.
> At next startup you would see something saying:
>
> 'loading DMS workspace'
>
> (if SearchIndexer is configured correctly for that ws in workspace.xml)
>
> and PDFs will be indexed (again).
> I would like to force re-index to be sure that no exception has been thrown
> in past index building phase.
>
>  Unfortunately no luck with PDF.
>>
> > As STK has majority of PDF documents in
>
>> DMS that could be the reason why I couldn't search documents.
>>
>
> Sorry, I missed something, how can you say that STK is related to PDF?
> STK, afaik, is a "framework" which help to build pages, nothing related to
> JCR / Lucene indexes, isn't it?
> Or maybe do you mean the new asset management shipped with Magnolia?
>
> > Still I'm
>
>> not sure when exactly Magnolia will index this or that document in DMS.
>>
>
> It should be at save time, but I'm not 100% sure.
>
> Sorry but I have no huge experience with PDF indexing, but are you sure
> that your PDF are indexable?You can try to wrap PDFIndexer and log
> something, but it is not a quick debugging option...
>
> :-(
>
>
> matteo
>
>
> ----------------------------------------------------------------
> For list details see
> http://www.magnolia-cms.com/home/community/mailing-lists.html
> To unsubscribe, E-mail to: <[email protected]>
> ----------------------------------------------------------------
>
>

----------------------------------------------------------------
For list details see
http://www.magnolia-cms.com/home/community/mailing-lists.html
To unsubscribe, E-mail to: <[email protected]>
----------------------------------------------------------------

Re: [magnolia-user] Re: Search indexes - magnolia 4.1.1

Reply via email to