Re: [magnolia-user] Re: Search indexes - magnolia 4.1.1

Denis Demichev Fri, 16 Oct 2009 08:07:26 -0700

Hello All,


Just as I send this message I came across following line:

WARN   org.apache.jackrabbit.core.query.lucene.TextExtractorJob 16.10.2009
08:52:31 -- Exception while indexing binary property:
java.lang.NoClassDefFoundError:
org/bouncycastle/jce/provider/BouncyCastleProvider

This line appeared after PDF file was added to the system. Unfortunately I
don't have a full exception stack trace as it was truncated. It looks like
I'm missing some jar - probably http://bouncycastle.org/ Crypto API.



Regards,
Denis


On Fri, Oct 16, 2009 at 8:50 AM, Denis Demichev <[email protected]> wrote:

> Hello All,
>
> Matteo wrote:
> >>Sorry, I missed something, how can you say that STK is related to PDF?
> STK has a bunch of sample files in DMS and majority of them are PDF.
> I still cannot index PDFs even if I delete lucene indexes.
>
> However, while indexing a RTF file I have an exception:
>
> java.lang.IllegalArgumentException: The document is really a RTF file
>     at
> org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocument.java:114)
>     at
> org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:49)
>     at
> org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
>     at
> org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
>     at
> org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
>     at
> org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
>
> It looks like org.apache.jackrabbit.extractor.MSWordTextExtractor is chosen
> for text extraction instead of
> org.apache.jackrabbit.extractor.RTFTextExtractor.
> I.e. an invalid file type is detected here:
> line 402 of org.apache.jackrabbit.core.query.lucene.NodeIndexer.
> InternalValue typeValue = getValue(NameConstants.JCR_MIMETYPE);
>
> Here's an implementation of getValue:
>
>     /**
>      * Utility method that extracts the first value of the named property
>      * of the current node. Returns <code>null</code> if the property does
>      * not exist or contains no values.
>      *
>      * @param name property name
>      * @return value of the named property, or <code>null</code>
>      * @throws ItemStateException if the property can not be accessed
>      */
>     protected InternalValue getValue(Name name) throws ItemStateException {
>         try {
>             PropertyId id = new PropertyId(node.getNodeId(), name);
>             PropertyState property =
>                 (PropertyState) stateProvider.getItemState(id);
>             InternalValue[] values = property.getValues();
>             if (values.length > 0) {
>                 return values[0];
>             } else {
>                 return null;
>             }
>         } catch (NoSuchItemStateException e) {
>             return null;
>         }
>     }
>
> So my assumption is: JCR node with RTF file contains a wrong MIME type
> associated with RTF file added... Not sure how to check this MIME value in
> Magnolia though.
> Should be "application/rtf" or "text/rtf", but not
> "application/vnd.ms-word" or "application/msword".
>
>
> Would really appreciate any help with PDF - I don't see any exception and
> thus cannot research what exactly went wrong.
>
>
> Thank you!
>
> Regards,
> Denis
>
>
>
> On Fri, Oct 16, 2009 at 2:40 AM, Matteo Pelucco <[email protected]
> > wrote:
>
>>
>> Denis Demichev ha scritto:
>>
>>> Hello Matteo,
>>>
>>> Thank you for your quick response.
>>>
>>
>> Magnolia give me one T-shirt for each message I write.
>> I have now a shop :-)
>>
>>   >>You should be able to use query manager and to succesfully execute
>>> this query:
>>>  >>SELECT * FROM nt:base
>>>
>>> I tried to run it against DMS successfully: 244 nodes returned in 734ms
>>>
>>
>> Ok, this is the proof that DMS is indexed.
>> Try now to delete ..workspaces/dms/index/* from filesystem.
>> At next startup you would see something saying:
>>
>> 'loading DMS workspace'
>>
>> (if SearchIndexer is configured correctly for that ws in workspace.xml)
>>
>> and PDFs will be indexed (again).
>> I would like to force re-index to be sure that no exception has been
>> thrown in past index building phase.
>>
>>  Unfortunately no luck with PDF.
>>>
>> > As STK has majority of PDF documents in
>>
>>> DMS that could be the reason why I couldn't search documents.
>>>
>>
>> Sorry, I missed something, how can you say that STK is related to PDF?
>> STK, afaik, is a "framework" which help to build pages, nothing related to
>> JCR / Lucene indexes, isn't it?
>> Or maybe do you mean the new asset management shipped with Magnolia?
>>
>> > Still I'm
>>
>>> not sure when exactly Magnolia will index this or that document in DMS.
>>>
>>
>> It should be at save time, but I'm not 100% sure.
>>
>> Sorry but I have no huge experience with PDF indexing, but are you sure
>> that your PDF are indexable?You can try to wrap PDFIndexer and log
>> something, but it is not a quick debugging option...
>>
>> :-(
>>
>>
>> matteo
>>
>>
>> ----------------------------------------------------------------
>> For list details see
>> http://www.magnolia-cms.com/home/community/mailing-lists.html
>> To unsubscribe, E-mail to: <[email protected]>
>> ----------------------------------------------------------------
>>
>>
>

----------------------------------------------------------------
For list details see
http://www.magnolia-cms.com/home/community/mailing-lists.html
To unsubscribe, E-mail to: <[email protected]>
----------------------------------------------------------------

Re: [magnolia-user] Re: Search indexes - magnolia 4.1.1

Reply via email to