Hello All,
Matteo wrote:
>>Sorry, I missed something, how can you say that STK is related to PDF?
STK has a bunch of sample files in DMS and majority of them are PDF.
I still cannot index PDFs even if I delete lucene indexes.
However, while indexing a RTF file I have an exception:
java.lang.IllegalArgumentException: The document is really a RTF file
at
org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocument.java:114)
at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:49)
at
org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
at
org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at
org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at
org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
It looks like org.apache.jackrabbit.extractor.MSWordTextExtractor is chosen
for text extraction instead of
org.apache.jackrabbit.extractor.RTFTextExtractor.
I.e. an invalid file type is detected here:
line 402 of org.apache.jackrabbit.core.query.lucene.NodeIndexer.
InternalValue typeValue = getValue(NameConstants.JCR_MIMETYPE);
Here's an implementation of getValue:
/**
* Utility method that extracts the first value of the named property
* of the current node. Returns <code>null</code> if the property does
* not exist or contains no values.
*
* @param name property name
* @return value of the named property, or <code>null</code>
* @throws ItemStateException if the property can not be accessed
*/
protected InternalValue getValue(Name name) throws ItemStateException {
try {
PropertyId id = new PropertyId(node.getNodeId(), name);
PropertyState property =
(PropertyState) stateProvider.getItemState(id);
InternalValue[] values = property.getValues();
if (values.length > 0) {
return values[0];
} else {
return null;
}
} catch (NoSuchItemStateException e) {
return null;
}
}
So my assumption is: JCR node with RTF file contains a wrong MIME type
associated with RTF file added... Not sure how to check this MIME value in
Magnolia though.
Should be "application/rtf" or "text/rtf", but not "application/vnd.ms-word"
or "application/msword".
Would really appreciate any help with PDF - I don't see any exception and
thus cannot research what exactly went wrong.
Thank you!
Regards,
Denis
On Fri, Oct 16, 2009 at 2:40 AM, Matteo Pelucco
<[email protected]>wrote:
>
> Denis Demichev ha scritto:
>
>> Hello Matteo,
>>
>> Thank you for your quick response.
>>
>
> Magnolia give me one T-shirt for each message I write.
> I have now a shop :-)
>
> >>You should be able to use query manager and to succesfully execute this
>> query:
>> >>SELECT * FROM nt:base
>>
>> I tried to run it against DMS successfully: 244 nodes returned in 734ms
>>
>
> Ok, this is the proof that DMS is indexed.
> Try now to delete ..workspaces/dms/index/* from filesystem.
> At next startup you would see something saying:
>
> 'loading DMS workspace'
>
> (if SearchIndexer is configured correctly for that ws in workspace.xml)
>
> and PDFs will be indexed (again).
> I would like to force re-index to be sure that no exception has been thrown
> in past index building phase.
>
> Unfortunately no luck with PDF.
>>
> > As STK has majority of PDF documents in
>
>> DMS that could be the reason why I couldn't search documents.
>>
>
> Sorry, I missed something, how can you say that STK is related to PDF?
> STK, afaik, is a "framework" which help to build pages, nothing related to
> JCR / Lucene indexes, isn't it?
> Or maybe do you mean the new asset management shipped with Magnolia?
>
> > Still I'm
>
>> not sure when exactly Magnolia will index this or that document in DMS.
>>
>
> It should be at save time, but I'm not 100% sure.
>
> Sorry but I have no huge experience with PDF indexing, but are you sure
> that your PDF are indexable?You can try to wrap PDFIndexer and log
> something, but it is not a quick debugging option...
>
> :-(
>
>
> matteo
>
>
> ----------------------------------------------------------------
> For list details see
> http://www.magnolia-cms.com/home/community/mailing-lists.html
> To unsubscribe, E-mail to: <[email protected]>
> ----------------------------------------------------------------
>
>
----------------------------------------------------------------
For list details see
http://www.magnolia-cms.com/home/community/mailing-lists.html
To unsubscribe, E-mail to: <[email protected]>
----------------------------------------------------------------