[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

2020-03-24 Thread Radu Gheorghe (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065390#comment-17065390
 ] 

Radu Gheorghe commented on TIKA-694:


Thanks for following up, Tim! It makes perfect sense.

In my particular use-case, I don't mind the performance penalty of parsing the 
whole file to get metadata (though I think it's not needed in the very 
particular use-case of Emails). But I do run into trouble if the (potentially 
massive) body ends up in memory, because I can run out of heap. Using 0 bytes 
isn't too bad of a workaround, isn't it? :D

> On extraction, get properties AND / OR content extraction
> -
>
> Key: TIKA-694
> URL: https://issues.apache.org/jira/browse/TIKA-694
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.0
> Environment: All OS
>Reporter: Etienne Jouvin
>Priority: Minor
> Attachments: Tika-1.0.zip
>
>
> I use TIKA to extract properties, and only, on Office files.
> The parser goes throw the document content and this is not necessary and slow 
> down the process.
> It would be nice to have choice to extract only properties or not.
> What I did was the following:
> Extension of AutoDetectParser to override the parse method.
> Then in the ParseContext instance, I put a flag with boolean true to say only 
> extract the properties.
> And for example, on office file, I extended OfficeParser class. During parse 
> method, I check the flag, and if equals to true, I removed all the extraction 
> from the content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

2020-03-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064963#comment-17064963
 ] 

Tim Allison commented on TIKA-694:
--

One of the challenges is that different parsers may need to parse the whole 
file before having all the metadata.  In general, we try to parse the metadata 
or at least add the metadata as early as possible because as soon as we hit a 
body element, no more metadata can be written to the xhtml...although the data 
will be added to the metadata object.

In short, it is hard.

> On extraction, get properties AND / OR content extraction
> -
>
> Key: TIKA-694
> URL: https://issues.apache.org/jira/browse/TIKA-694
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.0
> Environment: All OS
>Reporter: Etienne Jouvin
>Priority: Minor
> Attachments: Tika-1.0.zip
>
>
> I use TIKA to extract properties, and only, on Office files.
> The parser goes throw the document content and this is not necessary and slow 
> down the process.
> It would be nice to have choice to extract only properties or not.
> What I did was the following:
> Extension of AutoDetectParser to override the parse method.
> Then in the ParseContext instance, I put a flag with boolean true to say only 
> extract the properties.
> And for example, on office file, I extended OfficeParser class. During parse 
> method, I check the flag, and if equals to true, I removed all the extraction 
> from the content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

2020-03-23 Thread Radu Gheorghe (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064921#comment-17064921
 ] 

Radu Gheorghe commented on TIKA-694:


Unfortunately, the RFC822 parser doesn't seem to be one of them :(

But I've had some success with this ugly workaround:
{code:java}
writer = new WriteOutContentHandler(0); // we allow 0 characters from 
body
        handler = new BodyContentHandler(writer); 
{code}
Which implies handing WriteLimitReachedException every time (see TIKA-2787). 
But at least my app doesn't crash on huge files anymore (and I only need the 
headers).

> On extraction, get properties AND / OR content extraction
> -
>
> Key: TIKA-694
> URL: https://issues.apache.org/jira/browse/TIKA-694
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.0
> Environment: All OS
>Reporter: Etienne Jouvin
>Priority: Minor
> Attachments: Tika-1.0.zip
>
>
> I use TIKA to extract properties, and only, on Office files.
> The parser goes throw the document content and this is not necessary and slow 
> down the process.
> It would be nice to have choice to extract only properties or not.
> What I did was the following:
> Extension of AutoDetectParser to override the parse method.
> Then in the ParseContext instance, I put a flag with boolean true to say only 
> extract the properties.
> And for example, on office file, I extended OfficeParser class. During parse 
> method, I check the flag, and if equals to true, I removed all the extraction 
> from the content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

2020-03-20 Thread Rapster (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063260#comment-17063260
 ] 

Rapster commented on TIKA-694:
--

Let me comment on this 5 years later :)

Etienne has a strong point, there are plenty of use cases where you only need 
to extract metadata and not content.
So far, only PDFParser can achieve this, pass null as handler and it should 
work. As a consequence, extracting metadata on a 1G PDF file takes 14sec 
instead of 84s, it's definitely not negligible especially if you're working 
synchronously is your only option.

I'm not aware about all parsers, but I know a lot of them are not supporting 
null handlers. I'm fully aware it'd be a lot of work but worth it ;-)

Please consider reopening this ticket 

> On extraction, get properties AND / OR content extraction
> -
>
> Key: TIKA-694
> URL: https://issues.apache.org/jira/browse/TIKA-694
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.0
> Environment: All OS
>Reporter: Etienne Jouvin
>Priority: Minor
> Attachments: Tika-1.0.zip
>
>
> I use TIKA to extract properties, and only, on Office files.
> The parser goes throw the document content and this is not necessary and slow 
> down the process.
> It would be nice to have choice to extract only properties or not.
> What I did was the following:
> Extension of AutoDetectParser to override the parse method.
> Then in the ParseContext instance, I put a flag with boolean true to say only 
> extract the properties.
> And for example, on office file, I extended OfficeParser class. During parse 
> method, I check the flag, and if equals to true, I removed all the extraction 
> from the content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

2011-08-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089389#comment-13089389
 ] 

Nick Burch commented on TIKA-694:
-

For some parsers it may be possible to skip some parts if only metadata or text 
is required, but for many parsers there wouldn't be any savings. My hunch is 
that it's probably only the office type formats where there would be a big 
change

If we were to do this, I think the parse context probably is the right place 
for this flag.

> On extraction, get properties AND / OR content extraction
> -
>
> Key: TIKA-694
> URL: https://issues.apache.org/jira/browse/TIKA-694
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 0.9
> Environment: All OS
>Reporter: Etienne Jouvin
>Priority: Minor
>
> I use TIKA to extract properties, and only, on Office files.
> The parser goes throw the document content and this is not necessary and slow 
> down the process.
> It would be nice to have choice to extract only properties or not.
> What I did was the following:
> Extension of AutoDetectParser to override the parse method.
> Then in the ParseContext instance, I put a flag with boolean true to say only 
> extract the properties.
> And for example, on office file, I extended OfficeParser class. During parse 
> method, I check the flag, and if equals to true, I removed all the extraction 
> from the content.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira