[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction
[ https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065390#comment-17065390 ] Radu Gheorghe commented on TIKA-694: Thanks for following up, Tim! It makes perfect sense. In my particular use-case, I don't mind the performance penalty of parsing the whole file to get metadata (though I think it's not needed in the very particular use-case of Emails). But I do run into trouble if the (potentially massive) body ends up in memory, because I can run out of heap. Using 0 bytes isn't too bad of a workaround, isn't it? :D > On extraction, get properties AND / OR content extraction > - > > Key: TIKA-694 > URL: https://issues.apache.org/jira/browse/TIKA-694 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 1.0 > Environment: All OS >Reporter: Etienne Jouvin >Priority: Minor > Attachments: Tika-1.0.zip > > > I use TIKA to extract properties, and only, on Office files. > The parser goes throw the document content and this is not necessary and slow > down the process. > It would be nice to have choice to extract only properties or not. > What I did was the following: > Extension of AutoDetectParser to override the parse method. > Then in the ParseContext instance, I put a flag with boolean true to say only > extract the properties. > And for example, on office file, I extended OfficeParser class. During parse > method, I check the flag, and if equals to true, I removed all the extraction > from the content. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction
[ https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064963#comment-17064963 ] Tim Allison commented on TIKA-694: -- One of the challenges is that different parsers may need to parse the whole file before having all the metadata. In general, we try to parse the metadata or at least add the metadata as early as possible because as soon as we hit a body element, no more metadata can be written to the xhtml...although the data will be added to the metadata object. In short, it is hard. > On extraction, get properties AND / OR content extraction > - > > Key: TIKA-694 > URL: https://issues.apache.org/jira/browse/TIKA-694 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 1.0 > Environment: All OS >Reporter: Etienne Jouvin >Priority: Minor > Attachments: Tika-1.0.zip > > > I use TIKA to extract properties, and only, on Office files. > The parser goes throw the document content and this is not necessary and slow > down the process. > It would be nice to have choice to extract only properties or not. > What I did was the following: > Extension of AutoDetectParser to override the parse method. > Then in the ParseContext instance, I put a flag with boolean true to say only > extract the properties. > And for example, on office file, I extended OfficeParser class. During parse > method, I check the flag, and if equals to true, I removed all the extraction > from the content. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction
[ https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064921#comment-17064921 ] Radu Gheorghe commented on TIKA-694: Unfortunately, the RFC822 parser doesn't seem to be one of them :( But I've had some success with this ugly workaround: {code:java} writer = new WriteOutContentHandler(0); // we allow 0 characters from body handler = new BodyContentHandler(writer); {code} Which implies handing WriteLimitReachedException every time (see TIKA-2787). But at least my app doesn't crash on huge files anymore (and I only need the headers). > On extraction, get properties AND / OR content extraction > - > > Key: TIKA-694 > URL: https://issues.apache.org/jira/browse/TIKA-694 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 1.0 > Environment: All OS >Reporter: Etienne Jouvin >Priority: Minor > Attachments: Tika-1.0.zip > > > I use TIKA to extract properties, and only, on Office files. > The parser goes throw the document content and this is not necessary and slow > down the process. > It would be nice to have choice to extract only properties or not. > What I did was the following: > Extension of AutoDetectParser to override the parse method. > Then in the ParseContext instance, I put a flag with boolean true to say only > extract the properties. > And for example, on office file, I extended OfficeParser class. During parse > method, I check the flag, and if equals to true, I removed all the extraction > from the content. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction
[ https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063260#comment-17063260 ] Rapster commented on TIKA-694: -- Let me comment on this 5 years later :) Etienne has a strong point, there are plenty of use cases where you only need to extract metadata and not content. So far, only PDFParser can achieve this, pass null as handler and it should work. As a consequence, extracting metadata on a 1G PDF file takes 14sec instead of 84s, it's definitely not negligible especially if you're working synchronously is your only option. I'm not aware about all parsers, but I know a lot of them are not supporting null handlers. I'm fully aware it'd be a lot of work but worth it ;-) Please consider reopening this ticket > On extraction, get properties AND / OR content extraction > - > > Key: TIKA-694 > URL: https://issues.apache.org/jira/browse/TIKA-694 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 1.0 > Environment: All OS >Reporter: Etienne Jouvin >Priority: Minor > Attachments: Tika-1.0.zip > > > I use TIKA to extract properties, and only, on Office files. > The parser goes throw the document content and this is not necessary and slow > down the process. > It would be nice to have choice to extract only properties or not. > What I did was the following: > Extension of AutoDetectParser to override the parse method. > Then in the ParseContext instance, I put a flag with boolean true to say only > extract the properties. > And for example, on office file, I extended OfficeParser class. During parse > method, I check the flag, and if equals to true, I removed all the extraction > from the content. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction
[ https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089389#comment-13089389 ] Nick Burch commented on TIKA-694: - For some parsers it may be possible to skip some parts if only metadata or text is required, but for many parsers there wouldn't be any savings. My hunch is that it's probably only the office type formats where there would be a big change If we were to do this, I think the parse context probably is the right place for this flag. > On extraction, get properties AND / OR content extraction > - > > Key: TIKA-694 > URL: https://issues.apache.org/jira/browse/TIKA-694 > Project: Tika > Issue Type: Wish > Components: parser >Affects Versions: 0.9 > Environment: All OS >Reporter: Etienne Jouvin >Priority: Minor > > I use TIKA to extract properties, and only, on Office files. > The parser goes throw the document content and this is not necessary and slow > down the process. > It would be nice to have choice to extract only properties or not. > What I did was the following: > Extension of AutoDetectParser to override the parse method. > Then in the ParseContext instance, I put a flag with boolean true to say only > extract the properties. > And for example, on office file, I extended OfficeParser class. During parse > method, I check the flag, and if equals to true, I removed all the extraction > from the content. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira