[jira] [Created] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-16 Thread Robert Letzler (JIRA)
Robert Letzler created TIKA-2478:


 Summary: MBOX import includes redundant copies of the text
 Key: TIKA-2478
 URL: https://issues.apache.org/jira/browse/TIKA-2478
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.16
Reporter: Robert Letzler
Priority: Minor


MBOX messages often get parsed into four documents:
a.  The mbox file - outer container "/"
b.  The actual email--  "/embedded-1"
c.  The utf-8 text content of the email "/embedded-1/embedded-2"
d.  The utf-8 html content of the email  "/embedded-1/embedded-3"

entries C and D are redundant and distracting.  The MSG parser parses the first 
non-null: email body and then it skips the rest.  Please modify MBOX to not 
have separate "attached" documents for the html body and the text body.

The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example 
of input sufficient to generate this behavior.

Thanks!





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1788) message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header

2017-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206460#comment-16206460
 ] 

ASF GitHub Bot commented on TIKA-1788:
--

tballison commented on issue #211: [TIKA-1788] RFC822Parser: provide email 
attachment filenames when available
URL: https://github.com/apache/tika/pull/211#issuecomment-337012343
 
 
   @AarjavP , I very much appreciate this PR.  I regret that I haven't been 
able to review it carefully yet, but I look forward to doing so over the next 
week or so (I hope).  Thank you!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> message/rfc822 parser doesn't identify attachment filenames from 
> Content-Disposition header
> ---
>
> Key: TIKA-1788
> URL: https://issues.apache.org/jira/browse/TIKA-1788
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Sergey Tsalkov
>Assignee: Tim Allison
> Attachments: grep_content_disposition.zip
>
>
> rfc822 email files can contain attachments as subparts, and they'll
> generally specify the filename of the attachment in a manner like
> this:
> Content-Disposition: attachment;
> filename*=utf-8''image001.jpg
> Tika doesn't seem to be grabbing that information at all!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206418#comment-16206418
 ] 

Tim Allison edited comment on TIKA-2471 at 10/16/17 7:30 PM:
-

That looks totally hosed.  Thank you for opening this and supplying an example 
triggering file. 

bq. But more to the point, what is the idea behind setting the headers in the 
MboxParser if they're going to be set by the RFC822Parser in any case?

TIKA-1244 brought that behavior in.  Before that, emails weren't treated as 
embedded files if I understand correctly.

bq.  why does the parser force Windows-1252 as the charset?
Again, no idea, -but I suspect that was because of the rfc822 method of 
encoding-.  I simply have no idea.  Are you able to share an example where this 
corrupts the content?


was (Author: talli...@mitre.org):
That looks totally hosed.  Thank you for opening this and supplying an example 
triggering file. 

bq. But more to the point, what is the idea behind setting the headers in the 
MboxParser if they're going to be set by the RFC822Parser in any case?

TIKA-1244 brought that behavior in.  Before that, emails weren't treated as 
embedded files if I understand correctly.

bq.  why does the parser force Windows-1252 as the charset?
Again, no idea, but I suspect that was because of the rfc822 method of 
encoding.  Are you able to share an example where this corrupts the content?

> Tab-prefixed message body lines in Mbox interpreted as headers
> --
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

2017-10-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206418#comment-16206418
 ] 

Tim Allison commented on TIKA-2471:
---

That looks totally hosed.  Thank you for opening this and supplying an example 
triggering file. 

bq. But more to the point, what is the idea behind setting the headers in the 
MboxParser if they're going to be set by the RFC822Parser in any case?

TIKA-1244 brought that behavior in.  Before that, emails weren't treated as 
embedded files if I understand correctly.

bq.  why does the parser force Windows-1252 as the charset?
Again, no idea, but I suspect that was because of the rfc822 method of 
encoding.  Are you able to share an example where this corrupts the content?

> Tab-prefixed message body lines in Mbox interpreted as headers
> --
>
> Key: TIKA-2471
> URL: https://issues.apache.org/jira/browse/TIKA-2471
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.16
>Reporter: Matthew Caruana Galizia
>  Labels: message, rfc822
> Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message 
> looking for anything that matches a header pattern, wherever it occurs in a 
> line!
> It looks to me like the parsing logic is in desperate need of a refactor. But 
> more to the point, what is the idea behind setting the headers in the 
> MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2477) Tika : Content of XLSX file extraction is not working after poi library upgrade

2017-10-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206374#comment-16206374
 ] 

Tim Allison commented on TIKA-2477:
---

Try a more recent version of Tika -- 1.16 -- available here: 
http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.16.jar and let us know if 
you still have the same problem. Thank you!

> Tika :  Content of XLSX file extraction is not working after poi library 
> upgrade
> 
>
> Key: TIKA-2477
> URL: https://issues.apache.org/jira/browse/TIKA-2477
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Reporter: Ramchandran
>
> Hi Team,
> I had written program to extract content of simple xlsx file. The program is 
> working fine with poi-3.11 library but now I have upgraded my poi library to 
> poi-3.16. Now the program is running but the output is not displayed.(Post 
> upgrade only sheet name is displayed).
> Class File
> ===
> package MSExcelParse;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.SAXException;
> public class MSExcelParser{
>public static void main(final String[] args) throws IOException, 
> TikaException, SAXException {
>   
>   //detecting the file type
>   BodyContentHandler handler = new BodyContentHandler();
>   Metadata metadata = new Metadata();
>   FileInputStream inputstream = new FileInputStream(new 
> File("C:\\JavaTest\\Student.xlsx"));
>   ParseContext pcontext = new ParseContext();
>   
>   Parser parser = new AutoDetectParser();
>   parser.parse(inputstream, handler, metadata,pcontext);
>   
>   System.out.println("Contents of the document:" + handler.toString());
>}
> }
> .classpath file
> 
> 
> 
>   
>path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-1.7"/>
>path="C:/JavaTest/commons-collections4-4.1.jar"/>
>path="C:/JavaTest/commons-compress-1.8.1.jar"/>
>   
>   
>   
>path="C:/JavaTest/poi-ooxml-schemas-3.16.jar"/>
>   
>   
>   
>   
> 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2477) Tika : Content of XLSX file extraction is not working after poi library upgrade

2017-10-16 Thread Ramchandran (JIRA)
Ramchandran created TIKA-2477:
-

 Summary: Tika :  Content of XLSX file extraction is not working 
after poi library upgrade
 Key: TIKA-2477
 URL: https://issues.apache.org/jira/browse/TIKA-2477
 Project: Tika
  Issue Type: Bug
  Components: core
Reporter: Ramchandran


Hi Team,

I had written program to extract content of simple xlsx file. The program is 
working fine with poi-3.11 library but now I have upgraded my poi library to 
poi-3.16. Now the program is running but the output is not displayed.(Post 
upgrade only sheet name is displayed).

Class File
===
package MSExcelParse;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class MSExcelParser{

   public static void main(final String[] args) throws IOException, 
TikaException, SAXException {
  
  //detecting the file type
  BodyContentHandler handler = new BodyContentHandler();
  Metadata metadata = new Metadata();
  FileInputStream inputstream = new FileInputStream(new 
File("C:\\JavaTest\\Student.xlsx"));
  ParseContext pcontext = new ParseContext();
  
  Parser parser = new AutoDetectParser();
  parser.parse(inputstream, handler, metadata,pcontext);
  
  System.out.println("Contents of the document:" + handler.toString());

   }
}

.classpath file






















--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205529#comment-16205529
 ] 

ASF GitHub Bot commented on TIKA-2400:
--

ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object 
Recognition REST parsers
URL: https://github.com/apache/tika/pull/208#issuecomment-336805333
 
 
   The new urls are,
   
https://raw.githubusercontent.com/tensorflow/models/master/research/inception/inception/data/imagenet_lsvrc_2015_synsets.txt
   
https://raw.githubusercontent.com/tensorflow/models/master/research/inception/inception/data/imagenet_metadata.txt
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Standardizing current Object Recognition REST parsers
> -
>
> Key: TIKA-2400
> URL: https://issues.apache.org/jira/browse/TIKA-2400
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Thejan Wijesinghe
>Priority: Minor
> Fix For: 1.17
>
>
> # This involves adding apiBaseUris and refactoring current Object Recognition 
> REST parsers,
> # Refactoring dockerfiles related to those parsers.
> #  Moving the logic related to checking minimum confidence into servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205509#comment-16205509
 ] 

ASF GitHub Bot commented on TIKA-2400:
--

ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object 
Recognition REST parsers
URL: https://github.com/apache/tika/pull/208#issuecomment-336800656
 
 
   I was getting the same error. Nothing is wrong with your docker setup. The 
problem was with the download url of **imagenet_lsvrc_2015_synsets.txt** & 
**imagenet_metadata.txt**. Apparently tf maintainers have moved these meta 
files and models to another repo https://github.com/tensorflow/serving. 
   See, 
https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_lsvrc_2015_synsets.txt
   
https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_metadata.txt
   you will get 404. I'll update with the new URLs
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Standardizing current Object Recognition REST parsers
> -
>
> Key: TIKA-2400
> URL: https://issues.apache.org/jira/browse/TIKA-2400
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Thejan Wijesinghe
>Priority: Minor
> Fix For: 1.17
>
>
> # This involves adding apiBaseUris and refactoring current Object Recognition 
> REST parsers,
> # Refactoring dockerfiles related to those parsers.
> #  Moving the logic related to checking minimum confidence into servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205506#comment-16205506
 ] 

ASF GitHub Bot commented on TIKA-2400:
--

ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object 
Recognition REST parsers
URL: https://github.com/apache/tika/pull/208#issuecomment-336800656
 
 
   I was getting the same error. Nothing is wrong with your docker setup. The 
problem was with the download url of **imagenet_lsvrc_2015_synsets.txt** & 
**imagenet_metadata.txt**. Apparently tf maintainers have moved these files to 
another location. 
   See, 
https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_lsvrc_2015_synsets.txt
   
https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_metadata.txt
   you will get 404. I'll update with the new URLs
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Standardizing current Object Recognition REST parsers
> -
>
> Key: TIKA-2400
> URL: https://issues.apache.org/jira/browse/TIKA-2400
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Thejan Wijesinghe
>Priority: Minor
> Fix For: 1.17
>
>
> # This involves adding apiBaseUris and refactoring current Object Recognition 
> REST parsers,
> # Refactoring dockerfiles related to those parsers.
> #  Moving the logic related to checking minimum confidence into servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205504#comment-16205504
 ] 

ASF GitHub Bot commented on TIKA-2400:
--

ThejanW commented on issue #208: Fix for TIKA-2400 Standardizing current Object 
Recognition REST parsers
URL: https://github.com/apache/tika/pull/208#issuecomment-336800656
 
 
   I was getting the same error. Nothing is wrong with your docker setup. The 
problem was with the download url of **imagenet_lsvrc_2015_synsets.txt** & 
imagenet_metadata.txt. Apparently tf maintainers have moved these files to 
another location. 
   See, 
https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_lsvrc_2015_synsets.txt
   
https://raw.githubusercontent.com/tensorflow/models/master/inception/inception/data/imagenet_metadata.txt
   you will get 404. I'll update with the new URLs
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Standardizing current Object Recognition REST parsers
> -
>
> Key: TIKA-2400
> URL: https://issues.apache.org/jira/browse/TIKA-2400
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Thejan Wijesinghe
>Priority: Minor
> Fix For: 1.17
>
>
> # This involves adding apiBaseUris and refactoring current Object Recognition 
> REST parsers,
> # Refactoring dockerfiles related to those parsers.
> #  Moving the logic related to checking minimum confidence into servers



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)