[jira] [Updated] (TIKA-2750) Update regression corpus

2018-11-05 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2750:
--
Attachment: CC-MAIN-2018-39-charset_lang_by_tld.zip

> Update regression corpus
> 
>
> Key: TIKA-2750
> URL: https://issues.apache.org/jira/browse/TIKA-2750
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: CC-MAIN-2018-39-charset_lang_by_tld.zip, 
> CC-MAIN-2018-39-mimes-charsets-by-tld.zip, 
> CC-MAIN-2018-39-mimes-v-detected.zip
>
>
> I think we've had great success with the current data on our regression 
> corpus.  I'd like to re-fresh some data from common crawl with three primary 
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely 
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- 
> CommonCrawl truncates docs at 1 MB.  I think some truncated documents have 
> been quite useful, similar to fuzzing, for identifying serious problems with 
> some of our parsers.  However, it would be useful to have more complete 
> files, esp. for PDFs.  In short, we should keep some truncated documents, but 
> I'd also like to get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2750) Update regression corpus

2018-11-01 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2750:
--
Attachment: CC-MAIN-2018-39-mimes-v-detected.zip

> Update regression corpus
> 
>
> Key: TIKA-2750
> URL: https://issues.apache.org/jira/browse/TIKA-2750
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip, 
> CC-MAIN-2018-39-mimes-v-detected.zip
>
>
> I think we've had great success with the current data on our regression 
> corpus.  I'd like to re-fresh some data from common crawl with three primary 
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely 
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- 
> CommonCrawl truncates docs at 1 MB.  I think some truncated documents have 
> been quite useful, similar to fuzzing, for identifying serious problems with 
> some of our parsers.  However, it would be useful to have more complete 
> files, esp. for PDFs.  In short, we should keep some truncated documents, but 
> I'd also like to get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2750) Update regression corpus

2018-10-26 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2750:
--
Attachment: CC-MAIN-2018-39-mimes-charsets-by-tld.zip

> Update regression corpus
> 
>
> Key: TIKA-2750
> URL: https://issues.apache.org/jira/browse/TIKA-2750
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip
>
>
> I think we've had great success with the current data on our regression 
> corpus.  I'd like to re-fresh some data from common crawl with three primary 
> goals:
> 1) include more interesting documents (e.g. down sample English UTF-8 
> text/html)
> 2) include more recent documents (perhaps newer features in PDFs? definitely 
> more ooxml)
> 3) identify and re-fetch truncated documents from the original site -- 
> CommonCrawl truncates docs at 1 MB.  I think some truncated documents have 
> been quite useful, similar to fuzzing, for identifying serious problems with 
> some of our parsers.  However, it would be useful to have more complete 
> files, esp. for PDFs.  In short, we should keep some truncated documents, but 
> I'd also like to get more complete docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)