[jira] [Updated] (TIKA-2750) Update regression corpus
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2750: -- Attachment: CC-MAIN-2018-39-charset_lang_by_tld.zip > Update regression corpus > > > Key: TIKA-2750 > URL: https://issues.apache.org/jira/browse/TIKA-2750 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: CC-MAIN-2018-39-charset_lang_by_tld.zip, > CC-MAIN-2018-39-mimes-charsets-by-tld.zip, > CC-MAIN-2018-39-mimes-v-detected.zip > > > I think we've had great success with the current data on our regression > corpus. I'd like to re-fresh some data from common crawl with three primary > goals: > 1) include more interesting documents (e.g. down sample English UTF-8 > text/html) > 2) include more recent documents (perhaps newer features in PDFs? definitely > more ooxml) > 3) identify and re-fetch truncated documents from the original site -- > CommonCrawl truncates docs at 1 MB. I think some truncated documents have > been quite useful, similar to fuzzing, for identifying serious problems with > some of our parsers. However, it would be useful to have more complete > files, esp. for PDFs. In short, we should keep some truncated documents, but > I'd also like to get more complete docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2750) Update regression corpus
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2750: -- Attachment: CC-MAIN-2018-39-mimes-v-detected.zip > Update regression corpus > > > Key: TIKA-2750 > URL: https://issues.apache.org/jira/browse/TIKA-2750 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip, > CC-MAIN-2018-39-mimes-v-detected.zip > > > I think we've had great success with the current data on our regression > corpus. I'd like to re-fresh some data from common crawl with three primary > goals: > 1) include more interesting documents (e.g. down sample English UTF-8 > text/html) > 2) include more recent documents (perhaps newer features in PDFs? definitely > more ooxml) > 3) identify and re-fetch truncated documents from the original site -- > CommonCrawl truncates docs at 1 MB. I think some truncated documents have > been quite useful, similar to fuzzing, for identifying serious problems with > some of our parsers. However, it would be useful to have more complete > files, esp. for PDFs. In short, we should keep some truncated documents, but > I'd also like to get more complete docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2750) Update regression corpus
[ https://issues.apache.org/jira/browse/TIKA-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2750: -- Attachment: CC-MAIN-2018-39-mimes-charsets-by-tld.zip > Update regression corpus > > > Key: TIKA-2750 > URL: https://issues.apache.org/jira/browse/TIKA-2750 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: CC-MAIN-2018-39-mimes-charsets-by-tld.zip > > > I think we've had great success with the current data on our regression > corpus. I'd like to re-fresh some data from common crawl with three primary > goals: > 1) include more interesting documents (e.g. down sample English UTF-8 > text/html) > 2) include more recent documents (perhaps newer features in PDFs? definitely > more ooxml) > 3) identify and re-fetch truncated documents from the original site -- > CommonCrawl truncates docs at 1 MB. I think some truncated documents have > been quite useful, similar to fuzzing, for identifying serious problems with > some of our parsers. However, it would be useful to have more complete > files, esp. for PDFs. In short, we should keep some truncated documents, but > I'd also like to get more complete docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)