[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293039#comment-13293039 ] Hudson commented on NUTCH-1262: --- Integrated in nutch-trunk-maven #306 (See [https://builds.apache.org/job/nutch-trunk-maven/306/]) NUTCH-1262 Map `duplicating` content-types to a single type (Revision 1348785) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java Map `duplicating` content-types to a single type Key: NUTCH-1262 URL: https://issues.apache.org/jira/browse/NUTCH-1262 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch Similar or duplicating content-types can end-up differently in an index. With, for example, both application/xhtml+xml and text/html it is impossible to use a single filter to select `web pages`. See also: http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html Content-Type mapping is disabled by default and is enabled via moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/. {code} # target MIME-type TAB type1 [TAB type2 ...] # Map XHTML to HTML text/html application/xhtml+xml # Map XHTML and HTML to something else Web pagetext/html application/xhtml+xml # Map some office documents to each other Office document application/vnd.oasis.opendocument.text application/x-tika-msoffice {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293334#comment-13293334 ] Hudson commented on NUTCH-1262: --- Integrated in Nutch-trunk #1868 (See [https://builds.apache.org/job/Nutch-trunk/1868/]) NUTCH-1262 Map `duplicating` content-types to a single type (Revision 1348785) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1348785 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java Map `duplicating` content-types to a single type Key: NUTCH-1262 URL: https://issues.apache.org/jira/browse/NUTCH-1262 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch Similar or duplicating content-types can end-up differently in an index. With, for example, both application/xhtml+xml and text/html it is impossible to use a single filter to select `web pages`. See also: http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html Content-Type mapping is disabled by default and is enabled via moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/. {code} # target MIME-type TAB type1 [TAB type2 ...] # Map XHTML to HTML text/html application/xhtml+xml # Map XHTML and HTML to something else Web pagetext/html application/xhtml+xml # Map some office documents to each other Office document application/vnd.oasis.opendocument.text application/x-tika-msoffice {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291602#comment-13291602 ] Markus Jelsma commented on NUTCH-1262: -- I'll commit this one in the next few days unless there are objections or improvements. Thanks Map `duplicating` content-types to a single type Key: NUTCH-1262 URL: https://issues.apache.org/jira/browse/NUTCH-1262 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch Similar or duplicating content-types can end-up differently in an index. With, for example, both application/xhtml+xml and text/html it is impossible to use a single filter to select `web pages`. See also: http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208461#comment-13208461 ] Markus Jelsma commented on NUTCH-1262: -- Is this issue still subject to debate? Opinions? Map `duplicating` content-types to a single type Key: NUTCH-1262 URL: https://issues.apache.org/jira/browse/NUTCH-1262 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1262-1.5-1.patch Similar or duplicating content-types can end-up differently in an index. With, for example, both application/xhtml+xml and text/html it is impossible to use a single filter to select `web pages`. See also: http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196829#comment-13196829 ] Julien Nioche commented on NUTCH-1262: -- Just wondering, does not Tika's Mimetype registry provide a canonical form for the mimetypes already? If so we could get it from there instead of having to maintain a separate mapping file ourselves Map `duplicating` content-types to a single type Key: NUTCH-1262 URL: https://issues.apache.org/jira/browse/NUTCH-1262 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1262-1.5-1.patch Similar or duplicating content-types can end-up differently in an index. With, for example, both application/xhtml+xml and text/html it is impossible to use a single filter to select `web pages`. See also: http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira