[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293039#comment-13293039
 ] 

Hudson commented on NUTCH-1262:
---

Integrated in nutch-trunk-maven #306 (See 
[https://builds.apache.org/job/nutch-trunk-maven/306/])
NUTCH-1262 Map `duplicating` content-types to a single type (Revision 
1348785)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* 
/nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java


 Map `duplicating` content-types to a single type
 

 Key: NUTCH-1262
 URL: https://issues.apache.org/jira/browse/NUTCH-1262
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch


 Similar or duplicating content-types can end-up differently in an index. 
 With, for example, both application/xhtml+xml and text/html it is impossible 
 to use a single filter to select `web pages`.
 See also: 
 http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html
 Content-Type mapping is disabled by default and is enabled via 
 moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/.
 {code}
 # target MIME-type TAB type1 [TAB type2 ...]
 # Map XHTML to HTML
 text/html   application/xhtml+xml
 # Map XHTML and HTML to something else
 Web pagetext/html   application/xhtml+xml
 # Map some office documents to each other
 Office document application/vnd.oasis.opendocument.text 
 application/x-tika-msoffice
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293334#comment-13293334
 ] 

Hudson commented on NUTCH-1262:
---

Integrated in Nutch-trunk #1868 (See 
[https://builds.apache.org/job/Nutch-trunk/1868/])
NUTCH-1262 Map `duplicating` content-types to a single type (Revision 
1348785)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1348785
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* 
/nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java


 Map `duplicating` content-types to a single type
 

 Key: NUTCH-1262
 URL: https://issues.apache.org/jira/browse/NUTCH-1262
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch


 Similar or duplicating content-types can end-up differently in an index. 
 With, for example, both application/xhtml+xml and text/html it is impossible 
 to use a single filter to select `web pages`.
 See also: 
 http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html
 Content-Type mapping is disabled by default and is enabled via 
 moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/.
 {code}
 # target MIME-type TAB type1 [TAB type2 ...]
 # Map XHTML to HTML
 text/html   application/xhtml+xml
 # Map XHTML and HTML to something else
 Web pagetext/html   application/xhtml+xml
 # Map some office documents to each other
 Office document application/vnd.oasis.opendocument.text 
 application/x-tika-msoffice
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-06-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291602#comment-13291602
 ] 

Markus Jelsma commented on NUTCH-1262:
--

I'll commit this one in the next few days unless there are objections or 
improvements. Thanks

 Map `duplicating` content-types to a single type
 

 Key: NUTCH-1262
 URL: https://issues.apache.org/jira/browse/NUTCH-1262
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch


 Similar or duplicating content-types can end-up differently in an index. 
 With, for example, both application/xhtml+xml and text/html it is impossible 
 to use a single filter to select `web pages`.
 See also: 
 http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-02-15 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208461#comment-13208461
 ] 

Markus Jelsma commented on NUTCH-1262:
--

Is this issue still subject to debate? Opinions?

 Map `duplicating` content-types to a single type
 

 Key: NUTCH-1262
 URL: https://issues.apache.org/jira/browse/NUTCH-1262
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1262-1.5-1.patch


 Similar or duplicating content-types can end-up differently in an index. 
 With, for example, both application/xhtml+xml and text/html it is impossible 
 to use a single filter to select `web pages`.
 See also: 
 http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-01-31 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196829#comment-13196829
 ] 

Julien Nioche commented on NUTCH-1262:
--

Just wondering, does not Tika's Mimetype registry provide a canonical form for 
the mimetypes already? If so we could get it from there instead of having to 
maintain a separate mapping file ourselves

 Map `duplicating` content-types to a single type
 

 Key: NUTCH-1262
 URL: https://issues.apache.org/jira/browse/NUTCH-1262
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1262-1.5-1.patch


 Similar or duplicating content-types can end-up differently in an index. 
 With, for example, both application/xhtml+xml and text/html it is impossible 
 to use a single filter to select `web pages`.
 See also: 
 http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira