[ https://issues.apache.org/jira/browse/TIKA-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801110#action_12801110 ]
Chris A. Mattmann edited comment on TIKA-357 at 1/16/10 6:00 AM: ----------------------------------------------------------------- Ken: I applied your patch and ran it against your sample file, but am not seeing the patch fix the issue: [chipotle:~/src/tika/trunk] mattmann% mvn -Dtest=MimeDetectionTest test [INFO] Scanning for projects... [INFO] Reactor build order: [INFO] Apache Tika parent [INFO] Apache Tika core [INFO] Apache Tika parsers [INFO] Apache Tika application [INFO] Apache Tika OSGi bundle [INFO] Apache Tika [INFO] ------------------------------------------------------------------------ [INFO] Building Apache Tika parent [INFO] task-segment: [test] [INFO] ------------------------------------------------------------------------ [INFO] Setting property: classpath.resource.loader.class => 'org.codehaus.plexus.velocity.ContextClassLoaderResourceLoader'. [INFO] Setting property: velocimacro.messages.on => 'false'. [INFO] Setting property: resource.loader => 'classpath'. [INFO] Setting property: resource.manager.logwhenfound => 'false'. [INFO] [remote-resources:process {execution: default}] [INFO] ------------------------------------------------------------------------ [INFO] Building Apache Tika core [INFO] task-segment: [test] [INFO] ------------------------------------------------------------------------ [INFO] [remote-resources:process {execution: default}] [INFO] [resources:resources] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 20 resources [INFO] Copying 3 resources [INFO] [compiler:compile] [INFO] Nothing to compile - all classes are up to date [INFO] [resources:testResources] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 26 resources [INFO] Copying 3 resources [INFO] [compiler:testCompile] [INFO] Nothing to compile - all classes are up to date [INFO] [surefire:test] [INFO] Surefire report directory: /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.tika.mime.MimeDetectionTest Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.48 sec <<< FAILURE! Results : Failed tests: testDetection(org.apache.tika.mime.MimeDetectionTest) Tests run: 2, Failures: 1, Errors: 0, Skipped: 0 [INFO] ------------------------------------------------------------------------ [ERROR] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] There are test failures. Please refer to /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports for the individual test results. [INFO] ------------------------------------------------------------------------ [INFO] For more information, run Maven with the -e switch [INFO] ------------------------------------------------------------------------ [INFO] Total time: 5 seconds [INFO] Finished at: Fri Jan 15 21:54:44 PST 2010 [INFO] Final Memory: 13M/23M [INFO] ------------------------------------------------------------------------ [chipotle:~/src/tika/trunk] mattmann% more tika-core/target/surefire-reports/org.apache.tika.mime.MimeDetectionTest.txt ------------------------------------------------------------------------------- Test set: org.apache.tika.mime.MimeDetectionTest ------------------------------------------------------------------------------- Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.482 sec <<< FAILURE! testDetection(org.apache.tika.mime.MimeDetectionTest) Time elapsed: 0.368 sec <<< FAILURE! junit.framework.ComparisonFailure: testlargerbuffer.html is not properly detected: detected. expected:<...html> but was:<...plain> at junit.framework.Assert.assertEquals(Assert.java:81) at org.apache.tika.mime.MimeDetectionTest.testStream(MimeDetectionTest.java:91) at org.apache.tika.mime.MimeDetectionTest.testFile(MimeDetectionTest.java:80) at org.apache.tika.mime.MimeDetectionTest.testDetection(MimeDetectionTest.java:61) [chipotle:~/src/tika/trunk] mattmann% Did this patch work for you in terms of AutoDetection? I would imagine the MimeTypes detector would detect it based on your patch but your patch updates the HtmlParser, rather than the detection part. Let me look into this more -- I'd like to get this into 0.6 which I've been promising to cut an RC for (but haven't had time sorry!) the past few weeks ;) Cheers, Chris was (Author: chrismattmann): Ken: I applied your patch and ran it against your sample file, but am not seeing the patch fix the issue: {noformat} [chipotle:~/src/tika/trunk] mattmann% mvn -Dtest=MimeDetectionTest test [INFO] Scanning for projects... [INFO] Reactor build order: [INFO] Apache Tika parent [INFO] Apache Tika core [INFO] Apache Tika parsers [INFO] Apache Tika application [INFO] Apache Tika OSGi bundle [INFO] Apache Tika [INFO] ------------------------------------------------------------------------ [INFO] Building Apache Tika parent [INFO] task-segment: [test] [INFO] ------------------------------------------------------------------------ [INFO] Setting property: classpath.resource.loader.class => 'org.codehaus.plexus.velocity.ContextClassLoaderResourceLoader'. [INFO] Setting property: velocimacro.messages.on => 'false'. [INFO] Setting property: resource.loader => 'classpath'. [INFO] Setting property: resource.manager.logwhenfound => 'false'. [INFO] [remote-resources:process {execution: default}] [INFO] ------------------------------------------------------------------------ [INFO] Building Apache Tika core [INFO] task-segment: [test] [INFO] ------------------------------------------------------------------------ [INFO] [remote-resources:process {execution: default}] [INFO] [resources:resources] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 20 resources [INFO] Copying 3 resources [INFO] [compiler:compile] [INFO] Nothing to compile - all classes are up to date [INFO] [resources:testResources] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 26 resources [INFO] Copying 3 resources [INFO] [compiler:testCompile] [INFO] Nothing to compile - all classes are up to date [INFO] [surefire:test] [INFO] Surefire report directory: /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.tika.mime.MimeDetectionTest Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.48 sec <<< FAILURE! Results : Failed tests: testDetection(org.apache.tika.mime.MimeDetectionTest) Tests run: 2, Failures: 1, Errors: 0, Skipped: 0 [INFO] ------------------------------------------------------------------------ [ERROR] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] There are test failures. Please refer to /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports for the individual test results. [INFO] ------------------------------------------------------------------------ [INFO] For more information, run Maven with the -e switch [INFO] ------------------------------------------------------------------------ [INFO] Total time: 5 seconds [INFO] Finished at: Fri Jan 15 21:54:44 PST 2010 [INFO] Final Memory: 13M/23M [INFO] ------------------------------------------------------------------------ [chipotle:~/src/tika/trunk] mattmann% more tika-core/target/surefire-reports/org.apache.tika.mime.MimeDetectionTest.txt ------------------------------------------------------------------------------- Test set: org.apache.tika.mime.MimeDetectionTest ------------------------------------------------------------------------------- Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.482 sec <<< FAILURE! testDetection(org.apache.tika.mime.MimeDetectionTest) Time elapsed: 0.368 sec <<< FAILURE! junit.framework.ComparisonFailure: testlargerbuffer.html is not properly detected: detected. expected:<...html> but was:<...plain> at junit.framework.Assert.assertEquals(Assert.java:81) at org.apache.tika.mime.MimeDetectionTest.testStream(MimeDetectionTest.java:91) at org.apache.tika.mime.MimeDetectionTest.testFile(MimeDetectionTest.java:80) at org.apache.tika.mime.MimeDetectionTest.testDetection(MimeDetectionTest.java:61) [chipotle:~/src/tika/trunk] mattmann% {noformat} Did this patch work for you in terms of AutoDetection? I would imagine the MimeTypes detector would detect it based on your patch but your patch updates the HtmlParser, rather than the detection part. Let me look into this more -- I'd like to get this into 0.6 which I've been promising to cut an RC for (but haven't had time sorry!) the past few weeks ;) Cheers, Chris > Increase buffer size for meta tag sniffing > ------------------------------------------ > > Key: TIKA-357 > URL: https://issues.apache.org/jira/browse/TIKA-357 > Project: Tika > Issue Type: Improvement > Affects Versions: 0.5 > Reporter: Ken Krugler > Assignee: Chris A. Mattmann > Priority: Minor > Fix For: 0.6 > > Attachments: makler.html, TIKA-357.patch > > > Some web pages (such as makler.su, see attached) have lots of script data > before the body of the HTML. > When this happens, the sniffing code fails to find the charset info in the > meta tag, because it currently only sniffs the first 4K. > Bumping it to 8K would cover all of the cases that I (Ken) have seen during a > test crawl. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.