[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350529#comment-16350529 ] Tim Allison edited comment on TIKA-1599 at 2/2/18 3:39 PM: --- >DOM could lead to higher memory usage Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so we aren't going to see major problems in that corpus. I added [~markus17] 's attached files to our regression corpus, and I've kicked off a fresh full run of Tika 1.17 against the corpus. I've updated my jsoup code on my personal fork. Once the 1.17 run finishes, I'll kick off the jsoup fork against the html files. Unrelated topic: does anyone have a shareable example of an html file with a base64 (or other) embedded file inside of an html file? I don't think we're currently handling these, and it would be nice to do that. was (Author: talli...@mitre.org): >DOM could lead to higher memory usage Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so we aren't going to see major problems in that corpus. I've kicked off a fresh full run of Tika 1.17 against the corpus, and I've updated my jsoup code on my personal fork. Once the 1.17 run finishes, I'll kick off the jsoup fork against the html files. Unrelated topic: does anyone have a shareable example of an html file with a base64 (or other) embedded file inside of an html file? I don't think we're currently handling these, and it would be nice to do that. > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350335#comment-16350335 ] Tim Allison edited comment on TIKA-1599 at 2/2/18 1:42 PM: --- What say we do a fresh eval on our current corpus and then do a clean cut over to JSoup for Tika 2.0 if the results are promising? Big question: are we willing to move to DOM for HTML. SAX is not yet available in JSoup (https://github.com/jhy/jsoup/issues/824). was (Author: talli...@mitre.org): What say we do a fresh eval on our current corpus and then do a clean cut over to JSoup for Tika 2.0 if the results are promising? > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: TIKA-1599-crazy-files.tar.gz, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049810#comment-15049810 ] Greg Lindahl edited comment on TIKA-1599 at 12/10/15 1:37 AM: -- I'm an advisor to Common Crawl -- they're currently between engineers. They run Nutch, and use the blekko metadata as the seed list. So yes, https://github.com/commoncrawl/nutch is what you should be looking at. was (Author: wumpus): I'm an advisor to Common Crawl -- they're currently between engineers. They currently run Nutch, and use the blekko metadata as the seed list. So yes, https://github.com/commoncrawl/nutch is what you should be looking at. > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049445#comment-15049445 ] Tim Allison edited comment on TIKA-1599 at 12/9/15 10:37 PM: - I ran a comparison on our 80k html docs. It looks like we move from 5.7M common English words with TagSoup to 6.0M common English words with JSoup for those ~3,800 files where is any content difference. The cost of the move is only 4 exceptions, 3 of which are zip bombs. I'm currently getting far fewer metadata items with JSoup, but that's almost certain to be the fault of my initial implementation. If anyone has a chance to look through contents/content_diffs.xlsx, I'd appreciate all feedback. You can get to the extract root [here|http://162.242.228.174/extracts/]. The "A" run is "tika_1_12_20151208" and the "B" run is "jsoup". was (Author: talli...@mitre.org): I ran a comparison on our 80k html docs. It looks like we gain about 5% in common English words if we move to JSoup at the cost of 4 new exceptions. I'm currently getting far fewer metadata items with JSoup, but that's almost certain to be the fault of my initial implementation. If anyone has a chance to look through contents/content_diffs.xlsx, I'd appreciate all feedback. You can get to the extract root [here|http://162.242.228.174/extracts/]. The "A" run is "tika_1_12_20151208" and the "B" run is "jsoup". > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049445#comment-15049445 ] Tim Allison edited comment on TIKA-1599 at 12/9/15 10:06 PM: - I ran a comparison on our 80k html docs. It looks like we gain about 5% in common English words if we move to JSoup at the cost of 4 new exceptions. I'm currently getting far fewer metadata items with JSoup, but that's almost certain to be the fault of my initial implementation. If anyone has a chance to look through contents/content_diffs.xlsx, I'd appreciate all feedback. You can get to the extract root [here|http://162.242.228.174/extracts/]. The "A" run is "tika_1_12_20151208" and the "B" run is "jsoup". was (Author: talli...@mitre.org): I ran a comparison on our 80k html docs. It looks like we gain about 5% in common English words if we move to JSoup at the cost of 4 new exceptions. I'm currently getting far fewer metadata items with JSoup, but that's almost certain to be the fault of my initial implementation. If anyone has a chance to look through contents/content_diffs.xlsx, I'd appreciate all feedback. > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > Attachments: tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048978#comment-15048978 ] Tim Allison edited comment on TIKA-1599 at 12/9/15 5:06 PM: bq. So I'm not seeing how the parser that CommonCrawl uses would factor into that. Right, agree. I was just seeing if they had also picked up JSoup or were using something else. bq. But I also don't have a good suggestion for how to automatically evaluate the TagSoup vs. JSoup parse-off, other than simple measures like how many failed, or maybe the amount of text extracted? Y, agreed. For PDFBOX-3058, and as part of TIKA-1332, I have some alpha-level eval code that will do these comparisons. Following [~tilman]'s recommendation, I added counts for "common English" words. was (Author: talli...@mitre.org): bq. So I'm not seeing how the parser that CommonCrawl uses would factor into that. Right, agree. I was just seeing if they had also picked up JSoup or were using something else. bq. But I also don't have a good suggestion for how to automatically evaluate the TagSoup vs. JSoup parse-off, other than simple measures like how many failed, or maybe the amount of text extracted? Y, agreed. For PDFBOX-3058, and as part of TIKA-1330, I have some alpha-level eval code that will do these comparisons. Following [~tilman]'s recommendation, I added counts for "common English" words. > Switch from TagSoup to JSoup > > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.7, 1.8 >Reporter: Ken Krugler >Assignee: Ken Krugler >Priority: Minor > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)