[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350529#comment-16350529
 ] 

Tim Allison edited comment on TIKA-1599 at 2/2/18 3:39 PM:
---

>DOM could lead to higher memory usage

Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so 
we aren't going to see major problems in that corpus.

 

I added [~markus17] 's attached files to our regression corpus, and I've kicked 
off a fresh full run of Tika 1.17 against the corpus.  I've updated my jsoup 
code on my personal fork.  Once the 1.17 run finishes, I'll kick off the jsoup 
fork against the html files.

 

Unrelated topic: does anyone have a shareable example of an html file with a 
base64 (or other) embedded file inside of an html file?  I don't think we're 
currently handling these, and it would be nice to do that.


was (Author: talli...@mitre.org):
>DOM could lead to higher memory usage

Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so 
we aren't going to see major problems in that corpus.

 

I've kicked off a fresh full run of Tika 1.17 against the corpus, and I've 
updated my jsoup code on my personal fork.  Once the 1.17 run finishes, I'll 
kick off the jsoup fork against the html files.

 

Unrelated topic: does anyone have a shareable example of an html file with a 
base64 (or other) embedded file inside of an html file?  I don't think we're 
currently handling these, and it would be nice to do that.

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350335#comment-16350335
 ] 

Tim Allison edited comment on TIKA-1599 at 2/2/18 1:42 PM:
---

What say we do a fresh eval on our current corpus and then do a clean cut over 
to JSoup for Tika 2.0 if the results are promising?

Big question: are we willing to move to DOM for HTML.  SAX is not yet available 
in JSoup (https://github.com/jhy/jsoup/issues/824).


was (Author: talli...@mitre.org):
What say we do a fresh eval on our current corpus and then do a clean cut over 
to JSoup for Tika 2.0 if the results are promising?

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1599-crazy-files.tar.gz, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Greg Lindahl (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049810#comment-15049810
 ] 

Greg Lindahl edited comment on TIKA-1599 at 12/10/15 1:37 AM:
--

I'm an advisor to Common Crawl -- they're currently between engineers. They run 
Nutch, and use the blekko metadata as the seed list. So yes, 
https://github.com/commoncrawl/nutch is what you should be looking at.


was (Author: wumpus):
I'm an advisor to Common Crawl -- they're currently between engineers. They 
currently run Nutch, and use the blekko metadata as the seed list. So yes, 
https://github.com/commoncrawl/nutch is what you should be looking at.

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049445#comment-15049445
 ] 

Tim Allison edited comment on TIKA-1599 at 12/9/15 10:37 PM:
-

I ran a comparison on our 80k html docs.  It looks like we move from 5.7M 
common English words with TagSoup to 6.0M common English words with JSoup for 
those ~3,800 files where is any content difference.  The cost of the move is 
only 4 exceptions, 3 of which are zip bombs.

I'm currently getting far fewer metadata items with JSoup, but that's almost 
certain to be the fault of my initial implementation.

If anyone has a chance to look through contents/content_diffs.xlsx, I'd 
appreciate all feedback.

You can get to the extract root [here|http://162.242.228.174/extracts/].  The 
"A" run is "tika_1_12_20151208" and the "B" run is "jsoup".


was (Author: talli...@mitre.org):
I ran a comparison on our 80k html docs.  It looks like we gain about 5% in 
common English words if we move to JSoup at the cost of 4 new exceptions.

I'm currently getting far fewer metadata items with JSoup, but that's almost 
certain to be the fault of my initial implementation.

If anyone has a chance to look through contents/content_diffs.xlsx, I'd 
appreciate all feedback.

You can get to the extract root [here|http://162.242.228.174/extracts/].  The 
"A" run is "tika_1_12_20151208" and the "B" run is "jsoup".

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049445#comment-15049445
 ] 

Tim Allison edited comment on TIKA-1599 at 12/9/15 10:06 PM:
-

I ran a comparison on our 80k html docs.  It looks like we gain about 5% in 
common English words if we move to JSoup at the cost of 4 new exceptions.

I'm currently getting far fewer metadata items with JSoup, but that's almost 
certain to be the fault of my initial implementation.

If anyone has a chance to look through contents/content_diffs.xlsx, I'd 
appreciate all feedback.

You can get to the extract root [here|http://162.242.228.174/extracts/].  The 
"A" run is "tika_1_12_20151208" and the "B" run is "jsoup".


was (Author: talli...@mitre.org):
I ran a comparison on our 80k html docs.  It looks like we gain about 5% in 
common English words if we move to JSoup at the cost of 4 new exceptions.

I'm currently getting far fewer metadata items with JSoup, but that's almost 
certain to be the fault of my initial implementation.

If anyone has a chance to look through contents/content_diffs.xlsx, I'd 
appreciate all feedback.

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1599) Switch from TagSoup to JSoup

2015-12-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048978#comment-15048978
 ] 

Tim Allison edited comment on TIKA-1599 at 12/9/15 5:06 PM:


bq. So I'm not seeing how the parser that CommonCrawl uses would factor into 
that.

Right, agree.  I was just seeing if they had also picked up JSoup or were using 
something else.

bq. But I also don't have a good suggestion for how to automatically evaluate 
the TagSoup vs. JSoup parse-off, other than simple measures like how many 
failed, or maybe the amount of text extracted?

Y, agreed.  For PDFBOX-3058, and as part of TIKA-1332, I have some alpha-level 
eval code that will do these comparisons.  Following [~tilman]'s 
recommendation, I added counts for "common English" words.


was (Author: talli...@mitre.org):
bq. So I'm not seeing how the parser that CommonCrawl uses would factor into 
that.

Right, agree.  I was just seeing if they had also picked up JSoup or were using 
something else.

bq. But I also don't have a good suggestion for how to automatically evaluate 
the TagSoup vs. JSoup parse-off, other than simple measures like how many 
failed, or maybe the amount of text extracted?

Y, agreed.  For PDFBOX-3058, and as part of TIKA-1330, I have some alpha-level 
eval code that will do these comparisons.  Following [~tilman]'s 
recommendation, I added counts for "common English" words.

> Switch from TagSoup to JSoup
> 
>
> Key: TIKA-1599
> URL: https://issues.apache.org/jira/browse/TIKA-1599
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7, 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)