[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2023-09-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769215#comment-17769215
 ] 

Tim Allison commented on TIKA-2562:
---

We're now getting this with the migration to JSoup in the 3.x/main branch.  
This looks right?
 
Lorem ipsum dolor sit amet, consectetur adipiscing laborum.
 http://www.google.com";>http://www.google.com 
 https://mail.google.com/mail/?tab=wm";>https://mail.google.com/mail/?tab=wm
 
   


> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-05 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352666#comment-16352666
 ] 

NW Brad commented on TIKA-2562:
---

Thanks.  I'll take a look at it.  It definitely looks the the same issue, but 
for the title tag.  Its too bad the SAXTransformer doesn't allow you the option 
to prevent the issue.

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-05 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352515#comment-16352515
 ] 

Tim Allison commented on TIKA-2562:
---

Thank you for looking into this.  IIUC, [~rgauss] offers a way to get the 
closing tags on empty elements in TIKA-895 via configuration of the 
SAXTransformer.  Fellow Tika-devs, do we want to make this configurable in 
tika-server...or make it the default ...or...?

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351224#comment-16351224
 ] 

NW Brad commented on TIKA-2562:
---

I was doing some research on this today and this may not be a function of Tika. 
 I think it is probably the SAXTransformerFactory (javax.xml.transform) that is 
making the change.  At least I could find any code in Tika that did it 
directly.  But anything I ran through the SAXTransformerFactory converted the 
HTML I provided with void (empty) elements and self-closing start tags as shown 
below:

http://www.google.com";> *becomes* http://www.google.com*"/>*

and  *becomes* .

>From an XML standpoint the converted syntax is correct, but the anchor tag 
>code while correct in XML, does not appear to work correctly as HTML in both 
>the current version of Chrome and Firefox.  So, converting HTML via Tika in 
>this situation generates bad HTML for the examples I have.

I believe the SAXTransformerFactory is also deleting the  that is around 
the "empty" anchor tag since a div around nothing is may not be consider 
relevant.  I least that is what I speculate...
h1.  

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350822#comment-16350822
 ] 

Tim Allison commented on TIKA-2562:
---

I'll take a look.  This will require some digging.

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350634#comment-16350634
 ] 

NW Brad commented on TIKA-2562:
---

Thanks.  I check it out, it and tagsoup is definitely adding the shape.  I 
tried parsing the file using tagsoup command line, and tagsoup is definitely 
the shape.  However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com";>http://www.google.com
 

Tika parse results:

http://www.google.com";>http://www.google.com

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com";>

Tagsoup:results

http://www.google.com*";>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350330#comment-16350330
 ] 

Tim Allison commented on TIKA-2562:
---

This is a "feature" of tagsoup see, e.g. 
[https://groups.google.com/forum/#!topic/tagsoup-friends/EfB6i12xBLw] I'm 
hesitant to fix this in Tika because we should probably migrate to jsoup, which 
is actively supported (TIKA-1599).

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com";>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)