[jira] [Commented] (JAMES-4061) Html Text extractor needs to handle blockquote

Benoit Tellier (Jira) Thu, 22 Aug 2024 06:04:30 -0700


    [ 
https://issues.apache.org/jira/browse/JAMES-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875841#comment-17875841
 ]


Benoit Tellier commented on JAMES-4061:
---------------------------------------

Links handling could be revisited too

EG given <a src="https://abc.com";>this link</a> inner text is kept and link is 
discarded (suboptimal: the link is thus not usable)

We could display links as


{code:java}
this link (https://abc.com)
{code}

If link source is different from the text displayed.


> Html Text extractor needs to handle blockquote
> ----------------------------------------------
>
>                 Key: JAMES-4061
>                 URL: https://issues.apache.org/jira/browse/JAMES-4061
>             Project: James Server
>          Issue Type: Bug
>          Components: JMAP
>    Affects Versions: master
>            Reporter: Benoit Tellier
>            Assignee: Antoine Duprat
>            Priority: Major
>         Attachments: image-2024-08-22-14-54-37-915.png, 
> image-2024-08-22-14-54-51-684.png, image-2024-08-22-14-55-01-317.png
>
>
> Following recent mailing list exchanges, Wojtek contacted me privatly to 
> notice me about the bad idents of my inlined ansers.
> The exchange: 
> https://www.mail-archive.com/server-dev@james.apache.org/msg74362.html
> Set up: I used Twake mail client throughout the discussion which produces 
> html and relies on James server JMAP code for generating the text/plain part. 
> Wojtek favors reading text plain when available.
> Full diagnostic is taken from a private conversation:
> h3. Diagnostic
> I bet this is a plain text projection of the email that screwed up. HTML 
> version looks fine
>  !image-2024-08-22-14-54-37-915.png! 
> Which matched the output I see in my sent mails in Twake mail
>  !image-2024-08-22-14-54-51-684.png! 
> However indeed the text plain version is missing one level
>  !image-2024-08-22-14-55-01-317.png! 
> What we have
> >> Your initial concern
> > My initial answer
> Your answer
> My answer to your answer
> What we should have
> >>> Your initial concern
> >> My initial answer
> > Your answer
> My answer to your answer
> Where it gets annoying it is that our Webmail ( 
> https://github.com/apache/james-project ) generates an HTML output (WYSIWYG) 
> and the backend then extract the text from the HTML in order to present a 
> text/plain view of the message and the <blockquote> tags are currently 
> ignored.
> The component converting HTML to text needs to account for these blockquotes, 
> actually keep track of the count of blockquotes of the curent context and 
> replace line breaks by the appropriate count of blockquotes
> <blockquote><p>abc</p><p>def<br/>ghi<p><blockquote><p>jkl</p><p>mno<br/></p></blockquote><p>pqr</p></blockquote><p>stu</p>
> Shall be replaced with
> > abc
> > def
> > ghi
> >> jkl
> >> mno
> > pqr
> stu
> The involved component is a JMAP utility of Apache James: 
> org.apache.james.jmap.utils.JsoupHtmlTextExtractor



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

[jira] [Commented] (JAMES-4061) Html Text extractor needs to handle blockquote

Reply via email to