Whitespace in HTML parsed incorrectly #27

Closed
opened 2019-07-01 13:20:07 +00:00 by shadowfacts · 1 comment
Owner

Posts authored in Markdown or HTML have extra whitespace at the beginning of paragraphs.

Posts authored in Markdown or HTML have extra whitespace at the beginning of paragraphs.
shadowfacts added this to the 1.0.0 milestone 2019-07-01 13:20:07 +00:00
shadowfacts added the
bug
label 2019-07-01 13:20:07 +00:00
shadowfacts changed title from Extra whitespace in Markdown/HTML formatted posts to Whitespace in HTML parsed incorrectly 2020-01-21 02:08:39 +00:00
Author
Owner

SwiftSoup, the HTML parsing library we currently use parses whitespace in between HTML elements incorrectly, e.g. it will pars <p>a</p>\n<p>b</p> as [paragraph element, TextNode containing a space, paragraph element]. The TextNode containing the space is what's showing up when rendering.

This issue only manifests with rich text posts because non-rich text posts don't have their text wrapped in paragraph tags.

The newlines in the raw HTML should be parsed correctly and then collapsed per the CSS whitespace collapsing rules.

Potential solutions to this are:

  • Use the NSAttributedString HTML initializer

    This has the downside of required two HTML parses (first with something else to sanitize the HTML, second to convert into an attributed string) which would be slower than ideal.

  • Fix SwiftSoup

    good luck

  • Switch to a different HTML parsing library

    HTMLReader seems like it could work, but most libraries seem like crap

  • Manually scan through the attributed string after it's generated and collapse whitespace per the CSS rules

    just kind of a pain in the ass

~~SwiftSoup, the HTML parsing library we currently use parses whitespace in between HTML elements incorrectly, e.g. it will pars `<p>a</p>\n<p>b</p>` as [paragraph element, TextNode containing a space, paragraph element]. The TextNode containing the space is what's showing up when rendering.~~ This issue only manifests with rich text posts because non-rich text posts don't have their text wrapped in paragraph tags. The newlines in the raw HTML should be ~~parsed correctly and then~~ collapsed per the [CSS whitespace collapsing rules](https://www.w3.org/TR/css-text-3/#white-space-phase-1). Potential solutions to this are: - Use the [`NSAttributedString` HTML initializer](https://developer.apple.com/documentation/foundation/nsattributedstring/1525953-init) This has the downside of required two HTML parses (first with something else to sanitize the HTML, second to convert into an attributed string) which would be slower than ideal. - Fix SwiftSoup good luck - Switch to a different HTML parsing library [HTMLReader](https://github.com/nolanw/HTMLReader) seems like it could work, but most libraries seem like crap - Manually scan through the attributed string after it's generated and collapse whitespace per the CSS rules just kind of a pain in the ass
Sign in to join this conversation.
No Milestone 1.0.0
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: shadowfacts/Tusker#27
No description provided.