[xep-support] line-breaking on typographic spaces

From: David Tolpin <dvd@davidashen.net>
Date: Thu Oct 20 2005 - 11:17:15 PDT


I am slowly recovering my thoughts of the past that led me to the
decision to abandon the idea to support Unicode spaces in XEP.

Here is the picture:

1) According to UAX#14, http://www.unicode.org/reports/tr14/#SP, Zs
are breaking spaces. Which means that on either side of the character
the line can be broken. On the other hand, whitespace treatment and
collapsing is defined in terms of XML spaces, which are

S ::= (#x20 | #x9 | #xD | #xA)+

Which means that when someone uses typographic spaces, they end up
hanging from ends of the line if the line-breaking algorithm decides
to break on them, breaking alignment on both edges.

2) Spaces, except for the THIN SPACE (2009), cannot change their
width due to alignment. Which means that they cannot be output as
normal glyphs and then used with letter-spacing/word-spacing --
letter spacing will change the effective widths of the spaces. Thus,
any implementation just preserving the spaces from the range as
unicode codepoints breaks their intended use.

3) XSL has an idiom that expresses what the spaces are intended for
in traditional typography - inline spaces on inlines (space-start,
space-end -- Sergey, thanks for the hint). A professional typesetting
toolchain should provide markup that automatically translates such
things as figure numbers and quoted strings in French into properly
spaced structures, using inline spaces, not unicode space characters.

4) In cases where legacy data must be dealt with, the frontend,
processing the document before it is fed to an XSL formatter, can and
should transform the spaces into space-filled leaders, which ensure
the desired look, and allow to adjust behavior to a particular
tradition of the publishing house.


With this points in mind, the options had been to:

   - provide complete implementation of typographic spaces, which
most probably would mean just translating them to leaders in the XSL
compiler (an internal part of XEP);
   - keep them as glyphs;
   - expose the problem and filter out all those spaces replacing
them with #x20, thus ensuring that the formatter yields readable
documents with less than optimal look in case of the use of the
problematic approach.

After discussing the issue, we had chosen the third path, because it
still allows a complete implementation according to the first
alternative to be implemented and employed, and does not hardcode
contradictory semantics into the formatting kernel. Provided that the
corresponding filter is but a few lines in any modern programming
language, we thus have left room for adjustments and customizations;
while still providing a powerful machinery to handle all cases and

I am thankful to Jirka Kosek for support of this approach in DocBook
XSL stylesheets, the XSLT code is necessarily verbose because string
manipulation is not the area where XSLT shines. The example shows
that it is easy to as well implement spaces to leaders transformation
for XSL itself, so that typographic spaces work with any XSL
formatter, and in a way you define, which is important provided that
currently available definitions are vague and contradictory.

I also believe that XSL is an interchange format. When talking about
high-quality typesetting, one is expected to implement a tool chain
where hair-splitting activities like inserting hair spaces at right
places are unnecessary because a higher-level markup does the job,
both by providing the typesetting with convenient abstractions and
macros and by translating the contents into structured XSL, with ipd
spaces, not with unicode space marks.

David Tolpin

(*) To unsubscribe, send a message with words 'unsubscribe xep-support'
in the body of the message to majordomo@renderx.com from the address
you are subscribed from.
(*) By using the Service, you expressly agree to these Terms of Service http://www.renderx.com/terms-of-service.html
Received on Thu Oct 20 11:43:46 2005

This archive was generated by hypermail 2.1.8 : Thu Oct 20 2005 - 11:43:47 PDT