[xep-support] Re: Selectable/searchable phrases in PDF in table column content spanning lines

From: Kevin Brown <kevin@renderx.com>
Date: Mon Feb 24 2014 - 11:42:03 PST

David:

I understand the source of this in relation to tables. I am not sure
anything can be done but I am willing to take a look to see if some change
can be made by changing the XEP intermediate format.

The issue revolves around the order of objects in the area tree (XEPOUT
file). A table is rendered approximately right to left, top to bottom and
hence all the cells and even parts of cells (lines of text) are in the
XEPOUT for visual appearance. When written to PDF, Adobe cannot interpret
these separate parts as part of the same thing. **Note that this is really
an Adobe problem. It should and in fact it does in other parts of the
product -- because if you were to set tagged PDF on generate this PDF and
then use Adobe to read it, it reads it properly even if the parts are
separated. However, taking that same example with tagging on and doing the
search give the same results as you saw ... funny that it can "read" across
the boundary and not "search" across it.

I tested this theory and was able to make a sample (PDF attached) that
works.

I also had to modify a hidden feature in XEP called the line tightness
factor. Basically, out of the box XEP would do something like this in the
area tree. The issue here is that while these separate objects are in the
PDF, Adobe cannot recognize that adjacent ones are not related.

<xep:text value="And finally cell" x="263583" y="732891" width="70301"/>
<xep:text value="number three with" x="263583" y="719692" width="88033"/>
<xep:text value="This is cell" x="183954" y="732891" width="51337"/>
<xep:text value="number two" x="183954" y="719692" width="57464"/>
<xep:text value="Wrap cell number one" x="72000" y="732891" width="107591"/>
<xep:text value="with some text in it." x="72000" y="719692" width="93533"/>
<xep:text value="additional text" x="263583" y="706493" width="67870"/>
<xep:text value="inside this cell." x="263583" y="693294" width="71522"/>
<xep:text value="with other text" x="183954" y="706493" width="68475"/>
<xep:text value="in it." x="183954" y="693294" width="20174"/>
<xep:text value="Wrap cell number one" x="72000" y="706493" width="107591"/>
<xep:text value="with some text in it." x="72000" y="693294" width="93533"/>

I modified by hand to reorder these to this:

    <xep:text value="Wrap cell number one" x="72000" y="732891"
width="107591"/>
    <xep:text value="with some text in it." x="72000" y="719692"
width="93533"/>
    <xep:text value="Wrap cell number one" x="72000" y="706493"
width="107591"/>
    <xep:text value="with some text in it." x="72000" y="693294"
width="93533"/>
    
    <xep:text value="This is cell" x="183954" y="732891" width="51337"/>
    <xep:text value="number two" x="183954" y="719692" width="57464"/>
    <xep:text value="with other text" x="183954" y="706493" width="68475"/>
    <xep:text value="in it." x="183954" y="693294" width="20174"/>

    <xep:text value="And finally cell" x="263583" y="732891" width="70301"/>
    <xep:text value="number three with" x="263583" y="719692"
width="88033"/>
    <xep:text value="additional text" x="263583" y="706493" width="67870"/>
    <xep:text value="inside this cell." x="263583" y="693294"
width="71522"/>

Now, this is not that easy (in fact it would not work if I did not set the
line tightness factor) because in formatting the tables, there would be
custom word-spacing set to squeeze the lines to fit. It becomes even more
complex if kerning is on, separating fragments of these words into
individual parts of text.

The solution is not clear to me. I would suppose it is possible to create
code to modify tabular content only in the XEPOUT file by examining the x
and y coordinates and trying to relate them together ... doing it
generically escapes me at the moment so it would work in all cases --
kerning, spanning, etc.

At this time, I have no easy solution for you as this is not how the
formatting engine works. I suspect in Word it writes these objects into the
PDF in order and Adobe can interpret them.

Kevin Brown
RenderX

-----Original Message-----
From: xep-support-bounces@renderx.com
[mailto:xep-support-bounces@renderx.com] On Behalf Of David Clunie
Sent: Friday, February 21, 2014 5:33 AM
To: xep-support@renderx.com
Subject: [xep-support] Re: Selectable/searchable phrases in PDF in table
column content spanning lines

Sorry, forgot the attached screenshots.

David

On 2/21/14 8:32 AM, David Clunie wrote:
> Hi
>
> In PDF table cells from the DocBook FO stylesheets rendered with xep,
> if the words in a table cell spread cross several lines, then they are
> not kept together from the perspective of selection or searching in
> PDF viewers.
>
> I have not been able to improve this behavior using any of the FO
> "keep-together" options (e.g., applied in "table.cell.block.properties"
> template customizations). I use "keep-together.within-column"
> successfully to prevent cells breaking across pages, but that does not
> affect the described problem.
>
> And using "keep-together.within-line" doesn't solve the problem either
> (phrases still get split up), and also causes some tables to over flow
> the page margins anyway and is hence unusable for this.
>
> In short, I am in need of some way of having the content wrap to fit
> within the page (obviously) but remain together from the PDF encoding
> perspective such that phrases are searchable, which is critical for
> our use case (we have a standard with many thousands of pages and need
> to be able to search for phrases that are (long) names of data
> elements that are present in tables and wrapped to fit on a page
> (e.g., as in the screen shots, "Shared Functional Groups Sequence").
>
> Since Word can do it, I know PDF can be encoded this way; the question
> is how to get xep to do it.
>
> It isn't split in the FO (the text is contained in one <fo:block/>).
>
> The attached screen shots show selection of a phrase spanning lines
> wrapped within a cell highlighted when displayed using Acrobat, with
> "good" output from Word and "bad" output from XEP.
>
> The DocBook fragment for this row is:
>
> <tr valign="top">
> <td align="left" colspan="1" rowspan="1">
> <para>Shared Functional Groups Sequence</para>
> </td>
> <td align="center" colspan="1" rowspan="1">
> <para>(5200,9229)</para>
> </td>
> <td align="center" colspan="1" rowspan="1">
> <para>1</para>
> </td>
> <td align="left" colspan="1" rowspan="1">
> <para>Sequence that contains the Functional Group Macros that are
> shared for all frames in this SOP Instance and Concatenation.</para>
> <note>
> <para>The contents of this sequence are the same in all SOP
> Instances that comprise a Concatenation.</para>
> </note>
> <para>Only a single Item shall be included in this sequence.</para>
> <para>See <xref linkend="sect_C.7.6.16.1.1" xrefstyle="select:
> label"/> for further explanation.</para>
> </td>
> </tr>
>
> The customization used is:
>
> <xsl:template name="table.cell.block.properties">
> <xsl:attribute name="keep-together.within-column">always</xsl:attribute>
> <xsl:choose>
> <xsl:when test="ancestor::d:thead or ancestor::d:tfoot">
> <xsl:attribute name="font-weight">bold</xsl:attribute>
> </xsl:when>
> <!-- Make row headers bold too -->
> <xsl:when test="ancestor::d:tbody and
> (ancestor::d:table[@rowheader = 'firstcol'] or
> ancestor::d:informaltable[@rowheader =
> 'firstcol']) and
>
> ancestor-or-self::d:entry[1][count(preceding-sibling::d:entry) = 0]">
> <xsl:attribute name="font-weight">bold</xsl:attribute>
> </xsl:when>
> </xsl:choose>
> </xsl:template>
>
> and the extract of the FO produced by the DocBook FO stylesheets is:
>
> <fo:table-row>
> <fo:table-cell padding-start="2pt" padding-end="2pt"
> padding-top="2pt" padding-bottom="2pt" text-align="left"
> display-align="before" border-start-style="none" border-top-style="none"
> border-bottom-style="solid" border-bottom-width="0.5pt"
> border-bottom-color="black" border-end-style="solid"
> border-end-width="0.5pt" border-end-color="black"><fo:block
> keep-together.within-column="always">
> <fo:block space-before.optimum="1em"
> space-before.minimum="0.8em" space-before.maximum="1.2em">Shared
> Functional Groups Sequence</fo:block>
> </fo:block></fo:table-cell>
> <fo:table-cell padding-start="2pt" padding-end="2pt"
> padding-top="2pt" padding-bottom="2pt" text-align="center"
> display-align="before" border-start-style="none" border-top-style="none"
> border-bottom-style="solid" border-bottom-width="0.5pt"
> border-bottom-color="black" border-end-style="solid"
> border-end-width="0.5pt" border-end-color="black"><fo:block
> keep-together.within-column="always">
> <fo:block space-before.optimum="1em"
> space-before.minimum="0.8em"
> space-before.maximum="1.2em">(5200,9229)</fo:block>
> </fo:block></fo:table-cell>
> <fo:table-cell padding-start="2pt" padding-end="2pt"
> padding-top="2pt" padding-bottom="2pt" text-align="center"
> display-align="before" border-start-style="none" border-top-style="none"
> border-bottom-style="solid" border-bottom-width="0.5pt"
> border-bottom-color="black" border-end-style="solid"
> border-end-width="0.5pt" border-end-color="black"><fo:block
> keep-together.within-column="always">
> <fo:block space-before.optimum="1em"
> space-before.minimum="0.8em" space-before.maximum="1.2em">1</fo:block>
> </fo:block></fo:table-cell>
> <fo:table-cell padding-start="2pt" padding-end="2pt"
> padding-top="2pt" padding-bottom="2pt" text-align="left"
> display-align="before" border-start-style="none" border-top-style="none"
> border-bottom-style="solid" border-bottom-width="0.5pt"
> border-bottom-color="black"><fo:block
keep-together.within-column="always">
> <fo:block space-before.optimum="1em"
> space-before.minimum="0.8em" space-before.maximum="1.2em">Sequence
> that contains the Functional Group Macros that are shared for all
> frames in this SOP Instance and Concatenation.</fo:block>
> <fo:block id="idp140215214853168"
> space-before.minimum="0.8em" space-before.optimum="1em"
> space-before.maximum="1.2em" margin-left="0.25in"
> margin-right="0.25in"><fo:block keep-with-next.within-column="always"
> font-size="9pt" font-weight="bold"
> hyphenate="false">Note</fo:block><fo:block><fo:block
> space-before.optimum="1em" space-before.minimum="0.8em"
> space-before.maximum="1.2em">The contents of this sequence are the
> same in all SOP Instances that comprise a
> Concatenation.</fo:block></fo:block></fo:block>
> <fo:block space-before.optimum="1em"
> space-before.minimum="0.8em" space-before.maximum="1.2em">Only a
> single Item shall be included in this sequence.</fo:block>
> <fo:block space-before.optimum="1em"
> space-before.minimum="0.8em" space-before.maximum="1.2em">See
> <fo:basic-link
> internal-destination="sect_C.7.6.16.1.1"><fo:inline>Section
> C.7.6.16.1.1</fo:inline></fo:basic-link> for further
> explanation.</fo:block>
> </fo:block></fo:table-cell>
> </fo:table-row>
>
> Thanks ... David

!DSPAM:87,530755b49855221313758!

_______________________________________________
(*) To unsubscribe, please visit http://lists.renderx.com/mailman/options/xep-support
(*) By using the Service, you expressly agree to these Terms of Service http://w
ww.renderx.com/terms-of-service.html

Received on Mon Feb 24 11:42:09 2014

This archive was generated by hypermail 2.1.8 : Mon Feb 24 2014 - 11:42:14 PST