[xep-support] Re: Selectable/searchable phrases in PDF in table column content spanning lines

From: David Clunie <dclunie@dclunie.com>
Date: Tue Feb 25 2014 - 11:11:27 PST

Hi Kevin

Thanks for doing the experiments.

Since you mentioned tagged PDF, I realized that was probably making
all the difference, and that the PDF generated from Word that I
had used as a comparison was probably generated with that option.

So I turned on ENABLE_ACCESSIBILITY in xep.xml, and interestingly
that now allows selection of a phrase in a table cell that spans
lines.

But it still does not allow searching on that phrase :(

Indeed, it seems to have very strange effects on searching such
phrases that span lines.

It is almost as though the object in the PDF that is tagged is
tagged with a value that is not the complete phrase (that one
can select) but rather just the first line of that phrase, if
that makes sense.

See the attached screenshot for a search for "Shared Functional"
that works and fins what is highlighted (which is more than the
search term).

But if I search for "Shared Functional Groups" it doesn't find it :(

Also, the tags are in a weird order, in that a search for content
will find it in a column on the right before it finds it in a
column on the left. But I can live with that.

David

On 2/24/14 2:42 PM, Kevin Brown wrote:
> David:
>
> I understand the source of this in relation to tables. I am not sure
> anything can be done but I am willing to take a look to see if some change
> can be made by changing the XEP intermediate format.
>
> The issue revolves around the order of objects in the area tree (XEPOUT
> file). A table is rendered approximately right to left, top to bottom and
> hence all the cells and even parts of cells (lines of text) are in the
> XEPOUT for visual appearance. When written to PDF, Adobe cannot interpret
> these separate parts as part of the same thing. **Note that this is really
> an Adobe problem. It should and in fact it does in other parts of the
> product -- because if you were to set tagged PDF on generate this PDF and
> then use Adobe to read it, it reads it properly even if the parts are
> separated. However, taking that same example with tagging on and doing the
> search give the same results as you saw ... funny that it can "read" across
> the boundary and not "search" across it.
>
> I tested this theory and was able to make a sample (PDF attached) that
> works.
>
> I also had to modify a hidden feature in XEP called the line tightness
> factor. Basically, out of the box XEP would do something like this in the
> area tree. The issue here is that while these separate objects are in the
> PDF, Adobe cannot recognize that adjacent ones are not related.
>
> <xep:text value="And finally cell" x="263583" y="732891" width="70301"/>
> <xep:text value="number three with" x="263583" y="719692" width="88033"/>
> <xep:text value="This is cell" x="183954" y="732891" width="51337"/>
> <xep:text value="number two" x="183954" y="719692" width="57464"/>
> <xep:text value="Wrap cell number one" x="72000" y="732891" width="107591"/>
> <xep:text value="with some text in it." x="72000" y="719692" width="93533"/>
> <xep:text value="additional text" x="263583" y="706493" width="67870"/>
> <xep:text value="inside this cell." x="263583" y="693294" width="71522"/>
> <xep:text value="with other text" x="183954" y="706493" width="68475"/>
> <xep:text value="in it." x="183954" y="693294" width="20174"/>
> <xep:text value="Wrap cell number one" x="72000" y="706493" width="107591"/>
> <xep:text value="with some text in it." x="72000" y="693294" width="93533"/>
>
> I modified by hand to reorder these to this:
>
> <xep:text value="Wrap cell number one" x="72000" y="732891"
> width="107591"/>
> <xep:text value="with some text in it." x="72000" y="719692"
> width="93533"/>
> <xep:text value="Wrap cell number one" x="72000" y="706493"
> width="107591"/>
> <xep:text value="with some text in it." x="72000" y="693294"
> width="93533"/>
>
> <xep:text value="This is cell" x="183954" y="732891" width="51337"/>
> <xep:text value="number two" x="183954" y="719692" width="57464"/>
> <xep:text value="with other text" x="183954" y="706493" width="68475"/>
> <xep:text value="in it." x="183954" y="693294" width="20174"/>
>
> <xep:text value="And finally cell" x="263583" y="732891" width="70301"/>
> <xep:text value="number three with" x="263583" y="719692"
> width="88033"/>
> <xep:text value="additional text" x="263583" y="706493" width="67870"/>
> <xep:text value="inside this cell." x="263583" y="693294"
> width="71522"/>
>
> Now, this is not that easy (in fact it would not work if I did not set the
> line tightness factor) because in formatting the tables, there would be
> custom word-spacing set to squeeze the lines to fit. It becomes even more
> complex if kerning is on, separating fragments of these words into
> individual parts of text.
>
> The solution is not clear to me. I would suppose it is possible to create
> code to modify tabular content only in the XEPOUT file by examining the x
> and y coordinates and trying to relate them together ... doing it
> generically escapes me at the moment so it would work in all cases --
> kerning, spanning, etc.
>
> At this time, I have no easy solution for you as this is not how the
> formatting engine works. I suspect in Word it writes these objects into the
> PDF in order and Adobe can interpret them.
>
> Kevin Brown
> RenderX
>
>
>
>
>
>
> -----Original Message-----
> From: xep-support-bounces@renderx.com
> [mailto:xep-support-bounces@renderx.com] On Behalf Of David Clunie
> Sent: Friday, February 21, 2014 5:33 AM
> To: xep-support@renderx.com
> Subject: [xep-support] Re: Selectable/searchable phrases in PDF in table
> column content spanning lines
>
> Sorry, forgot the attached screenshots.
>
> David
>
> On 2/21/14 8:32 AM, David Clunie wrote:
>> Hi
>>
>> In PDF table cells from the DocBook FO stylesheets rendered with xep,
>> if the words in a table cell spread cross several lines, then they are
>> not kept together from the perspective of selection or searching in
>> PDF viewers.
>>
>> I have not been able to improve this behavior using any of the FO
>> "keep-together" options (e.g., applied in "table.cell.block.properties"
>> template customizations). I use "keep-together.within-column"
>> successfully to prevent cells breaking across pages, but that does not
>> affect the described problem.
>>
>> And using "keep-together.within-line" doesn't solve the problem either
>> (phrases still get split up), and also causes some tables to over flow
>> the page margins anyway and is hence unusable for this.
>>
>> In short, I am in need of some way of having the content wrap to fit
>> within the page (obviously) but remain together from the PDF encoding
>> perspective such that phrases are searchable, which is critical for
>> our use case (we have a standard with many thousands of pages and need
>> to be able to search for phrases that are (long) names of data
>> elements that are present in tables and wrapped to fit on a page
>> (e.g., as in the screen shots, "Shared Functional Groups Sequence").
>>
>> Since Word can do it, I know PDF can be encoded this way; the question
>> is how to get xep to do it.
>>
>> It isn't split in the FO (the text is contained in one <fo:block/>).
>>
>> The attached screen shots show selection of a phrase spanning lines
>> wrapped within a cell highlighted when displayed using Acrobat, with
>> "good" output from Word and "bad" output from XEP.
>>
>> The DocBook fragment for this row is:
>>
>> <tr valign="top">
>> <td align="left" colspan="1" rowspan="1">
>> <para>Shared Functional Groups Sequence</para>
>> </td>
>> <td align="center" colspan="1" rowspan="1">
>> <para>(5200,9229)</para>
>> </td>
>> <td align="center" colspan="1" rowspan="1">
>> <para>1</para>
>> </td>
>> <td align="left" colspan="1" rowspan="1">
>> <para>Sequence that contains the Functional Group Macros that are
>> shared for all frames in this SOP Instance and Concatenation.</para>
>> <note>
>> <para>The contents of this sequence are the same in all SOP
>> Instances that comprise a Concatenation.</para>
>> </note>
>> <para>Only a single Item shall be included in this sequence.</para>
>> <para>See <xref linkend="sect_C.7.6.16.1.1" xrefstyle="select:
>> label"/> for further explanation.</para>
>> </td>
>> </tr>
>>
>> The customization used is:
>>
>> <xsl:template name="table.cell.block.properties">
>> <xsl:attribute name="keep-together.within-column">always</xsl:attribute>
>> <xsl:choose>
>> <xsl:when test="ancestor::d:thead or ancestor::d:tfoot">
>> <xsl:attribute name="font-weight">bold</xsl:attribute>
>> </xsl:when>
>> <!-- Make row headers bold too -->
>> <xsl:when test="ancestor::d:tbody and
>> (ancestor::d:table[@rowheader = 'firstcol'] or
>> ancestor::d:informaltable[@rowheader =
>> 'firstcol']) and
>>
>> ancestor-or-self::d:entry[1][count(preceding-sibling::d:entry) = 0]">
>> <xsl:attribute name="font-weight">bold</xsl:attribute>
>> </xsl:when>
>> </xsl:choose>
>> </xsl:template>
>>
>> and the extract of the FO produced by the DocBook FO stylesheets is:
>>
>> <fo:table-row>
>> <fo:table-cell padding-start="2pt" padding-end="2pt"
>> padding-top="2pt" padding-bottom="2pt" text-align="left"
>> display-align="before" border-start-style="none" border-top-style="none"
>> border-bottom-style="solid" border-bottom-width="0.5pt"
>> border-bottom-color="black" border-end-style="solid"
>> border-end-width="0.5pt" border-end-color="black"><fo:block
>> keep-together.within-column="always">
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">Shared
>> Functional Groups Sequence</fo:block>
>> </fo:block></fo:table-cell>
>> <fo:table-cell padding-start="2pt" padding-end="2pt"
>> padding-top="2pt" padding-bottom="2pt" text-align="center"
>> display-align="before" border-start-style="none" border-top-style="none"
>> border-bottom-style="solid" border-bottom-width="0.5pt"
>> border-bottom-color="black" border-end-style="solid"
>> border-end-width="0.5pt" border-end-color="black"><fo:block
>> keep-together.within-column="always">
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em"
>> space-before.maximum="1.2em">(5200,9229)</fo:block>
>> </fo:block></fo:table-cell>
>> <fo:table-cell padding-start="2pt" padding-end="2pt"
>> padding-top="2pt" padding-bottom="2pt" text-align="center"
>> display-align="before" border-start-style="none" border-top-style="none"
>> border-bottom-style="solid" border-bottom-width="0.5pt"
>> border-bottom-color="black" border-end-style="solid"
>> border-end-width="0.5pt" border-end-color="black"><fo:block
>> keep-together.within-column="always">
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">1</fo:block>
>> </fo:block></fo:table-cell>
>> <fo:table-cell padding-start="2pt" padding-end="2pt"
>> padding-top="2pt" padding-bottom="2pt" text-align="left"
>> display-align="before" border-start-style="none" border-top-style="none"
>> border-bottom-style="solid" border-bottom-width="0.5pt"
>> border-bottom-color="black"><fo:block
> keep-together.within-column="always">
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">Sequence
>> that contains the Functional Group Macros that are shared for all
>> frames in this SOP Instance and Concatenation.</fo:block>
>> <fo:block id="idp140215214853168"
>> space-before.minimum="0.8em" space-before.optimum="1em"
>> space-before.maximum="1.2em" margin-left="0.25in"
>> margin-right="0.25in"><fo:block keep-with-next.within-column="always"
>> font-size="9pt" font-weight="bold"
>> hyphenate="false">Note</fo:block><fo:block><fo:block
>> space-before.optimum="1em" space-before.minimum="0.8em"
>> space-before.maximum="1.2em">The contents of this sequence are the
>> same in all SOP Instances that comprise a
>> Concatenation.</fo:block></fo:block></fo:block>
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">Only a
>> single Item shall be included in this sequence.</fo:block>
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">See
>> <fo:basic-link
>> internal-destination="sect_C.7.6.16.1.1"><fo:inline>Section
>> C.7.6.16.1.1</fo:inline></fo:basic-link> for further
>> explanation.</fo:block>
>> </fo:block></fo:table-cell>
>> </fo:table-row>
>>
>> Thanks ... David
>
>
>
>

!DSPAM:87,530ceaf79851944120157!

_______________________________________________
(*) To unsubscribe, please visit http://lists.renderx.com/mailman/options/xep-support
(*) By using the Service, you expressly agree to these Terms of Service http://w
ww.renderx.com/terms-of-service.html

Screen_Shot_2014-02-25_at_2.07.55_PM.png
Received on Tue Feb 25 11:11:55 2014

This archive was generated by hypermail 2.1.8 : Tue Feb 25 2014 - 11:12:02 PST