[xep-support] Re: Selectable/searchable phrases in PDF in table column content spanning lines

From: Kevin Brown <kevin@renderx.com>
Date: Tue Feb 25 2014 - 11:59:26 PST

David:

Do you have time to discuss over the phone? A call perhaps to discuss.
My direct number is (707) 869-8353
I just tested something that works (to my absolute surprise).

Question: What font are you using?

Kevin Brown

-----Original Message-----
From: xep-support-bounces@renderx.com
[mailto:xep-support-bounces@renderx.com] On Behalf Of David Clunie
Sent: Tuesday, February 25, 2014 11:11 AM
To: xep-support@renderx.com
Subject: [xep-support] Re: Selectable/searchable phrases in PDF in table
column content spanning lines

Hi Kevin

Thanks for doing the experiments.

Since you mentioned tagged PDF, I realized that was probably making all the
difference, and that the PDF generated from Word that I had used as a
comparison was probably generated with that option.

So I turned on ENABLE_ACCESSIBILITY in xep.xml, and interestingly that now
allows selection of a phrase in a table cell that spans lines.

But it still does not allow searching on that phrase :(

Indeed, it seems to have very strange effects on searching such phrases that
span lines.

It is almost as though the object in the PDF that is tagged is tagged with a
value that is not the complete phrase (that one can select) but rather just
the first line of that phrase, if that makes sense.

See the attached screenshot for a search for "Shared Functional"
that works and fins what is highlighted (which is more than the search
term).

But if I search for "Shared Functional Groups" it doesn't find it :(

Also, the tags are in a weird order, in that a search for content will find
it in a column on the right before it finds it in a column on the left. But
I can live with that.

David

On 2/24/14 2:42 PM, Kevin Brown wrote:
> David:
>
> I understand the source of this in relation to tables. I am not sure
> anything can be done but I am willing to take a look to see if some
> change can be made by changing the XEP intermediate format.
>
> The issue revolves around the order of objects in the area tree
> (XEPOUT file). A table is rendered approximately right to left, top to
> bottom and hence all the cells and even parts of cells (lines of text)
> are in the XEPOUT for visual appearance. When written to PDF, Adobe
> cannot interpret these separate parts as part of the same thing.
> **Note that this is really an Adobe problem. It should and in fact it
> does in other parts of the product -- because if you were to set
> tagged PDF on generate this PDF and then use Adobe to read it, it
> reads it properly even if the parts are separated. However, taking
> that same example with tagging on and doing the search give the same
> results as you saw ... funny that it can "read" across the boundary and
not "search" across it.
>
> I tested this theory and was able to make a sample (PDF attached) that
> works.
>
> I also had to modify a hidden feature in XEP called the line tightness
> factor. Basically, out of the box XEP would do something like this in
> the area tree. The issue here is that while these separate objects are
> in the PDF, Adobe cannot recognize that adjacent ones are not related.
>
> <xep:text value="And finally cell" x="263583" y="732891"
> width="70301"/> <xep:text value="number three with" x="263583"
> y="719692" width="88033"/> <xep:text value="This is cell" x="183954"
> y="732891" width="51337"/> <xep:text value="number two" x="183954"
> y="719692" width="57464"/> <xep:text value="Wrap cell number one"
> x="72000" y="732891" width="107591"/> <xep:text value="with some text
> in it." x="72000" y="719692" width="93533"/> <xep:text
> value="additional text" x="263583" y="706493" width="67870"/>
> <xep:text value="inside this cell." x="263583" y="693294"
> width="71522"/> <xep:text value="with other text" x="183954"
> y="706493" width="68475"/> <xep:text value="in it." x="183954"
> y="693294" width="20174"/> <xep:text value="Wrap cell number one"
> x="72000" y="706493" width="107591"/> <xep:text value="with some text
> in it." x="72000" y="693294" width="93533"/>
>
> I modified by hand to reorder these to this:
>
> <xep:text value="Wrap cell number one" x="72000" y="732891"
> width="107591"/>
> <xep:text value="with some text in it." x="72000" y="719692"
> width="93533"/>
> <xep:text value="Wrap cell number one" x="72000" y="706493"
> width="107591"/>
> <xep:text value="with some text in it." x="72000" y="693294"
> width="93533"/>
>
> <xep:text value="This is cell" x="183954" y="732891" width="51337"/>
> <xep:text value="number two" x="183954" y="719692" width="57464"/>
> <xep:text value="with other text" x="183954" y="706493"
width="68475"/>
> <xep:text value="in it." x="183954" y="693294" width="20174"/>
>
> <xep:text value="And finally cell" x="263583" y="732891"
width="70301"/>
> <xep:text value="number three with" x="263583" y="719692"
> width="88033"/>
> <xep:text value="additional text" x="263583" y="706493"
width="67870"/>
> <xep:text value="inside this cell." x="263583" y="693294"
> width="71522"/>
>
> Now, this is not that easy (in fact it would not work if I did not set
> the line tightness factor) because in formatting the tables, there
> would be custom word-spacing set to squeeze the lines to fit. It
> becomes even more complex if kerning is on, separating fragments of
> these words into individual parts of text.
>
> The solution is not clear to me. I would suppose it is possible to
> create code to modify tabular content only in the XEPOUT file by
> examining the x and y coordinates and trying to relate them together
> ... doing it generically escapes me at the moment so it would work in
> all cases -- kerning, spanning, etc.
>
> At this time, I have no easy solution for you as this is not how the
> formatting engine works. I suspect in Word it writes these objects
> into the PDF in order and Adobe can interpret them.
>
> Kevin Brown
> RenderX
>
>
>
>
>
>
> -----Original Message-----
> From: xep-support-bounces@renderx.com
> [mailto:xep-support-bounces@renderx.com] On Behalf Of David Clunie
> Sent: Friday, February 21, 2014 5:33 AM
> To: xep-support@renderx.com
> Subject: [xep-support] Re: Selectable/searchable phrases in PDF in
> table column content spanning lines
>
> Sorry, forgot the attached screenshots.
>
> David
>
> On 2/21/14 8:32 AM, David Clunie wrote:
>> Hi
>>
>> In PDF table cells from the DocBook FO stylesheets rendered with xep,
>> if the words in a table cell spread cross several lines, then they
>> are not kept together from the perspective of selection or searching
>> in PDF viewers.
>>
>> I have not been able to improve this behavior using any of the FO
>> "keep-together" options (e.g., applied in "table.cell.block.properties"
>> template customizations). I use "keep-together.within-column"
>> successfully to prevent cells breaking across pages, but that does
>> not affect the described problem.
>>
>> And using "keep-together.within-line" doesn't solve the problem
>> either (phrases still get split up), and also causes some tables to
>> over flow the page margins anyway and is hence unusable for this.
>>
>> In short, I am in need of some way of having the content wrap to fit
>> within the page (obviously) but remain together from the PDF encoding
>> perspective such that phrases are searchable, which is critical for
>> our use case (we have a standard with many thousands of pages and
>> need to be able to search for phrases that are (long) names of data
>> elements that are present in tables and wrapped to fit on a page
>> (e.g., as in the screen shots, "Shared Functional Groups Sequence").
>>
>> Since Word can do it, I know PDF can be encoded this way; the
>> question is how to get xep to do it.
>>
>> It isn't split in the FO (the text is contained in one <fo:block/>).
>>
>> The attached screen shots show selection of a phrase spanning lines
>> wrapped within a cell highlighted when displayed using Acrobat, with
>> "good" output from Word and "bad" output from XEP.
>>
>> The DocBook fragment for this row is:
>>
>> <tr valign="top">
>> <td align="left" colspan="1" rowspan="1">
>> <para>Shared Functional Groups Sequence</para>
>> </td>
>> <td align="center" colspan="1" rowspan="1">
>> <para>(5200,9229)</para>
>> </td>
>> <td align="center" colspan="1" rowspan="1">
>> <para>1</para>
>> </td>
>> <td align="left" colspan="1" rowspan="1">
>> <para>Sequence that contains the Functional Group Macros that are
>> shared for all frames in this SOP Instance and Concatenation.</para>
>> <note>
>> <para>The contents of this sequence are the same in all SOP
>> Instances that comprise a Concatenation.</para>
>> </note>
>> <para>Only a single Item shall be included in this sequence.</para>
>> <para>See <xref linkend="sect_C.7.6.16.1.1" xrefstyle="select:
>> label"/> for further explanation.</para>
>> </td>
>> </tr>
>>
>> The customization used is:
>>
>> <xsl:template name="table.cell.block.properties">
>> <xsl:attribute name="keep-together.within-column">always</xsl:attribute>
>> <xsl:choose>
>> <xsl:when test="ancestor::d:thead or ancestor::d:tfoot">
>> <xsl:attribute name="font-weight">bold</xsl:attribute>
>> </xsl:when>
>> <!-- Make row headers bold too -->
>> <xsl:when test="ancestor::d:tbody and
>> (ancestor::d:table[@rowheader = 'firstcol'] or
>> ancestor::d:informaltable[@rowheader =
>> 'firstcol']) and
>>
>> ancestor-or-self::d:entry[1][count(preceding-sibling::d:entry) = 0]">
>> <xsl:attribute name="font-weight">bold</xsl:attribute>
>> </xsl:when>
>> </xsl:choose>
>> </xsl:template>
>>
>> and the extract of the FO produced by the DocBook FO stylesheets is:
>>
>> <fo:table-row>
>> <fo:table-cell padding-start="2pt" padding-end="2pt"
>> padding-top="2pt" padding-bottom="2pt" text-align="left"
>> display-align="before" border-start-style="none" border-top-style="none"
>> border-bottom-style="solid" border-bottom-width="0.5pt"
>> border-bottom-color="black" border-end-style="solid"
>> border-end-width="0.5pt" border-end-color="black"><fo:block
>> keep-together.within-column="always">
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">Shared
>> Functional Groups Sequence</fo:block>
>> </fo:block></fo:table-cell>
>> <fo:table-cell padding-start="2pt" padding-end="2pt"
>> padding-top="2pt" padding-bottom="2pt" text-align="center"
>> display-align="before" border-start-style="none" border-top-style="none"
>> border-bottom-style="solid" border-bottom-width="0.5pt"
>> border-bottom-color="black" border-end-style="solid"
>> border-end-width="0.5pt" border-end-color="black"><fo:block
>> keep-together.within-column="always">
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em"
>> space-before.maximum="1.2em">(5200,9229)</fo:block>
>> </fo:block></fo:table-cell>
>> <fo:table-cell padding-start="2pt" padding-end="2pt"
>> padding-top="2pt" padding-bottom="2pt" text-align="center"
>> display-align="before" border-start-style="none" border-top-style="none"
>> border-bottom-style="solid" border-bottom-width="0.5pt"
>> border-bottom-color="black" border-end-style="solid"
>> border-end-width="0.5pt" border-end-color="black"><fo:block
>> keep-together.within-column="always">
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">1</fo:block>
>> </fo:block></fo:table-cell>
>> <fo:table-cell padding-start="2pt" padding-end="2pt"
>> padding-top="2pt" padding-bottom="2pt" text-align="left"
>> display-align="before" border-start-style="none" border-top-style="none"
>> border-bottom-style="solid" border-bottom-width="0.5pt"
>> border-bottom-color="black"><fo:block
> keep-together.within-column="always">
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">Sequence
>> that contains the Functional Group Macros that are shared for all
>> frames in this SOP Instance and Concatenation.</fo:block>
>> <fo:block id="idp140215214853168"
>> space-before.minimum="0.8em" space-before.optimum="1em"
>> space-before.maximum="1.2em" margin-left="0.25in"
>> margin-right="0.25in"><fo:block keep-with-next.within-column="always"
>> font-size="9pt" font-weight="bold"
>> hyphenate="false">Note</fo:block><fo:block><fo:block
>> space-before.optimum="1em" space-before.minimum="0.8em"
>> space-before.maximum="1.2em">The contents of this sequence are the
>> same in all SOP Instances that comprise a
>> Concatenation.</fo:block></fo:block></fo:block>
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">Only a
>> single Item shall be included in this sequence.</fo:block>
>> <fo:block space-before.optimum="1em"
>> space-before.minimum="0.8em" space-before.maximum="1.2em">See
>> <fo:basic-link
>> internal-destination="sect_C.7.6.16.1.1"><fo:inline>Section
>> C.7.6.16.1.1</fo:inline></fo:basic-link> for further
>> explanation.</fo:block>
>> </fo:block></fo:table-cell>
>> </fo:table-row>
>>
>> Thanks ... David
>
>
>
>

!DSPAM:87,530ceaf79851944120157!

_______________________________________________
(*) To unsubscribe, please visit http://lists.renderx.com/mailman/options/xep-support
(*) By using the Service, you expressly agree to these Terms of Service http://w
ww.renderx.com/terms-of-service.html
Received on Tue Feb 25 11:59:32 2014

This archive was generated by hypermail 2.1.8 : Tue Feb 25 2014 - 11:59:33 PST