Transforming InfoPath Rich Text into WordML

NOTE: This post is actually a repeat of one that was authored last June. Unfortunately, that post resides on another blog server that is no longer public. So, in response to a recent question about transforming InfoPath rich text into Word, I have decided to re-post. Amazingly, the content is still valid, even though Microsoft has since released its XSLT Inference Tool.

Many of our business solutions require an XSLT process for transforming InfoPath form data into Microsoft Word 2003 documents. Overall, development of this type of transformation is pretty straightforward. First, the Word document template is built and then saved as an XML file, giving developers access to the underlying WordprocessingML (WordML) content. The WordML header information—which consists of definitions for styles, fonts, lists, and custom document properties—is then incorporated into the XSL stylesheet that is used to process the InfoPath XML content.

What becomes somewhat tricky in the stylesheet though is trying to accommodate character formatting that appears in the XML input. Many of the InfoPath forms that we develop contain rich-text fields, which allow users to add XHTML content, as in the following example:

<div xmlns="http://www.w3.org/1999/xhtml">The <strong>brown</strong> fox jumped over the fence.</div>

Any inline formatting, like the strong element above, creates a challenge in processing paragraph-level elements such as div and p. In WordML, the concept of “runs” is used within paragraph elements. A run defines the formatting properties for a particular string of text. Every time the character formatting within a paragraph changes, a new run is created. So, for the XHTML sample shown above, three separate runs (represented as sibling w:r elements) would appear in the WordML:

<w:p>
    <w:pPr>
        <w:pStyle w:val="Paragraph" />
    </w:pPr>
    <w:r>
        <w:t>The </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b />
        </w:rPr>
        <w:t>brown</w:t>
    </w:r>
    <w:r>
        <w:t> fox jumped over the fence.</w:t>
    </w:r>
</w:p>

In order to create the separate runs, each character-formatting template in the XSL file is updated with unmatched w:t and w:r start and end tags. These tags provide the capability to close the run for the text prior to the character formatting, create a run for the character formatting, and then open a run for the text after the character formatting. Since it is illegal in XSL files to have a start tag without a matching end tag and vice versa, the delimiter characters for these tags are escaped. To illustrate, the following demonstrates how inline bold text is processed:

<xsl:template match="xhtml:b | xhtml:strong">
    <xsl:text disable-output-escaping="yes"></w:t></w:r></xsl:text>
    <w:r>
        <w:rPr>
            <w:b/>
            <xsl:call-template name="output-character-formatting"/>
        </w:rPr>
        <w:t><xsl:apply-templates/></w:t>
    </w:r>
    <xsl:text disable-output-escaping="yes"><w:r></xsl:text>
    <w:rPr>
        <xsl:call-template name="output-character-formatting"/>
    </w:rPr>
    <xsl:text disable-output-escaping="yes"><w:t></xsl:text>
</xsl:template>

The output-character-formatting named template handled multiple character-formatting properties (e.g., bold-italic text). This template was used to identify ancestors of the current element and apply the WordML formatting properties accordingly.

<xsl:template name="output-character-formatting">
    <xsl:if test="ancestor::xhtml:i or ancestor::xhtml:em">
        <w:i/>
    </xsl:if>
    <xsl:if test="ancestor::xhtml:b or ancestor::xhtml:strong">
        <w:b/>
    </xsl:if>
    <xsl:if test="ancestor::xhtml:u">
        <w:u w:val="single"/>
    </xsl:if>
    <xsl:if test="ancestor::xhtml:strike">
        <w:strike/>
    </xsl:if>
    <xsl:if test="ancestor::xhtml:sup">
        <w:vertAlign w:val="superscript"/>
    </xsl:if>
    <xsl:if test="ancestor::xhtml:sub">
        <w:vertAlign w:val="subscript"/>
    </xsl:if>
</xsl:template>

16 thoughts on “Transforming InfoPath Rich Text into WordML

  1. You will need templates for the corresponding XHTML table elements. The <table> element is transformed into a <w:tbl> element, the <tr> elements are transformed into <w:tr> elements, and the <td> elements are transformed into <w:tc> elements.

  2. Hi David,

    Thanks for posting this article. I am still trying to get it to work with mine though. I do have one other question though, if a user enter line breaks in the rich text box, how do you transfrom that? Sorry I’m a bit new to all this and just trying to catch up right now. Let assume the following text is what the user enter:

    Line 1
    Line 2
    Line 3

    How do you make it so when it’s transform to Word it still display like that.

    Thanks
    Howard

  3. Howard, each line in a rich text box will appear as an XHTML <div> element in the InfoPath form’s underlying XML. So, you will need a template in your XSL file that transforms an xhtml:div into a w:p. You will also need processing in your existing template for the rich text field that checks to see if there are nested <div> or other XHTML elements. If these nested elements exist, your rich text field template should not output a w:p, because you do not want recursive w:p elements.

  4. Hi, Sreedhar.

    The short answer to your question is "Yes, I do know how to transform WordML into InfoPath rich text." Content from an InfoPath rich text box control conforms to the W3C XHTML standard. Thus, an .xsl file could be used to help transform WordML content into XHTML content.

    Please excuse me for my brevity, but a longer answer would require an understanding of your specific requirements.

    Regards,
    David

  5. Thanks for your response. I have the following requirements.

    I have several word documents that have summary and detail sections (They might have other sections too, let’s assume there are two sections for simplicity). I have XML tagged these sections and saved them in Word XML format. I have an InfoPath form that has a main view which has links (or buttons) to other views. The other views have Rich Text Boxes that read and display the word content.
    I was able to bind word data to Rich textbox, but the data is not shown in original format. Here I am not able figure out how to apply xsl. I am basically trying to use InfoPath for presentation.

    I would appreciate any help.

    Thanks,
    Sreedhar

  6. Hi, Sreedhar.

    Given your requirements, I think my original suggestion will still work. You can apply an .xsl file that transforms each WordML section into XHTML, which is the format (by default) for InfoPath rich text boxes. The output from each transform would then need to be added to the node that is bound to the corresponding rich text box.

    I hope that makes sense. If not, perhaps we should chat further outside of this thread.

    Regards,
    David

  7. Hmm… Tried your example as is, but it appears to eat away my spaces after the b (or strong). Tired to use xml: space="preserve"…. but no luck. Any ideas?

  8. Hi, Kirti.

    If you are having issues with whitespace in your transform, you could try using the top-level preserve-space element in your .xsl file. This element allows you to identify a list of elements in your input file for which you want to preserve any spaces.

    Hope this helps…

    Regards,
    David

  9. Hi,

    I have a requirement where user can paste rich text from word and later generate word document on fly. I use word template and fill the template using data from database.

    So i am planning to use rich textbox that will store formatted text as html. Now the problem is how do i convert html to wordml. Since i paste the text from database directly in wordml seed document , i fill it directly with text. So when the document is generated i see html tags in the document.

    Following link shows how the seed document is created. http://msdn2.microsoft.com/en-us/library/aa212889(office.11).aspx

    The applicaton is in asp.net 2.0 and language is C#.

    Is there any good third party control or free ware that allows me option to store html and wordml both?

    Regards

    Mac

  10. David,

    Thanks for your response. I have read Oleg Tkachenko’s blog. The tool coverts wordml to html. I want from html to wordml.

    Any example or xslt will be of great help.

    mac

  11. I have the same problem as David a year later. I have rich text which I need to convert to word ML. The problem I’m having is the text is copied and pasted from other documents and some of the tags don’t seem to be valid. Is there a way I can validate all the tags before converting to wordML

  12. Hi, Mcguire.Your best bet is to copy the rich text into a Word document and then use the WordML that gets created from that paste operation in your XSL stylesheet.Regards,David

Comments are closed.