Setting HTML/Text to Clipboard revisited

After getting feedback that my original clipboard code doesn't handle all scenarios, especially with Chrome, I went back to the code to get a better understand of what's going on and find the correct way to set plain text and HTML snippet to clipboard.

 
Highlights

  • Setting plain text and html data.
  • Unicode support for plain text.
  • Clipboard HTML format
    • Version.
    • Parts start/end offsets.
    • StartFragment/EndFragment comments.
    • <html> and <body> elements.
    • Unicode handling.

 

TL; DR

Get the code on Gist: ClipboardHelper.cs, or scroll to the bottom of this post.
 

Setting HTML and Plain text

To set both plain text and rich HTML you need to create DataObject instance, set its data with both plain text and HTML format data, then set the data object to clipboard. The receiving client will read the appropriate data depending on its capabilities.
 

var dataObject =  new DataObject();      
dataObject.SetData(DataFormats.Html,  htmlFormat);      
dataObject.SetData(DataFormats.Text,  plainText);      
dataObject.SetData(DataFormats.UnicodeText,  plainText);      
Clipboard.SetDataObject(dataObject);      

 
Plain text Unicode support
Note that the plain text was set twice, using regular and Unicode format. It is important to set both as without the regular format some older clients will not get any text as they do not handle Unicode, and without Unicode format non-ASCII text won't work properly or even won't paste any text at all as some clients expecting proper Unicode support so they don't use the regular format at all.
 

HTML format

To set HTML snippet to clipboard it must be embedded in HTML Clipboard Format, this allows to surround the html snippet with context – additional styling elements that apply on the html snippet but should not be pasted, the receiving client is responsible to properly interpret them.
 
For example, to only copy "Hello World אבג " text from HTML snippet in figure 1 you need to create HTML Clipboard Format string shown in figure 2 .

  • Only what is between <!–StartFragment–> and <!–EndFragment–> should be pasted.
  • <div style="color: red;"> surrounds the fragment to provide styling (color) context.
  • <!DOCTYPE>, <html> and <body> elements are added to context.
  • "Copy me: " text is stripped as is not part of the fragment.
  • " אבג " is the first three letters of the hebrew alphabet used for unicode example as will be explained shortly, it appears as "אבגד" in the format.

 

<div  style="color: red">      
    Copy me: Hello <b>World</b>  <i> אבג </i>       
</div>      

Figure 1: Sample HTML snippet.
 

Version:0.9      
StartHTML:000000149      
EndHTML:000000329      
StartFragment:000000266      
EndFragment:000000298      
StartSelection:000000266      
EndSelection:000000298      
<!DOCTYPE HTML  PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">      
<html><div  style="color:  red;"><!--StartFragment--><b>Hello</b> World  <i>אבג</i><!--EndFragment--></div></html>      

Figure 2: The HTML clipboard format required to copy "Hello World אבג " HTML fragment.
 

Version

The correct format version is 0.9, unfortunately there are blog posts1 and libraries2 that incorrectly use 1.0 for clipboard format version. It may not seem very important but as described here using version 1.0 may identify your application as creating invalid clipboard format and may cause the receiving client to incorrectly interpret it.
 

Start/End offsets headers

There are 3 pairs of start/end offsets headers: HTML, Fragment, Selection.

  • HTML – byte count from the beginning of the clipboard format to the start/end of the HTML context that surrounds the HTML snippet.
  • Fragment – byte count from the beginning of the clipboard format to the start/end of the HTML snippet that is copied.
  • Selection – optional, can contain additional information on selected portion of the copied snippet, I add it with fragment values just in case some clients incorrectly require it.

 
Note: I use zero's padding on offsets to keep the header size constant, convenience during format string generation.
Note: All the indexes are byte count and not char count, see "Unicode handling" section.
 

StartFragment/EndFragment and HTML context

Marks the start and end of the actual fragment that needs to be pasted, everything inside the fragment is pasted as-is, including all the HTML elements present. The HTML surrounding the fragment is the context and can be used to provide the styling for the fragment (useful if copied snippet was part of larger HTML), it is the receiving client responsibility to parse the HTML in the context and apply its styling properly into the pasted text.
 
Note: You can also get the fragment substring using StartFragment/EndFragment offsets in the header.
Note: Not all clients handle context correctly, Chrome, for example, ignores styles set in context.
 
<html> and <body> elements
Although not explicitly required by the format the context must include "<html>" and "<body>" elements to be properly interpreted by some receiving clients. Providing multiple <html> or <body> elements or set them as part of the fragment may also cause issues for some clients, therefore, in my code, I parse the snippet to add the elements if they are missing or insert the StartFragment/EndFragment comments inside the snippet so they will always be in the context and not inside the fragment.
 
Note: Chrome for example will fail to parse the format if <html> element is missing
 

Unicode handling

The only character set supported by the clipboard is Unicode using UTF-8 encoding3.
Format header uses only ASCII characters so there is no special handling required, but the text of the context (starting at StartHTML) could be using any other characters including characters that require UTF-16 or higher encoding. This has two consequences:

  1. As mentioned earlier the Start/End offsets in the header are byte count, therefore it can be higher than the character count of the text, providing invalid offsets can cause clients to trim the end of the HTML snippet (common mistake is to use string.Length that returns character count and not byte count). In my example " אבג " uses 2 bytes encoding therefore EndFragment-StartFragment=32 and not 29.
  2. When setting format string to clipboard object it will use .NET default string encoding (UTF-16) so not-ASCII characters will be encoded incorrectly, resulting in '?' characters appearing in receiving client. To fix it you can either re-encode the string into UTF-8 (those the "אבג" string) or using UTF-8 encoding stream to set the data to clipboard (Note: it was fixed in .NET 4.0).

 

Using the code

ClipboardHelper.CopyToClipboard(myHtmlSnippet,  MyText);      

 

References

 

Code

/// <summary>      
/// Helper to  encode and set HTML fragment to clipboard.<br/>      
/// See <br/>      
/// <seealso  cref="CreateDataObject"/>.      
///  </summary>      
/// <remarks>      
/// The MIT License  (MIT) Copyright (c) 2014 Arthur Teplitzki.      
///  </remarks>      
public static class  ClipboardHelper      
{      
    #region Fields and Consts      

    /// <summary>      
    /// The string contains index references to  other spots in the string, so we need placeholders so we can compute the  offsets. <br/>      
    /// The  <![CDATA[<<<<<<<]]>_ strings are just placeholders.  We'll back-patch them actual values afterwards. <br/>      
    /// The string layout  (<![CDATA[<<<]]>) also ensures that it can't appear in the body  of the html because the <![CDATA[<]]> <br/>      
    /// character must be escaped. <br/>      
    /// </summary>      
    private const string Header =  @"Version:0.9      
StartHTML:<<<<<<<<1      
EndHTML:<<<<<<<<2      
StartFragment:<<<<<<<<3      
EndFragment:<<<<<<<<4      
StartSelection:<<<<<<<<3      
EndSelection:<<<<<<<<4";      

    /// <summary>      
    /// html comment to point the beginning of  html fragment      
    /// </summary>      
    public const string StartFragment =  "<!--StartFragment-->";      

    /// <summary>      
    /// html comment to point the end of html  fragment      
    /// </summary>      
    public const string EndFragment =  @"<!--EndFragment-->";      

    /// <summary>      
    /// Used to calculate characters byte count  in UTF-8      
    /// </summary>      
    private static readonly char[] _byteCount =  new char[1];      

    #endregion      


    /// <summary>      
    /// Create <see  cref="DataObject"/> with given html and plain-text ready to be  used for clipboard or drag and drop.<br/>      
    /// Handle missing  <![CDATA[<html>]]> tags, specified startend segments and Unicode  characters.      
    /// </summary>      
    /// <remarks>      
    /// <para>      
    /// Windows Clipboard works with UTF-8  Unicode encoding while .NET strings use with UTF-16 so for clipboard to  correctly      
    /// decode Unicode string added to it from  .NET we needs to be re-encoded it using UTF-8 encoding.      
    /// </para>      
    /// <para>      
    /// Builds the CF_HTML header correctly for  all possible HTMLs<br/>      
    /// If given html contains start/end  fragments then it will use them in the header:      
    ///  <code><![CDATA[<html><body><!--StartFragment-->hello  <b>world</b><!--EndFragment--></body></html>]]></code>      
    /// If given html contains html/body tags  then it will inject start/end fragments to exclude html/body tags:      
    ///  <code><![CDATA[<html><body>hello  <b>world</b></body></html>]]></code>      
    /// If given html doesn't contain html/body  tags then it will inject the tags and start/end fragments properly:      
    /// <code><![CDATA[hello  <b>world</b>]]></code>      
    /// In all cases creating a proper CF_HTML  header:<br/>      
    /// <code>      
    /// <![CDATA[      
    /// Version:1.0      
    /// StartHTML:000000177      
    /// EndHTML:000000329      
    /// StartFragment:000000277      
    /// EndFragment:000000295      
    /// StartSelection:000000277      
    /// EndSelection:000000277      
    /// <!DOCTYPE HTML PUBLIC  "-//W3C//DTD HTML 4.0 Transitional//EN">      
    ///  <html><body><!--StartFragment-->hello  <b>world</b><!--EndFragment--></body></html>      
    /// ]]>      
    /// </code>      
    /// See format specification here: [http://msdn.microsoft.com/library/default.asp?url=/workshop/networking/clipboard/htmlclipboard.asp][9]      
    /// </para>      
    /// </remarks>      
    /// <param name="html">a  html fragment</param>      
    /// <param  name="plainText">the plain text</param>      
    public static DataObject  CreateDataObject(string html, string plainText)      
    {      
        html = html ?? String.Empty;      
        var htmlFragment =  GetHtmlDataString(html);      

        // re-encode the string so it will work  correctly (fixed in CLR 4.0)      
        if (Environment.Version.Major < 4  && html.Length != Encoding.UTF8.GetByteCount(html))      
            htmlFragment =  Encoding.Default.GetString(Encoding.UTF8.GetBytes(htmlFragment));      

        var dataObject = new DataObject();      
        dataObject.SetData(DataFormats.Html,  htmlFragment);      
        dataObject.SetData(DataFormats.Text,  plainText);      
        dataObject.SetData(DataFormats.UnicodeText, plainText);      
        return dataObject;      
    }      

    /// <summary>      
    /// Clears clipboard and sets the given  HTML and plain text fragment to the clipboard, providing additional  meta-information for HTML.<br/>      
    /// See <see  cref="CreateDataObject"/> for HTML fragment details.<br/>      
    /// </summary>      
    /// <example>      
    ///  ClipboardHelper.CopyToClipboard("Hello <b>World</b>",  "Hello World");      
    /// </example>      
    /// <param name="html">a  html fragment</param>      
    /// <param  name="plainText">the plain text</param>      
    public static void CopyToClipboard(string  html, string plainText)      
    {      
        var dataObject = CreateDataObject(html,  plainText);      
        Clipboard.SetDataObject(dataObject,  true);      
    }      

    /// <summary>      
    /// Generate HTML fragment data string with  header that is required for the clipboard.      
    /// </summary>      
    /// <param name="html">the  html to generate for</param>      
    /// <returns>the resulted  string</returns>      
    private static string  GetHtmlDataString(string html)      
    {      
        var sb = new StringBuilder();      
        sb.AppendLine(Header);      
        sb.AppendLine(@"<!DOCTYPE HTML  PUBLIC ""-//W3C//DTD HTML 4.0  Transitional//EN"">");      

        // if given html already provided the  fragments we won't add them      
        int fragmentStart, fragmentEnd;      
        int fragmentStartIdx =  html.IndexOf(StartFragment, StringComparison.OrdinalIgnoreCase);      
        int fragmentEndIdx =  html.LastIndexOf(EndFragment, StringComparison.OrdinalIgnoreCase);      

        // if html tag is missing add it  surrounding the given html (critical)      
        int htmlOpenIdx =  html.IndexOf("<html", StringComparison.OrdinalIgnoreCase);      
        int htmlOpenEndIdx = htmlOpenIdx >  -1 ? html.IndexOf('>', htmlOpenIdx) + 1 : -1;      
        int htmlCloseIdx =  html.LastIndexOf("</html", StringComparison.OrdinalIgnoreCase);      

        if (fragmentStartIdx < 0 &&  fragmentEndIdx < 0)      
        {      
            int bodyOpenIdx =  html.IndexOf("<body", StringComparison.OrdinalIgnoreCase);      
            int bodyOpenEndIdx = bodyOpenIdx  > -1 ? html.IndexOf('>', bodyOpenIdx) + 1 : -1;      

            if (htmlOpenEndIdx < 0  && bodyOpenEndIdx < 0)      
            {      
                // the given html doesn't  contain html or body tags so we need to add them and place start/end fragments  around the given html only      
                sb.Append("<html><body>");      
                sb.Append(StartFragment);      
                fragmentStart =  GetByteCount(sb);      
                sb.Append(html);      
                fragmentEnd = GetByteCount(sb);      
                sb.Append(EndFragment);      
                sb.Append("</body></html>");      
            }      
            else      
            {      
                // insert start/end fragments  in the proper place (related to html/body tags if exists) so the paste will  work correctly      
                int bodyCloseIdx =  html.LastIndexOf("</body", StringComparison.OrdinalIgnoreCase);      

                if (htmlOpenEndIdx < 0)      
                    sb.Append("<html>");      
                else      
                    sb.Append(html, 0,  htmlOpenEndIdx);      

                if (bodyOpenEndIdx > -1)      
                    sb.Append(html,  htmlOpenEndIdx > -1 ? htmlOpenEndIdx : 0, bodyOpenEndIdx - (htmlOpenEndIdx  > -1 ? htmlOpenEndIdx : 0));      

                sb.Append(StartFragment);      
                fragmentStart =  GetByteCount(sb);      

                var innerHtmlStart =  bodyOpenEndIdx > -1 ? bodyOpenEndIdx : (htmlOpenEndIdx > -1 ?  htmlOpenEndIdx : 0);      
                var innerHtmlEnd = bodyCloseIdx  > -1 ? bodyCloseIdx : (htmlCloseIdx > -1 ? htmlCloseIdx : html.Length);      
                sb.Append(html, innerHtmlStart,  innerHtmlEnd - innerHtmlStart);      

                fragmentEnd = GetByteCount(sb);      
                sb.Append(EndFragment);      

                if (innerHtmlEnd <  html.Length)      
                    sb.Append(html,  innerHtmlEnd, html.Length - innerHtmlEnd);      

                if (htmlCloseIdx < 0)      
                    sb.Append("</html>");      
            }      
        }      
        else      
        {      
            // handle html with existing  startend fragments just need to calculate the correct bytes offset (surround  with html tag if missing)      
            if (htmlOpenEndIdx < 0)      
                sb.Append("<html>");      
            int start = GetByteCount(sb);      
            sb.Append(html);      
            fragmentStart = start +  GetByteCount(sb, start, start + fragmentStartIdx) + StartFragment.Length;      
            fragmentEnd = start +  GetByteCount(sb, start, start + fragmentEndIdx);      
            if (htmlCloseIdx < 0)      
                sb.Append("</html>");      
        }      

        // Back-patch offsets (scan only the  header part for performance)      
        sb.Replace("<<<<<<<<1",  Header.Length.ToString("D9"), 0, Header.Length);      
        sb.Replace("<<<<<<<<2",  GetByteCount(sb).ToString("D9"), 0, Header.Length);      
        sb.Replace("<<<<<<<<3",  fragmentStart.ToString("D9"), 0, Header.Length);      
        sb.Replace("<<<<<<<<4",  fragmentEnd.ToString("D9"), 0, Header.Length);      

        return sb.ToString();      
    }      

    /// <summary>      
    /// Calculates the number of bytes produced  by encoding the string in the string builder in UTF-8 and not .NET default  string encoding.      
    /// </summary>      
    /// <param name="sb">the  string builder to count its string</param>      
    /// <param  name="start">optional: the start index to calculate from (default  - start of string)</param>      
    /// <param  name="end">optional: the end index to calculate to (default - end  of string)</param>      
    /// <returns>the number of bytes  required to encode the string in UTF-8</returns>      
    private static int  GetByteCount(StringBuilder sb, int start = 0, int end = -1)      
    {      
        int count = 0;      
        end = end > -1 ? end : sb.Length;      
        for (int i = start; i < end; i++)      
        {      
            _byteCount[0] = sb[i];      
            count +=  Encoding.UTF8.GetByteCount(_byteCount);      
        }      
        return count;      
    }      
}      

 

Advertisement

10 comments on “Setting HTML/Text to Clipboard revisited

  1. […] basically works, I have made a few mistakes in this code, see: Setting HTML/Text to Clipboard revisited for better clipboard […]

  2. I have just imported this code to my project (https://github.com/michal-czardybon/herring). It works, but for some reason I can’t paste the html to Word/WordPad. It is strange, because I can paste it to Libre Office or to html editor in my email client. Any idea why Word/WordPad does not accept what is in the clipboard? (It only accepts the plain-text version).

    • Arthur says:

      don’t know.. I have tested it with word and it worked fine…
      try simplifying the html, maybe there is an issue with some html element or css style…
      let me know if you found the solution

  3. I can’t figure out now how I finally got it working… but right now it seems to work well in my software. I have just diffed the code of your class and I see I did not significant modifications. Sorry for disturbing apparently without a good reason :-/.

  4. cosmoschtroumpf says:

    Hello! I have used your class, but I have an encoding problem. The french accents (é è à ç ô…) get transformed to Chinese characters on paste, whether it goes through the conversion mechanism (Encoding.Default.GetString…) or not. I had to remove accents using Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(htmlContent)). Any idea ?

    • Arthur says:

      I honestly don’t know why it doesn’t work…
      Unfortunately I don’t have the capacity to dig into it,
      please let me know if you find something.

  5. […] some nice fellow has created a helper class to handle all that for you. And it’s even available on […]

  6. Luke Breuer says:

    I could not get this to paste HTML into Evernote until I removed the trailing spaces from your Header constant.

  7. king says:

    Perfect!!!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s