After getting feedback that my original clipboard code doesn't handle all scenarios, especially with Chrome, I went back to the code to get a better understand of what's going on and find the correct way to set plain text and HTML snippet to clipboard.
Highlights
- Setting plain text and html data.
- Unicode support for plain text.
- Clipboard HTML format
- Version.
- Parts start/end offsets.
- StartFragment/EndFragment comments.
- <html> and <body> elements.
- Unicode handling.
TL; DR
Get the code on Gist: ClipboardHelper.cs, or scroll to the bottom of this post.
Setting HTML and Plain text
To set both plain text and rich HTML you need to create DataObject instance, set its data with both plain text and HTML format data, then set the data object to clipboard. The receiving client will read the appropriate data depending on its capabilities.
var dataObject = new DataObject(); dataObject.SetData(DataFormats.Html, htmlFormat); dataObject.SetData(DataFormats.Text, plainText); dataObject.SetData(DataFormats.UnicodeText, plainText); Clipboard.SetDataObject(dataObject);
Plain text Unicode support
Note that the plain text was set twice, using regular and Unicode format. It is important to set both as without the regular format some older clients will not get any text as they do not handle Unicode, and without Unicode format non-ASCII text won't work properly or even won't paste any text at all as some clients expecting proper Unicode support so they don't use the regular format at all.
HTML format
To set HTML snippet to clipboard it must be embedded in HTML Clipboard Format, this allows to surround the html snippet with context – additional styling elements that apply on the html snippet but should not be pasted, the receiving client is responsible to properly interpret them.
For example, to only copy "Hello World אבג " text from HTML snippet in figure 1 you need to create HTML Clipboard Format string shown in figure 2 .
- Only what is between <!–StartFragment–> and <!–EndFragment–> should be pasted.
- <div style="color: red;"> surrounds the fragment to provide styling (color) context.
- <!DOCTYPE>, <html> and <body> elements are added to context.
- "Copy me: " text is stripped as is not part of the fragment.
- " אבג " is the first three letters of the hebrew alphabet used for unicode example as will be explained shortly, it appears as "×בגד" in the format.
<div style="color: red"> Copy me: Hello <b>World</b> <i> אבג </i> </div>
Figure 1: Sample HTML snippet.
Version:0.9 StartHTML:000000149 EndHTML:000000329 StartFragment:000000266 EndFragment:000000298 StartSelection:000000266 EndSelection:000000298 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html><div style="color: red;"><!--StartFragment--><b>Hello</b> World <i>×בג</i><!--EndFragment--></div></html>
Figure 2: The HTML clipboard format required to copy "Hello World אבג " HTML fragment.
Version
The correct format version is 0.9, unfortunately there are blog posts1 and libraries2 that incorrectly use 1.0 for clipboard format version. It may not seem very important but as described here using version 1.0 may identify your application as creating invalid clipboard format and may cause the receiving client to incorrectly interpret it.
Start/End offsets headers
There are 3 pairs of start/end offsets headers: HTML, Fragment, Selection.
- HTML – byte count from the beginning of the clipboard format to the start/end of the HTML context that surrounds the HTML snippet.
- Fragment – byte count from the beginning of the clipboard format to the start/end of the HTML snippet that is copied.
- Selection – optional, can contain additional information on selected portion of the copied snippet, I add it with fragment values just in case some clients incorrectly require it.
Note: I use zero's padding on offsets to keep the header size constant, convenience during format string generation.
Note: All the indexes are byte count and not char count, see "Unicode handling" section.
StartFragment/EndFragment and HTML context
Marks the start and end of the actual fragment that needs to be pasted, everything inside the fragment is pasted as-is, including all the HTML elements present. The HTML surrounding the fragment is the context and can be used to provide the styling for the fragment (useful if copied snippet was part of larger HTML), it is the receiving client responsibility to parse the HTML in the context and apply its styling properly into the pasted text.
Note: You can also get the fragment substring using StartFragment/EndFragment offsets in the header.
Note: Not all clients handle context correctly, Chrome, for example, ignores styles set in context.
<html> and <body> elements
Although not explicitly required by the format the context must include "<html>" and "<body>" elements to be properly interpreted by some receiving clients. Providing multiple <html> or <body> elements or set them as part of the fragment may also cause issues for some clients, therefore, in my code, I parse the snippet to add the elements if they are missing or insert the StartFragment/EndFragment comments inside the snippet so they will always be in the context and not inside the fragment.
Note: Chrome for example will fail to parse the format if <html> element is missing
Unicode handling
The only character set supported by the clipboard is Unicode using UTF-8 encoding3.
Format header uses only ASCII characters so there is no special handling required, but the text of the context (starting at StartHTML) could be using any other characters including characters that require UTF-16 or higher encoding. This has two consequences:
- As mentioned earlier the Start/End offsets in the header are byte count, therefore it can be higher than the character count of the text, providing invalid offsets can cause clients to trim the end of the HTML snippet (common mistake is to use string.Length that returns character count and not byte count). In my example " אבג " uses 2 bytes encoding therefore EndFragment-StartFragment=32 and not 29.
- When setting format string to clipboard object it will use .NET default string encoding (UTF-16) so not-ASCII characters will be encoded incorrectly, resulting in '?' characters appearing in receiving client. To fix it you can either re-encode the string into UTF-8 (those the "×בג" string) or using UTF-8 encoding stream to set the data to clipboard (Note: it was fixed in .NET 4.0).
Using the code
ClipboardHelper.CopyToClipboard(myHtmlSnippet, MyText);
References
- What is this rogue version 1.0 of the HTML clipboard format?
- Copying an HTML-fragment to the Clipboard.
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
- Setting HTML and plain text formatting to clipboard.
- Copying HTML on the clipboard.
- ClipboardHelper.cs
Code
/// <summary> /// Helper to encode and set HTML fragment to clipboard.<br/> /// See <br/> /// <seealso cref="CreateDataObject"/>. /// </summary> /// <remarks> /// The MIT License (MIT) Copyright (c) 2014 Arthur Teplitzki. /// </remarks> public static class ClipboardHelper { #region Fields and Consts /// <summary> /// The string contains index references to other spots in the string, so we need placeholders so we can compute the offsets. <br/> /// The <![CDATA[<<<<<<<]]>_ strings are just placeholders. We'll back-patch them actual values afterwards. <br/> /// The string layout (<![CDATA[<<<]]>) also ensures that it can't appear in the body of the html because the <![CDATA[<]]> <br/> /// character must be escaped. <br/> /// </summary> private const string Header = @"Version:0.9 StartHTML:<<<<<<<<1 EndHTML:<<<<<<<<2 StartFragment:<<<<<<<<3 EndFragment:<<<<<<<<4 StartSelection:<<<<<<<<3 EndSelection:<<<<<<<<4"; /// <summary> /// html comment to point the beginning of html fragment /// </summary> public const string StartFragment = "<!--StartFragment-->"; /// <summary> /// html comment to point the end of html fragment /// </summary> public const string EndFragment = @"<!--EndFragment-->"; /// <summary> /// Used to calculate characters byte count in UTF-8 /// </summary> private static readonly char[] _byteCount = new char[1]; #endregion /// <summary> /// Create <see cref="DataObject"/> with given html and plain-text ready to be used for clipboard or drag and drop.<br/> /// Handle missing <![CDATA[<html>]]> tags, specified startend segments and Unicode characters. /// </summary> /// <remarks> /// <para> /// Windows Clipboard works with UTF-8 Unicode encoding while .NET strings use with UTF-16 so for clipboard to correctly /// decode Unicode string added to it from .NET we needs to be re-encoded it using UTF-8 encoding. /// </para> /// <para> /// Builds the CF_HTML header correctly for all possible HTMLs<br/> /// If given html contains start/end fragments then it will use them in the header: /// <code><![CDATA[<html><body><!--StartFragment-->hello <b>world</b><!--EndFragment--></body></html>]]></code> /// If given html contains html/body tags then it will inject start/end fragments to exclude html/body tags: /// <code><![CDATA[<html><body>hello <b>world</b></body></html>]]></code> /// If given html doesn't contain html/body tags then it will inject the tags and start/end fragments properly: /// <code><![CDATA[hello <b>world</b>]]></code> /// In all cases creating a proper CF_HTML header:<br/> /// <code> /// <![CDATA[ /// Version:1.0 /// StartHTML:000000177 /// EndHTML:000000329 /// StartFragment:000000277 /// EndFragment:000000295 /// StartSelection:000000277 /// EndSelection:000000277 /// <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> /// <html><body><!--StartFragment-->hello <b>world</b><!--EndFragment--></body></html> /// ]]> /// </code> /// See format specification here: [http://msdn.microsoft.com/library/default.asp?url=/workshop/networking/clipboard/htmlclipboard.asp][9] /// </para> /// </remarks> /// <param name="html">a html fragment</param> /// <param name="plainText">the plain text</param> public static DataObject CreateDataObject(string html, string plainText) { html = html ?? String.Empty; var htmlFragment = GetHtmlDataString(html); // re-encode the string so it will work correctly (fixed in CLR 4.0) if (Environment.Version.Major < 4 && html.Length != Encoding.UTF8.GetByteCount(html)) htmlFragment = Encoding.Default.GetString(Encoding.UTF8.GetBytes(htmlFragment)); var dataObject = new DataObject(); dataObject.SetData(DataFormats.Html, htmlFragment); dataObject.SetData(DataFormats.Text, plainText); dataObject.SetData(DataFormats.UnicodeText, plainText); return dataObject; } /// <summary> /// Clears clipboard and sets the given HTML and plain text fragment to the clipboard, providing additional meta-information for HTML.<br/> /// See <see cref="CreateDataObject"/> for HTML fragment details.<br/> /// </summary> /// <example> /// ClipboardHelper.CopyToClipboard("Hello <b>World</b>", "Hello World"); /// </example> /// <param name="html">a html fragment</param> /// <param name="plainText">the plain text</param> public static void CopyToClipboard(string html, string plainText) { var dataObject = CreateDataObject(html, plainText); Clipboard.SetDataObject(dataObject, true); } /// <summary> /// Generate HTML fragment data string with header that is required for the clipboard. /// </summary> /// <param name="html">the html to generate for</param> /// <returns>the resulted string</returns> private static string GetHtmlDataString(string html) { var sb = new StringBuilder(); sb.AppendLine(Header); sb.AppendLine(@"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Transitional//EN"">"); // if given html already provided the fragments we won't add them int fragmentStart, fragmentEnd; int fragmentStartIdx = html.IndexOf(StartFragment, StringComparison.OrdinalIgnoreCase); int fragmentEndIdx = html.LastIndexOf(EndFragment, StringComparison.OrdinalIgnoreCase); // if html tag is missing add it surrounding the given html (critical) int htmlOpenIdx = html.IndexOf("<html", StringComparison.OrdinalIgnoreCase); int htmlOpenEndIdx = htmlOpenIdx > -1 ? html.IndexOf('>', htmlOpenIdx) + 1 : -1; int htmlCloseIdx = html.LastIndexOf("</html", StringComparison.OrdinalIgnoreCase); if (fragmentStartIdx < 0 && fragmentEndIdx < 0) { int bodyOpenIdx = html.IndexOf("<body", StringComparison.OrdinalIgnoreCase); int bodyOpenEndIdx = bodyOpenIdx > -1 ? html.IndexOf('>', bodyOpenIdx) + 1 : -1; if (htmlOpenEndIdx < 0 && bodyOpenEndIdx < 0) { // the given html doesn't contain html or body tags so we need to add them and place start/end fragments around the given html only sb.Append("<html><body>"); sb.Append(StartFragment); fragmentStart = GetByteCount(sb); sb.Append(html); fragmentEnd = GetByteCount(sb); sb.Append(EndFragment); sb.Append("</body></html>"); } else { // insert start/end fragments in the proper place (related to html/body tags if exists) so the paste will work correctly int bodyCloseIdx = html.LastIndexOf("</body", StringComparison.OrdinalIgnoreCase); if (htmlOpenEndIdx < 0) sb.Append("<html>"); else sb.Append(html, 0, htmlOpenEndIdx); if (bodyOpenEndIdx > -1) sb.Append(html, htmlOpenEndIdx > -1 ? htmlOpenEndIdx : 0, bodyOpenEndIdx - (htmlOpenEndIdx > -1 ? htmlOpenEndIdx : 0)); sb.Append(StartFragment); fragmentStart = GetByteCount(sb); var innerHtmlStart = bodyOpenEndIdx > -1 ? bodyOpenEndIdx : (htmlOpenEndIdx > -1 ? htmlOpenEndIdx : 0); var innerHtmlEnd = bodyCloseIdx > -1 ? bodyCloseIdx : (htmlCloseIdx > -1 ? htmlCloseIdx : html.Length); sb.Append(html, innerHtmlStart, innerHtmlEnd - innerHtmlStart); fragmentEnd = GetByteCount(sb); sb.Append(EndFragment); if (innerHtmlEnd < html.Length) sb.Append(html, innerHtmlEnd, html.Length - innerHtmlEnd); if (htmlCloseIdx < 0) sb.Append("</html>"); } } else { // handle html with existing startend fragments just need to calculate the correct bytes offset (surround with html tag if missing) if (htmlOpenEndIdx < 0) sb.Append("<html>"); int start = GetByteCount(sb); sb.Append(html); fragmentStart = start + GetByteCount(sb, start, start + fragmentStartIdx) + StartFragment.Length; fragmentEnd = start + GetByteCount(sb, start, start + fragmentEndIdx); if (htmlCloseIdx < 0) sb.Append("</html>"); } // Back-patch offsets (scan only the header part for performance) sb.Replace("<<<<<<<<1", Header.Length.ToString("D9"), 0, Header.Length); sb.Replace("<<<<<<<<2", GetByteCount(sb).ToString("D9"), 0, Header.Length); sb.Replace("<<<<<<<<3", fragmentStart.ToString("D9"), 0, Header.Length); sb.Replace("<<<<<<<<4", fragmentEnd.ToString("D9"), 0, Header.Length); return sb.ToString(); } /// <summary> /// Calculates the number of bytes produced by encoding the string in the string builder in UTF-8 and not .NET default string encoding. /// </summary> /// <param name="sb">the string builder to count its string</param> /// <param name="start">optional: the start index to calculate from (default - start of string)</param> /// <param name="end">optional: the end index to calculate to (default - end of string)</param> /// <returns>the number of bytes required to encode the string in UTF-8</returns> private static int GetByteCount(StringBuilder sb, int start = 0, int end = -1) { int count = 0; end = end > -1 ? end : sb.Length; for (int i = start; i < end; i++) { _byteCount[0] = sb[i]; count += Encoding.UTF8.GetByteCount(_byteCount); } return count; } }
[…] basically works, I have made a few mistakes in this code, see: Setting HTML/Text to Clipboard revisited for better clipboard […]
Reblogged this on Dinesh Ram Kali..
I have just imported this code to my project (https://github.com/michal-czardybon/herring). It works, but for some reason I can’t paste the html to Word/WordPad. It is strange, because I can paste it to Libre Office or to html editor in my email client. Any idea why Word/WordPad does not accept what is in the clipboard? (It only accepts the plain-text version).
don’t know.. I have tested it with word and it worked fine…
try simplifying the html, maybe there is an issue with some html element or css style…
let me know if you found the solution
I can’t figure out now how I finally got it working… but right now it seems to work well in my software. I have just diffed the code of your class and I see I did not significant modifications. Sorry for disturbing apparently without a good reason :-/.
Hello! I have used your class, but I have an encoding problem. The french accents (é è à ç ô…) get transformed to Chinese characters on paste, whether it goes through the conversion mechanism (Encoding.Default.GetString…) or not. I had to remove accents using Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(htmlContent)). Any idea ?
I honestly don’t know why it doesn’t work…
Unfortunately I don’t have the capacity to dig into it,
please let me know if you find something.
[…] some nice fellow has created a helper class to handle all that for you. And it’s even available on […]
I could not get this to paste HTML into Evernote until I removed the trailing spaces from your Header constant.
Perfect!!!