Searching and Modifying Output In Tridion Template Building Blocks With HtmlAgilityPack

In your Tridion career, you’ve probably written countless of C# Template Building Blocks. You’ve probably rolled your own custom Link Resolver, your own “Add or Extract Binaries From…”, your own cleanup templates… dozens of Building Blocks where the goal was to search for specific html patterns and attributes or modify the output in some way. And, if you’re like me, you’ve probably had to write numerous regex expressions to accomplish your tasks. I’ll admit that I’m no regex ninja or anything, its usually through trial and error that I get my regular expressions working correctly. Recently however, I attempted to write an expression that handled nested elements that could go any number of levels deep, with possibility to have several different variations of the pattern that I was looking for. My regex skills was just not good enough, and I thought for sure there must be a better way to parse the html output string of these patterns.

The search was short, but I found exactly what I was looking for: Html Agility Pack

Using HtmlAgilityPack in your Tridion Template Building Block is simple… just reference the DLL, and make sure you install the DLL into the GAC on the CMS and Publishing servers.

Now that we are set up and ready to run, let’s go ahead and write some code that will take the output from the package, and load it into an HtmlDocument. ¬†Your code should look something like the following:

using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using Tridion.ContentManager.Templating;
using Tridion.ContentManager.Templating.Assembly;
using HtmlAgilityPack;

namespace CodedWeapon.Samples
{
    [TcmTemplateTitle("HtmlAgilityPack Tester")]
    public class HtmlAgilityPackTester : ITemplate
    {
        private TemplatingLogger _logger = null;

	protected TemplatingLogger Logger
	{
		get
		{
			if (_logger == null) 
                            _logger = TemplatingLogger.GetLogger(this.GetType());

			return _logger;
		}
	}

        public override void Transform(Engine engine, Package package)
        {
            Item outputItem = package.GetByName(Package.OutputName);
            string outputString = outputItem.GetAsString();

            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(outputString);
        }
    }
}

If you’re familiar with XmlDocument, then HtmlDocument should not look so strange to you. All we are doing in the above is grabbing the output, and passing the output (as a string) into our HtmlDocument instance. Simple enough right? And now for the fun… seeing how simple it is to use this library to get the exact elements that we are looking for. Since a lot of the time we work with links, I’ll show some examples of grabbing some anchor tags.

Grabbing Every Anchor

Lets say we want to grab every anchor element present on the output:

HtmlNodeCollection nodes = doc.DocumentNode.Descendants("a");

You can loop over each of the nodes and perform any necessary operations that you want:

foreach (var node in nodes)
{
    Logger.Debug("Found node with url: " + anchor.Attributes["href"].Value); 
}

Just like with XmlDocument, you can also use XPath:

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a");

Querying for Specific Elements

You can use XPath to search for the specific elements that you are looking for. For example, if we wanted to match all links with a specific url:

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[@href=\"http://www.example.com\"]");

Or what if we wanted to grab every link that actually contains a title attribute:

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[@title]");

Or better yet… what if we want check any node that does not contain the title attribute at all?

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[not(@title)]")

If you’re not a fan of XPath, then you’ll be happy to know that you can also use LINQ like in this following example where we are looking for any link that has an href attribute of “#”:

var nodes = doc.DocumentNode.Descendants()
    .Where(n => n.Attributes["href"] != null && n.Attributes["href"].Value.Equals("#"));

Modifying Output

Normally when we’re searching for elements it’s because we may need to modify the markup in someway, either by adding attributes, removing elements, etc. In the following example, we are adding a custom attribute onto our anchors that contain a value of “#” in the href attribute:

foreach (var node in doc.DocumentNode.SelectNodes("//a[@href=\"#\"]"))
{
    node.Attributes.Add("data-my-custom-attribute", "true");
}

But what if we have a more complicated requirement where we need to strip a <tcdl:ComponentPresentation> tag (but ensure that we keep the inside markup). One may be tempted to try the following:

foreach (var node in doc.DocumentNode.Descendants("tcdl:ComponentPresentation"))
{
    node.ParentNode.RemoveChild(node, true); // this strips the tcdl tag while preserving the children... but its modifying the collection...
}

package.Remove(outputItem);
outputItem.SetAsString(doc.DocumentNode.OuterHtml);
package.PushItem(Package.OutputName, outputItem);

However, while running the above, you’ll get an error that says Collection was modified; enumeration operation may not execute. This is because you cannot modify the collection (add/remove) while enumerating it. But, we can keep a record of the nodes and operate on them after:

List nodesToChange = new List();
foreach (var node in doc.DocumentNode.Descendants("tcdl:ComponentPresentation"))
{
    nodesToChange.Add(node);
}

// Now we strip the wrapping tags...
foreach (var node in nodesToChange)
{
    node.ParentNode.RemoveChild(node, true);
}

package.Remove(outputItem);
outputItem.SetAsString(doc.DocumentNode.OuterHtml);
package.PushItem(Package.OutputName, outputItem);

Hopefully this will end the regex blues out there for anyone who may be struggling!

One thought on “Searching and Modifying Output In Tridion Template Building Blocks With HtmlAgilityPack

  1. Nice post. I just wanted to mention though, that the choice is not per se between regular expressions and using an html parser. If it’s early enough in the pipeline, using an XML document can be a better choice. For example, you can easily load the contents of a RTF field into an XML document. Regexes and HTML parsers come into their own further down the pipeline when you’ve already generated text or html output.

Leave a Reply to Dominic Cronin Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>