Parsing XML with HTML Agility Pack (instead of XDocument, etc.)

If you’re looking to easily parse some XML w/ .net and finding the normal XObject (Linq) isn’t cutting it, using HtmlAgilityPack could be your answer.

Below I’ll layout the simple steps for parsing a XML file w/ HtmlAgilityPack. The XML file I’m using to test is the WordPress export XML format. It uses lots of XML features so seemed like a good test. Let’s start. I’m using Visual Studio and in my example I’m assuming you have a project already open (I’m using a simple Console app).

  1. Install HtmlAgilityPack. In the “Package Manager Console” in Visual Studio, run this to install the newest HtmlAgilityPack release.
     Install-Package HtmlAgilityPack

    If it went well, you’ll see “Successfully added ‘HtmlAgilityPack x.x.x‘ to [your project]”

  2. Now in your code, add
    using HtmlAgilityPack;

    to the top of your code document.

  3. In a method (I’m in Main), add the below to create a new doc and load your XML file.
     HHtmlDocument someDoc = new HtmlDocument();
     someDoc.Load(@"C:\path\yourfile.xml");
  4. Below is some code that loops through the nodes and pulls some content.
    foreach (HtmlNode nodeRss in someDoc.DocumentNode.SelectNodes("rss"))
                {
                    //get attribute of a node:
                    Console.WriteLine(nodeRss.Attributes["version"].Value);
    
                    //get more nodes:
                    foreach (HtmlNode aNode2 in nodeRss.SelectNodes("channel"))
                    {
                        //get count of child nodes
                        Console.WriteLine(aNode2.ChildNodes.Count());
    
                        //get deeper nodes
                        foreach (HtmlNode aNode3 in aNode2.SelectNodes("item"))
                        {
                            //get the content of a node:
                            Console.WriteLine(aNode3.SelectSingleNode("title").InnerText);
    
                            //get the content of a node w/ a URL:
                            Console.WriteLine(aNode3.SelectSingleNode("link").NextSibling.InnerText);
    
                            //get content of a node w/ ":"
                            Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='wp:status']").InnerText);
    
                            //some html content:
                            Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='content:encoded']").InnerHtml.Replace("![CDATA[", "").Replace("]]", ""));
                        }
                    }
                }
    

Below is the entire sample:

using HtmlAgilityPack;
using System;
using System.Linq;

namespace htmlAgilityPack_XML_Parsing
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            HtmlDocument someDoc = new HtmlDocument();
            someDoc.Load(@"C:\path\yourfile.xml");

            foreach (HtmlNode nodeRss in someDoc.DocumentNode.SelectNodes("rss"))
            {
                //get attribute of a node:
                Console.WriteLine(nodeRss.Attributes["version"].Value);

                //get more nodes:
                foreach (HtmlNode aNode2 in nodeRss.SelectNodes("channel"))
                {
                    //get count of child nodes
                    Console.WriteLine(aNode2.ChildNodes.Count());

                    //get deeper nodes
                    foreach (HtmlNode aNode3 in aNode2.SelectNodes("item"))
                    {
                        //get the content of a node:
                        Console.WriteLine(aNode3.SelectSingleNode("title").InnerText);

                        //get the content of a node w/ a URL:
                        Console.WriteLine(aNode3.SelectSingleNode("link").NextSibling.InnerText);

                        //get content of a node w/ ":"
                        Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='wp:status']").InnerText);

                        //some html content:
                        Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='content:encoded']").InnerHtml.Replace("![CDATA[", "").Replace("]]", ""));
                    }
                }
            }

            Console.ReadLine();
        }
    }
}
Parsing XML with HTML Agility Pack (instead of XDocument, etc.)

2 thoughts on “Parsing XML with HTML Agility Pack (instead of XDocument, etc.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s