Parsing XML with HTML Agility Pack (instead of XDocument, etc.)

If you’re looking to easily parse some XML w/ .net and finding the normal XObject (Linq) isn’t cutting it, using HtmlAgilityPack could be your answer.

Below I’ll layout the simple steps for parsing a XML file w/ HtmlAgilityPack. The XML file I’m using to test is the WordPress export XML format. It uses lots of XML features so seemed like a good test. Let’s start. I’m using Visual Studio and in my example I’m assuming you have a project already open (I’m using a simple Console app).

Install HtmlAgilityPack. In the “Package Manager Console” in Visual Studio, run this to install the newest HtmlAgilityPack release.
```
 Install-Package HtmlAgilityPack
```
If it went well, you’ll see “Successfully added ‘HtmlAgilityPack x.x.x‘ to [your project]”
Now in your code, add
```
using HtmlAgilityPack;
```
to the top of your code document.
In a method (I’m in Main), add the below to create a new doc and load your XML file.
```
 HHtmlDocument someDoc = new HtmlDocument();
 someDoc.Load(@"C:\path\yourfile.xml");
```

Below is some code that loops through the nodes and pulls some content.

foreach (HtmlNode nodeRss in someDoc.DocumentNode.SelectNodes("rss"))
            {
                //get attribute of a node:
                Console.WriteLine(nodeRss.Attributes["version"].Value);

                //get more nodes:
                foreach (HtmlNode aNode2 in nodeRss.SelectNodes("channel"))
                {
                    //get count of child nodes
                    Console.WriteLine(aNode2.ChildNodes.Count());

                    //get deeper nodes
                    foreach (HtmlNode aNode3 in aNode2.SelectNodes("item"))
                    {
                        //get the content of a node:
                        Console.WriteLine(aNode3.SelectSingleNode("title").InnerText);

                        //get the content of a node w/ a URL:
                        Console.WriteLine(aNode3.SelectSingleNode("link").NextSibling.InnerText);

                        //get content of a node w/ ":"
                        Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='wp:status']").InnerText);

                        //some html content:
                        Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='content:encoded']").InnerHtml.Replace("![CDATA[", "").Replace("]]", ""));
                    }
                }
            }

Below is the entire sample:

using HtmlAgilityPack;
using System;
using System.Linq;

namespace htmlAgilityPack_XML_Parsing
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            HtmlDocument someDoc = new HtmlDocument();
            someDoc.Load(@"C:\path\yourfile.xml");

            foreach (HtmlNode nodeRss in someDoc.DocumentNode.SelectNodes("rss"))
            {
                //get attribute of a node:
                Console.WriteLine(nodeRss.Attributes["version"].Value);

                //get more nodes:
                foreach (HtmlNode aNode2 in nodeRss.SelectNodes("channel"))
                {
                    //get count of child nodes
                    Console.WriteLine(aNode2.ChildNodes.Count());

                    //get deeper nodes
                    foreach (HtmlNode aNode3 in aNode2.SelectNodes("item"))
                    {
                        //get the content of a node:
                        Console.WriteLine(aNode3.SelectSingleNode("title").InnerText);

                        //get the content of a node w/ a URL:
                        Console.WriteLine(aNode3.SelectSingleNode("link").NextSibling.InnerText);

                        //get content of a node w/ ":"
                        Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='wp:status']").InnerText);

                        //some html content:
                        Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='content:encoded']").InnerHtml.Replace("![CDATA[", "").Replace("]]", ""));
                    }
                }
            }

            Console.ReadLine();
        }
    }
}

2 thoughts on “Parsing XML with HTML Agility Pack (instead of XDocument, etc.)”

John Aghadiuno says:

Hi can you rewrite this to load from a remote URL instead? Example:
someDoc.Load(“http://www.website.com/sitemap.xml);

Thanks

LikeLike

April 5, 2017 at 7:42 am Reply
1. chrisbitting says:
  
  This should do it:
  
  HtmlWeb web = new HtmlWeb();
  HtmlAgilityPack.HtmlDocument doc = web.Load(“https://chrisbitting.com”);
  
  LikeLike
  
  April 6, 2017 at 8:34 am Reply

Chris Bitting

Random, I know.

Parsing XML with HTML Agility Pack (instead of XDocument, etc.)

2 thoughts on “Parsing XML with HTML Agility Pack (instead of XDocument, etc.)”

Leave a comment Cancel reply

Rate this:

Share this:

Related

2 thoughts on “Parsing XML with HTML Agility Pack (instead of XDocument, etc.)”

Leave a comment Cancel reply