If you’re looking to easily parse some XML w/ .net and finding the normal XObject (Linq) isn’t cutting it, using HtmlAgilityPack could be your answer.
Below I’ll layout the simple steps for parsing a XML file w/ HtmlAgilityPack. The XML file I’m using to test is the WordPress export XML format. It uses lots of XML features so seemed like a good test. Let’s start. I’m using Visual Studio and in my example I’m assuming you have a project already open (I’m using a simple Console app).
- Install HtmlAgilityPack. In the “Package Manager Console” in Visual Studio, run this to install the newest HtmlAgilityPack release.
Install-Package HtmlAgilityPack
If it went well, you’ll see “Successfully added ‘HtmlAgilityPack x.x.x‘ to [your project]”
- Now in your code, add
using HtmlAgilityPack;
to the top of your code document.
- In a method (I’m in Main), add the below to create a new doc and load your XML file.
HHtmlDocument someDoc = new HtmlDocument(); someDoc.Load(@"C:\path\yourfile.xml");
- Below is some code that loops through the nodes and pulls some content.
foreach (HtmlNode nodeRss in someDoc.DocumentNode.SelectNodes("rss")) { //get attribute of a node: Console.WriteLine(nodeRss.Attributes["version"].Value); //get more nodes: foreach (HtmlNode aNode2 in nodeRss.SelectNodes("channel")) { //get count of child nodes Console.WriteLine(aNode2.ChildNodes.Count()); //get deeper nodes foreach (HtmlNode aNode3 in aNode2.SelectNodes("item")) { //get the content of a node: Console.WriteLine(aNode3.SelectSingleNode("title").InnerText); //get the content of a node w/ a URL: Console.WriteLine(aNode3.SelectSingleNode("link").NextSibling.InnerText); //get content of a node w/ ":" Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='wp:status']").InnerText); //some html content: Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='content:encoded']").InnerHtml.Replace("![CDATA[", "").Replace("]]", "")); } } }
Below is the entire sample:
using HtmlAgilityPack; using System; using System.Linq; namespace htmlAgilityPack_XML_Parsing { internal class Program { private static void Main(string[] args) { HtmlDocument someDoc = new HtmlDocument(); someDoc.Load(@"C:\path\yourfile.xml"); foreach (HtmlNode nodeRss in someDoc.DocumentNode.SelectNodes("rss")) { //get attribute of a node: Console.WriteLine(nodeRss.Attributes["version"].Value); //get more nodes: foreach (HtmlNode aNode2 in nodeRss.SelectNodes("channel")) { //get count of child nodes Console.WriteLine(aNode2.ChildNodes.Count()); //get deeper nodes foreach (HtmlNode aNode3 in aNode2.SelectNodes("item")) { //get the content of a node: Console.WriteLine(aNode3.SelectSingleNode("title").InnerText); //get the content of a node w/ a URL: Console.WriteLine(aNode3.SelectSingleNode("link").NextSibling.InnerText); //get content of a node w/ ":" Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='wp:status']").InnerText); //some html content: Console.WriteLine(aNode3.SelectSingleNode(".//*[name()='content:encoded']").InnerHtml.Replace("![CDATA[", "").Replace("]]", "")); } } } Console.ReadLine(); } } }
Hi can you rewrite this to load from a remote URL instead? Example:
someDoc.Load(“http://www.website.com/sitemap.xml);
Thanks
LikeLike
This should do it:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(“https://chrisbitting.com”);
LikeLike