Parsing Html content using Html Agility Pack, Cookies & Proxy

stockpriceIf you’ve ever needed to parse (screen scrape?) some remote html, you may have wanted to pull info from a page that only renders content to a browser. The below example shows how to grab some content from a web page (using a web request) but also incorporates using cookies and a proxy to help and the Html Agility Pack to parse the returned html (allowing you to get a specific element):

Imports System.Net
Imports System.Web

Public Class Form1

    Public cookies As New CookieContainer

    Private Sub Button1_Click(sender As System.Object, e As System.EventArgs) Handles Button1.Click

        'create a web request
        Dim wreq As HttpWebRequest = WebRequest.Create("http://www.reuters.com/finance/stocks/overview?symbol=AMZN.OQ")

        'set the agent to mimic a recent browser
        wreq.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"

        'how you're getting the page
        wreq.Method = "get"

        'create a proxy
        Dim prox As IWebProxy = wreq.Proxy

        'set the proxy cred
        prox.Credentials = CredentialCache.DefaultCredentials

        'create the html doc & web
        Dim document As New HtmlAgilityPack.HtmlDocument
        Dim web As New HtmlAgilityPack.HtmlWeb

        'needs to use cookies
        web.UseCookies = True

        'set the cookie request
        wreq.CookieContainer = cookies

        'start a response
        Dim res As HttpWebResponse = wreq.GetResponse()

        'get a stream from the response
        document.Load(res.GetResponseStream, True)

        'get some data from the page. 
        'in the below example, i'm looling for a div with the class 'sectionQuoteDetail' and getting the content of the second span inside
        Dim strCurrentQuote = document.DocumentNode.SelectSingleNode("//div[@class='sectionQuoteDetail']//span[2]")

        MsgBox(strCurrentQuote.InnerText)

    End Sub

End Class

As a side note, the use of the proxy code above also solves the frequent HttpWebRequest WebException “The remote server returned an error: (407) Proxy Authentication Required” error.

Parsing Html content using Html Agility Pack, Cookies & Proxy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s