If you’ve ever needed to parse (screen scrape?) some remote html, you may have wanted to pull info from a page that only renders content to a browser. The below example shows how to grab some content from a web page (using a web request) but also incorporates using cookies and a proxy to help and the Html Agility Pack to parse the returned html (allowing you to get a specific element):
Imports System.Net
Imports System.Web
Public Class Form1
Public cookies As New CookieContainer
Private Sub Button1_Click(sender As System.Object, e As System.EventArgs) Handles Button1.Click
'create a web request
Dim wreq As HttpWebRequest = WebRequest.Create("http://www.reuters.com/finance/stocks/overview?symbol=AMZN.OQ")
'set the agent to mimic a recent browser
wreq.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"
'how you're getting the page
wreq.Method = "get"
'create a proxy
Dim prox As IWebProxy = wreq.Proxy
'set the proxy cred
prox.Credentials = CredentialCache.DefaultCredentials
'create the html doc & web
Dim document As New HtmlAgilityPack.HtmlDocument
Dim web As New HtmlAgilityPack.HtmlWeb
'needs to use cookies
web.UseCookies = True
'set the cookie request
wreq.CookieContainer = cookies
'start a response
Dim res As HttpWebResponse = wreq.GetResponse()
'get a stream from the response
document.Load(res.GetResponseStream, True)
'get some data from the page.
'in the below example, i'm looling for a div with the class 'sectionQuoteDetail' and getting the content of the second span inside
Dim strCurrentQuote = document.DocumentNode.SelectSingleNode("//div[@class='sectionQuoteDetail']//span[2]")
MsgBox(strCurrentQuote.InnerText)
End Sub
End Class
As a side note, the use of the proxy code above also solves the frequent HttpWebRequest WebException “The remote server returned an error: (407) Proxy Authentication Required” error.