Creating a Chrome Plugin to Scrape A Page (using jQuery)

If you’ve played w/ Chrome extensions at all, you know they are super powerful. I recently wanted to visit a bunch of pages, and extract some info from each page. I could easily run some jQuery script in the console of each page to do this, but I wanted a quick and easy way to do this. Creating a Chrome extension, that includes jQuery, to run locally is pretty simple. Below are the different files (5 of them) you’ll need (put these all in a single folder).

After creating these and adding your code, add to Chrome by going to Extensions > Load unpacked extension and choose your folder.

1: manifest.json

{
 "name": "Your Extension Name",
 "description": "This was easy",
 "version": "1.1",
 "background": {
 "scripts": [ "jquery-3.1.1.min.js","background.js","content.js"]
 },
 "permissions": [
 "tabs", "http://*/*", "https://*/*"
 ],
 "browser_action": {
 "default_title": "My Extension Title",
 "default_icon": "a-cool-logo.png"
 },
 "manifest_version": 1
}

2: jquery-3.1.1.min.js (get a copy from jquery.com)

 

3: a-cool-logo.png (16px x 16px)

 

4: background.js

chrome.browserAction.onClicked.addListener(function(tab) {
 chrome.extension.getBackgroundPage().console.log('your plugin gonna do something');
//maybe see if your plugin should be allowed to run
 if (tab.url.indexOf("/maybeCheckAurl/") !== -1) {
 
chrome.tabs.executeScript(null, { file: "jquery-3.1.1.min.js" }, function() {
 chrome.tabs.executeScript(null, { file: "content.js" });
});
 
 }
 
});

5: content.js (the magic happen here)

//i check to make sure jQuery is loaded
if (jQuery) { 
 
 jQuery(".someclass a").each( function() { 
 
//log the results 
 console.log($(this).attr("href"));

//maybe do something with them
 $.get( "http://yourapi"), function( data ) {});
 
 });
 
} else {
 alert('no jq');
}
Creating a Chrome Plugin to Scrape A Page (using jQuery)

Accessing the Cloudflare API in C#

shutterstock_328592903Cloudflare provides security, CDN and more for your websites. If you’re using the Cloudflare caching to speed up your sites (it really is fast) you may want to purge their cache from your application (instead of waiting X days). Cloudflare provides an API that seems to offer everything you’d possible need. I wanted to do this from c#, but did’t find any great libraries or code that was using their newest API (v4).

Below is some quick simple code that I’ve found to work great so far for me. It’s pretty basic, and doesn’t require many external libraries (just JSON.net). Let me know your thoughts and if it helps you.

//define somethings we'll need for the api
string apiEndpoint = "https://api.cloudflare.com/client/v4";

//user info here
string userEmail = "youremail@something.com";

//this is your Global api key found in "my account"
string userAPIkey = "xxxxxxxxxxxxxxxxxxxxxxxxx";

//domain your working with:
string domain = "http://www.yourdomain.com";

//let's get our zone ID (we'll need this for other requests
HttpWebRequest request = WebRequest.CreateHttp(apiEndpoint + "/zones?name=" + domain + "/&status=active&page=1&per_page=20&order=status&direction=desc&match=all");
request.Method = "Get";
request.ContentType = "application/json";
request.Headers.Add("X-Auth-Email", userEmail);
request.Headers.Add("X-Auth-Key", userAPIkey);

string srZoneResult = "";
using (WebResponse response = request.GetResponse())
using (var streamReader = new StreamReader(response.GetResponseStream()))

	srZoneResult = (streamReader.ReadToEnd());

dynamic zoneResult = JsonConvert.DeserializeObject(srZoneResult);

if (zoneResult.result != null)
{
	//get our zoneID
	string zoneId = zoneResult.result[0].id;

	//some pages to purge the cache on:
	byte[] data = Encoding.ASCII.GetBytes(@"{""files"":[""http://www.yourdomain.com/about/""]}");

	request = WebRequest.CreateHttp(apiEndpoint + "/zones/" + zoneId + "/purge_cache");
	request.Method = "DELETE";
	request.ContentType = "application/json";
	request.ContentLength = data.Length;

	request.Headers.Add("X-Auth-Email", userEmail);
	request.Headers.Add("X-Auth-Key", userAPIkey);

	using (Stream outStream = request.GetRequestStream())
	{
		outStream.Write(data, 0, data.Length);
		outStream.Flush();
	}

	string srPurgeResult = "";
	using (WebResponse response = request.GetResponse())
	using (var streamReader = new StreamReader(response.GetResponseStream()))

		srPurgeResult = (streamReader.ReadToEnd());

	dynamic purgeResult = JsonConvert.DeserializeObject(srPurgeResult);

	//was it a success (hopefully it = true)
	textBox1.Text = purgeResult.success;
}
Accessing the Cloudflare API in C#

Removing Bad (Spam) Traffic From Google Analytics

fake-spamIf you use Google Analytics, you’ve probably noticed in the past few months a ton of fake traffic in your website analytics. Traffic from referrals like social-buttons.com, best-seo-offer.com or 100dollars-seo.com (and other obviously legit sites). You’ll notice this traffic has a zero session view time and only visits the root of your site. Actually, “visits” isn’t even the correct term, because to my knowledge, these site are just loading the Google Analytics javascript with your tracking code. I guess they believe it’s an easy way for people to see their site “giving” you traffic, then you visit them – and who knows what happens next.

Below are the steps I use to create a new segment in Google Analytics. You can use this segment instead of the “All Sessions” default.

 

1. Add a new segment:

1

2. Give it a name:

(I called mine “Real Traffic”, since I’m attempting to keep only actual user visit data).

2

3. Go to the “advanced” area to start adding:

3

4. Let’s add the first rule to remove data with blank hostnames:

Make sure Sessions and Include is set, then select Hostname and Matches Regex. We’re including multiple domains (your domain and others you want to include (like webcache.googleusercontent.com). Be sure to add your domain to this list! Note: the “|.*” is RegEx for and + starts with.

4

5. Second and last rule:

Now the second rule has 2 conditions and is removing the other part of the bad / spam data using referral sources. This is where the action happens and starts to really clean the bad fake traffic. Make sure to set Sessions to Exclude, and use “and” in between the two parts of the condition.

5

 

That should be it. Save your segment and see if it makes a difference. I prefer this method over filters since it doesn’t remove any data.

In some of my test sites, I’m finding 96% of the traffic is fake (below is a comparison).

ga-sample

 

For reference, here are the sites I’ve found so far to be junk and I’m excluding:

social-buttons.com
simple-share-buttons.com
free-share-buttons.com
free-social-buttons.com
event-tracking.com
Get-Free-Traffic-Now.com
buttons-for-website.com
semalt.com
best-seo-offer.com
best-seo-solution.com
buttons-for-your-website.com
makemoneyonline.com
100dollars-seo.com
dailyrank.net

Edit 7/15 - 2 more additions to add:
success-seo.com
videos-for-your-business.com

Edit 8/3 - another awesome referrer:
yourserverisdown.com

Removing Bad (Spam) Traffic From Google Analytics

Force WWW & Fix Redundant Hostnames on Google / SEO

I feel the term “SEO” is completely overused, however, there are a few things you want to do besides just having great content. One is make sure your site url is consistent. chrisbitting.com is different from http://www.chrisbitting.com. 

Google Analytics will provide you a suggestion to fix this if you’re experiencing traffic from multiple hostnames. Something like:

Property http://www.yourdomain.com is receiving data from redundant hostnames. Some of the redundant hostnames are:

This is easy to fix using your Global.asax page. Just add this to your code, replacing “yourdomain” with your actual domain. The Application_BeginRequest will catch and redirect to the altered url, also issuing a 301 to help search engines.

void Application_BeginRequest(object sender, EventArgs e)
    {
        

        if (HttpContext.Current.Request.Url.ToString().ToLower().Contains(

            "http://yourdomain.com"))
        {

            HttpContext.Current.Response.Status = "301 Moved Permanently";

            HttpContext.Current.Response.AddHeader("Location",

                HttpContext.Current.Request.Url.AbsoluteUri.ToString().ToLower().Replace(

                    "http://yourdomain.com", "http://www.yourdomain.com"));

            HttpContext.Current.Response.End();
        }
    }

You could do this using web.config + rewrite, but I enjoy this method more.

Force WWW & Fix Redundant Hostnames on Google / SEO

Creating a copy of your website using GNU Wget for Windows or OS X

There are times when you want to have a copy of your site (the frontend / user side). GNU Wget has been around a long time, but in my opinion, it’s still a great tool to backup / mirror websites.

Wget has many options and parameters, of which I won’t even scratch the surface, but below are the simple steps to get Wget setup and running on Windows and OSX machines. Wget is a command line utility, so it might appear overwhelming, but don’t worry, it’s cake!

Windows steps:

Step 1. Download / install Wget. Visit http://gnuwin32.sourceforge.net/packages/wget.htm and choose to download the Setup labeled “Complete package, except sources.”

Step 2. After installation is finished, open a command prompt (cmd.exe).

Step 3. Go to your GNU application folder (on 64 bit it’s in C:\Program Files (x86)\GnuWin32\bin, on 32 bit, it’s probably in C:\Program Files\GnuWin32\bin).

Step 4. To test if wget is installed correctly, run “wget -V“. It should return the current version and some credit. If not, revisit previous steps.

Step 5. To download / mirror a site, run  wget -e robots=off -r -l 0 -P “c:\\temp” http://www.chrisbitting.com – replacing “c:\temp” with the folder you want the site files to download to and “chrisbitting.com” with your site address.

wget_pc

 

You should now see the command prompt update with the progress – depending on the size of your site – it may take some time to download everything. After it’s finished, your directory should contact a mirror of your site, including html, css, images, etc.

 

Apple OS X steps:

Step 1. Open a blank terminal.

Step 2. Install homebrew by running:

ruby -e “$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)”

Step 3. After installing brew (and entering your password), run:

brew doctor

Step 4. Now install Wget using:

brew install wget

Step 5. When installation finishes, run “wget -V” to ensure Wget installed correctly. It should return the current version and some credit. If not, revisit previous steps.

Step 6. To download / mirror a site, run  wget -e robots=off -r -l 0 -P “/temp” http://www.chrisbitting.com – replacing “/temp” with the folder you want the site files to download to and “chrisbitting.com” with your site address.

wget_osx

You should now see the terminal update with the progress – depending on the size of your site – it may take some time to download everything. After it’s finished, your folder should contact a mirror of your site, including html, css, images, etc.

To see the multitude of options Wget provides, run “wget –help“. Happy downloading!

 

Creating a copy of your website using GNU Wget for Windows or OS X

Creating a .bash_profile file in OS X and adding PATH directories

If you’re starting out with a fresh install of OS X (10.9 in my example) and are using any development tools, at some point I’m sure you’ll want to add some directories to your system PATH. In short: this allows you to use an application in a specific directory from any other directory – commonly when you’re running commands in Terminal.

To start, we’ll utilize a text editor – in my case I’m using TextMate – but any plain text editor should do. Let’s get to it:

  1. bash_1Let’s first make sure you don’t already have a .bash_profile. In TextMate, go to File > Open. Browse to your home folder (with the house icon) and click “Show Hidden Files”. In your home folder you shouldn’t already see a .bash_profile file. (If you do, then you don’t need to create a new file and can open your file, make changes and skip to step 5.)
  2. bash_2So cancel the open dialog and enter some text into the untitled file currently open. You’re usually entering something like: export PATH=${PATH}:/somedirectory/asubdirectory:/anotherdirectory
  3. bash_3Now let’s save our new .bash_profile. Go to File > Save As. Browse to your home folder (with the little house icon again). Enter the filename as “.bash_profile” (without quotes).
  4. bash_4If you get a message saying “names that begin with a dot are reserved for the system” chose “Use ‘.’
  5. bash_5That’s it. Now if you already have a terminal open run source ~/.bash_profile (this just give you access to the updated PATH).
Creating a .bash_profile file in OS X and adding PATH directories

Local web server for testing / development using Node.js and http-server

localhost8080If you’re developing html / javascript applications and want to test locally, many times you will go beyond what local file access (file:///C:/…) in browsers will allow (like XMLHttpRequests, json calls, cross domain access and Access-Control-Allow-Origin restrictions).

A simple solution instead of deploying your code to apache or IIS is to install a local http server. http-server for Node.js is a fast, easy install and app that will allow you to use any directory as a http://localhost.

Installing this simple http server only takes a few steps:

  1. Install node.js if you don’t already have installed (from http://nodejs.org)
  2. In a command prompt / terminal, now run:
    npm install http-server -g
    

    (this installs http-server globally so you can access from any folder or directory)

  3. Now using command prompt or terminal, browser to a folder with some html you want to serve as http. (ie: c:\someproject\).
  4. Run:
    http-server
    
  5. Open your browser and visit http://localhost:8080.

 

You can change port 8080 (the default) to anything using “-p”, so http-server -p 8088 would change your local site to http://localhost:8088

Run http-server –help to see the other options available for running.

Local web server for testing / development using Node.js and http-server