Get rid of live.com load-balanced domains with a regular expression: GA Webmail Referral Traffic Source Rollup Filter
The problem? Nasty load balanced domains in Google Analytics reports like this:
Source | Visits |
36ohk6dgmcd1n-c.c.yom.mail.yahoo.net / referral | 149 |
mail.google.com / referral | 131 |
du114w.dub114.mail.live.com / referral | 43 |
du103w.dub103.mail.live.com / referral | 25 |
sn124w.snt124.mail.live.com / referral | 23 |
The solution? Clean nicely segmented source lines that "roll up" into one:
Source/Medium | Visits |
Webmail (live.com) / email | 643 |
Webmail (yahoo.com) / email | 258 |
Webmail (google.com) / email | 105 |
Webmail (aol.com) / email | 13 |
Webmail (libero.it) / email | 12 |
Webmail (laposte.net) / email | 23 |
To clean these up requires two filters:
- Webmail Source Rollup (search and replace with Webmail (brand])
- Webmail Medium Rollup (swap / referral with / email as medium for further rolling up!)
Both these filters will use the core regex code I've figured out that consolidates 99% of the worlds webmail systems without pulling in any false positives in theory.
The Magic Regex Code
Here is the filter regex that I've been currently using in my production Google analytics advanced filters since 15 Feb 2012 to cleanup these - or to roll up the load-balanced domains that you often get in referrals:
My starting point:
(messag|courrier|zimbra|imp|mail)(.*)\.(.*)\..{2,4}
This grabs any domain with say "mail" in it, but runs a check on the ending of the domain: it needs to have at least 2 dots after it and a TLD between 2 and 4 chars long. It will miss go.mail.ru and mail.com. It will also miss "Mail Campaign \ email" because this is not a proper domain. So far so good 🙂
My improvement with exceptions (This is the one to use!):
(messag|courrier|zimbra|^imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)
Domain of webmail platform is in capture group $A2.
My improved regex has a new exception section at the end to allow some special cases (go.mail.ru, 3c.web.de, service.mail.com) to get through the filter using a hard-coded approach that skips the safety net and autonomy provided by the wide open keyword matching paired with a domain name restriction ensuring "two more dots and a TLD".
Maybe that is excessive use of regex, but at least you can be sure you can now see your word of mouth / word of email traffic nice and tidy!
How does it work?
I'll break the formula down in sections:
(messag|courrier|zimbra|^imp|mail)
This looks for the really obvious and common keywords in webmail services, the main one being mail. This will match sn124w.snt124.mail.live.com but it will also grab emailchimp.com and a huge number of others that you really don't want to catch with this filter. If we were looking for live (but we aren't in this case), then a site like www.answeringoLIVEr.com would get picked up in the crossfire. I use a ^ in front of imp so that domains like dimpost.wordpress.com don't get caught.
Basically the first part of the domain name (sn124w.snt124.mail) will be getting deleted by this filter so you could stand to lose quite a lot of data with the "mail" and "imp" keywords if this were the only parts of the expression! So the next bit of filter is designed to pass the domain through another difficult test involving the dots and TLDs...
(.*)\.(.*)\..{2,4}
This makes sure that the domain bits after mail or zimbra or whatever always have two dots and a TLD (top level domain extension eg .nz .jp). Which matches the end bit of sn124w.snt124.mail.live.com and  the end of: www.funk.co.nz.
The (.*) part means match anything including nothing, and the \. means there must be a dot, so (.*)\.(.*)\. means there gotta be at least two more dots in this domain name coming up after the live". Which is how Answering Oliver gets through the test for live. The next part .{2,4} is all about the top level domain or TLD. These can be 2, 3, or 4 letters long like .co, .com. and .mobi. The curly braces specify how many times the previous character . (which means single char you like except nothing) can appear like {min,max}.
Then in the middle is a pipe | which cuts the regex open and allows some really hard to match exceptions through for smaller webmail systems. Only reason you see the web.de one also appear on the left of the central | is because this is such a whacky domain name that it doesn't match the "mail" which gets most of the webmail systems on the planet. Germans aye? 🙂
* (Hi Devon! Thanks for sending traffic to Stray Travel I found you researching this post)
Screenshots
Campaign Source Filter
Advanced filter.
Field A -> Extract A: Campaign Source:
(messag|courrier|zimbra|^imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)
Field B -> Extract B: [leave both blank]
Output To -> Constructor: Campaign Source:Â Webmail ($A2.$A3)

GA Webmail Rollup Filter
Medium Rollup Filter
Advanced filter.
Field A -> Extract A: Campaign Source:
(messag|courrier|zimbra|^imp|mail).*\.(.*\..{2,4}|go\.mail\.ru|promail\.co\.nz|service\.mail\.com|3c\.web\.de|outlook\.com)
Field B -> Extract B:Â [leave both blank]
Output To -> Constructor: Campaign Medium:Â email
Additional reference domains to check:
The domains below are the really rare webmail clients that are hard to extract:
mail175-236.sinamail.sina.com.cn
go.mail.ru
service.mail.com
promail.co.nz
3d.web.de
ch1prd0310.outlook.com
This is why I needed to grab the full domain with (.*\..{2,4}) versus the first versions (.*)\.(.*)\..{2,4} which would have only grabbed "sinamail". Now we get  sinamail.sina.com.cn.
These can be checked with: sina\.com\.cn|go\.mail\.ru|mail\.com|promail\.co\.nz|web\.de|outlook\.com
Errors and Issues
Currently the filter is incorrectly tracking as word of email the following referrals:
dailymail.co.uk
I will update the filter to address this issue at some point.
References
Thanks to Olivier Resoneo for the original inspiration (French). His code was:
Posted by tomachi on May 6th, 2012 filed in Google Analytics, Online MarketingGrouper tous les webmail francophones sous le nom de domaine principalCustom filterAdvancedChamp A : Campaign Source : (messag|courrie|zimbra|ima?p|mail|prd[0-9]+)(.*)\.(.*)\..{2,4}Champ B : (rien) -Output To -> Constructor : Campaign Source : Webmail - $A3YesNoYesNoOn peut aussi décliner pour forcer le medium à 'email' quand match sur Campaign Source, ET Campaign à email-non-taggue par exemple, pour avoir le triplet medium/source/campagne