After implementing a new reporting backend for html5test.com, I noticed something strange. It seemed like there were an unusually high number of visits from browsers that claimed to be Safari but did have scores that were different from my own devices. It looked like there were quite a lot of visits from browsers that were lying about their identity.
Each visit to html5test.com is logged and is used for generating reports. You can see those reports on the html5test website by going to the “Other browsers” or “Compare” tabs. Creating the reports isn’t fully automatic – I still have to go over the logged data manually, but by using some smart queries I can get the data I want quite easily from the database. The hardest part of writing the new reporting backend wasn’t actually the reporting: it was identifying the name and version of the browser that is used. If you do not accurately recognise the browser, you can’t say anything useful about how well each browser supports HTML5. Properly identifying the source of the recorded score is vital to reliable reports about HTML5 support.
If you ever needed to identify browsers you’ll probably know that every browser has a user agent string that basically tells you the name and version of the browser and rendering engine. Well… kind of. There are many problems you will run into, but for all intents and purposes it is possible to reliably detect the identity of the browser using the user agent string. The script I wrote was pretty good and over the course of a couple of months and a couple of million real world user agent strings it became quite accurate.
Spoofed user agent strings
There is one large drawback about using the user agent string. The string itself can be spoofed. For almost every popular browser there are extensions that will allow the user to modify the string and sometimes even choose from a list of strings used by other browsers. That means my script will misidentify these browsers and the score will be recorded for the wrong browser. This is actually what I was expecting, but I figured that only a very small number of users would be using these extensions. I also thought that if you have a sufficient large number of visitors you can easily spot the fake ones. And I was right. If you look at the raw data for almost every browser you’ll see that only about 1% of the visits are from browser that spoof their user agent string.
There was one exception.
Whenever I looked at iOS and Safari I couldn’t figure out what the correct score was for a particular version solely from looking at the raw data. After using an actual device to confirm the score I found that for some versions of iOS and Safari most of the recorded scores were different from the device I used. Take for example iOS 3.1:
|Number of recorded scores||Score|
As you can see in the table above it is very difficult to determine the correct score from the data alone. You might suspect the score with the most visits is the correct score, but you would be wrong. I manually verified the correct score and it is 142. Which means that over 88% of all visits claiming to be iOS 3.1 are lying and are in reality some other browser on another device.
I did the same test for other versions of iOS and found the following numbers:
- 97% of iOS 3.0 scores are fake
- 88% of iOS 3.1 scores are fake
- 99% of iOS 3.2 scores are fake
- 92% of iOS 4.0 scores are fake
- 53% of iOS 4.1 scores are fake
- 71% of iOS 4.2 scores are fake
- 32% of iOS 4.3 scores are fake
- 2% of iOS 5.0 scores are fake
- 2% of iOS 5.1 scores are fake
Given that more than 75% of all iOS visitors are running iOS 5.0 or higher the overall impact fake user agent strings have on browser market share will not be as high as some of the individual version of iOS.
The same problem occurs when you look at the desktop version of Safari.
The reason we are seeing so many fake user agent strings on iOS is simple. The iPhone and iPad are very populair products and many sites have created optimized versions of their websites. What many web developer did not realize or simply don’t care about is that there are many other devices that could benefit from using these optimized websites and they are seeing the regular version of the website. That is why many browsers can imitate iPhones and iPads to make sure their users get the same optimized experience.
Similarly many browsers have a desktop mode which makes sure users of smart phones and tablets get the full desktop website instead of a watered down mobile version. And unfortunately many of the Android browsers identify themselves as a desktop version of Safari when using desktop mode.
Camouflage mode wouldn’t be a big problem if there was a way to identify a browser in desktop mode. Many browsers actually add a special identifier to the user agent string while in desktop mode, making it pretty trivial to detect.
Others simply copy the user agent string of a desktop browser bit for bit. This makes it impossible to distinguish between the actual desktop browser and the mobile browser lying about its identity. By looking only at the string you cannot tell wether it is fake or not.
Detecting fake user agent strings
The only solution to this problem is to look at other things than just the user agent string. To check if a browser is lying you need to look at some other properties it has and compare it to information we already know is true about each browser. I decided to check for the following things:
- Rendering engine: Most rendering engines use some private properties that are specific to only that engine. By testing for these properties you can easily detect which engine is actually being used. Just compare the actual engine to the one that is reported in the user agent string and if they do not match the user agent string is fake.
- Screen size: We know the screen size of each device that shipped with iOS, so if we encounter a device with some other screen width or height, we know it is lying.
In order to catch the rest we need to look at some specific features and compare the version of the browser with what we know about when that version was actually introduced. If we detect a specific feature and we know it was introduced in version 3, and the browser claims to be version 2, we know the browser is lying about its identity. Similarly we can check if a specific feature that a browser should support is missing.
I decided to look at the following features for iOS:
- Sandboxed iframe (iOS 4.0 or higher)
- Websockets (iOS 4.2 or higher)
- Webworkers (iOS 5.0 or higher)
- Application Cache (iOS 2.1 or higher)
And the following features for the desktop version of Safari:
- Application Cache (Safari 4.0 or higher)
- History API (Safari 4.1 or higher)
- Fullscreen support (Safari 5.1 or higher)
- Filereader (Safari 5.2 or higher)
By testing these selected features we can detect most of the spoofed browsers. There are still some improvements to make, because despite all these test we still see some strange scores being recorded. So it is not 100% reliable, but it is good enough to generate reports for html5test.com.
In order to gain some more insight into how prevalent this problem is I looked at all visits to html5test.com during one single week. During that week there were a total of 172.827 recorded scores. The timespan and number of visits should be enought to provide statistically signifant information.
There is one caveat though. Html5test.com is hardly representative for the rest of the internet. To extrapolate these results to global browser marketshare is premature. I do think the results warrant more investigation.
Whenever I use the term “camouflage” below, I specifically mean a browser that identifies itself as another browser and which can not be detected by only looking at the user agent string. Also note that the numbers provided are a best case scenario. The method to detect camouflaged browsers is not 100% accurate and a few are bound to slip through. In reality the number of camouflaged browsers will be slightly higher than reported.
Out of the 172.827 recorded scores only 6.609 were marked as camouflaged by my method:
Out of those 6.609 camouflaged browsers more than 60% identified themselves as a desktop version of Safari and a further 12% identified as Safari on iOS. The remainder is split over desktop and mobile versions of Chrome, Firefox, Internet Explorer and Opera.
If we compare the number of recorded scores that were marked as camouflaged to the total number of recorded scores for each browser we can see that it is only a very small percentage and certainly not statistically significant.
As expected this changes when we look at Safari. Out of 4.868 recorded scores attributed to iOS over 17% is marked as camouflaged.
|4.037||were actual Safari on iOS|
|831||were camouflaged as Safari on iOS|
And out of 10.551 recorded scores attributed to the desktop version of Safari over 38% is marked as camouflaged. That means that over one third of html5test.com visitors claiming to be Safari are in reality some other browser.
|4.052||were camouflaged as Safari on Windows or OS X|
|6.285||were actual Safari on Windows or OS X|
In addition to the camouflaged browsers there are also two categories of recorded scores that are not strictly camouflaged. But if you are not carefully looking at the user agent string they will be detected as Safari.
|102||were identifying as Safari, but included a HTC model name|
|112||were identifying as Safari on Linux|
That means of 10.551 identifying as the desktop version of Safari 40% were fake!
By detecting camouflaged browsers it is fairly easy to generate reliable data for html5test.com, but one does wonder how this affects global browser market share. Personally I do not know which methods companies like StatCounter or NetApplications are using to determine which browser are used, but it is likely they use a similar method and are affected by the same problems.
As I said before, we can not simply apply these percentages to the global browser market share, but 17% and 38% are large enough to start asking questions about the methodology used by some of those companies that release market share statistics.