Sunday, January 09, 2011

What Web Sites Know About You

Many who are more knowledgeable about web services than I am have already written about how much information web sites can glean about you when you simply click on a link. They get information about your operating system, your browser, your display, and your IP address. Because many web services are designed and implemented using an architecture known as Representational State Transfer (ReST) in which the software on the server is stateless, much information may be encoded in the URL that forms the link on which you click, and this URL is also passed to the web service, but may be visible to other software along the way.

What kind of software? I subscribe to a service called Site Meter who, for a few bucks a month, collects statistics about the people who read my blog. How is this done? I just embed some magic HTML code in the template that is used for every blog article that I write. This HTML code is executed every time you read one of my articles. Site Meter collects and summarizes this data and provides me with a way to access it. Other sites, including Google Analytics (which I also use), do similar things. Companies use this information to fine tune their web sites (Search Engine Optimization or SEO), improve their marketing and advertising, and ultimately to improve their bottom line.

I'm not nearly that ambitious. But I have used results from Site Meter to improve my blog. For example, I wrote an article on Chip's Instant Managed Beans, a way to easily instrument Java code so you could interrogate and communicate with it using the GUI-based jconsole tool that is part of the Java JDK. It's not unlike using the Simple Network Management Protocol (SNMP) with a good Management Information Base (MIB) browser, except it's a whole lot simpler. I changed the original title of that article because I had unwittingly used the brand name of a snack food company in India. (No, I'm not kidding.) Most of the traffic reaching that article through search engines were folks in India looking for a way to order junk food over the web. I couldn't see any reason for my article on a fairly obscure technology like Java managed beans to be the number one search result for Indian snack chips.

The other day while reviewing my Site Meter data, I stumbled across the following result about a visit someone made to my blog. Here is a screen shot right from my Site Meter control panel. The amount of data gleaned from this search is really interesting.

What Web Sites Know About You

What can we tell about this visitor?
  • They used Google to search for the strings "john l sloan" and "asshole" appearing on the same web page.
  • They found a blog article from July 2007. That's because "John L. Sloan" appears in the copyright of all of my blog articles, and that month I wrote an article in which I mentioned the book The No Asshole Rule by Stanford engineering professor Robert Sutton. (Sutton is one of my favorite business authors and bloggers.) You can trivially duplicate this search yourself right now. Go ahead. I'll wait. The quotes are important, otherwise you'll get a lot of unrelated results. In fact, when I do this search, my July 2007 blog page is the only result I get.
  • Just from their IP address we can see they use Road Runner, an internet service provider on Time Warner Cable.
  • Geolocation based on their IP address suggests they are located in my old stomping grounds, Fairborn Ohio USA.
  • Their preferred language, for web browsing anyway, is English.
  • They use Microsoft Windows NT, or more likely some Windows variant that self-identifies as Windows NT.
  • They were using Internet Explorer 8.0, along with an alphabet soup of options and add-ons (this alphabet soup will be important later).
  • They were using JavaScript 1.3.
  • They have a display with a resolution of 1344 by 840 pixels and 32-bit color.
  • They performed the search at around 8PM local time on January 6th.
  • The zero second visit length is an artifact of the fact that they didn't leave my blog by clicking on a link in my blog, which is the only way the Site Meter software can tell when they leave. Had they clicked on a link, I would not only know how long they remained on my blog, but where they went when they left.
The fact that I, someone you may not even know, or at least only know from this blog, have access to all of this information about you every time you visit my blog might be enough to worry you. But the fact is, all of this information is recorded about you when you visit any web site. That's right: any web site can, and probably does, collect this information about you. This does not entail using tracking cookies, viruses, spyware, botnets, or any other possibly malignant technology. It's built into the basic software architecture that is the web.

Now you might think that all this information isn't enough to actually identify you as a particular user. That's where the alphabet soup of software versions reported by your browser comes in. It turns out that, as research studies by Peter Eckersley of the Electronic Freedom Foundation and others have shown, the collection of software installed in, around, and with your browser is almost as unique as your fingerprint. In fact, this technique is known as browser fingerprinting. Using this information, it is very possible that sufficiently clever and motivated individuals can identify your specific machine, using it to correlate your movements across all the web sites you touch. Or even, warrant in hand, identify your specific desktop or laptop as having been the one to visit a particular web site.

Welcome to the Internet.


Craig Ruff said...

I guess its a good thing I use NoScript to block access to your data collection site. :-)

Anonymous said...

Entertaining as usual. Send me an email when you identify me. It shouldn't take long. I'll wait.

Anonymous said...

I think we should generate a lot more postings that fit the search criteria. :-)