Friday, 8 August 2014

Website stats checking

Today I was doing webstats checking. I noticed that I was getting a lot of 404s for pages with /RK=0 tacked on the end, so I wondered why I was getting requests with RK=0 on the end. (Actually I've been getting these for a long time, I just decided to look into it more today). I found the answer here: Server Logs with RK=0/RS=2 .... I now know what these are.

Apparently they are from bots scraping Yahoo search results. If you check that link, it gives a bit more detail about why bots are scraping Yahoo search results and where the RK0 bit comes from.

Anyway, it seems it's not bots searching for a site vulnerability or anything legitimate either, so they can just be ignored.

Something I noticed when checking the browser stats for a site was that I had a visit from someone using IE 999.1. After doing some research, it seems that this UA string is used for people using the stealth mode in Symantec software: What does the Stealth Mode Web Browsing feature in SEP 11.x / 5.x do?.

Going back to 404s, another problem I noticed was URLs with &cfs=1 appended on the end. There seems to be very little information on this (probably partly due to the uselessness of search engines when searching for non human text). I did find this post, which suggests it is malformed requests generated by Facebook: URL Parameter (&cfs=1) Causing .NET Exceptions.

It does annoy me that large sites like Facebook and Google send these made up requests. We have enough problem with dodgy bots without Google and Facebook adding their own dodgy bots to the mix. On one of my site probably the majority of dodgy requests come from Googlebot.

Still on the 404s, I noticed that I was getting 404s for URLs that exist with a hash / fragment identifier on the end e.g. /url/#respond. I was perplexed at how these could 404, since it was a valid URL. Then reading this post: Why URI-encoded ('#') anchors cause 404, and how to deal with it in JS?, it became obvious. The browser does not send the hash to the server. So if a browser does send a request with a hash in it, it means that it is requesting an actual address (file) with a hash in it. So the 404s were quite correct. (It appears to be a bot sending these dodgy requests as in the logs I would see a request for the page and then a requests for the url with various anchors linked to as hashes).

Checking my sites in Google and Bing Webmaster tools, I noticed an unusual search impressions graph. Every weekend the searches drop off dramatically, then pick up again on Monday. I wonder if people only search for info on the subject that this website covers when they are at work?

Amazingly I did manage to get through checking all my website stats today, despite also doing some other stuff as well.

No comments: