Tell bots apart from human visitors for stats?

I am looking to roll my own simple web stats script.

The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).

Is there any open service that does that, like Akismet does for spam? Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?

To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just want to exclude as many as I can from my stats. In know that parsing the user-Agent is an option but maintaining the patterns to parse for is a lot of work. My question is whether there is any project or service that does that already.

Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.

Answers:

Answer

Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.

Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.

Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.

So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.

Answer

EDIT (10y later): As Lukas said in the comment box, almost all crawlers today support javascript so I've removed the paragraph that stated that if the site was JS based most bots would be auto-stripped out.

You can follow a bot list and add their user-agent to the filtering list.

Take a look at this bot list.

This user-agent list is also pretty good. Just strip out all the B's and you're set.

EDIT: Amazing work done by eSniff has the above list here "in a form that can be queried and parsed easier. robotstxt.org/db/all.txt Each new Bot is defined by a robot-id:XXX. You should be able to download it once a week and parse it into something your script can use" like you can read in his comment.

Hope it helps!

Answer

Consider a PHP stats script which is camouflaged as a CSS background image (give the right response headers -at least the content type and cache control-, but write an empty image out).

Some bots parses JS, but certainly no one loads CSS images. One pitfall -as with JS- is that you will exclude textbased browsers with this, but that's less than 1% of the world wide web population. Also, there are certainly less CSS-disabled clients than JS-disabled clients (mobiles!).

To make it more solid for the (unexceptional) case that the more advanced bots (Google, Yahoo, etc) may crawl them in the future, disallow the path to the CSS image in robots.txt (which the better bots will respect anyway).

Answer

I currently use AWstats and Webalizer to monitor my log files for Apasce2 and so far they have been doing a pretty good job of it. If you would like you can have a look at their source code as it is an open source project.

You can get the source at http://awstats.sourceforge.net or alternatively look at the FAQ http://awstats.sourceforge.net/docs/awstats_faq.html

Hope that helps, RayQuang

Answer

Checking the user-agent will alert you to the honest bots, but not the spammers.

To tell which requests are made by dishonest bots, your best bet (based on this guy's interesting study) is to catch a Javascript focus event .

If the focus event fires, the page was almost certainly loaded by a human being.

Answer

Record mouse movement and scrolling using javascript. You can tell from the recorded data wether it's a human or a bot. Unless the bot is really really sophisticated and mimics human mouse movements.

Answer

Prerequisite - referrer is set

apache level:

LogFormat "%U %{Referer}i %{%Y-%m-%d %H:%M:%S}t" human_log
RewriteRule ^/human/(.*)   /b.gif [L]
SetEnv human_session 0

# using referrer
SetEnvIf Referer "^http://yoursite.com/" human_log_session=1

SetEnvIf Request_URI "^/human/(.*).gif$" human_dolog=1
SetEnvIf human_log_session 0 !human_dolog
CustomLog logs/human-access_log human_log env=human_dolog

In web-page, embed a /human/$hashkey_of_current_url.gif.
If is a bot, is unlikely have referrer set (this is a grey area).
If hit directly using browser address bar, it will not included.

At the end of each day, /human-access_log should contains all the referrer which actually is human page-view.

To play safe, hash of the referrer from apache log should tally with the image name

Answer

Now we have all kind of headless browsers. Chrome, Firefox or else that will execute whatever JS you have on your site. So any JS-based detections won't work.

I think the most confident way would be to track behavior on site. If I would write a bot and would like to by-pass checks, I would mimic scroll, mouse move, hover, browser history etc. events just with headless chrome. To turn it to the next level, even if headless chrome adds some hints about "headless" mode into the request, I could fork chrome repo, make changes and build my own binaries that will leave no track.

I think this may be the closest answer to real detection if it's human or not by no action from the visitor:

https://developers.google.com/recaptcha/docs/invisible

I'm not sure techniques behind this but I believe Google did a good job by analyzing billions of requests with their ML algorithms to detect if the behavior is human-ish or bot-ish.

while it's an extra HTTP request, it would not detect quickly bounced visitor so that's something to keep in mind.

Answer

Have a 1x1 gif in your pages that you keep track of. If loaded then its likely to be a browser. If it's not loaded it's likely to be a script.

Answer

=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.

Original answer follows (getting negative ratings!)

The only reliable way to tell bots from humans are [CAPTCHAS][1]. You can use [reCAPTCHA][2] if it suits you.

[1]: http://en.wikipedia.org/wiki/Captcha
[2]: http://recaptcha.net/

Answer

You could exclude all requests that come from a User Agent that also requests robots.txt. All well behaved bots will make such a request, but the bad bots will escape detection.

You'd also have problems with false positives - as a human, it's not very often that I read a robots.txt in my browser, but I certainly can. To avoid these incorrectly showing up as bots, you could whitelist some common browser User Agents, and consider them to always be human. But this would just turn into maintaining a list of User Agents for browsers instead of one for bots.

So, this did-they-request-robots.txt approach certainly won't give 100% watertight results, but it may provide some heuristics to feed into a complete solution.

Answer

The easiest way is to check if their useragent includes 'bot' or 'spider' in. Most do.

Answer

I use the following for my stats/counter app:

<?php
    function is_bot($user_agent) {
        return preg_match('/(abot|dbot|ebot|hbot|kbot|lbot|mbot|nbot|obot|pbot|rbot|sbot|tbot|vbot|ybot|zbot|bot\.|bot\/|_bot|\.bot|\/bot|\-bot|\:bot|\(bot|crawl|slurp|spider|seek|accoona|acoon|adressendeutschland|ah\-ha\.com|ahoy|altavista|ananzi|anthill|appie|arachnophilia|arale|araneo|aranha|architext|aretha|arks|asterias|atlocal|atn|atomz|augurfind|backrub|bannana_bot|baypup|bdfetch|big brother|biglotron|bjaaland|blackwidow|blaiz|blog|blo\.|bloodhound|boitho|booch|bradley|butterfly|calif|cassandra|ccubee|cfetch|charlotte|churl|cienciaficcion|cmc|collective|comagent|combine|computingsite|csci|curl|cusco|daumoa|deepindex|delorie|depspid|deweb|die blinde kuh|digger|ditto|dmoz|docomo|download express|dtaagent|dwcp|ebiness|ebingbong|e\-collector|ejupiter|emacs\-w3 search engine|esther|evliya celebi|ezresult|falcon|felix ide|ferret|fetchrover|fido|findlinks|fireball|fish search|fouineur|funnelweb|gazz|gcreep|genieknows|getterroboplus|geturl|glx|goforit|golem|grabber|grapnel|gralon|griffon|gromit|grub|gulliver|hamahakki|harvest|havindex|helix|heritrix|hku www octopus|homerweb|htdig|html index|html_analyzer|htmlgobble|hubater|hyper\-decontextualizer|ia_archiver|ibm_planetwide|ichiro|iconsurf|iltrovatore|image\.kapsi\.net|imagelock|incywincy|indexer|infobee|informant|ingrid|inktomisearch\.com|inspector web|intelliagent|internet shinchakubin|ip3000|iron33|israeli\-search|ivia|jack|jakarta|javabee|jetbot|jumpstation|katipo|kdd\-explorer|kilroy|knowledge|kototoi|kretrieve|labelgrabber|lachesis|larbin|legs|libwww|linkalarm|link validator|linkscan|lockon|lwp|lycos|magpie|mantraagent|mapoftheinternet|marvin\/|mattie|mediafox|mediapartners|mercator|merzscope|microsoft url control|minirank|miva|mj12|mnogosearch|moget|monster|moose|motor|multitext|muncher|muscatferret|mwd\.search|myweb|najdi|nameprotect|nationaldirectory|nazilla|ncsa beta|nec\-meshexplorer|nederland\.zoek|netcarta webmap engine|netmechanic|netresearchserver|netscoop|newscan\-online|nhse|nokia6682\/|nomad|noyona|nutch|nzexplorer|objectssearch|occam|omni|open text|openfind|openintelligencedata|orb search|osis\-project|pack rat|pageboy|pagebull|page_verifier|panscient|parasite|partnersite|patric|pear\.|pegasus|peregrinator|pgp key agent|phantom|phpdig|picosearch|piltdownman|pimptrain|pinpoint|pioneer|piranha|plumtreewebaccessor|pogodak|poirot|pompos|poppelsdorf|poppi|popular iconoclast|psycheclone|publisher|python|rambler|raven search|roach|road runner|roadhouse|robbie|robofox|robozilla|rules|salty|sbider|scooter|scoutjet|scrubby|search\.|searchprocess|semanticdiscovery|senrigan|sg\-scout|shai\'hulud|shark|shopwiki|sidewinder|sift|silk|simmany|site searcher|site valet|sitetech\-rover|skymob\.com|sleek|smartwit|sna\-|snappy|snooper|sohu|speedfind|sphere|sphider|spinner|spyder|steeler\/|suke|suntek|supersnooper|surfnomore|sven|sygol|szukacz|tach black widow|tarantula|templeton|\/teoma|t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|theophrastus|titan|titin|tkwww|toutatis|t\-rex|tutorgig|twiceler|twisted|ucsd|udmsearch|url check|updated|vagabondo|valkyrie|verticrawl|victoria|vision\-search|volcano|voyager\/|voyager\-hc|w3c_validator|w3m2|w3mir|walker|wallpaper|wanderer|wauuu|wavefire|web core|web hopper|web wombat|webbandit|webcatcher|webcopy|webfoot|weblayers|weblinker|weblog monitor|webmirror|webmonkey|webquest|webreaper|websitepulse|websnarf|webstolperer|webvac|webwalk|webwatch|webwombat|webzinger|wget|whizbang|whowhere|wild ferret|worldlight|wwwc|wwwster|xenu|xget|xift|xirq|yandex|yanga|yeti|yodao|zao\/|zippp|zyborg|\.\.\.\.)/i', $user_agent);
    }

    //example usage
    if (! is_bot($_SERVER["HTTP_USER_AGENT"])) echo "it's a human hit!";
?>

I removed a link to the original code source, because it now redirects to a food app.

Answer

Rather than trying to maintain an impossibly-long list of spider User Agents we look for things that suggest human behaviour. Principle of these is that we split our Session Count into two figures: the number of single-page-sessions, and the number of multi-page-sessions. We drop a session cookie, and use that to determine multi-page sessions. We also drop a persistent "Machine ID" cookie; a returning user (Machine ID cookie found) is treated as a multi-page session even if they only view one page in that session. You may have other characteristics that imply a "human" visitor - referrer is Google, for example (although I believe that the MS Search bot mascarades as a standard UserAgent referred with a realistic keyword to check that the site doesn't show different content [to that given to their Bot], and that behaviour looks a lot like a human!)

Of course this is not infalible, and in particular if you have lots of people who arrive and "click off" its not going to be a good statistic for you, nor if you have predominance of people with cookies turned off (in our case they won't be able to use our [shopping cart] site without session-cookies enabled).

Taking the data from one of our clients we find that the daily single-session count is all over the place - an order of magnitude different from day to day; however, if we subtract 1,000 from the multi-page session per day we then have a damn-near-linear rate of 4 multi-page-sessions per order placed / two session per basket. I have no real idea what the other 1,000 multi-page sessions per day are!

Answer

I'm surprised no one has recommended implementing a Turing test. Just have a chat box with human on the other end.

A programatic solution just won't do: See what happens when PARRY Encounters the DOCTOR

These two 'characters' are both "chatter" bots that were written in the course of AI research in the '70: to see how long they could fool a real person into thinking they were also a person. The PARRY character was modeled as a paranoid schizophrenic and THE DOCTOR as a stereotypical psychotherapist.

Here's some more background

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us Javascript

©2020 All rights reserved.