Detecting leeches with ShortStat
Sometimes I see a sudden jump in the hits through ShortStat, and I’m left to wonder whether some busy site has linked to Nomadig.com, a search engine robot has gone wild or the site has been sucked by an offline browser.
The spikes of the first kind are more than welcome. More traffic here, more potential audience to be served. The second is usually okay, but there are some search engines that nobody is really interested in and their robots are not behaving properly.
The third category drives me nuts — the offline browsers are stupid enough to suck every possible page that has been linked in. This causes problems when the same page can be reached using different URLs and the site uses these variations. The same page can be fetched several times, increasing the bandwidth usage that at the end of the day costs me money. And these suckers are fast; they can fetch several pages in a second, eating bandwidth and processing power from the rest of you.
Until today, I wasn’t able to tell what caused the spike without downloading the raw log file and then trying to analyse it by finding an IP address that showed up again and again in relatively short period. As the log file contains information about all downloads, including image, JavaScript and CSS files, the task is burdensome.
A week ago I got that a-ha moment, when I suddenly realised that all required informaiton is stored in the ShortStat database. I could fetch the data with a relatively simple SQL statement and then organise it to a suitable structure for rapid analysis and print-out.
The process is very simple:
- Read all requests for the past X seconds from the database, organised by the IP number and the browser string.
- Count the number of requests per distinct IP number and browser string pair. Store the first and the last access time of these requests.
- Calculate the time span between the first and the last access, and use that number to calculate average hits per second.
- Show the number of hits, average hits per second, time span, IP address and the browser string sorted by the number of hits to the user. There is a lower limit for the hits to keep the list relatively short.
I fancied first doing that whole stuff in an SQL statement, but the timespan calculation was impossible feat for MySQL — at least without creation huge unions.
If you are interested in this, go to my ShortStat page at www.nomadig.com/shortstat/ and click link Leeches at the top of the page to see who is leeching Nomadig.com during the past 24, 48 or 72 hours.
I can make the code available, if someone is interested in using the system.
1. Dan Wolfgang — Saturday, Aug 20 2005
Janne, I’d love to get a hold of your leeches code. I’ve implemented all of your hacks, as well as a few minor ones of my own (http://www.danandsherree.com/2005/08/06/hacking_shortstat.php). My further-modified installation is running at http://www.danandsherree.com/shortstat/. I have some other ideas in mind I’d like to implement, too.
Maybe we should try getting together to release our hacked version?
2. Janne — Sunday, Aug 21 2005
Dan, the leech.php is now available for downloading at http://www.nomadig.com/downloads/leech_php.txt .
You have to add the following directives to configuration.php:
$SI_display['dateformat'] = ‘M j’;
$SI_display['timeformat'] = ‘H:i’;
$SI_display['whoisurl'] = ‘http://www.samspade.org/t/lookat?a=%i’;
$SI_display['maxleechdays'] = 3;
$SI_display['leechlimit'] = 50;
Some of those may already be there, as you have used some other hacks of mine.
Shaun has been indeed quite slow releasing the promised new version of ShortStat. I’ve got to understanding that Mint (haveamint.com may be the next ShortStat. I haven’t been that interested in the next version of ShortStat after seeing it working on hicksdesign.co.uk. Too complicated and no possibility to have all information at a glance.
Currently I’m too busy to get anything in releasable shape and ShortStat would benefit a lot from modulatiry and configurability. These unsexy features should be implemented and it takes quite a while to do them properly. The database schema is also burdensome, as the installation slows down rapidly when the amount of data grows.
3. Dan Wolfgang — Monday, Aug 22 2005
Thanks for the code; it works well.
I’ve been excitedly waiting for Mint, but… gee, it’s taking a long while.
4. danandsherree.com — Friday, Aug 26 2005
Goodbye ShortStat, Hello SlimStat!
I’ve only been using ShortStat for a few weeks and I’ve come to really like it. It’s a very simple…
5. Janne — Friday, Aug 26 2005
Dan, thanks for the tip. I have to study SlimStat in more detail when I have enough time (any century now :) for a careful analysis, as I don’t want to loose my stats history.