Aug 17, 2005 Nomadig, Technology, WordPress:

Detecting leeches with ShortStat

Sometimes I see a sudden jump in the hits through ShortStat, and I’m left to wonder whether some busy site has linked to Nomadig.com, a search engine robot has gone wild or the site has been sucked by an offline browser.

The spikes of the first kind are more than welcome. More traffic here, more potential audience to be served. The second is usually okay, but there are some search engines that nobody is really interested in and their robots are not behaving properly.

The third category drives me nuts — the offline browsers are stupid enough to suck every possible page that has been linked in. This causes problems when the same page can be reached using different URLs and the site uses these variations. The same page can be fetched several times, increasing the bandwidth usage that at the end of the day costs me money. And these suckers are fast; they can fetch several pages in a second, eating bandwidth and processing power from the rest of you.

Until today, I wasn’t able to tell what caused the spike without downloading the raw log file and then trying to analyse it by finding an IP address that showed up again and again in relatively short period. As the log file contains information about all downloads, including image, JavaScript and CSS files, the task is burdensome.

A week ago I got that a-ha moment, when I suddenly realised that all required informaiton is stored in the ShortStat database. I could fetch the data with a relatively simple SQL statement and then organise it to a suitable structure for rapid analysis and print-out.

The process is very simple:

  1. Read all requests for the past X seconds from the database, organised by the IP number and the browser string.
  2. Count the number of requests per distinct IP number and browser string pair. Store the first and the last access time of these requests.
  3. Calculate the time span between the first and the last access, and use that number to calculate average hits per second.
  4. Show the number of hits, average hits per second, time span, IP address and the browser string sorted by the number of hits to the user. There is a lower limit for the hits to keep the list relatively short.

I fancied first doing that whole stuff in an SQL statement, but the timespan calculation was impossible feat for MySQL — at least without creation huge unions.

If you are interested in this, go to my ShortStat page at www.nomadig.com/shortstat/ and click link Leeches at the top of the page to see who is leeching Nomadig.com during the past 24, 48 or 72 hours.

I can make the code available, if someone is interested in using the system.

5 Comments

The URI to TrackBack this entry is: http://www.nomadig.com/2005/08/17/detecting-leeches-with-shortstat/trackback

1. Dan Wolfgang — Saturday, Aug 20 2005

Janne, I’d love to get a hold of your leeches code. I’ve implemented all of your hacks, as well as a few minor ones of my own (http://www.danandsherree.com/2005/08/06/hacking_shortstat.php). My further-modified installation is running at http://www.danandsherree.com/shortstat/. I have some other ideas in mind I’d like to implement, too.

Maybe we should try getting together to release our hacked version?

2. Janne — Sunday, Aug 21 2005

Dan, the leech.php is now available for downloading at http://www.nomadig.com/downloads/leech_php.txt .

You have to add the following directives to configuration.php:
$SI_display['dateformat'] = ‘M j’;
$SI_display['timeformat'] = ‘H:i’;
$SI_display['whoisurl'] = ‘http://www.samspade.org/t/lookat?a=%i’;
$SI_display['maxleechdays'] = 3;
$SI_display['leechlimit'] = 50;

Some of those may already be there, as you have used some other hacks of mine.

Shaun has been indeed quite slow releasing the promised new version of ShortStat. I’ve got to understanding that Mint (haveamint.com may be the next ShortStat. I haven’t been that interested in the next version of ShortStat after seeing it working on hicksdesign.co.uk. Too complicated and no possibility to have all information at a glance.

Currently I’m too busy to get anything in releasable shape and ShortStat would benefit a lot from modulatiry and configurability. These unsexy features should be implemented and it takes quite a while to do them properly. The database schema is also burdensome, as the installation slows down rapidly when the amount of data grows.

3. Dan Wolfgang — Monday, Aug 22 2005

Thanks for the code; it works well.

I’ve been excitedly waiting for Mint, but… gee, it’s taking a long while.

4. danandsherree.com — Friday, Aug 26 2005

Goodbye ShortStat, Hello SlimStat!

I’ve only been using ShortStat for a few weeks and I’ve come to really like it. It’s a very simple…

5. Janne — Friday, Aug 26 2005

Dan, thanks for the tip. I have to study SlimStat in more detail when I have enough time (any century now :) for a careful analysis, as I don’t want to loose my stats history.

Leave a comment

The following Textile shortcuts are available:

_emphasis_
*strong*
@code@
^superscript^
~subscript~
+inserted text+

Hyperlink:
"link text":http://link.url

Image:
!http://image.url!

Lists:
* bulleted
# numbered

Hide help

Please be polite and use common sense when posting. Any comment is subject to removal. The e-mail address is required, but it is not shown to anybody else than the administrator.

Commenting uses Textile and your message is previewed below. Show Textile help

Write your comments

 

Preview

6.  — Jan 6 2009