Sister blog of Physicists of the Caribbean in which I babble about non-astronomy stuff, because everyone needs a hobby

Tuesday 29 November 2016

Beating stupid surveillance laws : let's all be villains !

OK, I'm totally learning how to write a web crawler.... maybe. Get everyone secretly looking at absurdly depraved and dangerous content and surveillance becomes meaningless. Wouldn't be that hard to do for a simple test. What would be more complex would be something that could fool a "big data" analysis. It would have to (depending on what precisely the ISPs have to store) :
- Mimic user's browsing habits, e.g. spending a realistic amount of time on different sites, coming back to some repeatedly and visiting others for mere seconds before never coming back again
- Be randomised, possibly by starting with a very large list of sites and then visiting all the links on every page in a pseudo-random order, plus using search engines to look for random combinations of naughty words; thus over time each individual user would have a unique web history
- The file itself would have to have a random name so that it couldn't be easily identified by anyone snooping on the computer
- It would have to download and store material from the web in a different directory on the user's computer so that anyone snooping the machine would think the user was genuinely interested in specific content
- User variation in what themes of criminal activities get searched for; e.g. home bomb-making equipment versus outlandish pornography

And then there's the very real concern that this would help genuine criminals hide from surveillance.

Anyway, there's the idea now thrown out in the public domain. Or, if we're lucky, the Parliamentary debate will see reason and (somehow) scale back the worst parts of this Act.
http://www.bbc.com/news/technology-38134560

9 comments:

  1. I had wondered about a similar thing. Probably a little box (RPi or similar) which sits on my router and just makes web requests. I suspect you don't even need to load pages, I'd imagine requesting headers alone will do (I know a guy who runs an ISP, I can ask him). Quite a fun little project, I thought. Writing the rules for what's loaded, when and what search terms were used and so on would be interesting.

    I don't think you need to worry about hiding things locally - if the security services want your local machine, they'll just take it such software is easy to spot running. I think what's important is saturating the ADSL (or fibre or whatever) link with noise, seeing as all the monitoring is happening at the other end of it. A small, low powered computer can do that 24/7 whether my computer is on or not.

    If genuine criminals want to hide it's hardly difficult for them to do that. VPNs, proxies, TOR, etc.

    ReplyDelete
  2. Good points, makes me feel less guilty about proposing the idea. :)

    If the ISPs only store what sites were visited and not when on for how long, then it becomes trivial - just conduct vast numbers of searches for random offensive content (or not even that - if headers alone are enough, then sheer search volume will do it).

    ReplyDelete
  3. I maintained a University web proxy (under a contractual obligation to keep logs) in a previous life. Access logs will typically identify the originating IP address, time (at millisecond or better accuracy), request method, URL, bytes supplied and time taken for the request to complete. So requesting headers (a HEAD request) is trivially identifiable. Truncated requests (i.e. requesting the page but aborting during the download) would also be fairly easily detected. The time spent on a site or page isn't logged directly, but can be inferred from the pattern of web requests. Similarly, if your crawler doesn't request all page dependencies (images, css resources, json etc), that can also be detected fairly easily from the logs.

    In terms of fooling a snooper who's looking specifically at you, making a convincing fake web crawler would be very hard. As for swamping the snoopers, I suspect that's got more utility, but my guess is that GCHQ would probably just get a big pile more money handed to them while lots of the cost would be externalised to the ISPs and thence to their customers.

    I suspect the best thing you can do without going all TOR on them is to try to ensure you use SSL protected variants of sites as much as possible. That reduces the amount of data that's available for logging - in the extreme case, probably just time/source IP/dest IP/bytes transferred.

    Hmmm... a crawler that predominantly crawled SSL resources would be interesting. If you avoided DNS requests (or did them in bulk, out of sync with the crawling activity) and varied the speed of downloading what you requested, it would be very much harder to determine what was going on.

    ReplyDelete
  4. Alun Jones Cong Ma The idea is that it should be used en masse, otherwise it doesn't work. The goal is not to completely prevent the detection of criminal activity (if the full resources of GCHQ are directed against an individual, I would expect and hope that they should be able to determine if that individual is a criminal). Instead, it should be to make the mass collection of data without prior evidence of criminal intent pointless, so that it would become much more efficient to conduct targeted searches at much smaller numbers of people (sure, you could collect mountains of data, but sifting through it would take far more effort than conducting a proper criminal investigation). And of course to protect private data that's non-criminal but the government has no right to and could be potentially harmful if leaked.

    ReplyDelete
  5. Rhys Taylor I appreciate that, was mainly describing the type of data that are kept and the difficulty of making a program act like a human well enough that another program couldn't discard it with ease.

    ReplyDelete
  6. Alun Jones Oh, that wasn't intended as a rebuttal of your very interesting comment, just a clarification of the goal. :)

    ReplyDelete
  7. It would also have to run Javascript as if was a web browser. Most of modern site designs consists of a and The rest of all created by JS.

    Fore example, my comment section on my web site ( > 250K visiutors a year) never gets spammed by a bot., yet it is wide open, and registration free. It's because it is created by JS and the content comes from an Ajax call, and bots typically will not run that code. If they did, I could tricially make them spin endlessly in JS, and fill up their hard disk or RAM until they crash.

    ReplyDelete
  8. It's pretty easy to just pass urls to a text-mode browser like links, which will render JS happily enough. Interacting with a page is a bit trickier but that wouldn't be needed in this instance. Although if it was it's not all that hard.

    The thing is, there's no need to write tools which make requests or load pages, all that's needed is to find out what level of logging happens - that determines which tools are needed (wget, curl, links, elinks, etc), then the fun bit, which is figuring out what urls to visit and what searches to make and so on.

    ReplyDelete

Due to a small but consistent influx of spam, comments will now be checked before publishing. Only egregious spam/illegal/racist crap will be disapproved, everything else will be published.

Review : Human Kind

I suppose I really should review Bregman's Human Kind : A Hopeful History , though I'm not sure I want to. This was a deeply frustra...