Sunday, December 24, 2023

Can AI replace Captcha click farms?

The code and demo for this blog is available here on my Hugging Face Space

I recently took some great AI training (here and here). One of the courses is introduced with this xkcd joke from 2014:



It made me think: what task was considered impossible for a computer to accomplish in 2014, and can now be easily achieved using AI? 🤔

2014 was the year Google released reCaptcha v2 which became, by far, the most popular captcha solution. We've all seen these prompts:

To illegally promote companies, products, influence elections, or launder money, hackers need to create thousands of fake accounts. Captcha are meant to prevent this issue by ensuring a real human is behind the screen to verify the image. If a computer can pass the Captcha test, a hacker could automate the user creation process. I haven't tested these illegal services but it seems hackers hire Captcha click-farms where humans solve the Captcha tests for a fee ($3 for 1000 captcha  according to this F5 Labs article).

Now that we have self-driving cars on the road, there is no secret that computers can recognize traffic lights. However, these cars (e.g. Tesla and Waymo) are equipped with expensive hardware and backed by top engineering teams. What I was actually interested to see is whether I could easily built a cost-efficient AI tool to solve a Captcha test.

The Captcha test splits an image in 4x4=16 squares and asks users which squares contain a specific object (e.g. traffic lights). One approach would be to use an image classification model and run it 16 times on each sub square. But it would not be very accurate as a traffic light might go across multiple squares and it may not be possible to identify part of a traffic light as one.


It turns out humans are not the only one struggling with those 😅

My first approach: train my own AI model 

I started collecting online pictures and train my own model to solve the test. I used the duckduckgo-search python library with some keywords search such as:

  • cars and trucks on a road with a traffic light
  • cars at a traffic junction
  • cars and buses on a road at a stop
  • car traffic taken from dash cam
Here is code snippet for this:
from duckduckgo_search import DDGS

def search_images(keywords, max_images=1):
with DDGS() as ddgs:
return [r['image'] for r in ddgs.images(
keywords=keywords,
type_image='photo',
max_results=max_images )]

It was time-consuming to find relevant photos and label them. Some pictures were aerial pictures, some had watermarks, some were artistic photos, some were pictures of the dash cam instead of picture taken by dash cam, others were just irrelevant or duplicates. After quite some work, I was left with only about 30 descent pictures. 

This model did not perform well. One of the reasons is that the dataset was too small. I tried finding an existing dataset but it wasn't that easy. I submitted an application to access the CityScape dataset but by the time my access was granted, I already had another working solution.

Second Attempt: Use a pre-trained Image Segmentation Model

I looked on HuggingFace for an existing Image Segmentation model and found one from nvidia which is pre-trained on the city scape dataset. I tested it and found that it can identify traffic lights very well. I wrote a script script that follows these steps:

  • resize the image to 1024x1024 as expected by the model
  • use the transformer python library to instantiate the model and pass the resized picture as input
  • as the model returns a list of masks for each object it can identify, I take the mask for the traffic lights and ignore the others
  • finally, I calculate which of the 16 squares the traffic lights fall into.

You can see the code here.

I built this demo using Gradio (which you can find on my HuggingFace Space). Feel free to try it out but expect to wait a minute or so for the app to start, and it will be slow because it is running on a CPU. The white squares indicate where the traffic lights are located.


Conclusion: Can my model beat the click-farm?

I ran some basic performance tests on various hardware on Hugging Face and RunPod, then I calculated the cost of solving 1000 Captcha test.

ProviderDeviceSpecsTime to Solve (in Seconds) / ImageHardware Cost ($) / Hourcost ($) / 1000 captchas
ApplemacbookM3 Pro
18 GB RAM
9
Hugging FaceBase cpu2 vCPU
16 GB RAM
35
Hugging FaceCPU upgrade8 vCPU
32 GB RAM
240.030.20
Hugging FaceNvidia t4 medium8 vCPU
30 GB RAM
16GB VRAM
1.20.900.30
RunPodRTX 4000 Ada9 vCPU
50 GB RAM
10.150.05
RunPodRTX 409024 GB VRAM
46 GB RAM
16 vCPU
1.00.390.11
RunPodRTX 6000 Ada22 vCPU
50 GB RAM
0.850.690.16

As expected, the model runs much faster on GPU than CPU. The GPUs are more expensive per hour, but cheaper per 1000s Captcha images. Assuming a hacker would not care about security and reliability (which is debatable), I used spot (i.e. interruptible), community (i.e. run by 3rd parties) GPU instances.

The cheapest option I found was the RTX 4000 Ada, which can process each image in 1 second and costs $ 0.15 per hour, which translate is a cost of $0.05 for each 1000 captchas processed.

This is still more expensive than the price of $0.003 per image charged by click farms, but I haven't tried hard to optimize my model.

Captcha services like Google ReCaptcha or hCaptcha have others security controls and risk assessment methods which are out of scope of this blog. My focus here is only on the traffic light identification task as it is really just an excuse for me to practice what I learned in AI courses. 

Still, if you think of security as an onion, captchas look like a very thin outer layer. It may prevent unskilled hackers to bypass them, but it won't hold against a motivated hacker with basic AI knowledge.



No comments: