Admin on the slrpnk.net Lemmy instance.

He/Him or what ever you feel like.

XMPP: povoq@slrpnk.net

Avatar is an image of a baby octopus.

  • 0 Posts
  • 4 Comments
Joined 2 years ago
cake
Cake day: September 19th, 2022

help-circle

  • Yeah, Forgejo and Gitea. I think it is partially a problem of insufficient caching on the side of these git forges that makes it especially bad, but in the end that is victim blaming 🫠

    Mlmym seems to be the target because it is mostly Javascript free and therefore easier to scrape I think. But the other Lemmy frontends are also not well protected. Lemmy-ui doesn’t even allow to easily add a custom robots.txt, you have to manually overwrite it in the reverse-proxy.


  • It seems any somewhat easy to implement solution gets circumvented by them quickly. Some of the bots do respect robots.txt through if you explicitly add their self-reported user-agent (but they change it from time to time). This repo has a regularly updated list: https://github.com/ai-robots-txt/ai.robots.txt/

    In my experience, git forges are especially hit hard, and the only real solution I found is to put a login wall in front, which kinda sucks especially for open-source projects you want to self-host.

    Oh and recently the mlmym (old reddit) frontend for Lemmy seems to have started attracting AI scraping as well. We had to turn it off on our instance because of that.