berlin.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
Alles rund um, über, aus & für Berlin

Administered by:

Server stats:

715
active users

Cool project: "Nepenthes" is a tarpit to catch (AI) web crawlers.

"It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse."

zadzmo.org/code/nepenthes/

zadzmo.orgZADZMO code

@tante I have mixed feelings.

Crawlers should respect robots.txt….

At the same time: there is clearly an emotionally based bias happening with LLM’s.

I feel weird about the idea of actively sabotaging. Considering it is only towards bad actors… and considering maybe robots.txt often are too restrictive in my opinion… the gray areas overlap a bit.

Why should we want to actively sabatoge AI dev? Wouldn’t that lead to possible catastrophic results? Who benefits from dumber ai?

@altruios @tante Because what they're doing is without consent, in violation of law in ways that normal ppl have had their lives ruined over, but they're backed by asshole billionaires so it's fine when they do it. We all benefit from sabotaging their scam products.

@dalias @altruios @tante They're crawling the web, running code against it and providing a service based on the results. How's that any different to what search engines have been doing for the last 30 years?

@woe2you @altruios @tante No, they're preparing unauthorized derivative works. And they're explicitly and intentionally disregarding opt-out. Ability to access something via the web does not imply right to incorporate it into other works, republish altered versions of it, etc.

@dalias @woe2you @tante

The difference between a human reading a website and writing an article ‘inspired by’ what they’ve read And an LLM consuming and outputting content the same way is we recognize that an LLM is a tool and can do the same thing faster.

Reading is training. Reading isn’t copying. Output is the issue: not input. It’s worrisome to see so many not grasp this.

Looking/copying isnt stealing. It just isn’t. No one lost their website.

Peter Kraus

@altruios @dalias @woe2you @tante well if the author attributes their sources and cites properly, we call it academic publishing.

Can an LLM do that, reliably? Can it be held accountable for academic misconduct?

@pkraus @altruios @woe2you @tante No, it cannot. When asked to cite sources, it instead falsifies citations most likely to be believable. The citations it gives have nothing to do with any process by which it generated the information-shaped slop it output.

@pkraus @altruios @woe2you @tante And no it cannot be held accountable. That's why it must never make a decision.