• Tartas1995@discuss.tchncs.de
    link
    fedilink
    arrow-up
    8
    arrow-down
    2
    ·
    edit-2
    8 hours ago

    AllenAi has datasets based on

    GitHub, reddit, Wikipedia and “web pages”.

    I wouldn’t call any of them ethically sourced.

    “Webpages” as it is vague as fuck and makes me question if they requested consent of the creators.

    “Gutenberg project” is the funniest tho.

    Writing GitHub, reddit and Wikipedia, tells be very clearly that they didn’t. They might asked the providers but that is not the creator. Whether or not the provider have a license for the data is irrelevant on a moral ground unless it was an opt-in for the creator. Also it has to be clearly communicated. Giving consent is not “not saying no”, it is a yes. Uninformed consent is not consent.

    When someone post on Reddit in 2005 and forgot their password, they can’t delete their content from it. They didn’t post it with the knowledge that it will be used for ai training. They didn’t consent to it.

    Gutenberg project… Dead author didn’t consent to their work being used to destroy a profession that they clearly loved.

    So I bothered to check out 1 dataset of the names that you dropped and it was unethical. I don’t understand why people don’t get it.

    What is wrong? That you think that they are ethical when the first dataset that I look at, already isn’t.

    • suy@programming.dev
      link
      fedilink
      arrow-up
      2
      arrow-down
      1
      ·
      5 hours ago

      I don’t know where you got that image from. AllenAI has many models, and the ones I’m looking at are not using those datasets at all.

      Anyway, your comments are quite telling.

      First, you pasted an image without alternative text, which it’s harmful for accessibility (a topic in which this kind of models can help, BTW, and it’s one of the obvious no-brainer uses in which they help society).

      Second, you think that you need consent for using works in the public domain. You are presenting the most dystopic view of copyright that I can think of.

      Even with copyright in full force, there is fair use. I don’t need your consent to feed your comment into a text to speech model, an automated translator, a spam classifier, or one of the many models that exist and that serve a legitimate purpose. The very image that you posted has very likely been fed into a classifier to discard that it’s CSAM.

      And third, the fact that you think that a simple deep learning model can do so much is, ironically, something that you share with the AI bros that think the shit that OpenAI is cooking will do so much. It won’t. The legitimate uses of this stuff, so far, are relevant, but quite less impactful than what you claimed. The “all you need is scale” people are scammers, and deserve all the hate and regulation, but you can’t get past those and see that the good stuff exists, and doesn’t get the press it deserves.

      • Tartas1995@discuss.tchncs.de
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        3 hours ago

        https://allenai.org/dolma then you scroll down to “read dolma paper” and then click on it. This sends you to this site. https://www.semanticscholar.org/paper/Dolma%3A-an-Open-Corpus-of-Three-Trillion-Tokens-for-Soldaini-Kinney/ad1bb59e3e18a0dd8503c3961d6074f162baf710

        1. Funny how you speak about e.g. text to speech ai when I am talking about LLM and image generation AIs. It is almost as if you didn’t want to critic my point.
        2. It is funny how you use legal terms like copyright when I talk about morality. It is almost as if I don’t say that you shouldn’t be legally allowed to work with public domain Material but that you shouldn’t call it ethical when it is not. It is also funny how you say it is fair use. I invite you to turn the whole of Harry Potter from text to Speech and publish it. It is fair use, isn’t it? You know that you wouldn’t be in the right there. But again, this isn’t a legal argument, it is moral one.
        3. Who said, that I think it could replace writers or painters in quality or skill, I said it could ruin the economical validity of the profession. That is a very very different claim.

        I want to address your statement about my telling behavior. Sorry, you are right. I am sorry for the screen reader crowd. You all probably know that alt text could be misleading and that someone says that in the internet, isn’t a reliable source. So i hope you can forgive me as you did your own simple research into AllenAi anyway.

    • merari42@lemmy.world
      link
      fedilink
      arrow-up
      6
      arrow-down
      1
      ·
      7 hours ago

      We generally had the reasonable rule that property ends at dead. Intellectual property extending beyond the grave is corporatist 21st century bullshit. In the past all writing got quickly into the public domain like it should. Depending on country within in at least 25 years of the publishing date to the authors dead. Project Gutenberg reflects the law and reasonable practice to allow writing to go into the public domain.

      • Tartas1995@discuss.tchncs.de
        link
        fedilink
        arrow-up
        4
        arrow-down
        1
        ·
        7 hours ago

        Good focus on 1 point, sadly bad point to focus on.

        What is lawful and legal, is not what is moral.

        The Holocaust was legal.

        Try again. Let’s start. Should the invention of ai have an influence on how we treat data? Is there a difference between reproducing a work after the author’s death and using possible millennia of public domain data to destroy the economical validity of a profession? If there is, should public domain law consider that? Has the general public discuss these points and come to a consensus? Has that consensus been put in law?

        No? Sounds like the law is not up to date to the tech. So not only is legal not Moral, legal isn’t up to date.

        You understand the point of public domain, right? You understand that even if you were right (you aren’t), that it would resolve the other issues, right?