Copyright and AI Training Data in Japan
Currently, per Japanese copyright law (PDF), re-affirmed as current policy in April 2023 by Keiko Nagaoka, the Japanese Minister of Education, Culture, Sports, Science, and Technology, states that all works are permitted to be used for the purposes of AI training.
In March 2024, the Japan Agency for Cultural Affairs (ACA) published their latest draft document on AI and Copyright (see also this summary. METI has their own documents/working group as well). See also the notes of the Japanese AI Strategy Council.
Here’s some more analysis and color on this:
- 2023-07-11 Legal Issues in Generative AI under Japanese Law - 3 lawyers from the Japanese law-firm Nishimura & Asahi give an overview
- 2024-02-24 The US should look at Japan’s unique approach to generative AI copyright law - a policy editorial that does a also good job covering the state of AI training in Japan (as an argument for the US to adopt a similar policy)
- 2024-03-12 Japan’s New Draft Guidelines on AI and Copyright: Is It Really OK to Train AI Using Pirated Materials? - on the latest guidelines published by the ACA. “The committee essentially embraced Article 30-4 allowing the ingestion and analysis of copyrighted materials for AI learning to promote creative innovations in AI. It removes the need of acquiring consent from copyright holders, as long as it would not have a “material impact on the relevant markets” and that the AI usage does not “violate the interests of the copyright holders.””
- 2024-05-01 Report on AI and Copyright Issues by Japanese Government - a full English summary of the latest ACA report
- 2024-05 General Understanding on AI and Copyright in Japan Overview (PDF) - this is a new EN presentation published by the Legal Subcommittee under the Copyright Subdivision of the Cultural Council of the Agency of Cultural Affairs and summarizes the current thinking. It re-affirms 30-4, however expressly warns about collecting data from piracy distribution sites, and also covers infringement at the usage stage (which understandably is more stringent). It also touches on copyrightability of AI generated material which largely falls within the standard norms (AI generated work is generally deemed non-creative works and to that extent are not considered copyrighted works).
Terms of Service and Synthetic Data
Online and elsewhere, I’ve noticed a lot of confusion/worries about using synthetic data generated by models due to Terms of Service violations (eg, OpenAI’s Terms of Service and the like). It’s important to understand that Terms of Service (TOS) is a contract that binds two agreeing parties (see privity of contract or the Japanese term 契約上の関係 (Keiyaku-jō no Kankei)) and a third party cannot be bound to (or break) a TOS they haven’t agreed to. Note, that Terms of Service (as its name implies) specifically regulates “access and use” of the service (not the generated output itself).
While, as a matter of course, everyone should respect the TOS that they agreed to with their service provider (or suffer potential liability/consequences), any data generated by a third party, whether synthetic or not, simply falls within the same copyright laws/policies of your jurisdiction and would not have any additional licensing or legal terms applied to it.
Notes:
- There has been a recent trend of using synthetic data generated from completely open models (eg Mistral or CALM2-7B models). While this allows a developer to train their own models without TOS worries, this really doesn’t provide any additional legal benefit or protection for third parties.
- There are also developers who seem to be unnecessarily applying licenses to datasets (where the original models make no claims to the output). Now, while anyone is allowed to license their data however they want, in Japan at least, our recommendation is that any third party should simply not agree to the terms of the license and instead use the data for training in accordance with the Article 30-4 copyright provisions.
- As mentioned, due to the contractual nature of TOS, the idea of TOS transitivity or any downstream “data contamination” lacks a legal basis. If such a concept did exist, however, then using open models to generate/process data wouldn’t help anyway, as all LLMs contain large amounts of TOS constrained data and would suffer the same “contamination.”