Open Source AI Definition – Weekly update september 16

Week 37 summary 

Endorse the Open Source AI Definition

Recommended Resources: US Copyright Office Guidance on TDM

  • @mjbommar encourages reviewing the U.S. Copyright Office’s guidance on text and data mining (TDM) exceptions, which provides clear explanations and limitations, especially focusing on non-commercial, scholarly, and teaching uses. He emphasizes that the TDM guidance operates within narrow parameters that are often misunderstood or overlooked.

Proposal to handle Data Openness in the Open Source AI definition [RFC]

  • @quaid proposes adding nuance to the Open Source AI (OSAI) Definition by introducing two designations: OSAI D+ (with open data) and OSAI D- (without open data, due to legitimate reasons beyond the creator’s control). He suggests using a dataset certificate of origin (dataset DCO) for self-verification to ensure compliance.
  • @kjetilk agrees that verification is key but questions whether data information alone is sufficient for verification. He highlights that verifying rights to the data may not always be possible.
  • @stefano appreciates the quadrant system’s clarity and confirms @quaid’s proposal for OSAI D- to be reserved for those with legitimate reasons for not sharing data.
  • @thesteve0 expresses skepticism about broadening the “Open Source” label. He argues that without access to both data and code, AI models cannot truly be Open Source and suggests labeling such models as “open weights” instead.
  • @shujisado notes the importance of data access in AI, pointing out that OSAID requires detailed information about how data is sourced, including provenance and selection criteria. He also discusses potential legal and ethical reasons for not sharing datasets.
  • @Shamar raises concerns about “openwashing” in AI, where developers might distribute a model with a different dataset, undermining trust. He argues that distinguishing between OSAI D+ and D- risks legal complications for derivative works, suggesting that models without open data should not be considered truly open.
  • @zack supports the idea of a tiered system (D+ and D-) as an improvement over the current situation, as it incentivizes progress from D- to D+. He is skeptical about verifiability but sees potential in the branding aspect of the proposal.

Welcome diverse approaches to training data within a unified Open Source AI Definition

  • @stefano asks @arandal about suggested edits, which include renaming data as “source data,” allowing open-source AI developers to require downstream modifications with open data, and permitting downstream developers to use open data to fine-tune models trained on non-public data. He further asks if arandal compares training data to model weights as source code is to binary code.
  • @shujisado agrees with @stefano and points out that while many interpret OSD-compliant licenses to include CC4 and CC0, OSI has not officially evaluated Creative Commons licenses for compliance. He highlights concerns about CC0’s patent defense, which could be crucial for datasets.
  • @mjbommar echoes the concerns about patent defense, noting it as a critical issue in both software and data licensing.
  • @Shamar supports the first two suggestions but argues that models trained on non-public data cannot meet an “Open Source AI” definition, as they limit the freedom to study and modify, which are core principles of Open Source.

On the current definition of Open Source AI and the state of the data commons

  • @nick shares an article by Nathan Lambert, reviewed by key figures in the Open Source AI space, discussing the challenges of training data and the current Open Source AI definition. @Percy Liang (on X) view is highlighted, where he suggests that releasing an entire dataset is neither sufficient nor necessary for Open Source AI. He emphasizes the need for detailed code of the data processing pipeline for transparency, beyond just releasing the dataset.
  • @shujisado discusses the legal nuances of using U.S. government documents in AI training, emphasizing that while they may be used in the U.S., legal complications arise in other jurisdictions.
  • @Shamar stresses that Open Source AI should provide all the necessary data and processing information to recreate a system, otherwise, calling it Open Source is “open washing.”

[RFC] Separating concerns between Source Data and Processing Information

  • @Shamar proposes a clearer distinction between “source data” and “processing information” in the Open Source AI definition to ensure transparency and reproducibility. He suggests source data should be publicly available under the same terms that allowed its original use, while the process used to train the system should be shared under an Open Source license. His formulation aims to prevent loopholes that could lead to open-washing and emphasizes the importance of granting all four freedoms (study, modify, distribute, and use) to qualify as Open Source AI.
  • @nick disagrees, arguing that @Shamar proposal misunderstands the difference between the rights to use data for training and the rights to distribute it. He also challenges the claim that exact replication of AI systems can be guaranteed, even with access to the same data.

Open Source AI Definition Town Hall – September 13, 2024