“As with any new technology, it is advisable to approach the use of RAG with caution and be proactive in identifying potential risks and liabilities. While there is some benefit to being an early adopter, such benefits can be offset by the unknown risks of a new technology,” cautioned Rourk.
For Rourk, the simplest step to take to ensure protection from copyright infringement claims and reduce risk is to not use RAG or other generative AI outputs to create a work of authorship without extensive editing and revision. “While it is possible that the output of a RAG or generative AI system could also result in either a direct copy or a derivative work, extensive editing and revision of that output would be much less likely to result in infringing material,” Rourk explained.
Simple step number 2 is to avoid mentioning that any AI tools were used to create the work.
Licensing agreements must be in place. These will ensure that all content sources used in RAG systems and fed to the LLM are properly licensed. The licensing terms may include an obligation to attribute. Consent from copyright holders must be obtained.
Regulatory interventions will also help. “The rationality of the process of extracting information from external sources, akin to crawler technology, can be appropriately expanded through regulatory interventions,” said Zhang. “As long as the data crawled belongs to the public search engines and under the premise of complying with the crawler agreement, it should be regarded as data from legitimate sources. In the case of RAG, where the extracted source includes a third-party database or public search engines, it will be costly to obtain copyright authorization one by one for each generated output, if the individual right holder asserts its rights afterwards, the bona fide tort liability can be determined according to the specific circumstances.”
Another step that may be taken is to impose further obligations on the RAG developer with regard to source tracing. The system should provide the specific external sources, evidence path that contributed to each response and the specifications and review mechanisms for the output of the model. To ensure compliance, RAG developers should establish an effective self-regulatory mechanism and then review the output of the model regularly. “This could improve the interpretability and traceability of the RAG output and ensure the auditability of the knowledge sources, making it easier to discover and diagnose the behaviour of the AI system,” said Zhang.
Collective management of copyright is also a feasible way to address copyright-related challenges. China has five collective copyright management organizations, namely for music, audio-visual, writing, photography and film. Through these collective copyright management organizations, R&D bodies working on AI projects can collectively authorize specific works. Such approach is also mentioned in Article 4 of China’s Regulations on the Collective Management of Copyright (Draft Revisions for Solicitation of Comments).
When copyright infringement does occur, it may be possible for the aggrieved party to seek indemnification from the provider of the natural language processing technique or any generative AI system for that matter.
Copyright infringement insurance is also commercially available, according to Rourk.
For data privacy and protection, organizations should adopt practices to minimize the amount of personal information collected by the RAG model. “If possible, information should be aggregated and stripped of personal and confidential information where possible. Organizations should also obtain consent from individuals for personal information collection and utilization. This includes transparent privacy policies and easy-to-use consent forms and template notices,” Chien suggested.
Anonymizing the personal data is also an effective mitigation strategy.
Once the retrieved data is received, protecting it from unauthorized access and use is paramount. To do this, RAG systems need robust security measures to safeguard this information. “For systems that operate globally, complying with local regulations relating to data/privacy, rights of data subject and cross-border data transfers is necessary. Examples of the key overseas legislations include the GDPR, the EU AI Act and the recently enacted state AI acts of Colorado and Utah in the U.S.,” said Chien.
For both IP and data privacy-related challenges in general, Khatri suggested limiting the data sources of the RAG system to a customized data source whose content is under the organization’s control if possible.
Working with reliable vendors can never be overemphasized. Vendors’ credentials and certifications must be reviewed. Contract terms must be negotiated, particularly with regard to data protection obligations, confidentiality obligations, indemnification for copyright infringement and personal data infringement cases, among others. Enforcement of such obligations and other responsibilities must be ensured.
Organizations must be cautious about sharing certain information as it may be part of another company’s trade secret or confidential information. Sharing it may be prohibited under the contract that both organizations have entered into.
Lastly, technology and legal experts must be consulted. “Talking totechnology developers and legal experts early is crucial to understanding and mitigating the legal risks associated with RAG,” said Chien.
RAG development and adoption
According to Rourk, the use of RAG is widespread in the U.S. These are primarily private uses internal to an organization such as law firms that adopt RAG to assist with contract review and analysis, review of discovery materials for litigation and others. “Such private uses are less likely to result in copyright infringement risks because they are harder to discover, but it is still possible that copyright infringement could occur if the RAG is also trained on copyrighted material, depending on the final determination of whether such training or use is copyright infringement,” said Rourk.
In China, a lot of things are happening.
Leading tech industry player Tencent has combined RAG technology with the actual application scenarios of its Hunyuan model. One example is a new function in WeChat Read, based on the Hunyuan large model, which enables a user to ask an AI system about the theme of a certain book without having to read the book in its entirety. Tencent Cloud launched a large model knowledge engine based on RAG technology architecture, integrating OCR document parsing, vector retrieval, LLMs, multimodal large models and other technologies to create a low-threshold model application development platform for enterprises.
Joining Tencent in the RAG revolution is Baidu. Its Ernie Bot was developed based on the Ernie and Plato series models and key technologies include retrieval enhancement.
Baichuan Intelligence has also jumped on the RAG bandwagon as it opens the Baichuan2-Turbo series API based on retrieval enhancement.
“We believe that RAG will have wider applications in China,” said Zhang, “and many scholars have also put forward improvement suggestions and potential future directions of RAG in their research on RAG systems.”
Singapore shows promise, as far as RAG adoption is concerned.
A leading proof of this is the country’s proposed Model AI Governance Framework for Generative AI which recommends RAG as a technique to reduce hallucinations.
The IMDA, Singapore’s regulatory body for the infocomm media sector, followed this up in March 2024 by accrediting a foreign startup specializing in RAG solutions for large enterprise clients. IMDA runs an accreditation programme to nurture Singapore’s infocomm media technology ecosystem.
In Australia, the use of RAG is also growing, particularly in the technology and content creation spaces.
In the tech world, RAG is definitely one of the new kids on the block. But it’s more than just a temporary rage. Modern technologies are re-defining the way we live and work. Therefore, RAG and AI have to be explored and people should learn about the rough spots in terms of IP, data privacy and protection. Knowing these risks and challenges will set you off to the right path in your “RAG journey” – whether you are the developer or end user of this new natural language processing technique, or someone whose creative work or personal data may be among the body of information that RAG indiscriminately extracts and uses without your consent.