The latest rage called RAG

01 July 2024

Retrieval-augmented generation is all the rage in the tech world at the moment. But as modern technologies redefine the way we live and work, rights owners and developers alike must thoroughly understand and plan for their own RAG journeys. Espie Angelica A. de Leon reports.

There is a new technique related to artificial intelligence that is all the rage nowadays. Its objective: make AI tools do a better job by providing more accurate responses to users’ questions.

This new technique achieves this by searching data sources other than the data on which the AI tool is trained. These external data sources include public databases, document repositories, websites, APIs, paid subscription services and others. It retrieves data from these sources and incorporates these into the large language model (LLM) which is used in generative AI tools, thus allowing the AI system to minimize hallucinations and improve the quality of its responses.

Hallucinations refer to statements that sound plausible but are actually incorrect.

“To ensure accurate and trustworthy responses, the most recent and reliable information will be made accessible. It is considered a cost-saving technique, as it does not require retraining of the LLM but instead relies on sending an ‘enhanced prompt’ that provides additional context,” said Sanil Khatri, a local principal at Baker McKenzie Wong & Leow in Singapore.

This new natural language processing technique, which is the current rage in the AI world, is called retrieval-augmented generation, or RAG.

Intellectual property and data privacy challenges with RAG

The same IP and data privacy-related risks and challenges involving generative AI and data scraping are also associated with RAG.

Under IP, foremost of these challenges is whether using copyrighted materials to train an AI without consent constitutes copyright infringement. “Copyright infringement can occur when an original work of authorship is either copied or used to create a new derivative work without authorization, either of which could arguably occur even when the AI system is trained or used,” explained Christopher J. Rourk, a partner at Jackson Walker in Dallas.

“If the RAG system uses documents from external sources, it’s crucial to understand the terms of use and licensing for those resources.

There may be some operators who think that things that are openly available on the internet are free to use in their AI systems and aren’t copyrighted. That’s a common misperception that even some of the big AI players are falling into,” Greg Lambert, chief knowledge services officer at Jackson Walker in Dallas, pointed out.

It may take some time before the United States arrives at a definite answer to the question of whether use of RAG models or any other AI will result in liability for copyright infringement. At present, several related cases are in the courts. There’s also proposed legislation to address the issue.

According to Peng Zhang, equity partner at Zhonglun Law Firm in Beijing, it is difficult to use the copyright fair use argument to avoid infringement liability if huge volumes of proprietary and copyrighted information are extracted, retrieved and used for commercial purposes.

“RAG will retrieve information from external sources of documents or databases. This will inevitably involve the acts of copying and using of copyrighted works or proprietary information owned by third parties without consent. At the output stage, the extracted information will be combined with the user prompt to generate the output based on pre-designed LLMs. If the generated content is substantially similar in expression to the original work, the right of reproduction may be infringed. If a new expression was formed on the basis of retaining the expression of the original work, the issue of adaptation right might be involved,” he said.

Dalvin Chien, a partner and head of ICT & digital law atMills Oakley in Sydney, added:

“Although Australia’s Copyright Act provides some exceptions, like fair dealing for research, study, parody, satire and temporary copying for technical purposes, these defenses require the dealing to be ‘fair.’ This poses difficulties for commercial operators who use copyrighted material without compensating creators. Notably, none of these arguments have yet been tested against AI systems in Australian courts.”

Zhang added it may also be harder to scrutinize copyright infringement in RAG-generated content and obtain prior authorization on the imputed source for the following reasons: The information is retrieved from external sources instead of the defined training datasets, and the proprietary content will be retrieved and combined with the user input prompt to generate the output, from a copyright perspective, ultimately generating an infringing derivative, but at the same time, making it harder for the third-party rights-holder to become aware of the infringement and to pursue for rights enforcement.

On March 13, 2024, the European Union Parliament adopted the EU AI Act. It is the first comprehensive legal framework in the world governing the development, market deployment and use of AI. Under the act, generative AI providers are required to prepare a sufficiently detailed summary of the content used for training the AI model. According to Khatri, this transparency requirement might “equip the copyright owners with better radars.”

“Singapore seems to be adopting a similar position,” he said. “The discussion paper from the Infocomm Media Development Authority (IMDA) on generative AI emphasizes the importance of transparency regarding the training datasets used as input factors in the model. As RAG would also affect the input of LLMs,

the transparency of RAG’s source data would also be important. It remains to be seen whether compulsory laws and regulations will be promulgated to require companies to disclose their training datasets or data sources in Singapore.”

As with AI, a major challenge in terms of data privacy is the inclusion of personal data and confidential information in the material extracted and retrieved by the RAG model and used without consent.

“The data source can be as narrow as a customized source, or as broad as the information from the internet. If the data source of the RAG system happens to be an external search engine, it might be implausible to notify random individuals without contact information and obtain their consent,” said Khatri.

Personal data or confidential information may even be already part of the end user’s input query to begin with. Making it even worse is the possibility of RAG re-collecting the questions and generated content of the dialogue between the user and the application without the user’s knowledge.

Likewise challenging is data breach management. According to Khatri, it remains debatable from a technical perspective whether RAG systems would pose more threats to LLMs’ training data or mitigate the leakage of such data. RAG systems’ private retrieval database can also be vulnerable. A company’s data access controls might malfunction if RAG is poorly designed.

“Vendors such as RAG solutions providers or third-party cloud services providers are generally considered data intermediaries under the Personal Data Protection Act. If any data breach event occurs, the organizations themselves will be responsible. If a data breach is likely to result in significant harm to individuals and/or are of significant scale, a data breach notification must be made by the organization. Given the organization is relying on the accountability of the vendors, timely notification and data breach management could be a challenge,” Khatri said.

He added that trade secrets may be disclosed when an organization sends confidential information of its business to its vendors if they want to customize the data source using the organization’s data for RAG systems. Trade secret theft may also occur when a company deploys a third party’s LLM and the company’s data is sent to the LLM to use prompt data retrieved by RAG models. A worst-case scenario is if such data will be used as training data for the third party’s LLM.

“As with any new technology, it is advisable to approach the use of RAG with caution and be proactive in identifying potential risks and liabilities. While there is some benefit to being an early adopter, such benefits can be offset by the unknown risks of a new technology,” cautioned Rourk.

For Rourk, the simplest step to take to ensure protection from copyright infringement claims and reduce risk is to not use RAG or other generative AI outputs to create a work of authorship without extensive editing and revision. “While it is possible that the output of a RAG or generative AI system could also result in either a direct copy or a derivative work, extensive editing and revision of that output would be much less likely to result in infringing material,” Rourk explained.

Simple step number 2 is to avoid mentioning that any AI tools were used to create the work.

Licensing agreements must be in place. These will ensure that all content sources used in RAG systems and fed to the LLM are properly licensed. The licensing terms may include an obligation to attribute. Consent from copyright holders must be obtained.

Regulatory interventions will also help. “The rationality of the process of extracting information from external sources, akin to crawler technology, can be appropriately expanded through regulatory interventions,” said Zhang. “As long as the data crawled belongs to the public search engines and under the premise of complying with the crawler agreement, it should be regarded as data from legitimate sources. In the case of RAG, where the extracted source includes a third-party database or public search engines, it will be costly to obtain copyright authorization one by one for each generated output, if the individual right holder asserts its rights afterwards, the bona fide tort liability can be determined according to the specific circumstances.”

Another step that may be taken is to impose further obligations on the RAG developer with regard to source tracing. The system should provide the specific external sources, evidence path that contributed to each response and the specifications and review mechanisms for the output of the model. To ensure compliance, RAG developers should establish an effective self-regulatory mechanism and then review the output of the model regularly. “This could improve the interpretability and traceability of the RAG output and ensure the auditability of the knowledge sources, making it easier to discover and diagnose the behaviour of the AI system,” said Zhang.

Collective management of copyright is also a feasible way to address copyright-related challenges. China has five collective copyright management organizations, namely for music, audio-visual, writing, photography and film. Through these collective copyright management organizations, R&D bodies working on AI projects can collectively authorize specific works. Such approach is also mentioned in Article 4 of China’s Regulations on the Collective Management of Copyright (Draft Revisions for Solicitation of Comments).

When copyright infringement does occur, it may be possible for the aggrieved party to seek indemnification from the provider of the natural language processing technique or any generative AI system for that matter.

For data privacy and protection, organizations should adopt practices to minimize the amount of personal information collected by the RAG model. “If possible, information should be aggregated and stripped of personal and confidential information where possible. Organizations should also obtain consent from individuals for personal information collection and utilization. This includes transparent privacy policies and easy-to-use consent forms and template notices,” Chien suggested.

Anonymizing the personal data is also an effective mitigation strategy.

Once the retrieved data is received, protecting it from unauthorized access and use is paramount. To do this, RAG systems need robust security measures to safeguard this information. “For systems that operate globally, complying with local regulations relating to data/privacy, rights of data subject and cross-border data transfers is necessary. Examples of the key overseas legislations include the GDPR, the EU AI Act and the recently enacted state AI acts of Colorado and Utah in the U.S.,” said Chien.

For both IP and data privacy-related challenges in general, Khatri suggested limiting the data sources of the RAG system to a customized data source whose content is under the organization’s control if possible.

Working with reliable vendors can never be overemphasized. Vendors’ credentials and certifications must be reviewed. Contract terms must be negotiated, particularly with regard to data protection obligations, confidentiality obligations, indemnification for copyright infringement and personal data infringement cases, among others. Enforcement of such obligations and other responsibilities must be ensured.

Organizations must be cautious about sharing certain information as it may be part of another company’s trade secret or confidential information. Sharing it may be prohibited under the contract that both organizations have entered into.

Lastly, technology and legal experts must be consulted. “Talking totechnology developers and legal experts early is crucial to understanding and mitigating the legal risks associated with RAG,” said Chien.

RAG development and adoption

According to Rourk, the use of RAG is widespread in the U.S. These are primarily private uses internal to an organization such as law firms that adopt RAG to assist with contract review and analysis, review of discovery materials for litigation and others. “Such private uses are less likely to result in copyright infringement risks because they are harder to discover, but it is still possible that copyright infringement could occur if the RAG is also trained on copyrighted material, depending on the final determination of whether such training or use is copyright infringement,” said Rourk.

In China, a lot of things are happening.

Leading tech industry player Tencent has combined RAG technology with the actual application scenarios of its Hunyuan model. One example is a new function in WeChat Read, based on the Hunyuan large model, which enables a user to ask an AI system about the theme of a certain book without having to read the book in its entirety. Tencent Cloud launched a large model knowledge engine based on RAG technology architecture, integrating OCR document parsing, vector retrieval, LLMs, multimodal large models and other technologies to create a low-threshold model application development platform for enterprises.

Joining Tencent in the RAG revolution is Baidu. Its Ernie Bot was developed based on the Ernie and Plato series models and key technologies include retrieval enhancement.

Baichuan Intelligence has also jumped on the RAG bandwagon as it opens the Baichuan2-Turbo series API based on retrieval enhancement.

“We believe that RAG will have wider applications in China,” said Zhang, “and many scholars have also put forward improvement suggestions and potential future directions of RAG in their research on RAG systems.”

Singapore shows promise, as far as RAG adoption is concerned.

A leading proof of this is the country’s proposed Model AI Governance Framework for Generative AI which recommends RAG as a technique to reduce hallucinations.

The IMDA, Singapore’s regulatory body for the infocomm media sector, followed this up in March 2024 by accrediting a foreign startup specializing in RAG solutions for large enterprise clients. IMDA runs an accreditation programme to nurture Singapore’s infocomm media technology ecosystem.

In Australia, the use of RAG is also growing, particularly in the technology and content creation spaces.

In the tech world, RAG is definitely one of the new kids on the block. But it’s more than just a temporary rage. Modern technologies are re-defining the way we live and work. Therefore, RAG and AI have to be explored and people should learn about the rough spots in terms of IP, data privacy and protection. Knowing these risks and challenges will set you off to the right path in your “RAG journey” – whether you are the developer or end user of this new natural language processing technique, or someone whose creative work or personal data may be among the body of information that RAG indiscriminately extracts and uses without your consent.

Tags: RAG Retrieval-augmented generation AI artificial intelligence data privacy infringement

Law firms

China

Singapore

USA