Close Menu
  • U.S.
    • Education
    • Immigration
    • Abortion
    • Transportation
    • Weather
    • LGBTQ+
  • Politics
    • White House
    • U.S. Supreme Court
    • Congress
  • Sports
    • NBA
    • NHL
    • NFL
    • Soccer
    • MLB
    • WNBA
    • Auto Racing
  • Entertainment
    • Movies
    • Television
    • Music
    • Books
  • Business
    • Tariffs
    • Financial
    • Inflation
    • Technology
  • Science & Tech
    • Physics & Math
    • History & Society
    • Space
    • Animals
    • Climate
  • Health
What's Hot

Trump Administration’s EPA Proposes Elimination of Power Plant Emission Regulations

June 17, 2025

Poll: Americans Favor Maintaining or Increasing Medicaid and Food Stamp Funding, According to AP-NORC

June 17, 2025

Consultant Found Not Guilty of AI Robocall Voter Suppression Charges

June 17, 2025
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram
World on NowWorld on Now
Subscribe
  • U.S.
    • Education
    • Immigration
    • Abortion
    • Transportation
    • Weather
    • LGBTQ+
  • Politics
    • White House
    • U.S. Supreme Court
    • Congress
  • Sports
    • NBA
    • NHL
    • NFL
    • Soccer
    • MLB
    • WNBA
    • Auto Racing
  • Entertainment
    • Movies
    • Television
    • Music
    • Books
  • Business
    • Tariffs
    • Financial
    • Inflation
    • Technology
  • Science & Tech
    • Physics & Math
    • History & Society
    • Space
    • Animals
    • Climate
  • Health
World on NowWorld on Now
Home » Library Unveils Collections as Training Data for AI Platforms
Technology

Library Unveils Collections as Training Data for AI Platforms

June 17, 20256 Mins Read
Facebook Twitter LinkedIn Telegram Pinterest Tumblr Reddit WhatsApp Email
Share
Facebook Twitter LinkedIn Pinterest Email

Cambridge, Massachusetts (AP) – The discourse on the internet has merely scratched the surface of education. Artificial intelligence is now being harnessed by tech companies, drawing upon an extensive repository of knowledge: the library stack.

The Harvard University Collection, featuring nearly one million books published as early as the 15th century across 254 languages, will be made available to AI researchers on Thursday. Additionally, a plethora of historic newspapers and government documents from Boston’s public libraries will be included.

Unlocking centuries-old books could provide a wealth of data for tech companies navigating lawsuits from a living novelist, visual artist, and others who claim their creative works were utilized without consent to train AI chatbots.

“Using public domain data is a smart move, as it’s typically less contentious than copyrighted material,” notes Burton Davis, an advisor at Microsoft.

Davis highlighted that the library contains “a significant amount of intriguing cultural, historical, and linguistic data.” He noted that the AI chatbot explanation has been trained on much of this information. AI developers often resort to relying on “synthetic” data produced by chatbots themselves, which can lead to decreased quality.

Backed by Microsoft and OpenAI’s “unlimited gifts,” Harvard’s Institutional Data Initiative collaborates with libraries and museums worldwide to develop historic collections in ways that positively impact the communities they serve.

“Our goal is to restore some authority to these institutions during this current AI era,” stated Aristanana School Tas, who oversees research at Harvard Law School’s Library Innovation Lab. “Librarians have always been the guardians of data and information.”

The newly released dataset from Harvard, Book 1.0, consists of over 394 million scanned pages. One notable work dates back to the 1400s – a piece by Korean painter Handwritten thoughts on nurturing flowers and trees. The most substantial collection arises from the 19th century, covering topics such as literature, philosophy, law, and agriculture, all meticulously preserved and organized by generations of librarians.

The initiative promises to assist AI developers aiming to enhance the accuracy and reliability of their systems.

“A significant portion of the data used for AI training doesn’t come directly from original sources,” explained Greg Repert, executive director of the Data Initiative and chief technologist at Harvard’s Berkman Klein Center. “This collection rectifies that by sourcing data from the institutions that actually compiled these materials,” he added.

Prior to the commercial AI surge sparked by ChatGPT, many AI researchers paid little attention to the origins of text extracted from sources like Wikipedia and social media forums such as Reddit, often relying on deep caches of pirated books. They required vast quantities of data, known in computer science as tokens, where each token can stand for a word.

Harvard’s new AI training collection encompasses an estimated 242 billion tokens, a staggering volume that remains mere fodder for some of the most advanced AI systems. For instance, Meta, Facebook’s parent company, claims its latest AI language model has been trained on over 30 trillion tokens derived from text, images, and videos.

Meta is also facing lawsuits from comedians like Sarah Silverman and other published authors accusing it of pilfering books from the pirated “Shadow Library.”

Presently, there are multiple appointments underway, causing an increase in demand for the library.

OpenAI, too, is contending with a series of copyright lawsuits. This year, it has contributed $50 million to various research institutions, including the 400-year-old Bodleian Library at Oxford University.

Jessica Chapel, chief of digital and online services, noted that when OpenAI first approached Boston’s largest public library, they emphasized that digitized materials are meant for public consumption.

“OpenAI expressed interest in a vast amount of training data, while we are keen on making numerous digital artifacts accessible. It felt like a perfect alignment,” Chapel remarked.

Digitalization comes at a high cost. For instance, Boston’s library undertook a labor-intensive project to scan and curate numerous French newspapers widely read by the Canadian immigrant community in Quebec during the late 19th and early 20th centuries. Such texts serve as valuable training data for projects librarians aspire to undertake.

Harvard’s collection was initially digitized in 2006 to create a searchable online library of over 20 million books for tech giant Google.

Google spent numerous years overcoming legal obstacles related to new copyrighted material, ultimately reaching a settlement in 2016 when the U.S. Supreme Court upheld a lower court ruling rejecting claims of copyright infringement.

For the first time, Google has partnered with Harvard University to obtain public domain volumes from Google Books, clearing the way for access by AI developers. U.S. Copyright Protection typically lasts for 95 years and also applies to sound recordings.

This new initiative received commendation from the same group of authors that Google had previously taken to court over AI projects. “Many titles exist solely within the stacks of major libraries, and the establishment of this dataset facilitates access, broadening knowledge and understanding. Importantly, creating legally sound, large-scale training datasets is essential for democratizing the development of new AI models,” they stated.

The actual utility of this dataset for next-generation AI tools remains to be assessed, as data is shuffled between collections and various open-source AI model hosting platforms emerge.

The book collection represents a more linguistically diverse base than conventional AI data sources, with European languages prominent—particularly German, French, Italian, Spanish, and Latin—accounting for less than half of the total volume in English.

A 19th-century book collection may prove “exceptionally valuable” for the tech industry in developing AI agents capable of reasoning and planning like humans, according to Leppert.

“In academia, we explore the pedagogical implications of inference,” stated Repert. “A wealth of scientific information exists on executing processes and analyses.”

At the same time, the dataset contains outdated content, including racist and colonial narratives that have tainted scientific and medical theories.

“Navigating large datasets presents challenges regarding harmful content and language,” remarked Christy Mook, coordinator at Harvard’s Library Innovation Lab. She emphasized that the initiative aims to utilize the data to guide users in making informed, independent decisions while employing AI responsibly.

————

The Associated Press and OpenAI have entered into a license and technical agreement, granting OpenAI access to portions of the AP’s text archives.

Source: apnews.com

collections Data Library platforms training Unveils
Share. Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email
Previous ArticleAir India Black Box Found a Day After Fatal Collision
Next Article Key GOP Holdout Expresses Frustration Over Senate Tax and Medicaid Plan

Related Posts

Ancient Roman Soldier Unveils Massive Legs and 2,000-Year-Old Watermelon Leather Footwear

June 17, 2025

Live Stream of Filming Near the Bellagio on the Las Vegas Strip

June 17, 2025

Senate Unveils Trump Tax and Medicaid Plans

June 16, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Trump Administration’s EPA Proposes Elimination of Power Plant Emission Regulations

June 17, 2025

Poll: Americans Favor Maintaining or Increasing Medicaid and Food Stamp Funding, According to AP-NORC

June 17, 2025

Consultant Found Not Guilty of AI Robocall Voter Suppression Charges

June 17, 2025
Advertisement

Global News at a Glance
Stay informed with the latest breaking stories, in-depth analysis, and real-time updates from around the world. Our team covers politics, business, science and tech, sports and health - bringing you the facts that shape our global future. Trusted, timely, and balanced.

We're social. Connect with us:

Facebook X (Twitter) Instagram Pinterest YouTube
Top Insights

Trump Administration’s EPA Proposes Elimination of Power Plant Emission Regulations

June 17, 2025

Poll: Americans Favor Maintaining or Increasing Medicaid and Food Stamp Funding, According to AP-NORC

June 17, 2025

Consultant Found Not Guilty of AI Robocall Voter Suppression Charges

June 17, 2025
Get Informed
Get the latest creative news from World On Now about Politics, Business, Sports, Science and Health.
© 2025 World On Now. All Rights Reserved.
  • Terms and conditions
  • Privacy Policy

Type above and press Enter to search. Press Esc to cancel.

Ad Blocker Enabled!
Ad Blocker Enabled!
Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.