BookCorpus

Updated: 11/12/2023 by Computer Hope
Books being placed onto a laptop computer.

Also called the Toronto Book Corpus, BookCorpus is a collection of about 7,000 books that were scraped by the eBook distribution site Smashwords. This dataset is made up of 985 million words, and the books that comprise it are varied in their genre types, everything from science fiction to romance. BookCorpus was designed to help train LLMs (large language models) such as OpenAI's ChatGPT and Google's BERT (bidirectional encoder representations from transformers).

AI Terms, Internet terms, Natural language processing