Frequently
Asked Questions
About
the Million Book Project
Universal Library portal: 1.5 million books, full-text, free-to-read online, www.ulib.org![]()
PRESS RELEASE:
Online Library Gives Readers Access to 1.5 Million Books: International Project Makes Complete Texts Available Through Single Web Portal (Carnegie Mellon, 11/27/07)
STORIES:
Digital library surpasses initial goal of 1 million books (CNN, 11/29/07)
CMU Million Book Digital Library preserves the world of literature digitally for posterity (Pop City, 11/28/07)
Online Library offers 1.5 works and counting (CNET, 11/27/07)
What
is the current status of the Million Book Project?
Use
Internet Explorer to access the Million Book Project/Universal Library sites:
As of April 9, 2007 -
The Million Book Project has exceeded its goal of digitizing one million books by 2007. The Project inspired other large-scale digitization projects, including Google Book Search, by changing worldwide thinking about the presentation of material found in books.
Leveraging the $3,000,000 provided by the National Science Foundation for equipment and travel, the Million Book Project attracted international partners and matching funds exceeding $100 million U.S. dollars. To date the Project has scanned over 1.4 million books in China, India and Egypt, and made great strides in research areas relevant to large-scale, multi-lingual database storage and retrieval.
Though the initial term of the Million Book Project has ended, much work remains to be done. Project partners plan to continue to work together on the following issues:
- Intellectual property: Copyright remains the biggest barrier to creating the digital library. In the United States, all materials published after 1963 are protected by copyright for the life of the author plus seventy years. Materials published prior to 1923 are out of copyright. In the interim from 1923 through 1963, copyright required renewal. Estimates are that 90% of the materials published during this period were not renewed and are therefore out of copyright. However, renewal records must be consulted for each title to determine its copyright status. Copyright renewal records were scanned to enable online consultation, and later re-keyed by Distributed Proofreaders to improve accuracy and facilitate searching. Project partner Michael Lesk developed the search system. Nevertheless, the labor cost of manually searching individual titles is cost prohibitive for large-scale projects. Partners at the Internet Archive are developing software to automate this process.
Project partners around the world are grappling with innovative ways to overcome the copyright barriers. China, for example, passed a law that allows new material published in China to be scanned and displayed at universities throughout the country. India is currently exploring a public lending law and a law that would enable scanning of books not available in bookstores; a fund would be established to pay authors when their online works were used. The efforts in China and India effectively circumvent the need to acquire permission from the copyright owner to digitize copyrighted works. Nothing comparable has been seriously explored in the United States, though legislation has been proposed that would allow the scanning of works for which the copyright owner cannot be found. The so-called orphan works legislation would be a step forward, but efforts to identify and locate copyright owners will remain cost prohibitive for large-scale digitization projects in the United States.
American partners in the project are, however, considering some innovative approaches to bringing down the copyright barrier. Million Book Project director and intellectual property attorney Dr. Michael Shamos has suggested that computer-generated summaries of uncopyrightable facts contained in books could circumvent the copyright permission process and provide free access to important content. As machine summarization improves, this breakthrough idea will be tested.
- Machine translation and summarization: The vision of the universal digital library includes automatic translation from any language to any language of both queries submitted and content retrieved. Million Book Project director and director of the Language Technologies Institute (LTI) at Carnegie Mellon, Dr. Jaime Carbonell, has been exploring context-based machine translation, a technique that mines the broad resources of the web to find examples to facilitate translation. Project partners in China and India are also working on machine translation. India, a country with eighteen official languages, is heavily invested in this work.
LTI is also developing summarization technology. Automated summaries can help address the dual problems of information overload and lack of time by quickly enabling users to determine relevance if not find exactly what they need. In combination with machine translation, automated summaries can provide people with access to information that might never be translated into their native language. The implications for teaching, learning, research and innovation would be profound.
- Improving and providing centralized access to the metadata: The initial plan of the Million Book Project was to host the entire collection at Carnegie Mellon and to have mirror sites around the world. File transfer, however, turned out to be a significant problem for technical and political reasons. Given these hurdles and developments in distributed computing over the past five years, the current plan is for each country to host the material that it scans, but to provide centralized access to the metadata. Inaccuracies and non-standard cataloging practices must be addressed to make this possible. This will be a primary focus of work over the next year.
- Usability: The books in the Million Book collection are stored as TIFF files, one file per page. The files are large and fetching each page can be tedious over inferior or busy networks. Project director Dr. Raj Reddy is exploring correcting the optical character recognition text to provide HTML versions of the books or converting the books to Portable Document Format (PDF). HTML and PDF files are much smaller files than TIFF files so transmission speeds would be much faster. Time is a critical factor for students and faculty. The time between page fetches affects reading comprehension. Work must be done to improve the usability of the collection.
- Growing the collection: In addition to the work and research described above, project partners aim to continue efforts begun in 2005 to create a critical mass of best practices literature in agriculture around the world. In partnership with the Food and Agriculture Organization, the National Agriculture Library and relevant university libraries, additional agricultural materials will be scanned and added to the Million Book collection. Project partners will also continue to add to the collection books and other materials in different languages and disciplines.
- Diversity and education: The Million Book Project has always had goals in support of diversity and education. Our efforts to provide a multi-lingual digital library aim to address the inordinate amount of web content in the English language and the inordinate amount of web content of dubious quality for teaching, learning and scholarship. In conjunction with work on machine translation and summarization, the Project looks to a future where all people can find the quality information they need free-to-read on the web.
Many students rely on the web as their information resource, turning first to Google or another internet search engines to mine the surface web and only secondarily to library licensed, restricted-access resources in the deep web. Print is a third, somewhat unpopular choice. The lack of quality information on the surface web and its impact on student learning is a primary driver of the Million Book Project. The problem is particularly acute in the sciences, where little relevant information is out of copyright and therefore readily available for digitization. Public policy and innovative technology, like machine translation and summarization of factual book content, must be explored to meet the needs and expectations of students, scholars, and lifelong learners everywhere.
The best practices and research initiatives that result from the project will continue to be shared with librarians and scientists worldwide through formal and informal channels. Applied research and best practices will enhance the quality of digitized materials, storage and delivery systems, and ultimately the users’ experience. Providing powerful tools and free-to-read access to materials in many disciplines will support education and lifelong learning. Free access to agricultural collections can help reduce hunger and food insecurity. The Million Book collection will be indexed by Google and other popular search engines. The Project will continue to drive research agendas in many areas, from computer science to public policy. top
As of November 2005 -
As of January 2004 -
What
Purpose Does the Million Book Project Serve? What Problems Does the Project
Address?
Research
reveals that students and faculty look online first when they need information
because of the speed and convenience of online access. They prefer remote access
to electronic resources rather than having to go to a physical library facility.
Though faculty and graduate students often turn to a library web site or licensed
electronic resources when they need information, undergraduate students tend
to start with popular Internet search engines like Google because these search
engines are more convenient and easier to use than library databases. Most students
believe the information they find on the open Internet is good enough to use
in their coursework. Unfortunately, only about 6% of the surface web content
indexed by popular search engines is appropriate for student academic work.
Faculty are concerned that the lack of quality resources on the surface web
is having a negative impact on the quality of student learning.
Meanwhile, the increasing availability and use of online bibliographic databases,
the increasing number of scholarly publications, and the increasing cost of
library materials have created a situation wherein libraries are spending more
money but acquiring fewer materials. Interlibrary loan is increasing, but the
turn-around time is often inadequate for both the highly competitive research
conducted by faculty and the shorter deadlines of students. Consequently, user
satisfaction is decreasing. Research recently conducted by Carnegie Mellon University
Libraries to improve our understanding of the graduate student experience exposed
their frustration with the amount of time it takes to get the materials they
need for teaching and coursework because the Libraries' electronic resources
are not easy to use. To save time and aggravation, they often turn to an Internet
search engine first. Among the concerns they expressed was the difficulty of
acquiring old journals and out-of-print books. Collection size, the turn-around
time required for interlibrary loan, and the cost of document delivery constrain
their selection of research topics, the quality of their work, and their grade
point average. Lack of free and speedy access to quality resources has a negative
impact on the timeliness and success of academic work. Research indicates that
most students and faculty perceive a significant gap between their need for
speed and convenience and the service their library is providing.
Beyond the boundaries of these problems, tremendous disparity exists across
the nation and around the world in the size and accessibility of library collections.
Some single institutions, like Harvard and Yale, have more books in their libraries
than some entire states have in all of their libraries combined. In our rapidly
changing world, lifelong learning and access to books have become essential
to employment, health, peace, and prosperity. Greater public access to information
is consistent with the goals of education and deliberative democracy. The expectation
is that greater access to information will enhance respect for diversity and
pluralism, alter the ways in which people work and deliberate together, and
better equip people to understand and challenge the world around them. The Million
Book Project will digitize a large body of published literature and offer it
free-to-read on the surface web - providing students, faculty, and lifelong
learners with rapid, convenient access to quality resources. Equitable, world-wide
access to the Collection will contribute to the democratization of knowledge
and empowerment of a global citizenry. An important byproduct of the Collection
will be the existence of a test bed that stimulates and supports much-needed
research in information storage and management, search engines, imaging processing,
and machine translation. top
For More Information
"How students search: Information seeking and electronic resource use"
(EDNER [Formative Evaluation of the Distributed National Electronic Resource]
Project, Issues Paper 8, 2002). Available: http://www.cerlim.ac.uk/edner/ip/ip08.rtf
What are the research issues in the Million Book Project?
Many believe that knowledge is now doubling at the rate of every two to three years. Machine summarization, intelligent indexing, and information mining are tools that will be needed for individuals to keep up in their discipline work, in their businesses, and in their personal interests. This large digitization project will support research in these areas. This will be of greater significance for the Indian languages where new tools for summarization, grammar and spell checking, thesaurus and translation dictionaries need to be developed ab initio.
The data provided
by the MBP with the right research inputs will facilitate the development
of language- and location-independent intelligence amplifiers for furthering
information creation. top
Our initial thinking was to take a staged approach to collection development on a discipline-by-discipline basis. However, discussion with project partners and potential partners in November 2001 at a collection planning meeting funded by NSF resulted in the decision to focus on providing free-to-read access to multiple collections. Copyrighted works will be digitized upon receipt of permission from the copyright holder to include the works in the Million Book Project.
Our partners in India and China are currently digitizing local materials. Our Chinese partners are digitizing unusual and unique rare collections in Chinese libraries. Our Indian partners are digitizing government textbooks published in eleven of the eighteen official languages in India.
Who are the key U.S. participants in the Million Book Project?
Who are the other partners in the Million Book Project?
What university/scholarly presses are participating in the program?
The University of Texas Press, Brookings Institution, the American Meteorological Society, American Institute of Biologocal Sciences, and Rand McNally are among the presses that have given permission to digitize their out-of-print in-copyright books. National Academy Press has given us permission to digitize all of their books published prior to 1995. As of June 2004, we are in negotiation with many other presses, including Johns Hopkins, Duke, Penn State, and the Russell Sage Foundation. top
How is the Million Book Project supported?
To date, two grants totalling $3.6 million have been received from the National Science Foundation to purchase equipment.
The Chinese Ministry of Education, Chinese Academy of the Sciences, Indian Institute of Science, and Carnegie Mellon University Libraries and School of Computer Science are providing personnel and facilities, and participating in collaborative research. Carnegie Mellon University Libraries is training the scanning operators.
University of California, Merced, will be a mirror site for the Million Book Collection. They have also contributed funds and personnel for copyright permissions work.Brewster Kahle (Internet Archive) is providing disk storage.
OCLC is providing project partners with metadata at no charge, will support a registry to track progress and avoid duplicate scanning, and might become a sustaining host of the final Million Book Project collection.
Additional grant proposals are planned to support seeking copyright permissions, further collection development, the management of project logistics, and shipping costs. top
Can users of the Million Book Collection print or download the books?
The delivery systems for the Million Book Collection might restrict Print and Save functionality to one-page at a time. netLibrary’s experience indicates that this is sufficient deterrent to prevent users from printing or downloading entire books. This restricted functionality is required for copyrighted books in the Collection. (Note that copyrighted books are included in the Collection only with the permission of the copyright holder.)
To Print or Save a displayed page, move the mouse over the page image. A little toolbar will appear, with icons that enable users to Print, Save, and Email the page. Just click on the appropriate icon to Print, Save, or Email the page.
top
Will the Million Book Project preserve the fixed format of the initial publications?
Yes, the digitized works will preserve the fixed format of the initial publications. top
What metadata is being captured about the digitized works?
MARC records and administrative metadata are being captured following existing standards. Dublin Core is being used for materials that have not previously been catalogued or where MARC is inappropriate, for example, for photographs and three-dimensional cultural artifacts. top
Publishers might not give the MBP blanket permission to digitize and make available all of their out-of-print, in-copyright titles, but might entertain requests for permission to digitize specific titles. Is that possible?
The MBP approach is to request permission for a range of years, for example, everything published prior to 1990. A publisher could specify the cut-off year or, alternatively, specify the list of titles for which they grant non-exclusive permission to digitize in the MBP. top
The MBP is not developing a for-profit system. All of the content will be available free-to-read on the Internet. Participating publishers will get copies of the digitized books and metadata, and can themselves provide or enable others to provide value-added services to access the digital books. Permission granted to the MBP is NON-exclusive.
Reading the case study of the National Academy Press's experience putting their books online free-to-read could facilitate understanding and appreciation of the benefits of this approach. See: Barbara Kline Pope, "How to Succeed in Online Markets: National Academy Press: A Case Study," Journal of Electronic Publishing 4, 4 (May 1999). Available: http://www.press.umich.edu/jep/04-04/pope.html
What kind of accuracy will the MBP achieve in scanning?
Carnegie Mellon has established a workflow (based on pilot, 100-book and 1000-book projects) that includes steps to insure capture of high resolution images and essential metadata, post-processing to correct skewing and crop dark borders surrounding the page images, and OCRing to create searchable ASCII text with 98% accuracy. top
Will the TIFFs meet the Print-On-Demand (POD) standards of Replica and Lightning Source?
The MBP follows the standards and best practices supported in "A Framework of Guidance for Building Good Digital Collections" Developed by the Institute for Museum and Library Services in 2001 and endorsed by the Digital Library Federation in 2002. See:
http://www.diglib.org/standards/imlsframe.htm
http://www.imls.gov/pubs/forumframework.htmMore specifically, our guidelines for data production (excerpted from the MBP NSF proposal and based on pilot projects) are:
- Bitonal images with a pixel depth of 1 bit-per-pixel scanned at a resolution of 600 dots per inch (DPI). Images will be stored as "Intel" TIFF (Tagged Image File Format) files with the header content specified. The compression algorithm used is ITU (Formerly CCITT) Group 4.
- TIFF version 5.0 is acceptable. Subject to testing, version 6.0 (or later) may also be acceptable.
- The initial-capture system includes dynamic thresholding or a similar feature to capture variability of darkness in the imprint and possibly darker (e.g., foxed) backgrounds from decay. Images should be as readable as the original pages.
- Typical expected data will be provided for most TIFF tags (normally, the data supplied by software default settings). A specification for the TIFF header will be produced to include scanner technical information, filename, and other data, but to be in no way a burden on the production service.
- Images will be written in sequential order, with corresponding 8.3 filenames, e.g., 00000001.tif as the first image in volume sequence and 00000341.tif as 341st image in volume sequence.
- Volumes provided to the MBP will be assigned unique identifiers that conform to 8.3 format. The images will be in directories named with the corresponding identifier (e.g., the volume identified as akf3435.001 will have a directory with the same name, and 00000001.tif through 0000000N.tif files within that directory).
- Images and directories (as specified above) will be written to gold CD-ROM according to agreed upon specifications and using ISO9660 format.
- Skew will be within a specified range of degrees allowed.
top
Once you've scanned a title, how soon will you return TIFFs to the publisher?
The timing depends on many factors, including how long it takes us to locate copies of the books for which permission was granted, how many books are involved, what's already in the queue of books waiting to be scanned, etc. top
Who will determine the pricing of value-added components of the MBP?
The publishers or vendors who develop the value-added components will determine the pricing for the services they provide. top