CENDI PRINCIPALS AND ALTERNATES MEETING
National Library of Medicine
Bethesda, MD
March 13, 2007
Minutes
Global Web Archives and National Digital Libraries
The Internet Archives: Current Initiatives and Future Plans
The National Science Digital Library (NSDL): Current Status and Future Developments
NLM Showcase
Survey of Interactive Content in Journals Indexed for MEDLINE
Tour of Exhibition Visible Proofs: Forensic Views of the Body
The Internet Archives: Current Initiatives and Future Plans
(Linda Frueh, Regional Director, Internet Archive)
The goal of the Internet Archive (IA) is to provide universal access to human knowledge. It was founded in 1996 to archive web pages. Since that time it has grown to become the largest publicly accessible digital library containing movies, books, audio (both spoken and music), and software. IA plays a significant role in both the open source software and standards communities. Along with twelve national libraries, the IA is part of the Internet Preservation Consortium. Through this group, IA is involved in standards archiving and metadata ARC/WARC. WARC is now an ISO work item for standards related to the archiving of web pages.
However, the environment is evolving and new tools are needed for information management. This includes tools for capturing information, tracking changes, providing user-friendly access, sharing distributed resources, searching across locations, and better coordinating scanning efforts. IA sees major trends toward openness, decentralization, new repositories, web access and computing capabilities/services on the grid.
In the beginning IA focused on the collections, but it is turning more to the development of tools and infrastructure. Libraries are good at creating collections but not at developing and maintaining infrastructure. The development of this infrastructure is focused on capture, access and keeping the information safe through a series of interrelated projects with a variety of sponsors and partners.
In order to capture the information, IA has developed Heritrix, an open source web crawler. This is used by IA engineers to create periodic archives of the web. The engineers have also developed Archive-It, a wrapper that allows small organizers to enter their own URLs and store the results on IA servers as part of a subscription service. The originating organization has a branded web site that comes through to IA’s archive.it.org.
In order to provide access, IA has developed search technologies and interfaces. NutchWAX is a search engine for archival web content based on the Lucene open source search engine. Currently, IA engineers are working on scaling this up. IA’s Web Harvesting projects involve broad web snapshots performed by Alexa Internet for the Wayback Machine. Even though the harvesting is not deep, it captures 2.5 billion pages or 12 terabytes per crawl which is performed every two months. The current archive includes more than 60 billion pages in more than 21 languages.
IA’s storage and preservation infrastructure for the web archive are based on open specifications which do not tie them to a particular hardware or software vendor. The IA currently has four petabytes of storage in San Francisco, Alexandria Egypt, a small mirror site in the Amsterdam, and a future site planned for Asia. IA is a big user of the Internet II backbone. The speed is currently two gigabytes but they are looking to increase to five because it is currently maxed out. They can handle massive amounts of web traffic, approximately 10 millions hits per day.
The second type of archive is based on topic specific crawls which are often performed for special projects or particular organizations. These crawls are deep but not broad. Special collections include the LC Iraq War and Crisis in Darfus collections, the NARA end of Presidential Term collection, a collection for Hurricanes Katrina and Rita, and a collection for the 100th Anniversary of the San Francisco Earthquake. In order to produce a topic-specific crawl, seed URLs must be identified. This approach is important for ephemeral events where key information will be lost after the event is no longer newsworthy.
Another type of crawl is the domain crawl which collects all the web sites in a particular domain. They are currently talking to some federal agencies and governments, including France, Italy and Australia, about this type of crawl. A modified Wayback Machine is used as the interface, and versions are created based on each crawl.
The content specific crawl is looking for specific kinds of content to be assembled from disparate sources. Examples include the University of North Carolina’s video archives and the DOE/OSTI e-prints network crawl. The latter is in the early stages. Currently, a live web index is refreshed every quarter. In the future, OSTI will use Archive-It to perform targeted harvesting. The content will be made text searchable and will be cumulative over time. Other examples include the Open Education Resources Collection and NASA Images (the agreement for this is pending).
Ms. Frueh also highlighted the Biodiversity Heritage Library which involves ten major biological and cultural institutions. The goal is to create a page for each species on earth, including photos, audio and other links. This is supported by the Open Content Alliance (OCA) and the Sloan Foundation has awarded several grants. The IA will be hosting the library. Ms. Frueh indicated that other members of the group have connections to the Marine Biological Laboratory and Woods Hole which are developing taxonomic authority files.
Ms. Frueh then turned to a discussion of the Book Project. This involves R&D to develop a scanning machine and a series of centers. Seven scanning centers are in place with the capacity of scanning three terabytes per day or about 12,000 books per month. IA is looking to cover its cost. The cost of 10 cents per page for books and microfilm covers the whole process. In 2006, IA began its mass digitization project. Jpeg2000, an emergent image standard, is being used because it is an open standard. The image is then OCRed. IA is planning to have a scanning center in DC by this summer, but no decision about its location has been made.
There are several ways to access the Books Project text archive. On-Demand Books is funded by Sloan. They are currently providing this in New Orleans to help replace lost library collections. Scan on Demand is a prototype for the OpenLibrary.org. One Laptop Per Child involves books reformatted to work on these machines which are distributed to developing countries for educational purposes. The FlipBook Interface includes both searching and “flipping” which mimics a real book. All books scanned by the OCA are in the IA but this isn’t a requirement. There are currently about 180,000 books in the archive.
The scanning procedures guarantee preservation quality output with compares, checksums and redundant copies. IA has had successful results when scanning from microfilm; they are beginning to explore microfiche scanning but they don’t know the quality issues yet. Ms. Frueh suggested a follow-up session with Robert Miller who is more directly involved with the technical specifications.
OpenLibrary.org is geared toward resource sharing and providing universal access to books. They are working toward a better mechanism for exposing a library’s own collections. One web page for each book ever published is the goal. The results will be open to add and download. Editing tools will help to build resources that include annotating, community tagging and user generated content. Scan-On-Demand is expected to be key to these efforts. IA is planning to be complementary with OCLC and OpenWorldCat by starting with the smaller collections where the library may not even have an OPAC.
Ms. Frueh ended by highlighting IA’s interest with CENDI and the government community. Joint activities might include scanning, tool development and preservation activities.
The National Science Digital Library (NSDL): Current Status and Future Developments
(Dave McArthur, Guardians of Honor, National Science Foundation Consultant)
The NSDL mission is to break down barriers to the discovery and use of resources by digitizing and organizing them. The goal is to save users’ time in locating resources so that they can spend more time using the resources. The NSDL provides tools and services to share and contextualize the material. The specific resources can be used to create larger modules, which can be shared to enhance educational value.
In 2001, the NSDL was a $20 million program. This is down to about $15 million per year. The total through the foreseeable lifespan of the program will be $135 million.
There are several key activities within the NSDL. The Core Integration Team works on standards and architecture development. The Pathways Projects are funded to curate and organize materials, usually at the discipline level. While the front-end interface looks very centralized, the resources themselves are distributed. The digital library is brought together by federated search and archiving, and by virtue of NSF funding mechanisms.
Over 200 collections have been built for targeted research tracks. The result is a digital library of approximately two million resources.
The NSDL catalog is selective and so it is concerned about metadata. All resources and services are web based. iVia crawls the resources, harvests and automatically creates a centralized repository of metadata. The resources remain with the originating organization. Metadata for rights management is a major area of interest to which there is no general solution. Even though the NSDL doesn’t store full text, metadata records have had to be removed because of rights issues.
The subjects in the NSDL vary widely. The sciences, particularly math, earth science and geology, are well represented, but the social sciences are not as well represented. This can be seen by an iXight Star Tree.
Resource types are tagged using a controlled vocabulary. NSDL does not consider itself in competition with Google or Google Scholar because the majority of the resources are non-traditional digital resources such as video clips, applets, web pages, etc., rather than digitized versions of traditional text materials.
Omniture is used as the search engine. They have seen a growth in users over time with no specific cycle. The usage level is currently 40,000 unique visits a day with visits calculated based on multiple page views distinguished by a lack of activity for half an hour or more. The latter part of 2004 saw a large increase with the addition of iVia which brought the number of resources over 2 million metadata records.
NSDL’s plans include the addition of new pathways. Most of the large subject areas have been covered, so they are looking for different cuts such as data and data literacy and learning in informal settings. Grants are available for pathways through 2010 with the next competition in May 2007.
The NSDL plans to extend the material that is included. There are currently agreements with several traditional publishers, including Elsevier. In most cases, only small pieces of the publishers’ materials are included on the NSDL site. Links would be included in the text and online to supplemental materials in the NSDL and materials would be tagged by educational standards. This approach raises business model issues for the publishers. The NSDL also plans to host NSF program “wings” in the NSDL based on relevant content from within the NSF. For example, the Compadre Project in physics is used as a pathway, with educational context added by the NSDL.
The system architecture continues to be based on a FEDORA platform in order to foster creation of innovative products and services. Integrated services that are useful but not exclusive to the NSDL are being sought. There will be a call later this year. One current example is the Stand Map services from the Digital Library for Earth Science Education (DLESE).
The NSDL interfaces with many groups in relevant fields. For example, they interface with the investigators on Project Tomorrow/Net Day, which is promoting the use of the web in the classroom. Paul Berkman’s committee, which is concerned with the sustainability of the NSDL, is one of several committees that monitor NSDL.
In the future, the NSDL might be the higher level platform for resource development and digital libraries, while third parties provide content and services through NSDL’s centralized interface. The NSDL already partners with over 20 publishers, other agencies and not-for-profits. Examples include the Instructional Architecture, Sklor, a personal collection management tool, and Content Alignment. The Content Alignment tool helps contributors catalog resources to state standards. It also maps each state standard to the standards of other states, maintaining and updating the mapping into the future. If a user catalogs a resource by a state standard once, its access into the future is guaranteed as standards and mappings change.
Ultimately, the goal is to move from an era of an NSDL for developers to one of end-users. Investing in new developments will continue but they are investing more time in getting NSDL into the classroom. “Teacher rules” post big challenges for the NSDL as evidenced by the complexity brought on by state and local standards. This makes it more important to contextualize the system and make it easy to use. The biggest barrier to inclusion of NSDL resources into the classroom is that the NSDL resources are not specific enough to the teacher’s lesson plans. They do not currently target specific districts or geographic areas. Partnerships and training are needed to connect to the NSDL to the classrooms at this level. To achieve this, the NSDL plans to “cherrypick” good resources and services and to provide more training.
Discussion
The NSDL is searched as a part of Science.gov. Ms. Frierson asked if the NSDL has referrer statistics that could be shared. Mr. McArthur will investigate based on the IP addresses for Science.gov.
NLM Showcase
(Betsy Humphreys)
NLM’s new Strategic Plan was recently released. (Several copies were distributed; the Secretariat will provide Dr. Siegel’s office with the CENDI principals and observers mailing lists so that he can distribute additional copies via mail.) The two presentations bookend the strategic plan – the first is involved in content acquisition, at the beginning of the lifecycle, while the second addresses archiving and preservation of materials and the use of them as secondary sources.
Survey of Interactive Content in Journals Indexed for MEDLINE (James Marcetich)
NLM has long been concerned about interactive journals and the challenge they present to describing, managing and preserving the scientific materials. However, until recently they had never done an extensive investigation of the prevalence or nature of interactive content in journals. Ultimately, NLM is interested in whether this content adds value and if it helps people to retain and use the knowledge gained from the information.
Approximately 5,000 journals are currently indexed in MEDLINE, yielding 650,000-700,000 articles. Interactive content is being processed, but it isn’t being tagged with a special publication tag. There is also “supplementary content” which is often information that could have been printed but there wasn’t enough room for it in the printed journal.
A sample of 49 journal titles (214 issues with 6,504 articles) published between January and June 2005 was surveyed. The journal titles were selected because the staff knew that interactive materials had occurred in them. The amount of interact content was more than expected. Of these articles 122 or approximately 2 percent included interactive content, and 551 or 8 percent included supplementary non-interactive material. There was an average of 2.2 files per article with a maximum of more than 10 files for some articles. The videoclip was the most common type of interactive content, and there were very few links to interactive datasets.
The interactive materials are presented in different ways. Often the characteristics of the interactive materials could not be identified from the print, requiring each article to be examined, including the interactive material itself. Often there is no legend to the supplementary files; the content appears in one series of files and the captions in another. The materials may be contained on CDROM supplements which are only issued periodically. In other cases, the entire article is interactive and only the abstract was printed. Sometimes the content is deposited or made available from other sites or in separate databases with links from several journals to the same repository. These variations present processing challenges for MEDLINE.
Currently, PubMed Central includes some supplementary and interactive content. An overall command allowed supplementary and interactive material searchable from PubMed, but there is no distinction between supplementary and interactive. As of last month, there were approximately 21,600 records in PubMed Central with supplementary or interactive material.
Standards for presenting this material, including publication types, are needed. For example, the term “interactive tutorial” could be used for the totally interactive articles. These changes will require modifications to the PubMed Central’s DTD. The PubMed Central Advisory Group will likely advise on these changes. In the meantime, NLM will continue to save all the content, but the migration path for this multimedia remains to be determined.
Tour of Exhibition Visible Proofs: Forensic Views of the Body (Erika Mills)
NLM’s collections are an important component of its outreach program and are designed not only to increase knowledge and use of NLM information services and collections, but to improve science and health literacy and to promote interest in science-related careers. The current exhibition, which has been very popular given the interest in CSI in the popular media, emphasizes the history of forensics and the different scientific disciplines that contribute to modern forensics. The meeting attendees were given a guided tour and invited to explore the interactive parts of the exhibit on their own.