January 9, 2013
DOE OFFICE OF SCIENTIFIC AND TECHNICAL INFORMATION
USPTO Headquarters - Madison Building - 600 Dulany Street, Alexandria, VA 22314 Auditorium – Concourse Level
9:00 am - Welcome and Introductions
Don Hagen, Associate Director, Office of Product & Program Management of NTIS and CENDI Chair
9:15 am - Host Showcase - DOE Office of Scientific and Technical Information
Web Architecture (Warnick/Martin) [presentation]
9:45 am - 11:45 am
12:00 pm - Group Lunch
Task/Working Group Chairs
Donald Hagen, CENDI Chair, welcomed the members. He thanked Dr. Warnick for arranging to use the US Patent and Trademark Office conference space. Dr. Warnick specifically thanked Thomas Phan.
Data and Publications – Access and Interoperability
NSF’s Office of Cyberinfrastructure (OCI) coordinates and supports the acquisition, development and provision of resources, tools and services for 21st Century science and engineering research and education. This includes computational infrastructure, software, networking, data infrastructure and education and workforce development. Data infrastructure is where Dr. Grumbling is focused. Data infrastructure is not just a part of disseminating scientific research results, but a part of the life cycle and the scientific method. Data feeds into conclusions and then into new hypotheses.
In the past 50 years, scientific paradigms have been added to the use of computational resources to test hypotheses and promote new discoveries. The data-centered world is now apparent. Many sources of information are connected electronically and there is significant automatic collection of data from all kinds of instruments.
NSF’s definition of Big Data includes large, diverse, complex, longitudinal and/or distributed sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future. Scientific data was not always digital, and a lot of scientific research still seems to be “off the grid”. It is important to remember that not all science generates big datasets; in the long tail of science, there is data that is not in digital form and some research produces datasets that are relatively small.
There are opportunities and challenges in data-driven science, including the fact that data is abundant and there is potential for multiple uses of data beyond the original purpose. The availability of data enhances scientific collaboration and allows for crowdsourcing with citizens. The big challenges are access and formats that are interoperable across software platforms and places. Another barrier is the need for the community to accept the benefits of data-sharing. Mechanisms for attribution and citation of data, provenance, quality control and credit for contributions to the infrastructure are needed.
OCI has a variety of data initiatives underway under the DataNet program, including the Sustainable Digital Data Preservation and Access Network partners, DataONE, the Data Conservancy, and SEAD (Sustainable Environment-Actionable Data). The Data Web Forum was funded as a Research Coordination Network across projects and institutions to advance or create new directions through communication and coordination. The DataWeb Forum has morphed into the Research Data Alliance (RDA). The goal of the RDA is to facilitate the sharing and exchange of research data world-wide and to help advance data-driven scientific discovery through development and adoption of data infrastructure. Activities are aimed at near-term results in an 18-month timeframe.
The US efforts are supported by NSF through an award and in kind, through the National Institute of Standards and Technology (NIST). There are international partners in the European Union and Australia. RDA is a self-managed, non-commercial, non-governmental entity led and driven by the research community and data users.
Dr. Fran Berman described the organization structure of the RDA and its current activities. She indicated that RDA is in start-up mode and she is hoping for CENDI participation. This is an exciting time for people who have been involved in data issues. It informs literally everything that we do. Data is the Hot Topic, with activities on every continent. It is really our time as the data community.
The interest stems from the need to accelerate innovation and research. The community needs to build a culture of data sharing and reciprocity so that the interesting data can be used in multi-data, multi-disciplinary ways. Data is often considered to be a competitive advantage and the resulting lack of data sharing is one of the threats to establishing a Global Data Infrastructure.
There is a need for short-term efforts that complement the long-term needs. The goal of the RDA is to create working groups and building blocks that work toward an infrastructure by accelerating the sharing and exchange of research data. The working groups have fairly broad scopes and specific work products and deliverables, such as harmonization of adopted standards, deploying infrastructure, adopting policy, and identifying and implementing best practices.
The working groups are expected to contribute to a coordinated global infrastructure, which can be accomplished in 12 to 18 months, have substantive applicability, create actionable deliverables that are adopted by the working group members and others, and enable working scientists and researchers to move forward today while long-term and far reaching solutions are pursued in other venues. The government agencies that began discussions about the organization are forming a separate group. This Government Group can be described in more detail by Chris Greer and Alan Blatecky.
The formation of a working group begins with a Case Statement establishing what the group will do and the value proposition for their work. The Case Statement is reviewed by the RDA Council and, if approved, the group is official. Evaluating the Case Statements involves an assessment of whether there will be adopted deliverables, whether the effort will be helpful to the community, whether the timeframe be met, and whether the proposed effort is a good fit for the scope of the RDA. The group has 12 to18 months within which to operate, and then there will be a post-completion communications effort. This is where Dr. Berman would also like to have help from others, such as CENDI. In this way, they go beyond a community of interest.
The RDA leadership Council has three initial members and will grow to about nine members. A work-in-progress organizational framework was described. To join RDA, interested parties can go to the RDA website and check the box to become a part of the RDA membership. A Technical Advisory Group will have responsibility for the technical roadmap. The administrative leadership will fall to the Secretariat and the Organizational Advisory Group of partners. They are looking for the most appropriate partnership model. There might be a financial or in-kind support needed on the part of partners. The Council will be responsible for overarching mission, vision, and strategic planning.
CENDI Meeting – January 9, 2013 Page 4 03/06/13
Dr. Berman laid out the timeline from 2011 through 2013. The first gathering was at the October 2012 Global Data Meeting. There were more than 100 people in attendance from the US and internationally. More than 10 discussion groups were formed to work through potential focus areas for working groups. They identified important areas of work and road blocks. The launch of the organization will take place in Sweden in mid-March. It will be a working meeting and a celebration of what they can accomplish. She would like to see a good US presence at this meeting. In September-October there will be a second plenary meeting here in the US.
The Steering Committee is still conceptualizing what the RDA will look like, how it will work, and what its governance structure will be. To initialize the RDA, they are discussing what the organization is about and what it stands for. Guiding principles have been developed such as being open, transparent, community driven, consensus-based, voluntary/not for profit, balanced representation of stakeholders, and facilitation of harmonization across different boundaries.
Accelerating the momentum for RDA will involve increasing the active membership, developing a pipeline of deliverables, promoting the launch and plenary as a gathering place, developing a functional and effective organization structure, and reaching out to other partners. It will be important to come up with a self-sustaining model for the organization within a five-to-six year timeframe.
Dr. Berman also spoke about the economics of data stewardship and preservation. How will the free and open access to data create a sustainable environment, especially around data management; what are the best practices; and what are the partners that are needed from other communities?
A key question of interest to CENDI is how RDA wants to interact with the federal government. The Government Group is discussing this question. Financial support is always welcome, and RDA would like to have the federal communities involved and engaged as much as possible. If RDA is successful, Dr. Berman believes that agencies will consider it an important investment. The three-year NSF timeframe for funding is short. Other ways to contribute include in-kind relationships, such as an intern assigned as a go-between with an agency.
Ms. Carroll suggested that CENDI might lead an STI working group as part of the September plenary meeting in the US.
DOE/OSTI – PAGES System
Walt Warnick presented an overview of the new DOE prototype system for open access to DOE-funded journal articles called the Public Access Gateway to Energy and Science (PAGES).. OSTI was asked to develop this system by DOE Senior Management to fill the biggest gap in access to DOE research results, which is the full text journal literature. It is based on a distributed architecture and provides a single search box of DOE sponsored journal literature.
The metadata is centralized but the full text articles and manuscripts are decentralized. The centralized metadata ensures access to a comprehensive list of the scholarly output of DOE funding. It also preserves the freedom of researchers to promote and disseminate their research, while recognizing and accommodating the business models of publishers and the value they add, such as peer review. A single version of the document is maintained. For published articles, the single version of record (VoR) resides with the publisher.
Some articles are automatically publicly available because they appear in open access journals. Some articles are accessible after an embargo period. If a publisher will not agree to allow access to the full text from its system, accepted manuscripts hosted by institutional repositories at DOE laboratories are used instead. If an institutional repository is not available to the author, the document could be hosted by OSTI. The prototype currently has 360 journal titles from Proceedings of the National Academy of Sciences (PNAS) and about 70 from the American Physical Society (APS).
The policy to make accepted manuscripts available is covered by a compulsory DOE order that applies to everyone who receives DOE funding. The researchers will be asked to submit metadata and to post their manuscript at the same time that the accepted article is sent to the publisher. PAGES will need a more explicit policy regarding an embargo period, but this is still to be determined.
The version to include and when has not been an issue with the publishers so far. Publishers, within the last few weeks, have responded positively about PAGES. In particular, the publishers seem to welcome the fact that their VoR will be the single VoR in PAGES.
Ms. Martin gave a demo of PAGES. A search of results in a list of hits that link to the article itself that is posted at the publisher’s site to demonstrate the distributed nature of PAGES.
Full text indexing requires the full text to be available for distributed access. PAGES will also have a dark archive of the full text to account for the possibility that a publisher or an institution repository might become permanently unavailable.
Features of PAGES include sorting by date and relevance ranking. Filters by subject and author are available. The user can also export the metadata from the hit list to Excel. There is also advanced search which includes full-text search and searching across the bibliographic citation data. OSTI is working with the STIP community so that authors will know how to submit their manuscripts.
Currently, there is no cross-linking to data, but an internal DOE working group on data management planning is addressing this.
USDA/NAL – National Agricultural Library Digital Collections
Chris Cole presented the overview and demo of the NAL system. NAL has been collecting USDA-authored materials for several years in addition to digitizing historical materials. The focus, when collecting these materials, was on peer-reviewed articles authored by USDA employees. About 50 percent of USDA research dollars are spent internally.
NAL has digitized a number of historical items and they are scaling up to bring the Internet Archive onsite to significantly add to this collection.
The format is almost all pdf since it is the easiest to handle, is ubiquitous, and provides verisimilitude because “what you see is what was published.” The item of record for most publications continues to be the print copy although this is changing.
Initially, NAL was collecting all kinds of materials, but the focus is now on the hard research. NAL wanted to take advantage of the copyright wherein documents authored by federal employees as part of their official duties cannot be copyrighted. In the case of almost all these articles, the content is provided by the authors. If there are co-authors and you can’t tell what is done by whom, the presence of the federal author puts the text into the public domain.
Currently, the repository is using Fedora, but there are questions of scalability. They are considering using XML as the format of the articles. In some cases, the publishers have XML available. In other cases, it is provided by a conversion service. NAL will be using the NLM DTD which is now an International Standards Organization (ISO) standard. XML is preferred because it is a curate-able, stable format that is not proprietary. It also removes the last issues of copyright, which is the formatting provided by the publisher.
Users can come straight to the article. In addition, every article has a link in the AGRICOLA index with a Handle. It wasn’t necessary to change any of the hyperlinks when they made the migration from our legacy system to FEDORA because of the persistence of the Handle.
It is imperative to have the documents exposed to search engines. In order to make this easier, NAL created an XML version of the catalog records. A number of the search engines are close to finishing the indexing. This is important because it shows the deep part of the web. More than 5 million downloads have been generated, primarily because of the exposure to search engines.
NAL has made it clear that this kind of service and the growth of it cannot be done within NAL’s current budget. Extra funding will be needed.
NLM - PubMed Central
Jerry Sheehan with input from Ed Sequeira presented an overview and demonstration of PubMed Central (PMC). PubMed Central is a free archive of biomedical and life science journal literature that has been peer reviewed. It was launched in February 2000. While many aspects of the system have remained the same, some features and content have expanded.
The repository is now associated with the National Institutes of Health (NIH) Public Access Policy. However, it is not limited to NIH-funded research. Only a fraction of the content in PMC is based on this policy. (It currently contains 2.6 million full text articles.) A number of the journals are submitted voluntarily by the publishers. It is tightly integrated with a number of other NLM databases, including PubMed and GenBank. Most of the articles in PMC published since the late 1990s are archived in a standard XML format, the NLM Journal Article DTD. The NLM DTD has been widely adopted by many publishers and is the basis for NISO standard Z39.96-2012, which was issued in fall 2012. PDF versions of most articles are also available.
PMC also includes a large body of articles from the mid-1990s back that were digitized from print.
PMC usage patterns are cyclical and correspond roughly to the academic year, with peaks in the spring and fall. At peak, there are more than 35 million article downloads from PMC per month, and more than 750,000 unique users downloading over 1.5 million articles on a weekday. Approximately 25 percent of the downloads are by universities, 40 percent by the general public, and 17 percent by companies.
The Public Access Policy mandates submission. The voluntary policy began in 2005 but became mandatory by an Act of Congress in 2008. There is a maximum embargo period of 12 months from publication. They estimate approximately 100,000 peer-reviewed articles arise from NIH-funded research per year; most of NIH’s budget expenditure is extramural.
Articles get into PMC in two ways. A large number of publishers have formal agreements with NLM to deposit the final published versions of their articles in PMC. They provide full-text XML and PDF copies of each article. There are three types of agreement. Full participation includes all the articles in a journal issue (about 1200 titles). Other journals (NIH portfolio) provide only those articles that fall under the NIH public access policy. Still others deposit just selected open access articles. In addition to these deposits of the final published articles, individual authors (or publishers assisting them) may deposit the peer reviewed manuscript version of an article that falls under the NIH policy. Those are submitted in manuscript form (e.g., Word file) and converted to XML by NLM. The result goes back to the author to review and approve prior to ingest. Approximately, 550,000 articles in PMC are open access (not necessarily public domain), a subset that people have been using for semantic search and text mining research.
NLM has worked with other funders to develop repositories. There are mirror sites in Europe and Canada and foundations such as the Howard Hughes Medical Institute and Autism Speaks are also using the system.
Linking is a key feature of the system. There are links from within the document and from the references to the articles in PMC. PMC also links to CrossRef with Digital Object Identifiers (DOIs). There is a link to the publisher’s version of the article as well.
NLM has been approached by a number of other agencies about using PMC to support potential public access policies similar to that of NIH. Though NLM normally focuses on the life sciences it is prepared to accept content into PMC from other disciplines in support of such federal agency public access policies.
Showcase – Department of Energy
OSTI is a program within DOE’s Office of Science. Its strength has been to make results findable via a single search box. Sharing knowledge is what CENDI and OSTI are all about. The OSTI corollary states that accelerating the sharing of scientific knowledge will accelerate the advancement of science.
OSTI’s Architecture for Accelerating Scientific Discovery involves taking DOE funded research and building a number of separate products that are then integrated into the Science Accelerator. This product provides access to technical reports, more than 5 million scientific e-prints, more than 2 million publicly available citations, more than 24,000 patents, and more than 500 websites and databases. PAGES will be added to this architecture to fill the gap for DOE sponsored journal literature.
Science.gov integrates DOE with other federal agency R&D results. Worldwidescience.org then adds Science.gov to the results of more than 70 national science portals via a single global science portal. This is OSTI’s largest collection since it runs worldwidescience.org on behalf of the Worldwide Science Alliance.
CENDI Meritorious Service Awards
The award committee, comprised of Mr. Gardner, Ms. Gheen, and Ms. Wilson, met by teleconference on December 18, 2012. Mr. Hagen announced the awards. Annie Simpson, USGS Alternate and Science.gov co-chair, was selected for going above and beyond the expectations for Science.gov, bringing enthusiasm, creativity, and unbelievable persistence and determination to revitalizing the Promotions Group. Mr. Huffine accepted the award on Ms. Simpson’s behalf. Ms. Simpson, who joined the event via teleconference, thanked the group and continued to encourage the CENDI members to provide staff resources for Science.gov. She thanked Ms. Gheen, Science.gov co-chair, and others for their help and support.
Lisa Weber, NARA Principal and former Chair of CENDI, was selected to receive the award in recognition of her tireless and inspiring leadership of CENDI as it grew in strength during two years of dramatic change. Ms. Weber accepted the award and stated that she had learned a lot during her tenure. She thanked the members, her Deputy Chair, Jerry Sheehan, and the Secretariat, for their support.
The public portion of this meeting adjourned.