Agenda
September 8, 2011
National Technical Information Service
5301 Shawnee Boulevard, Alexandria, VA 22312 Visitor’s Center
Conference Room 115/116, First Floor
9:00 am - Welcome and Introductions
Lisa Weber, Director, Information Technology Policy and Administration, NARA, and CENDI Chair
9:15 am - 11:00 am
|
11:00 am - Host Showcase - National Technical Information Service [presentation]
Don Hagen and Wayne Strickland
» Deep Water Horizon Project Repository Case Study (Gail Hodge/Annette Olson)
11:45 pm - Group Lunch
Minutes
Members Speakers
|
Observers Working Group Chair Secretariat |
Welcome
Ms. Lisa Weber, CENDI Chair, opened the meeting at approximately 9:10 am EST. She thanked NTIS for hosting the meeting and NTIS staff for all their help with arrangements.
Metrics and Statistics
| “STAR METRICS: Understanding the Impacts of our Federal R&D Investment” [presentation] - Dr. Julia Lane, Program Director, Science of Science & Innovation Policy Program, NSF |
STAR METRICS is a volunteer collaboration of five agencies and 78 private sector organizations with the goal of enabling science agencies to better understand portfolios and answers to stakeholders about federal research investment. Six years ago, Dr. John Marburger stated that there was no science of science policy and little empirical evidence to show that government investment in science was making any difference. Under Dr. Marburger, an interagency group on science policy was formed with 17 science agencies. A two-year roadmap was developed with an investigator-initiated research program at NSF.
Some steps have been taken from the legislative side, such as the American Recovery and Reinvestment Act (ARRA) Legislation that requires a disclosure of what is funded, although there was not a mandate for and it is still impossible to answer questions about outcomes. The goal was to develop an approach that was not just a bean counting exercise. It was important that the research be defined by the science agencies and scientists. Manual reporting takes up 40 percent of scientists’ time, and researchers must often choose between reporting on funding and performing research. Meanwhile, the agencies have the mission to identify and fund the best science. Currently, the science agencies act more like “proposal and award factories”. They are finance-driven but do not have an analytical system of inputs and no links to outputs.
One of the major problems is that there are different identifiers for researchers in different agencies; the issue of name consistency and resolution is a big challenge. In addition, there is no systematic link of the reports to other artifacts such as the data from the research. The paper-based reporting system was simply made electronic. Future consequences are not tracked beyond the period of the award.
The federal science information system matters, but you can’t manage what you can’t measure. The correct unit of measure should be the scientists output. Awards are simply interventions. In addition, institutional environments and structures matter significantly, since they provide the data access, equipment, etc.
Step 1 was to determine who is supported by science funding. The goal was to document without manual reporting by moving to a more electronic and automated approach without asking for additional reporting. The approach uses existing university information systems that support grants administration and payroll. In addition to counting principal investigators, it is possible to track the impact on graduate and undergraduate students, showing a much broader reach of the federal research dollars.
The functional architecture is actually rather simple. The number of data elements required from the native systems at universities is limited. The data can be pulled out and incorporated into a confidential data system with an API (application programming interface) on top. This model is based on one developed by Brazil, which supports both internal and public access. The public can produce relevant applications against the data.
Level 1 is now in the implementation and production stage. The next step is to capture more information from human resource systems of research institutions. A code is created when a university receives funding and the dollars must be tracked for auditing and bookkeeping. Any person who charges against it is counted. This data is routinely aggregated by the University for the Internal Revenue Service and for Department of Labor. Other direct costs and subcontractor costs are also captured. The full set can be captured every quarter. The number of jobs directly and indirectly associated can then be calculated. The results, including a report from this phase, are being cleared and will be released soon.
The data from September 2010 to date is covered. They are able through this system to document that 80 percent of the funds go to graduate students and post docs and not to faculty. The FTEs generally undercount, particularly for undergraduates, which is often a 3-1 ratio of students to faculty.
Geographic distribute is also available. The subcontractor and vendors touch every state, even with only the 78 institutions that are involved at this phase.
The initial survey took a lot of time for the institutions to set up. However, this has since been reduced to about 45 hours of set-up time and anywhere from zero to 10 hours a quarter to submit the information.
The next set of reports can “topic” model what each award is doing, thereby identifying upcoming gaps in the new workforce. The use of topic models engaged communities in the development of relevant ontologies. They are doing natural language processing of all the abstracts and the topics are automatically generated. This information is going to the NSF Program Managers. However, there hasn’t been an effort to bridge across disciplines, where there could be conflicting ontologies. Dr. Lane believes that more control of the vocabulary is important. A discussion followed and it was evident that CENDI has some good contributions to make to this effort.
The native human resource labor categories differ across institutions. The categories are being submitted and then cross walked to seven aggregate categories. The institution can adjust the cross walk if it isn’t correct. This is now done automatically. Ultimately, the work of ORCID, a common persistent researcher ID, will become an important tool.
“Interaction of STAR METRICS and the R&D Dashboard” [presentation] |
The Level 2 work of STAR METRICS will create a platform to link automatically and leverage existing data under the R&D Dashboard. The Portfolio Reporting tool is now ready. Geographic visualization will be available first and then a drill-down by topic. Existing publicly available datasets will be linked. Representation will be done through topic modeling.
Currently, National Institutes of Health (NIH) and NSF data for 2001-to date is available with information about the year, amount, and institution where the funding is going. If you click on a grant number, it goes to the agency grants database and shows detailed information. The system was launched as a beta in February 2011.
The system is moving toward output data through links to the Patent Database. Citations are used in the patent applications and linked to the original award. The link is made through the principal investigator in a database that has been disambiguated for person. There is no causal statement made; just a relationship.
Issues at this point involve whether it is possible to ingest additional databases that deal with outcomes and the scalability of the system. They will start with publications using NSF’s Research.gov and NIH Reporter (PubMed numbers that allow direct access to the full text online).
The aim is to create a federal-wide researcher profile that relies on databases already in existence. Most researchers currently have profiles in many federal and non-federal systems. However, different formats are used. They are looking to see how they can create a master database without a lot of additional effort on the part of the researchers.
The next steps are to add additional agencies and institutions in order to cover more of the research terrain. There is a need to focus on people. In addition, visualization and tools are important. Other outstanding issues include how to deal with non-publication results such as software. The group is discussing a wide variety of output types beyond publications and patents. A pilot is planned with the Federal Demonstration Partnership.
The use of a tool such as Science.gov which crosses agency lines was discussed and seen to be a possible resource for future consideration.
“Federal Interagency Council on Statistical Policy and SCOPE: Overview and Interconnections with CENDI” [presentation] |
The goal of SCOPE (Statistical Community of Practice and Engagement) is to link the statistical and e-government communities to provide a collaborative community for statistical agencies to produce relevant, accurate, timely, cost effective data and insightful research. Relevant laws and directives include the Confidential Information Protection and Statistical Efficiency Act (within the E-Government Act), the Scientific Integrity Presidential Memorandum, and the Open Government Directive.
There is a dynamic around the need to virtualize and to break down the stovepipes. The major germination point for SCOPE was the argument for horizontal integration of statistical data centers. They didn’t receive the $2 million requested to accomplish this goal, but they plan to pursue intermediate steps with existing resources.
FedStats is working behind the scenes. They aren’t sure how to interact with Data.gov, but they are continuing to look for new visualization tools and revisiting how to leverage their content.
A major area of interest for SCOPE is setting up a pilot cloud environment. Agencies have different security protocols which need to be reconciled. The Chief Information Officers (CIOs) of the agencies are very involved supporting a Program Management Office (PMO) and detailing agency staff to work at the project level.
A vision of the future statistical system was developed in 2009. There is a significant overlap of interest across the agencies. Both information technology and statistics people are on the working groups. Efficiencies of scale are expected through sharing. SCOPE believes that efficiencies can be gained by reducing the number of agencies re-creating similar software and tools; by improving security through collaboration and the implementation of best practices; by decreasing procurement costs through aggregation and the use of open source; by supporting standardized metadata models, quality control procedures, etc.; and by developing models for joint data access where statutes permit.
In addition, there is a lot of commonality with statistical agencies abroad. Almost all countries have a single agency and they are trying to balance the benefit of a central environment with the availability of subject matter expertise. In the US, it is the opposite. “Let’s Move”, a collaborative data sharing in support of the First Lady’s childhood obesity initiative, was an early, successful example of data sharing and integration.
CENDI offered to come to SCOPE to give a briefing about CENDI. Mr. Bianchi agreed that science and social science need to be more closely connected.
Action Item: Secretariat will follow-up to arrange a CENDI briefing to the SCOPE organization.
Host Showcase – National Technical Information Service
Donald Hagen, NTIS [presentation] |
NTIS has laid out a strategic framework, positioning itself for the S&T future. The Federal Science Repository Service (FSRS) is a work in progress and may be only one of many paths forward. It is part of e- or i-science. In the long term, the goal is not just to develop repositories but to make the results of science more productive.
FSRS is aligned with the NTIS mission statement, which calls for a permanent archive, enhanced dissemination, and a service orientation. Services and non-S&T work are bigger revenue generators than the traditional S&T dissemination. The future is in services in S&T and not necessarily sales of S&T.
NTIS has an option to create joint ventures or public/private partnerships. FSRS brings the core competencies of NTIS and Information International, Inc. (IIa) together. There are other joint venture partnerships as well. In addition, NTIS has an agreement with NARA that the NTIS operation is backed up by NARA. If an agency deposits with NTIS it is fulfilling its archival responsibility. NTIS is not the archivist but serves a facilitating role.
The National Technical Reports Library (NTRL) is the NTIS repository which has many users but limited capabilities in its original form. NTRL has been reengineered as the NTIS repository related to FSRS. The new open repository system was announced in April 2011 in beta version. The production release will be implemented by early 2012.
NTIS has done foundational changes in its processes. It is prepared to integrate with other types of functions including social media and incorporation of multimedia. The NTRL was the first implementation of the FSRS infrastructure.
Gail Hodge, International Information Associates, Inc. (IIa) |
The Deep Water Horizon repository is being funded by the National Oceanic and Aeronautics Administration (NOAA) with major representation from the NOAA Central Library. The initial contributions to the site will come from NOAA components through the Library. In the future, it is hoped that other agencies involved in monitoring and follow- up of such crises as the oil spill will contribute or build their own collections.
Using the FSRS infrastructure, the IIa/NTIS/NOAA Team has been able to quickly map the NOAA Library’s MARC records to the FSRS Core metadata fields, develop customized indexing schema, and provide a user interface, along with functionality such as printing and downloading results. Unlike the NTRL, this implementation of the Fedora/Solr repository architecture manages a variety of objects including documents, videos, and images. (Note: Solr is an open source enterprise search platform from the Apache Lucene project.) Fedora object modeling allows for flexible management. Flexible dissemination mechanisms allow changes to the output to be made quickly without changing the underlying data store.
In addition to improving the management of and access to individual repositories, the infrastructure and, in particular, the creation of the core metadata enhances searching across repositories, the creation of virtual topical collections based on objects from multiple physical collections, and common services.
The goal is to provide more repositories for small science agencies or components of agencies that want a specialty collection or do not have the capacity or major information management arms of their organizations. These repositories can provide more content for Science.gov and Data.gov.
The meeting adjourned at 12:00 PM.