CENDI PRINCIPALS AND ALTERNATES MEETING
Defense Technical Information Center
Ft. Belvoir, Virginia
April 27, 2006

Minutes

Report on the AAAS Session on Long-lived Data and Proposed Science Editorial
NSF Draft Strategic Plan for Data, Data Analysis, and Visualization
DTIC Showcase
            The Research and Engineering (R&E) Portal
            The Iraqi Virtual Science Library (IVSL)
            The Defense Virtual Information Architecture (DVIA)
            SF-298 Report Documentation Page Toolkit

Welcome

Dr. Walter Warnick, CENDI Chair, opened the meeting at 9:10 am.  He thanked DTIC for hosting the meeting.  He proposed that the agenda line-up be changed to have Ms. Bonnie Carroll present the AAAS Symposium summary with the presentation from Dr. Chris Greer of NSF to follow.

Paul Ryan, DTIC Administrator, welcomed CENDI and gave highlights on the facility and organization at Ft. Belvoir.

Report on the AAAS Session on Long-lived Data and Proposed Science Editorial

Bonnie C. Carroll, Executive Director of CENDI Secretariat

Bonnie Carroll reported on the CENDI-sponsored symposium on long-lived data that was presented at the AAAS meeting in February, entitled “The Expanding Universe of Digital Data Collections.”  CENDI has been involved with long-lived data for some time and the timing of this symposium was good to coincide with the release of the Science Board report on Long-lived data that CENDI had the privilege of contributing to.  The report needed to get out in the scientific community more broadly and the symposium, organized with Anita Jones and which had a very distinguished panel, was a good way to do this. 
The outcomes from the symposium can be summarized principally in that Ray Orbach, DOE Science Office Director, committed to officially supporting the NSTC Committee by providing staff support.  Michael Crosby, Executive Officer of the National Science Board (NSB) and Warren Washington, outgoing Chair of the NSB, were enthusiastic about the participants writing a paper for Science based on the symposium.

The panel presentations began with Dr. Warren Washington giving an overview of policy activity in the NSB, the Long-lived data report upon which the symposium was based, and other related key reports released by the NSB in the past year or two. 

Dr. Orbach’s office is essentially a data funder.  His talk focused on how to deal with all the data being generated, emphasizing a key role in finding, connecting and understanding the dots – essentially, how to organize data in order for it to be useful for research.  The kinds of data of interest to DOE are in the tera- and petabyte range and came from large experiments.  The Office of Science is initiating a basic research program to address some of the challenges faced in data management and synthesis.

Dr. Jeff Dozier talked from the standpoint of a data author, having accumulated much snow and ice data from Landsat and having faced the challenges of all research universities in collection, analysis, presentation, search, storage, and retrieval. Data centers play important roles as do a distributed federation of information/data providers.  Research versus production computing can be summarized as research is heterogeneous, idiosyncratic, and problem-driven (focused on results, not processes); whereas, production computing is robust and reliable, standardized, and scalable.  The primary principle behind improving the flow of data is that it should help scientists become information providers in a federated data system.  The prime directive is to have minimal disruption of a working scientist’s computational environment.  The ultimate product will be software, system architectures, and procedures for turning science projects into a federation of providers.

Dr. Bruce Schatz from the CANIS laboratory at the University of Illinois presented from the perspective of a data user, beginning with an overview of information science.  He worked on a major project called BeeSpace, which was a “Frontier in Integrative Biology” project.  Chris Greer is the Program Manager.  Schatz studied how to make all this data interoperable.  Interspace is where it is going with interactive correlations across knowledge sources.  It is more or less a data commons among stakeholders:  data users, data curators, data scientists, and data archivists.

Dr. Helen Berman, Rutgers University, gave the data manager’s perspective from the research realm.  The Protein Databank, a reference database on the structure of large biological molecules, has become an international standard.  She described the data bank, challenges, and issues related to a funding model.  There are both scientific and technical challenges that must be addressed as well as key issues related to standards, preservation and stability.  The bottom line is that we are all dealing with interdependencies that call for a new funding model.  The issues related to this are very complex.  However, it is critical to explore other possible funding models.

Dr. Francine Berman, Director of the San Diego Supercomputer Center, gave the data manager’s perspective from the computer science realm.  Storage of all this generated data is critical as is collecting and organizing it.  The Supercomputing Center is a national center or community data repository. A key challenge is digital preservation, giving rise to such questions as what should be preserved, how it should be preserved, who should pay for it, and who should have access.  Dr. Berman also emphasized the need for a different model to meet the needs of the scientific communities.

The question now for CENDI is how to carry the momentum forward?  There remains opportunity for CENDI to have input and meaningful impact.

NSF Draft Strategic Plan for Data, Data Analysis, and Visualization

Dr. Christopher Greer, National Science Foundation

NSF has had a lot of strategic planning involvement in the development of a five-year Cyberinfrastructure Plan for the new NSF office.  Chapter 3, which focuses on data and high performance computing, is one of the first chapters completed.  The former [THE FORMER WHAT?] will be discussed here.  Comments made today will be taken back to the Data Strategic Planning Group. 

The definition of data is very broad and includes algorithms, video streams, and text in addition to numbers. Collections to keep in mind are reference collections used by a broad range of communities as well as collections made for small single-focused communities.  All are relevant.

NSF favors the term cyberinfrastructure and Fran Berman’s definition is used and  defines cyberinfrastructure as the “. . .organized aggregate of technologies that enable us to access and integrate today’s information technology resources – data and storage, computation, communication, visualization, networking, scientific instruments, expertise—to facilitate science and engineering goals.”  The concept includes people also. 

There are four components: 1) collaboratories, observatories, and virtual organization, 2) learning and workforce development, 3) high performance computing, and 4) data, data analysis, and visualization.

The cyberinfrastructure vision document is a living document that will continue to evolve.  The timeline includes its creation, providing comments, and revisions.  It is expected to be released for publication this summer.

The over-arching vision is one in which science and engineering digital data are routinely deposited in well-documented form, are regularly and easily consulted and analyzed by specialists and non-specialists alike, are openly accessible while suitably protected, and are reliably preserved.  Full implementation of this vision will take quite a bit to change the culture.  There is a need to make the data available for use by people who are not computer scientists.  Data authors and data users are the primary focus for culture change because data managers already understand the vision. 

There was some discussion on it being a funding issue, an important consideration to meeting any challenge and making any change.  One issue is when a contracted researcher runs out of money before the report is written, the preservation of data is just not done.   Preservation needs to be invested in.  Dr. Greer said that NSF projects almost always include database people on the review team and that NSF information managers are often included from the beginning. 

Dan Atkins, a champion of preservation and data management, and also known to CENDI, is expected to take his new assignment as NSF’s Cyberinfrastructure Head in the early summer and will report directly to the NSF director.  [NOTE:  He has been invited to address CENDI in the near future.]

The goals of the strategic plan can be summarized as  1) catalyze the development of the system of science and engineering data collections that are open, extensible and evolvable, and  2)  support the development of a new generation of tools and services.

The national digital data framework is user-centric, multilevel (global to local, local to global), nimble, sustainable, and reliable.  Cyberinfrastructure always needs to be looking at new tools, new ways of doing things. Many libraries play a critical role in our society as sustainers of data and information. 

The primary principles behind the Plan are that 1) data generated with NSF funding will be accessible and reliably preserved, 2) research and education opportunities need to determine investment priorities, 3) there should be broad community engagement in reviewing and prioritizing data activities, 4) data is only useful if it can be found, understood, and analyzed,  5) privacy, confidentiality, and intellectual property rights must be protected, and  6) international, interagency, and public-private partnerships are essential. The NSF is committed to work with other entities – global, federated, national, and international – and is always looking for partners.

The NSF plan of action involves establishing and maintaining a coherent organization framework to enable communities to do the things they do best.  Communities of practice have a diversity of approaches. Because data managers decide what is to be kept, they must obtain and maintain the trust of the communities.  This makes them accountable.  The federal realm should see that these responsibilities exist.

This is a dynamic and evolving system, which is necessary because collections change overtime.  For example, the Protein Database has changed significantly over the past 30 years.  It is a premier collection with all the models changing as it went along (grew).  Resource and reference collections need to be assured that the data accepted have been curated well.  As recognition of collections evolve from research to resource to reference, there are critical times to ensure the quality of the data.

The creation of a flexible technological architecture with layered capabilities, putting an emphasis on metadata, including data analysis and visualization tools, and promoting the use and stability of standards, is also within the plan of action. 

Establishing coherent data policies is also embedded in the plan of action.  How are all the data policies met?  NSF collects all the policies and centralizes them for access and applicability across the Foundation.  Data management plans are now required in all proposals which should address which standards are to be used or will not be used throughout the project.  How the data is to be accessed and preserved also needs to be addressed.  Dr. Greer sees this as working the same way as the experiment of a number of years ago when proposers were required to describe broader societal impacts in their proposals. What NSF saw was that, as proposals started coming in, proposers became more and more creative and focused in their projects.  They saw this as a measure for competition.  Including a data management plan is a third criteria and it is expected that it will start bringing in some innovative methods. 

Interagency coordination, much like that realized in CENDI activities, is crucial.  The Office of Science and Technology Policy (OSTP) is establishing a working group on digital data to coordinate the activities across the federal government.  The international arena, such as where CODATA and UNESCO operate, is also critical. 

Dr. Greer summarized that the NSF Strategic Plan promotes a change in the current culture, it intends to catalyze development of a national digital data framework, and it will support new generations of tools, services, and capabilities.  He encouraged the CENDI members to get the latest document for review and provide input.  CENDI expertise and experience is very much desired; it is such partnerships that will make the Plan most effective. 

Discussion

Bonnie Carroll asked if there are plans to implement some of these requirements – specifically, the data management plan requirement – into a policy plan since it so clearly affects policy as well as the vision (plan).  Requiring a data management plan is a policy element, so it should be integrated into the general NSF policy, and it should be made clear that this policy applies to digital data.

Dr. Greer stated that practice will make it policy.  Peer review will look at whether the data of a proposal will be accessible and how it will be shared.  Metadata will be a key part of a data management plan.

Dr. Warnick announced that the Committee of Science met with the NSF Director.  A new interagency working group for digital data is being formulated and the Charter is being developed with OSTP leading the draft effort.  Once the draft is ready, the new working group will convene.  Dr. Warnick was asked to represent DOE, though there may be others from DOE also asked to serve on this Committee.  The Working Group will have a Technical Advisory Group (TAG) structure in place.  CENDI could become a TAG, if they choose, when the time comes.  This will allow input to be given through a sanctioned pipeline to the WG.

Dr. Greer agreed that CENDI’s contribution could easily be as an interagency TAG.  It is a great idea because working together has potential to be more effective as all the challenges could be overwhelming to one organization. 

Mr. Lannom commented that information management crosses a variety of disciplines and maybe the concentration should be beyond research data. There are fundamental challenges that cross-cut. Dr. Greer acknowledged that there are good implementation models in a number of disciplines with deep thinking going on about these issues. 

The NSF requirement that proposals include a data management plan was generally thought to be a good idea by CENDI members. It was suggested that reviewers be brought in as subject matter experts and data managers be made part of the review team just as a matter of course to effectively evaluate a proposal. 

Dr. Greer noted that panels are finding they need computational experts or an IT component, so these experts are often brought in on temporary bases.  He would like to see community data centers being used as a resource to promote careers of young scientists.  They should be leveraged also to support the local workforces by making use of the data centers’ tools and equipment. This could be an excellent way to train people – young and innovative minds geared to the sciences and technological tools and methods – in the communities. 

Roundtable

Some of the members contributed what their agencies are doing in regard to data management policy.  Dr. Warnick stated that DOE has no formal policy issuance but has established and implements DOE operational orders that must be adhered to.  An order exists that stipulates all reports are to be submitted to OSTI.  There should be an overall policy and vision.  Program managers need to be thinking of data management when they design programs. 

The Environmental Protection Agency has many challenges in records management.  A study of scientists found that they do not know about records retention or even basic records management.  The Project Managers need to be educated and guidelines put in place to require formats and processes.  Success stories need to be shared to help users see how useful it is to them to implement standards and have a policy in establishing collections. 

The Department of Defense has a policy but there is a breakdown in how it is applied in practice.  IT staff and librarians have a good understanding of the policy and how it should be implemented but are often left out of the process.    DoD has a strong policy for data in particular, but the implementation is not consistently done.

The Federal Geographical Data Committee (FGDC) has had years of policy setting but is still not effective in getting the metadata, which is an important part of a data management plan. 

Dr. Greer feels that embedded in the metadata should be an identification of the originator of the data so due credit is given. This should help motivate originators to create metadata when they submit their reports or papers.  It is definitely a planning issue – there is a need to plan what data will be collected, how to collect it, and how to get it accomplished as you go.  If the PM discovers at the end what should have been done all along, it likely will never get done; i.e., the right data won’t be collected and submitted as part of the report.

Dr. Neal Kaske from the National Commission on Libraries and Information Science (NCLIS) works with data on libraries and said that comparisons can’t be done over the long term because boundaries change.  It is the long-term changes that are the greatest challenges.  There are many differences between national and international policies and processes and between disciplines. Even the term dataset has different meanings.

DTIC Showcase

The Research and Engineering (R&E) Portal, Bobbie DeLeon 

The Research and Engineering Portal was launched in April 2005.  The initial goal was to answer the need to find out what was being done across the agency; i.e., what should be collected, what should be shared, who is doing the work, etc.

The decision to develop the portal stemmed from the question, if all this work is being done and no one knows about it, what good is it?  This portal was created to be an information transformation tool to answer such questions as what, why, when, and who.   The primary users are the DoD researchers, acquisition professionals, testers, and operators looking to find R&D information.  There are over 2 million DoD technical reports; approximately 300,000 research summaries; 165,000 Independent Research and Development (IR&D) research descriptions; up-to-date budget information; links to over 50 RDT&E websites, portals, and resources; an R&D calendar; over 2,300 news sources; and an R&D point-of-contact list.

The portal was not created in direct response to the E-government Act of 2002 but it was developed with that in mind and complies with all of E-gov’s requirements. The E-government Act required the creation of a central repository for all federally-funded, R&D data. The current primary users are registered DoD employees and their contractors, but there is a plan to open it up to all federal agencies and their contractors.  

The portal covers a broad spectrum.  The main page has a listing of tools on the right.  It is always up-to-date with the latest news.  A tutorial and interactive menus are included. A discussion forum is available but has not been used much yet.  There is a push to make the forum more interactive. A profile of each user can be developed, so a modicum of control is possible for differing levels of access.

There are plans to create a search tool to allow categorization by user. The retrieved information is presented in table format.  Another tool that has grown and is useful to program managers is the budget tool because it provides analysis capability.

The reason that the search tool is so effective is that it goes across all of the DoD websites.  One can be kept abreast of all DoD conferences, latest research, etc.

Since the Portal maintains an E-government-complied database, all DTIC reports are consolidated on the portal. The Dialog news edge is updated at least daily, often more frequently. 

The Defense Technology Search (DTS) tool is organized much like a library-card catalog.  Data can be searched by organization as well.  Research results are in table views, which save much time because one can see at a glance whether particular research is being done in a certain area.

The architecture was built with the idea that the portal would eventually share information across the DoD agencies.  XML is the required format.  Tools can be applied that make information more flexible.  Some of the planned enhancements are to improve site navigation and to provide more personalization. Additional news sources (scientific journals, etc.) will be added as will additional links, and there will be a focus on services at the agencies as well as an expanded R&E POC search.  It is intended to become a DoD standard tool, so input on how to make it more useful is welcome.

Other enhancements include establishing an “Ask an Expert” reference desk and more intelligence tools for analysis as well as data visualization tools and analytical business intelligence tools.  For content, DTIC is looking at making it more personalized for the users, adding news sources and links, and expanding the search capability.

John Sykes mentioned that EPA is about to release their R&E portal and would like to see how it is in agreement with other agencies’ R&D portals  He’d especially like to hear about collaborative capability and what, if any, data visualization tools were selected.

Ms. DeLeon said that the portal is defining subject areas.  They are working toward collaborative capability, like the Ask the Expert function, though they are still evaluating tools and haven’t selected the software. 

The underlying portal technology is an Oracle portal product, though Plumtree was considered.  Oracle was selected because it was known that it would work with the existing database structure.

The Iraqi Virtual Science Library (IVSL) [Website], Carolyn Jones

DTIC has been involved with the U.S. goal of helping Iraq rebuild, particularly in the area of educational infrastructure; since the war, 84 percent of its higher educational institutions were severely damaged or destroyed.  Further, under Saddam’s leadership, Iraqi scientists and engineers were isolated from the international science and engineering communities, with the result that Iraq is 20 to 30 years behind the rest of the world.

One of the biggest challenges is to re-engage the scientists and engineers in the international community.  The project was initiated by the AAAS Science and Technology Policy Fellows at the Departments of State and Defense. The Defense Threat Reduction Agency (DTRA) provided $360K for the website creation and 1-year of subscription fees. DTIC developed the authorization and authentication. The website will be officially launched at 1:00 pm on May 3, 2006 at the National Academy of Science. 

The US partners for this endeavor include several agencies within DoD:  DTRA, International Technology Policy Office (ITPO), and DTIC.  The National Academies of Science played a part, and there are civil partners as well, such as the US Civilian Research and Development Foundation. The Iraqi partners include universities, science institutes, and the Iraqi government.

The Iraqi Virtual Science Library site provides organized access to thousands of journals, free publications, online course materials and books, customized resources from the US Government, and computer resources. There are over 20,000 journal titles available. EX-Proxy was developed to provide access to journals without log-ins. There is open access within IVSL to many resources, such as, journals, search tools, educational resources, international and US resources.  In many cases, the resources are in full text.

A registered journal user can log in directly to journal sites and then perform searches from their databases.  Most articles can be printed from PDF files.  There are some educational resources but there are no registrations for these resources. 

There are no plans to monitor usage except by registrars.  There are no other plans to monitor since it is open access; thus, there will be no usage statistics. 

Next steps include providing training and transition to the Iraqis, which will involve other entities such as Sun Microsystems, NASP, the National Academies of Science, and HINARI.

DOE had provided links to DOE databases but it was noted that Science.gov was not linked.  Since the CENDI members were in agreement that Science.gov be linked from the Iraqi site, Science.gov staff would see that it was linked by the May 3 launch.

The Defense Virtual Information Architecture (DVIA), Jim Erwin

The goal of the DVIA is to provide access to all the R&D information within DoD.  The design goal was to have hot-linked citation metadata, a multiple citation view, and “anywhere” functionality (e.g., STINET, Science.gov, Google, Worldcat, OPACs).

DVIA is context sensitive, providing different views for different users. Subset searches were a requirement with a search history to help you know where you are in the search experience and relevance ranking. 

It is not a true search system but it looks like one for demonstration purposes.  Since it is a list of citations, it has registry functionality, which means it is essentially a navigation methodology that links out from a citation within the DTIC database.  One navigates out to other sites and other information through links. 

There are multiple registries.  Once results are obtained, one can drill down through links to more detailed information.  It is multi-term with subset terms possible. 

The architecture was developed from both a user perspective and a system perspective.  The user functionality is based on distributed, authenticated, contextualized metadata, data searches and accesses.  Queries are expressed and records as Open URLs.  The system supports OAI-PMH metadata harvesting and uses Handles. 

The system administrative functionality is performed through distributed, authenticated registration.  The metadata is XML.  Data can be in any binary format. 

The system architecture is based on a Digital Object Repository and a Handle system that underlie numerous interfaces.  One of the interfaces, a contextual linking service, interacts with other linking services and so becomes a federated interface.  The linking service acts on behalf of clients to issue searches, retrievals, and other requests against a targeted registry.  It is a data mediator that tries to represent both sides, with it all working together to create proper dissemination of the results.

All URLs are tied to Handles.  Since it has a user context, queries can be customized to a user profile; different users initiating the same request will get different results because DVIA analyzes the user’s profile.  This functionality is useful because of the federated nature of the system.  DVIA registries can harvest metadata from other registries.  Aggregation will make it more functional.  In fact, federations of registries can be created that are uniquely identified by Handles.  The context linking service routes all queries appropriate to a user. Users would not see that it is a federation or federation of federations.  No matter where the user comes in with a query, the search will be directed to the federation that will be most useful. 

Discussion

Since the URL’s can be encrypted, it could be made difficult if not impossible to determine who asked for what data and from where.

The interface may need more work since the focus has been on functionality instead of appearance, so input and/or comments are welcome from CENDI members.

SF-298 Report Documentation Page Toolkit, Marjorie Powell

Reports were being submitted in a variety of formats with many different standards applied.  This necessitated the creation of a form to standardize submittals.  As Internet use has grown and web-based documents increasingly become the norm, it was a natural step to adapt the SF-298 Report Documentation Page to a website.  There is a one- time registration requirement to access and submit documents.  A Technical Reports Submission Toolkit is available for download that has step-by-step help in the form of Frequently Asked Questions. An avatar is built into the site to give oral instructions.

There is a demo download available and users can sign up for secure access to other DTIC data services, such as STINET.  In addition, there are options to link to reports that are in progress, and FAQs that link to other sites for backup information, examples, directives, etc.  When complete, the user can turn the document into a pdf and upload to the site. Future plans include being able to mark important fields.

A metadata extraction tool was developed to use on the SF-298 document.  NASA and GPO have tried it with good results.  Carlynn Thompson, who could not attend the CENDI meeting, was given credit for her efforts in bringing DTIC a long way in these areas.

Previous Page