CENDI PRINCIPALS AND ALTERNATES MEETING
US Geological Survey
Reston, VA
November 6, 2007
Minutes
Fourth Mode of Science
On New Ways of Doing Science
USGS/BRD Showcase
Young Government Leader: An Open Discussion
On New Ways of Doing Science: the Interlinked Roles of Articles and Data
Haym Hirsh, Director, Division of Information and Intelligent Systems, National Science Foundation and Professor of Computer Science, Rutgers University
Dr. Hirsh emphasized the fact that he is giving the talk as a member of the Computer Science Department at Rutgers and not as a program officer at NSF. He welcomed the opportunity to articulate his thoughts around these issues more formally. He has been challenged to find the right label for these activities. “Fourth (4th) Mode” was actually suggested by Steve Griffin at NSF. The focus is that three pillars of science have historically been described and the suggestion is that we are going beyond them.
The classic pillars are experimentation, theory (including models, which are often mathematical), and computation. There have been tremendous successes because of the advent of computing which lacks the constraints of the first two pillars. The use of computers allows scientists to simulate phenomena that can’t physically be executed, such as the Big Bang or Climate Change. While you can try different experiments in a chemistry lab, there are phenomena that require running through a large number of possibilities. It is also possible to simulate phenomena at a greatly reduced cost. For example, the Boeing 777 was designed with minimal wind tunnel testing because of the use of computers. In addition, you can simulate phenomena for which we do not yet have theories because there are complex interactions involved, such as social networking and epidemiology. It is possible to explore spaces beyond the scope of human ability; i.e., genome decoding, four color theorem, and archeology.
However, computational science is not simply computer simulation. For example, imaging technology is being brought to a traditional microscope. Data technologies, such as supercomputer simulations and sensor networks, generate massive amounts of data, and computers are being used to help make sense and gain new insights from this data. Computer-based communication technologies allow scientists who are separated spatially and culturally to work together. Over 20 years after the introduction of e-mail, it is basically being used as it was before. However, this is changing due to additional computer technologies, including those being used for scholarly publication.
The 4th Mode is not necessarily a pillar as much as it is a framing of science. Increasingly, the practice of science is defined by computing. In a sense, “computational science” is redundant since science is being done computationally. To support innovation in science requires support to innovation in computing. Computers are becoming part of the consensus-making that is at the basis of science. However, not all scientists agree with this; there is concern that we do not completely understand or control the errors that might occur in the systems and software, and so validating results becomes more difficult. Also, there are severe issues of reproducibility.
Dr. Hirsh introduced a new NSF initiative in Computation Science called Cyber-enabled Discovery and Innovation. This is a five-year program seeking science and engineering research outcomes made possible by innovations and advanced computational thinking. They are looking for revolutionary, transformative research made possible by computational thinking. The seven NSF directorates and various offices are working on this as a cross-cutting program; this is where the big frontier of science is going. The fact that the money is going here is going to drive this. The solicitation has three thrusts: Data to Knowledge; Understanding Complexity in Natural, Built and Social Systems; and Virtual Organizations. They are looking for innovative computational concepts, not just the use of computers. The awards will be $750 million over five years.
In the second part of his presentation, Dr. Hirsh focused on the interlinked roles of journal articles and data. The scientific enterprise is about communication. It is about building off of the work of others and scientific publication is the means by which we stand on the shoulders of others. It is how science is validated through reproducibility.
However, we haven’t determined how to publish computational science. Reproducibility in computational science requires data, software, and parameters. There are significant issues such as being able to transfer the file over the network, the lack of clear standards for documenting these types of experiments, and article and citation rules. While scientists are increasingly putting their data on the web, there is a lack of durability of web sites. He suggested a model from the legal community where you are allowed to cite web sites, and there are requirements for downloading and logging the date.
There is an increased complexity of content and data types. How do you couple the articles with the data? How do we deal with privacy, intellectual property rights, legal and ethical restrictions on the data or the code? Lastly, who pays for it?
In some cases, the solutions are available, but there is cultural conservatism on the part of the academy. People are slowly beginning to address these issues, but perhaps not as quickly as one would like. The funding agencies collectively have a lot of power in this area. The universities are trying to address the issues but there is often not enough of a critical mass to be very effective.
Funding agencies could do a lot to foster the development of infrastructure and encourage its use.
Finally, there is the issue of achieving broader access to scientific information. The results of federally funded research, neither the articles nor the data, are not broadly available. The information needs to be available for the public good. The interactions between publishers and scientist aren’t leading to the outcomes we want to see. It is important to get every ounce of good out of the limited funds we have.
Is it appropriate and effective to have the publishers take responsibility for the data? Why would they want to do this? How will they make money? Publishers need an effective business model. Even professional societies need money and the publishing arms are often used to fund other member services. Research institutions and funding agencies could also play a role, but there will be an impact on the overhead for research institutions and this would need to be recouped.
Articles are contributing to science in more ways than just being read. Articles themselves are data and subject to innovative computing technologies that advance science as well. It isn’t possible to separate the data from the articles. Articles are being analyzed to do comparisons on the literature that describes the work. Survey work has been done as far back as 2002 (e.g., work by Russ Altman). Increasingly, articles are being used to compare sequences. “Mining the literature” is coming up with a result, not just an article that you didn’t know about. What this means is that the articles themselves are being mined by scientists in support of new and advanced science.
However, given all this, it is obvious that we don’t address data well. More of the burden of reproducibility is on the scientist and not on the broader scientific community. Mark Lieberman at the University of Pennsylvania has written about “Executable Articles”. Publication for the third pillar means executing a program just like you read an article. Scientific and technical papers should include an explicit, executable recipe for generating their numbers, tables, and graphs from the published data. This also applies to the other two pillars of science.
The more we can couple the articles and the data, the more we can minimize fraud and error. We can speed up science, lower the barriers to entry, foster education through learning by doing, and make data available for re-use independent of the place where the data or techniques are available.
There are many challenges. Because the generator of the data is not necessarily the user of the data, we need to prepare the data for later re-use when we don’t even know what is going to be needed. In addition, computational science may use data that is not publicly available. Authors need to specify access to data in a durable way. We need methods similar to the legal community for copying, identifying and archiving web sites. It is easier to find out who I cited than who cited me. “Shepherdizing” in the law goes both ways. It is a mechanism for analyzing cites for quantity and quality. Science hasn’t emulated the legal system; it doesn’t look at the value of the source or if the concepts have been overruled. The Law is able to do this.
Dr. Hirsh recalled the web-based product from Penn State University called Citeseer. It was developed in the mid-1990s as a web crawler that specifically went to researcher pages and downloaded the articles. It produced more information automatically than what ISI did. The crawler captured Word and PDF files. It created citation statistics and an active bibliography. Similar documents could be identified based on co-citation and on text. A database was created automatically. A histogram was available for visualization. This effort is now focused on domains instead of the broad field of science. It is still constantly trolling the Web. Google Scholar was originally based on the Citeseer concept. This approach had benefits for evaluation for funding and tenure, automatic publications ranking for research institutions and scholars by ranking based on the online articles and the citations. Citeseer could tell you the most important articles of the year based on the data. It was possible to automate the analysis of the literature to look at collaboration trends.
Looking ahead, Dr. Hirsh sees interconnected data similar to interconnected articles. Because of the continued fragmentation of science, it is difficult to make foundational discoveries across the sciences. It is necessary to harness narrower expertise collectively for the greater good, such as with Wikipedia or Open Source Software.
Science is also about reputation. You believe some people more than others. The notion of reputation is being explored in new ways with eBay, Amazon, and blogs. These approaches will be brought to bear on the scientific enterprise.
Science is also about community. For example, the use of Facebook and Second Life would create relationships, shared experiences, and communities that would provide support for the development of standards and the archiving and quality control of data. This may encourage disciplines to develop data policies that dictate how the data is supposed to be managed.
USGS/BRD Showcase (Tom Lahr)
Integrated Taxonomic Information System (David Nicolson)
ITIS is a database of organism names, their hierarchical classification, and related information. ITIS is also the interagency partnership of nine agencies and other non-governmental partners. The database includes 550,000 names with synonyms and common names. It has regional and global coverage.
ITIS began in the early 1970s with some information for the Chesapeake Bay. The National Oceanic and Atmospheric Administration (NOAA) took over and extended this effort. The database went online in 1996. The development team is now at the Smithsonian.
ITIS is being used in a number of different ways. It serves as a thesaurus function for search. A species search function is being developed. It can also be used to link to biological information, machine to machine by embedding data into URLs for look up purposes. The TSN (Taxonomic Serial Number) is a unique identifier in the context of the ITIS system. It is permanent, but not “intelligent”. An effort is underway to move this functionality to web services, which have a tremendous possibility for facilitating look up. Multi-lingual interfaces to the ITIS content have been made available from other sources including Canada, Brazil and Mexico.
Mr. Nicolson presented several slides on Animalia to show how complete ITIS is for both North America and globally. Completeness varies depending on the availability of the information and the emphasis at the time. They have been doing a lot of work in Arthropods lately. At least one reference is required in order to enter an organism into ITIS. Geographic information is also included.
The ITIS web site has a simple search function. Counts were recently added. The standard report gives the classification and other information including how valid the entry is. Users can search external resources including BioOne, the UN Environment Programme, the Species2000 Catalog of Life, the National Center for Bioinformatics for genomic information, and others. ITIS is involved with many other groups working in this area, including the Bar Code of Life, which is involved with establishing the DNA sequences for organisms. The Encyclopedia of Life is also using ITIS.
As of October, the 550,000 names included approximately 109,000 vernacular names. About 179,000 are verified and meet the Taxonomic Data Working Group (TDWG) standards. About 24 percent of the names were inherited from the NOAA system. These are being verified as time permits.
There are competing goals in a project such as ITIS. Some of the different approaches have to do with the ways users would like to use the data. Is it better to have names in a single classification or identify multiple classifications? Should the database be global or regional? Should the focus be on quantity or quality? Should the focus be on current names versus all names, including historic names? The ITIS Steering Committee helps to set the path and they try to balance these competing goals.
In terms of the infrastructure, there are plans to improve the online editing environment, which includes entering the data, editing and proofing, accessing the data, comparing the data, exporting the data, and producing e-checklists. Only a limited number of stewards can currently enter data directly. Contributors provide data but they don’t have direct access. Data integrity is important.
The future vision for ITIS is that it will complete and maintain the North American flora and fauna. Global checklists will continue to be built through ITIS and its global partners, especially the Catalog of Life. Completed groups will continue to be reviewed and updated. Some peer review and steward activities will become more explicit parts of the process.
NBII Digital Image Library (Annette Olson)
The NBII Digital Image Library (DIL) began as an in-house image library in 2002, providing access to pictures for use in presentations and publications. It became a public resource online in 2004. At that time, there were 200 public domain items available. By 2004, the partners had begun to view it as a platform to serve, showcase, and store their images. By 2005, the DIL had 1000 images and had already been offered 100,000 more.
By 2005, the DIL had a defined mission, but it was very broad and very labor intensive. It has been necessary to step back and reexamine the goals, priorities and methods. Once the digital library was offered so many images, it had to reexamine its processes. They recently examined who needs the images, the types of images being sought, what other galleries are doing, and, therefore, what is the NBII DIL’s niche.
The main audience is natural resource managers, researchers, decision makers, educators, students and the general public. The users are primarily looking for images of organisms for identification in the field or elsewhere. They are also interested in images of habitats often related to issues such as fire, avian flu, and invasives. There is some interest in images related to research and management methodologies and in department public relations images.
The goal is a one-stop gateway to biological images. It is also a place for secure storage. A search for comparable sites turned up others, but these generally had a regional or topical focus. The DIL is seen as a collaborative platform to talk about images.
The level of documentation for the images is an issue. There are only a few galleries that provide detailed metadata. While Google is pursuing the indexing of images of the Department, the DIL sees itself as the gateway for serving images with detailed metadata. The mission has now changed a bit to provide a gateway to biological images with an emphasis on good documentation. They are also interested in a diversity and range of images that no other gallery can provide at this point.
Images are acquired in several different ways. They are harvested from many partners by setting up web services. This is probably where most of the images will come from in the future. The actual images may reside on the DIL server or on the partner’s server.
Policies are very important. The first focus is on the policies related to the images themselves. Currently, images are acceptable from web to archival quality, because, often, the web quality is the only one available.
The contributors must allow non-profit use of the images they make available. There is a signed statement of ownership and authenticity, because the DIL has already encountered cases where photographers are changing the images for impact.
In terms of staff priorities, a focus is on detailed metadata, identifying higher resolution images, and North American biological images. In addition, the staff members are working to enhance the automation of the contribution system.
The web site has recently been redesigned. Search is improved in addition to the browse feature. There is now a special collections section. Image management is providing resizing tools, guidance documents, metadata templates, and QA/QC procedures.
The current DIL has over 5,500 images online. There are 33,000 more under agreement to be added soon. An additional 370,000 images have been offered just through random conversations without any advertising of the service. Future work will include improvements to the cataloging interface and the provision of images to other global scientific services such as the Encyclopedia of Life.
Gap Analysis Program “GAPServe” (John Mosesso)
Gap Analysis has been performed since the mid-1980s. “Gap” is the lack of representation or under-representation of protected land for sufficient habitat for targeted species. The results are made available to natural resource managers, land use managers and developers. The goal is to provide information about the status of habitat availability and species requirements in order to be proactive and avoid a regulatory process or a situation where there are few options. The aim is to keep common species common.
The data include species counts, land cover, species distributions, and conservation information. The Gap Program uses Landsat data from 2000. Signatures are identified through Landsat and then data is collected from the field. Species distributions are collections through publications state by state and by modeling. This is also being done in larger chunks by ecosystems and ranges.
The Program has developed habitat affinities. What land cover does a particular species need? What conditions does it need? Having this information, one can predict through modeling where this organism should be found. Actual site records are used along with other variables.
The management plans from federal, state and non-government organization (NGO) land owners are reviewed. The level of protection in the area and the degree to which it is being managed for biodiversity is rated on a scale of 1 to 4. A “1” would be a wilderness area and “4” would be an area with no plan. Only those areas with a 1 or 2 count as protected.
Over 1500 species have undergone a Gap Analysis across the United States. GAP products are reviewed by biologists from the area and they analyze what it means in order to predict future problems. Federal Geographic Data Committee (FGDC)-compliant geospatial metadata is created.
Initially, the GAP products were used to plan reserves and species corridors. Now, the use is much broader. Therefore, GAPServe has been developed. It is an electronic portal through which GAP data and maps are made available. It is possible to do backyard to much larger geographic ranges. As new data comes in, it is immediately up to date. Maps, data and reports are all available via GAPServe. The system describes the data you may be looking for and whether it is available or not. For example, Alaska has barely been touched because it is very complex. However, the GAP Program is closing in on having a completed national data set.
The Program has a new consortium of interested people to broaden the information they have to include land trust areas and more private property information. All GAP projects are done in cooperation with scientists with a variety of expertise. Funding and in-kind services help promote the work as well. There is a rigorous review process.
For the public and students, a number of canned compilation maps have been developed. For example, there is one available for black bears. Alternatively, researchers can go to GAPServe and get all the data or pieces of it. There are options for projection, ESRI Grid or GeoTIFF format, the extent (state, regional or just species distribution), etc. Having the data in one place makes it more manipulable and easier to perform assessments.
Special Interest Topic: Young Government Leader: An Open Discussion (James Hedrick, Presidential Management Fellow, Office of Rural Housing and Economic Development, HUD)
The goal of the Young Government Leaders (YGL) is to educate, inspire and transform. It serves as a coordinated voice and provides networking opportunities for young government professionals. Over the last month, there have been a variety of events for professional development, service, and socializing.
The YGL was founded in 2005. It now has more than 1200 members from over 30 different agencies. There are chapters opening in different areas of the country including Denver, Boston, Central Pennsylvania, Bethesda and Dallas. He presented a breakdown of the members by agency. There are approximately 100 from EPA and 200 from DoD.
Mr. Hedrick believes that YGL is representative of the young federal government workforce. YGL recently conducted a membership survey to get more information. Twenty-five percent of their members have less than a year in government; almost two-thirds have less than three years. The members are generally 30 years of age or younger. (Less than 3 percent of the government workforce overall is under 25 years of age.) Fifty-eight percent have Master’s Degrees, and they are more highly educated than previous workforces. Many of them are Presidential Management Fellows.
The majority fall into the GS9-11/12 government payscale. Time in grade is one of the biggest problems, because they are slowed in their professional advancement in the 12, 13 and 14 levels. The majority are very interested in advancement. One of the best attended panels last year was the leadership development panel. Attendees wanted to know how to go about becoming the next generation of leaders.
Mr. Hedrick then went on to discuss the characteristics of these Gen Y- and X-ers. These generations are more mission-oriented. They want to feel that they are making a difference and, often, it is difficult to get into federal government service. The time to hire can be very long and it is easier to get a job and satisfy the need for a mission orientation with an NGO. The Partnership for Public Service found that people really want to “do good” and “serve”, but you can do it in more non-profits and there are more opportunities in the NGO sector for training. Government is competing with many other avenues for employment.
These generations have well developed skills and want to maintain them. They like to learn. In the survey, 82 percent wanted professional development opportunities. This response was 20 percent higher than any other response on the survey. Often, they are not getting this from their jobs.
The young government professionals are comfortable with most advanced technologies. They expect to use up-to-date technology at work, but it must be technology with touch. Technology is important but without having someone to show or a mentor to talk to, the technology doesn’t help. There are issues with technology in the workplace. These professionals are often more connected at home than they are at work. Although technology is important, it isn’t necessarily the overriding factor unless all other factors are the same between government and non-government employment.
There are many challenges to government recruiting. Private sector jobs are booming and Gen X, Y and Z individuals are less interested in the job security of the government. Salary is important but not an overriding factor, except for lawyers who carry such a large educational debt that they have to go to the private sector. Hopefully, these potential workers will return to government service in the future.
Information about federal jobs is lacking among college career counselors. The Partnership for Public Service has set up a Call to Serve Network of several colleges where the knowledge level and interest level went up about 20 points after additional information was provided. Summer internships are a good way to encourage younger employees. YGL has received help from 13-L, a group of professionals at 15/SES levels that work on recruitment and retainment issues in several agencies. This is another self-organized group of champions. They sponsor get acquainted, “speed dating” sessions for younger professionals.
Most of the young professionals plan to stay three to five years and some more. They would tend to move between agencies and departments if they aren’t happy. There are some statistics related to this on the Partnership for Public Service web site. Mr. Hedrick indicated that boredom is more often a problem than burn out. Access to peers is also important for retention. A buddy system for younger professionals within an agency might be helpful.
Mr. Hedrick indicated that job announcements and internships can be sent to him directly or posted to the younggovernmentleaders.org website. YGL is doing a road show among the agency Human Resource departments. Presidential Management Fellows are often looking for rotations in other agencies. Would a rotation among CENDI agencies be possible? They would be willing to serve as a way to recruit young professionals for usability testing, technology discussions, and more. Mr. Hedrick expressed an interest on the part of the YGL group to continue contact with CENDI.