November 3, 2011
NASA GODDARD SPACE FLIGHT CENTER
8800 Greenbelt Road, Greenbelt, MD 20771 Building 28, 2nd Floor, Conference Room E210
9:00 am - Welcome and Introductions
Lisa Weber, Director, Information Technology Policy and Administration, NARA, and CENDI Chair
9:15 am - 11:45 am
11:45 am - Host Showcase - What’s New at NASA STI [presentation]
Gerald Steeman and Ann E. Dixon
» New Content Management System
» Better, Usable Metrics
12:30 pm - Group Lunch
Task/Working Group Chairs
Ms. Lisa Weber, CENDI Chair, opened the meeting at 9:15 am. She thanked Cindy Etkin and her staff from the Government Printing Office for hosting the meeting.
THE SEMANTIC WEB
The web comprises web pages linked together. Links are crucial to what the web is. The pages have information for humans to read. While HTML has hidden metadata, it is basically designed for people to read. By contrast, the semantic web is data for computers to read with semantic searches yielding answers, not just pages that may have answers. In this sense, it is more like a relational database system.
Semantics refers to the meaning as opposed to syntax which refers to the form. A key element of the semantic web is the ability to use inference across the meaning. Semantic web might have been appropriately called the inferred web, the computed web or the atomic web. It stores all your data in atomic format and is a remarkable new way to federate data (combine datasets, merge data, do mash-ups, etc.).
Dr. Strawn went on to explain not only what the semantic web can do but how it does it. He said that the semantic web is an attempt to graph traditional studies of knowledge systems and language understanding onto the web platform to enable “meaningful data”. Traditional computer science built the intelligence into the programming code rather than into the data resulting in the need to constantly develop and make changes. The semantic web is a step to putting the intelligence into the data resulting in simpler code and increased data re-usability.
While the traditional web links pages to pages, the semantic web links data elements to data elements (nouns linked to nouns by links labeled by verbs). Currently, there are unnamed links on the web that require human interpretation. The named links allow the computer to make some decisions.
The semantic web uses the URL (or URI) naming system to create globally unique names for identified nouns and verbs in a text or table. This subject-predicate-object structure is referred to as an rdf triple. A semantic web database is a set of rdf triples in a triple store. Converters and scrapers can be used to create them.
RDF triple stores have asserted triples, instruction triples and inferred triples. There is an inference and query engine over the triple store which is accessed by an application. Inference engines are then used to create more triples that can be inferred from the explicitly stated relationships. In text, the rdf triples are extracted from key sentences by natural language processing. With tables, triples are created from the key name_key value which becomes the subject. The column names become the predicate and the table value for the column becomes the object. It should be noted that the rdf for a text is smaller than the text and the rdf for a table is larger than the table. Because data from both text and tables are transformed into triples, it is a way to bring structured and unstructured data together.
Storage is no longer the issue it has been. However, the question is whether you can get the information back out of the system. Part of the process of developing triples is to discard triples that don’t have much meaning. It is also possible to graphically represent triples.
Inferencing is as much about classes and properties as it is about rdf triples. Classes are sets of rdf subjects/objects, and properties are sets of rdf predicates. An ontology is made up of classes and predicates. It can be thought of as a graph where the nodes are classes and the arcs are labeled by properties. This ontology is referred to as the semantics of the domain that is being described.
Linked data is an approach that is based on linked URIs rather than full-blown ontologies. In Germany, professors and students are doing DBpedia, a semantic version of the tables that are already in Wikipedia. The whole field seems to be tending back from the probabilistic aspects of text mining to more traditional linguistic methods or a hybrid approach. The semantic web may be ready to move from the experimental phase to the early adopter phase, but the question is “what is the motivator”?
The vocabulary for describing classes as well as the relationships, or properties appropriate for the domain is always a bottleneck in the ontology development. Dr. Strawn would like to see UMLS-like work done in other disciplines. The CENDI work in the terminology area could serve as a starting point.
The business world is already doing some of the vocabulary work needed to make linked data and the semantic web a reality. Dr. Strawn would like to see STI catch up in order to make the scientific record more useful and adequately include both the data and the document.
Semantic MEDLINE is a proof of concept to improve access to the wealth of textual resources available through PubMed by adding semantic technologies. The current document retrieval systems, such as Google and PubMed, manipulate textual tokens, include frequency of occurrence or distribution patterns, but the system doesn’t actually “know the meaning” of what it accesses. Queries, as well as the text they operate against, are seen as strings of numbers.
There are emerging applications in the academic world where text mining is being done to extract facts and observe trends, connect text and structured data, perform question answering, and assist in literature-based discovery. These applications require more effective language processing and automatic semantic interpretation.
Automatic semantic interpretation requires mapping of something that is expressed in some kind of representation to something more abstract such as an ontology. Automatic semantic interpretation can augment but not supplant traditional document retrieval systems, manipulating information and not just documents. The goal is to bridge the gap between language in the text and meaning. This is like having a research assistant working for you. In the final analysis, you still have to look at the text. These same principles can be applied to other domains if you have the terminology.
Semantic MEDLINE sits on top of PubMed which is the retrieval engine for MEDLINE. Traditional PubMed searching is used to retrieve citations, including abstracts, which are sent to a natural language processing system. Abstracts are being used now. Without additional knowledge about the information expressed in full text, you would get much more of an information tsunami. Processing full text would require a different level of processing and an understanding of the discourse structure of full text.
This system creates semantic relationships in rdf triples. Automatic summarization techniques are also used to eliminate useless statements. A graphical summary is created which presents a lot of information in a more accessible human format.
The Semantic MEDLINE process is based on the NLM’s Unified Medical Language System (UMLS). It has three key components: a purely linguistic lexicon of more than 430,000 medical and general English terms; a Metathesaurus of more than 2 million biomedical concepts and synonyms (nouns put into semantic types or sets); and a Semantic Network of approximately 135 semantic types and 50 verb relationships, which provide classes of relationships between concepts. This Semantic MEDLINE processes use these UMLS components. Natural language processing using linguistic techniques based on language structure is performed sentence by sentence through the abstracts retrieved by the traditional PubMed searching. The nouns are created using terms in the UMLS Metathesaurus Concepts. The resulting nouns are controlled based on the way they are stated in the UMLS. The UMLS Semantic Network controls the predicates based on the relationships between the concepts in the Semantic Network.
Ontological relationships are the core aspects of a domain ontology. How do we conceptually cut up this area of the world, and what can we say about it? Humans working in a particular domain know them, but we must express them explicitly in order for the computer to use them. Finding new ontologies and terminologies is a new role for libraries and librarians.
Semantic MEDLINE was initially developed for clinical medicine. It has been extended to pharmacogenomics, influenza epidemic preparedness, and the genetic etiology of disease. Dr. Rindflesch showed an example using a clock gene which would allow a researcher to identify, through a search of the literature using PubMed and Semantic MEDLINE, that there is a connection between cancer and obesity.
NLM is currently working on extension of Semantic MEDLINE in the areas of public health and climate and health. (Dr. Donald Lindberg, the Director of the NLM, is on an interagency committee related to climate and health). There are prospects of extending beyond biomedicine.
The system has been run on the last 10 years work of the MedLine database, or about 1/3 of the MedLine database. The results can be made available as an rdp triple store or as a traditional SQL database. They have performed mid- to large-scale evaluations. What should the system extract from these documents and how should they be represented in the UMLS language? Precision is around 75 percent (lower for molecular biology), while recall is about 60 percent.
The interface allows the user to use the graphical visualization and then view the linked text sentence from the PubMed display. The system does not provide the answers for you but makes it easier to identify and sort through the content. It facilitates literature-based discovery, the observation of trends, and decision making, especially portfolio analysis. This is particularly important for researchers who are interested in related or new fields with which they are somewhat unfamiliar.
MetaLib Federated Search Service – Linda Resler
GPO’s federated search tool, MetaLib (www.metalib.gpo.gov), was released October 2010. It searches multiple US Federal government databases, retrieving reports, articles and citations, using the MetaLib software from ExLibris. It is an extension of the GPO Catalog so that users can get to information that isn’t in the catalog itself. There have been approximately 2,755 user sessions so far.
GPO started with 53 databases, including some of the ExLibris free knowledgebases. The databases are organized into topic areas which were originally used by the Depository Libraries. General Resources are those that cross topics. There is also a function to suggest databases for inclusion. Databases are still being added. They are limited to those that are free and where no subscription or licensing is needed.
Functionality includes searching multiple or individual databases, including access to the native interface of the database. Basic, advanced, and expert search options are available. In the expert option, GPO has created an agency list so users can limit by agencies at the high level. Users can save the results to an “e-shelf” for later use, set up their own sets of databases, save and arrange the results in folders, and browse by an A-Z list. The Quick Search interface is currently limited to 10 databases, which are selected by a group of libraries.
Many agencies are developing federated search applications and perhaps this should be a topic for discussion by the Science.gov Alliance or by CENDI as a whole.
Action Item: Consider Federated Search as a special topic for discussion at a future meeting.
GPO’s Federated Digital System Update – Kate Zwaard
FDsys automates the collection and dissemination of electronic information from all three branches for the life of the republic, authenticated and digitally signed, versioned, publicly accessible and downloadable for no fee, and in digital form. FDsys is a Content Management System to support the security and authentication needed for the entire lifecycle, a Preservation Repository that follows archival standards, and an Advanced Search Engine, which takes advantage of extensive metadata and advanced search technologies. The goal is to automate the lifecycle and the “handoffs” required.
Release 1 established the foundational infrastructure and the preservation repository; replaced the GPO public site; performed a large scale data migration; and provided operational continuity for the system. The replacement of GPO’s current public web site is scheduled to be completed in December 2010. The other functionality will be added by the end of December, including failover capabilities. Another major component of the system is internal change management. As components are brought online, people will be trained and made aware of the changes in their tasks.
Additional FDsys projects are focused on content. GPO released a daily compilation of Presidential Documents in February 2009. They converted Federal Register Publications to XML, with the Code of Federal Regulations having been completed in December 2009. A pilot project of accepting digitized statutes by the Library of Congress was completed in April and is awaiting Congressional approval for implementation. Federal Register 2.0 was released in July.
FDsys is based on the concepts of interoperability and reuse in response to open government initiatives. The content is made available to all major search engines in an indexable form. GPO is in communication with Google and Microsoft. Sitemaps and permission statements enable compatibility with LOCKSS. Data is also provided in XML, which allows private citizens and non-profits to create new ways to interact with key public content. Some results include FedThread.org and GovPulse. Other federal sites, such as Science.gov and Regulations.gov, rely on FDsys to enrich their user experiences.
Release 2 of FDsys will include developing an external content submission component. Content must currently come through the service specialists. Access functionality and data usability will be extended, and emphasis will be placed on collection development. A lot of the software for this release is already built, but GPO is developing enhancements around the collections.
Collection development will include yearly Public Papers. GPO would like to get more harvested content and they have a strategic roadmap for harvested content. The roadmap covers the harvester, the metadata, and the packaging of the information in a preservation format. The ILS Metadata Exchange project will seek to automate the exchange of metadata between the catalog and their integrated library system, and vice versa. MODS metadata is used for description and PREMIS metadata for preservation. Most of the data is digitally signed. Each document is also described in context. GPO wants to be able to navigate down to easily searchable chunks; for example, breaking the Federal Register into granular pieces.
The ultimate goal of FDsys is to be comprehensive and to handle any type of media. The system can currently handle Powerpoint. Multimedia is more difficult, primarily because of preservation issues.
Science.gov is a unique collaboration with tangible results. It is an interagency science discovery tool that provides access to more than 200 million pages of authoritative scientific information from 14 U.S. science agencies and serves as the USA.gov science portal. It is also a large scale, voluntary collaboration among those agencies.
The concept of Science.gov was spawned by two workshops. In 2000, a blue-ribbon panel explored the concept of a physical science information infrastructure. This prompted interagency collaboration. The second workshop, “Strengthening the Public Information Infrastructure for Science,” which occurred in 2001, resulted in the actual creation of the Alliance. In 2002, Science.gov was named the “First.gov for Science” portal.
Science.gov development began in 2001 and the first version was launched in December 2002. It was created as a separate entity from CENDI because there were agencies that were not members of CENDI but that had content or contributions to make. Since that time, other Alliance members have been added. CENDI member agencies are automatically members of the Science.gov Alliance.
The members of the Alliance are committed to ongoing collaboration. Underlying the Alliance is a common shared premise that while each agency has vast stores of information, science and user needs aren’t bounded by agency. At the time of the initial development, many organizations were developing web portals, so this was the approach of choice.
Science.gov faced several integration challenges that included funding, politics, the broad scope of federal science information, the wide range of audiences, and issues of sustainability. Guiding principles for the content were developed. Users of Science.gov were identified to include researchers, program managers, librarians, educators, and the public, with the key target audience defined as “the science-attentive citizen.” Technology and content were based on the potluck party approach. All agencies had resources and information specialties to bring to the Internet table.
Different technical teams were created under the Alliance. These groups developed the taxonomy (browse tree topics), designed the web site, and conducted outreach and promotions. For each enhancement, testers from across the agencies have been involved. In 2009, the taxonomy was updated to handle new science topics. More recently, a group helped develop the federated image libraries feature.
In the beginning, Science.gov received two grants from the CIO Council’s E-government grants. However, in-kind contributions have been the basic funding approach, including agency staff time (initial work involved more than 200 agency staff) and use of existing resources to support Science.gov development. NAL and USGS provided the Alliance co-chairs. NTIS created the initial catalog of scientific and technical web sites. DOE/OSTI developed technologies including the deep web search capability, hosted the web site, and provided the technical team. NLM provided usability testing of the initial web site prior to the public launch. USGS initially provided the indexing of the web catalog. In-kind contributions continue for specific development. Science.gov has also taken advantage of DOE SBIR (Small Business Innovation Research) R&D awards for some research that was later implemented by Science.gov. The “pass the hat” approach was used to fund specific technical development opportunities, such as the development of Science.gov 3.0. CENDI support leverages resources. The annual average of in-kind support as well as other direct fees since 2005 has been approximately $180K.
Science.gov is managed as a CENDI working group. Through financial and in-kind commitments from its member agencies, CENDI provides the ongoing infrastructure needed to enable such a large-scale collaboration. CENDI and Science.gov funds are combined in a single pot. Alliance-Only funds go into the treasury with the option to be used for specific Science.gov purchases such as exhibits, or to support the general Science.gov operations. A chart showed the overview of CENDI finances and the approximate portion of the Science.gov funding.
Science.gov 1.0 was launched in 2001-2002. Version 2.0 was launched in May 2004, which added the relevancy ranking of the metasearch results. Versions 3.0 and 4.0 were less aggressively launched. Version 5.0 was a significant redesign of the web site with the goal of providing a much richer user experience. Version 5.1 aggregated news feeds from 11 science agencies and made an existing internships and fellowships section searchable. A federated image library will be coming soon. Forty-two databases are currently searched.Content management is done in a distributed fashion. The Secretariat now provides the Lead Content Manager. A system was developed so the agencies can input and edit the metadata for the web site catalog. Agency content managers help to select featured sites and searches. The web sites are indexed along with the deep web databases the agencies have identified. Suggestions for the databases are reviewed by a team for technical suitability. Science.gov provides a real-time search across the databases and the web catalog. The search is not against an index that may not be up to date. The Alliance Members Page provides access to guidance documents, the input system, promotional materials, and usage statistics. Usage continues to grow.
DOE/OSTI has done an assessment of the overlap with Google. There is less than 1 percent overlap with Google and about 3.2 percent overlap with Google Scholar. If full text documents are searchable on the target database site, they are searchable via Science.gov. During the query, the most relevant documents are gathered, approximately 100-200 from each source, and are then combined for relevance ranking.
The collaboration between Science.gov agencies is often cited as a model for collaboration. Several collaborations have come about because Science.gov had the foundation already built. It serves as the model for ScienceEducation.gov, a portal for federal government science education information, which is now in beta. It was also the model for Worldwidescience.org, for which Science.gov is the U.S. contribution.
Mentions of Science.gov in the news are kept track of. The site has received many awards, including a Top 10 Google Choice for Science site.
We are often asked about the difference between Science.gov and Data.gov. Science.gov is a search portal that provides bibliographic as well as full-text (if available) access to S&T material at the record level. Data.gov is a clearinghouse of metadata describing data sites. Ms. Jordan presented an overview of the differences. The level of search granularity in Science.gov was highlighted. Like Data.gov, we need to put information out in formats that other people or machines can consume. Science.gov is a perfect platform on which to launch new technologies. Ms. Jordan posed the question, “What will Science.gov 10.0 look like?”
End of Technical Program