The Case for a Canadian National Social Sciences Data Archive 1

Charles Humphrey, University of Alberta2

ABSTRACT: Institutions of higher education in Canada have found access to federal statistical data in machine-readable format to have been constricted by cost-recovery policies introduced during the Mulroney government. Two strategies emerged from the academic community in response to the pricing scheme of Statistics Canada: the Canadian Association of Research Libraries (CARL) Data Consortium and the Data Liberation Initiative of the Social Science Federation of Canada (SSFC). While both strategies opened access for some institutions, researchers at other institutions were excluded. This article argues that a national social science data archive is needed to redress this inequity. Other developments which encourage creation of such an archive include the growing network of data libraries in universities and the push to organize data for the Global Change Program.

RÉSUMÉ: Les institutions canadiennes de haut savoir ont vu leur accés aux données statistiques fédérales présentées sous un format lisible à la machine limité par les politiques de recouvrement des frais du gouvernement Mulroney. Les universités ont répliqué à l'échelle de tarifs de Statistique Canada en élaborant deux stratégies : le Consortium de données de l'Association des bibliothèques de recherche du Canada et l'Initiative de démocratisation des données de la Fédération canadienne des sciences sociales. Bien que ces deux stratégies aient permis à certaines institutions d'accéder aux données en question, les chercheurs d'autres institutions en furent exclus. Le présent article fait valoir la nécessité de corriger cette situation injuste en mettant sur pied des archives nationales de données en sciences sociales. Parmi les autres événements qui favorisent la création de telles archives, mentionnons l'essor du réseau des bibliothèques de données dans les universités et la dynamique d'organisation du Programme de changements à l'échelle du globe.

Several events over the past few years have demonstrated the need for a Canadian national social science data archive. In particular, the Canadian Global Change Program, the Canadian Association of Research Libraries Data Consortium, and the Data Liberation Initiative have all in some sense contributed to making the case for a national data archive. A basic lesson from these experiences is that Canadian scholars require an institution which will locate, obtain a copy, and preserve significant Canadian data collections so that these data can be shared on an egalitarian basis with all Canadian researchers irrespective of their institutions' size, means, or location. Below is a discussion of some key reasons why this idea is especially relevant now.

The Canadian Social Science Data Conundrum: Access to Data

An embarrassing reality for Canadian scholars is that they can obtain machine-readable data far more readily about the United States than they can about their own country. The use of U.S. data may arise because comparable data for Canada does not exist. More often, however, United States data are analyzed because access to equivalent Canadian data has some form of restriction or barrier. As described in a position paper by the Data and Information Systems Panel of the Canadian Global Change Program, one of the major barriers to Canadian data is the federal government's information policy and the resulting pricing schemes emanating from this policy.3

The United States government's approach toward ownership of data collected and released by its bureaucracy is quite different from their Canadian counterparts. Basically, most government data in the United States are treated as belonging to the public domain. The fundamental practice has been to distribute these data at the marginal cost of reproduction. In the past, this has meant the cost of writing files onto magnetic tape (approximately $100 to $200 per tape.) Today, the United States government distributes a tremendous volume of data on CD-ROMs, which cost a fraction of the expense to produce a magnetic tape. (A CD-ROM can be produced for as little as $1.25 a CD after a master copy has been cut.)

The Canadian government since the Mulroney years has operated on a cost recovery policy regarding the secondary distribution of data.4 The implementation of this policy within divisions of Statistics Canada, for example, resulted in dramatic fee increases for data. The equivalent files from the 1981 Census that were purchased for $10,000 were sold for over $200,000 for the 1986 Census. In 1994, all prices for public use microdata in the Statistics Canada catalogue were doubled.

Not only are data not viewed by the Canadian government as inherently part of the public domain, data are sold without considering the impact prices have on access. It is my contention that the pricing policies of the past ten years have been instrumental in keeping many data collections out of the hands of academic researchers. This has not been a conspiracy against scholars, but rather the consequence of a policy that treats educational institutions as though they have the equivalent means of one of the top 500 Canadian businesses.

Strategies to Deal with the Cost Barrier to Access

The CARL Data Consortium

Through the Canadian Association of Research Libraries (CARL), 5 academic institutions approached Statistics Canada as a buyer's consortium and collectively purchased one copy of the 1986 and 1991 Census data files at Statistic Canada's price. The agreement with CARL permitted the distribution of these data among the members of the Data Consortium. A similar arrangement was made through CARL for the acquisition of the 1986 and 1991 Census of Agriculture and for the series known as the General Social Survey of Canada, an annual, topical survey conducted since 1985. The CARL data consortium has not altered the pricing practices of Statistics Canada, but rather has provided a way for higher education to purchase data at a more affordable price.

The Data Liberation Initiative

A project has emerged within the past eighteen months, known as the Data Liberation Initiative and promulgated by the Social Science Federation of Canada (SSFC), to incorporate the distribution of machine-readable data within the federal Depository Services Program (DSP). The DSP is responsible for distributing information from the federal government to the public through a system of public and educational depository libraries. The Treasury Board authorizes distribution of the material disseminated through the DSP, which to this point has been chiefly information in print formats. Within the last two years, the DSP has experimented with distributing some items in machine-readable format. To date, this has been textual material, not numeric data. The Treasury Board, realizing that tremendous savings exist in distributing information in electronic format, has encouraged the DSP to move beyond the delivery of information in print. Thus, the current climate is supportive of distributing machine-readable information through the DSP. What remains, however, is to extend the scope of machine-readable information to incorporate statistical data. This is what the Social Science Federation of Canada (SSFC) initiative addresses.

Negotiations, which are not yet complete, were initiated by the SSFC with Statistics Canada and the DSP. The proposal is that educational institutions and the DSP jointly pay Statistics Canada somewhere between $700,000 and $1,000,000 over a five year period in return for a large collection of historical data (emanating from projects conducted mostly by Labour and Household Surveys) as well as data from upcoming projects, such as the 1996 Census of Canada. The potential of this project is to place a tremendous volume of valuable data into the hands of researchers at participating institutions.

Consequences of These Two Strategies

Have/Have Not Institutions

The strategies of both CARL and the SSFC have been pragmatic in confronting the federal government's information policy. Both approaches involved negotiating special price arrangements for access to data. The alternative strategy of attempting to change the fundamental policy behind the pricing of data simply did not seem politically possible.

I have been a proponent of both the CARL and SSFC initiatives and believe that without the CARL Data Consortium machine-readable data from the past two censuses would likely have not been available to university researchers. Without condemning either of these approaches, there nevertheless have been inequities arising from these special arrangements. There are institutions of higher education that simply do not have the resources to buy into the data collective. The consequence is that, at a minimum, a three-tier system has emerged regarding access to these data by Canadian researchers. There are scholars who have access through their institution's affiliation with the CARL consortium; there are researchers who receive privileged access to data from Statistics Canada either through grant funding or contractual arrangements for a special project; and there are those outside both of these categories who simply do not have access to the data.

We face a situation in Canada where researchers currently work on an uneven field with respect to access to statistical data. I recently spoke with an economist at one of Canada's smaller universities with a student population around 3,000 who was unable to obtain Statistics Canada data while his colleagues at a nearby larger institution had a copy of the data through a cooperative arrangement. Our current approach to data access will result in institutions divided into have and have not status regarding access to data. An ideal approach to redress such inequities would be the creation of a national social science data archive that served as a repository of government data and then subsequently provided academic institutions with access to these resources.

Preservation of National Data Treasures

The DSP is not an archiving program but rather a distribution mechanism. A major concern about the success of the Data Liberation Initiative is that a flood of data will be made available without seriously addressing the issue of the long-term preservation of these data. The National Archive has taken the position that its responsibility for the preservation of machine-readable files is limited to its records management policy, which currently does not include the statistical files being discussed. No action is being undertaken by the National Archives to preserve the data identified within the Data Liberation Initiative. Furthermore, a serious question remains whether Statistics Canada sees within its mandate the responsibility of archiving data for other organizations, i.e. except for resale possibilities.

In addition to the large Statistics Canada data collections, academic researchers have generated valuable data in their own right. Through funded projects, especially through grants from the Social Science and Humanities Research Council (SSHRC), significant data have been gathered about Canadian society, e.g. the national election studies.6 Trying to locate data from a SSHRC-funded project is currently difficult given the ad hoc and decentralized manner by which these data are archived and made available for secondary research purposes. Since 1989, SSHRC requires every grant recipient to deposit with a recognized institution a copy of all data created through one of its funded projects. However, no central agency serves as the Canadian repository nor ensures that data are eventually deposited in a usable form.

Since the demise of the Social Sciences Data Clearing House in the mid-1970's, only limited efforts have been undertaken to catalogue and preserve data systematically at a national level. In the early 1980's the now defunct Machine Readable Archives Division of the National Archives undertook a cataloguing project known as the Canadian Union List of Data (CULDAT). Following a reorganization within the National Archives that resulted in disbanding the Machine Readable Archives Division, the project ceased and the catalogue was passed along to its contributors. The University of Alberta maintains a public copy of the catalogue but updates only occur on a voluntary basis. Overall, the preservation of data and its cataloguing now exist as a voluntary activity among a few academic data libraries across the country. These efforts, while commendable in the absence of a national social science data archive, are stop-gap at best.

Conclusion

The experiences of the CARL Data Consortium and SSFC Initiative can be viewed as stepping stones toward a Canadian national social science data archive. The effort behind these programs certainly should not be judged as an end in itself. Rather, avenues have been opened to data that heretofore were restricted because of their cost. A national data archive could well benefit from the groundwork accomplished by these two programs.

However, the outcome of these two programs can hardly be considered the final actions that will be necessary in obtaining access to government statistical data. Further advocacy will likely be needed and a national data archive could serve as a focal point for promoting access to data in the future.

Other events have recently occurred upon which a national data archive can build. A network of data libraries has been growing since the CARL Data Consortium began in 1989. Many university libraries, once they participated in the Data Consortium, became aware of the need to provide data services to accompany the collection of census files delivered through the program. Through the Canadian Association of Public Data Users, workshops have been offered to assist libraries in forming these new services. In western Canada, the Council of Prairie and Pacific University Libraries (COPPUL) formed a federation of ten universities to share a joint membership with the Inter-university Consortium of Political and Social Research (ICPSR), which is a major, international social science data archive housed at the University of Michigan. The COPPUL federation is currently examining a proposal to extend its mandate to incorporate other data sources. A similar federation project involving Ontario and Quebec university libraries is underway.

In each of these instances, the network of data libraries provides an ideal framework through which a national social science data archive could disseminate data. The purpose of a central data archive would not be to supplant the growing network of data libraries but rather to work as a partner in supplying data resources to these institutions. The experiences of central data archives in other nations demonstrate that the most successful way of disseminating data to researchers is through local assistance.7 Because of the diversity of computing environments, local data libraries are best capable of troubleshooting for the researchers on their campus.

Internationally, Canada is one of the few industrialized countries that does not have a central or prominent social science data archive. The case has been made above that distinct national advantages would be created by having a central data archive within Canada. This argument can be extended to include the advantages this country would have through working together with other national data archives. The Global Change Program has demonstrated the clear need for the international exchange of data when examining worldwide phenomena. However, before data can be shared internationally, standards and conditions of exchange must be established. Without a national data archive, Canada has been on the sidelines in these negotiations. In my opinion, Canada's participation in the Human Dimensions component of the Global Change Program has been hindered by not having a central data archive.

Access to data and a network of data libraries provide a foundation that the Data Clearing House never had. A number of factors come together to make the formation of a national data archive a real possibility. The one ingredient still missing is a political will to carry this proposal to completion.

Letters to the Editor / Lettres au rédacteur en chef