Breaking the code: predicting where disease will strike

Environmental Health Perspectives, Sept, 2005 by Tim Lougheed

Health authorities today enjoy an embarrassment of riches when it comes to data on disease incidence, thanks to the rapidly increasing use of electronic record keeping at hospitals, clinics, and related businesses such as pharmacies. Where significant numbers of paper files once required days or weeks to process for survey purposes, much larger volumes of material on patient conditions can today be assembled in a matter of hours. This growing abundance of information has raised expectations about the prospect of identifying local outbreaks of disease at ever earlier stages so that such outbreaks may be addressed and contained as soon as possible--expectations that are perhaps more easily raised than met, given statistical realities.

One of the most recent strategies for harnessing these data is offering some promising results, however. A paper that appeared in the March 2005 edition of PLoS Medicine outlines a new methodology that requires only that researchers know the actual number of disease cases occurring in a given area over a given period of time. The resulting statistical model makes only the most limited number of assumptions surrounding the emergence of a disease, while attempting to compensate for naturally occurring temporal and geographical variations in disease reporting.

This offers greater detail than one-dimensional statistical methods, which track disease outbreaks in either purely temporal or purely geographic terms. Above all, the new model eliminates the need for information about the local population and its relative risk for disease--for example, whether a neighborhood contains a higher-than-average proportion of groups, such as infants or the elderly, who may be prone to specific ailments.

A Window on Disease

Underlying the work presented in PLoS Medicine is the principle of the scan statistic, which offers a probability of an excessive number of case reports appearing within a narrowly defined space and time, as compared with a probability determined from information collected in a larger region or over a longer period. Scan statistics aren't new, but the PLoS Medicine paper adds a twist: the calculation of probabilities for disease outbreak within various samples of space and time. In the case of purely geographic surveillance, sudden highly localized outbreaks may be hidden in the data that have been aggregated for a region, whereas such events are more likely to be revealed once a temporal dimension has been incorporated. Space-time permutation scan statistics could therefore become a preferred way of representing the occurrence of diseases such as cancer, where the number of actual and expected reports are counted within particular "windows."

These windows can be visualized as a set of thousands or even millions of overlapping "cylinders" within the geographic area in question, each varying in the amount of territory and length of time it covers. Harvard University Department of Ambulatory Care and Prevention associate professor Martin Kulldorff, one of the paper's authors, uses the cylinder as a way of visualizing the sampling of data in three dimensions, with the x and y axes representing the geographic area being surveyed and a z axis representing time. As time passes, subsequent samples are added atop previous ones, and the cylinder thus grows in height. Mathematically, each cylinder uses a likelihood function to compare its expected and observed numbers of cases, making it possible to single out locations and days in which the latter number was unexpectedly high.

The new model was tested in conjunction with the New York City Department of Health and Mental Hygiene, which has been among the leading agencies collecting the data necessary to gauge the spread of disease. In the late 1990s, the city launched a dedicated program of syndromic surveillance, tracking ambulance reports, emergency room visits, and pharmacy sales--all with the aim of spotting anomalous clusters of cases that could signal a disease outbreak.

Kulldorff says the researchers took several steps to manage the computational burden of this exercise. The circular cylinder base was in turn one of several combinations of 183 New York City zip codes with a radius of zero to 5 kilometers. Each of the cylinders the researchers defined was seven "days" high (the ream reasoned that if an outbreak has existed for more than a week, it will likely have already been picked up by clinicians or laboratories).

But while the geographic area covered by each cylinder remained the same, the specific days changed. For example, over the course of a month running from day 1 to day 30, the first statistical analysis would take place on a cylinder with a height defined by data from day 1 to day 7, the next would take place on a cylinder with a height defined by data from day 2 to day 8, and so on. This moving window makes it possible to look for changes taking place in a strictly defined time and space.

Thus, the team could, for example, catch a disease outbreak that began emerging on day 7, something that health authorities might not otherwise have identified for many more days. This early signal would prompt officials to check out the situation sooner and perhaps contain any outbreaks more successfully. To keep such signals in perspective, the statistical analyses also refer to the previous 30 days, so that any longer-term trends or variations could be compared with what has been seen in the seven-day window.

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
Click Here
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale