The Bristol Observatory

              Incorporated in 1997 by John A. Pandiani, Ph.D., Sociologist 

                                             and Steven M. Banks, Ph.D., Mathematician

Home                 

 

Probabilistic

Population

Estimation

   

Reports for Clients

 

Publications

and

Presentations 

 

Clients

and

Projects

 

Staff

File Upload  

(Clients only,

requires username

and password)


Probabilistic Population Estimation for Children's Mental Health

             Probabilistic Population Estimation is a statistical procedure that provides unduplicated counts of the number of children and adolescents who are represented in more than one data set without reference to personally identifying information (Banks & Pandiani, in press, a).  For purposes of measuring treatment outcomes, these data sets usually describe the caseloads of different institutions or service sectors during  different periods of time.

The method of Probabilistic Population Estimation has been used extensively in the measurement of  treatment outcomes for adult mental health programs.  Rates of hospitalization subsequent to community mental health treatment, for instance, have been determined using Probabilistic Population Estimation to measure the amount of overlap between community program caseload during one year, and inpatient population during subsequent years (Banks et. al, 1999).  When probabilistic estimates of incarceration subsequent to treatment (outcomes) are combined with probabilistic estimates of incarceration before treatment (access), a very powerful risk adjusted measure of program performance is the result.  The ratio of subsequent to prior incarceration provides a measure of program performance that takes into account differences among programs in caseload composition.  These differences could account for differences in this treatment outcome.  Another important measure of treatment outcome is provided when the overlap between vital records mortality data sets and substance abuse treatment data sets is measured using Probabilistic Population Estimation.  In one application, the elevated risk of death that is associated with problem drinking was measured for different age groups. (Banks & Pandiani, in press, b).

Probabilistic Population Estimation also has been used to evaluate systems of care for children and adolescents.   The degree to which child serving agencies share responsibility for children and adolescents has been recognized as an important measure of service system performance for a number of years.  A child focused measure of this shared responsibility is provided by the caseload segregation/integration ratio.  Caseload segregation/integration has been measured using anonymous records from children’s mental health, child protection, and special education programs on a statewide basis (Pandiani, Banks, Schacht, 1999b).  Levels of caseload segregation were found to be related to a number of treatment outcomes on both the individual and the community level (Banks, et. al. 1999b).  Individual level outcomes include incarceration (for boys) maternity (for girls) and hospitalization for behavioral health care (for both genders).  Community level outcomes include rates of out-of-home placement, community wide maternity and hospitalization rates.  Other children’s  treatment outcomes that have been measured using Probabilistic Population Estimation include hospitalization for behavioral health care and maternity (for girls) after children services (Pandiani, Banks, & Schacht, in press).

Probabilistic Population Estimation has three important advantages.  First, the personal privacy of individuals and the confidentiality of medical records are assured  because Probabilistic Population Estimation does not depend on information that identifies specific individuals.  Second, because the methodology relies on existing data bases, it does not require the commitment of substantial amounts of staff time or financial resources.  Finally, Probabilistic Population Estimation can support retrospective evaluation of changes in systems of care that have occurred in the past, and provide longitudinal baseline data for evaluating current or anticipated changes in systems of care wherever basic client information resides in electronic data bases.

 

Methodology

 Probabilistic Population Estimation allows researchers, policy analysts, and evaluators to answer two basic questions that have frequently remained unanswered because existing data sets lack unique person identifiers across organizations and service sectors:  “How many people have contact with a service system?” and “How many people are served by more than one organization, service sector, or service system?”  Probabilistic Population Estimation provides these estimates by combining information on the distribution of dates of birth in data sets with information on the distribution of dates of birth in the general population to produce valid and reliable estimates of the number of people represented.  The ability of this statistic to provide probabilistic estimates (with known confidence intervals) of these basic parameters of service systems is particularly valuable where issues of confidentiality or organizational complexity limit the availability of unique identifiers, or the lack of adequate financial resources inhibits the development of comprehensive integrated data warehouses.

The ability to answer these questions is based on Probabilistic Population Estimation, a statistical procedure derived from the solution to the classic mathematical “coupon collector” problem (Feller, 1957).  In the classical coupon collector problem, the solution to the problem answers the question "How many baseball cards must a collector collect to obtain a complete set of cards, when the probability of every card being in a given bubble gum package is known?".  In the current application, the same logic is used to answer the questions, “How many unique individuals are represented in a data set that does not include a unique person identifier?”, and “How many unique individuals are shared by data sets that do not include common person identifiers?” 

 

Determining Population Size

             The solution to the coupon collector problem used in Probabilistic Population Estimation begins with a decomposition of the problem which does not involve mathematical approximation.  This decomposition involves breaking down the larger question into a series of smaller questions for which the mathematical solution is known.  In this case, a data set is divided into discreet segments that describe individuals who share a year of birth and gender.  Using decomposition, the total number of individuals needed to fill a pre-specified number of dates of birth is equivalent to the number of individuals needed to fill one date of birth, plus the number needed to fill a second date of birth once the first is full, plus the number necessary to fill a third once the second is filled, etc., until the pre-specified number of dates of birth is filled.  For birth dates, when a uniform distribution is assumed, the number of individuals is determined by: 

                                                                     

 where j represents a distinct gender/year of birth cohort, and l is the number of observed birth dates within that cohort.

The variance of the number of people is determined by: 

                                                           

The total number of people represented in a data set (PTotal) is obtained by summing the population parameters over all gender/year of birth cohort subsets:

                                                                            

 where k is the total number of gender/year of birth cohort subsets.

 The construction of the 95% confidence interval for the point estimate derived above involves a two step process.  First, the total variance s2Total is obtained by summing the variance for each gender/year of birth cohort:

                                                                   

 where k is the total number of gender/year of birth cohort subsets.  The estimate for the 95% confidence interval is then constructed:   

          For example, if 231 dates of birth were represented in a data set with all male children’s mental health service recipients for 1990 who were born in 1975, the procedures described above would indicate that 367 unique individuals were represented in that data set.  Similarly, if a data set with information on all men born in the same year who were incarcerated during 1998 included 280 dates of birth, that data set would represent 533 unique individuals.

Table 1 provides an example of the accuracy and the precision of estimates of population size that are provided by Probabilistic Population Estimation.  This table provides the number of people who were served by each community mental health children’s services program in the state of Vermont during fiscal years 1997 and 1998.  These counts are based on identification numbers that are assigned by the local agency by which each person was served.  The table also includes the probabilistic estimate (with 95% confidence intervals) of the number of people served by each of these programs during each of these years.  In this case, the probabilistic estimate fell within ±1% of the true value in 15 out of 20 cases.  The 95% confidence interval of the estimate included the true value in 18 out of the 20 cases.  In a large demonstration, the 95% confidence interval will include the true value in 19 out of 20 cases.   

Determining Population Overlap

In order to determine the number of children and adolescents shared across data sets that do not include a common person identifier, the sizes of three populations are determined, and the results are compared.  First, the number of young people represented in  each of the original data sets is determined.  In this case, the original data sets are the file that describes all community mental health clients for the base period, and the data set that describes all individuals in correctional facilities during the follow-up period.  Second, these two data sets are combined and the number of unique individuals represented in the combined data set is determined.

            The number of people shared by the two data sets is the number of former clients of children’s services programs who were incarcerated during the follow-up period.  Mathematically, the number of people who are shared by the two data sets is the difference between the sum of the numbers of people represented in the two original data sets and the number of people represented in the combined data set.  In terms of mathematical set theory (Whitehead and Russell, 1927), the size of the intersection of two sets (AÇB) is the difference between the sum of the sizes of the two sets (A+B) and the size of the union of the two sets  (AÈB):

(AÇB) = A + B  - (AÈB)

The size of the two original data sets and the size of the combined data set may be determined using the Probabilistic Population Estimation as described above. 

            The 95% confidence interval for the estimate of caseload overlap is derived form the variance of the estimate.  The variance of the estimate is a function of  the number of dates of birth represented in the larger of the original data sets (b), and the number of dates of birth in the combined data set (c).  The formula for determining the variance of the overlap is:

            In the hypothetical example introduced above, there were 231 dates of birth (representing 367(±)  young people) in the mental health data set and 280 dates of birth (representing 533 (±) individuals)  in the corrections data set 1 .  When the two data set were joined, the combined data set included 324 unique dates of birth.  Probabilistic Population Estimation estimates that 802 individuals are represented in this combined data set.  The overlap is the difference between the sum of the numbers of people in the two original data sets (900) and the number of people in the combined data set (802).  In this hypothetical example, 98(±)  of the total 367(±) people who had been served by the children’s mental health programs were incarcerated during the follow-up period.  This represents  27% of the children’s services caseload.

            Table 2 provides an example of the accuracy and the precision of estimates of population overlap that are provided by Probabilistic  Population Estimation.  This table provides the number of people who were served by each community mental health children’s services programs in the State of Vermont during both Fiscal Year 1997 and fiscal year 1998.  These counts are based on identification numbers that are assigned by the local agency by which each person is served.  The table also includes the probabilistic estimates (with 95% confidence intervals) of the number of people shared by the two annual data sets.  In this case, the probabilistic estimate fell within ±2% of the true value in all ten cases.  The 95% confidence interval of the estimate included the true value in all ten of the cases.  In a large demonstration, the 95% confidence interval will include the true value in 19 of every 20 cases.  

Table 1

 Actual Counts and Probabilistic Estimates of Population Size

Number of People Served by Children’s Mental Health Programs in Vermont

During  Fiscal Year 1997 and Fiscal Year 1998

  Fiscal Year 1997

Region

Number

Served

Probabilistic

Estimate

Accuracy

 

 

 

Difference

Within 95% ci

 

 

 

 

 

Chittenden

1,030

1,035.6  (1020.6-1050.6)

+0.5%

Yes

Southeast

1,435

1,406.6  (1386.7-1426.6)

-2.0%

Yes

Northeast

867

876.9  (863.6-890.3)

+1.1%

Yes

Rutland

693

690.0  (680.6-699.5)

-0.4%

Yes

Washington

560

560.3  (553.3-567.3)

+0.1%

Yes

Franklin

520

517.7  (510.3-525.1)

-0.5%

Yes

Addison

683

682.6  (673.7-691.5)

-0.1%

Yes

Bennington

474

475.2  (468.2-482.2)

+0.3%

Yes

Orange

523

524.3  (517.0-531.5)

+0.2%

Yes

Lamoille

203

203.6  (200.4-206.7)

+0.3%

Yes

 Fiscal Year 1998

Region

Number

Served

Probabilistic

Estimate

Accuracy

 

 

 

Difference

Within 95% ci

 

 

 

 

 

Chittenden

1,027

1,020.8  (1006 – 1035.7)

-0.6%

Yes

Southeast

1,360

1,352.5  (1333.3-1371.7)

-0.6%

Yes

Northeast

967

984.9  (970.4-999.5)

+1.9%

No

Rutland

618

618.9  (610.2-627.6)

+0.1%

Yes

Washington

516

502.8  (496.3-509.4)

-2.6%

No

Franklin

484

481.3  (474.3 – 488.3)

-0.6%

Yes

Addison

641

633.9  (625.4 – 642.4)

-1.1%

Yes

Bennington

455

455.7  (448.9-462.6)

+0.2%

Yes

Orange

542

542.5  (534.6-550.4)

+0.1%

Yes

Lamoille

224

223.9  (220.5 - 227.3)

0.0%

Yes

  

Table 2

 Actual Counts and Probabilistic Estimates of Population Overlap

Number of People Served by Children’s Mental Health Programs in Vermont

During Both Fiscal Year 1997 and Fiscal Year 1998 

Region

Number

Served

Probabilistic

Estimate

Accuracy

 

Both Years

 

Difference

Within 95% ci

 

 

 

 

 

Chittenden

483

474.1   (458.9-489.2)

-1.8%

Yes

Southeast

730

730.0   (710.5-749.5)

0.0%

Yes

Northeast

487

481.8   (468.8-494.9)

-1.1%

Yes

Rutland

316

316.9   (307.7-326.1)

+0.3%

Yes

Washington

293

290.8   (284.7-297)

-0.7%

Yes

Franklin

207

209.6   (202.2-217.1)

+1.3%

Yes

Addison

352

348.4   (340.5-356.2)

-1.0%

Yes

Bennington

225

224.1   (217.4-230.9)

-0.4%

Yes

Orange

289

284.3   277.6-291)

-1.6%

Yes

Lamoille

104

104.0   (101.2-106.9)

0.0%

Yes

 

REFERENCES

 Banks, S.M., & Pandiani,  J.A. (1998). The use of state and general hospitals for inpatient psychiatric care.  American Journal of Public Health, 88, 448-451. 

 Banks, S.M., & Pandiani, J.A. (in press, a).  Probabilistic population estimation of the size and overlap of data sets based on date of birth. Statistics in Medicine.

 Banks, S.M., Pandiani, J.A., Gauvin, L., Reardon, E., Schacht, L.C., and Zovistoski, A. (1999). A risk adjusted measure of hospitalization rates for evaluating community mental health program performance.  Administration and Policy in Mental Health, 26, 269-279.

 Banks, S.M., Pandiani, J.A.,  Schacht, L.C., & Bagdon, B. (1999)  Causes and consequences of caseload segregation/integration. In A System of Care for Children’s Mental Health:  Proceedings of the 12th Annual Conference. 

 Banks, S.M., Pandiani, J.A., Schacht, L.C., & Gauvin, L. (1999) A risk adjusted measure of hospitalization rates for evaluating community mental health program performance.  Administration and Policy in Mental Health, 26, 269-279.

 Feller, W. (1957).  An Introduction to Probability Theory and Its Applications (2nd ed.).  New York:  John Wiley.

 Pandiani, J.A., Banks, S.M., & Schacht, L.S. (1998a).  Personal privacy vs. public accountability: A  technological solution to an ethical dilemma. Journal of Behavioral Health Services and Research, 25, 456-463.

 Pandiani, J.A., Banks, S.M., & Schacht, L.S. (1998).  Using incarceration rates to measure mental health program performance. Journal of Behavioral Health Services and Research, 25, 300-311. 

 Whitehead, A.N., & Russell, B. (1927)  Principia Mathematica.  Vol 1, 2nd Ed.  Cambridge, University Press, 211-212.

 

 

The Bristol Observatory
521 Hewitt Road
Bristol, VT 05443

bristob@together.net

(802) 453-7070 / (802)453-5061 Fax

For questions or comments about this web site, send e- mail to webmaster@TheBristolObservatory.com  
Copyright © 2000 The Bristol Observatory
Web design by Fern Hill (last update 04/28/09)