Effect of confidence interval construction on judgment accuracy

Three experiments (N = 550) examined the effect of an interval construction elicitation method used in several expert elicitation studies on judgment accuracy. Participants made judgments about topics that were either searchable or unsearchable online using one of two order variations of the interval construction procedure. One group of participants provided their best judgment (one step) prior to constructing an interval (i.e., lower bound, upper bound, and a confidence rating that the correct value fell in the range provided), whereas another group of participants provided their best judgment last, after the three-step confidence interval was constructed. The overall effect of this elicitation method was not significant in 8 out of 9 univariate tests. Moreover, the calibration of confidence intervals was not affected by elicitation order. The findings warrant skepticism regarding the benefit of prior confidence interval construction for improving judgment accuracy.


Introduction
Improving the accuracy of our judgments represents a major effort in decision science and cognitive science, more broadly. This work has important implications for several applied domains where expert judgment is required, such as medicine (Berner & Graber, 2008;Dawson & Arkes, 1987), clinical practice (Dawes, 1979;Oskamp, 1965), law (Goodman-Delahunty, Granhag, Hartwig & Loftus, 2010), finance (Önkal, Yates, Simga-Mugan & Öztin, 2003), and geopolitical forecasting and strategic intelligence analysis (Dhami, Mandel, Mellers & Tetlock, 2015;Mandel & Barnes, 2018;Tetlock, 2005). For example, intelligence analysts often need to make rapid judgments under conditions of uncertainty and these judgments often inform missioncritical decisions (e.g., Fingar, 2011;Friedman, 2019;Mandel, 2019). One strategy for improving judgment involves designing structured elicitation methods that reduce potential bias or error in judgment (e.g., confirmation bias, overconfidence). That is, how individuals are probed for a given judgment is one potential target for intervention. If effective, then such methods could offer a reliable route to improving judgment accuracy.
A popular example of an elicitation method that appears to improve judgment accuracy is the "consider the opposite" approach. Lord, Lepper and Preston (1984) demonstrated that asking individuals to consider the opposite point of view reduced individuals' tendency to interpret evidence in terms of their prior beliefs. Herzog and Hertwig (2009) use a similar exhortation in their dialectical bootstrapping method in which individuals provide two estimates, the second of which follows instructions to imagine the first estimate was incorrect and the reasons why that might be. Herzog and Hertwig (2009) found that the accuracy of the average estimate was higher in that condition than a control condition in which individuals provided two estimates without the instruction to "consider the opposite" (see also Herzog & Hertwig, 2014;Müller-Trede, 2011). Herzog and Hertwig (2009) suggested that the consider-the-opposite instruction encourages more divergent estimates based on different sources of knowledge and, hence, greater benefits of aggregation. In a related vein, Williams and Mandel (2007) found that "evaluation frames", which make the complementary hypothesis explicit and hence contrastively evaluable (e.g., "How likely is x rather than ¬x to occur?"), fostered greater judgment accuracy than "economy frames", which explicated only the focal hypothesis (e.g., "How likely is x to occur?").
A similar approach has been attempted in efforts to reduce another pervasive bias -overconfidence in interval estimates or "overprecision"; namely, the confidence intervals people generate are often too narrow or overly precise (Alpert & Raiffa, 1982;Soll & Klayman, 2004;Moore & Healy, 2008;Pitz, 1974;Teigen & Jørgenson, 2005). For example, Soll and Klayman (2004) demonstrated that individuals were more overconfident when asked to produce a single 80% confidence interval compared to providing two separate lowerand upper-bound 90% estimates. Soll and Klayman (2004) suggested that, like the consider-the-opposite approach, having to generate multiple point estimates (i.e., lower and upper bounds) yields a wider evidence base from which to generate the point estimates. Consistent with this idea, Teigen and Jørgensen (2005) asked one group of participants to provide a typical range estimate (i.e., both a lower and upper bound in the same query) and two other groups to each provide only a lower or upper bound. The latter interval was wider than when the same individual gave both bounds suggesting that thinking about each bound independently likely leads to more disparate evidence being retrieved than thinking about them simultaneously. Teigen and Jørgensen (2005) also found that individuals were less overprecise when they were allowed to assign a confidence level to an interval versus having to produce an interval of a specified confidence (e.g., 90%). However, the degree of overprecision appears to be related to the degree of confidence referenced in fixed confidence-level elicitations. For instance, Budescu and Du (2007) found that whereas 90% confidence intervals were overprecise, 70% intervals were well-calibrated, and 50% intervals were underprecise. Taken together, it appears that there are many ways in which elicitations can be structured to reduce bias and improve judgment accuracy.

The present investigation
In the present research, we continue this line of inquiry by examining a specific instantiation of interval elicitation, which prescribes that interval construction should precede the elicitation of best estimates. This approach has been adopted in both the three-and four-step methods (Speirs-Bridge et al., 2010), which have been used in several expert elicitation studies (e.g., Adams-Hosking et al., 2016;Burgman et al., 2011) and in the comprehensive structured elicitation IDEA protocol (for Investigate-Discuss-Estimate-Aggregate; e.g., Hanea et al., 2017;Hemming, Walshe, Hanea, Fidler & Burgman, 2018). Other methods such as the SHELF protocol (for Sheffield Elicitation Framework) also prescribe eliciting upper and lower bounds prior to best estimates (O'Hagan, 2019). Such methods are inspired by research showing the beneficial effect of interval estimation (e.g., Soll & Klayman, 2004;Teigen & Jørgensen, 2005). However, they go further by prescribing a fixed order in which estimates should be elicited from assessors. For instance, in the four-step method (Speirs-Bridge et al., 2010), after providing assessors with a query (e.g., what is the number of native plant species in a given region?) the assessors are asked to provide in the following order: (1) the lowest realistic numerical estimate, (2) the highest realistic numerical estimate, (3) a best estimate, and (4) a confidence judgment in the form of an estimate of the likelihood that the true value of the assessed quantity falls in the credible interval defined by the first two estimates provided.
These features of such "IBBE protocols" (for Intervals Before Best Estimates) can potentially support two ameliorative functions. First, because the bounds are estimated before the best estimate, this might prompt consideration of a wider range of relevant information. For instance, citing Morgan and Henrion (1990), Hemming et al. (2018) states, "asking for lowest and highest estimates first [in the three-or fourpoint methods], encourages consideration of counterfactuals and the evidence for relatively extreme values, and avoids anchoring on best estimates" (p. 174). As the quote suggests, eliciting bounds before best estimates may improve the accuracy of the latter by stimulating consideration of worstand best-case scenarios or multiple viewpoints. However, to the best of our knowledge, this hypothesized order effect of interval construction on the accuracy of best estimates has not been empirically tested. Therefore, one aim of this research was to test the validity of this hypothesis.
A second ameliorative function of IBBE protocols is to improve the calibration of confidence by allowing assessors to assign a confidence level to their credible interval after the latter has been estimated. As noted earlier, prior research has found evidence to support this method (Soll & Klayman, 2004;Teigen & Jørgensen, 2005) and we do not pursue that issue further here. However, little research has examined whether the elicitation of a best estimate prior to interval construction has any effect on the latter process. We conducted three sets of analyses. First, we examined whether elicitation order influenced the range of participant' credible intervals. One hypothesis suggested by the work of Tversky and Kahneman (1974) and Epley and Gilovich (2006) is that intervals would be narrower if best estimates are elicited first because participants would anchor on the estimate and adjust until a plausible value is reached. In contrast, participants who generate intervals before their best estimates might anchor on the lower and upper limits of the probability scale and adjust away from those limits until a plausible value is reached. Second, we examined whether elicitation order affected participants' level of confidence in their credible range. If credible intervals were narrower after the elicitation of best estimates than if prior to their elicitation, one might expect confidence to be lower in the "best-first" case because estimate precision (and informativeness) is greater (Yaniv & Foster, 1995, 1997. That is, it should be easier to be confident that one's interval captures the true value if it is wider. Finally, we examined whether elicitation order affected calibration of confidence. The anchoring-and-adjustment hypothesis suggests that overprecision might be amplified by generating the best estimate first because confidence intervals will tend to be narrower than when the intervals are constructed before providing a best estimate. However, contrary to the anchoring-and-adjustment hypothesis, Block and Harper (1991) found that generating an explicit best estimate prior to a confidence interval improved the calibration of confidence by reducing overprecision, whereas Soll and Klayman (2004, Experiment 3) found that hit rates did not depend on whether the best estimate (defined in their study as the median estimate) was elicited before, in the middle or after lower and upper bounds. Therefore, substantial uncertainty regarding the effect of generating best estimates on confidence intervals remains.
We report three experiments that prompt participants to use one of two variants of a modified four-step method to answer general-knowledge and estimation-type questions depending on the condition to which they were randomly assigned. In the "best-first" condition, best estimates were elicited before eliciting lower-bound, upper-bound, and confidence estimates, respectively. In the "best-last" condition, participants provided their best estimates after generating the three estimates required for confidence interval construction. We used three different types of questions. All questions required responses in a percentage format (i.e., between 0% and 100%). A third of the items were general-knowledge questions (e.g., "What percentage of a typical adult human's bones are located in its head?"). However, when research is conducted online, as in the present research, individuals could search for answers on the Internet even if instructed not to do so. To address this concern, the remaining items were unsearchable. Following Gaertig and Simmons (2019), this was achieved by splitting our samples into two groups, each of which had a set of queries that required them to estimate the percentage of respondents who exhibited a certain behavior or that correctly answered a certain general-knowledge question. The correct values for these queries were unsearchable and were determined based on the values in the survey sample.

Sampling strategy and participants
The sample size was set to ensure sufficient power to detect effects of medium size using a multivariate analysis of variance (MANOVA) of judgment accuracy. We further oversampled in anticipation of having to exclude participants for various reasons, as we describe in the Case exclusions subsection of the Results. Accordingly, 350 participants in Experiment 1a and 417 participants in Experiment 1b completed our study online via Qualtrics Panels. The study was available to adults between the ages of 18 and 60 years of age who have English as their first language, and were Canadian or American citizens (self-reported). After case exclusions, we retained a sample of 299 participants in Experiment 1a (mean age = 41.75; 166 females and 133 males; 173 Canadian citizens, 114 US, and 12 dual) and 357 participants in Experiment 1b (mean age = 38.37; 205 females, 149 males, 1 who preferred not to say, and 2 missing responses; 139 Canadian citizens, 209 US, and 9 dual). Participants were compensated from the panel provider (i.e., Qualtrics) for this study. The specific type of compensation varied (e.g., cash, gift card) and had a maximum value of $5 US (or equivalent in Canadian dollars).

Design
We used a between-groups design wherein the order in which participants provided their confidence intervals and best estimates was manipulated. In the best-last condition, participants were asked to estimate a lower bound, an upper bound, a confidence level that their interval captured the true value before providing their best estimate. In the best-first condition, participants provided their best estimate before providing the three estimates pertinent to confidence interval construction.

Materials and procedures
Supplementary materials including the full experimental protocol, data, and other supporting files are available from the Open Science Foundation project page https://osf.io/ k3jhq/.
After reviewing the information letter and consent form, participants answered basic demographic questions (i.e., age, sex, education, nationality, citizenship, first language) before completing a series of tasks in a counterbalanced order. Following completion of the tasks, participants were debriefed about the purpose of the study. The other tasks included an eight-item version of the International Cognitive Ability Resource (ICAR; Condon & Revelle, 2014, see Appendix E in the online supplementary materials), an eight-item version of the Actively Open-Minded thinking scale (Baron, Scott, Fincher & Metz, 2015), and one of two sets of seven items from the 14-item Bias Blind Spot scale (Scopelliti et al., 2015). The focus of the present report is on the estimation task. However, ICAR was used to assess data quality. ICAR is a test of general cognitive ability. It is a forced choice, six-option multiple-choice test. The items are designed to tap into verbal reasoning, numeracy, matrix reasoning, and mental rotation abilities (Condon & Revelle, 2014). Participants completed a shorter, eight-item version of this test. They received a score of 0-8, one point for each correct answer. No partial marks are given.
The latter two tasks are the focus of a separate investigation and are not presented here.
Estimation task. Participants in the best-first condition first received the following instructions explaining the elicitation protocol to be used in providing their estimates (see also Appendix A in the online supplementary materials): In the following you will be presented with a series of questions (e.g., what percentage of stuffed animals are teddy bears?) for which we will ask you to provide four different estimates.

Realistically, what is your BEST response?
Realistically, what do you think the LOWEST plausible value could be?
Realistically, what do you think the HIGHEST plausible value could be?
How confident are you that the interval you created, from LOWEST to HIGHEST, could capture the true value? Please enter a number between 50 and 100%?
Each estimate will be in the form of a percentage. Please try to be as accurate as possible. PLEASE DO NOT LOOK UP THE ANSWERS (e.g., on the Internet). In the next few pages we will review how to respond to each of the questions above. It is important that you understand how to respond to each. Please read carefully.
In the best-last condition, the questions were ordered such that the first question presented above was presented last. These instructions were followed by two instruction comprehension questions and feedback (see Appendix B in the online supplementary materials).
Participants then made four judgments regarding each of 18 questions. There were two sets of 18 questions and each participant received one set (328 participants received Set A and the other 328 received Set B across Experiments 1a and 1b). Each set of items was composed of three different question types: (a) searchable knowledge, (b) unsearchable knowledge, and (c) unsearchable behavior. There were six items of each type. A general description of these items is provided in the Introduction and all of the items are presented in Appendix C in the online supplementary materials. For each question, participants provided answers to the following queries: (a) "Realistically, what do you think the LOWEST plausible value could be?", (b) "Realistically, what do you think the HIGHEST plausible value could be?", (c) "How confident are you that the interval you created, from LOW-EST to HIGHEST, could capture the true value? Please enter a number between 50 and 100%?", and (d) "Realistically, what is your BEST response?". Responses to (a), (b), and (d), were provided on a 101-point sliding scale ranging from 0% to 100%; responses to (c) used a sliding scale between 50% and 100%. The slider had a default position of the lowest value on the scale for all four questions. In both orders, the interval construction questions (i.e., a, b, c) were always presented on the same page and the best estimate was always elicited on a separate page (an example is shown in Appendix A).

Answers for unsearchable questions.
To generate answers for the unsearchable questions, each participant com-pleted six knowledge and six behavioral questions (see Appendix D in the online supplementary materials for a complete list of items). There was a total of 24 items, and each participant received one set of 12 (i.e., Lists A and B). The particular set received determined the estimation set they would receive, such that participants receiving List A would estimate on List B and vice versa. For example, one set of items would include the estimation question "What percentage of survey respondents reported having pet a cat in the last 3 days?" and the other set would include the question "Have you pet a cat in the last 3 days?" Each item required a yes/no (e.g., Have you visited a country in Europe in the last ten years?) or true/false (e.g., Volvo is a Swedish car manufacturer) response.

Data quality
We took several steps to improve data quality. First, we included three attention-check items, one prior to the survey, and two within the survey. The pre-survey attention check was employed to assess the degree to which a participant was likely to do the task (and not just quickly speed through the survey randomly clicking responses). Participants responded to the following question, where they needed to respond "No" to proceed to the survey: The survey that you are about to enter is longer than average, and will take about 30 to 60 minutes. There is a right or wrong answer to some of the questions, and your undivided attention will be required. If you are interested in taking this survey and providing your best answers, please select "No" below. Thank you for your time! Within the survey, we presented participants with two additional attention checks to be used as a means of excluding data from participants during data analysis who were not attending to the task. These were the following: 1. The value of a quarter is what percentage of a dollar? 2. In the following alphanumeric series, which letter comes next?
The (1) attention check was embedded in the estimation task and the (2) attention check was embedded in one of the individual difference measures. Lastly, we also monitored speed of responding. We set the minimum plausible duration to 500 seconds, and only retained data from participants who spent more than 500 seconds completing the survey. Second, we analyzed only those items for which participants provided a complete and coherent set of estimates. Abstaining from one or more of the lowest, highest, best, or confidence judgments resulted in the exclusion of the item, as did the violation of the following constraint:

≤ ≤
That is, lower bounds had to be less than or equal to the upper bounds and the best estimate had to fall between (or precisely upon) those bounds.
Finally, we performed two additional data quality checks. First, we tested whether participants performed better than chance for each of the six combinations of questions type (SK, UK, UB) and list (A, B). To generate accuracy estimates for chance responses, we simulated 200,000 participants (100,000 for List A, 100,000 for List B) who selected randomm responses between 0 and 100 for each of the 18 best-estimate questions and scored these against the same truth vector used to compute accuracy for participant data. We then compared participants' accuracy to the randomresponse accuracy using six one-tailed t tests. Second, we examined the correlation between performance on ICAR and estimation accuracy.
Statistical procedure Accuracy of best estimates was measured using mean absolute error (MAE; the mean absolute difference between the participant's best estimate and the correct answer over items within each question type). We refer to the grand mean of MAE computed over participants as GMAE. Question type GMAE was calculated only where participants provided complete and coherent estimates for at least half of the items. Total GMAE was calculated only for those with a complete set of question type GMAEs.

Data quality
Participants had to meet the following inclusion criteria to be included in the analyses: (1) pass the initial pre-study attention check, (2) pass the letter sequence attention check, (3) report "English first language" in the demographics, and (4) report Canadian and/or American citizenship. We did not use one of the attention checks (i.e., the value of a quarter) as an exclusion criterion because of very poor performance that possibly indicated that a significant portion of the participants misunderstood the question. In addition, participants with a large number of missing responses, inappropriate responses (e.g., to open text questions), and/or overly systematic response patterns were removed. In Experiment 1a, 37, 12, and 2 participants were excluded for demographic reasons, failure of the attention check, and missing or inappropriate responses, respectively. In Experiment 1b, 30, 18, and 17 participants were excluded for these same reasons, respectively. Because some participants were excluded for multiple reasons, the filtering procedure resulted in the exclusion of 51 participants for Experiment 1a and 60 for Experiment 1b.
In Experiment 1a, 74.21% (SD = 34.02%) of participants' sets of estimates were complete and coherent, whereas that figure was 75.12% (SD = 31.92%) in Experiment 1b. The complete and coherent requirement excluded an additional 120 participants in Experiment 1a and a final sample of 178 participants in Experiment 2b, leaving final sample sizes of 179 and 217 in Experiments 1a and 1b, respectively. Participants excluded on the basis of incoherence scored significantly lower on ICAR in Experiments 1a than the coherent participants retained in our samples (r pb [299] = .36, p < .001) and 1b (r pb [357] = .53, p < .001).
Participants' accuracy was significantly above chance for each question type in Lists A and B in both Experiment 1a and Experiment 1b (all p < .001). The full analysis is reported in Appendix F in the online supplementary materials.
A final data quality check revealed that ICAR (M = 3.59, SD = 1.99, after exclusion of the incoherent participants) correlated with GMAE in both Experiment 1a (r[177] = −.197, p = .008) and Experiment 1b (r[215] = −.202, p = .003), indicating that participants who performed well on ICAR also had more accurate best estimates. Figure 1 shows the distributions of GMAE by experiment, question type, and elicitation order, whereas Table 1 shows the corresponding means and standard deviations. We conducted a MANOVA in each experiment with elicitation order as a fixed factor, and the three MAE measures corresponding to question type as dependent measures. Table 2 summarizes  the multivariate results and Table 3 shows the parameter estimates for the univariate results. As can be seen in Table 2, the effect of elicitation order was not significant in Experiment 1a but was significant in Experiment 1b. The univariate parameter estimates in Table 3, however, show that there was only a significant effect of elicitation order for unsearchable knowledge questions. No other univariate parameters were significant in either experiment.

Credible intervals
We next analyzed the width of participants' credible intervals by subtracting the lower-bound estimate from the upperbound estimate. We averaged the ranges within each question type. The three repeated measures were subjected to a MANOVA with elicitation order as a fixed factor. Table  4 shows the corresponding means and standard deviations, whereas Tables 5 and 6 summarize the multivariate effects and univariate parameter estimates from the models in each experiment, respectively. Neither the multivariate analysis nor the univariate parameter estimates were significant in either experiment.

Confidence judgments
We examined the effect of elicitation order on the confidence participants had that their credible intervals captured the correct value. We averaged participants' confidence ratings for   each question type and subjected the three repeated measures to a MANOVA with elicitation order as a fixed factor. Table  7 shows the corresponding means and standard deviations. There was no effect of elicitation order in either experiment (both p ≥ .80).

Calibrated confidence intervals
Our final analysis examined the accuracy of participants' confidence intervals when all were calibrated to a fixed confidence level of 80%. We constructed the lower and upper bound of the calibrated intervals using the following formulas (Hemming et al., 2018): where is the best estimate, is the lower bound, is the upper bound, is the reported confidence, and is the calibrated interval range. For each of the three question types, we computed for each participant the proportion of standardized confidence intervals that captured the correct answer. If the correct answer fell outside the interval, then it was scored as incorrect. Figure 2 shows the distributions of the proportion of correct judgments by experiment, question type, and elicitation order, whereas Table 8 shows the corresponding Note: Order = best first; S = searchable; U = unsearchable; K = knowledge; B = behavior.
means and standard deviations. The resulting three repeated measures were analyzed in a MANOVA with elicitation order as a fixed factor. Neither the multivariate analysis nor the univariate parameter estimates were significant in either experiment (p ≥ .25). It is evident from descriptive results in Figure 2 and Table  8 that participants were overprecise in both experiments, with their accuracy rates falling far short of 80% accuracy. We confirmed this by running one-sample t-tests against a test value of .8: In Experiment 1a, the grand mean accuracy rate was .65 (SD = .21, t(178) = −9.54, p < .001, d = 0.71); in Experiment 1b, the rate was .67 (SD = .21, t(216) = −8.96, p < .001, d = 0.62).

Discussion
The results of Experiments 1a and 1b were quite consistent. In Experiment 1a, none of the univariate tests of order were significant, and in Experiment 1b, only one (unsearchable knowledge) was significant. Moreover, consistent with Soll and Klayman (2004, Experiment 3), in each experiment, elicitation order had no significant effect on the accuracy of their calibrated confidence intervals. Nor did elicitation order have an effect on the range of credible intervals or confidence judgments in the range. The hypothesis we tested based on insights from work on anchoring-and-adjustment processes (Epley & Gilovich, 2006;Tversky & Kahneman, 1974) were unsupported in the present experiments. As well, our findings did not generalize Block and Harper's (1991) result that generating and writing down best estimates first improves calibration. It was evident that the inaccuracy of participants' confidence intervals was expressed in both experiments in the form of overprecision, with the deviations from perfect calibration of a medium to large effect size in both experiments. Therefore, it is not simply the case that there is little or no miscalibration to correct in the first place.
While Experiments 1a and 1b appear to provide clear evidence that the effect of the prior elicitation of confidence intervals on the accuracy of best estimates is minimal, we wanted to provide an additional test to put these results on even firmer footing. Thus, in Experiment 2 we set out to replicate Experiments 1a and 1b. In an attempt to improve participants' use of confidence interval construction we moved the critical estimation task to the beginning of the battery of tasks that the participants completed. In addition, we added additional instructions to remind participants about how to construct the intervals coherently. Lastly, we adopted a more stringent attention check (Oppenheimer, Meyvis & Davidenko, 2009) and placed it immediately after the estimation task. Each of these modifications was geared toward increasing participants' willingness and/or ability to use confidence interval construction and provide us with an additional, independent means of selecting those individuals who were more likely doing so.

Participants
Five hundred and forty-five participants completed our study online via Qualtrics Panels. The sample characteristics were identical to Experiments 1a and 1b (i.e., between 18 and 60 years of age, English first language, US and/or Canadian citizen). After exclusions based on the same criteria used in Experiments 1a and 1b we retained a sample of 198 (mean age = 43.71; 124 females, 73 males, and 1 missing response; 112 Canadian citizens, 81 US, and 5 dual). In Experiment 2, 64, 286, and 74 participants were excluded for demographic reasons, failure of the attention check, and missing or inappropriate responses, respectively. Participants were compensated in the same manner as the prior experiments.

Materials and procedures
The materials and procedures were identical to Experiments 1a and 1b, with the following exceptions. As noted previously, we moved the estimation task to the beginning of the survey rather than having it randomly intermixed. As well, we added the following reminder on the pages on which individuals constructed their intervals: "Remember your LOW-EST plausible value should be LOWER than your BEST response and HIGHEST plausible value" when providing the lowest plausible value, and "Remember your HIGHEST plausible value should be HIGHER than your BEST response and your LOWEST plausible value" when providing the highest plausible value. We also removed the questions used to calculate the true response for unsearchable items. Rather, baseline behaviors/judgments were determined based on the responses provided by participants in Experiments 1a and 1b. Lastly, we replaced the second attention check item (i.e., the value of a quarter is what percentage of a dollar?) based on low performance in Experiments 1a and 1b. In its place, we included an "Instructional Manipulation Check" (adapted from Oppenheimer et al., 2009) whereby under the cover of a question about sports participation, participants were simply instructed to ignore the main question and click Note: Order = best first; S = searchable; U = unsearchable; K = knowledge; B = behavior.
a button to proceed to the next screen. As in Experiments 1a and 1b, participants also completed a number of other tasks/scales (now all of which were administered after the estimation task). There were also minor changes to these other tasks/scales. & Stanovich, 2012), plus six general heuristics and biases problems, adapted from a variety of sources (assessing Anchoring, Base-rate neglect, Conjunction Fallacy, and Outcome Bias). We also removed the Actively Open-Minded Thinking Scale. These latter tasks are the focus of another investigation and are not presented here.

Data quality
In Experiment 2, 89.05% (SD = 31.92%) of participants' sets of estimates were complete and coherent. This was significantly greater than the pooled coherence rate from Experiment 1a and 1b (M = 74.7%, SD = 31.92%, t(350.22) = −5.42, p < .001, d = 0.43). Therefore, it appears the modifications to Experiment 2 had the intended effect of improving the coherence of participants' judgments. The complete and coherent requirement excluded an additional 44, leaving a final sample of 154. As in the prior experiments, coherence was significantly related to ICAR (r pb [198] = .34, p < .001).
As in the earlier experiments, participants' accuracy was significantly above chance for each question type in Lists A and B, all p < .001. The full analysis is reported in Appendix F in the online supplementary materials.
The final data quality check revealed that ICAR (M = 3.92, SD = 1.92) correlated with GMAE (r[152] = −.279, p < .001); participants who performed well on ICAR tended to be more accurate. Figure 1 and Table 1 show the descriptive results. We conducted a MANOVA with elicitation order as a fixed factor and the three MAE measures corresponding to question type as dependent measures. Consistent with Experiment 1a, the effect of elicitation order was not significant in either the multivariate analysis or any of the univariate parameter estimates (see Tables 2 and 3).

Credible intervals
We calculated the range of the credible intervals and averaged them within question type (see Table 4 for descriptive results). We then computed a MANOVA on the repeated measures with elicitation order as a fixed factor. Consistent with the earlier experiments, there was no effect of elicitation order in either the multivariate analysis or any of the univariate parameter estimates (see Tables 5 and 6).

Confidence judgments
As in the prior experiments, we averaged participants' confidence ratings for each question type and subjected the three repeated measures to a MANOVA with elicitation order as a fixed factor (see Table 7 for descriptive results). Consistent with the earlier experiments, there was no significant effect of elicitation order (p = .76).

Calibrated confidence intervals
We examined the accuracy of participants' confidence intervals when all were calibrated to a fixed confidence level of 80% (Hemming et al., 2018), using the same proportioncorrect metric used in Experiments 1a and 1b (see Figure  2 and Table 8 for descriptive results). The three repeated measures were first analyzed in a MANOVA with elicitation order as a fixed factor. The effect of elicitation order was not significant (p = .66). Finally, as in the earlier experiments, participants were overprecise. Their grand mean accuracy rate was .62 (SD = .24), significantly lower than the .8 criterion required for perfect calibration (t[153] = −9.09, p < .001, d = 0.75).

Discussion
In spite of the changes in procedure to improve participants' focus on the central task and to estimate judgments coherently, the results of Experiment 2 are highly consistent with those of Experiments 1a and 1b. Elicitation order did not have a significant effect on the accuracy of best estimates or the calibration of confidence intervals. As well, consistent with the earlier experiments, participants showed a substantial degree of overprecision across the judgment tasks.

General discussion
In the present investigation, we examined the effect of prior confidence interval construction on the accuracy of best estimates as well as the effect of best estimate construction on the calibration of confidence intervals. Previous research has provided support for the potential effectiveness of interval construction where separate lower and upper bounds are constructed and where confidence levels are assigned to credible intervals by assessors rather than by experimental fiat (Soll & Klayman, 2004;Teigen & Jørgenson, 2005). Various IBBE protocols such as the four-step method (Speirs-Bridge et al., 2010), implement these attributes, but further specify that elicitation order is relevant and prescribes that credible intervals be constructed prior to the elicitation of best estimates. The potential benefits of confidence interval construction are often explained in terms of the beneficial influence of taking multiple samples from memory and/or taking multiple perspectives on a given judgment (i.e., similar to various consider-the-opposite approaches; Herzog & Hertwig, 2009;Hirt & Markman, 1995;Koriat, Lichtenstein & Fischhoff, 1980;Lord et al., 1984;Williams & Mandel, 2007). That is, by generating intervals before best estimates, assessors might be prompted to consider a wider range of relevant evidence that might, in turn, improve the accuracy of the best estimates. As well, by generating intervals before best estimates, assessors might escape the biasing effect that the best estimates might have if assessors are prone to anchor on them and then insufficiently adjust (Epley & Gilovich, 2006;Tversky & Kahneman, 1974). Therefore, IBBE protocols can be viewed as a debiasing method for judgment, one that addresses Lilienfeld et al.'s (2009) call for research on correcting errors in judgment.
Overall, our results painted an uninspiring picture of the effectiveness of prior confidence interval construction on the accuracy of assessors' best estimates. In two of the three experiments, elicitation order had no significant effect on accuracy, and in the one experiment (1b) where there was a multivariate effect, that effect was due to only one significant univariate effect. Fully 8 out of 9 univariate tests of the effect of order on best estimate accuracy failed to find a significant effect. The null effect of elicitation order was even more stable for accuracy of calibrated confidence intervals, where not one univariate parameter estimate (out of 9 tests across the 3 experiments) was significant. As well, in Experiments 1a and 1b, we included participants who failed one of our attention checks because we thought that item might have been misunderstood by many, given the high error rate we observed. In Experiment 2, however, we also observed a high error rate on an attention check that has been used in other studies (Oppenheimer et al., 2009). The probable net effect is that we were much more liberal in our inclusion criteria in Experiments 1a and 1b than we were in Experiment 2, and yet we obtained highly consistent results.
Furthermore, recall that confidence interval construction is hypothesized to benefit the accuracy of best estimates by improving the recruitment of relevant evidence pertinent to testing the equivalent of best-and worst-case scenarios or multiple viewpoints (Hemming et al., 2018). Presumably, the benefit afforded to best-estimate accuracy depends on how accurately the preceding intervals are constructed. Following this line of reasoning, one might also expect the correlation between best-estimate accuracy and (calibrated) confidence interval accuracy to be stronger if interval construction preceded best-estimate construction than if the best estimates were constructed first. However, we do not find support for that prediction either. Across experiments and question types, the correlation between GMAE for the best estimates and the proportion of correct responses in the calibrated confidence interval was r(289) = −.40 (p < .001) when best estimates were elicited first and r(257) = −.34 (p < .001) when they were elicited after the confidence intervals were constructed. The difference is not significant (z = −0.79, p = .21).
From a practical perspective, the present results do not indicate the utility of prior confidence interval construction for improving the accuracy of best estimates. As we noted in the Introduction, improving the accuracy of judgments represents a major effort that has important implications for several domains. An important consideration in these efforts is cost. IBBE protocols require significant additional time (e.g.., in the case of the four-step method, three additional judgments). Thus, even a small benefit may not justify the added effort, given that other methods that require a similar number of elicitations have yielded large improvements in probability judgment accuracy. Notwithstanding the risks associated with internal meta-analyses (Ueno, Fastrich & Murayama, 2016;Vosgerau, Simonsohn, Nelson & Simmons, 2019), it is useful to estimate the overall effect size of the elicitation order manipulation we conducted across three experiments. There is a small positive effect (Cohen's d = 0.285, 95% CI [0.117, 0.453]) of the modified four-step method we tested on the accuracy of best estimates. Moreover, in some elicitation contexts, such as decision analysis (Clemen, 1996;von Winterfeldt & Edwards, 1986), it may be highly desirable, if not necessary, to collect lower-and upper-bound estimates, in which case there may be a small benefit to following the ordering prescribed by IBBE protocols. However, if the aim of the method is to improve best estimates, then query intensive IBBE protocols do not compare favorably with alternative methods for improving judgment accuracy that require similar increases in elicitation.
For instance, several studies have found that by eliciting a small set of logically-related judgments (typically 3-4 items per topic), accuracy can be substantially improved by recalibrating them using coherentization methods that constrain the estimates to respect certain logical criteria, such as the additivity and unitarity properties in probability calculus (e.g., Fan, Budescu, Mandel & Himmelstein, 2019;Karvetski et al., 2013). For example,  found a large (i.e., d = 0.96) improvement in accuracy on a probability judgment task after four related probability judgments were coherentized. Moreover, individual differences in the degree of incoherence have been effectively used in these studies and others to improve aggregation through performance weighting -namely, by giving more weight to assessors who are more coherent (e.g., Karvetski, Mandel & Irwin, 2020;Predd, Osherson, Kulkarni & Poor, 2008;Wang, Kulkarni, Poor & Osherson, 2011). Other techniques such as use of conditional rather than direct probability assessments (Kleinmuntz, Fennema & Peecher, 1996), using ratio rather than direct probability assessments (Por & Budescu, 2017), using contrastive evaluation frames that make complements explicit (Williams & Mandel, 2007) have been shown to improve judgment accuracy, whereas other methods such as eliciting probability estimates for ranges over entire distributions (Haran, Moore & Morewedge, 2010;Moore, 2019) or iteratively adjusting interval sizes until a pre-specified confidence level is matched by an assessor's subjective probability that the interval captures the true value (Winman, Hansson & Juslin, 2004) have shown promise for reducing overprecision.
Estimation of the confidence interval on d was computed using the implementation of procedures by Smithson (2001) provided by Wuensch (2012).
That said, the present research solicited estimates to general-knowledge and behavior-related questions in a percentage format, and thus the possibility remains that there are contexts wherein this particular form of elicitation generates larger (and more justifiable) gains. For instance, our problems might not have been ideal for recruiting "for vs. against" evidence that would bear on the best estimate. Moreover, one might question whether the unsearchable items were effective for our research purposes. We believe they were for at least two reasons. First, we did not observe much difference between accuracy levels for the three types of questions (see Table 1 and Figure 1) and all three question types were answered with accuracy levels significantly above chance levels. Second, we did not find that order had an effect on the commonly employed general-knowledge items. Indeed, the only significant effect of order we observed was on the unsearchable knowledge items.
Thus, an important contribution of the present work is to introduce measured skepticism about expecting a general gain in terms of judgment accuracy from prior confidence interval construction. Future work aimed at locating contexts wherein or conditions under which such an elicitation method is beneficial would be valuable. For example, unlike the present research in which participants were compensated equally regardless of performance level, researchers could investigate whether incentivized conditions moderate the effect of confidence interval construction. It is possible that with performance-based incentives, the beneficial effect of confidence interval construction would be more pronounced. As well, research could examine the effect of instructions accompanying the elicitation of estimates. Perhaps order would have more of an effect if the instructions more strongly encouraged dialectical thinking (Herzog & Hertwig, 2009). Finally, experts and novices display different patterns of response in tasks involving interval construction (e.g., McKenzie, Liersch & Yaniv, 2008). The performance benefits that Speirs-Bridge et al. (2010) reported for the four-step method over a three-step variant that omitted the judgment of confidence level were observed in expert samples. Although Speirs-Bridge et al. did not compare these elicitations to a control condition in which neither intervals nor confidence levels was elicited, it may be that the medium effect size they observed across their studies is attributable in part to the expert samples employed. Although we did not use an expert sample, we took care to rule out (at a substantial cost to our sample sizes across the three experiments) participants who made blatantly incoherent responses, and we observed that performance in the resulting samples was correlated with intelligence. Nevertheless, it would be useful in future research to conduct similar tests with expert samples.
Finally, it is worth noting that the average confidence level that participants assigned to their credible intervals was remarkably stable across question types and experiments, ranging from 71% to 77%. Recall that Budescu and Du (2007) found that participants directed to construct 70% confidence intervals were better calibrated than those required to construct either 50% or 90% confidence intervals. Although participants may have chosen confidence levels that offer relatively good prospects for calibration, evidently in the present research this tendency did not buffer them from overprecision, which they exhibited in moderate to large degree.

Conclusion
The present investigation provided a strong test of the effect of IBBE methods that require the prior construction of credible intervals on judgment accuracy. There was weak evidence that eliciting confidence intervals prior to best estimates increases the accuracy of those estimates, at least with respect to the types of judgment we evaluated and the type of sample we recruited. Taken together, these findings call for greater skepticism regarding the effectiveness of interval construction as an elicitation method for improving judgment accuracy. By the same token, we found no evidence that generating best estimates before confidence intervals improves the calibration of the intervals, as Block and Harper (1991) reported. Nor did we find support for the contrary anchoringand-adjustment hypothesis (Epley & Gilovich, 2006;Tversky & Kahneman, 1974), which predicted that generating prior best estimates would, if anything, aggravate overprecision. Rather, in line with Soll and Klayman (2004), we found meager evidence that the order in which best estimates and confidence intervals are elicited matters much to accuracy and calibration. However, the generalizability of this finding should be tested in future research.