Two contradictory conclusions from the same dataset: the multiverse in data science.


In the implementation of advanced advanced analytics for companies there is a significant risk to decision making when two different conclusions are drawn from the same data set. In this regard is it possible for two teams of data scientists to arrive at different results for the same database? Is there any way to identify the analytical choices that create these variations?

A crowdsourcing study published in the journal Organizational Behavior and Human Decision Processes documented the experiment in which a group of analysts were assembled to independently evaluate a data set and attempt to test the same hypothesis.

Each one decided both the analytical approach and the way of integrating the different variables. The result was not homogeneous: some teams falsified the hypothesis, others were unable to do so and the rest could not conclude anything.

All analysts used the DataExplained platform, providing real-time access to both the analytical paths they chose and those they rejected, and thus to their evaluation and decision-making processes.

In addition, this tool provided a graphical representation of the workflow of each one, which was very useful when communicating their work and, in turn, allowed for a qualitative analysis of the quantitative decisions of this iterative research.

For this purpose, a multivariate analysis was performed, which helped to show that the results depend more on the ways of performing an analysis and managing its variables than on the statistics.

What analyses were performed?

The aforementioned crowdsourcirng research analyzed a complex data set on gender and professional status in group meetings. There was no restriction on the conceptualization and way of operating the different variables (such as social status).

The dataset for this project included more than three million words of dialogues extracted from an online forum of scientific discussions. After a pilot program, the experts agreed on two hypotheses:

  • Hypothesis 1. A woman’s tendency to actively participate in a conversation correlates positively with the number of women in the discussion
  • Hypothesis 2. Higher-status participants speak more than lower-status participants

For the next phase of the study, a team was recruited to work with the multivariate analysis or Boba multiverse, which allows:

  • Examine all reasonable paths implicit in the analysts’ choices.
  • Quantitatively identify which choice points disperse interpretations.
  • Create visualizations to illustrate the key steps of these bifurcations.

The objective was to examine whether independent analysts would arrive at the same results and hypotheses from the same data set.

<< Build your company’s future: apply advanced analytics >>

What were the results?

The analysts who participated in this study reached the following results:

  • 64.3% agreed with hypothesis 1.
  • 28.6% agreed with hypothesis 2.
  • The remaining 7.1% could neither confirm nor refute either hypothesis.

To better understand the process that guided the analytical decisions, a sub-team of qualitative researchers from the project evaluated the descriptive text explaining each step of the data analysis, as well as the source code corresponding to each step.

Asking analysts to explain their decisions yielded a large data set that captured their various workflows. From this, it was concluded that many analyses were iterative, as participants made sense of the data over time.

For their part, the scientists who organized the research pointed out that thanks to the crowdsourcing approach and the multivariate model, the possibility (of almost 50%) of the results being null was drastically reduced.

Moreover, without these analytical elements, the role of the subjective decisions of each researcher would have remained unknown, instead of becoming transparent.

Likewise, the DataExplained platform made the difference when observing in detail the roadmap of the different analysis alternatives and their respective justifications.

Final considerations

In any application of advanced analytics for enterprises there are choices that are subjective because they depend on the analyst in charge, so it is necessary for each organization to have robust and orderly databasesThe analysis should not be lacking in rigor and should provide useful conclusions to to streamline the production processes of the business.

Finally, it is recommended that CEOs base their business decisions on these three factors:

  • Scientific findings
  • Consulting reports
  • Internal analysis

Has your company performed any type of predicitive analysis? Are the conclusions contradictory over time or depending on the team performing it? Which production processes do you think could benefit the most from this technology? Do you think an external analyst, as in the crowdsourcing model, could guide you in this process?

Comment in the space below and subscribe to my blog to learn more about other topics of innovation and scientific technology applied to business.

Originally published in Jorge Pérez Colin Blog

Leave a Reply

Your email address will not be published. Required fields are marked *


Related posts