
I know too many non-statistical researchers that will state that they do not want to do EDA procedures on preliminary data, because they "know that EDA can be bad". Personally, I think confusion about this can lead to a great slow down in scientific progress. Ever try to logically deduce how over expression of a set of genes will affect survival of a cell? Hint: it's not very easy (one of our favorite jokes among the bioinformatics staff at my work was when a physicist asked "Why don't you just simulate the physical properties of different gene interactions? It's a finite parameters space.") Leaving out this step in the larger scientific process is essentially hamstringing science to never be able to find new interesting aspects of our data, outside of pure logical deduction. The point of EDA is not to make definitive conclusions about relations in the data, but rather to look for potential novel relations in the data to follow up on. But I think Tukey would not be happy with anyone doing this. While the argument is not wrong, it's really saying "what can go wrong when I use a very important tool in the wrong manner?"Īccepting unadjusted p-values from EDA methods will lead to vastly inflated type I error rates. This paints a very negative view of exploratory data analysis. So my question is: Does EDA (or any informal process of exploring data) make it more likely to fall for the Texas sharpshooter fallacy? I would go as far as to say that following this separation strictly is uncommon and most practitioners don't subscribe to the EDA paradigm at all. However, I don't think this distinction is made very clearly, and although a separation of EDA and CDA would be ideal, surely there are some circumstances in which this is not feasible. I understand that after new data have been collected, then a confirmatory data analysis (CDA) is appropriate.

Notice that the description of EDA above actually talks about new data collection and experiments. It looks like any exploratory process performed without having a hypothesis beforehand is prone to generate spurious hypotheses. That could lead to new data collection and experiments Statisticians to explore the data, and possibly formulate hypotheses There is a whole approach ( EDA) dedicated to this process:Įxploratory data analysis was promoted by John Tukey to encourage I think that kind of exploration work is commonplace and often, hypotheses are constructed based on that part of the analysis. Interesting results, and now you're engaged in something that's not atĪll an unbiased representation of the data.” Options and you picked the one that gave you the most agreeable or “You just get someĮncouragement from the data and then think, well, this is the path to Lottery draw comes up as all odd numbers. Streak of wins, or to people who see supernatural significance when a Obvious to gamblers who believe in a 'hot hand' when they have a His bullseye is obviously laughable - but the fallacy is not so The biggest clump of bullet holes, and points proudly at his success. Random pattern of bullets at the side of a barn, draws a target around The fable of the Texas sharpshooter: an inept marksman who fires a I noticed that the Texas sharpshooter fallacy was particularly difficult to avoid:Ī cognitive trap that awaits during data analysis is illustrated by

I was reading this article in Nature in which some fallacies are explained in the context of data analysis.
