Proverb IX:1 “Wisdom has built her house; she has hewn out its seven pillars.”
Based on the Proverb IX:1, Stephen Stigler, a statistician and historian, published a book entitled ”The Seven Pillars of Statistical Wisdom” in March 2016 by the Harvard University Press. The seven pillars of statistical wisdom are Aggregation, Information, Likelihood, Intercomparison, Regression, Design and Residuals.
In the age of big data, it seems to many that statistics lost the battle of new gold digging wave: Data is information, hence is money. From the famous 4v (volume, velocity, variety and veracity) to 1V (Value), everyone is rushing and investigating to dig out the big V using the conventional data-driven statistical analysis. What is neglected is the validity of the conventional model selection procedure under the big data assumption. Thus, in order to make valid conclusion of the model selected, it is important to realize that what we produce should be reproducible. For this purpose, we need to stand firm with inference, rather than just pick what is good and fool oneself. Ioannidis had realized a decade ago that most scientific discoveries are false and published a paper “Why most published research findings are false”to warn the readers to correctly use statistical inference. But it didn’t have much effect. Numerous scientific results are still published just based on empirical case studies with no assurance of reproducible property. Furthermore, the data used for publication are kept as private assets, though most of them are federal funded projects. There is no way to reproduce the results as reported. Recently, ASA issued six principles of using p-value to prevent misuse of p-value for false statistical inference. The USA national science foundation has adopted the recommendation from “The Mathematical Sciences in 2025” published by the National Academies Press. That is, for Big Data analysis, correct inference after massive data snooping is required. Berk et al. (2016) pointed out that the common practice in big data analysis are data-driven and the conventional statistical inference based on the selected model is generally invalid. Thus, Post-Selection Inference is required. In their paper, the authors proposed simultaneous inference and hence suitably widening conventional confidence and retention intervals which are proven to be universally valid under all possible model selection procedures. Tibshirani‘s recent publication on Statistical Learning with Sparsity has included a chapter on “Statistical Inference” which collected the most recent development on “Post-Selection Inference”.