Posts

Showing posts with the label STATISTICS

Overfitting and underfitting

Image
The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line. In statistics,  overfitting  is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An  overfitted model  is a  statistical model  that contains more  parameters  than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the  noise ) as if that variation represented underlying model structure. Underfitting  occurs when a statistical model cannot adequately capture the underlying structure of the data. An  under-fitted model  is a...

What is the Data Science Relationship to Statistics?

Image
Many statisticians, including Nate Silver, have argued that data science is not a new field, but rather another name for statistics. Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data. Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. images) and emphasizes prediction and action. Andrew Gelman of Columbia University and data scientist Vincent Granville have described statistics as a nonessential part of data science. Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program. He describes data science as an applied field growing out of traditional statistics.