The 10 Statistical Techniques Data Scientists Need to Master
Mei 03, 2019
igsd
Regardless of where you stand on the matter
of Data Science sexiness, it’s simply impossible to ignore the continuing
importance of data, and our ability to analyze, organize, and contextualize it.
Data scientists live at the intersection of coding, statistics, and critical
thinking. As Josh Wills put it, “data scientist is a person who is better at
statistics than any programmer and better at programming than any
statistician.”
1. Linear Regression:
In statistics, linear regression is a method to predict a target
variable by fitting the best linear relationship between the dependent and
independent variable. The best fit is done by making sure that the sum of all
the distances between the shape and the actual observations at each point is as
small as possible. The fit of the shape is “best” in the sense that no other
position would produce less error given the choice of shape. 2 major types of
linear regression are Simple Linear Regression and Multiple Linear Regression.
2. Classification:
Classification is a data mining technique that assigns categories to a
collection of data in order to aid in more accurate predictions and analysis.
Also sometimes called a Decision Tree, classification is one of several methods
intended to make the analysis of very large datasets effective. 2 major
Classification techniques stand out: Logistic Regression and Discriminant
Analysis.
3. Resampling Methods:
Resampling is the method that consists of drawing repeated samples from
the original data samples. It is a non-parametric method of statistical
inference. In other words, the method of resampling does not involve the
utilization of the generic distribution tables in order to compute approximate
p probability values.
4. Subset Selection:
This approach identifies a subset of the p predictors that we believe
to be related to the response. We then fit a model using the least squares of
the subset features.
5. Shrinkage:
This approach fits a model involving all p predictors, however, the
estimated coefficients are shrunken towards zero relative to the least squares
estimates. This shrinkage, aka regularization has the effect of reducing
variance. Depending on what type of shrinkage is performed, some of the
coefficients may be estimated to be exactly zero. Thus this method also
performs variable selection. The two best-known techniques for shrinking the
coefficient estimates towards zero are the ridge regression and the lasso.
6. Dimension Reduction:
Dimension reduction reduces the problem of estimating p + 1
coefficients to the simple problem of M + 1 coefficients, where M < p. This
is attained by computing M different linear combinations, or projections, of
the variables. Then these M projections are used as predictors to fit a linear
regression model by least squares. 2 approaches for this task are principal
component regression and partial least squares.
7. Nonlinear Models:
In statistics, nonlinear regression is a form of regression analysis in
which observational data are modeled by a function which is a nonlinear
combination of the model parameters and depends on one or more independent
variables. The data are fitted by a method of successive approximations.
8. Tree-Based Methods:
Tree-based methods can be used for both regression and classification
problems. These involve stratifying or segmenting the predictor space into a
number of simple regions. Since the set of splitting rules used to segment the
predictor space can be summarized in a tree, these types of approaches are
known as decision-tree methods.
9. Support Vector Machines:
SVM is a classification technique that is listed under supervised
learning models in Machine Learning. In layman’s terms, it involves finding the
hyperplane (line in 2D, plane in 3D and hyperplane in higher dimensions. More
formally, a hyperplane is n-1 dimensional subspace of an n-dimensional space)
that best separates two classes of points with the maximum margin. Essentially,
it is a constrained optimization problem where the margin is maximized subject
to the constraint that it perfectly classifies the data (hard margin).
10. Unsupervised Learning:
So far, we only have discussed supervised learning techniques, in which
the groups are known and the experience provided to the algorithm is the
relationship between actual entities and the group they belong to. Another set
of techniques can be used when the groups (categories) of data are not known.
They are called unsupervised as it is left on the learning algorithm to figure
out patterns in the data provided. Clustering is an example of unsupervised
learning in which different data sets are clustered into
groups of closely related items.
BAGIKAN