Para tener acceso debera recibir una invitación del profesor. En caso de no recibirla por favor solicitar dicho acceso con un correo electrónico donde se identifique claramente el curso al cual hace referencia.
El curso de Machine Learning está orientado para estudiantes de ingeniería que deseen adquirir conocimientos sobre modelos multivariados, para análisis de inferencia estadística y predicción de una variable dependiente a partir de variables independentes o predictoras.
Para dichos análisis se utilizarán técnicas de aprendizaje automático como:
Las herramientas de procesamiento serán:
El curso de Machine Learning está enfocado en la construcción de modelos basados en datos espaciales, que ayuden a entender la distribución y comportamientos de procesos físicos en ciencias de la tierra.
No es un curso de Python, R o SIG, por lo tanto no requiere conocimientos profundos en dichas herramientas. El curso, sin entrar en detalle en los aspectos básicos de estas herramientas, parte que el estudiante conoce lo basico de dichas herramientas, no se requiere ser un experto. El curso es una construcción conjunta de conocimiento por parte tanto de los estudiantes como del profesor.
El enfoque del curso es el uso de análisis de datos espaciales y/o temporales para solucionar problemas en el campo de las geociencias, dando prioridad a modelos de datos que puedan interpretarse, y permitan brindar conocimiento del fenómeno físico
“In God we trust. All others must bring data"
W. Edwards Deming (1900–1993)
Los kilobytes eran almacenados en discos, megabytes fueron almacenados en discos duros, terabytes fueron almacenados en arreglos de discos, y petabytes son almacenados en la nube.
(Anderson, 2008)
Metodología y técnica para extraer información de datos en un dominio del conocimiento.
The field of data mining involves processes, methodologies, tools and techniques to discover and extract patterns, knowledge, insights and valuable information from non-trivial datasets.
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Python code is fast to develop: As the code is not required to be compiled and built, Python code can be much readily changed and executed. This makes for a fast development cycle.
Python code is not as fast in execution: Since the code is not directly compiled and executed and an additional layer of the Python virtual machine is responsible for execution, Python code runs a little slow as compared to conventional languages like C, C++, etc.
It is interpreted: Python is an interpreted language, which means it does not need compilation to binary code before it can be run. You simply run the program directly from the source code.
It is object oriented: Python is an object-oriented programming language. An object--oriented program involves a collection of interacting objects, as opposed to the conventional list of tasks. Many modern programming languages support object-oriented programming. ArcGIS and QGIS is designed to work with object-oriented languages, and Python qualifies in this respect.
A model is an an idealized representation of a system
In this situation we wish to estimate $f$, but our goal is not necessarily to make predictions for $Y$. We instead want to understand the relationship between $X$ and $Y$ , or more specifically, to understand how $Y$ changes as a function of $X1, . . .,Xp.$
$\widehat{f}$ cannot be treated as a black box, because we need to know its exact form. In this setting, one may be interested in answering the following questions:
Los modelos explicativos se refieren a la aplicación a datos de modelos estadísticos para verificar hipótesis causales sobre construcciones teóricas.
Los científicos están entrenados para reconocer que correlación es no causalidad, no es recomendable sacar conclusiones solo basado en la correlación entre $X$ y $Y$ (es posible que solo sea una coincidencia). Por lo tanto, se debe entender el mecanismo que subyace y que conecta $X$ y $Y$.
En modelos explicativos, la función $\widehat{f}$ es cuidadosamente construida basada en $f$, en una forma que soporte interpretando la relación estimada entre $X$ y $Y$, y testeando la hipótesis causal.
Association (dependence): indicates a general relationship between two variables, where one of them provides some information about another.
Correlation: refers to a specific kind of association and captures information about the increasing or decreasing trends (whether linear or non-linear) of associated variables.
Causation: refers to a stronger relationship between two associated variables, where the cause variable “is partly responsible for the effect, and the effect is partly dependent on the cause”.
“Data is useful to illuminate the path, but keep following the path to find the full story..."
Quién dijo esto?
There are two basic types of data and one hybrid:
Cross-Sectional data: sample of observations on individual units taken at a single point in time. Individual observations have no natural ordering adn statistically independent.
Time series data: consists of a sample of observations on one or more variables over successive periods of time. They have a chronological ordering.
Hybrid data structures: combine the inherent characteristics of cross-sectional and time-series data sets. They could be Pooled Cross-Section or Panel (longitudinal).
Propiedad, atributo, característica, aspecto o dimensión de un objeto, hecho o fenómeno que puede variar y cuya variación es medible.
Variables categóricas: Expresan una cualidad, característica o atributo que solo se pueden clasificar o categorizar mediante el conteo. Se establecen en rango o categorías y sólo pueden tomar los valores que únicamente pertenecen al conjunto.
Variables continuas: variables numéricas que no pueden ser contadas y tienen un número infinito de valores entre un intervalo determinado. Nunca puede ser medido con exactitud y depende de la precisión de los equipos de medida..
Exploratory Data Analysis (EDA) is the very first step before you can perform any changes to the dataset or develop a statistical model to answer business problems. In other words, the process of EDA contains summarizing, visualizing and getting deeply acquainted with the important traits of a data set.
Data Preprocessing is usually about data engineers getting large volumes of data from the sources — databases, object stores, data lakes, etc — and performing basic data cleaning and data wrangling preparing them for the later part, which is essentially important before modelling — feature engineering!
Feature Engineering is known as the process of transforming raw data into features that better represent the underlying problem to predictive models, resulting in improved model accuracy on unseen data.
Specifically, the data scientist will begin building the models and testing to see if the features achieve the desired results. This is a repetitive process that includes running experiments with various features, as well as adding, removing, and changing features multiple times!!!
Web Scraping refers to the process of extracting data from a website or specific webpage. Once web scrapers extract the user’s desired data, they often also restructure the data into a more convenient format such as an csv.
An API (Application Programming Interface) is a set of procedures and communication protocols that provide access to the data of an application, operating system or other services. Generally, this is done to allow the development of other applications that use the same data.
Los modelos lineales o regresión logística son especialmente sensibles a este problema. Los modelos basados en arboles de decisión pueden funcionar adecuadamente sin escalar las variables.
A machine learning method is 'scale invariant' if rescaling any (or all) of the features--i.e. multiplying each column by a different nonzero number--does not change its predictions.
OLS is scale invariant. If you have a model $y=w_0+w_1x_1+w_2x_2$ and you replace $x_1$ with $x'=x_1/2$ and re-estimate the model $y=w_0+2w_1x'_1+w_2x_2$, you'll get a new model which gives exactly the same preditions. The new $x'_1$ is half as big, so its coefficient is now twice as big.
When using a non-scale invariant method, if the features are of different units (e.g. dollars and miles and kilograms and numbers of products). people often standardized the data (subtract off the mean and divide by the standard deviation of each column in X.
Frequency distribution table is a table that stores the categories (also called “bins”), the frequency, the relative frequency and the cumulative relative frequency of a single continuous interval variable
The frequency for a particular category or value (also called “observation”) of a variable is the number of times the category or the value appears in the dataset.
Relative frequency is the proportion (%) of the observations that belong to a category. It is used to understand how a sample or population is distributed across bins (calculated as relative frequency = frequency/n )
The cumulative relative frequency of each row is the addition of the relative frequency of this row and above. It tells us what percent of a population (observations) ranges up to this bin. The final row should be 100%.
A probability density histogram is defined so that (i) The area of each box equals the relative frequency (probability) of the corresponding bin, (ii) The total area of the histogram equals 1
When we collect sufficiently large samples from a population, the means of the samples will have a normal distribution. Even if the population is not normally distributed.
A boxplot is a graphical representation of the key descriptive statistics of a distribution.
The characteristics of a boxplot are
The normal QQ plot is a graphical technique that plots data against a theoretical normal distribution that forms a straight line
A normal QQ plot is used to identify if the data are normally distributed
If data points deviate from the straight line and curves appear (especially in the beginning or at the end of the line), the normality assumption is violated.
A scatter plot displays the values of two variables as a set of point coordinates
A scatter plot is used to identify the relations between two variables and trace potential outliers.
Inspecting a scatter plot allows one to identify linear or other types of associations
If points tend to form a linear pattern, a linear relationship between variables is evident. If data points are scattered, the linear correlation is close to zero, and no association is observed between the two variables. Data points that lie further away on the x or y direction (or both) are potential outliers
Before making modeling decisions, you need to know the underlying data distribution.
Returns the probability that a discrete random variable X is equal to a value of x. The sum of all values is equal to 1. PMF can only be used with discrete variables.
It is like the version of PMF for continuous variables. Returns the probability that a continuous random variable X is in a certain range.
Returns the probability that a random variable X takes values less than or equal to x.
Covariance is a measure of the extent to which two variables vary together (i.e., change in the same linear direction). Covariance Cov(X, Y) is calculated as:
$cov_{x,y}=\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n-1}$where $x_i$ is the score of variable X of the i-th object, $y_i$ is the score of variable Y of the i-th object, $\bar{x}$ is the mean value of variable X, $\bar{y}$ is the mean value of variable Y.
For positive covariance, if variable X increases, then variable Y increases as well. If the covariance is negative, then the variables change in the opposite way (one increases, the other decreases). Zero covariance indicates no correlation between the variables.
Correlation coefficient $r_{(x, y)}$ analyzes how two variables (X, Y) are linearly related. Among the correlation coefficient metrics available, the most widely used is the Pearson’s correlation coefficient (also called Pearson product-moment correlation),
$r_{(x, y)} = \frac{\text{cov}(X,Y)}{s_x s_y}$Correlation is a measure of association and not of causation.
Often, in a high dimensional dataset, there remain some entirely irrelevant, insignificant and unimportant features. It has been seen that the contribution of these types of features is often less towards predictive modeling as compared to the critical features. They may have zero contribution as well. These features cause a number of problems which in turn prevents the process of efficient predictive modeling.
Sometimes, feature selection is mistaken with dimensionality reduction. But they are different. Feature selection is different from dimensionality reduction. Both methods tend to reduce the number of attributes in the dataset, but a dimensionality reduction method does so by creating new combinations of attributes (sometimes known as feature transformation), whereas feature selection methods include and exclude attributes present in the data without changing them.
Filter method relies on the general uniqueness of the data to be evaluated and pick feature subset, not including any mining algorithm. Filter method uses the exact assessment criterion which includes distance, information, dependency, and consistency.
A wrapper method needs one machine learning algorithm and uses its performance as evaluation criteria. This method searches for a feature which is best-suited for the machine learning algorithm and aims to improve the mining performance.
Embedded methods are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration.
La diferencia entre el valor mapeado o la clase y el valor o clase verdadero
Bias: is the amount of error introduced by approximating real-world phenomena with a simplified model.
Variance: is how much your model's test error changes based on variation in the training data. It reflects the model's sensitivity to the idiosyncrasies of the data set it was trained on.
La evaluación de un modelo considera dos aspectos del desempeño de un modelo:
La validación en modelos explicativos consiste de dos partes:
La validación en modelo predictivos se enfoca en la generalización, la cual consiste en la habilidad de la función $f$ en predecir con nuevos datos (X,y).
La aceptación de un modelos debe responder al menos tres criterios:
La evaluación debe ser chequeada:
En general es mas fácil obtener niveles altos del ajuste del modelo que alcanzar niveles similares para el desempeño de la predicción. Y sin embargo el segundo es mas importante para efectos prácticos.
If the absolute value is not taken (the signs of the errors are not removed), the average error becomes the Mean Bias Error (MBE) and is usually intended to measure average model bias. MBE can convey useful information, but should be interpreted cautiously because positive and negative errors will cancel out.
R-squared ($R^2$) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.
process of analyzing sample data and making conclusions about the parameters of a population is called Statistical Inference
There are three common forms of Statistical Inference. Each one has a different way of using sample data to make conclusions about the population. They are:
For Point Estimate, we infer an unknown population parameter using a single value based on the sample data.
If we took another set of sample data (with the same sample size) and do a point estimate again, it is very likely we would end up making a different conclusion about the population. That means there is uncertainty when we draw conclusions about the population based on the sample data. Point estimation doesn’t give us any idea as to how good the estimation is.
For Interval Estimation, we use an interval of values (aka Confidence Interval) to estimate an unknown population parameter, and state how confident we are that this interval would include the true population parameter.
To construct the confidence interval, we would need two metrics:
In statistics, when we have collected data from experiments, surveys, or observations, we often want to determine if the observed differences or effects (i.e., x̄ - µ) are statistically significant or just the result of random chance.
Once the test statistic (Z statistic or T statistic) is calculated, we determine the critical region — the region of extreme values that would lead us to reject the null hypothesis. This critical region is established based on the chosen significance level (alpha), which represents the probability of making a Type I error (rejecting the null hypothesis when it is true). Common significance levels include 0.05 (5%) or 0.01 (1%).
If the calculated test statistic falls within the critical region, we reject the null hypothesis in favor of the alternative hypothesis, suggesting that the observed sample results are statistically significant.
Unlike point estimate and interval estimate which are used to infer population parameters based on sample data, the purpose of hypothesis testing is to evaluate the strength of evidence from the sample data for making conclusions about the population
In a hypothesis test, we evaluate two mutually exclusive statements about the population. They are
Note: A hypothesis test is NOT designed to prove the null or alternative hypothesis. Instead, it evaluates the strength of evidence AGAINST the null hypothesis using sample data. If a p-value is less than the Significance Level. That means the evidence (against the null hypothesis) we found from the sample data would rarely occur by chance. Therefore, We have enough evidence to reject the null hypothesis. It doesn’t mean we’ve proved the alternative hypothesis is correct. It only means we accept the alternative hypothesis or we’re more confident that the alternative hypothesis is correct.
We describe a finding as statistically significant by interpreting the p-value.
A statistical hypothesis test may return a value called p or the p-value. This is a quantity that we can use to interpret or quantify the result of the test and either reject or fail to reject the null hypothesis. This is done by comparing the p-value to a threshold value chosen beforehand called the significance level.
A common value used for alpha is 5% or 0.05. A smaller alpha value suggests a more robust interpretation of the null hypothesis, such as 1% or 0.1%.
On a very broad level if we have prior knowledge about the under lying data distribution (mainly normal distribution) then parametric tests like T-Test, Z-Test, ANOVA test etc are used. And if we don’t have prior knowledge about the underlying data distribution then non-parametric tests like Mann-Whitney U Test is used.
It is used for hypothesis testing mainly when sample size is very less (less than 30) and sample standard deviation is not available. However under lying distribution is assumed to be normal. A T-Test is of two types. One-Sample T-Test, which is used for comparing sample mean with that of a population mean. Two-Sample T-Test, which is used for comparing means of two samples. When observation across each sample are paired then it is called Pared T-Test.
It is used for hypothesis testing mainly when sample size is high (greater than 30) . And under lying distribution is assumed to be normal. A Z-Test is of two types. One-Sample Z-Test, which is used for comparing sample mean with that of a population mean. Two-Sample Z-Test, which is used for comparing means of two samples.
ANOVA Test stands for Analysis of Variance. It is a generalisation or extension for Z-Test. This test tells us whether two or more samples are significantly same or different. Similar to Paired T-Test, there is also Repeated Measures ANOVA Test which tests whether the means of two or more paired samples are significantly different or not.
When we have prior knowledge of underlying data distribution (gaussian distribution), then parametric tests are carried out. Some of the non-parametric test are as follows
Gradually, with the help of some optimization function, loss function learns to reduce the error in prediction. An optimization algorithm is a procedure which is executed iteratively by comparing various solutions until an optimum or a satisfactory solution is found. ... these algorithms minimize or maximize a Loss function using its gradient values with respect to the parameters.
$θ$ → The parameters that minimize the loss function
$x$ → The input feature values
$y$ → The vector of output values
$\hat{y}$ → The vector of estimated output values
$h(\theta)$ → the hypothesis function
In a linear regression problem, optimizing a model can be done by using a mathematical formula called the Normal Equation.
$\theta=(X^T X)^{-1} X^Ty$It is as a one-step algorithm used to analytically find the coefficients that minimize the loss function ($\theta$) without having to iterate
However, this method becomes an issue when you have a lot of data : either too much features (calculation time issue) or too much observations (memory issue).
It is an optimization algorithm to find the minimum of a function. We start with a random point on the function and move in the negative direction of the gradient of the function to reach the local/global minima.
Alpha (α) is called the learning rate and specify the magnitude of the steps. The higher α is, the bigger the steps are going to be and vice versa.
$\theta^{(next_step)}=\theta - \alpha \nabla_\theta cost(\theta)$El objetivo es identificar subgrupos en los datos, de tal forma que los datos en cada subgrupo (clusters) sean muy similares, mientras que los datos en diferentes subgrupos sean muy diferentes.
As the dimensions increase the volume of the space increases so fast that the available data becomes sparse
One more weird problem that arises with high dimensional data is that distance-based algorithms tend to perform very poorly. This happens because distances mean nothing in high dimensional space. As the dimensions increase, all the points become equidistant from each other such that the difference between the minimum and maximum distance between two points tends to zero.
Dimensionality reduction techniques can be classified in two major approaches as follows.
Método supervisado que trabaja con datos que ya han sido clasificados en grupos para encontrar reglas que permitan clasificar elementos individuales nuevos no clasificados. La técnica mas conocida y utilizada se denomina Análisis de la Función Linear Discriminante de Fisher (Fisher, 1936).
Se pretende encontrar relaciones lineales entre las variables continuas que mejor discriminen entre grupos, para esto se realiza:
Los siguientes son modelos lineales:
Los modelos siguen siendo lineales, ya que los coeficientes/pesos asociados con cada variable siguen siendo lineales. Se puede decir que el modelo es no linear en término de las variables, pero lineal en término de los coeficientes.En el caso del segundo modelo $y$ es función tanto de $x$ como de $x^2$. Para el tercer modelo $y$ es función tanto de $x_1$ y $x_2$ como de la interacción entre $x_1$ y $x_2$.
Paradoxically, while the value is generally meaningless, it is crucial to include the constant term in most regression models!...If all of the predictors can’t be zero, it is impossible to interpret the value of the constant. Don't even try!...
The constant term is in part estimated by the omission of predictors from a regression analysis. In essence, it serves as a garbage bin for any bias that is not accounted for by the terms in the model.
You should standardize the variables when your regression model contains polynomial terms or interaction terms. While these types of terms can provide extremely important information about the relationship between the response and predictor variables, they also produce excessive amounts of multicollinearity. Multicollinearity is a problem because it can hide statistically significant terms, cause the coefficients to switch signs, and make it more difficult to specify the correct model.
Collinearity is a linear association between two explanatory variables:
Multicollinearity is a linear associations between more than two explanatory variables:
The effect of multicollinearity (redundancy) is that it makes the estimation of model parameters impossible.
To derive the ordinary least squares (OLS) estimators r must be invertible, which means the matrix must have full rank. If the matrix is rank deficient then the OLS estimator does not exist because the matrix cannot be inverted. If your design matrix contains k independent explanatory variables, then rank(r) should equal k. The effect of perfect multicollinearity is to reduce the rank of r less than the maximum number of columns.
VIF which quantifies the severity of multicollinearity in a multiple regression model. Functionally, it measures the increase in the variance of an estimated coefficient when collinearity is present
The coefficient of determination is derived from the regression of covariate j against the remaining k — 1 variables. If the VIF is close to one this implies that covariate j is linearly independent of all other variables. Values greater than one are indicative of some degree of multicollinearity, whereas values between one and five are considered to exhibit mild to moderate multicollinearity. these variables may be left in the model, though some caution is required when interpreting their coefficients. If the VIF is greater than five this suggests moderate to high multicollinearity, you might want to consider leaving these variables out.
Logarithmic transformations are useful in various situations, including:
Before applying a logarithmic transformation, it’s essential to consider the following:
Note: When interpreting results after applying a logarithmic transformation, keep in mind that the transformation has changed the scale of the data. To make meaningful interpretations, you may need to back-transform the results to the original scale.
The importance of a feature in a linear regression model can be measured by the absolute value of its t-statistic. The t-statistic is the estimated weight scaled with its standard error.
Let us examine what this formula tells us: The importance of a feature increases with increasing weight. This makes sense. The more variance the estimated weight has (= the less certain we are about the correct value), the less important the feature is. This also makes sense.
A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well for a data set different than its training data. Regularization, significantly reduces the variance of the model, without substantial increase in its bias
So the tuning parameter λ, used in the regularization techniques controls the impact on bias and variance. As the value of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance (hence avoiding overfitting), without loosing any important properties in the data.
But after certain value, the model starts loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the value of λ should be carefully selected.
Regularization will help select a midpoint between the first scenario of high bias and the later scenario of high variance
La Regresión Logística es una combinación lineal de variables independientes (factores explicativos) para explicar la varianza en una variable dependiente (inventario de deslizamientos) tipo dummy [0 – 1].
Sea p(x) la probabilidad de éxito cuando el valor de la variable predictora es x, entonces:
$p(x) = \frac{e^{a+\sum bx}}{1+e^{a+\sum bx}} = \frac{1}{1+e^{-(a+\sum bx)}}$ $\frac{p(x)}{1-p(x)} = e^{a+\sum bx}$Donde $a$ es el intercepto del modelo, $b$ son los coeficientes del modelo de regresión logística, y $x$ son las variables independientes (predictoras).
$P(y=1) = \frac{1}{1+e^{-(a+\sum bx)}}$Donde, P es la probabilidad de Bernoulli que una unidad de terreno pertenece al grupo de no deslizamientos o al grupo de si deslizamiento. P varía de 0 a 1 en forma de curva “S” (logística).
En regresión logística no se puede utilizar el estimador OLS, ya que en este caso no tiene una solución analítica y no se puede utilizar gradiente descendente por que no es una función convexa. En su defecto, se utiliza MLE y Binary cross-entropy.
LogisticRegression(X, y,
pos_class=None,
Cs=10,
fit_intercept=True,
max_iter=100,
tol=1e-4,
verbose=0,
solver='lbfgs',
coef=None,
class_weight=None,
dual=False, penalty='l2',
intercept_scaling=1.,
multi_class='auto',
random_state=None, check_input=True,
max_squared_sum=None,
sample_weight=None,
l1_ratio=None):
Utiliza todo el dataset para entrenar cada punto y por eso requiere de uso de mucha memoria y recursos de procesamiento (CPU). Por estas razones KNN tiende a funcionar mejor en datasets pequeños y sin una cantidad enorme de features (las columnas).
Lazy learning: KNN no genera un modelo fruto del aprendizaje con datos de entrenamiento, sino que el aprendizaje sucede en el mismo momento en el que se prueban los datos de test.
Basado en Instancia: Esto quiere decir que nuestro algoritmo no aprende explícitamente un modelo (como por ejemplo en Regresión Logística o árboles de decisión). En cambio memoriza las instancias de entrenamiento que son usadas como base de conocimiento para la fase de predicción.
Modelo supervisado: A diferencia de K-means, que es un algoritmo no supervisado y donde la «K» significa la cantidad de grupos (clusters) que deseamos clasificar, en K-Nearest Neighbor la «K» significa la cantidad de «puntos vecinos» que tenemos en cuenta en las cercanías para clasificar los «n» grupos -que ya se conocen de antemano, pues es un algoritmo supervisado.
Nonparametric: KNN makes no assumptions about the functional form of the problem being solved. As such KNN is referred to as a nonparametric machine learning algorithm.
KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)
KDTree(X, leaf_size=40, metric='minkowski', **kwargs)
A margin is a separation of line to the closest class points. A good margin is one where this separation is larger for both the classes. Images below gives to visual example of good and bad margin. A good margin allows the points to be in their respective classes without crossing to other class.
El parametro C permite definir que tanto se desea penalizar los errores.
The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma, points far away from plausible seperation line are considered in calculation for the seperation line. Where as high gamma means the points close to plausible line are considered in calculation.
A decision tree is a tree-like structure where internal nodes represent a test on an attribute, each branch represents outcome of a test, and each leaf node represents class label, and the decision is made after computing all attributes. A path from root to leaf represents classification rules. Thus, a decision tree consists of three types of nodes.
Gini impurity is a metric of measuring the mix of a set. The value of Gini Impurity lies between 0 and 1 and it quantifies the uncertainty at a node in a tree. So Gini Impurity tells us how mixed up or impure a set is. Now our goal with classification is to split or partition our data into as pure or unmixed sets as possible. If we reach at a 0 Gini Impurity value we stop dividing the tree further
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. By combining individual models, the ensemble model tends to be more flexible (less bias) and less data-sensitive (less variance).
Base models that are often considered for bagging are models with high variance but low bias
Random forest is an ensemble model using bagging as the ensemble method and decision tree as the individual model.
Se puede realiza en términos de observaciones y/o en términos de predictores
We see that the vertical width of the red tube, formed by the Random Forests is smaller than the Decision Trees’ black tube. So, Random Forests have a lower variance than Decision Trees, as expected. Furthermore, it seems that the averages (the middle) of the two tubes are the same which means that the process of averaging did not change the bias. We still hit the underlying true function 3sin(x)+x quite well.
https://towardsdatascience.com/understanding-the-effect-of-bagging-on-variance-and-bias-visually-6131e6ff1385Boosting is a method of converting weak learners into strong learners (low variance but high bias) --> Shallow decision trees (stump)
AdaBoost (Adaptative Boosting) is a boosting ensemble model and works especially well with the decision tree. Boosting model’s key is learning from the previous mistakes, e.g. misclassification data points. AdaBoost learns from the mistakes by increasing the weight of misclassified data points.
As you can see, the 3 points I marked with yellow are on the wrong side. For this reason, we need to increase their weight for the 2nd iteration. But how?. In the 1st iteration, we have 7 correctly and 3 incorrectly classified points. Let’s assume we want to bring our solution into a 50/50 balance situation. Then we need to multiply the weight of incorrectly classified points with (correct/incorrect) which is (7/3 ≈ 2.33). If we increase the weight of the incorrectly classified 3 points to 2.33, our model will be 50–50%. We keep the results of 1st classification in our minds and go to the 2nd iteration.
In the 2nd iteration, the best solution is as on the left. Correctly classified points have a weight of 11, whereas incorrectly classified points’ weight is 3. To bring the model back to a 50/50 balance, we need to multiply the incorrectly classified points’ weight by (11/3 ≈ 3.66). With new weights, we can take our model to the 3rd iteration.
The best solution for the 3rd iteration is as on the left. The weight of the correctly classified points is 19, while the weight of the incorrectly classified points is 3 (once again). We can continue the iterations, but let’s assume that we end it here. We have now reached the stage of combining the 3 weak learners. But how do we do that?
ln(correct/incorrect) seems to give us the coefficients we want.
If we consider the blue region as positive and the red region as negative; we can combine the result of 3 iterations like the picture on the left.
A perceptron is a binary classification algorithm modeled after the functioning of the human brain—it was intended to emulate the neuron. The perceptron, while it has a simple structure, has the ability to learn and solve very complex problems.
A multilayer perceptron (MLP) is a group of perceptrons, organized in multiple layers, that can accurately answer complex questions. Each perceptron in the first layer (on the left) sends signals to all the perceptrons in the second layer, and so on. An MLP contains an input layer, at least one hidden layer, and an output layer.
A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly. — Ian Goodfellow.
The objective of the Convolution Operation is to extract the high-level features such as edges, from the input image.
We can observe that the size of output is smaller that input. To maintain the dimension of output as in input , we use padding. Padding is a process of adding zeros to the input matrix symmetrically. In the following example,the extra grey blocks denote the padding. It is used to make the dimension of output same as input.
There are two types of Pooling: Max Pooling and Average Pooling. Max Pooling returns the maximum value from the portion of the image covered by the Kernel. On the other hand, Average Pooling returns the average of all the values from the portion of the image covered by the Kernel.
Now that we have converted our input image into a suitable form for our Multi-Level Perceptron, we shall flatten the image into a column vector. The flattened output is fed to a feed-forward neural network and backpropagation applied to every iteration of training. Over a series of epochs, the model is able to distinguish between dominating and certain low-level features in images and classify them using the Softmax Classification technique.
Adding a Fully-Connected layer is a (usually) cheap way of learning non-linear combinations of the high-level features as represented by the output of the convolutional layer. The Fully-Connected layer is learning a possibly non-linear function in that space.
The generalized linear model (GLM) is a flexible generalization of General Linear model that allows for response variable that have error distribution models normal and non-normal distribution as well. It allows the linear predictor (e.g. $b_0+b_1*X$)to be related to the response via a function which is called link function.
Note: General Linear Models are specific GLMS when errors are independent and follows normal distribution. And the link function is identity because we model the mean directly in case of General Linear Models.
Donde:
F1(X1), F1(X2), …. Fn(Xn) are non-parametric functions (smoothing functions)
G(Y) is a link function connecting the expected value to input features X1,X2….Xn