The problem What if you want to run a regression to estimate the coefficients and retrieve the data generating process of some data BUT you are in a scenario with censored data? How does that affect my estimation? Can I do better?
Censored data In short, you have censored data when some observation is unobserved or constrained because of some specific reason which is natural and unavoidable.
Unobserved For instance, in survival analysis, if the subject is not yet dead (luckily) you can not observe the time of death yet and hence you don’t know if he will die tomorrow or in two years.
Weighted regression consists on assigning different weights to each observation and hence more or less importance at the time of fitting the regression.
On way to look at it is to think as solving the regression problem minimizing Weighted Mean Squared Error(WSME) instead of Mean Squared Error(MSE)
\[WMSE(\beta, w) = \frac{1}{N} \sum_{i=1}^n w_i(y_i - \overrightarrow {x_i} \beta)^2\] Intuitively, we are looking fot the coefficients that minimize MSE but putting different weights to each observation.
Let’s say we have a dataset and we want to fit a model to it and do some inference such as obtaining the coefficients and look for their confidence intervals.
For such a task we would first need to find a model that we think approximates to the real data generating process behind the phenomenon.
This will be the model selection step.
Then we would look at the output of our model and get the standard error of the coefficients or calculate the confidence interval or any other similar task.
R2 depends on the variance on the variance of the predictors Quoting from Shalizi1 Assuming a true linear model
\[ Y = aX + \epsilon\]
and assuming we know \(a\) exactly.
The variance of Y will be \(a^2\mathbb{V}[X] + \mathbb{V}[\epsilon]\).
So \(R^2 = \frac{a^2\mathbb{V}[X]}{a^2\mathbb{V}[X] + \mathbb{V}[\epsilon]}\)
This goes to 0 as \(\mathbb{V}[X] \rightarrow 0\) and it goes to 1 as \(\mathbb{V}[X] \rightarrow \infty\). “It thus has little to do with the quality of the fit, and a lot to do with how spread out the predictor variable is.
Linear regression as smoothing Let’s assume the DGP (data generating process) is: \[ Y = \mu(x) + \epsilon\] where \(\mu(x)\) is the mean Y value for that particular x and \(\epsilon\) is an error with mean 0.
When running OLS we are trying to approximate \(\mu(x)\) with a linear function of the form \(\alpha + \beta x\) and trying to retrieve the best \(\alpha\) and \(\beta\) minimizing the mean-squared error.
Mean squared error (MSE) is a measure of how far our prediction is from the true values of the dependent variable. It’s the expectation of the squared error.
The squared error being:
\[(Y - \hat \mu(x))^2\] where Y is the true value and $ (x)$ is the prediction for a given x.
We can decompose it into:
\[ (Y - \hat \mu(x))^2 \\ = (Y - \mu(x) + \mu(x) - \hat \mu(x)^2) \\ = (Y - \mu(x))^2 + 2(Y - \mu(x))(\mu(x) - \hat \mu(x)) + (\mu(x) - \hat \mu(x))^2 \]
What’s Spark? prueba The definition says:
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters >through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any >Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new >workloads like streaming, interactive queries, and machine learning.
Basically is a framework to work with big amounts of data stored in distributed systems instead of just one machine.
When using Neural Nets for a multiclass classification problem it’s standard to have a softmax layer at the end to normalize the probabilities for each class. This means that the output of our net is a vector of probabilities (one for each class) that sums to 1. If there isn’t a softmax layer at the end, then the net will output a value in each of the last cells (one for each class) but without a delimited range.
This CS GO Kaggle link has data about several competitive CS GO matches.
In a few words:
those are 5 vs 5 matches where each team tries to kill the other or complete a task (planting or defusing the bomb depending the role you are playing) before the time expires.
The goal is to win 16 rounds before the other team.
After 15 rounds both teams switch sides/role.
Please follow this link
It was made with Flexboard (a package to do dashboards in R) so I think it’s only visualized correctly in laptops/pc because of the layout.