Wil Blouin's Quantitative Methods Portfolio: Assignment 6: Regression Analysis

Introduction:

Figure 1

Regression, like correlation, is a method of showing relationship, or lack of relationship between two variables. Unlike correlation however, this method investigates causation. One can make a prediction about the dependent variable with a hypothetical independent variable, as long as that independent variable falls between the minimum and maximum values of the preexisting data the regression analysis was run on. Linear regression analysis finds a best fit line with the ordinary least squares (OLS) method for the data, and also an equation for this line in the form y = A + Bx. A is the constant, or the y intercept. B, on the other hand is the slope of the line, or the regression coefficient. This shows how much the dependent variable changes with one positive change in the independent variable, and which direction that change is (positive or negative). A and B are found by using the equations shown in Figure 1.

Regression also results in a coefficient of determination. This is the r² value and represents the strength of the relationship being studied. This value ranges from zero to one. The closer to one that this value is, the stronger the relationship.

Deviation from regression models is shown by residuals. This is shown in this assignment with standard deviations away from the predictive model's value.

Regression can also be performed with multiple variables to create one equation that describes a data set using all of variables. This is called multiple regression.

This assignment deals with single regression in part one and multiple regression in part two. In part one, the relationship between percent of children receiving free lunches is examined, and in the second part the relationships between 911 calls and other variables are examined.

Part 1:

Methods, Results, and Discussion:

Opening the supplied Excel file containing crime statistics and the percent free lunch received by children at schools in SPSS, a linear regression analysis was performed to examine a possible relationship. This is shown in Figure 2. Percent of children receiving free lunch is used as the dependent variable, and the crime rate (per 100,00 people) is used as the independent variable due to the prompt for the assignment asking for verification of a news station's claim that as free lunches increase, so does crime. The results of this are shown in Figure 3.

Figure 2

Figure 3

The r² value is 0.173, signifying a weak relationship. Though this shows a very weak relationship (it is close to zero), it does show a relationship. It is also shown to be significant at the 0.005 significance level with a t-test, so the null hypothesis can be rejected in favor of alternative hypothesis that there is a relationship here. The news station is right in saying that as free lunch increases so does crime rate, however, just because there is a linear relationship does not mean that there is causation.

From the statistics, a best fit line can be found. This would be y = 40.380 + .102x. This can be used to predict crime values based on a given lunch percentage, or the other way around. This equation would predict a crime rate of 42.777 for a lunch percentage of 32.5. Because of the very low r² value however, this is not a very confident result. There are more variables that would need to be considered to confidently predict the result.

Part 2:

Introduction:

This part of the assignment compares 911 call with other variables on a census tract level. This analysis is useful to find out where a new hospital ER should be made. Three variables are first compared, with their residuals mapped, then multiple regression is performed, with a map made of the residuals from the equation line of the three most important variables found.

Single Regression Methods:

Opening the data supplied in IBM SPSS, three different linear regressions were run using calls as the dependent variable. These three were alcohol sales, number of people with no high school degree, and median income.

A choropleth map was made in ArcMap showing the amount of calls per tract and this is shown and described at the top of the results and discussion. Because number of people with no high school degree had the strongest relationship with number of 911 calls, a residual map was made of this variable.

Single Regression Results and Discussion:

In Figure 4 is shown the amount of calls per tract. This map was classified by equal interval due to this method's showing the range of the highest call areas the best. This method distinguishes the highest call areas from one another. The highest areas are shown in the north-central area with another area of particularly high rates of calls in the south-east.

Figure 4

Alcohol Sales:

Figure 5

Figure 5 shows the relationship between alcohol sales and calls. The r² value is relatively small here so alcohol sales does not explain well the amount of 911 calls however it is a factor. The significance level of 0.000 shows that the relationship is also significant, allowing the null hypothesis to be rejected in favor of the alternative, that there is a relationship here. For one unit of change in alcohol sales, calls will raise by 3.069E-5.

Number of People with No High School Degree:

Figure 6

Figure 6 shows the relationship between number of low educated people and 911 calls in a census tract. The r² value is larger here so more low educated people explains well the number of 911 calls. The significance level of 0.000 shows that the relationship is also significant, allowing the null hypothesis to be rejected in favor of the alternative, that there is a relationship here. For one unit of change in low educated people, calls will raise by 0.166.

Figure 7

Because this was the highest predictor of 911 calls processed, geoprocessing and mapping was used to show residuals. These values, although lower than with the others tested, still existed, as the r^2 value was still far from one. The residuals show where the actual values of the tracts deviate from the model created through the regression process. The geoprocessing used is shown in Figure 7, and the resulting map showing the standard deviations from the residual is shown in Figure 8. This figure shows how there are areas in north-central and eastern areas with particularly high rates of 911 calls that deviate positively from the predictive model's value. The presence of the high residuals on this map show that the variable of number of low educated people does not solely predict well the amount of 911 calls. It can be determined where to put a new ER from Figure 8 and 4. The areas with more calls, and with more calls than can be explained by the equation should be served by the new ER.

Figure 8

Median Income:

Figure 9

Figure 9 shows the relationship between median income and calls. The r² value is small here so median income does not explain well the amount of 911 calls however it is a factor. The significance level of 0.000 shows that the relationship is also significant, allowing the null hypothesis to be rejected in favor of the alternative, that there is a relationship here. For one unit of change in alcohol sales, calls will lower by 0.404.

Multiple Regression Methods:

Multiple regression was run by going into SPSS with the same data, then clicking Analyze, then Linear Regression, and making the number of 911 calls the dependent variable again. Then, instead of just adding one independent variable, all independent variables were added: Jobs, Renters, LowEduc (Number of people with no HS Degree), AlcoholX (alcohol sales), Unemployed, ForgnBorn (Foreign Born Pop), Med Income, and CollGrads (Number of College Grads). The multicollinearity diagnostic tool was then turned on. The results and discussion of the results can be seen below.

After the multiple regression was run with all of the variables selected, a step wise approach was run. The results of this are shown below.

Multiple Regression Results and Discussion:

Figure 10

Figure 10 shows all of the variables aforementioned when a multiple regression process is run with all of the variables entered. It also shows the r², a value that is moderately high, reflecting the sheer amount of variables that were used.

Figure 11

Figure 11 shows the same multiple regression with all of the variables used. The Beta coefficients show that low education is the most influential variable in the equation, and that the least influential variable is unemployed people. The collinearity diagnostics show that there is no collinearity because there is no condition index that is larger than 30. If one had been found that was above 30, then Variance Proportions that were closer to 1 would indicate where the collinearity was. This variable would then be eliminated, and multiple regression would be run again. Multicollinearity presence would make variables that were significant seem insignificant (smaller Beta values), and also would make the resulting equation less precise, and therefore is undesirable.

Figure 12

Figure 13

Figure 14

The above figures show the result of the stepwise multiple regression. The stepwise approach, as expected chose the three variables with the highest Beta values to build an equation with: Renters, Low Education, and Jobs. With these variables building the equation, the r² value is higher than every single variable regression model, and also all of the variables together. The r² value is 0.771. The equation that can be built with the three variables is: 911 calls = Renters*.024+LowEduc*.103+Jobs*.004. In this regression equation, Low Education is the most important variable with a Beta value of 0.464, then Jobs is next with 0.343, then Renters is last with 0.282. All variables are significant at high levels, the highest being 0.002.

Finally, with this equation, a residual map can be made similar to the one made above again using the Ordinary Least Square tool, this time using multiple independent (explanatory) variables. The resulting map is shown in Figure 15. It can be seen that the pattern of residuals has changed here, as the new equation now describes different areas better. The area with the highest standard deviations above what would be calculated with the equation should be served best with a new ER.

Figure 15

Conclusion:

This assignment helped to familiarize with and practice with the concepts of regression, multiple regression, multi-collinearity, and residuals, and to apply them in a spatial manner of analysis.

Wil Blouin's Quantitative Methods Portfolio

Wednesday, May 10, 2017

Assignment 6: Regression Analysis

No comments:

Post a Comment