Monday, June 1, 2009

You're the worst thing that's ever happened to me.

Hoping that our readers might enjoy a sneak preview of what we were working on, we decided to post our preliminary results on the Chrysler dealer data we crunched all weekend.

It will be a cold day in hell before we do that again, you can be sure.

The reaction was complicated by my typing "Why would there be an significant and highly positive correlation between dealer survival and Clinton donors?"

I didn't mean "statistically significant," but that was all it took. A number of blogs fixated on this sentence and ignored the next two, "Granted, that P-Value (0.125) isn't enough to reject the null hypothesis at 95% confidence intervals (our null hypothesis being that the effect is due to random chance), but a 12.5% chance of a Type I error in rejecting a null hypothesis (false rejection of a true hypothesis) is at least eyebrow raising. Most statisticians would not call this a "find" as 95% confidence intervals are the gold standard for this sort of work." You can imagine the result.

As a measure of how partisan this issue is, both sides of the aisle had it out over sentence structure all day. We were severely chastised (in my case rightly so) and that was further complicated by my posting without explanation, the results of six regressions absent any narrative about our experimental design, our order of testing and the like. This subjected us to "data fishing" criticisms. (Probably warranted to some small degree- though some readers went so far to accuse us of outright fraud). What began as an attempt to give our readers some transparency devolved into a mess of sound bytes. The fault is entirely mine for expecting civil peer review or anything like it in such a forum. A mistake I shall not repeat with early findings again.

When we pointed out that we found no support for a republican "enemies list" the response was that we had shifted our theory to match the new data. (This is odd since we never thought there was an enemies list). This should have been SOME news at least- at least that the present data didn't seem to support a theory that's been all over the MSM.... Somehow this little bit got lost even by the people you would have expected to pick it up. Talk about focus!

The slew of email I received ranged from "thanks" to "you are the spawn of the devil." The latter is obviously closest to the truth, so we are sending that reader the Marla Singer Stolen Jeans prize. His name was Robert Paulson.

We planned to release the data publicly today, but we are having second thoughts about that particular plan. Nothing makes a researcher feel less like sharing than being told what a moron they are after 48 hours of straight dataset preparation.

We have come up with a few ideas instead.

1. We are considering creating a members-only section for research and data like this. This would be the place we would release preliminary findings and datasets we have put a great deal of (uncompensated) (wo)manhours into. We may or may not then release final results to the public. Or we may after a substantial delay.

2. We may send datasets like this one to qualified and interested parties who request it by email. (This is where we are leaning for this particular set at the moment).

This means that everyone who isn't a certified space monkey is going to have to wait for the integration of the GM dealer closing data.

As our dataset is a derivative of the CRP, we will be distributing it (to the extent that we do) under the creative commons Attribution-Noncommercial-Share Alike license (though we may ask you respect the embargo time before we release it publicly).

I'll decide at the end of this note.

About the Data - Dataset creation:

Individual donor records from the Center for Responsive Politics for the 2008 election cycle (~800 megs, n=3542585) were converted to a .csv file. (Please consider donating to the Center here). Chrysler dealer records were compiled from dealer closing and dealer survival records (n=3129 after the elimination of mangled or unusable entries). These were obtained from bankruptcy documents and converted to excel. Both datasets were imported into an SQL database (after abortive attempts to run matching with less effective but more amusing methods-- at this size Excel and Access were useless so we had to rely on custom SQL queries for the initial extraction).

The dealer dataset was matched against full last name and first 2 characters of first name of the "majority owner" field v. the name of donor field in the CRP dataset. (Single initial first names were also matched). Records that matched were merged into a new file. The matching result produced a new data subset (n=~10000). This subset was filtered by first 2 charters in zip code. The resulting initial match dataset yielded n= ~6200 records including all political donations made by potential name matches, and preserving multiple donations from one majority owner name. In the case where a single majority owner owned several dealerships, records for each dealership/donation pair were created. In this way we could link dealership fates with their owner's political acts.

As the CRP data includes self-identified profession and employer data, we edited the automated match list by hand by first looking for a match between CRP employer or profession and the name of the dealership in the dealership dataset. We also cross checked full name entries-- in cases where small details like middle initial disagreed but the two datasets agreed on employer as a dealership we preserved the record. If no clear match was found with employer / profession we then looked to other connections- for instance common CRP donor codes with other records where the donor listed a known dealer as employer / profession. Some professions caused us automatically to look for other evidence (self-employed, business owner, etc) which often led us to find donors we might have otherwise passed by.

We had finished the first pass of this edit when we ran our first regressions and published the preliminary results.

We've since made a second pass and caught some more bad matches.

The newest resulting hand-edited dataset (n=5117) includes a single entry for each dealership with no matched political donations and at least one entry for each dealership where majority owners donated in a fashion tracked by CRP individual donor data.

Each individual dealership was assigned a numeric ID code. After filtering for double entries and subject to our removal of mangled or unreadable records the dataset contained 2923 individual dealer codes. This is somewhat less than the reported ~3200 total dealers reported as current by Chrysler.

Using a pivot table keyed on individual dealer ID code we tabulated several variables from the CRP/Dealer data including:

Source/Field: Description

Dealer/Majority Owner (full name)
CRP/FullName
CRP/Profession
CRP/Employer
Dealer/CompanyName
Dealer/CompanyAddress
Dealer/Zip Code
Dealer/DealerTerminationStatus: (1=terminated)
Created/ID: Dealer Unique ID
CRP/DonorCode: Individual Donor Code
CRP/Recipient: CRP Unique Recipient ID
CRP/DateDonation
CRP/AmountDonation
CRP/Party - Codes:

First Character:
D=Democratic
R=Republican
3=3rd Party
U=Unknown
P= PAC

Second Character:

Party:
W=Winner
L=Loser
I=Incumbent
C=Challenger
O=Open Seat
N=Non-incumbent

PAC:
B=Business
L=Labor
I=Ideological
O=Other
U=Unknown

We further teased these party codes into boolean categoricals (1/0) for each party / PAC code.

We then added calculated fields for:

Clinton (0 or 1 where Recipient ID = N00000019)
Obama (0 or 1 where Recipient ID = N00009638)
McCain (0 or 1 where Recipient ID = N00006424)

D (Any democratic donation)
R (Any republican donation)

None: No donation match found
Some: Any donation match found

Using pivot tables in Excel sorted by dealer code, we then aggregated all donation data by Dealer ID. In this way we captured the entire donation profile of a given Majority Owner and assigned to to each of his/her dealers. We used this dataset in our regression analysis.

Update: 3:30 am:

Some limitations to the data:

1. We are only counting majority owners in our donor matches. It's entirely possible that other influential donors who are minority owners, or former owners or otherwise connected in ways we cannot see are not being counted. Perhaps a son or relative has taken over the business but "Dad" or "Grandpa" still holds a majority stake. We caught several "retired" donors who were obviously still owners of dealers. We might be undercounting for this reason.

2. While we have PAC data, it is not sorted by party (for obvious reasons). To the extent PAC money is flowing someplace that might make a difference we cannot see it. (It would be fun to look at PAC (Labor) donations v. saved dealers, no?)

3. We don't have any data on dealer profitability. This is a pretty important factor we'd like to account for. Since this is generally claimed or understood to be a major factor, we might get stronger results out of the data (or weaker results for that matter) if we had the ability to add it as a variable.

4. There are certainly fund raising activities (speaking engagements, benefit dinners, etc) that we are not capturing with individual donor data. Certainly this makes our data weaker.

5. We haven't added older (2006, 2004) data yet. We have no idea if it will make a difference, but it might.

6. We don't have a good way of capturing organizational giving at this point. ABC Chrysler, Golden Hills may have donated gobs to Republicans and we'd never know about it because we are stuck with individual donor data.

Some general interesting facts:

Dealers: 2923

Dealers terminated: 702
Dealers saved: 2221
Dealers with Majority Owners who donated to republicans: 429
Dealers with Majority Owners who donated to republicans also terminated: 100
Dealers with Majority Owners who donated to democrats: 174
Dealers with Majority Owners who donated to democrats also terminated: 43
Dealers with Majority Owners who donated to both: 80
Dealers with Majority Owners who donated to both also terminated: 19
Dealers with Majority Owners who donated to no one (not even PACs): 2111
Dealers with Majority Owners who donated to no one (not even PACs) also terminated: 517

etc. etc. etc.

Initially we were interested in testing the hypothesis that donating to Obama in the 2008 election cycle might result in higher than average survival rates among dealers.

The data wouldn't come close to rejecting the null hypothesis there. (p-value ~0.7)

Here's that regression with the final version data:

Binary Logistic Regression: safe versus Obama 

Step Log-Likelihood
0 -1611.36
1 -1611.34
2 -1611.34
3 -1611.34


Link Function: Logit


Response Information

Variable Value Count
safe 1 2221 (Event)
0 702
Total 2923


Logistic Regression Table

Odds 95% CI
Predictor Coef SE Coef Z P Ratio Lower Upper
Constant 1.15086 0.0434894 26.46 0.000
Obama
1 0.101900 0.464948 0.22 0.827 1.11 0.45 2.75


Log-Likelihood = -1611.335
Test that all slopes are zero: G = 0.049, DF = 1, P-Value = 0.825

* NOTE * No goodness of fit test performed.
* NOTE * The model uses all degrees of freedom.


Measures of Association:
(Between the Response Variable and Predicted Probabilities)

Pairs Number Percent Summary Measures
Concordant 14616 0.9 Somers' D 0.00
Discordant 13200 0.8 Goodman-Kruskal Gamma 0.05
Ties 1531326 98.2 Kendall's Tau-a 0.00
Total 1559142 100.0

* NOTE * 1 time(s) the standardized Pearson residuals, delta chi-square, delta
deviance, delta beta (standardized) and delta beta could not be
computed because leverage (Hi) is equal to 1.


We ran Clinton next. Here are the new results (note a drop in p-value).

Binary Logistic Regression: safe versus Clinton 

Step Log-Likelihood
0 -1611.36
1 -1610.29
2 -1610.26
3 -1610.26
4 -1610.26


Link Function: Logit


Response Information

Variable Value Count
safe 1 2221 (Event)
0 702
Total 2923


Logistic Regression Table

Odds 95% CI
Predictor Coef SE Coef Z P Ratio Lower Upper
Constant 1.14409 0.0435562 26.27 0.000
Clinton
1 0.573566 0.412789 1.39 0.165 1.77 0.79 3.99


Log-Likelihood = -1610.265
Test that all slopes are zero: G = 2.190, DF = 1, P-Value = 0.139

* NOTE * No goodness of fit test performed.
* NOTE * The model uses all degrees of freedom.


Measures of Association:
(Between the Response Variable and Predicted Probabilities)

Pairs Number Percent Summary Measures
Concordant 27105 1.7 Somers' D 0.01
Discordant 15274 1.0 Goodman-Kruskal Gamma 0.28
Ties 1516763 97.3 Kendall's Tau-a 0.00
Total 1559142 100.0

* NOTE * 1 time(s) the standardized Pearson residuals, delta chi-square, delta
deviance, delta beta (standardized) and delta beta could not be
computed because leverage (Hi) is equal to 1.


We still think there is enough to be curious here, but clearly our model is insufficient to understand what's going on in any statistically significant way. The reader will have to answer for themselves if such a p-value is eyebrow raising or not. (Perhaps someone will conduct a Bayesian analysis for this p-value and the Maureen White connection for us).

And for the "Anti-Republican" conspiracy buffs:


Binary Logistic Regression: term versus R 

Step Log-Likelihood
0 -1611.36
1 -1611.29
2 -1611.29
3 -1611.29


Link Function: Logit


Response Information

Variable Value Count
term 1 702 (Event)
0 2221
Total 2923


Logistic Regression Table

Odds 95% CI
Predictor Coef SE Coef Z P Ratio Lower Upper
Constant -1.14513 0.0467939 -24.47 0.000
R
1 -0.0457553 0.123407 -0.37 0.711 0.96 0.75 1.22


Log-Likelihood = -1611.291
Test that all slopes are zero: G = 0.138, DF = 1, P-Value = 0.710

* NOTE * No goodness of fit test performed.
* NOTE * The model uses all degrees of freedom.


Measures of Association:
(Between the Response Variable and Predicted Probabilities)

Pairs Number Percent Summary Measures
Concordant 198058 12.7 Somers' D 0.01
Discordant 189200 12.1 Goodman-Kruskal Gamma 0.02
Ties 1171884 75.2 Kendall's Tau-a 0.00
Total 1559142 100.0



I don't plan to be so silly as to offer more commentary this time, and I'm only putting up the new regressions to quell the flames.

Alright, I've decided.

If you'd like the raw, merged dataset, drop me an email (marla @ zerohedge d o t com) with your stats qualifications and the like and we'll work something out. You'll probably get at least a week with the stuff on your own before we open the floodgates.

If you'd just like us to run some testing against what we have, just let us know and we'll see what we can do for you.

Likewise, if you think we've blown something, why not write us a nice note telling us how it should be done to meet with your satisfaction? We're happy to give it a shot- if you're polite about the whole thing.

Thanks also to ep and jb! Sphere: Related Content
Print this post
blog comments powered by Disqus