
Data Analyst Skills Job Search
Let's analyze the desired skills obtained from a Google search and see where you stand. Using Tableau, filter by company or job board and get an idea of what the your salary would be for your skill set.
Learn moreSQL - Data Cleansing and Exploratory Analysis
Tableau - Interactive Dashboard
I’ve spent the bulk of my career within the healthcare industry. And when COVID hit the United States like a firestorm, I witnessed firsthand the impact it had on the hospital system and patient care. Emergency rooms and clinics struggled to match the demand for treatment. Supply chain departments ordered all they could from any available vendor to keep up with the influx of new patients.
And no matter where you lived, there’s a good chance that COVID affected you or someone you know over the past several years.
So out of professional curiosity, I got my hands on a dataset collected by the US Department of Health and Human Services (via healthdata.gov). This dataset is updated regularly and contains almost 77,000 rows and 135 fields of COVID related information sent by local hospitals across the country.
With metrics such as patient age, geographic location, staffing shortages, and reported deaths (to name a few!), there is plenty to dig deep with this time-series data dating back to January 1, 2020.
I’m going in with an open mind! My only real goal is to explore the files, see what stands out, and report back my findings to you.
Let’s go!
To start, I should mention that the initial data was retrieved as a single file. But I didn’t want to make this too easy; so, I split the CSV into 4 separate tables and gave each row a primary key, or row ID. I mean, where’s the fun if you can’t query a few JOINs now and again?
One of the split files contains geographic fields and date of occurrence. The next file grouped together any column where hospital staffing was reported. A third file holds all fields made up of COVID-specific values (suspected and confirmed cases, deaths, age reported, and more). The final file includes any reference of bed metrics reported. This includes inpatient vs ICU beds (available and in use) between adults and children.
With that said, I create the database and load my four new tables to begin cleaning my data. And as it often happens, the first thing I spot are some of my fields have been imported with invalid syntax.
With a few SQL statements, I prep my tables for exploration.
My tables are clean and now it’s time to get down to business!
To me, a clear starting point is to take an overview look at the total confirmed or suspected cases of COVID-19, number of deaths reported, and the overall mortality rate.
SELECT SUM(inpatient_beds_used_covid) AS covid_cases,
SUM(deaths_covid) AS covid_deaths,
ROUND((SUM(deaths_covid)/SUM(inpatient_beds_used_covid) * 100),2) AS fatal_percentage
FROM covid_influenza_data;
covid_cases | covid_deaths | fatal_percentage |
---|---|---|
60,887,178 | 737,476 | 1.2 % |
Obviously, the reported cases and deaths is striking, but the percentage doesn’t represent the numbers truthfully. If I break the same metrics by state and year, I think we’ll get a different perspective.
SELECT state, YEAR(record_date) as year,
SUM(inpatient_beds_used_covid) AS covid_cases,
SUM(deaths_covid) AS covid_deaths,
ROUND((SUM(deaths_covid)/SUM(inpatient_beds_used_covid) * 100),2) AS fatal_percentage
FROM covid_influenza_data cid
JOIN state_data sd
ON cid.record_id = sd.record_id
GROUP BY state, year
ORDER BY fatal_percentage DESC;
With this particular view, there are clear jumps in the percentage of fatalities by state for an individual year. Specifically, Oregon ranks the highest in deaths during 2020 with 15.4%. Being familiar with the spread of COVID, this isn’t too surprising given that COVID first took hold in the US on the along the northwestern coast.
Despite the origin of COVID-19 in the United States, I run a similar analysis to look at the top five states impacted over the entire time-series. According to the output, the highest reported death toll were located in Michigan, District of Columbia, Montana, Massachusetts, and Alabama.
state | covid_cases_reported | covid_deaths_reported | fatal_percentage |
---|---|---|---|
MI | 6,508,599 | 197,121 | 38.76 % |
DC | 40,916 | 15,074 | 36.84 % |
MT | 203,978 | 72,441 | 35.51 % |
MA | 285,997 | 101,473 | 35.48 % |
AL | 365,590 | 129,322 | 35.37 % |
The state data is interesting for sure. But now I want to see the numbers for confirmed cases between each age group. After setting up the aggregations and ALIAS for each age bracket, I find that the totals in adults increases with age, with a noticeable jump after the age of 50.
For pediatric patients, it shouldn’t come as a shock that the effect was the opposite. Newborns and up to the age of 4, experienced the most impact due to COVID. As the ages grew, the numbers began to drop dramatically.
COVID cases and mortality rates are noteworthy but that’s just one facet of the database. The rest of my analysis and fact-finding mission is spent exploring the other tables containing bed and staff data. With queries focused on bed availability by QUARTER, staff shortages by WEEK, and other utilization statistics through various aggregations and JOINs, I’ve been able to identify relevant and unexpected trends.
Unexpected, you ask? Well, for instance, Texas and California both lead the ranks in daily staff shortages and weekly anticipated staff shortages for COVID care. Given the size of these states, it’s not hard to believe this being the case.
I won’t drone on about every significant statistic I found. There’s so much data available, we could spend hours brainstorming and answering questions from several vantage points. But if you’re the curious type, like myself, I’d recommend checking out my github, where you can see the queries I used and even download the datasets yourself.
However, since the various utilizations of reported cases and deaths (by age, state, and year) were the highlights of my deep dive, I created a simple Tableau dashboard for you to explore my findings.
Utilizing a dual axis time-series chart, color-coded bar charts and map, with a filter on the year, I’ve made the dashboard available to view most of the KPI’s I’ve already discussed.
Like I said, if you've got that itch to explore the data yourself, or want to know how I did it, check it out!
My GITHUBLet's analyze the desired skills obtained from a Google search and see where you stand. Using Tableau, filter by company or job board and get an idea of what the your salary would be for your skill set.
Learn moreUsing a combination of SQL and Tableau, I take a dive into supermarket sales and inventory data over a four year period. Let's see who has risen to the top!
Learn more