Data With Dorrie

A gentle introduction to interactive data visualization

Shri Khalpada

Shri Khalpada

0
0
0
0
0
0
0

Data

There are a few things we should always look at ahead of working with the data.

The Data Structure

Before really doing any meaningful analysis or visualization, it's important to make sure that the data and its structure make sense. If the dataset is publicly available and widely used, there's a good chance you can find documentation on how it's structured. Otherwise, like in this case, we can turn the data itself to get a picture of what it can tell us.

Here's the data again, for reference:

Time Shri Dorrie
04/02/2023 12:00 AMSleepSleep
04/02/2023 1:00 AMSleepSleep
04/02/2023 2:00 AMSleepSleep
04/02/2023 3:00 AMSleepSleep
04/02/2023 4:00 AMSleepLounge
04/02/2023 5:00 AMSleepSleep
04/02/2023 6:00 AMSleepSleep
04/02/2023 7:00 AMBreakfastBreakfast
04/02/2023 8:00 AMWalkLounge
04/02/2023 9:00 AMWorkLounge
04/02/2023 10:00 AMWorkSleep
04/02/2023 11:00 AMWorkLounge
04/02/2023 12:00 PMWorkSnack
04/02/2023 1:00 PMLunchSleep
04/02/2023 2:00 PMWorkSleep
04/02/2023 3:00 PMWorkSolo Play
04/02/2023 4:00 PMWorkSleep
04/02/2023 5:00 PMVideo GamesSleep
04/02/2023 6:00 PMCook DinnerLounge
04/02/2023 7:00 PMRelaxDinner
04/02/2023 8:00 PMTVLounge
04/02/2023 9:00 PMPlay TogetherPlay Together
04/02/2023 10:00 PMGuitarZoom
04/02/2023 11:00 PMSleepSleep
04/03/2023 12:00 AMSleepSleep
04/03/2023 1:00 AMSleepSleep
04/03/2023 2:00 AMSleepLounge
04/03/2023 3:00 AMSleepSleep
04/03/2023 4:00 AMSleepLounge
04/03/2023 5:00 AMSleepSleep
04/03/2023 6:00 AMSleepSleep
04/03/2023 7:00 AMBreakfastBreakfast
04/03/2023 8:00 AMWalkSleep
04/03/2023 9:00 AMWorkSleep
04/03/2023 10:00 AMWorkLounge
04/03/2023 11:00 AMAppointmentSolo Play
04/03/2023 12:00 PMWorkSnack
04/03/2023 1:00 PMWorkSleep
04/03/2023 2:00 PMLunchLounge
04/03/2023 3:00 PMWorkSleep
04/03/2023 4:00 PMWorkSleep
04/03/2023 5:00 PMWorkSleep
04/03/2023 6:00 PMExerciseLounge
04/03/2023 7:00 PMTVDinner
04/03/2023 8:00 PMSide ProjectsLounge
04/03/2023 9:00 PMPlay TogetherPlay Together
04/03/2023 10:00 PMSide ProjectsSleep
04/03/2023 11:00 PMReadZoom
04/04/2023 12:00 AMSleepSleep
04/04/2023 1:00 AMSleepSleep
04/04/2023 2:00 AMSleepSleep
04/04/2023 3:00 AMSleepSleep
04/04/2023 4:00 AMSleepSleep
04/04/2023 5:00 AMSleepSleep
04/04/2023 6:00 AMSleepLounge
04/04/2023 7:00 AMSleepBreakfast
04/04/2023 8:00 AMBreakfastLounge
04/04/2023 9:00 AMWalkSleep
04/04/2023 10:00 AMErrandsSolo Play
04/04/2023 11:00 AMWorkLounge
04/04/2023 12:00 PMWorkSnack
04/04/2023 1:00 PMWorkSleep
04/04/2023 2:00 PMWorkSleep
04/04/2023 3:00 PMWorkSleep
04/04/2023 4:00 PMWorkSleep
04/04/2023 5:00 PMWorkSleep
04/04/2023 6:00 PMDinner with FriendsLounge
04/04/2023 7:00 PMDinner with FriendsDinner
04/04/2023 8:00 PMPlay TogetherPlay Together
04/04/2023 9:00 PMSide ProjectsLounge
04/04/2023 10:00 PMSide ProjectsSolo Play
04/04/2023 11:00 PMTVSleep
04/05/2023 12:00 AMTVSleep
04/05/2023 1:00 AMSleepSleep
04/05/2023 2:00 AMSleepSleep
04/05/2023 3:00 AMSleepSleep
04/05/2023 4:00 AMSleepLounge
04/05/2023 5:00 AMSleepSleep
04/05/2023 6:00 AMSleepSleep
04/05/2023 7:00 AMSleepBreakfast
04/05/2023 8:00 AMSleepSleep
04/05/2023 9:00 AMWalkLounge
04/05/2023 10:00 AMWorkSleep
04/05/2023 11:00 AMWorkLounge
04/05/2023 12:00 PMWorkSnack
04/05/2023 1:00 PMLunchSleep
04/05/2023 2:00 PMWorkSleep
04/05/2023 3:00 PMWorkSleep
04/05/2023 4:00 PMWorkSleep
04/05/2023 5:00 PMWorkSleep
04/05/2023 6:00 PMExerciseLounge
04/05/2023 7:00 PMCook DinnerDinner
04/05/2023 8:00 PMVideo GamesSolo Play
04/05/2023 9:00 PMPlay TogetherPlay Together
04/05/2023 10:00 PMTVSleep
04/05/2023 11:00 PMSleepSleep
04/06/2023 12:00 AMSleepSleep
04/06/2023 1:00 AMSleepZoom
04/06/2023 2:00 AMSleepSleep
04/06/2023 3:00 AMSleepSleep
04/06/2023 4:00 AMSleepSleep
04/06/2023 5:00 AMSleepSleep
04/06/2023 6:00 AMSleepLounge
04/06/2023 7:00 AMBreakfastBreakfast
04/06/2023 8:00 AMWalkSleep
04/06/2023 9:00 AMWorkSleep
04/06/2023 10:00 AMWorkSolo Play
04/06/2023 11:00 AMWorkLounge
04/06/2023 12:00 PMLunch with FriendsSnack
04/06/2023 1:00 PMWorkSleep
04/06/2023 2:00 PMWorkSolo Play
04/06/2023 3:00 PMWorkSleep
04/06/2023 4:00 PMWorkSleep
04/06/2023 5:00 PMWorkSleep
04/06/2023 6:00 PMSide ProjectsLounge
04/06/2023 7:00 PMSide ProjectsDinner
04/06/2023 8:00 PMTVLounge
04/06/2023 9:00 PMPlay TogetherPlay Together
04/06/2023 10:00 PMGuitarLounge
04/06/2023 11:00 PMReadSleep
04/07/2023 12:00 AMSleepSleep
04/07/2023 1:00 AMSleepSleep
04/07/2023 2:00 AMSleepSleep
04/07/2023 3:00 AMSleepSleep
04/07/2023 4:00 AMSleepSleep
04/07/2023 5:00 AMSleepSleep
04/07/2023 6:00 AMSleepSleep
04/07/2023 7:00 AMSleepBreakfast
04/07/2023 8:00 AMWalkSleep
04/07/2023 9:00 AMWorkZoom
04/07/2023 10:00 AMWorkSleep
04/07/2023 11:00 AMWorkSleep
04/07/2023 12:00 PMLunchSnack
04/07/2023 1:00 PMWorkSleep
04/07/2023 2:00 PMWorkLounge
04/07/2023 3:00 PMWorkSleep
04/07/2023 4:00 PMVideo GamesLounge
04/07/2023 5:00 PMVideo GamesSleep
04/07/2023 6:00 PMPlay TogetherPlay Together
04/07/2023 7:00 PMBasketball with FriendsDinner
04/07/2023 8:00 PMBasketball with FriendsLounge
04/07/2023 9:00 PMBasketball with FriendsLounge
04/07/2023 10:00 PMDinnerSolo Play
04/07/2023 11:00 PMPlay TogetherPlay Together
04/08/2023 12:00 AMSide ProjectsSleep
04/08/2023 1:00 AMSide ProjectsSleep
04/08/2023 2:00 AMSleepSleep
04/08/2023 3:00 AMSleepZoom
04/08/2023 4:00 AMSleepSleep
04/08/2023 5:00 AMSleepSleep
04/08/2023 6:00 AMSleepSleep
04/08/2023 7:00 AMSleepBreakfast
04/08/2023 8:00 AMSleepSleep
04/08/2023 9:00 AMSleepSleep
04/08/2023 10:00 AMSide ProjectsSleep
04/08/2023 11:00 AMSide ProjectsSleep
04/08/2023 12:00 PMSide ProjectsSnack
04/08/2023 1:00 PMMovie with FriendsSleep
04/08/2023 2:00 PMMovie with FriendsSleep
04/08/2023 3:00 PMLunch with FriendsSleep
04/08/2023 4:00 PMLunch with FriendsSleep
04/08/2023 5:00 PMVideo GamesLounge
04/08/2023 6:00 PMVideo GamesSleep
04/08/2023 7:00 PMDinnerDinner
04/08/2023 8:00 PMVideo GamesSolo Play
04/08/2023 9:00 PMPlay TogetherPlay Together
04/08/2023 10:00 PMTVLounge
04/08/2023 11:00 PMTVLounge

Looking at the data, we can see that there's a row for each hour of the week (giving us 168 rows total), each with 3 columns:

  • Time: When the activity happened, in MM/DD/YYYY HH:MM AM/PM format
  • Shri: The main activity that I did during the hour
  • Dorrie: The main activity that Dorrie did during the hour

Assessing Data Quality

After getting an initial sense of how the data is structured, a good next step can be checking that the data is high quality. For this dataset, that might mean making sure that we have:

  • No Missing Data: There should be one row for each hour, and each hour should have an activity both Shri and Dorrie.
  • No Duplicated Data: There should not be more than one row for each hour, and there should only be one activity for Shri and Dorrie.
  • Consistent Data: Data that represents the same activity should have the same exact label. For example, ideally we can avoid having a label for "Eating" and another value for "Eat", since they represent the same thing.
  • Clarity: If a data point looks strange or like an outlier, we should investigate it. For example, if Dorrie usually sleeps around 16 hours a day and we notice a day where she didn't sleep, then either something extraordinary happened or there's a data error.

The complexity of this step and the tooling involved will vary significantly based on the scope of the dataset and project. Large corporations have entire teams dedicated to this step, and it's not uncommon for this to take up the majority of a data scientist's time. For our purposes, the dataset is relatively simple and its quality can be validated using a combination of manual inspection and functions in a tool like Excel or Google Sheets.

Exploring The Categories

Whenever we're dealing with categorical data (data that uses labels instead of numbers to describe things), it's helpful to get a sense of all of the distinct categories that exist. This should be relatively straightforward in most data processing tools.

In this dataset, here are my distinct activity categories:

Sleep
Breakfast
Walk
Work
Lunch
Video Games
Cook Dinner
Relax
TV
Play Together
Guitar
Appointment
Exercise
Side Projects
Read
Errands
Dinner with Friends
Lunch with Friends
Basketball with Friends
Dinner
Movie with Friends

And Dorrie's distinct activity categories:

Sleep
Lounge
Snack
Solo Play
Dinner
Play Together
Zoom

Listing out the distinct categories helps us start to understand some of the constraints around the data.

For example, we can get a sense of the specificity of the data. With activities like "Exercise" and "TV", there is no way from the given data to figure out what type of exercise was done or what show was watched. Any dataset will have a level of fidelity (or precision) that we need to be aware of.

On the flipside, we can also see that there are some activities that are more specific, like "Basketball with Friends". These activities can be grouped together with other activities to form broader categories, such as "Leisure" or "Productivity". Annotating the data with these broader groups is an example of adding metadata, which is basically data that describes our data (you can see why it's meta!). It's a lot easier to make an existing data point more general than it is to make it more specific.

Having this type of understanding of the data is important for the next step, which is thinking about how to explore the data.

Helpful References

Here's a list of some of the techniques I'd use to help with this step:

  • Conditional Formatting in Google Sheets to make sure the times were unique.
  • The UNIQUE function in Google Sheets to grab the different categories.
  • Manual inspection to make sure the data is consistent and clear.

If the dataset was very large or if I needed to do more robust analysis on it, I'd probably load it into a SQL database. Postgres has great documentation on how to get set up, and you can find tons of tutorials on how to load CSV data into a Postgres database. In practice, this dataset is tiny so Sheets or Excel are more than enough.

← IntroductionConsiderations →