Data
There are a few things we should always look at ahead of working with the data.
The Data Structure
Before really doing any meaningful analysis or visualization, it's important to make sure that the data and its structure make sense. If the dataset is publicly available and widely used, there's a good chance you can find documentation on how it's structured. Otherwise, like in this case, we can turn the data itself to get a picture of what it can tell us.
Here's the data again, for reference:
Time | Shri | Dorrie |
---|---|---|
04/02/2023 12:00 AM | Sleep | Sleep |
04/02/2023 1:00 AM | Sleep | Sleep |
04/02/2023 2:00 AM | Sleep | Sleep |
04/02/2023 3:00 AM | Sleep | Sleep |
04/02/2023 4:00 AM | Sleep | Lounge |
04/02/2023 5:00 AM | Sleep | Sleep |
04/02/2023 6:00 AM | Sleep | Sleep |
04/02/2023 7:00 AM | Breakfast | Breakfast |
04/02/2023 8:00 AM | Walk | Lounge |
04/02/2023 9:00 AM | Work | Lounge |
04/02/2023 10:00 AM | Work | Sleep |
04/02/2023 11:00 AM | Work | Lounge |
04/02/2023 12:00 PM | Work | Snack |
04/02/2023 1:00 PM | Lunch | Sleep |
04/02/2023 2:00 PM | Work | Sleep |
04/02/2023 3:00 PM | Work | Solo Play |
04/02/2023 4:00 PM | Work | Sleep |
04/02/2023 5:00 PM | Video Games | Sleep |
04/02/2023 6:00 PM | Cook Dinner | Lounge |
04/02/2023 7:00 PM | Relax | Dinner |
04/02/2023 8:00 PM | TV | Lounge |
04/02/2023 9:00 PM | Play Together | Play Together |
04/02/2023 10:00 PM | Guitar | Zoom |
04/02/2023 11:00 PM | Sleep | Sleep |
04/03/2023 12:00 AM | Sleep | Sleep |
04/03/2023 1:00 AM | Sleep | Sleep |
04/03/2023 2:00 AM | Sleep | Lounge |
04/03/2023 3:00 AM | Sleep | Sleep |
04/03/2023 4:00 AM | Sleep | Lounge |
04/03/2023 5:00 AM | Sleep | Sleep |
04/03/2023 6:00 AM | Sleep | Sleep |
04/03/2023 7:00 AM | Breakfast | Breakfast |
04/03/2023 8:00 AM | Walk | Sleep |
04/03/2023 9:00 AM | Work | Sleep |
04/03/2023 10:00 AM | Work | Lounge |
04/03/2023 11:00 AM | Appointment | Solo Play |
04/03/2023 12:00 PM | Work | Snack |
04/03/2023 1:00 PM | Work | Sleep |
04/03/2023 2:00 PM | Lunch | Lounge |
04/03/2023 3:00 PM | Work | Sleep |
04/03/2023 4:00 PM | Work | Sleep |
04/03/2023 5:00 PM | Work | Sleep |
04/03/2023 6:00 PM | Exercise | Lounge |
04/03/2023 7:00 PM | TV | Dinner |
04/03/2023 8:00 PM | Side Projects | Lounge |
04/03/2023 9:00 PM | Play Together | Play Together |
04/03/2023 10:00 PM | Side Projects | Sleep |
04/03/2023 11:00 PM | Read | Zoom |
04/04/2023 12:00 AM | Sleep | Sleep |
04/04/2023 1:00 AM | Sleep | Sleep |
04/04/2023 2:00 AM | Sleep | Sleep |
04/04/2023 3:00 AM | Sleep | Sleep |
04/04/2023 4:00 AM | Sleep | Sleep |
04/04/2023 5:00 AM | Sleep | Sleep |
04/04/2023 6:00 AM | Sleep | Lounge |
04/04/2023 7:00 AM | Sleep | Breakfast |
04/04/2023 8:00 AM | Breakfast | Lounge |
04/04/2023 9:00 AM | Walk | Sleep |
04/04/2023 10:00 AM | Errands | Solo Play |
04/04/2023 11:00 AM | Work | Lounge |
04/04/2023 12:00 PM | Work | Snack |
04/04/2023 1:00 PM | Work | Sleep |
04/04/2023 2:00 PM | Work | Sleep |
04/04/2023 3:00 PM | Work | Sleep |
04/04/2023 4:00 PM | Work | Sleep |
04/04/2023 5:00 PM | Work | Sleep |
04/04/2023 6:00 PM | Dinner with Friends | Lounge |
04/04/2023 7:00 PM | Dinner with Friends | Dinner |
04/04/2023 8:00 PM | Play Together | Play Together |
04/04/2023 9:00 PM | Side Projects | Lounge |
04/04/2023 10:00 PM | Side Projects | Solo Play |
04/04/2023 11:00 PM | TV | Sleep |
04/05/2023 12:00 AM | TV | Sleep |
04/05/2023 1:00 AM | Sleep | Sleep |
04/05/2023 2:00 AM | Sleep | Sleep |
04/05/2023 3:00 AM | Sleep | Sleep |
04/05/2023 4:00 AM | Sleep | Lounge |
04/05/2023 5:00 AM | Sleep | Sleep |
04/05/2023 6:00 AM | Sleep | Sleep |
04/05/2023 7:00 AM | Sleep | Breakfast |
04/05/2023 8:00 AM | Sleep | Sleep |
04/05/2023 9:00 AM | Walk | Lounge |
04/05/2023 10:00 AM | Work | Sleep |
04/05/2023 11:00 AM | Work | Lounge |
04/05/2023 12:00 PM | Work | Snack |
04/05/2023 1:00 PM | Lunch | Sleep |
04/05/2023 2:00 PM | Work | Sleep |
04/05/2023 3:00 PM | Work | Sleep |
04/05/2023 4:00 PM | Work | Sleep |
04/05/2023 5:00 PM | Work | Sleep |
04/05/2023 6:00 PM | Exercise | Lounge |
04/05/2023 7:00 PM | Cook Dinner | Dinner |
04/05/2023 8:00 PM | Video Games | Solo Play |
04/05/2023 9:00 PM | Play Together | Play Together |
04/05/2023 10:00 PM | TV | Sleep |
04/05/2023 11:00 PM | Sleep | Sleep |
04/06/2023 12:00 AM | Sleep | Sleep |
04/06/2023 1:00 AM | Sleep | Zoom |
04/06/2023 2:00 AM | Sleep | Sleep |
04/06/2023 3:00 AM | Sleep | Sleep |
04/06/2023 4:00 AM | Sleep | Sleep |
04/06/2023 5:00 AM | Sleep | Sleep |
04/06/2023 6:00 AM | Sleep | Lounge |
04/06/2023 7:00 AM | Breakfast | Breakfast |
04/06/2023 8:00 AM | Walk | Sleep |
04/06/2023 9:00 AM | Work | Sleep |
04/06/2023 10:00 AM | Work | Solo Play |
04/06/2023 11:00 AM | Work | Lounge |
04/06/2023 12:00 PM | Lunch with Friends | Snack |
04/06/2023 1:00 PM | Work | Sleep |
04/06/2023 2:00 PM | Work | Solo Play |
04/06/2023 3:00 PM | Work | Sleep |
04/06/2023 4:00 PM | Work | Sleep |
04/06/2023 5:00 PM | Work | Sleep |
04/06/2023 6:00 PM | Side Projects | Lounge |
04/06/2023 7:00 PM | Side Projects | Dinner |
04/06/2023 8:00 PM | TV | Lounge |
04/06/2023 9:00 PM | Play Together | Play Together |
04/06/2023 10:00 PM | Guitar | Lounge |
04/06/2023 11:00 PM | Read | Sleep |
04/07/2023 12:00 AM | Sleep | Sleep |
04/07/2023 1:00 AM | Sleep | Sleep |
04/07/2023 2:00 AM | Sleep | Sleep |
04/07/2023 3:00 AM | Sleep | Sleep |
04/07/2023 4:00 AM | Sleep | Sleep |
04/07/2023 5:00 AM | Sleep | Sleep |
04/07/2023 6:00 AM | Sleep | Sleep |
04/07/2023 7:00 AM | Sleep | Breakfast |
04/07/2023 8:00 AM | Walk | Sleep |
04/07/2023 9:00 AM | Work | Zoom |
04/07/2023 10:00 AM | Work | Sleep |
04/07/2023 11:00 AM | Work | Sleep |
04/07/2023 12:00 PM | Lunch | Snack |
04/07/2023 1:00 PM | Work | Sleep |
04/07/2023 2:00 PM | Work | Lounge |
04/07/2023 3:00 PM | Work | Sleep |
04/07/2023 4:00 PM | Video Games | Lounge |
04/07/2023 5:00 PM | Video Games | Sleep |
04/07/2023 6:00 PM | Play Together | Play Together |
04/07/2023 7:00 PM | Basketball with Friends | Dinner |
04/07/2023 8:00 PM | Basketball with Friends | Lounge |
04/07/2023 9:00 PM | Basketball with Friends | Lounge |
04/07/2023 10:00 PM | Dinner | Solo Play |
04/07/2023 11:00 PM | Play Together | Play Together |
04/08/2023 12:00 AM | Side Projects | Sleep |
04/08/2023 1:00 AM | Side Projects | Sleep |
04/08/2023 2:00 AM | Sleep | Sleep |
04/08/2023 3:00 AM | Sleep | Zoom |
04/08/2023 4:00 AM | Sleep | Sleep |
04/08/2023 5:00 AM | Sleep | Sleep |
04/08/2023 6:00 AM | Sleep | Sleep |
04/08/2023 7:00 AM | Sleep | Breakfast |
04/08/2023 8:00 AM | Sleep | Sleep |
04/08/2023 9:00 AM | Sleep | Sleep |
04/08/2023 10:00 AM | Side Projects | Sleep |
04/08/2023 11:00 AM | Side Projects | Sleep |
04/08/2023 12:00 PM | Side Projects | Snack |
04/08/2023 1:00 PM | Movie with Friends | Sleep |
04/08/2023 2:00 PM | Movie with Friends | Sleep |
04/08/2023 3:00 PM | Lunch with Friends | Sleep |
04/08/2023 4:00 PM | Lunch with Friends | Sleep |
04/08/2023 5:00 PM | Video Games | Lounge |
04/08/2023 6:00 PM | Video Games | Sleep |
04/08/2023 7:00 PM | Dinner | Dinner |
04/08/2023 8:00 PM | Video Games | Solo Play |
04/08/2023 9:00 PM | Play Together | Play Together |
04/08/2023 10:00 PM | TV | Lounge |
04/08/2023 11:00 PM | TV | Lounge |
Looking at the data, we can see that there's a row for each hour of the week (giving us 168 rows total), each with 3 columns:
- Time: When the activity happened, in MM/DD/YYYY HH:MM AM/PM format
- Shri: The main activity that I did during the hour
- Dorrie: The main activity that Dorrie did during the hour
Assessing Data Quality
After getting an initial sense of how the data is structured, a good next step can be checking that the data is high quality. For this dataset, that might mean making sure that we have:
- No Missing Data: There should be one row for each hour, and each hour should have an activity both Shri and Dorrie.
- No Duplicated Data: There should not be more than one row for each hour, and there should only be one activity for Shri and Dorrie.
- Consistent Data: Data that represents the same activity should have the same exact label. For example, ideally we can avoid having a label for "Eating" and another value for "Eat", since they represent the same thing.
- Clarity: If a data point looks strange or like an outlier, we should investigate it. For example, if Dorrie usually sleeps around 16 hours a day and we notice a day where she didn't sleep, then either something extraordinary happened or there's a data error.
The complexity of this step and the tooling involved will vary significantly based on the scope of the dataset and project. Large corporations have entire teams dedicated to this step, and it's not uncommon for this to take up the majority of a data scientist's time. For our purposes, the dataset is relatively simple and its quality can be validated using a combination of manual inspection and functions in a tool like Excel or Google Sheets.
Exploring The Categories
Whenever we're dealing with categorical data (data that uses labels instead of numbers to describe things), it's helpful to get a sense of all of the distinct categories that exist. This should be relatively straightforward in most data processing tools.
In this dataset, here are my distinct activity categories:
And Dorrie's distinct activity categories:
Listing out the distinct categories helps us start to understand some of the constraints around the data.
For example, we can get a sense of the specificity of the data. With activities like "Exercise" and "TV", there is no way from the given data to figure out what type of exercise was done or what show was watched. Any dataset will have a level of fidelity (or precision) that we need to be aware of.
On the flipside, we can also see that there are some activities that are more specific, like "Basketball with Friends". These activities can be grouped together with other activities to form broader categories, such as "Leisure" or "Productivity". Annotating the data with these broader groups is an example of adding metadata, which is basically data that describes our data (you can see why it's meta!). It's a lot easier to make an existing data point more general than it is to make it more specific.
Having this type of understanding of the data is important for the next step, which is thinking about how to explore the data.
Helpful References
Here's a list of some of the techniques I'd use to help with this step:
- Conditional Formatting in Google Sheets to make sure the times were unique.
- The UNIQUE function in Google Sheets to grab the different categories.
- Manual inspection to make sure the data is consistent and clear.
If the dataset was very large or if I needed to do more robust analysis on it, I'd probably load it into a SQL database. Postgres has great documentation on how to get set up, and you can find tons of tutorials on how to load CSV data into a Postgres database. In practice, this dataset is tiny so Sheets or Excel are more than enough.