Dataset Details

Overview

Three StudentLife datasets are associated with studying undergraduates at Dartmouth College.

First StudentLife Study: The first dataset is from the 2013 StudentLife study, with results published in a landmark ACM UbiComp 2014 paper presented by Rui Wang (now at Meta). This dataset captures 48 undergraduate students over a single 10-week term using Android phones. For the first time, it reveals the day-to-day and week-by-week impact of workload on stress, sleep, activity, mood, sociability, mental well-being, and academic performance. A second paper presented at ACM UbiComp 2015 by Rui Wang on academic performance from the same study predicted GPA within ±0.179 of the reported grades. We talk about this dataset in detail below.

Second StudentLife Study: The second study followed 83 undergraduate students over two consecutive terms in 2016, utilizing smartphones and wearables to investigate the dynamics of depression and anxiety. Andrew Campbell presented the results in an ACM UbiComp 2018 paper, demonstrating that symptom features derived from phone and wearable sensors predict whether a student is depressed on a week-by-week basis with 81.5% recall and 69.1% precision. This dataset will be released this academic year.

Third StudentLife Study: The third study tracked over 200 undergraduate students from high school to graduation, providing invaluable insights into changing behaviors, resilience, and mental health in college life. Passive sensor data, surveys, and interviews were used to capture changing behaviors before, during, and after the COVID-19 pandemic subsided. The paper from this study will be presented at ACM UbiComp 2024. Other papers have been published using the dataset, including 1) an ACM UbiComp 2022 paper on the mental health of first-generation undergraduate students presented by Weichen Wang; 2) an ACM CHI 2022 paper comparing students' behavioral changes from the year before COVID-19 against the first year of COVID-19 presented by Subigya Nepal; 3) an ACM UbiComp 2020 paper on predicting brain functional connectivity from fMRI brain-imaging data and mobile behavioral sensing data presented by Mikio Obuchi; and 4) a Journal of Medical Internet Research paper on the early months of the COVID-19 pandemic. This dataset comprises fMRI brain-imaging, mobile sensing, EMA, and surveys. It is available for download on Kaggle.


Introduction

The StudentLife dataset is a large, longitudinal dataset that is rich in formation and deep. Importantly, the dataset is anonymized protecting the privacy of the participants in the study.

The dataset is from 48 undergrads and grad students at Dartmouth over the 10 week spring term. It includes over 53 GB of continuous data, 32,000 self-reports, and pre-post surveys; specifically it comprises:

  • objective sensing data: sleep (bedtime, duration, wake up); conservation duration, conversation frequency; physical activity (stationary, walk, run);
  • location-based data: location, co-location, indoor and outdoor mobility;
  • other phone data: light, Bluetooth, audio, Wi-Fi, screen lock/unlock, phone charge, app usage.
  • self-reports: affect (PAM), stress, behavior, Boston bombing reaction, cancelled classes, class opinion, comment, Dartmouth now, Dimension incident, Dimension protest, dining halls, events, exercise, Green Key, lab, mood, loneliness, social and study spaces.
  • pre-post surveys: PHQ9 depression scale, UCLA loneliness scale, positive and negative affect schedule (PANAS), perceived stress scale (PSS), big five personality, flourishing scale, Pittsburgh sleep quality index, veterans RAND 12 item health (VR12)
  • academic performance data: class information, deadlines, grades (grades, term GPA, cumulative GPA), piazza data
  • dinning data: meals data, location and time
  • seating data: seating position of students in Android programming
  • entry and exit surveys: to be added once anonymized

  • The whole StudentLife dataset is in one big file: full dataset, which contains all the sensor data, EMA data, survey responses and educational data.

    For privacy considerations, we removed data that may reveal participants' identities. For example, Bluetooth devices' names may contain participants' real name because people use their names to name their computers. Browser logs are also removed from the dataset. WiFi AP's SSID has beed removed from the dataset because Dartmouth College Network Service does not allow us to disclose any information on campus WiFi AP deployment.

    We recommend importing the whole dataset into a centralized datastore (e.g. MongoDB, Apache Cassandra) first. It will make the data processing much easier.

    Download the Dataset

    Citation

    Please cite the following paper if the dataset is used in a publication:

    Wang, Rui, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor, Xia Zhou, Dror Ben-Zeev, and Andrew T. Campbell. "StudentLife: Assessing Mental Health, Academic Performance and Behavioral Trends of College Students using Smartphones." In Proceedings of the ACM Conference on Ubiquitous Computing. 2014.

    R package

    Tidy handling and navigation of the valuable Student-Life mHealth dataset: https://github.com/frycast/studentlife

    Fryer, Daniel, Hien Nguyen, and Pierre Orban. "studentlife: Tidy Handling and Navigation of a Valuable Mobile-Health Dataset." [pdf]

    Data Directory Organization

    The dataset directories are organized by data types. StudentLife dataset contains four types of data: sensor data, EMA data, pre and post survey responses and educational data. The top level directory is shown below. In the following subsections, we introduce the structure of each directory, and the data format in next section.

       dataset
       |-user_info.csv
       |-sensing
       |-EMA
       |-education
       |-survey
    

    Sensor Data

    There are 10 subdirectories in dataset/sensing that correspond to 10 different sensor data: physical activity, audio inferences, conversation inferences, Bluetooth scan, light sensor, GPS, phone charge, phone lock, WiFi, WiFi location. All sensor data is stored in csv files.

    The data files under each data type subdirectory are organized by participants. For example, you can find all physical activity inferences for u01 in sensing/activity/activity_u01.csv. Similarly, you can find u01's conversation inferences in sensing/conversation/conversation_u01.csv.

       sensing
       |-activity
       |-audio
       |-conversation
       |-bluetooth
       |-dark
       |-gps
       |-phonecharge
       |-phonelock
       |-wifi
       |-wifi_location

    EMA Data

    You can find EMA question definitions in EMA/EMA_definition.json. Participants' responses are stored in EMA/responses. The name of subdirectories under EMA/responses correspond to EMA question's name. For example, EMA/responses/Stress contains all participants' responses to the Stress EMA. Similar to sensor data, each EMA's responses are organized by participants' uid. You can find detailed EMA file format in EMA section

       EMA
       |-EMA_definition.json
       |-response
       |---Activity
       |---Administration's response
       |---Behavior
       |---Boston Bombing
       |---Cancelled Classes
       |---Class
       |---Class 2
       |---Comment
       |---Dartmouth now
       |---Dimensions
       |---Dimensions protestors
       |---Dining Halls
       |---Do Campbell's jokes suck?
       |---Events
       |---Exercise
       |---Green Key 1
       |---Green Key 2
       |---Lab
       |---Mood
       |---Mood 1
       |---Mood 2
       |---PAM
       |---Sleep
       |---Social
       |---Stress
       |---Study Spaces

    Pre and Post Surveys

    All pre and post survey responses are stored in corresponding files under dataset/survey. The directory is organized by survey names. For example, you can find participants' pre and post responses to PHQ-9 depression scale in survey/PHQ-9.csv. All files are in csv format, which is defined in Survey section.

       survey
       |---BigFive.csv
       |---FlourishingScale.csv
       |---LonelinessScale.csv
       |---panas.csv
       |---PerceivedStressScale.csv
       |---PHQ-9.csv
       |---psqi.csv
       |---vr_12.csv

    Educational Data

    Educational data, which include classes taken during 2013 Spring term, deadlines for each participants, grades and Piazza usage for CS65, is stored under dataset/education. Detailed description is in Educational Data section.

       education
       |---class_info.json
       |---class.csv
       |---deadlines.csv
       |---grades.csv
       |---piazza.csv

    Automatic Sensing

    This section introduces the data format of automatic sensor data that resides under dataset/sensing.

    Physical Activity Inferences

    The first few lines of a participant's physical activity inferences file look like this:

    timestamp activity inference
    1364356853 0
    1364356856 0
    1364356858 0

    The first row is the header row, which defines that there are two fields in activity data files: timestamp and activity inference id. The timestamp is the Unix time when the inference was collected. The timezone is Eastern Time Zone.

    The activity classifier runs 24/7 with duty cycling. To avoid draining the battery, it makes activity inferences continuously for 1 minutes, then pause for 3 minutes before restart collecting activity inferences again. It generates one activity inference every 2~3 seconds depending on smartphone's accelerometer sampling rate. The meaning of activity inference is described in the following table.

    Inference ID Description
    0 Stationary
    1 Walking
    2 Running
    3 Unknown

    Audio

    The first few lines of a participant's physical audio inferences file look like this:

    timestamp audio inference
    1364356875 0
    1364356876 0
    1364356877 0

    The first row is the header row, which defines that there are two fields in audio data files: timestamp and audio inference type id. The timestamp is the Unix time when the inference was collected. The timezone is Eastern Time Zone.

    The audio classifier runs 24/7 with duty cycling. It makes audio inferences for 1 minutes, then pause for 3 minutes before restart. If the conversation classifier detects that there is a conversation going on, it will keep running until the conversation is finished. It generates one audio inference every 2~3 seconds. The meaning of audio inference is described in the following table.

    Inference ID Description
    0 Silence
    1 Voice
    2 Noise
    3 Unknown

    Conversation

    The first few lines of a participant's conversation inferences file look like this:

    start_timestamp end_timestamp
    1364425656 1364425727
    1364427639 1364427780
    1364428051 1364428485

    There are two fields in conversation data files: conversation start timestamp and conversation end timestamp. For example, the first row in showing above records that the participant was around a conversation from Unix timestamp 1364425656 to Unix time stamp 1364425727. The timezone is Eastern Time Zone.

    GPS Location

    The first few lines of a participant's GPS location file look like this:

    time provider network_type accuracy latitude longitude altitude bearing speed travelstate
    1364357009 network wifi 67.993 43.7066671 -72.2890974 0.0 0.0 0.0 stationary
    1364358209 network wifi 23.0 43.706637 -72.2890664 0.0 0.0 0.0 moving
    1364359405 gps 16.0 43.70667831 -72.28901794 136.300003052 96.2 0.25

    GPS coordinates were collected every 10 minutes. Important data fields are shown as follows:

    Field Name Description
    time The Unix time of when it was collected (EST)
    provider The source of GPS coordinates: GPS or network
    network_type Which network was used to obtain GPS fix when the provider is network
    latitude Latitude
    longitude Longitude

    Bluetooth

    The first few lines of a participant's Bluetooth scan log file look like this:

    time MAC class_id level
    1364359421 00:26:08:C9:80:E2 3670284 -79
    1364359421 68:A8:6D:24:D9:8F 3801356 -92
    1364360622 68:A8:6D:24:D9:8F 3801356 -94
    1364388221 00:26:08:D2:B5:E9 3670284 -80
    1364393027 00:26:08:B8:D2:CF 3801356 -86
    1364393027 44:2A:60:FB:B7:59 3801356 -93

    Bluetooth scans every 10 minutes. We removed device names for privacy concerns. Important data fields are shown as follows:

    Field Name Description
    time The Unix time of when it was collected
    MAC The MAC address of surrounding Bluetooth device
    class_id Describes general characteristics and capabilities of a device, see android.bluetooth.BluetoothClass
    level Signal strength

    Note: rows that share same timestamp belong to a single Bluetooth scan.

    WiFi

    The first few lines of a participant's WiFi AP scan log file look like this:

    time BSSID freq level
    1364356944 d0:57:4c:57:58:00 2437 -68
    1364356944 dc:7b:94:87:29:b0 2462 -87
    1364357187 d0:57:4c:57:58:00 2437 -68
    1364357187 dc:7b:94:87:29:b0 2462 -87
    1364357514 d0:57:4c:57:58:00 2437 -68
    1364357514 dc:7b:94:87:46:f2 2412 -89

    WiFi scans frequently. We removed SSID for privacy concerns. Important data fields are shown as follows:

    Field Name Description
    time The Unix time of when it was collected
    BSSID AP's MAC address
    freq AP's working channel frequency
    level Signal strength

    Note: rows that share same timestamp belong to a single WiFi scan.

    WiFi Location

    We acquired Dartmouth College's WiFi AP deployment information from Dartmouth Network Services which allows us to calculate a participant's on-campus rough location. However, we are not allowed to release Dartmouth WiFi AP deployment information to the public, so we release the location inference we calculated based on participants' WiFi scan log. You can use location inferred from WiFi scan and GPS Location data to infer the GPS coordinates of each Dartmouth building.

    The first few lines of a participant's WiFi location file look like this:

    time location
    1364357009 near[north-main; cutter-north; kemeny; ]
    1364358209 in[kemeny]
    1364359102 in[kemeny]
    1364359163 in[kemeny]
    1364359223 in[kemeny]
    1364359409 in[kemeny]
    1364359508 near[kemeny; cutter-north; north-main; ]
    1364359793 near[kemeny; cutter-north; north-main; ]
    1364360078 near[kemeny; cutter-north; north-main; ]

    Each field is defined as follows:

    Field Name Description
    time The Unix time of when it was collected
    location On-campus location inferred from WiFi scans.

    There are two kinds of location inferences: in a building (e.g. in[kemeny]) and near some buildings (near[kemeny; cutter-north; north-main;]).

    Light

    The light data files record when the phone was at a dark environment for a significant long time (>=1 hour). There are two fields in each data file: start timestamp and end timestamp.

    The first few lines of a participant's light sensor file look like this:

    start end
    1364359112 1364387807
    1364397153 1364400889
    1364402955 1364418088
    1364423980 1364432230

    Phone Lock

    The phone lock data files record when the phone was locked for a significant long time (>=1 hour). There are two fields in each data file: start timestamp and end timestamp.

    The first few lines of a participant's phone lock file look like this:

    start end
    1364359161 1364387080
    1364395185 1364402754
    1364402806 1364409439
    1364427062 1364432230

    Phone Charge

    The phone charge data files record when the phone was plugged in and charging for a significant long time (>=1 hour). There are two fields in each data file: start timestamp and end timestamp.

    The first few lines of a participant's phone charge file look like this:

    start end
    1364359041 1364387080
    1364531150 1364560331
    1364622533 1364657458
    1364703563 1364739262

    EMA

    EMA data has two parts: EMA question definitions and participants' responses.

    EMA Definitions

    EMA question definition is defined in dataset/EMA/EMA_definition.json. It defines a JSON array that stores all EMA questions' definitions. For example, the Sleep EMA question is defined as follows:

    EMA

    The name field defines the EMA question's name (i.e. Sleep in the above example). The questions field defines the questions that the participants need to answer for this EMA. Each item in questions array has three fields: question_text, question_id and options. question_text is the text of the question. question_id is the id of the question. options defines candidates of the response. For example, if a participant answered 6.5 for the first Sleep EMA question "How many hours did you sleep last night?", you will find hour:8 in their corresponding response record.

    EMA Responses

    EMA responses are in JSON array format. Each item in the JSON array is one response. As mentioned in EMA Definitions, the keys of each response are EMA question names defined in the EMA definitions. The value is participant's response to the question. It corresponds to the index of the options defined in the EMA definitions.

    One Sleep EMA response looks like below:

    Location

    We can learn from this response that the participants responded at Unix time 1364359545 (EST), and the participants' location GPS coordinates is 43.70705013,-72.28730277 when he/she was answering the EMA question. The participant slept 6 hours according to the hour field. His/her sleep quality was Fairly good and he/she had Three or more times to have trouble staying awake yesterday while in class, eating meals or engaging in social activity according to rate and social respectively.

    Seating Position

    You can seating position data files under the folder dataset/EMA/response/QR_Code. There are two fields in each seating position data file: timestamp and a QR code corresponding to a seating position.

    The mapping between the seating position and the QR code is as follows:

    Seating Position Mapping

    Seating Position Mapping


    Survey Responses

    Survey responses file contains participants's responses to both pre and post mental health measures. The following shows u01's pre and post responses to the Flourishing Scale. The first column shows which participants answered the survey and the second column indicates if the response is from pre or post measurement. The rest columns correspond to each survey questions.

    uid type I lead a purposeful and meaningful life My social relationships are supportive and rewarding I am engaged and interested in my daily activities I actively contribute to the happiness and well-being of others I am competent and capable in the activities that are important to me I am a good person and live a good life
    u01 pre 4 6 6 6 7 6
    u01 post 5 5 6 5 7 6

    You can find detailed information about the mental health surveys from the following references:

    • Spitzer R., Kroenke, K., Williams, J. (1999). Validation and utility of a self-report Version of PRIME-MD: the PHQ Primary Care Study. Journal of the American Medical Association, 282, 1737-1744.
    • Kroenke K, Spitzer R L, Williams J B (2001). The PHQ-9: validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9): 606-613.
    • Russell, Daniel W. "UCLA Loneliness Scale (Version 3): Reliability, validity, and factor structure." Journal of personality assessment 66.1 (1996): 20-40.
    • Mount, Michael K., and Murray R. Barrick. "The Big Five personality dimensions: Implications for research and practice in human resources management." Research in personnel and human resources management 13.3 (1995): 153-200.
    • Most, Robert, and Theresa Muñoz. "Perceived Stress Scale."
    • Diener, E., Wirtz, D., Tov, W., Kim-Prieto, C., Choi, D., Oishi, S., & Biswas-Diener, R. (2010). New measures of well-being: Flourishing and positive and negative feelings. Social Indicators Research, 39, 247-266.
    • The Veterans RAND 12 Item Health Survey (VR12): What is it and How it is Used (2009) Iqbal SU, Rogers W, Selim A, Qian S, Lee A, Ren XS, Rothendler J, Miller D, Kazis LE Center for Health Quality, Outcomes, and Economic Research, A Health Services Research and Development Center of Excellence, VA Medical Center, Bedford, MA, USA
    • Buysse, Daniel J., et al. "Quantification of subjective sleep quality in healthy elderly men and women using the Pittsburgh Sleep Quality Index (PSQI)." Sleep: Journal of Sleep Research & Sleep Medicine (1991).
    • Watson, David, Lee A. Clark, and Auke Tellegen. "Development and validation of brief measures of positive and negative affect: the PANAS scales." Journal of personality and social psychology 54.6 (1988): 1063.

    Education

    There are four types of educational data: classes which participants took during the 2013 Spring term, number of class deadlines per day, GPA and Piazza usage.

    Class

    class.csv records classes which participants took during the 2013 Spring term.

    You can find the lecture time period and location in class_info.json. All classes are stored in a JSON array. The following shows the location and class periods for COSC 065. The class location corresponds to the WiFi location. The periods defines all class meeting periods in an JSON array. day is the weekday that the lecture takes place where Monday is 1 and Friday is 5.

    Class info

    GPA

    You can find participants' cumulated GPA, 2013 Spring term GPA and grades for COSC 065 in grades.csv

    Class Deadlines

    deadline.csv records the number of class deadlines for each participant from March 27, 2013 to June 5, 2013. Class deadlines include homework deadlines, projects, quiz, mid-terms and finals.

    Piazza Usage

    piazza.csv contains participants' Piazza usage data. The definition of each column is as follows.

    Field Name Description
    days online number of days the student logged in CS65 Piazza class page
    views number of posts the student has viewed
    contribution number of posts, responses, edits, followups, and comments to followups (i.e., everything)
    questions number of questions the student has asked
    notes number of notes the student has posted
    answers number of questions the student has answered

    Please refer to Piazza.com for more detail information.


Other Datasets

Other than the StudentLife dataset, we have also openly released datasets for other studies that are built on top of StudentLife or make use of the same sensing system:


Get in touch

If you have any questions regarding the study or dataset contact andrew.t.p.campbell [at] gmail [dot] com