cosinorage.datahandlers Module¶
Module Contents¶
Classes¶
- class DataHandler[source]¶
Bases:
objectA base class for data handlers that process and store ENMO data at the minute level.
This class provides a common interface for data handlers with methods to load data, retrieve processed ENMO values, and save data. The load_data and save_data methods are intended to be overridden by subclasses.
- raw_data¶
Raw accelerometer data loaded from the source.
- Type:
pd.DataFrame or None
- sf_data¶
Filtered and processed accelerometer data.
- Type:
pd.DataFrame or None
- ml_data¶
Minute-level ENMO data calculated from processed data.
- Type:
pd.DataFrame or None
- meta_dict¶
Dictionary storing metadata about the data processing.
- Type:
dict
- __init__()[source]¶
Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.
Notes
This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.
- save_data(output_path)[source]¶
Save minute-level ENMO data to a specified output path.
This method is intended to be implemented by subclasses, specifying the format and structure for saving data.
- Parameters:
output_path (str) – The file path where the minute-level ENMO data will be saved.
- get_raw_data()[source]¶
Retrieve the raw data.
- Returns:
A DataFrame containing the raw data.
- Return type:
pd.DataFrame
- get_sf_data()[source]¶
Retrieve the filtered data.
- Returns:
A DataFrame containing the filtered data.
- Return type:
pd.DataFrame
Utility Functions¶
Generic Data Functions¶
Galaxy Smartwatch Data Functions¶
UK Biobank Data Functions¶
NHANES Data Functions¶
General Utility Functions¶
- filter_incomplete_days(df, data_freq, expected_points_per_day=None)[source]¶
Filter out data from incomplete days to ensure 24-hour data periods.
This function removes data from days that don’t have the expected number of data points to ensure that only complete 24-hour data is retained for analysis.
- Parameters:
df (pd.DataFrame) – DataFrame with datetime index, which is used to determine the day. The index should contain datetime objects.
data_freq (float) – Frequency of data collection in Hz (e.g., 1/60 for minute-level data).
expected_points_per_day (int, optional) – Expected number of data points per day. If None, calculated using data_freq * 86400.
- Returns:
Filtered DataFrame containing only complete days. Returns empty DataFrame if an error occurs during processing.
- Return type:
pd.DataFrame
Notes
Calculates expected points per day as data_freq * 60 * 60 * 24 if not provided
Groups data by date and counts points per day
Retains only days with sufficient data points
Removes the temporary ‘DATE’ column before returning
Handles errors gracefully by returning empty DataFrame
Examples
>>> import pandas as pd >>> >>> # Create sample data with some incomplete days >>> dates = pd.date_range('2023-01-01', periods=5000, freq='min') >>> data = pd.DataFrame({'value': np.random.randn(5000)}, index=dates) >>> >>> # Filter incomplete days (expecting 1440 points per day for minute data) >>> filtered_data = filter_incomplete_days(data, data_freq=1/60, expected_points_per_day=1440) >>> print(f"Original days: {len(data.index.date.unique())}") >>> print(f"Complete days: {len(filtered_data.index.date.unique())}")
- filter_consecutive_days(df)[source]¶
Filter DataFrame to retain only the longest sequence of consecutive days.
This function identifies the longest sequence of consecutive days in the data and filters the DataFrame to include only those days. This is important for circadian rhythm analysis which requires continuous data.
- Parameters:
df (pd.DataFrame) – DataFrame with datetime index containing the data to filter.
- Returns:
Filtered DataFrame containing only the longest sequence of consecutive days.
- Return type:
pd.DataFrame
- Raises:
ValueError – If less than 2 consecutive days are found in the data.
Notes
Extracts unique dates from the datetime index
Finds the longest consecutive sequence using largest_consecutive_sequence
Requires at least 2 consecutive days for valid analysis
Filters the DataFrame to include only data from consecutive days
Important for circadian rhythm analysis which requires continuous data
Examples
>>> import pandas as pd >>> >>> # Create sample data with gaps >>> dates = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', ... '2023-01-05', '2023-01-06', '2023-01-07']) >>> data = pd.DataFrame({'value': np.random.randn(len(dates))}, index=dates) >>> >>> # Filter to longest consecutive sequence >>> filtered_data = filter_consecutive_days(data) >>> print(f"Original dates: {data.index.date.tolist()}") >>> print(f"Consecutive dates: {filtered_data.index.date.tolist()}")
- largest_consecutive_sequence(dates)[source]¶
Find the longest sequence of consecutive dates in a list.
This function analyzes a list of dates and returns the longest subsequence of consecutive dates. It’s used to identify continuous periods of data for circadian rhythm analysis.
- Parameters:
dates (List[datetime]) – List of dates to analyze for consecutive sequences.
- Returns:
Longest sequence of consecutive dates found. Returns empty list if input is empty.
- Return type:
List[datetime]
Notes
Sorts and removes duplicate dates before processing
Compares dates using timedelta(days=1) for consecutive day detection
Maintains the original order within consecutive sequences
Handles edge cases like empty lists and single dates
Used internally by filter_consecutive_days
Examples
>>> from datetime import datetime >>> >>> # Example with gaps in dates >>> dates = [ ... datetime(2023, 1, 1), datetime(2023, 1, 2), datetime(2023, 1, 3), ... datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7) ... ] >>> consecutive = largest_consecutive_sequence(dates) >>> print(f"Longest consecutive sequence: {consecutive}") >>> # Output: [datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7)] >>> >>> # Example with single date >>> single_date = [datetime(2023, 1, 1)] >>> result = largest_consecutive_sequence(single_date) >>> print(f"Single date result: {result}") >>> # Output: [datetime(2023, 1, 1)]
- calculate_enmo(data, verbose=False)[source]¶
Calculate the Euclidean Norm Minus One (ENMO) metric from accelerometer data.
This function computes the ENMO metric, which is a widely used measure in physical activity research for quantifying acceleration while accounting for gravity.
- Parameters:
data (pd.DataFrame) – DataFrame containing accelerometer data with columns: - ‘x’: X-axis acceleration values - ‘y’: Y-axis acceleration values - ‘z’: Z-axis acceleration values All values should be in g units (1g = 9.81 m/s²).
verbose (bool, default=False) – If True, prints processing information.
- Returns:
Array of ENMO values. Values are truncated at 0, meaning negative values are set to 0. Returns np.nan if calculation fails.
- Return type:
numpy.ndarray
Notes
ENMO = sqrt(x² + y² + z²) - 1
Values are truncated at 0 (negative values become 0)
ENMO represents acceleration in excess of 1g (gravity)
Commonly used in physical activity and sleep research
Handles errors gracefully by returning np.nan
Examples
>>> import pandas as pd >>> import numpy as np >>> >>> # Create sample accelerometer data >>> data = pd.DataFrame({ ... 'x': [0.1, 0.2, 0.3], ... 'y': [0.1, 0.2, 0.3], ... 'z': [1.0, 1.1, 1.2] # Close to 1g (gravity) ... }) >>> >>> # Calculate ENMO >>> enmo_values = calculate_enmo(data, verbose=True) >>> print(f"ENMO values: {enmo_values}") >>> # Output: [0.014, 0.028, 0.042] (approximately)
- calculate_minute_level_enmo(data, meta_dict={}, verbose=False)[source]¶
Resample high-frequency ENMO data to minute-level by averaging over each minute.
This function aggregates high-frequency ENMO data to minute-level resolution using mean aggregation, which is the standard approach for circadian rhythm analysis.
- Parameters:
data (pd.DataFrame) – DataFrame with datetime index and ‘ENMO’ column containing high-frequency ENMO data. Optional ‘wear’ column for wear time information.
meta_dict (dict, default={}) – Dictionary containing metadata. Should include: - ‘sf’: Sampling frequency in Hz (defaults to 25Hz if not specified)
verbose (bool, default=False) – If True, prints processing information.
- Returns:
DataFrame containing minute-level aggregated data with: - ‘ENMO’: Mean ENMO value for each minute - ‘wear’: Mean wear time for each minute (if wear column exists in input) Index is datetime at minute resolution.
- Return type:
pd.DataFrame
- Raises:
ValueError – If sampling frequency is less than 1/60 Hz (less than one sample per minute).
Notes
Uses pandas resample(‘min’).mean() for aggregation
Handles both ENMO and wear columns if present
Converts index to datetime format
Standard preprocessing step for circadian rhythm analysis
Handles errors gracefully by returning empty DataFrame
Examples
>>> import pandas as pd >>> >>> # Create sample high-frequency ENMO data >>> dates = pd.date_range('2023-01-01 00:00:00', periods=3600, freq='S') # 1 hour of second-level data >>> data = pd.DataFrame({ ... 'ENMO': np.random.uniform(0, 0.1, 3600), ... 'wear': np.random.choice([0, 1], 3600) ... }, index=dates) >>> >>> # Resample to minute level >>> meta_dict = {'sf': 1} # 1 Hz sampling frequency >>> minute_data = calculate_minute_level_enmo(data, meta_dict=meta_dict, verbose=True) >>> print(f"Original records: {len(data)}") >>> print(f"Minute-level records: {len(minute_data)}")
- detect_frequency_from_timestamps(timestamps)[source]¶
Detect sampling frequency by finding the most common time delta.
This function analyzes a series of timestamps to determine the sampling frequency of the data by calculating the time differences between consecutive samples and finding the most frequently occurring interval.
- Parameters:
timestamps (pd.Series) – Series or array of datetime objects representing the timestamps of data points. Can be pandas datetime objects, numpy datetime64, or string timestamps that can be converted to datetime.
- Returns:
Sampling frequency in Hz (samples per second).
- Return type:
float
- Raises:
ValueError – If less than two timestamps are provided. If no time deltas can be calculated. If the most common time delta is zero. If the mode cannot be determined.
Notes
The function converts all timestamps to pandas datetime format
Time deltas are calculated in seconds
The most common (mode) time delta is used to determine frequency
Frequency is calculated as 1.0 / most_common_delta
Examples
>>> import pandas as pd >>> >>> # Regular 25 Hz sampling >>> timestamps = pd.date_range('2023-01-01', periods=100, freq='40ms') >>> freq = detect_frequency_from_timestamps(timestamps) >>> print(f"Detected frequency: {freq:.1f} Hz") Detected frequency: 25.0 Hz >>> >>> # Irregular sampling with some missing points >>> irregular_times = pd.to_datetime([ ... '2023-01-01 00:00:00', ... '2023-01-01 00:00:00.040', ... '2023-01-01 00:00:00.080', ... '2023-01-01 00:00:00.120', ... '2023-01-01 00:00:00.200', # Gap here ... '2023-01-01 00:00:00.240' ... ]) >>> freq = detect_frequency_from_timestamps(irregular_times) >>> print(f"Detected frequency: {freq:.1f} Hz") Detected frequency: 25.0 Hz