cosinorage.datahandlers Module¶
Module Contents¶
This module provides the functionality to load Accelerometer data or minute-level ENMO data from CSV files and process this data to obtain a dataframe containing minute-level ENMO data.
Classes¶
- class DataHandler[source]¶
Bases:
objectA base class for data handlers that process and store ENMO data at the minute level.
This class provides a common interface for data handlers with methods to load data, retrieve processed ENMO values, and save data. The load_data and save_data methods are intended to be overridden by subclasses.
- raw_data¶
Raw accelerometer data loaded from the source.
- Type:
pd.DataFrame or None
- sf_data¶
Filtered and processed accelerometer data.
- Type:
pd.DataFrame or None
- ml_data¶
Minute-level ENMO data calculated from processed data.
- Type:
pd.DataFrame or None
- meta_dict¶
Dictionary storing metadata about the data processing.
- Type:
dict
- __init__()[source]¶
Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.
Notes
This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.
- save_data(output_path)[source]¶
Save minute-level ENMO data to a specified output path.
This method is intended to be implemented by subclasses, specifying the format and structure for saving data.
- Parameters:
output_path (str) – The file path where the minute-level ENMO data will be saved.
- get_raw_data()[source]¶
Retrieve the raw data.
- Returns:
A DataFrame containing the raw data.
- Return type:
pd.DataFrame
- get_sf_data()[source]¶
Retrieve the filtered data.
- Returns:
A DataFrame containing the filtered data.
- Return type:
pd.DataFrame
- class NHANESDataHandler(nhanes_file_dir, seqn=None, verbose=False)[source]¶
Bases:
DataHandlerData handler for NHANES accelerometer data.
This class handles loading, filtering, and processing of NHANES accelerometer data.
- Parameters:
nhanes_file_dir (str)
seqn (int)
verbose (bool)
- nhanes_file_dir¶
Directory containing NHANES data files.
- Type:
str
- seqn¶
ID of the person whose data is being loaded.
- Type:
str or None
- __init__(nhanes_file_dir, seqn=None, verbose=False)[source]¶
Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.
Notes
This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.
- Parameters:
nhanes_file_dir (str)
seqn (int | None)
verbose (bool)
- class GalaxyDataHandler(galaxy_file_path, data_format='binary', data_type=None, time_column=None, data_columns=None, preprocess_args={}, verbose=False)[source]¶
Bases:
DataHandlerUnified data handler for Samsung Galaxy Watch accelerometer data.
This class handles loading, filtering, and processing of Galaxy Watch accelerometer data in both binary and CSV formats. Currently supports: - Binary format with accelerometer data type - CSV format with ENMO data type
- Parameters:
galaxy_file_path (str)
data_format (str)
data_type (str | None)
time_column (str | None)
data_columns (list | None)
preprocess_args (dict)
verbose (bool)
- galaxy_file_path¶
Path to the Galaxy Watch data file (for CSV) or directory (for binary).
- Type:
str
- data_format¶
Format of the data (‘csv’ or ‘binary’).
- Type:
str
- data_type¶
Type of the data (‘enmo’ or ‘accelerometer’).
- Type:
str
- time_column¶
Name of the timestamp column.
- Type:
str
- data_columns¶
Names of the data columns.
- Type:
list
- preprocess_args¶
Arguments for preprocessing.
- Type:
dict
- __init__(galaxy_file_path, data_format='binary', data_type=None, time_column=None, data_columns=None, preprocess_args={}, verbose=False)[source]¶
Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.
Notes
This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.
- Parameters:
galaxy_file_path (str)
data_format (str)
data_type (str | None)
time_column (str | None)
data_columns (list | None)
preprocess_args (dict)
verbose (bool)
- class UKBDataHandler(qa_file_path, ukb_file_dir, eid, verbose=False)[source]¶
Bases:
DataHandlerData handler for UK Biobank accelerometer data.
This class handles loading, filtering, and processing of UK Biobank accelerometer data.
- Parameters:
qa_file_path (str)
ukb_file_dir (str)
eid (int)
verbose (bool)
- qa_file_path¶
Path to quality assessment file.
- Type:
str
- ukb_file_dir¶
Directory containing UK Biobank data files.
- Type:
str
- eid¶
Participant ID.
- Type:
int
- __init__(qa_file_path, ukb_file_dir, eid, verbose=False)[source]¶
Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.
Notes
This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.
- Parameters:
qa_file_path (str)
ukb_file_dir (str)
eid (int)
verbose (bool)
- class GenericDataHandler(file_path, data_format='csv', data_type='accelerometer-mg', time_format='unix-ms', time_column='timestamp', time_zone=None, data_columns=None, preprocess_args={}, verbose=False)[source]¶
Bases:
DataHandlerGeneric data handler for processing accelerometer and ENMO data from CSV files.
This class provides a flexible interface for loading and processing various types of accelerometer data, including ENMO (Euclidean Norm Minus One), raw accelerometer data (x, y, z), and alternative count data. It supports automatic data filtering, resampling, preprocessing, and ENMO calculation.
- Parameters:
file_path (str)
data_format (str)
data_type (str)
time_format (str)
time_column (str)
time_zone (str | None)
data_columns (list | None)
preprocess_args (dict)
verbose (bool)
- file_path¶
Path to the CSV file containing the data.
- Type:
str
- data_format¶
Format of the data file.
- Type:
str
- data_type¶
Type of data in the file.
- Type:
str
- time_format¶
Format of timestamps.
- Type:
str
- time_column¶
Name of the timestamp column.
- Type:
str
- time_zone¶
Timezone for datetime conversion.
- Type:
str or None
- data_columns¶
Names of the data columns.
- Type:
list
- preprocess_args¶
Preprocessing arguments.
- Type:
dict
- raw_data¶
Raw data loaded from the file with timestamp index.
- Type:
pd.DataFrame or None
- sf_data¶
Data after filtering and resampling (sensor fusion data).
- Type:
pd.DataFrame or None
- ml_data¶
Minute-level ENMO data calculated from the processed data.
- Type:
pd.DataFrame or None
- meta_dict¶
Metadata dictionary containing information about the data processing.
- Type:
dict
Examples
Load ENMO data from a CSV file:
>>> handler = GenericDataHandler( ... file_path='data/enmo_data.csv', ... data_type='enmo', ... time_column='timestamp', ... data_columns=['enmo'] ... ) >>> raw_data = handler.get_raw_data() >>> ml_data = handler.get_ml_data()
Load accelerometer data from a CSV file:
>>> handler = GenericDataHandler( ... file_path='data/accel_data.csv', ... data_type='accelerometer', ... time_column='time', ... data_columns=['x', 'y', 'z'] ... ) >>> raw_data = handler.get_raw_data() >>> ml_data = handler.get_ml_data()
Notes
The data processing pipeline includes: 1. Loading raw data from CSV file 2. Filtering incomplete days and selecting longest consecutive sequence 3. Resampling to minute-level data 4. Preprocessing (wear detection, noise removal, etc.) 5. Calculating minute-level ENMO values
The class automatically handles column mapping and timestamp processing.
- __init__(file_path, data_format='csv', data_type='accelerometer-mg', time_format='unix-ms', time_column='timestamp', time_zone=None, data_columns=None, preprocess_args={}, verbose=False)[source]¶
Initialize GenericDataHandler with CSV data file.
- Parameters:
file_path (str) – Path to the CSV file containing the data.
data_format (str, default='csv') – Format of the data file. Currently only ‘csv’ is supported.
data_type (str, default='accelerometer-mg') – Type of data in the file. Must be one of: - ‘enmo-mg’, ‘enmo-g’: ENMO (Euclidean Norm Minus One) data - ‘accelerometer-mg’, ‘accelerometer-g’, ‘accelerometer-ms2’: Raw accelerometer data (x, y, z) - ‘alternative_count’: Alternative count data
time_format (str, default='unix-ms') – Format of timestamps. Must be one of: ‘unix-ms’, ‘unix-s’, ‘datetime’.
time_column (str, default='timestamp') – Name of the timestamp column in the CSV file.
time_zone (str, optional) – Timezone for datetime conversion. If None, uses local timezone.
data_columns (list, optional) – Names of the data columns in the CSV file. If not provided, defaults are: - [‘enmo’] for data_type=’enmo-mg’ or ‘enmo-g’ - [‘x’, ‘y’, ‘z’] for data_type=’accelerometer-mg’, ‘accelerometer-g’, or ‘accelerometer-ms2’ - [‘counts’] for data_type=’alternative_count’
preprocess_args (dict, default={}) – Additional preprocessing arguments to pass to the filtering and preprocessing functions.
verbose (bool, default=False) – Whether to print progress information during data loading and processing.
Utility Functions¶
Generic Data Functions¶
- read_generic_xD_data(file_path, data_type, meta_dict, n_dimensions, time_format='unix-ms', time_column='timestamp', time_zone=None, data_columns=None, verbose=False)[source]¶
Read generic accelerometer or count data from a CSV file.
This function loads data from a CSV file and standardizes the column names for further processing. It supports both 1-dimensional (counts/ENMO) and 3-dimensional (accelerometer) data formats.
- Parameters:
file_path (str) – Path to the CSV file containing the data.
meta_dict (dict) – Dictionary to store metadata about the loaded data. Will be populated with: - raw_n_datapoints: Number of data points - raw_start_datetime: Start timestamp - raw_end_datetime: End timestamp - sf: Sampling frequency in Hz - raw_data_frequency: Sampling frequency as string - raw_data_type: Type of data (‘Counts’ or ‘Accelerometer’) - raw_data_unit: Unit of data (‘counts’ or ‘mg’)
n_dimensions (int) – Number of dimensions in the data. Must be either 1 (for counts/ENMO) or 3 (for accelerometer).
time_column (str, default='timestamp') – Name of the timestamp column in the CSV file.
data_columns (list, optional) – Names of the data columns in the CSV file. If not provided, defaults are: - [‘counts’] for n_dimensions=1 - [‘x’, ‘y’, ‘z’] for n_dimensions=3
verbose (bool, default=False) – Whether to print progress information.
data_type (str)
time_format (str)
time_zone (str | None)
- Returns:
DataFrame containing the loaded data with standardized column names: - For n_dimensions=1: [‘ENMO’] (single column) - For n_dimensions=3: [‘x’, ‘y’, ‘z’] (three columns) The DataFrame has a datetime index from the timestamp column.
- Return type:
pd.DataFrame
- Raises:
ValueError – If n_dimensions is not 1 or 3, or if the number of data_columns doesn’t match n_dimensions.
Examples
Load 1-dimensional count data:
>>> meta_dict = {} >>> data = read_generic_xD( ... file_path='data/counts.csv', ... meta_dict=meta_dict, ... n_dimensions=1, ... time_column='time', ... data_columns=['counts'] ... ) >>> print(data.columns) Index(['ENMO'], dtype='object')
Load 3-dimensional accelerometer data:
>>> meta_dict = {} >>> data = read_generic_xD( ... file_path='data/accel.csv', ... meta_dict=meta_dict, ... n_dimensions=3, ... time_column='timestamp', ... data_columns=['accel_x', 'accel_y', 'accel_z'] ... ) >>> print(data.columns) Index(['x', 'y', 'z'], dtype='object')
Notes
The function automatically: - Converts timestamps to datetime objects - Removes timezone information - Fills missing values with 0 - Sorts data by timestamp - Detects sampling frequency from timestamps - Populates metadata dictionary with data information
- filter_generic_data(data, data_type, meta_dict={}, verbose=False, preprocess_args={})[source]¶
Filter generic data by removing incomplete days and selecting longest consecutive sequence.
This function applies data quality filters to ensure only complete and consecutive days of data are retained for analysis. It removes incomplete days and selects the longest sequence of consecutive days.
- Parameters:
data (pd.DataFrame) – Input DataFrame with datetime index containing accelerometer or count data.
data_type (str) – Type of data being processed. Must be one of: - ‘enmo’: ENMO (Euclidean Norm Minus One) data - ‘accelerometer’: Raw accelerometer data (x, y, z) - ‘alternative_count’: Alternative count data
meta_dict (dict, default={}) – Dictionary to store metadata about the filtering process. Will be updated with: - filtered_n_datapoints: Number of data points after filtering - filtered_start_datetime: Start timestamp after filtering - filtered_end_datetime: End timestamp after filtering
verbose (bool, default=False) – Whether to print progress information during filtering.
preprocess_args (dict, default={}) – Additional preprocessing arguments that may affect filtering behavior.
- Returns:
Filtered DataFrame containing only complete and consecutive days of data. The DataFrame maintains the same structure as the input.
- Return type:
pd.DataFrame
Notes
Removes days that don’t have the expected number of data points
Selects the longest sequence of consecutive days (minimum 4 days required)
Updates metadata with information about the filtered data
The function assumes 24-hour periods for day-based filtering
Examples
>>> import pandas as pd >>> >>> # Create sample data with some incomplete days >>> dates = pd.date_range('2023-01-01', periods=10000, freq='min') >>> data = pd.DataFrame({'ENMO': np.random.randn(10000)}, index=dates) >>> >>> # Filter the data >>> meta_dict = {} >>> filtered_data = filter_generic_data( ... data, data_type='enmo', meta_dict=meta_dict, verbose=True ... ) >>> print(f"Original data points: {len(data)}") >>> print(f"Filtered data points: {len(filtered_data)}")
- resample_generic_data(data, data_type, meta_dict={}, verbose=False)[source]¶
Resample generic data to minute-level resolution.
This function resamples high-frequency data to minute-level resolution using mean aggregation. This is a standard preprocessing step for circadian rhythm analysis.
- Parameters:
data (pd.DataFrame) – Input DataFrame with datetime index containing high-frequency data.
data_type (str) – Type of data being processed. Must be one of: - ‘enmo’: ENMO (Euclidean Norm Minus One) data - ‘accelerometer’: Raw accelerometer data (x, y, z) - ‘alternative_count’: Alternative count data
meta_dict (dict, default={}) – Dictionary to store metadata about the resampling process. Will be updated with: - resampled_n_datapoints: Number of data points after resampling - resampled_start_datetime: Start timestamp after resampling - resampled_end_datetime: End timestamp after resampling
verbose (bool, default=False) – Whether to print progress information during resampling.
- Returns:
Resampled DataFrame with minute-level resolution. The DataFrame maintains the same column structure as the input but with reduced temporal resolution.
- Return type:
pd.DataFrame
Notes
Uses pandas resample(‘min’).mean() for minute-level aggregation
The function assumes the input data has a datetime index
All columns are resampled using mean aggregation
Updates metadata with information about the resampled data
Examples
>>> import pandas as pd >>> import numpy as np >>> >>> # Create sample high-frequency data (every 10 seconds) >>> dates = pd.date_range('2023-01-01', periods=8640, freq='10S') # 24 hours >>> data = pd.DataFrame({ ... 'ENMO': np.random.randn(8640), ... 'wear': np.ones(8640) ... }, index=dates) >>> >>> # Resample to minute level >>> meta_dict = {} >>> resampled_data = resample_generic_data( ... data, data_type='enmo', meta_dict=meta_dict, verbose=True ... ) >>> print(f"Original frequency: {len(data)} points") >>> print(f"Resampled frequency: {len(resampled_data)} points")
- preprocess_generic_data(data, data_type, preprocess_args={}, meta_dict={}, verbose=False)[source]¶
Preprocess generic accelerometer data with calibration, noise removal, and wear detection.
This function applies a comprehensive preprocessing pipeline to accelerometer data, including calibration, noise filtering, and wear detection. The preprocessing steps are applied based on the data type and preprocessing arguments.
- Parameters:
data (pd.DataFrame) – Input DataFrame with datetime index containing accelerometer data. For accelerometer data, must have columns [‘x’, ‘y’, ‘z’].
data_type (str) – Type of data being processed. Must be one of: - ‘enmo’: ENMO (Euclidean Norm Minus One) data - ‘accelerometer’: Raw accelerometer data (x, y, z) - ‘alternative_count’: Alternative count data
preprocess_args (dict, default={}) – Dictionary containing preprocessing parameters: - ‘calibrate’: Whether to apply accelerometer calibration (default: False) - ‘sphere_crit’: Sphere fitting criterion for calibration (default: 0.3) - ‘sd_criteria’: Standard deviation criterion for calibration (default: 0.1) - ‘remove_noise’: Whether to apply noise filtering (default: False) - ‘filter_cutoff’: Cutoff frequency for noise filter in Hz (default: 2) - ‘detect_wear’: Whether to apply wear detection (default: False) - ‘sd_crit’: Standard deviation criterion for wear detection (default: 0.013) - ‘range_crit’: Range criterion for wear detection (default: 0.05) - ‘window_length’: Window length for wear detection in seconds (default: 60) - ‘window_skip’: Window skip for wear detection in seconds (default: 30)
meta_dict (dict, default={}) – Dictionary to store metadata about the preprocessing process.
verbose (bool, default=False) – Whether to print progress information during preprocessing.
- Returns:
Preprocessed DataFrame with the same structure as input but with applied preprocessing steps. May include additional columns like ‘wear’ if wear detection is enabled.
- Return type:
pd.DataFrame
Notes
Calibration is only applied to accelerometer data (data_type=’accelerometer-mg’, ‘accelerometer-g’, ‘accelerometer-ms2’)
Noise removal uses a Butterworth low-pass filter
Wear detection adds a binary ‘wear’ column (1=worn, 0=not worn)
The function skips preprocessing steps that are not enabled in preprocess_args
All preprocessing steps are applied in sequence: calibration → noise removal → wear detection
Examples
>>> import pandas as pd >>> import numpy as np >>> >>> # Create sample accelerometer data >>> dates = pd.date_range('2023-01-01', periods=1440, freq='min') >>> data = pd.DataFrame({ ... 'x': np.random.randn(1440), ... 'y': np.random.randn(1440), ... 'z': np.random.randn(1440) + 1 # Add gravity component ... }, index=dates) >>> >>> # Apply preprocessing with wear detection >>> preprocess_args = { ... 'calibrate': True, ... 'remove_noise': True, ... 'detect_wear': True ... } >>> meta_dict = {} >>> processed_data = preprocess_generic_data( ... data, data_type='accelerometer', ... preprocess_args=preprocess_args, meta_dict=meta_dict, verbose=True ... ) >>> print(f"Processed data shape: {processed_data.shape}") >>> print(f"Wear column present: {'wear' in processed_data.columns}")
Galaxy Smartwatch Data Functions¶
- read_galaxy_binary_data(galaxy_file_dir, meta_dict, time_column='unix_timestamp_in_ms', data_columns=None, verbose=False)[source]¶
Read accelerometer data from Galaxy Watch binary files.
- Parameters:
galaxy_file_dir (str) – Directory containing Galaxy Watch data files
meta_dict (dict) – Dictionary to store metadata about the loaded data
time_column (str) – Name of the timestamp column in the binary data
data_columns (list) – Names of the data columns in the binary data
verbose (bool) – Whether to print progress information
- Returns:
DataFrame containing accelerometer data with columns [‘x’, ‘y’, ‘z’]
- Return type:
pd.DataFrame
- filter_galaxy_binary_data(data, meta_dict={}, verbose=False, preprocess_args={})[source]¶
Filter Galaxy Watch accelerometer data by removing incomplete days and selecting longest consecutive sequence.
- Parameters:
data (pd.DataFrame) – Raw accelerometer data
meta_dict (dict) – Dictionary to store metadata about the filtering process
verbose (bool) – Whether to print progress information
preprocess_args (dict)
- Returns:
Filtered accelerometer data
- Return type:
pd.DataFrame
- resample_galaxy_binary_data(data, meta_dict={}, verbose=False)[source]¶
Resample Galaxy Watch accelerometer data to a regular interval.
- Parameters:
data (pd.DataFrame) – Filtered accelerometer data
meta_dict (dict) – Dictionary to store metadata about the resampling process
verbose (bool) – Whether to print progress information
- Returns:
Resampled accelerometer data at regular frequency.
- Return type:
pd.DataFrame
- preprocess_galaxy_binary_data(data, preprocess_args={}, meta_dict={}, verbose=False)[source]¶
Preprocess Galaxy Watch accelerometer data including rescaling, calibration, noise removal, and wear detection.
- Parameters:
data (pd.DataFrame) – Resampled accelerometer data
preprocess_args (dict) – Dictionary containing preprocessing parameters
meta_dict (dict) – Dictionary to store metadata about the preprocessing
verbose (bool) – Whether to print progress information
- Returns:
Preprocessed accelerometer data with additional columns for raw values and wear detection
- Return type:
pd.DataFrame
- acceleration_data_to_dataframe(data)[source]¶
Convert binary acceleration data to pandas DataFrame.
This function converts raw binary acceleration data from Samsung Galaxy Watch into a structured pandas DataFrame format for further processing.
- Parameters:
data (object) – Binary acceleration data object containing samples with the following attributes: - acceleration_x: X-axis acceleration value - acceleration_y: Y-axis acceleration value - acceleration_z: Z-axis acceleration value - sensor_body_location: Location of the sensor on the body - unix_timestamp_in_ms: Timestamp in milliseconds since Unix epoch - effective_time_frame: Effective time frame for the sample
- Returns:
DataFrame containing accelerometer data with columns: - ‘acceleration_x’: X-axis acceleration values - ‘acceleration_y’: Y-axis acceleration values - ‘acceleration_z’: Z-axis acceleration values - ‘sensor_body_location’: Sensor location information - ‘unix_timestamp_in_ms’: Timestamps in milliseconds - ‘effective_time_frame’: Effective time frame information
- Return type:
pd.DataFrame
Notes
This function is used internally by read_galaxy_binary_data
The function iterates through all samples in the binary data object
Each sample is converted to a dictionary and added to the DataFrame
The resulting DataFrame maintains the original data structure from the binary file
Examples
>>> # This function is typically called internally by read_galaxy_binary_data >>> # but can be used directly if you have binary data objects: >>> >>> # Load binary data (example) >>> binary_data = load_acceleration_data("path/to/binary/file") >>> >>> # Convert to DataFrame >>> df = acceleration_data_to_dataframe(binary_data) >>> print(f"Converted {len(df)} acceleration samples") >>> print(f"Columns: {df.columns.tolist()}")
- read_galaxy_csv_data(galaxy_file_path, meta_dict, time_column='timestamp', data_columns=None, verbose=False)[source]¶
Read ENMO data from Galaxy Watch CSV file.
This function loads ENMO (Euclidean Norm Minus One) data from Samsung Galaxy Watch CSV files and standardizes the format for further processing.
- Parameters:
galaxy_file_path (str) – Path to the Galaxy Watch CSV data file containing ENMO values.
meta_dict (dict) – Dictionary to store metadata about the loaded data. Will be populated with: - raw_n_datapoints: Number of data points - raw_start_datetime: Start timestamp - raw_end_datetime: End timestamp - sf: Sampling frequency in Hz - raw_data_frequency: Sampling frequency as string - raw_data_type: Type of data (‘ENMO’) - raw_data_unit: Unit of data (‘mg’)
time_column (str, default='timestamp') – Name of the timestamp column in the CSV file.
data_columns (list, optional) – Names of the data columns in the CSV file. If not provided, defaults to [‘enmo’].
verbose (bool, default=False) – Whether to print progress information during loading.
- Returns:
DataFrame containing ENMO data with standardized column names: - ‘ENMO’: ENMO values in mg units The DataFrame has a datetime index from the timestamp column.
- Return type:
pd.DataFrame
Notes
The function automatically converts UTC timestamps to local time
Missing values are filled with 0
Data is sorted by timestamp
Sampling frequency is automatically detected from timestamps
Column names are standardized to ‘ENMO’ for consistency
Examples
>>> import pandas as pd >>> >>> # Load ENMO data from Galaxy Watch CSV file >>> meta_dict = {} >>> data = read_galaxy_csv_data( ... galaxy_file_path='data/galaxy_enmo.csv', ... meta_dict=meta_dict, ... time_column='time', ... data_columns=['enmo_mg'], ... verbose=True ... ) >>> print(f"Loaded {len(data)} ENMO records") >>> print(f"Sampling frequency: {meta_dict['sf']:.1f} Hz")
- filter_galaxy_csv_data(data, meta_dict={}, verbose=False, preprocess_args={})[source]¶
Filter Galaxy Watch ENMO data by removing incomplete days and selecting longest consecutive sequence.
This function applies data quality filters to Galaxy Watch ENMO data, including removal of incomplete days and selection of the longest consecutive sequence of days.
- Parameters:
data (pd.DataFrame) – Raw ENMO data with datetime index and ‘ENMO’ column.
meta_dict (dict, default={}) – Dictionary to store metadata about the filtering process. Should contain: - sf: Sampling frequency in Hz
verbose (bool, default=False) – Whether to print progress information during filtering.
preprocess_args (dict, default={}) – Dictionary containing filtering parameters: - required_daily_coverage: Minimum fraction of daily data required (default: 0.5)
- Returns:
Filtered ENMO data containing only complete and consecutive days. The DataFrame maintains the same structure as the input.
- Return type:
pd.DataFrame
Notes
Removes days that don’t meet the required daily coverage threshold
Selects the longest sequence of consecutive days (minimum 4 days required)
Resamples data to minute-level resolution
Removes incomplete first and last days
Updates metadata with information about the filtering process
Examples
>>> import pandas as pd >>> >>> # Create sample ENMO data >>> dates = pd.date_range('2023-01-01', periods=10000, freq='min') >>> data = pd.DataFrame({'ENMO': np.random.randn(10000)}, index=dates) >>> >>> # Filter the data >>> meta_dict = {'sf': 1/60} # 1 sample per minute >>> preprocess_args = {'required_daily_coverage': 0.8} >>> filtered_data = filter_galaxy_csv_data( ... data, meta_dict=meta_dict, preprocess_args=preprocess_args, verbose=True ... ) >>> print(f"Original data points: {len(data)}") >>> print(f"Filtered data points: {len(filtered_data)}")
- resample_galaxy_csv_data(data, meta_dict={}, verbose=False)[source]¶
Ensure we have minute-level data across the whole timeseries.
This function resamples Galaxy Watch ENMO data to ensure consistent minute-level resolution across the entire time series.
- Parameters:
data (pd.DataFrame) – Filtered ENMO data with datetime index and ‘ENMO’ column.
meta_dict (dict, default={}) – Dictionary to store metadata about the resampling process.
verbose (bool, default=False) – Whether to print progress information during resampling.
- Returns:
Resampled ENMO data with consistent minute-level resolution. The DataFrame maintains the same structure as the input.
- Return type:
pd.DataFrame
Notes
Uses pandas resample(‘1min’) with linear interpolation
Forward fills any remaining gaps with bfill()
Ensures consistent temporal resolution for analysis
Updates metadata with information about the resampling process
Examples
>>> import pandas as pd >>> >>> # Create sample ENMO data with irregular intervals >>> dates = pd.to_datetime(['2023-01-01 00:00:00', '2023-01-01 00:01:30', ... '2023-01-01 00:03:00', '2023-01-01 00:04:30']) >>> data = pd.DataFrame({'ENMO': [0.1, 0.2, 0.3, 0.4]}, index=dates) >>> >>> # Resample to minute level >>> meta_dict = {} >>> resampled_data = resample_galaxy_csv_data(data, meta_dict=meta_dict, verbose=True) >>> print(f"Original data points: {len(data)}") >>> print(f"Resampled data points: {len(resampled_data)}")
- preprocess_galaxy_csv_data(data, preprocess_args={}, meta_dict={}, verbose=False)[source]¶
Preprocess Galaxy Watch ENMO data including rescaling, calibration, noise removal, and wear detection.
This function applies preprocessing steps to Galaxy Watch ENMO data. Currently, wear detection is not implemented for ENMO data as the algorithm relies on raw accelerometer data.
- Parameters:
data (pd.DataFrame) – Resampled ENMO data with datetime index and ‘ENMO’ column.
preprocess_args (dict, default={}) – Dictionary containing preprocessing parameters (currently not used for ENMO data).
meta_dict (dict, default={}) – Dictionary to store metadata about the preprocessing process.
verbose (bool, default=False) – Whether to print progress information during preprocessing.
- Returns:
Preprocessed ENMO data with additional columns: - ‘ENMO’: Original ENMO values - ‘wear’: Wear detection column (set to -1 for ENMO data)
- Return type:
pd.DataFrame
Notes
Wear detection is not implemented for ENMO data
The ‘wear’ column is set to -1 to indicate no wear detection
Future implementations may add wear detection for ENMO data
The function maintains the original ENMO values
Examples
>>> import pandas as pd >>> >>> # Create sample ENMO data >>> dates = pd.date_range('2023-01-01', periods=1440, freq='min') >>> data = pd.DataFrame({'ENMO': np.random.uniform(0, 0.1, 1440)}, index=dates) >>> >>> # Preprocess the data >>> meta_dict = {} >>> preprocess_args = {} >>> processed_data = preprocess_galaxy_csv_data( ... data, preprocess_args=preprocess_args, meta_dict=meta_dict, verbose=True ... ) >>> print(f"Processed data shape: {processed_data.shape}") >>> print(f"Wear column present: {'wear' in processed_data.columns}")
UK Biobank Data Functions¶
- read_ukb_data(qc_file_path, enmo_file_dir, eid, meta_dict={}, verbose=False)[source]¶
Read and process UK Biobank accelerometer data for a specific participant.
This function loads and processes UK Biobank accelerometer data for a specific participant, applying quality control checks and converting the data to a standardized format.
- Parameters:
qc_file_path (str) – Path to the quality control CSV file containing participant metadata. Must contain columns: eid, acc_data_problem, acc_weartime, acc_calibration, acc_owndata, acc_interrupt_period.
enmo_file_dir (str) – Directory containing the ENMO data files (OUT_*.csv format).
eid (int) – Participant ID to process.
meta_dict (dict, default={}) – Dictionary to store metadata about the loaded data. Will be populated with: - raw_n_datapoints: Number of data points - raw_start_datetime: Start timestamp - raw_end_datetime: End timestamp - raw_data_frequency: Sampling frequency (‘minute-level’) - raw_data_type: Type of data (‘ENMO’) - raw_data_unit: Unit of data (‘mg’)
verbose (bool, default=False) – Whether to print processing information and progress.
- Returns:
DataFrame containing processed ENMO data with columns: - ‘ENMO’: Euclidean Norm Minus One values in milligravity units The DataFrame has a datetime index.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If QC file or ENMO directory doesn’t exist.
ValueError – If participant data is invalid or fails quality control checks.
Notes
Applies multiple quality control filters from the QC file
Processes ENMO data from CSV files with acceleration headers
Converts timestamps to proper datetime format
Filters ENMO values >= 0.1, sets others to 0
Sorts data by timestamp for consistency
Examples
>>> import os >>> >>> # Load UK Biobank data for a specific participant >>> qc_file_path = '/path/to/ukb_qc.csv' >>> enmo_file_dir = '/path/to/enmo/files' >>> eid = 12345 # Participant ID >>> meta_dict = {} >>> data = read_ukb_data( ... qc_file_path=qc_file_path, ... enmo_file_dir=enmo_file_dir, ... eid=eid, ... meta_dict=meta_dict, ... verbose=True ... ) >>> print(f"Loaded {len(data)} ENMO records for participant {eid}") >>> print(f"Data range: {data.index.min()} to {data.index.max()}")
- filter_ukb_data(data, meta_dict={}, verbose=False)[source]¶
Filter UK Biobank accelerometer data to ensure data quality.
This function applies data quality filters to UK Biobank ENMO data, including removal of incomplete days and selection of the longest consecutive sequence.
- Parameters:
data (pd.DataFrame) – Input DataFrame containing ENMO data with datetime index and ‘ENMO’ column.
meta_dict (dict, default={}) – Dictionary to store metadata about the filtering process.
verbose (bool, default=False) – Whether to print filtering information and progress.
- Returns:
Filtered DataFrame containing only complete and consecutive days of data. Maintains same structure as input DataFrame.
- Return type:
pd.DataFrame
Notes
Removes incomplete days using filter_incomplete_days (requires 1440 points per day)
Selects longest consecutive sequence using filter_consecutive_days
Assumes minute-level data (1/60 Hz sampling frequency)
Updates metadata with information about the filtering process
Examples
>>> import pandas as pd >>> >>> # Create sample UK Biobank data >>> dates = pd.date_range('2023-01-01', periods=10000, freq='min') >>> data = pd.DataFrame({'ENMO': np.random.uniform(0, 0.1, 10000)}, index=dates) >>> >>> # Filter the data >>> meta_dict = {} >>> filtered_data = filter_ukb_data(data, meta_dict=meta_dict, verbose=True) >>> print(f"Original data points: {len(data)}") >>> print(f"Filtered data points: {len(filtered_data)}")
- resample_ukb_data(data, meta_dict={}, verbose=False)[source]¶
Resample UK Biobank accelerometer data to ensure consistent 1-minute intervals.
This function ensures consistent minute-level resolution for UK Biobank ENMO data by resampling to 1-minute intervals and handling any gaps in the data.
- Parameters:
data (pd.DataFrame) – Input DataFrame containing ENMO data with datetime index and ‘ENMO’ column.
meta_dict (dict, default={}) – Dictionary to store metadata about the resampling process.
verbose (bool, default=False) – Whether to print resampling information and progress.
- Returns:
Resampled DataFrame with consistent 1-minute intervals. Missing values are interpolated linearly and any remaining gaps are filled using backward fill.
- Return type:
pd.DataFrame
Notes
Uses pandas resample(‘1min’) with linear interpolation
Applies backward fill (bfill) to handle any remaining gaps
Ensures consistent temporal resolution for analysis
Maintains data integrity and structure
Examples
>>> import pandas as pd >>> >>> # Create sample UK Biobank data with irregular intervals >>> dates = pd.to_datetime(['2023-01-01 00:00:00', '2023-01-01 00:01:30', ... '2023-01-01 00:03:00', '2023-01-01 00:04:30']) >>> data = pd.DataFrame({ : [0.1, 0.2, 0.3, 0.4]}, index=dates) >>> >>> # Resample to minute level >>> meta_dict = {} >>> resampled_data = resample_ukb_data(data, meta_dict=meta_dict, verbose=True) >>> print(f"Original data points: {len(data)}") >>> print(f"Resampled data points: {len(resampled_data)}")
NHANES Data Functions¶
- read_nhanes_data(file_dir, seqn=None, meta_dict={}, verbose=False)[source]¶
Read and process NHANES accelerometer data files for a specific person.
This function loads and processes National Health and Nutrition Examination Survey (NHANES) accelerometer data for a specific participant. It handles the complex NHANES data structure including day-level, minute-level, and header files.
- Parameters:
file_dir (str) – Directory containing NHANES data files. Must contain: - PAXDAY_*.xpt: Day-level data files - PAXHD_*.xpt: Header data files - PAXMIN_*.xpt: Minute-level data files
seqn (str, optional) – Unique identifier for the participant. Required for data extraction.
meta_dict (dict, default={}) – Dictionary to store metadata about the loaded data. Will be populated with: - raw_n_datapoints: Number of data points - raw_start_datetime: Start timestamp - raw_end_datetime: End timestamp - raw_data_frequency: Sampling frequency (‘minute-level’) - raw_data_type: Type of data (‘accelerometer’) - raw_data_unit: Unit of data (‘MIMS’)
verbose (bool, default=False) – Whether to print processing status and progress information.
- Returns:
Processed accelerometer data with columns: - ‘x’, ‘y’, ‘z’: Accelerometer values in MIMS units - ‘wear’: Binary wear detection (1=worn, 0=not worn) - ‘sleep’: Binary sleep detection (1=sleep, 0=wake) - ‘paxpredm’: Original NHANES prediction values The DataFrame is indexed by timestamp.
- Return type:
pd.DataFrame
- Raises:
ValueError – If seqn is None or if no valid NHANES data is found for the participant.
Notes
Automatically detects and processes multiple NHANES data versions
Applies data quality filters (paxqfd < 1, valid_hours > 16)
Requires at least 4 days of valid data per participant
Filters for complete days (288 epochs per day)
Converts column names to lowercase for consistency
Removes byte-encoded data using remove_bytes function
Examples
>>> import os >>> >>> # Load NHANES data for a specific participant >>> file_dir = '/path/to/nhanes/data' >>> seqn = '12345' # Participant ID >>> meta_dict = {} >>> data = read_nhanes_data( ... file_dir=file_dir, ... seqn=seqn, ... meta_dict=meta_dict, ... verbose=True ... ) >>> print(f"Loaded {len(data)} records for participant {seqn}") >>> print(f"Data columns: {data.columns.tolist()}")
- filter_and_preprocess_nhanes_data(data, meta_dict={}, verbose=False)[source]¶
Filter NHANES accelerometer data for incomplete days and non-consecutive sequences.
This function applies data quality filters to NHANES accelerometer data and converts the data to the standard format used by the CosinorAge pipeline.
- Parameters:
data (pd.DataFrame) – Raw NHANES accelerometer data with columns [‘x’, ‘y’, ‘z’, ‘wear’, ‘sleep’, ‘paxpredm’] and datetime index.
meta_dict (dict, default={}) – Dictionary to store metadata about the filtering process. Will be populated with: - n_days: Number of valid days after filtering
verbose (bool, default=False) – Whether to print processing status and progress information.
- Returns:
Filtered and preprocessed accelerometer data with columns: - ‘x’, ‘y’, ‘z’: Accelerometer values converted from MIMS to mg units - ‘x_raw’, ‘y_raw’, ‘z_raw’: Original accelerometer values - ‘wear’: Binary wear detection - ‘sleep’: Binary sleep detection - ‘paxpredm’: Original NHANES prediction values - ‘ENMO’: Calculated ENMO values (scaled by factor of 257)
- Return type:
pd.DataFrame
Notes
Removes incomplete days using filter_incomplete_days
Selects longest consecutive sequence using filter_consecutive_days
Converts accelerometer values from MIMS to mg units (division by 9.81)
Calculates ENMO values with a scaling factor of 257 for parameter tuning
Stores original values in *_raw columns for reference
Examples
>>> import pandas as pd >>> >>> # Create sample NHANES data >>> dates = pd.date_range('2023-01-01', periods=10000, freq='min') >>> data = pd.DataFrame({ ... 'x': np.random.randn(10000), ... 'y': np.random.randn(10000), ... 'z': np.random.randn(10000), ... 'wear': np.random.choice([0, 1], 10000), ... 'sleep': np.random.choice([0, 1], 10000), ... 'paxpredm': np.random.choice([0, 1, 2], 10000) ... }, index=dates) >>> >>> # Filter and preprocess the data >>> meta_dict = {} >>> processed_data = filter_and_preprocess_nhanes_data( ... data, meta_dict=meta_dict, verbose=True ... ) >>> print(f"Processed data shape: {processed_data.shape}") >>> print(f"Number of days: {meta_dict.get('n_days', 'N/A')}")
- resample_nhanes_data(data, meta_dict={}, verbose=False)[source]¶
Resample NHANES accelerometer data to 1-minute intervals using linear interpolation.
This function ensures consistent minute-level resolution for NHANES accelerometer data by resampling to 1-minute intervals and handling categorical variables appropriately.
- Parameters:
data (pd.DataFrame) – NHANES accelerometer data with datetime index and columns including ‘x’, ‘y’, ‘z’, ‘sleep’, ‘wear’.
meta_dict (dict, default={}) – Dictionary to store metadata about the resampling process.
verbose (bool, default=False) – Whether to print processing status and progress information.
- Returns:
Resampled accelerometer data with consistent 1-minute intervals. Categorical variables (‘sleep’, ‘wear’) are rounded to nearest integer.
- Return type:
pd.DataFrame
Notes
Uses pandas resample(‘1min’) with linear interpolation for continuous variables
Applies forward fill (bfill) to handle any remaining gaps
Rounds categorical variables (‘sleep’, ‘wear’) to nearest integer
Maintains data integrity for binary classification variables
Examples
>>> import pandas as pd >>> >>> # Create sample NHANES data with irregular intervals >>> dates = pd.to_datetime(['2023-01-01 00:00:00', '2023-01-01 00:01:30', ... '2023-01-01 00:03:00', '2023-01-01 00:04:30']) >>> data = pd.DataFrame({ ... 'x': [0.1, 0.2, 0.3, 0.4], ... 'y': [0.1, 0.2, 0.3, 0.4], ... 'z': [0.1, 0.2, 0.3, 0.4], ... 'sleep': [0, 1, 0, 1], ... 'wear': [1, 1, 0, 1] ... }, index=dates) >>> >>> # Resample to minute level >>> meta_dict = {} >>> resampled_data = resample_nhanes_data(data, meta_dict=meta_dict, verbose=True) >>> print(f"Original data points: {len(data)}") >>> print(f"Resampled data points: {len(resampled_data)}")
- remove_bytes(df)[source]¶
Convert byte string columns to regular strings in a DataFrame.
This function handles byte-encoded string columns that are common in NHANES data files, converting them to UTF-8 encoded strings for proper processing.
- Parameters:
df (pd.DataFrame) – Input DataFrame containing potential byte string columns.
- Returns:
DataFrame with byte strings converted to UTF-8 strings. Non-byte string columns remain unchanged.
- Return type:
pd.DataFrame
Notes
Only processes columns with object dtype (likely to contain byte strings)
Uses UTF-8 encoding for conversion
Leaves non-byte string values unchanged
Common in NHANES data due to SAS file format
Examples
>>> import pandas as pd >>> >>> # Create sample DataFrame with byte strings >>> data = { ... 'col1': [b'hello', b'world', 'normal_string'], ... 'col2': [1, 2, 3], ... 'col3': [b'byte1', b'byte2', b'byte3'] ... } >>> df = pd.DataFrame(data) >>> >>> # Convert byte strings >>> cleaned_df = remove_bytes(df) >>> print(cleaned_df['col1'].iloc[0]) # 'hello' instead of b'hello'
- clean_data(df, days)[source]¶
Clean NHANES minute-level data by applying quality filters.
This function applies multiple quality filters to NHANES minute-level data to ensure only valid measurements are included in the analysis.
- Parameters:
df (pd.DataFrame) – Raw minute-level NHANES data containing columns: - ‘SEQN’: Participant identifier - ‘PAXMTSM’: Minute-level timestamp - ‘PAXPREDM’: Prediction values - ‘PAXQFM’: Quality flag
days (pd.DataFrame) – Day-level NHANES data containing valid participant identifiers in ‘seqn’ column.
- Returns:
Cleaned minute-level data with invalid measurements and participants removed.
- Return type:
pd.DataFrame
Notes
Filters for participants present in day-level data
Removes measurements with PAXMTSM = -0.01 (invalid timestamp)
Excludes PAXPREDM values of 3 or 4 (invalid predictions)
Removes measurements with PAXQFM >= 1 (poor quality)
Examples
>>> import pandas as pd >>> >>> # Create sample NHANES data >>> minute_data = pd.DataFrame({ ... 'SEQN': ['12345', '12345', '12346', '12345'], ... 'PAXMTSM': [0, -0.01, 60, 120], ... 'PAXPREDM': [1, 2, 3, 1], ... 'PAXQFM': [0, 0, 1, 0] ... }) >>> >>> day_data = pd.DataFrame({'seqn': ['12345']}) >>> >>> # Clean the data >>> cleaned_data = clean_data(minute_data, day_data) >>> print(f"Original records: {len(minute_data)}") >>> print(f"Cleaned records: {len(cleaned_data)}")
- calculate_measure_time(row)[source]¶
Calculate the measurement timestamp for a row of NHANES data.
This function converts NHANES timing information into actual datetime timestamps by combining the day start time with the seconds since midnight.
- Parameters:
row (pd.Series) – Row containing timing information: - ‘day1_start_time’: Start time of the first day in format “HH:MM:SS” - ‘paxssnmp’: Seconds since midnight (scaled by 80)
- Returns:
Calculated measurement timestamp combining base time and offset.
- Return type:
datetime
Notes
Converts day1_start_time string to datetime object
Divides paxssnmp by 80 to get actual seconds (NHANES scaling factor)
Adds the offset to the base time to get measurement timestamp
Used for creating proper datetime index for NHANES data
Examples
>>> import pandas as pd >>> >>> # Create sample row with timing information >>> row = pd.Series({ ... 'day1_start_time': '08:30:00', ... 'paxssnmp': 8000 # 100 seconds * 80 ... }) >>> >>> # Calculate measurement time >>> measure_time = calculate_measure_time(row) >>> print(f"Measurement time: {measure_time}") >>> # Output: 1900-01-01 08:31:40 (base time + 100 seconds)
General Utility Functions¶
- filter_incomplete_days(df, data_freq, expected_points_per_day=None)[source]¶
Filter out data from incomplete days to ensure 24-hour data periods.
This function removes data from days that don’t have the expected number of data points to ensure that only complete 24-hour data is retained for analysis.
- Parameters:
df (pd.DataFrame) – DataFrame with datetime index, which is used to determine the day. The index should contain datetime objects.
data_freq (float) – Frequency of data collection in Hz (e.g., 1/60 for minute-level data).
expected_points_per_day (int, optional) – Expected number of data points per day. If None, calculated using data_freq * 86400.
- Returns:
Filtered DataFrame containing only complete days. Returns empty DataFrame if an error occurs during processing.
- Return type:
pd.DataFrame
Notes
Calculates expected points per day as data_freq * 60 * 60 * 24 if not provided
Groups data by date and counts points per day
Retains only days with sufficient data points
Removes the temporary ‘DATE’ column before returning
Handles errors gracefully by returning empty DataFrame
Examples
>>> import pandas as pd >>> >>> # Create sample data with some incomplete days >>> dates = pd.date_range('2023-01-01', periods=5000, freq='min') >>> data = pd.DataFrame({'value': np.random.randn(5000)}, index=dates) >>> >>> # Filter incomplete days (expecting 1440 points per day for minute data) >>> filtered_data = filter_incomplete_days(data, data_freq=1/60, expected_points_per_day=1440) >>> print(f"Original days: {len(data.index.date.unique())}") >>> print(f"Complete days: {len(filtered_data.index.date.unique())}")
- filter_consecutive_days(df)[source]¶
Filter DataFrame to retain only the longest sequence of consecutive days.
This function identifies the longest sequence of consecutive days in the data and filters the DataFrame to include only those days. This is important for circadian rhythm analysis which requires continuous data.
- Parameters:
df (pd.DataFrame) – DataFrame with datetime index containing the data to filter.
- Returns:
Filtered DataFrame containing only the longest sequence of consecutive days.
- Return type:
pd.DataFrame
- Raises:
ValueError – If less than 2 consecutive days are found in the data.
Notes
Extracts unique dates from the datetime index
Finds the longest consecutive sequence using largest_consecutive_sequence
Requires at least 2 consecutive days for valid analysis
Filters the DataFrame to include only data from consecutive days
Important for circadian rhythm analysis which requires continuous data
Examples
>>> import pandas as pd >>> >>> # Create sample data with gaps >>> dates = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', ... '2023-01-05', '2023-01-06', '2023-01-07']) >>> data = pd.DataFrame({'value': np.random.randn(len(dates))}, index=dates) >>> >>> # Filter to longest consecutive sequence >>> filtered_data = filter_consecutive_days(data) >>> print(f"Original dates: {data.index.date.tolist()}") >>> print(f"Consecutive dates: {filtered_data.index.date.tolist()}")
- largest_consecutive_sequence(dates)[source]¶
Find the longest sequence of consecutive dates in a list.
This function analyzes a list of dates and returns the longest subsequence of consecutive dates. It’s used to identify continuous periods of data for circadian rhythm analysis.
- Parameters:
dates (List[datetime]) – List of dates to analyze for consecutive sequences.
- Returns:
Longest sequence of consecutive dates found. Returns empty list if input is empty.
- Return type:
List[datetime]
Notes
Sorts and removes duplicate dates before processing
Compares dates using timedelta(days=1) for consecutive day detection
Maintains the original order within consecutive sequences
Handles edge cases like empty lists and single dates
Used internally by filter_consecutive_days
Examples
>>> from datetime import datetime >>> >>> # Example with gaps in dates >>> dates = [ ... datetime(2023, 1, 1), datetime(2023, 1, 2), datetime(2023, 1, 3), ... datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7) ... ] >>> consecutive = largest_consecutive_sequence(dates) >>> print(f"Longest consecutive sequence: {consecutive}") >>> # Output: [datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7)] >>> >>> # Example with single date >>> single_date = [datetime(2023, 1, 1)] >>> result = largest_consecutive_sequence(single_date) >>> print(f"Single date result: {result}") >>> # Output: [datetime(2023, 1, 1)]
- calculate_enmo(data, verbose=False)[source]¶
Calculate the Euclidean Norm Minus One (ENMO) metric from accelerometer data.
This function computes the ENMO metric, which is a widely used measure in physical activity research for quantifying acceleration while accounting for gravity.
- Parameters:
data (pd.DataFrame) – DataFrame containing accelerometer data with columns: - ‘x’: X-axis acceleration values - ‘y’: Y-axis acceleration values - ‘z’: Z-axis acceleration values All values should be in g units (1g = 9.81 m/s²).
verbose (bool, default=False) – If True, prints processing information.
- Returns:
Array of ENMO values. Values are truncated at 0, meaning negative values are set to 0. Returns np.nan if calculation fails.
- Return type:
numpy.ndarray
Notes
ENMO = sqrt(x² + y² + z²) - 1
Values are truncated at 0 (negative values become 0)
ENMO represents acceleration in excess of 1g (gravity)
Commonly used in physical activity and sleep research
Handles errors gracefully by returning np.nan
Examples
>>> import pandas as pd >>> import numpy as np >>> >>> # Create sample accelerometer data >>> data = pd.DataFrame({ ... 'x': [0.1, 0.2, 0.3], ... 'y': [0.1, 0.2, 0.3], ... 'z': [1.0, 1.1, 1.2] # Close to 1g (gravity) ... }) >>> >>> # Calculate ENMO >>> enmo_values = calculate_enmo(data, verbose=True) >>> print(f"ENMO values: {enmo_values}") >>> # Output: [0.014, 0.028, 0.042] (approximately)
- calculate_minute_level_enmo(data, meta_dict={}, verbose=False)[source]¶
Resample high-frequency ENMO data to minute-level by averaging over each minute.
This function aggregates high-frequency ENMO data to minute-level resolution using mean aggregation, which is the standard approach for circadian rhythm analysis.
- Parameters:
data (pd.DataFrame) – DataFrame with datetime index and ‘ENMO’ column containing high-frequency ENMO data. Optional ‘wear’ column for wear time information.
meta_dict (dict, default={}) – Dictionary containing metadata. Should include: - ‘sf’: Sampling frequency in Hz (defaults to 25Hz if not specified)
verbose (bool, default=False) – If True, prints processing information.
- Returns:
DataFrame containing minute-level aggregated data with: - ‘ENMO’: Mean ENMO value for each minute - ‘wear’: Mean wear time for each minute (if wear column exists in input) Index is datetime at minute resolution.
- Return type:
pd.DataFrame
- Raises:
ValueError – If sampling frequency is less than 1/60 Hz (less than one sample per minute).
Notes
Uses pandas resample(‘min’).mean() for aggregation
Handles both ENMO and wear columns if present
Converts index to datetime format
Standard preprocessing step for circadian rhythm analysis
Handles errors gracefully by returning empty DataFrame
Examples
>>> import pandas as pd >>> >>> # Create sample high-frequency ENMO data >>> dates = pd.date_range('2023-01-01 00:00:00', periods=3600, freq='S') # 1 hour of second-level data >>> data = pd.DataFrame({ ... 'ENMO': np.random.uniform(0, 0.1, 3600), ... 'wear': np.random.choice([0, 1], 3600) ... }, index=dates) >>> >>> # Resample to minute level >>> meta_dict = {'sf': 1} # 1 Hz sampling frequency >>> minute_data = calculate_minute_level_enmo(data, meta_dict=meta_dict, verbose=True) >>> print(f"Original records: {len(data)}") >>> print(f"Minute-level records: {len(minute_data)}")
- calibrate_accelerometer(data, sphere_crit, sd_criteria, meta_dict=None, verbose=False)[source]¶
Calibrate accelerometer data using sphere fitting method.
This function applies accelerometer calibration using the sphere fitting approach to correct for sensor bias and scaling errors. The calibration process fits the accelerometer data to a unit sphere and applies correction factors.
- Parameters:
data (pd.DataFrame) – Raw accelerometer data with datetime index and columns [‘x’, ‘y’, ‘z’]. Data should be in g units (1g = 9.81 m/s²).
sphere_crit (float) – Sphere fitting criterion threshold. Controls the tolerance for sphere fitting. Lower values result in stricter calibration requirements.
sd_criteria (float) – Standard deviation criterion threshold. Controls the tolerance for standard deviation of the calibrated data.
meta_dict (dict, optional) – Dictionary to store calibration parameters and metadata. If None, an empty dict will be created. Updated with calibration results including: - ‘calibration_offset’: Offset correction factors - ‘calibration_scale’: Scale correction factors
verbose (bool, default=False) – Whether to print progress information during calibration.
- Returns:
Calibrated accelerometer data with the same structure as input data. The calibrated data has corrected bias and scaling errors.
- Return type:
pd.DataFrame
Notes
The function uses the skdh.preprocessing.CalibrateAccelerometer class
Calibration parameters are stored in meta_dict for future reference
The function assumes data is sampled at the frequency specified in meta_dict[‘sf’]
If no sampling frequency is found in meta_dict, defaults to 25 Hz
Examples
>>> import pandas as pd >>> import numpy as np >>> >>> # Create sample accelerometer data >>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms') >>> data = pd.DataFrame({ ... 'x': np.random.normal(0, 0.1, 1000), ... 'y': np.random.normal(0, 0.1, 1000), ... 'z': np.random.normal(1, 0.1, 1000) # Gravity component ... }, index=timestamps) >>> >>> # Calibrate the data >>> meta_dict = {'sf': 25} >>> calibrated_data = calibrate_accelerometer( ... data, sphere_crit=0.3, sd_criteria=0.1, ... meta_dict=meta_dict, verbose=True ... ) >>> print(f"Calibration offset: {meta_dict.get('calibration_offset')}")
- detect_frequency_from_timestamps(timestamps)[source]¶
Detect sampling frequency by finding the most common time delta.
This function analyzes a series of timestamps to determine the sampling frequency of the data by calculating the time differences between consecutive samples and finding the most frequently occurring interval.
- Parameters:
timestamps (pd.Series) – Series or array of datetime objects representing the timestamps of data points. Can be pandas datetime objects, numpy datetime64, or string timestamps that can be converted to datetime.
- Returns:
Sampling frequency in Hz (samples per second).
- Return type:
float
- Raises:
ValueError – If less than two timestamps are provided. If no time deltas can be calculated. If the most common time delta is zero. If the mode cannot be determined.
Notes
The function converts all timestamps to pandas datetime format
Time deltas are calculated in seconds
The most common (mode) time delta is used to determine frequency
Frequency is calculated as 1.0 / most_common_delta
Examples
>>> import pandas as pd >>> >>> # Regular 25 Hz sampling >>> timestamps = pd.date_range('2023-01-01', periods=100, freq='40ms') >>> freq = detect_frequency_from_timestamps(timestamps) >>> print(f"Detected frequency: {freq:.1f} Hz") Detected frequency: 25.0 Hz >>> >>> # Irregular sampling with some missing points >>> irregular_times = pd.to_datetime([ ... '2023-01-01 00:00:00', ... '2023-01-01 00:00:00.040', ... '2023-01-01 00:00:00.080', ... '2023-01-01 00:00:00.120', ... '2023-01-01 00:00:00.200', # Gap here ... '2023-01-01 00:00:00.240' ... ]) >>> freq = detect_frequency_from_timestamps(irregular_times) >>> print(f"Detected frequency: {freq:.1f} Hz") Detected frequency: 25.0 Hz
- remove_noise(data, sf, filter_type='lowpass', filter_cutoff=2, verbose=False)[source]¶
Remove noise from accelerometer data using a Butterworth filter.
This function applies a digital Butterworth filter to remove noise from accelerometer data. The filter can be configured as lowpass, highpass, bandpass, or bandstop depending on the noise characteristics.
- Parameters:
data (pd.DataFrame) – DataFrame containing accelerometer data with columns [‘x’, ‘y’, ‘z’]. Data should have a datetime index and contain acceleration values in g units.
sf (float) – Sampling frequency of the accelerometer data in Hz.
filter_type (str, default='lowpass') – Type of filter to apply. Must be one of: - ‘lowpass’: Removes high-frequency noise above cutoff - ‘highpass’: Removes low-frequency noise below cutoff - ‘bandpass’: Keeps frequencies between two cutoff values - ‘bandstop’: Removes frequencies between two cutoff values
filter_cutoff (float or list, default=2) – Cutoff frequency(ies) for the filter in Hz. - For lowpass/highpass: single float value - For bandpass/bandstop: list of two values [low_cutoff, high_cutoff]
verbose (bool, default=False) – Whether to print progress information during filtering.
- Returns:
DataFrame with noise removed from the [‘x’, ‘y’, ‘z’] columns. The filtered data maintains the same structure as the input.
- Return type:
pd.DataFrame
- Raises:
ValueError – If filter_type is ‘bandpass’ or ‘bandstop’ but filter_cutoff is not a list of two values. If filter_type is ‘lowpass’ or ‘highpass’ but filter_cutoff is not a single numeric value. If the input DataFrame is empty.
KeyError – If the DataFrame does not contain required columns [‘x’, ‘y’, ‘z’].
Notes
Uses scipy.signal.butter and scipy.signal.filtfilt for zero-phase filtering
The filter order is fixed at 2 (second-order Butterworth filter)
The function applies the same filter to all three axes (x, y, z)
Zero-phase filtering is used to avoid phase distortion
Examples
>>> import pandas as pd >>> import numpy as np >>> >>> # Create sample accelerometer data with noise >>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms') >>> data = pd.DataFrame({ ... 'x': np.random.normal(0, 0.1, 1000) + 0.5*np.sin(2*np.pi*0.1*np.arange(1000)), ... 'y': np.random.normal(0, 0.1, 1000) + 0.3*np.cos(2*np.pi*0.05*np.arange(1000)), ... 'z': np.random.normal(1, 0.1, 1000) # Gravity component ... }, index=timestamps) >>> >>> # Remove high-frequency noise with lowpass filter >>> filtered_data = remove_noise(data, sf=25, filter_type='lowpass', ... filter_cutoff=2, verbose=True) >>> >>> # Remove low-frequency drift with highpass filter >>> filtered_data = remove_noise(data, sf=25, filter_type='highpass', ... filter_cutoff=0.1, verbose=True)
- detect_wear_periods(data, sf, sd_crit, range_crit, window_length, window_skip, meta_dict={}, verbose=False)[source]¶
Detect periods of device wear using acceleration thresholds.
This function identifies when the accelerometer device is being worn by analyzing the standard deviation and range of acceleration data within sliding windows. The algorithm is based on the assumption that worn devices show more variable acceleration patterns than unworn devices.
- Parameters:
data (pd.DataFrame) – Preprocessed accelerometer data with datetime index and columns [‘x’, ‘y’, ‘z’]. Data should be in g units and cleaned of major artifacts.
sf (float) – Sampling frequency of the accelerometer data in Hz.
sd_crit (float) – Standard deviation criterion for wear detection. Threshold for the minimum standard deviation required to classify a window as “worn”.
range_crit (float) – Range criterion for wear detection. Threshold for the minimum range of acceleration values required to classify a window as “worn”.
window_length (int) – Length of the sliding window in seconds. Longer windows provide more stable wear detection but may miss brief wear periods.
window_skip (int) – Number of seconds to skip between consecutive windows. Controls the temporal resolution of wear detection.
meta_dict (dict, default={}) – Dictionary to store wear detection metadata and parameters.
verbose (bool, default=False) – Whether to print progress information during wear detection.
- Returns:
DataFrame with binary wear detection column [‘wear’] where: - 1 indicates the device is being worn - 0 indicates the device is not being worn The DataFrame has the same index as the input data.
- Return type:
pd.DataFrame
Notes
Uses skdh.preprocessing.AccelThresholdWearDetection for the core algorithm
The function converts acceleration data from g to mg units for processing
Wear periods are determined by analyzing both standard deviation and range
The algorithm is sensitive to the choice of sd_crit and range_crit parameters
Examples
>>> import pandas as pd >>> import numpy as np >>> >>> # Create sample accelerometer data >>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms') >>> data = pd.DataFrame({ ... 'x': np.random.normal(0, 0.1, 1000), ... 'y': np.random.normal(0, 0.1, 1000), ... 'z': np.random.normal(1, 0.1, 1000) # Gravity component ... }, index=timestamps) >>> >>> # Detect wear periods >>> wear_data = detect_wear_periods( ... data, sf=25, sd_crit=0.013, range_crit=0.05, ... window_length=60, window_skip=30, verbose=True ... ) >>> print(f"Wear time: {wear_data['wear'].sum() / 25:.1f} seconds")
- calc_weartime(data, sf, meta_dict, verbose)[source]¶
Calculate total, wear, and non-wear time from accelerometer data.
This function computes summary statistics about device wear time based on wear detection results. It calculates the total recording duration, time the device was worn, and time the device was not worn.
- Parameters:
data (pd.DataFrame) – DataFrame containing accelerometer data with a ‘wear’ column indicating wear status (1 for worn, 0 for not worn). Should have a datetime index.
sf (float) – Sampling frequency of the accelerometer data in Hz.
meta_dict (dict) – Dictionary to store wear time metadata. Will be updated with the following keys: - ‘total_time’: Total recording time in seconds - ‘wear_time’: Time device was worn in seconds - ‘non-wear_time’: Time device was not worn in seconds
verbose (bool) – Whether to print progress information during calculation.
- Returns:
Updates meta_dict with wear time statistics.
- Return type:
None
Notes
Total time is calculated from the first to last timestamp
Wear time is calculated by summing the ‘wear’ column and converting to seconds
Non-wear time is calculated as total_time - wear_time
All times are stored in seconds in the meta_dict
Examples
>>> import pandas as pd >>> import numpy as np >>> >>> # Create sample data with wear detection >>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms') >>> data = pd.DataFrame({ ... 'wear': np.random.choice([0, 1], 1000, p=[0.3, 0.7]) # 70% wear time ... }, index=timestamps) >>> >>> # Calculate wear time statistics >>> meta_dict = {} >>> calc_weartime(data, sf=25, meta_dict=meta_dict, verbose=True) >>> print(f"Total time: {meta_dict['total_time']:.1f} seconds") >>> print(f"Wear time: {meta_dict['wear_time']:.1f} seconds") >>> print(f"Non-wear time: {meta_dict['non-wear_time']:.1f} seconds")
Visualization Functions¶
- plot_orig_enmo(acc_handler, resample='15min', wear=True)[source]¶
Plot the original ENMO values resampled at a specified interval.
This function creates a time series plot of ENMO (Euclidean Norm Minus One) values with optional highlighting of wear and non-wear periods. The data is resampled to reduce noise and improve visualization clarity.
- Parameters:
acc_handler (DataHandler) – Accelerometer data handler object containing the raw data. Must have: - get_sf_data(): Method returning DataFrame with ‘ENMO’ and ‘wear’ columns
resample (str, default='15min') – The resampling interval for the plot. Can be any pandas time frequency string (e.g., ‘5min’, ‘1H’, ‘1D’).
wear (bool, default=True) – Whether to add color bands for wear and non-wear periods. - True: Shows red bands for non-wear periods - False: Shows only the ENMO time series
- Returns:
Displays a matplotlib plot.
- Return type:
None
Notes
The function resamples the data using mean aggregation
Non-wear periods are highlighted with red bands when wear=True
The plot uses a progress bar (tqdm) when processing wear data
The figure size is set to 12x6 inches
Examples
>>> from cosinorage.datahandlers import GenericDataHandler >>> >>> # Load data >>> handler = GenericDataHandler('data.csv') >>> >>> # Plot with wear periods highlighted >>> plot_orig_enmo(handler, resample='30min', wear=True) >>> >>> # Plot without wear highlighting >>> plot_orig_enmo(handler, resample='1H', wear=False)
- plot_enmo(handler)[source]¶
Plot minute-level ENMO values with optional wear/non-wear period highlighting.
This function creates a time series plot of minute-level ENMO values with automatic highlighting of wear and non-wear periods using colored bands.
- Parameters:
handler (DataHandler) – Data handler object containing the minute-level ENMO data. Must have: - get_ml_data(): Method returning DataFrame with ‘ENMO’ column - Optional ‘wear’ column for wear/non-wear periods
- Returns:
Displays a matplotlib plot showing ENMO values over time with optional wear/non-wear period highlighting in green/red.
- Return type:
None
Notes
Wear periods are highlighted in green
Non-wear periods are highlighted in red
The plot automatically adjusts Y-axis limits to show the full range
If no ‘wear’ column is present, only the ENMO time series is shown
The figure size is set to 12x6 inches
Examples
>>> from cosinorage.datahandlers import GenericDataHandler >>> >>> # Load data >>> handler = GenericDataHandler('data.csv') >>> >>> # Plot minute-level ENMO with wear highlighting >>> plot_enmo(handler)
- plot_orig_enmo_freq(acc_handler)[source]¶
Plot the frequency domain representation of the original ENMO signal using Welch’s method.
This function computes and displays the power spectral density (PSD) of the ENMO signal using Welch’s method, which provides a smoothed estimate of the signal’s frequency content.
- Parameters:
acc_handler (DataHandler) – Accelerometer data handler object containing the raw ENMO data. Must have: - get_sf_data(): Method returning DataFrame with ‘ENMO’ column
- Returns:
Displays a matplotlib plot showing the power spectral density of the ENMO signal computed using Welch’s method.
- Return type:
None
Notes
Uses scipy.signal.welch for power spectral density estimation
Sampling frequency is set to 80 Hz
Segment length is set to 1024 samples for frequency resolution
The plot shows frequency (Hz) on the x-axis and power spectral density on the y-axis
The figure size is set to 20x5 inches
Examples
>>> from cosinorage.datahandlers import GenericDataHandler >>> >>> # Load data >>> handler = GenericDataHandler('data.csv') >>> >>> # Plot frequency domain representation >>> plot_orig_enmo_freq(handler)