cosinorage.datahandlers Module¶

Module Contents¶

Classes¶

class DataHandler[source]¶

Bases: object

A base class for data handlers that process and store ENMO data at the minute level.

This class provides a common interface for data handlers with methods to load data, retrieve processed ENMO values, and save data. The load_data and save_data methods are intended to be overridden by subclasses.

raw_data¶

Raw accelerometer data loaded from the source.

Type:: pd.DataFrame or None

sf_data¶

Filtered and processed accelerometer data.

Type:: pd.DataFrame or None

ml_data¶

Minute-level ENMO data calculated from processed data.

Type:: pd.DataFrame or None

meta_dict¶

Dictionary storing metadata about the data processing.

Type:: dict

__init__()[source]¶

Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.

Notes

This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.

save_data(output_path)[source]¶

Save minute-level ENMO data to a specified output path.

This method is intended to be implemented by subclasses, specifying the format and structure for saving data.

Parameters:: output_path (str) – The file path where the minute-level ENMO data will be saved.

get_raw_data()[source]¶

Retrieve the raw data.

Returns:: A DataFrame containing the raw data.
Return type:: pd.DataFrame

get_sf_data()[source]¶

Retrieve the filtered data.

Returns:: A DataFrame containing the filtered data.
Return type:: pd.DataFrame

get_ml_data()[source]¶

Retrieve the minute-level ENMO values.

Returns:: A DataFrame containing the minute-level ENMO values.
Return type:: pd.DataFrame

get_meta_data()[source]¶

Retrieve the metadata.

Returns:: A dictionary containing the metadata.
Return type:: dict

class GalaxyDataHandler(galaxy_file_path, data_format='binary', data_type=None, time_column=None, data_columns=None, preprocess_args={}, verbose=False)[source]¶

Bases: DataHandler

Unified data handler for Samsung Galaxy Watch accelerometer data.

This class handles loading, filtering, and processing of Galaxy Watch accelerometer data in both binary and CSV formats. Currently supports: - Binary format with accelerometer data type - CSV format with ENMO data type

Parameters:

galaxy_file_path (str)
data_format (str)
data_type (str | None)
time_column (str | None)
data_columns (list | None)
preprocess_args (dict)
verbose (bool)

galaxy_file_path¶

Path to the Galaxy Watch data file (for CSV) or directory (for binary).

Type:: str

data_format¶

Format of the data (‘csv’ or ‘binary’).

Type:: str

data_type¶

Type of the data (‘enmo’ or ‘accelerometer’).

Type:: str

time_column¶

Name of the timestamp column.

Type:: str

data_columns¶

Names of the data columns.

Type:: list

preprocess_args¶

Arguments for preprocessing.

Type:: dict

__init__(galaxy_file_path, data_format='binary', data_type=None, time_column=None, data_columns=None, preprocess_args={}, verbose=False)[source]¶

Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.

Notes

This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.

Parameters:

galaxy_file_path (str)
data_format (str)
data_type (str | None)
time_column (str | None)
data_columns (list | None)
preprocess_args (dict)
verbose (bool)

Utility Functions¶

Generic Data Functions¶

Galaxy Smartwatch Data Functions¶

read_galaxy_binary_data(galaxy_file_dir, meta_dict, time_column='unix_timestamp_in_ms', data_columns=None, verbose=False)[source]¶

Read accelerometer data from Galaxy Watch binary files.

Parameters:

galaxy_file_dir (str) – Directory containing Galaxy Watch data files
meta_dict (dict) – Dictionary to store metadata about the loaded data
time_column (str) – Name of the timestamp column in the binary data
data_columns (list) – Names of the data columns in the binary data
verbose (bool) – Whether to print progress information

Returns:

DataFrame containing accelerometer data with columns [‘x’, ‘y’, ‘z’]

Return type:

pd.DataFrame

filter_galaxy_binary_data(data, meta_dict={}, verbose=False, preprocess_args={})[source]¶

Filter Galaxy Watch accelerometer data by removing incomplete days and selecting longest consecutive sequence.

Parameters:

data (pd.DataFrame) – Raw accelerometer data
meta_dict (dict) – Dictionary to store metadata about the filtering process
verbose (bool) – Whether to print progress information
preprocess_args (dict)

Returns:

Filtered accelerometer data

Return type:

pd.DataFrame

resample_galaxy_binary_data(data, meta_dict={}, verbose=False)[source]¶

Resample Galaxy Watch accelerometer data to a regular interval.

Parameters:

data (pd.DataFrame) – Filtered accelerometer data
meta_dict (dict) – Dictionary to store metadata about the resampling process
verbose (bool) – Whether to print progress information

Returns:

Resampled accelerometer data at regular frequency.

Return type:

pd.DataFrame

preprocess_galaxy_binary_data(data, preprocess_args={}, meta_dict={}, verbose=False)[source]¶

Preprocess Galaxy Watch accelerometer data including rescaling, calibration, noise removal, and wear detection.

Parameters:

data (pd.DataFrame) – Resampled accelerometer data
preprocess_args (dict) – Dictionary containing preprocessing parameters
meta_dict (dict) – Dictionary to store metadata about the preprocessing
verbose (bool) – Whether to print progress information

Returns:

Preprocessed accelerometer data with additional columns for raw values and wear detection

Return type:

pd.DataFrame

acceleration_data_to_dataframe(data)[source]¶

Convert binary acceleration data to pandas DataFrame.

This function converts raw binary acceleration data from Samsung Galaxy Watch into a structured pandas DataFrame format for further processing.

Parameters:: data (object) – Binary acceleration data object containing samples with the following attributes: - acceleration_x: X-axis acceleration value - acceleration_y: Y-axis acceleration value - acceleration_z: Z-axis acceleration value - sensor_body_location: Location of the sensor on the body - unix_timestamp_in_ms: Timestamp in milliseconds since Unix epoch - effective_time_frame: Effective time frame for the sample
Returns:: DataFrame containing accelerometer data with columns: - ‘acceleration_x’: X-axis acceleration values - ‘acceleration_y’: Y-axis acceleration values - ‘acceleration_z’: Z-axis acceleration values - ‘sensor_body_location’: Sensor location information - ‘unix_timestamp_in_ms’: Timestamps in milliseconds - ‘effective_time_frame’: Effective time frame information
Return type:: pd.DataFrame

Notes

This function is used internally by read_galaxy_binary_data
The function iterates through all samples in the binary data object
Each sample is converted to a dictionary and added to the DataFrame
The resulting DataFrame maintains the original data structure from the binary file

Examples

>>> # This function is typically called internally by read_galaxy_binary_data
>>> # but can be used directly if you have binary data objects:
>>>
>>> # Load binary data (example)
>>> binary_data = load_acceleration_data("path/to/binary/file")
>>>
>>> # Convert to DataFrame
>>> df = acceleration_data_to_dataframe(binary_data)
>>> print(f"Converted {len(df)} acceleration samples")
>>> print(f"Columns: {df.columns.tolist()}")

read_galaxy_csv_data(galaxy_file_path, meta_dict, time_column='timestamp', data_columns=None, verbose=False)[source]¶

Read ENMO data from Galaxy Watch CSV file.

This function loads ENMO (Euclidean Norm Minus One) data from Samsung Galaxy Watch CSV files and standardizes the format for further processing.

Parameters:

galaxy_file_path (str) – Path to the Galaxy Watch CSV data file containing ENMO values.
meta_dict (dict) – Dictionary to store metadata about the loaded data. Will be populated with: - raw_n_datapoints: Number of data points - raw_start_datetime: Start timestamp - raw_end_datetime: End timestamp - sf: Sampling frequency in Hz - raw_data_frequency: Sampling frequency as string - raw_data_type: Type of data (‘ENMO’) - raw_data_unit: Unit of data (‘mg’)
time_column (str, default='timestamp') – Name of the timestamp column in the CSV file.
data_columns (list, optional) – Names of the data columns in the CSV file. If not provided, defaults to [‘enmo’].
verbose (bool, default=False) – Whether to print progress information during loading.

Returns:

DataFrame containing ENMO data with standardized column names: - ‘ENMO’: ENMO values in mg units The DataFrame has a datetime index from the timestamp column.

Return type:

pd.DataFrame

Notes

The function automatically converts UTC timestamps to local time
Missing values are filled with 0
Data is sorted by timestamp
Sampling frequency is automatically detected from timestamps
Column names are standardized to ‘ENMO’ for consistency

Examples

>>> import pandas as pd
>>>
>>> # Load ENMO data from Galaxy Watch CSV file
>>> meta_dict = {}
>>> data = read_galaxy_csv_data(
...     galaxy_file_path='data/galaxy_enmo.csv',
...     meta_dict=meta_dict,
...     time_column='time',
...     data_columns=['enmo_mg'],
...     verbose=True
... )
>>> print(f"Loaded {len(data)} ENMO records")
>>> print(f"Sampling frequency: {meta_dict['sf']:.1f} Hz")

filter_galaxy_csv_data(data, meta_dict={}, verbose=False, preprocess_args={})[source]¶

Filter Galaxy Watch ENMO data by removing incomplete days and selecting longest consecutive sequence.

This function applies data quality filters to Galaxy Watch ENMO data, including removal of incomplete days and selection of the longest consecutive sequence of days.

Parameters:

data (pd.DataFrame) – Raw ENMO data with datetime index and ‘ENMO’ column.
meta_dict (dict, default={}) – Dictionary to store metadata about the filtering process. Should contain: - sf: Sampling frequency in Hz
verbose (bool, default=False) – Whether to print progress information during filtering.
preprocess_args (dict, default={}) – Dictionary containing filtering parameters: - required_daily_coverage: Minimum fraction of daily data required (default: 0.5)

Returns:

Filtered ENMO data containing only complete and consecutive days. The DataFrame maintains the same structure as the input.

Return type:

pd.DataFrame

Notes

Removes days that don’t meet the required daily coverage threshold
Selects the longest sequence of consecutive days (minimum 4 days required)
Resamples data to minute-level resolution
Removes incomplete first and last days
Updates metadata with information about the filtering process

Examples

>>> import pandas as pd
>>>
>>> # Create sample ENMO data
>>> dates = pd.date_range('2023-01-01', periods=10000, freq='min')
>>> data = pd.DataFrame({'ENMO': np.random.randn(10000)}, index=dates)
>>>
>>> # Filter the data
>>> meta_dict = {'sf': 1/60}  # 1 sample per minute
>>> preprocess_args = {'required_daily_coverage': 0.8}
>>> filtered_data = filter_galaxy_csv_data(
...     data, meta_dict=meta_dict, preprocess_args=preprocess_args, verbose=True
... )
>>> print(f"Original data points: {len(data)}")
>>> print(f"Filtered data points: {len(filtered_data)}")

resample_galaxy_csv_data(data, meta_dict={}, verbose=False)[source]¶

Ensure we have minute-level data across the whole timeseries.

This function resamples Galaxy Watch ENMO data to ensure consistent minute-level resolution across the entire time series.

Parameters:

data (pd.DataFrame) – Filtered ENMO data with datetime index and ‘ENMO’ column.
meta_dict (dict, default={}) – Dictionary to store metadata about the resampling process.
verbose (bool, default=False) – Whether to print progress information during resampling.

Returns:

Resampled ENMO data with consistent minute-level resolution. The DataFrame maintains the same structure as the input.

Return type:

pd.DataFrame

Notes

Uses pandas resample(‘1min’) with linear interpolation
Forward fills any remaining gaps with bfill()
Ensures consistent temporal resolution for analysis
Updates metadata with information about the resampling process

Examples

>>> import pandas as pd
>>>
>>> # Create sample ENMO data with irregular intervals
>>> dates = pd.to_datetime(['2023-01-01 00:00:00', '2023-01-01 00:01:30',
...                         '2023-01-01 00:03:00', '2023-01-01 00:04:30'])
>>> data = pd.DataFrame({'ENMO': [0.1, 0.2, 0.3, 0.4]}, index=dates)
>>>
>>> # Resample to minute level
>>> meta_dict = {}
>>> resampled_data = resample_galaxy_csv_data(data, meta_dict=meta_dict, verbose=True)
>>> print(f"Original data points: {len(data)}")
>>> print(f"Resampled data points: {len(resampled_data)}")

preprocess_galaxy_csv_data(data, preprocess_args={}, meta_dict={}, verbose=False)[source]¶

Preprocess Galaxy Watch ENMO data including rescaling, calibration, noise removal, and wear detection.

This function applies preprocessing steps to Galaxy Watch ENMO data. Currently, wear detection is not implemented for ENMO data as the algorithm relies on raw accelerometer data.

Parameters:

data (pd.DataFrame) – Resampled ENMO data with datetime index and ‘ENMO’ column.
preprocess_args (dict, default={}) – Dictionary containing preprocessing parameters (currently not used for ENMO data).
meta_dict (dict, default={}) – Dictionary to store metadata about the preprocessing process.
verbose (bool, default=False) – Whether to print progress information during preprocessing.

Returns:

Preprocessed ENMO data with additional columns: - ‘ENMO’: Original ENMO values - ‘wear’: Wear detection column (set to -1 for ENMO data)

Return type:

pd.DataFrame

Notes

Wear detection is not implemented for ENMO data
The ‘wear’ column is set to -1 to indicate no wear detection
Future implementations may add wear detection for ENMO data
The function maintains the original ENMO values

Examples

>>> import pandas as pd
>>>
>>> # Create sample ENMO data
>>> dates = pd.date_range('2023-01-01', periods=1440, freq='min')
>>> data = pd.DataFrame({'ENMO': np.random.uniform(0, 0.1, 1440)}, index=dates)
>>>
>>> # Preprocess the data
>>> meta_dict = {}
>>> preprocess_args = {}
>>> processed_data = preprocess_galaxy_csv_data(
...     data, preprocess_args=preprocess_args, meta_dict=meta_dict, verbose=True
... )
>>> print(f"Processed data shape: {processed_data.shape}")
>>> print(f"Wear column present: {'wear' in processed_data.columns}")

UK Biobank Data Functions¶

NHANES Data Functions¶

General Utility Functions¶

filter_incomplete_days(df, data_freq, expected_points_per_day=None)[source]¶

Filter out data from incomplete days to ensure 24-hour data periods.

This function removes data from days that don’t have the expected number of data points to ensure that only complete 24-hour data is retained for analysis.

Parameters:

df (pd.DataFrame) – DataFrame with datetime index, which is used to determine the day. The index should contain datetime objects.
data_freq (float) – Frequency of data collection in Hz (e.g., 1/60 for minute-level data).
expected_points_per_day (int, optional) – Expected number of data points per day. If None, calculated using data_freq * 86400.

Returns:

Filtered DataFrame containing only complete days. Returns empty DataFrame if an error occurs during processing.

Return type:

pd.DataFrame

Notes

Calculates expected points per day as data_freq * 60 * 60 * 24 if not provided
Groups data by date and counts points per day
Retains only days with sufficient data points
Removes the temporary ‘DATE’ column before returning
Handles errors gracefully by returning empty DataFrame

Examples

>>> import pandas as pd
>>>
>>> # Create sample data with some incomplete days
>>> dates = pd.date_range('2023-01-01', periods=5000, freq='min')
>>> data = pd.DataFrame({'value': np.random.randn(5000)}, index=dates)
>>>
>>> # Filter incomplete days (expecting 1440 points per day for minute data)
>>> filtered_data = filter_incomplete_days(data, data_freq=1/60, expected_points_per_day=1440)
>>> print(f"Original days: {len(data.index.date.unique())}")
>>> print(f"Complete days: {len(filtered_data.index.date.unique())}")

filter_consecutive_days(df)[source]¶

Filter DataFrame to retain only the longest sequence of consecutive days.

This function identifies the longest sequence of consecutive days in the data and filters the DataFrame to include only those days. This is important for circadian rhythm analysis which requires continuous data.

Parameters:: df (pd.DataFrame) – DataFrame with datetime index containing the data to filter.
Returns:: Filtered DataFrame containing only the longest sequence of consecutive days.
Return type:: pd.DataFrame
Raises:: ValueError – If less than 2 consecutive days are found in the data.

Notes

Extracts unique dates from the datetime index
Finds the longest consecutive sequence using largest_consecutive_sequence
Requires at least 2 consecutive days for valid analysis
Filters the DataFrame to include only data from consecutive days
Important for circadian rhythm analysis which requires continuous data

Examples

>>> import pandas as pd
>>>
>>> # Create sample data with gaps
>>> dates = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03',
...                         '2023-01-05', '2023-01-06', '2023-01-07'])
>>> data = pd.DataFrame({'value': np.random.randn(len(dates))}, index=dates)
>>>
>>> # Filter to longest consecutive sequence
>>> filtered_data = filter_consecutive_days(data)
>>> print(f"Original dates: {data.index.date.tolist()}")
>>> print(f"Consecutive dates: {filtered_data.index.date.tolist()}")

largest_consecutive_sequence(dates)[source]¶

Find the longest sequence of consecutive dates in a list.

This function analyzes a list of dates and returns the longest subsequence of consecutive dates. It’s used to identify continuous periods of data for circadian rhythm analysis.

Parameters:: dates (List[datetime]) – List of dates to analyze for consecutive sequences.
Returns:: Longest sequence of consecutive dates found. Returns empty list if input is empty.
Return type:: List[datetime]

Notes

Sorts and removes duplicate dates before processing
Compares dates using timedelta(days=1) for consecutive day detection
Maintains the original order within consecutive sequences
Handles edge cases like empty lists and single dates
Used internally by filter_consecutive_days

Examples

>>> from datetime import datetime
>>>
>>> # Example with gaps in dates
>>> dates = [
...     datetime(2023, 1, 1), datetime(2023, 1, 2), datetime(2023, 1, 3),
...     datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7)
... ]
>>> consecutive = largest_consecutive_sequence(dates)
>>> print(f"Longest consecutive sequence: {consecutive}")
>>> # Output: [datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7)]
>>>
>>> # Example with single date
>>> single_date = [datetime(2023, 1, 1)]
>>> result = largest_consecutive_sequence(single_date)
>>> print(f"Single date result: {result}")
>>> # Output: [datetime(2023, 1, 1)]

calculate_enmo(data, verbose=False)[source]¶

Calculate the Euclidean Norm Minus One (ENMO) metric from accelerometer data.

This function computes the ENMO metric, which is a widely used measure in physical activity research for quantifying acceleration while accounting for gravity.

Parameters:

data (pd.DataFrame) – DataFrame containing accelerometer data with columns: - ‘x’: X-axis acceleration values - ‘y’: Y-axis acceleration values - ‘z’: Z-axis acceleration values All values should be in g units (1g = 9.81 m/s²).
verbose (bool, default=False) – If True, prints processing information.

Returns:

Array of ENMO values. Values are truncated at 0, meaning negative values are set to 0. Returns np.nan if calculation fails.

Return type:

numpy.ndarray

Notes

ENMO = sqrt(x² + y² + z²) - 1
Values are truncated at 0 (negative values become 0)
ENMO represents acceleration in excess of 1g (gravity)
Commonly used in physical activity and sleep research
Handles errors gracefully by returning np.nan

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data
>>> data = pd.DataFrame({
...     'x': [0.1, 0.2, 0.3],
...     'y': [0.1, 0.2, 0.3],
...     'z': [1.0, 1.1, 1.2]  # Close to 1g (gravity)
... })
>>>
>>> # Calculate ENMO
>>> enmo_values = calculate_enmo(data, verbose=True)
>>> print(f"ENMO values: {enmo_values}")
>>> # Output: [0.014, 0.028, 0.042] (approximately)

calculate_minute_level_enmo(data, meta_dict={}, verbose=False)[source]¶

Resample high-frequency ENMO data to minute-level by averaging over each minute.

This function aggregates high-frequency ENMO data to minute-level resolution using mean aggregation, which is the standard approach for circadian rhythm analysis.

Parameters:

data (pd.DataFrame) – DataFrame with datetime index and ‘ENMO’ column containing high-frequency ENMO data. Optional ‘wear’ column for wear time information.
meta_dict (dict, default={}) – Dictionary containing metadata. Should include: - ‘sf’: Sampling frequency in Hz (defaults to 25Hz if not specified)
verbose (bool, default=False) – If True, prints processing information.

Returns:

DataFrame containing minute-level aggregated data with: - ‘ENMO’: Mean ENMO value for each minute - ‘wear’: Mean wear time for each minute (if wear column exists in input) Index is datetime at minute resolution.

Return type:

pd.DataFrame

Raises:

ValueError – If sampling frequency is less than 1/60 Hz (less than one sample per minute).

Notes

Uses pandas resample(‘min’).mean() for aggregation
Handles both ENMO and wear columns if present
Converts index to datetime format
Standard preprocessing step for circadian rhythm analysis
Handles errors gracefully by returning empty DataFrame

Examples

>>> import pandas as pd
>>>
>>> # Create sample high-frequency ENMO data
>>> dates = pd.date_range('2023-01-01 00:00:00', periods=3600, freq='S')  # 1 hour of second-level data
>>> data = pd.DataFrame({
...     'ENMO': np.random.uniform(0, 0.1, 3600),
...     'wear': np.random.choice([0, 1], 3600)
... }, index=dates)
>>>
>>> # Resample to minute level
>>> meta_dict = {'sf': 1}  # 1 Hz sampling frequency
>>> minute_data = calculate_minute_level_enmo(data, meta_dict=meta_dict, verbose=True)
>>> print(f"Original records: {len(data)}")
>>> print(f"Minute-level records: {len(minute_data)}")

calibrate_accelerometer(data, sphere_crit, sd_criteria, meta_dict=None, verbose=False)[source]¶

Calibrate accelerometer data using sphere fitting method.

This function applies accelerometer calibration using the sphere fitting approach to correct for sensor bias and scaling errors. The calibration process fits the accelerometer data to a unit sphere and applies correction factors.

Parameters:

data (pd.DataFrame) – Raw accelerometer data with datetime index and columns [‘x’, ‘y’, ‘z’]. Data should be in g units (1g = 9.81 m/s²).
sphere_crit (float) – Sphere fitting criterion threshold. Controls the tolerance for sphere fitting. Lower values result in stricter calibration requirements.
sd_criteria (float) – Standard deviation criterion threshold. Controls the tolerance for standard deviation of the calibrated data.
meta_dict (dict, optional) – Dictionary to store calibration parameters and metadata. If None, an empty dict will be created. Updated with calibration results including: - ‘calibration_offset’: Offset correction factors - ‘calibration_scale’: Scale correction factors
verbose (bool, default=False) – Whether to print progress information during calibration.

Returns:

Calibrated accelerometer data with the same structure as input data. The calibrated data has corrected bias and scaling errors.

Return type:

pd.DataFrame

Notes

The function uses the skdh.preprocessing.CalibrateAccelerometer class
Calibration parameters are stored in meta_dict for future reference
The function assumes data is sampled at the frequency specified in meta_dict[‘sf’]
If no sampling frequency is found in meta_dict, defaults to 25 Hz

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data
>>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms')
>>> data = pd.DataFrame({
...     'x': np.random.normal(0, 0.1, 1000),
...     'y': np.random.normal(0, 0.1, 1000),
...     'z': np.random.normal(1, 0.1, 1000)  # Gravity component
... }, index=timestamps)
>>>
>>> # Calibrate the data
>>> meta_dict = {'sf': 25}
>>> calibrated_data = calibrate_accelerometer(
...     data, sphere_crit=0.3, sd_criteria=0.1,
...     meta_dict=meta_dict, verbose=True
... )
>>> print(f"Calibration offset: {meta_dict.get('calibration_offset')}")

detect_frequency_from_timestamps(timestamps)[source]¶

Detect sampling frequency by finding the most common time delta.

This function analyzes a series of timestamps to determine the sampling frequency of the data by calculating the time differences between consecutive samples and finding the most frequently occurring interval.

Parameters:: timestamps (pd.Series) – Series or array of datetime objects representing the timestamps of data points. Can be pandas datetime objects, numpy datetime64, or string timestamps that can be converted to datetime.
Returns:: Sampling frequency in Hz (samples per second).
Return type:: float
Raises:: ValueError – If less than two timestamps are provided. If no time deltas can be calculated. If the most common time delta is zero. If the mode cannot be determined.

Notes

The function converts all timestamps to pandas datetime format
Time deltas are calculated in seconds
The most common (mode) time delta is used to determine frequency
Frequency is calculated as 1.0 / most_common_delta

Examples

>>> import pandas as pd
>>>
>>> # Regular 25 Hz sampling
>>> timestamps = pd.date_range('2023-01-01', periods=100, freq='40ms')
>>> freq = detect_frequency_from_timestamps(timestamps)
>>> print(f"Detected frequency: {freq:.1f} Hz")
Detected frequency: 25.0 Hz
>>>
>>> # Irregular sampling with some missing points
>>> irregular_times = pd.to_datetime([
...     '2023-01-01 00:00:00',
...     '2023-01-01 00:00:00.040',
...     '2023-01-01 00:00:00.080',
...     '2023-01-01 00:00:00.120',
...     '2023-01-01 00:00:00.200',  # Gap here
...     '2023-01-01 00:00:00.240'
... ])
>>> freq = detect_frequency_from_timestamps(irregular_times)
>>> print(f"Detected frequency: {freq:.1f} Hz")
Detected frequency: 25.0 Hz

remove_noise(data, sf, filter_type='lowpass', filter_cutoff=2, verbose=False)[source]¶

Remove noise from accelerometer data using a Butterworth filter.

This function applies a digital Butterworth filter to remove noise from accelerometer data. The filter can be configured as lowpass, highpass, bandpass, or bandstop depending on the noise characteristics.

Parameters:

data (pd.DataFrame) – DataFrame containing accelerometer data with columns [‘x’, ‘y’, ‘z’]. Data should have a datetime index and contain acceleration values in g units.
sf (float) – Sampling frequency of the accelerometer data in Hz.
filter_type (str, default='lowpass') – Type of filter to apply. Must be one of: - ‘lowpass’: Removes high-frequency noise above cutoff - ‘highpass’: Removes low-frequency noise below cutoff - ‘bandpass’: Keeps frequencies between two cutoff values - ‘bandstop’: Removes frequencies between two cutoff values
filter_cutoff (float or list, default=2) – Cutoff frequency(ies) for the filter in Hz. - For lowpass/highpass: single float value - For bandpass/bandstop: list of two values [low_cutoff, high_cutoff]
verbose (bool, default=False) – Whether to print progress information during filtering.

Returns:

DataFrame with noise removed from the [‘x’, ‘y’, ‘z’] columns. The filtered data maintains the same structure as the input.

Return type:

pd.DataFrame

Raises:

ValueError – If filter_type is ‘bandpass’ or ‘bandstop’ but filter_cutoff is not a list of two values. If filter_type is ‘lowpass’ or ‘highpass’ but filter_cutoff is not a single numeric value. If the input DataFrame is empty.
KeyError – If the DataFrame does not contain required columns [‘x’, ‘y’, ‘z’].

Notes

Uses scipy.signal.butter and scipy.signal.filtfilt for zero-phase filtering
The filter order is fixed at 2 (second-order Butterworth filter)
The function applies the same filter to all three axes (x, y, z)
Zero-phase filtering is used to avoid phase distortion

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data with noise
>>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms')
>>> data = pd.DataFrame({
...     'x': np.random.normal(0, 0.1, 1000) + 0.5*np.sin(2*np.pi*0.1*np.arange(1000)),
...     'y': np.random.normal(0, 0.1, 1000) + 0.3*np.cos(2*np.pi*0.05*np.arange(1000)),
...     'z': np.random.normal(1, 0.1, 1000)  # Gravity component
... }, index=timestamps)
>>>
>>> # Remove high-frequency noise with lowpass filter
>>> filtered_data = remove_noise(data, sf=25, filter_type='lowpass',
...                              filter_cutoff=2, verbose=True)
>>>
>>> # Remove low-frequency drift with highpass filter
>>> filtered_data = remove_noise(data, sf=25, filter_type='highpass',
...                              filter_cutoff=0.1, verbose=True)

detect_wear_periods(data, sf, sd_crit, range_crit, window_length, window_skip, meta_dict={}, verbose=False)[source]¶

Detect periods of device wear using acceleration thresholds.

This function identifies when the accelerometer device is being worn by analyzing the standard deviation and range of acceleration data within sliding windows. The algorithm is based on the assumption that worn devices show more variable acceleration patterns than unworn devices.

Parameters:

data (pd.DataFrame) – Preprocessed accelerometer data with datetime index and columns [‘x’, ‘y’, ‘z’]. Data should be in g units and cleaned of major artifacts.
sf (float) – Sampling frequency of the accelerometer data in Hz.
sd_crit (float) – Standard deviation criterion for wear detection. Threshold for the minimum standard deviation required to classify a window as “worn”.
range_crit (float) – Range criterion for wear detection. Threshold for the minimum range of acceleration values required to classify a window as “worn”.
window_length (int) – Length of the sliding window in seconds. Longer windows provide more stable wear detection but may miss brief wear periods.
window_skip (int) – Number of seconds to skip between consecutive windows. Controls the temporal resolution of wear detection.
meta_dict (dict, default={}) – Dictionary to store wear detection metadata and parameters.
verbose (bool, default=False) – Whether to print progress information during wear detection.

Returns:

DataFrame with binary wear detection column [‘wear’] where: - 1 indicates the device is being worn - 0 indicates the device is not being worn The DataFrame has the same index as the input data.

Return type:

pd.DataFrame

Notes

Uses skdh.preprocessing.AccelThresholdWearDetection for the core algorithm
The function converts acceleration data from g to mg units for processing
Wear periods are determined by analyzing both standard deviation and range
The algorithm is sensitive to the choice of sd_crit and range_crit parameters

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data
>>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms')
>>> data = pd.DataFrame({
...     'x': np.random.normal(0, 0.1, 1000),
...     'y': np.random.normal(0, 0.1, 1000),
...     'z': np.random.normal(1, 0.1, 1000)  # Gravity component
... }, index=timestamps)
>>>
>>> # Detect wear periods
>>> wear_data = detect_wear_periods(
...     data, sf=25, sd_crit=0.013, range_crit=0.05,
...     window_length=60, window_skip=30, verbose=True
... )
>>> print(f"Wear time: {wear_data['wear'].sum() / 25:.1f} seconds")

calc_weartime(data, sf, meta_dict, verbose)[source]¶

Calculate total, wear, and non-wear time from accelerometer data.

This function computes summary statistics about device wear time based on wear detection results. It calculates the total recording duration, time the device was worn, and time the device was not worn.

Parameters:

data (pd.DataFrame) – DataFrame containing accelerometer data with a ‘wear’ column indicating wear status (1 for worn, 0 for not worn). Should have a datetime index.
sf (float) – Sampling frequency of the accelerometer data in Hz.
meta_dict (dict) – Dictionary to store wear time metadata. Will be updated with the following keys: - ‘total_time’: Total recording time in seconds - ‘wear_time’: Time device was worn in seconds - ‘non-wear_time’: Time device was not worn in seconds
verbose (bool) – Whether to print progress information during calculation.

Returns:

Updates meta_dict with wear time statistics.

Return type:

None

Notes

Total time is calculated from the first to last timestamp
Wear time is calculated by summing the ‘wear’ column and converting to seconds
Non-wear time is calculated as total_time - wear_time
All times are stored in seconds in the meta_dict

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample data with wear detection
>>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms')
>>> data = pd.DataFrame({
...     'wear': np.random.choice([0, 1], 1000, p=[0.3, 0.7])  # 70% wear time
... }, index=timestamps)
>>>
>>> # Calculate wear time statistics
>>> meta_dict = {}
>>> calc_weartime(data, sf=25, meta_dict=meta_dict, verbose=True)
>>> print(f"Total time: {meta_dict['total_time']:.1f} seconds")
>>> print(f"Wear time: {meta_dict['wear_time']:.1f} seconds")
>>> print(f"Non-wear time: {meta_dict['non-wear_time']:.1f} seconds")

cosinorage.datahandlers Module¶

Module Contents¶

Classes¶

Utility Functions¶

Generic Data Functions¶

Galaxy Smartwatch Data Functions¶

UK Biobank Data Functions¶

NHANES Data Functions¶

General Utility Functions¶

Visualization Functions¶