Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home

doi:10.70124/hbtq5-ykv92

Published December 4, 2024 | Version 1.0.0

Dataset Open

Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home

1. TU Wien

Data collectors:

Data manager:

Grabler, Reinhard¹

Editor:

Starzinger, Michael

1. TU Wien

Dataset Card for "privacy-care-interactions"

Dataset Description

Purpose and Features

🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

Dataset Overview

Total entries: 95
Number of distinct taxonomy categories in the public dataset: 4
Number of distinct conversational categories in public dataset: 7
Papers:
- Continues the work of: Privacy Agents: Utilizing Large Language Models to Safeguard Contextual Integrity in Elderly Care
- Continues the work of: Prototype of a care documentation support system using audio recordings of care actions and large language models

Language Distribution 🌍

English (en): 95

Locale Distribution 🌎

United States (US) 🇺🇸: 95

Key Facts 🔑

This is synthetic data! Generated using proprietary algorithms - no privacy violations!
Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).
The data was manually labeled by an expert.

Dataset Structure

Data Instances

The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

{ "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

Data Fields

The data fields are:

text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).
taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.
category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.
affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.
language: a string feature. Language code as defined by ISO 639.
locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.
data_type: a string a classification label, with possible values including real (0), synthetic (1).
uid: a int64 feature. A unique identifier within the dataset.
split: a string feature. Either train, validation or test.

Dataset Splits

The dataset has 2 subsets:

split: with a total of 95 examples split into train, validation and test (70%-15%-15%)
unsplit: with a total of 95 examples in a single train split

name	train	validation	test
split	66	14	15
unsplit	95	n/a	n/a

The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

split-train-en.jsonl
split-validation-en.jsonl
split-test-en.jsonl
unsplit-train-en.jsonl

Dataset Creation

Curation Rationale

Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

Source Data

Initial Data Collection

The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

Data Processing

The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the accessible portions of the interviews were translated from German to US English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the Scikit-learn library.

Who are the source language producers?

Care workers and care recipients in a residential care home in a German-speaking country.

Annotations

Annotation process

Each entry in the synthetic data was read by an expert and labeled according to the various classifications.

Who are the annotators?

Reinhard Grabler and Michael Starzinger

Personal and Sensitive Information

As this is completely synthetic data, there is no personal or sensitive information in this data set.

Considerations for Using the Data

Social Impact of Dataset

The ability to automatically detect privacy-sensitive parts of a conversation is a versatile use case that extends beyond care interactions. This technology has broad applications, such as in human-robot interactions and other scenarios involving active speech processing. Its implementation can significantly enhance privacy across these diverse applications.

Discussion of Biases

The source data comes from a residential care home in a German-speaking country. The classification was also carried out by an expert in a German-speaking country. Therefore, no cultural differences in the perception of privacy are reflected in this data set.

Other Known Limitations

The collection of original data for synthesis is a highly time-consuming process. Recordings are conducted by care workers amidst their already demanding day-to-day care responsibilities. Additionally, complete transcripts must be manually reviewed to identify and redact privacy-sensitive information. Consequently, the current volume of data is limited. Ongoing efforts aim to expand the master dataset as additional phases of the study are conducted. Furthermore, research is underway to explore and develop improved methods for generating high-quality synthetic data tailored to this use case.

Additional Information

Dataset Curators

Reinhard Grabler

Licensing Information

Creative Commons Attribution 4.0 International (CC BY 4.0)

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Citation Information

If you use this dataset, please cite it as indicated.

Contributions

Thanks to Michael Starzinger for supporting with the review an annotation of the transcriptions.

Files

README.md

Files (107.9 KiB)

Name	Size
README.md md5:0830152e11a9f4073ad81717c5b75afe	13.3 KiB	Preview Download
split-test-en.jsonl md5:22ac2d2b13f4a6c704a9fbc151b642bb	8.1 KiB	Download
split-train-en.jsonl md5:c223c6bbadaa0864e54b36a4a61ef14b	32.3 KiB	Download
split-validation-en.jsonl md5:7dcdc66c4b97e021c99c4f27c96fa208	6.9 KiB	Download
unsplit-train-en.jsonl md5:2a25d0816df0cf71bf377c1294042e52	47.3 KiB	Download

Additional details

Continues: Publication: 10.34726/5960 (DOI); Publication: 10.34726/6399 (DOI)
Is documented by: Data Management Plan: 10.5281/zenodo.14276211 (DOI)

Created: 2024-12

Synthesized data
Collected: 2024-04/2024-08

Initial data

Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home

Creators

Contributors

Data collectors:

Data manager:

Editor:

Description

Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution 🌍

Locale Distribution 🌎

Key Facts 🔑

Dataset Structure

Data Instances

Data Fields

Dataset Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection

Data Processing

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Files

README.md

Files (107.9 KiB)

Additional details

Related works

Dates