# Components
This page describes the various pipeline components that `clinlp` offers, along with how to configure and use them effectively. This page assumes you have made yourself familiar with the foundations of the `clinlp` and `spaCy` frameworks. If this is not the case, it might be a good idea to read the [Getting Started](getting_started.md) page first.
## Basic components
### `clinlp` (language)
| property | value |
| --- | --- |
| name | `clinlp` |
| class | [clinlp.language.Clinlp](clinlp.language.Clinlp) |
| example | `nlp = spacy.blank("clinlp")` |
| requires | `-` |
| assigns | `-` |
| config options | `-` |
The `clinlp` language class is an instantiation of the `spaCy` `Language` class, with some customizations for clinical text. It contains the default settings for Dutch clinical text, such as rules for tokenizing, abbreviations and units. Creating an instance of the `clinlp` language class is usually the first step in setting up a pipeline for clinical text processing.
```{admonition} Note
:class: tip
Note that `clinlp` does not start from a pre-trained `spaCy` model, but from a blank model. This is because `spaCy` only provides models and components pre-trained on general Dutch text, which typically perform poorly on the domain-specific language of clinical text. Although, you are always free to to add pre-trained components from a general Dutch model to the pipeline if needed.
```
The included tokenizer employs some custom rule based logic, including:
- Clinical text-specific logic for splitting punctuation, units, dosages (e.g. `20mg/dag` :arrow_right: `20` `mg` `/` `dag`)
- Custom lists of abbreviations, units (e.g. `pt.`, `zn.`, `mmHg`)
- Custom tokenizing rules (e.g. `xdd` :arrow_right: `x` `dd`)
- Regarding [DEDUCE](https://github.com/vmenger/deduce) tags as a single token (e.g. `[DATUM-1]`).
- De-identification is not built into `clinlp` and should be done as a preprocessing step.
### `clinlp_normalizer`
| property | value |
| --- | --- |
| name | `clinlp_normalizer` |
| class | [clinlp.normalizer.Normalizer](clinlp.normalizer.Normalizer) |
| example | `nlp.add_pipe("clinlp_normalizer")` |
| requires | `-` |
| assigns | `token.norm` |
| config options | `lowercase = True`
`map_non_ascii = True` |
The normalizer sets the `Token.norm` attribute, which can be used by further components (entity matching, qualification). It currently has two options (enabled by default):
- Lowercasing
- Mapping non-ascii characters to ascii-characters, for instance removing diacritics, where possible. For instance, it will map `ë` :arrow_right: `e`, but keeps most other non-ascii characters intact (e.g. `µ`, `²`).
Note that this component only has effect when explicitly configuring successor components to match on the `Token.norm` attribute.
### `clinlp_sentencizer`
| property | value |
| --- | --- |
| name | `clinlp_sentencizer` |
| class | [clinlp.sentencizer.Sentencizer](clinlp.sentencizer.Sentencizer) |
| example | `nlp.add_pipe("clinlp_sentencizer")` |
| requires | `-` |
| assigns | `token.is_sent_start`, `doc.sents` |
| config options | `sent_end_chars = [".", "!", "?", "\n", "\r"]`
`sent_start_punct = ["-", "*", "[", "("]` |
The sentencizer is a rule-based sentence boundary detector. It is designed to detect sentence boundaries in clinical text, whenever a character that marks a sentence ending is matched (e.g. newline, period, question mark). The next sentence is started whenever an alpha character or a character in `sent_start_punct` is encountered. This prevents e.g. sentences ending in `...` to be classified as three separate sentences. The sentencizer correctly detects items in enumerations (e.g. starting with `-` or `*`).
## Entity Matching
### `clinlp_rule_based_entity_matcher`
| property | value |
| --- | --- |
| name | `clinlp_rule_based_entity_matcher` |
| class | [clinlp.ie.entity.RuleBasedEntityMatcher](clinlp.ie.entity.RuleBasedEntityMatcher) |
| example | `nlp.add_pipe("clinlp_rule_based_entity_matcher")` |
| requires | `-` |
| assigns | `doc.spans['ents']` |
| config options | `attr = "TEXT"`
`proximity = 0`
`fuzzy = 0`
`fuzzy_min_len = 0`
`pseudo = False`
`resolve_overlap = False`
`spans_key = 'ents'` |
The `clinlp_rule_based_entity_matcher` component can be used for matching entities in text, based on a dictionary of known concepts and their terms/synonyms. It includes options for matching on different token attributes, proximity matching, fuzzy matching and non-matching pseudo/negative terms.
The most basic example would be the following, with further options described below:
```python
terms = {
"sepsis": [
"sepsis",
"lijnsepsis",
"systemische infectie",
"bacteriemie",
],
"veneus_infarct": [
"veneus infarct",
"VI",
]
}
entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher")
entity_matcher.add_terms_from_dict(terms)
```
```{admonition} Spans vs ents
:class: tip
`clinlp` stores entities in `doc.spans`, specifically in `doc.spans["ents"]`. The reason for this is that spans can overlap, while the entities in `doc.ents` cannot. If you use other/custom components, make sure they read/write entities from/to the same span key if interoperability is needed.
```
```{admonition} Using spaCy components directly
:class: tip
The `clinlp_rule_based_entity_matcher` component wraps the `spaCy` `Matcher` and `PhraseMatcher` components, adding some convenience and configurability. However, the `Matcher`, `PhraseMatcher` or `SpanRuler` can also be used directly with `clinlp` for those who prefer it. You can configure the `SpanRuler` to write to the same `SpanGroup` as follows:
from clinlp.ie import SPAN_KEY
ruler = nlp.add_pipe('span_ruler', config={'span_key': SPAN_KEY})
```
#### Attribute
Specify the token attribute the entity matcher should use as follows (by default `TEXT`):
```python
entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher", config={"attr": "NORM"})
```
Any [Token attribute](https://spacy.io/api/token#attributes) can be used, but in the above example the `clinlp_normalizer` should be added before the entity matcher, or the `NORM` attribute is simply the literal text. `clinlp` does not include Part of Speech tags and dependency trees, at least not until a reliable model for Dutch clinical text is created, though it's always possible to add a relevant component from a trained (general) Dutch model if needed.
#### Proximity matching
The proximity setting defines how many tokens can optionally be skipped between the tokens of a pattern. With `proxmity` set to `1`, the pattern `slaapt slecht` will also match `slaapt vaak slecht`, but not `slaapt al weken slecht`.
```python
entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher", config={"proximity": 1})
```
#### Fuzzy matching
Fuzzy matching enables finding misspelled variants of terms. For instance, with `fuzzy` set to `1`, the pattern `diabetes` will also match `diabets`, `ddiabetes`, or `diabetis`, but not `diabetse` or `ddiabetess`. The threshold is based on Levenshtein distance with insertions, deletions and replacements (but not swaps).
```python
entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher", config={"fuzzy": 1})
```
Additionally, the `fuzzy_min_len` argument can be used to specify the minimum length of a phrase for fuzzy matching. This also works for multi-token phrases. For example, with `fuzzy` set to `1` and `fuzzy_min_len` set to `5`, the pattern `bloeding graad ii` would also match `bloedin graad ii`, but not `bloeding graad iii`.
```python
entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher", config={"fuzzy": 1, "fuzzy_min_len": 5})
```
#### Terms
The settings above are described at the matcher level, but can all be overridden at the term level by adding a `Term` to a concept, rather than a literal phrase:
```python
from clinlp.ie import Term
terms = {
"sepsis": [
"sepsis",
"lijnsepsis",
Term("early onset", proximity=1),
Term("late onset", proximity=1),
Term("EOS", attr="TEXT", fuzzy=0),
Term("LOS", attr="TEXT", fuzzy=0)
]
}
entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher", config={"attr": "NORM", "fuzzy": 1})
entity_matcher.add_terms_from_dict(terms)
```
In the above example, by default the `NORM` attribute is used, and `fuzzy` is set to `1`. In addition, for the terms `early onset` and `late onset` proximity matching is set to `1`, in addition to matcher-level config of matching the `NORM` attribute and fuzzy matching. For the `EOS` and `LOS` abbreviations the `TEXT` attribute is used (so the matching is case sensitive), and fuzzy matching is disabled.
#### Pseudo/negative phrases
On the term level, it is possible to add pseudo or negative patterns, for those phrases that need to be excluded. For example:
```python
terms = {
"prematuriteit": [
"prematuur",
Term("prematuur ademhalingspatroon", pseudo=True),
]
}
```
In this case `prematuur` will be matched, but not in the context of `prematuur ademhalingspatroon` (which may indicate prematurity, but is not a definitive diagnosis).
#### `spaCy` patterns
Finally, if you need more control than literal phrases and terms as explained above, the entity matcher also accepts [`spaCy` patterns](https://spacy.io/usage/rule-based-matching#adding-patterns). These patterns do not respect any other configurations (like attribute, fuzzy, proximity, etc.):
```python
terms = {
"delier": [
Term("delier", attr="NORM"),
Term("DOS", attr="TEXT"),
[
{"NORM": {"IN": ["zag", "ziet", "hoort", "hoorde", "ruikt", "rook"]}},
{"OP": "?"},
{"OP": "?"},
{"OP": "?"},
{"NORM": {"FUZZY1": "dingen"}},
{"OP": "?"},
{"NORM": "die"},
{"NORM": "er"},
{"OP": "?"},
{"NORM": "niet"},
{"OP": "?"},
{"NORM": {"IN": ["zijn", "waren"]}}
],
]
}
```
#### Adding concept sets
External lists of concepts (e.g. from a medical thesaurus such as `UMLS`) can also be loaded directly from `JSON` or `csv`.
##### Adding terms from json
Terms from `JSON` can be added by using `add_terms_from_json`. Your json should have the following format:
```json
{
"terms": {
"concept_identifier": [
"term",
{
"phrase": "term",
"attr": "some_attr"
},
[
{
"NORM": "term"
}
]
],
"next_concept_identifier": [
"other_term"
]
}
}
```
Each term can be presented as a `str` (direct phrase), `dict` (arguments directly passed to `clinlp.ie.Term`), or `list` (a `spaCy` pattern). Any other top level keys than `terms` are ignored, so metadata can be added (e.g. a description, authors, etc.).
##### Adding terms from csv
Terms from `csv` can be added through the `add_terms_from_csv` function. Your `csv` should contain a combination of concept and phrase on each line, with optional columns to configure the `Term`-options described above (e.g. `attribute`, `proximity`, `fuzzy`). You may present the columns in any order, but make sure the names match the `Term` attributes. Any other columns are ignored. For example:
| **concept** | **phrase** | **attr** | **proximity** | **fuzzy** | **fuzzy_min_len** | **pseudo** | **comment** |
|--|--|--|--|--|--|--|--|
| prematuriteit | prematuriteit | | | | | | some comment |
| prematuriteit | `load_rules = True`
`rules = "src/clinlp/resources/context_rules.json"` |
The rule-based [Context Algorithm](https://doi.org/10.1016%2Fj.jbi.2009.05.002) is fairly accurate, and quite transparent and fast. A set of rules, that checks for `Presence`, `Temporality`, and `Experiencer`, is loaded by default:
```python
nlp.add_pipe("clinlp_context_algorithm", config={"phrase_matcher_attr": "NORM"})
```
A custom set of rules, including different types of qualifiers, can easily be defined. See [`src/clinlp/resources/context_rules.json`](../../src/clinlp/resources/context_rules.json) for an example, and load it as follows:
```python
cm = nlp.add_pipe("clinlp_context_algorithm", config={"rules": "/path/to/my_own_ruleset.json"})
```
```{admonition} Definitions of qualifiers
:class: tip
For more extensive documentation on the definitions of the qualifiers we use in `clinlp`, see the [Qualifiers](qualifiers.md) page.
```
### `clinlp_negation_transformer`
| property | value |
| --- | --- |
| name | `clinlp_negation_transformer` |
| class | [clinlp.ie.qualifier.transformer.NegationTransformer](clinlp.ie.qualifier.transformer.NegationTransformer) |
| example | `nlp.add_pipe('clinlp_negation_transformer')` |
| requires | `doc.spans['ents']` |
| assigns | `span._.qualifiers` |
| config options | `token_window = 32`
`strip_entities = True`
`placeholder = None`
`prob_aggregator = statistics.mean`
`absence_threshold = 0.1`
`presence_threshold = 0.9` |
The `clinlp_negation_transformer` wraps the the negation detector described in [van Es et al, 2022](https://doi.org/10.48550/arxiv.2209.00470). The underlying transformer can be found on [HuggingFace](https://huggingface.co/UMCU/). The negation detector is reported as more accurate than the rule-based version (see paper for details), at the cost of less transparency and additional computational cost.
This component requires the following optional dependencies:
```bash
pip install "clinlp[transformers]"
```
The component can be configured to consider a maximum number of tokens as context, when determining whether a term is negated. There is an option to strip the entity, removing any potential whitespace or punctuation before passing it to the transformer. The `placeholder` option can be used to replace the entity with a placeholder token, which has a small impact on the output probability. The `prob_aggregator` option can be used to aggregate the probabilities of the transformer, which is only used for for multi-token entities.
The thresholds define where the cutoff for absence and presence are. If the predicted probability of presence < `absence_threshold`, entities will be qualified as `Presence.Absent`. If the predicted probability of presence > `presence_threshold`, entities will be qualified as `Presence.Present`. If the predicted probability is between these thresholds, the entity will be qualified as `Presence.Uncertain`.
```{admonition} Definitions of qualifiers
:class: tip
For more extensive documentation on the definitions of the qualifiers we use in `clinlp`, see the [Qualifiers](qualifiers.md) page.
```
### `clinlp_experiencer_transformer`
| property | value |
| --- | --- |
| name | `clinlp_experiencer_transformer` |
| class | [clinlp.ie.qualifier.transformer.ExperiencerTransformer](clinlp.ie.qualifier.transformer.ExperiencerTransformer) |
| example | `nlp.add_pipe('clinlp_experiencer_transformer')` |
| requires | `doc.spans['ents']` |
| assigns | `span._.qualifiers` |
| config options | `token_window = 32`
`strip_entities = True`
`placeholder = None`
`prob_aggregator = statistics.mean`
`family_threshold = 0.5` |
The `clinlp_experiencer_transformer` wraps a very similar model as the [`clinlp_negation_transformer`](#clinlp_negation_transformer) component, with which it shares most of its configuration.
Additionally, it has a threshold for determining whether an entity is experienced by the patient or by a family member. If the predicted probability < `family_threshold`, the entity will be qualified as `Experiencer.Patient`. If the predicted probability > `family_threshold`, the entity will be qualified as `Experiencer.Family`. The `Experiencer.Other` qualifier is currently not implemented in this component.
```{admonition} Definitions of qualifiers
:class: tip
For more extensive documentation on the definitions of the qualifiers we use in `clinlp`, see the [Qualifiers](qualifiers.md) page.
```