Getting started¶
This guide contains some code examples to get you started with clinlp. Since clinlp is built on top of the spaCy framework, it’s highly recommended to read spaCy 101: Everything you need to know (~15 minutes) before getting started with clinlp. Understanding the basic spaCy framework will make working with clinlp much easier.
Creating a blank model¶
You can create a blank clinlp model using the following code:
import spacy
import clinlp
nlp = spacy.blank('clinlp')
This instantiates a Language object, which is the central object in spaCy. It contains all default settings for a language, in our case Dutch clinical text, such as the tokenizer, abbreviations, stop words, and so on. Calling it on a piece of text creates a Doc object:
text = "De patient krijgt 2x daags 500 mg paracetamol."
doc = nlp(text)
In the Doc object, you can find the tokenized text:
print(list(token.text for token in doc))
> ['De', 'patient', 'krijgt', '2', 'x', 'daags', '500', 'mg', 'paracetamol', '.']
Each token in the document is a Token object, which contains the text and some additional information. You can also access the tokens directly from the Doc object:
print(doc[8])
> 'paracetamol'
A span of multiple tokens, essentially a slice of the document, is called a Span object. This can be a sentence, a named entity, or any other contiguous part of the text. You can create a Span by slicing the Doc:
print(doc[6:8])
> '500 mg'
Even when using a blank model, the Doc, Token and Span objects already contain some information about the text and tokens, such as the token’s text and its position in the document. In the next section, we will add more components to the model, that will add more interesting information. Then we can start using our model for more interesting things.
Adding components¶
The above model is a blank model, which means it does not contain any additional components yet. It’s essentially an almost empty pipeline. Adding a component is done using:
nlp.add_pipe('component_name')
For example, let’s add the clinlp normalizer and clinlp sentencizer to the model. They respectively normalize the text and detect sentence boundaries:
nlp.add_pipe('clinlp_normalizer')
nlp.add_pipe('clinlp_sentencizer')
If we now again process a piece of sample text, we can see that clinlp has added some additional information to the Doc and Span objects:
doc = nlp(
"Patiënt krijgt 2x daags 500 mg "
"paracetamol. De patiënt is allergisch "
"voor penicilline."
)
print(token.norm_ for token in doc)
> ['patient', 'krijgt', '2', 'x', 'daags', '500', 'mg', 'paracetamol', '.', 'de', 'patient', 'is', 'allergisch', 'voor', 'penicilline', '.']
print(str(sent) for sent in doc.sents)
> ['Patiënt krijgt 2x daags 500 mg paracetamol.', 'De patiënt is allergisch voor penicilline.']
Other components can use these newly set properties Token.norm_ and Doc.sents. For example, an entity recognizer can use the normalized text to recognize entities, and a negation detector can use the sentence boundaries to determine the range of a negation.
You can always inspect the current model’s pipeline using:
print(nlp.pipe_names)
> ['clinlp_normalizer', 'clinlp_sentencizer']
This shows the current components in the pipeline, in the order they are executed. The order of the components is important, as the output of one component is the input of the next component. The order of the components can be changed by using the nlp.add_pipe method with the before or after parameter. For example, to add a component before the clinlp_sentencizer:
nlp.add_pipe('component_name', before='clinlp_sentencizer')
This will add the component before the clinlp_sentencizer in the pipeline.
Information extraction example¶
Now that we understand the basics of a blank model and adding components, let’s add two more components to create a basic information extraction pipeline.
First, we will add the clinlp_rule_based_entity_matcher, along with some sample concepts to match:
from clinlp.ie import Term
terms = {
"prematuriteit": [
"preterm", "<p3", "prematuriteit", "partus praematurus"
],
"hypotensie": [
"hypotensie", Term("bd verlaagd", proximity=1)
],
"veneus_infarct": [
"veneus infarct", Term("VI", attr="TEXT")
]
}
entity_matcher = nlp.add_pipe(
"clinlp_rule_based_entity_matcher",
config={"attr": "NORM", "fuzzy": 1}
)
entity_matcher.add_terms_from_dict(terms)
The above code adds three concepts to be matched (prematuriteit, hypotensie, and veneus_infarct), along with synonyms to match. Additionally, it configures the entity matcher on how to perform the matching. We have here configured the entity matcher to match against the NORM attribute by default, which it finds in the Token.norm_ property the clinlp_normalizer set earlier. The fuzzy parameter specifies how much the concept text and the real text can differ (based on the edit distance). Some settings are overruled at the Term level. For instance, the proximity=1 parameter for bd verlaagd specifies that at most one token may skipped between the words bd and verlaagd.
If we now process a piece of text, we can see that the entity recognizer has recognized some entities:
text = (
"Preterme neonaat (<p3) opgenomen, bd enigszins verlaagd, "
"familieanamnese vermeldt eveneens hypotensie bij moeder. "
"Thans geen aanwijzingen voor veneus infarkt wat ook geen "
"verklaring voor de partus prematurus is. Risico op VI "
"blijft aanwezig."
)
doc = nlp(text)
for ent in doc.spans['ents']:
print(ent.text, ent.label_)
> 'Preterme' 'prematuriteit'
> '<p3' 'prematuriteit'
> 'bd enigszins verlaagd' 'hypotensie'
> 'hypotensie' 'hypotensie'
> 'veneus infarkt' 'veneus_infarct'
> 'partus prematurus' 'prematuriteit'
> 'VI' 'veneus_infarct'
As you can see, the doc.spans['ents'] property now contains seven Span objects, each with the matched text, along with the concept label.
Now, as a final step, let’s add the clinlp_context_algorithm component to the pipeline, which implements the Context Algorithm. For each matched entity, it can detect qualifiers, such as Presence, Temporality and Experiencer, based on triggers like ‘geen’, ‘uitgesloten’, etc.
nlp.add_pipe("clinlp_context_algorithm", config={"phrase_matcher_attr": "NORM"})
We again configure it to match on the NORM attribute, set by the clinlp_normalizer.
If we now process the same text, we can see that the Context Algorithm has added some additional information to the entities:
doc = nlp(text)
for ent in doc.spans['ents']:
print(ent.text, ent._.qualifiers)
> 'Preterme' set()
> '<p3' set()
> 'bd enigszins verlaagd' set()
> 'hypotensie' {'Experiencer.Family'}
> 'veneus infarkt' {'Presence.Absent'}
> 'partus prematurus' set()
> 'VI' {'Temporality.Future'}
In the above example, for readability all default qualifier values (Presence.Present, Temporality.Current, Experiencer.Patient) have been omitted. You can see that three out of seven entities have correctly been qualified, either as Absent, related to Family, or potentially occurring in the Future. Of course, your specific use case determines how the output of this pipeline will further be handled.
Conclusion¶
In this guide, we have shown how to create a blank model, add components to it, and process a piece of text. It also shows how to configure individual components and organize them in a specific information extraction pipeline. Note that there are more components available than shown in this example, you can find them on the Components page. By now you understand the basics, and are ready to further explore everything clinlp can offer!