clinlp.ie¶
Information Extraction (IE) module for clinlp.
clinlp.ie.entity¶
Component for rule based entity matching.
- class clinlp.ie.entity.RuleBasedEntityMatcher(nlp: Language, *, attr: str = 'TEXT', proximity: int = 0, fuzzy: int = 0, fuzzy_min_len: int = 0, pseudo: bool = False, resolve_overlap: bool = False, spans_key: str = 'ents')¶
Bases:
PipespaCycomponent for rule-based entity matching.This component can be used to match entities based on known concepts, along with terms/synonyms to match (per concept). It can do literal string matching, but also has some additional configuration options like fuzzy matching and proximity matching. Note that configuration (e.g.
attr,proximity, …) set at the entity matcher level is overridden by the configuration at the term level.- __init__(nlp: Language, *, attr: str = 'TEXT', proximity: int = 0, fuzzy: int = 0, fuzzy_min_len: int = 0, pseudo: bool = False, resolve_overlap: bool = False, spans_key: str = 'ents') None¶
Create a rule-based entity matcher.
- Parameters:
nlp – The
spaCylanguage model.attr – The attribute to match on.
proximity – The number of tokens to allow between each token in the phrase.
fuzzy – The threshold for fuzzy matching.
fuzzy_min_len – The minimum length for fuzzy matching.
pseudo – Whether this term is a pseudo-term, which is excluded from matches.
resolve_overlap – Whether to resolve overlapping entities.
spans_key – The key to store the entities in the document.
- add_term(concept: str, term: str | dict | list | Term) None¶
Add a term for matching, along with a concept identifier.
Note that concepts do not need to be added separately. It’s is also possible to call add_term multiple with the same concept identifier (terms will be appended, not overwritten).
- Parameters:
concept – The concept identifier.
term – The term that should be matched. Can be a string (i.e. a phrase), a dict (that is passed directly to the
clinlp.ie.Termconstructor), a list comprising aspaCypattern, or aclinlp.ie.Termobject.
- Raises:
TypeError – If the term type is not supported.
- add_terms(concept: str, terms: Iterable[str | dict | list | Term]) None¶
Add multiple terms with the same concept identifier.
- Parameters:
concept – A concept identifier, applicable to all terms.
terms – An iterable of terms to add.
- add_terms_from_dict(terms: dict[str, Iterable[str | dict | list | Term]]) None¶
Add terms from a dictionary.
The dictionary should have the concept identifier as the key, and a list of terms as values.
- Parameters:
data – The concepts and terms in dictionary form.
- add_terms_from_json(path: str) None¶
Add terms from a JSON file.
The JSON file should have a “terms” key containing the terms and concepts. This dictionary should have the concept identifier as the key, and a list of terms as values.
- Parameters:
path – The path to the JSON file.
- Raises:
ValueError – If a ‘terms’ key is not found in the JSON file.
- add_terms_from_csv(path: str, concept_col: str = 'concept', **kwargs) None¶
Add concepts from a csv file.
The csv should contain the concept identifier in the “concept_col” column, and the term arguments as columns. Must at least include a column for the phrase, and optionally other columns for the clinlp.ie.Term arguments. Any other columns are ignored.
- Parameters:
path – A path to the csv file.
concept_col – The column name for the concept identifier.
optional – The column name for the concept identifier.
**kwargs – Any additional keyword arguments to pass to the
pandas.read_csvmethod.
- Raises:
RuntimeError – If a value in the csv file cannot be parsed.
- match_entities(doc: Doc) list[Span]¶
Match entities in a document.
- Parameters:
doc – The document.
- Returns:
list[Span]– The entities.
- __call__(doc: Doc) Doc¶
Match entities in a document text and add to document.
The entities that are found will be stored in
doc.spans['ents']. Make sure any subsequent components expect the entities to be stored there.- Parameters:
doc – The document.
- Returns:
Doc– The document with entities.
clinlp.ie.term¶
Term class, which is used for rule based entity matching.
- class clinlp.ie.term.Term(phrase: str, *, attr: str | None = 'TEXT', proximity: int | None = 0, fuzzy: int | None = 0, fuzzy_min_len: int | None = 0, pseudo: bool | None = False)¶
Bases:
BaseModelA single term used for rule based entity matching.
- phrase: str¶
The literal phrase to match.
- attr: str | None¶
The attribute to match on.
- proximity: int | None¶
The number of tokens to allow between each token in the phrase.
- fuzzy: int | None¶
The threshold for fuzzy matching.
- fuzzy_min_len: int | None¶
The minimum length for fuzzy matching.
- pseudo: bool | None¶
Whether this term is a pseudo-term, which is excluded from matches.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod defaults() dict¶
Get the default values for each term attribute, if any.
- Returns:
dict– The default values for each attribute, if any.
- property fields_set: set[str]¶
Get the fields set for this term.
- Returns:
set[str]– The fields set for this term.
- override_non_set_fields(override_args: dict) Term¶
Override the non-set fields in this term.
- Parameters:
override_args – The arguments to override.
- Returns:
Term– The term with the overridden fields.
- to_spacy_pattern(nlp: Language) list[dict]¶
Convert the term to a
spaCypattern.- Parameters:
nlp – The
spaCylanguage model. This is used for tokenizing patterns.- Returns:
list[dict]– ThespaCypattern.