clinlp.ie

Information Extraction (IE) module for clinlp.

clinlp.ie.entity

Component for rule based entity matching.

class clinlp.ie.entity.RuleBasedEntityMatcher(nlp: Language, *, attr: str = 'TEXT', proximity: int = 0, fuzzy: int = 0, fuzzy_min_len: int = 0, pseudo: bool = False, resolve_overlap: bool = False, spans_key: str = 'ents')

Bases: Pipe

spaCy component for rule-based entity matching.

This component can be used to match entities based on known concepts, along with terms/synonyms to match (per concept). It can do literal string matching, but also has some additional configuration options like fuzzy matching and proximity matching. Note that configuration (e.g. attr, proximity, …) set at the entity matcher level is overridden by the configuration at the term level.

__init__(nlp: Language, *, attr: str = 'TEXT', proximity: int = 0, fuzzy: int = 0, fuzzy_min_len: int = 0, pseudo: bool = False, resolve_overlap: bool = False, spans_key: str = 'ents') None

Create a rule-based entity matcher.

Parameters:
  • nlp – The spaCy language model.

  • attr – The attribute to match on.

  • proximity – The number of tokens to allow between each token in the phrase.

  • fuzzy – The threshold for fuzzy matching.

  • fuzzy_min_len – The minimum length for fuzzy matching.

  • pseudo – Whether this term is a pseudo-term, which is excluded from matches.

  • resolve_overlap – Whether to resolve overlapping entities.

  • spans_key – The key to store the entities in the document.

add_term(concept: str, term: str | dict | list | Term) None

Add a term for matching, along with a concept identifier.

Note that concepts do not need to be added separately. It’s is also possible to call add_term multiple with the same concept identifier (terms will be appended, not overwritten).

Parameters:
  • concept – The concept identifier.

  • term – The term that should be matched. Can be a string (i.e. a phrase), a dict (that is passed directly to the clinlp.ie.Term constructor), a list comprising a spaCy pattern, or a clinlp.ie.Term object.

Raises:

TypeError – If the term type is not supported.

add_terms(concept: str, terms: Iterable[str | dict | list | Term]) None

Add multiple terms with the same concept identifier.

Parameters:
  • concept – A concept identifier, applicable to all terms.

  • terms – An iterable of terms to add.

add_terms_from_dict(terms: dict[str, Iterable[str | dict | list | Term]]) None

Add terms from a dictionary.

The dictionary should have the concept identifier as the key, and a list of terms as values.

Parameters:

data – The concepts and terms in dictionary form.

add_terms_from_json(path: str) None

Add terms from a JSON file.

The JSON file should have a “terms” key containing the terms and concepts. This dictionary should have the concept identifier as the key, and a list of terms as values.

Parameters:

path – The path to the JSON file.

Raises:

ValueError – If a ‘terms’ key is not found in the JSON file.

add_terms_from_csv(path: str, concept_col: str = 'concept', **kwargs) None

Add concepts from a csv file.

The csv should contain the concept identifier in the “concept_col” column, and the term arguments as columns. Must at least include a column for the phrase, and optionally other columns for the clinlp.ie.Term arguments. Any other columns are ignored.

Parameters:
  • path – A path to the csv file.

  • concept_col – The column name for the concept identifier.

  • optional – The column name for the concept identifier.

  • **kwargs – Any additional keyword arguments to pass to the pandas.read_csv method.

Raises:

RuntimeError – If a value in the csv file cannot be parsed.

match_entities(doc: Doc) list[Span]

Match entities in a document.

Parameters:

doc – The document.

Returns:

list[Span] – The entities.

__call__(doc: Doc) Doc

Match entities in a document text and add to document.

The entities that are found will be stored in doc.spans['ents']. Make sure any subsequent components expect the entities to be stored there.

Parameters:

doc – The document.

Returns:

Doc – The document with entities.

clinlp.ie.term

Term class, which is used for rule based entity matching.

class clinlp.ie.term.Term(phrase: str, *, attr: str | None = 'TEXT', proximity: int | None = 0, fuzzy: int | None = 0, fuzzy_min_len: int | None = 0, pseudo: bool | None = False)

Bases: BaseModel

A single term used for rule based entity matching.

phrase: str

The literal phrase to match.

attr: str | None

The attribute to match on.

proximity: int | None

The number of tokens to allow between each token in the phrase.

fuzzy: int | None

The threshold for fuzzy matching.

fuzzy_min_len: int | None

The minimum length for fuzzy matching.

pseudo: bool | None

Whether this term is a pseudo-term, which is excluded from matches.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod defaults() dict

Get the default values for each term attribute, if any.

Returns:

dict – The default values for each attribute, if any.

property fields_set: set[str]

Get the fields set for this term.

Returns:

set[str] – The fields set for this term.

override_non_set_fields(override_args: dict) Term

Override the non-set fields in this term.

Parameters:

override_args – The arguments to override.

Returns:

Term – The term with the overridden fields.

to_spacy_pattern(nlp: Language) list[dict]

Convert the term to a spaCy pattern.

Parameters:

nlp – The spaCy language model. This is used for tokenizing patterns.

Returns:

list[dict] – The spaCy pattern.