Latest AI Tools for AI Media Processing NEWS

[ad_1]

AllenAI’s SciSpacy project provides a language model trained on biomedical text that can be used for named entity recognition (NER) of biomedical entities using the standard SpaCy API. Unlike entities found using SpaCy’s language models (at least English), where entities have types such as PER, GEO, ORG, etc., SciSpacy entities have a single type of ENTITY. For their further classification, SciSpacy provides Entity Linking (NEL) functionality through its integration with various ontology providers such as Unified Medical Language System (UMLS), Medical Subject Headings (MeSH), RxNorm, Gene Ontology (GO) and Human Phenotype Ontology (HPO) .

NER and NEL processes are separated. The NER process finds candidate entity boundaries and matches these ranges to corresponding ontologies, which may result in zero or more ontology entries corresponding to the range. All candidate ranges are then matched against all relevant entities.

I attempted to annotate the COVID-19 Open Research Dataset (CORD-19) against UMLS using the SciSpacy integration described above and noticed significant uncertainty in the linkage results. Specifically, annotation of approximately 22 million sentences of the CORD-19 dataset results in 113 million candidate entities associated with 166 million UMLS concepts, i.e., on average, each candidate length is resolved to 1.5 UMLS concepts. However, the distribution is Zipfian, with approximately 46.87% of units covering a single concept, and the long tail of units being associated with up to 67 UMLS concepts.

In this post, I’ll describe a strategy related to the uncertainty of entities. Based on limited testing, this selects the correct concept about 73% of the time.

The strategy is based on the intuition that an ambiguously related unit range is more likely to resolve a concept that is closely related to the concepts of other unambiguous units in the sentence. In other words, the best target label for an ambiguous entity is the one that is semantically closest to the labels of the other entities in the sentence. Or even more succinctly, and with apologies to John Firth, a unit is known by the company it keeps.

The NER and NEL processes provided by the SciSpacy library allow us to reduce a sentence to collections of entities, each of which represents zero or more UMLS concepts. Each UMLS concept represents one or more semantic types that represent high-level subject categories. So, essentially, a sentence can be reduced to a semantic type graph using the following steps.

Consider the sentence below, the NER step determines the candidate boundaries indicated by the underline.

The fact that viral antigens cannot be demonstrated is used coloring There is no result Antibodies is present cat that is already attached to these antigens and prevents It is mandatory of others Antibodies.

The NEL step will attempt to link these spans to the UMLS ontology. The matching results are shown below. As mentioned earlier, each UMLS concept maps to one or more schematic types, and these are also shown here.

Subject-ID	unit span	Concept-ID	The basic name of the concept	Semantic type code	Semantic type name
1	coloring	C0487602	coloring method	T059	laboratory procedure
2	Antibodies	C0003241	Antibodies	T116	amino acid, peptide or protein
				T129	immunological factor
3	cat	C0007450	Felis catus	T015	A mammal
		C0008169	Chloramphenicol O-acetyltransferase	T116	amino acid, peptide or protein
				T126	Enzyme
		C0325089	Felidae family	T015	A mammal
		C1366498	chloramphenicol acetyl transferase gene	T028	A gene or genome
4	antigens	C0003320	antigens	T129	immunological factor
5	Mandatory	C1145667	Mandatory action	T052	activity
		C1167622	binding (molecular function)	T044	molecular function
6	Antibodies	C0003241	Antibodies	T116	amino acid, peptide or protein
				T129	immunological factor

A sequence of entity spans, each mapped to one or more semantic type codes, can be represented by a graph of semantic type nodes, as shown below. Here, each vertical grouping corresponds to a unit position. A BOS node is a special node that represents the beginning of a sequence. From our intuition above, unit uncertainty is now just finding the most probable path in the diagram.

Of course, “probably” implies that we need to know the probability of transitions between semantic types. We can think of the graph as a Markov chain and assume that the probability of each node in the graph is determined only by its previous node. Fortunately, this information is already available as a result of the NER + NEL process for the entire CORD-19 data set, where about half of the units are uniquely mapped to a single UMLS concept. Most concepts reflect one semantic type, but in cases where they are multiple, we consider them as separate entries. We calculate pairwise transition probabilities across semantic types for these uniquely related pairs in the CORD-19 database and construct our transition matrix. In addition, we also create a matrix of emission probabilities, which determines the probability of solving a semantic type concept.

Using transition probabilities, we can look at each path in the graph from the start to the end position, computing the path probability as the product of the transition probabilities of the edges (or, for computational reasons, the sum of the log-probabilities). However, there are better methods, such as the Viterbi algorithm, which allows us to save the iterative computation of common edge sequences over multiple paths. This is what we used to calculate the most probable path for our semantic type graph.

Viterbi algorithm consists of two phases – forward and backward. In the previous phase, we move from left to right, calculating the log-probability of each transition at each step, as shown in the figure by the vectors below each position. When calculating a transition from multiple nodes to a single node (such as from node [T129, T116] that [T126]We calculate both ways and choose the maximum value.

In the backtracking phase, we move from right to left, selecting the maximum likelihood node at each step. These are shown in the figure as boxed entries. We can then search for the corresponding semantic type and return the most likely sequence of semantic types (shown in bold at the bottom of the figure).

However, our goal is to return ambiguous concept associations for entities. Given the uncertain semantic type and multiple possibilities indicated by SciSpacy’s linking process, we use emission probabilities to select the most likely concept that applies to the position. The result of our example is shown in the table below.

Subject-ID	unit span	Concept-ID	The basic name of the concept	Semantic type code	Semantic type name	it is right?
1	coloring	C0487602	coloring method	T059	laboratory procedure	N/A^*
2	Antibodies	C0003241	Antibodies	T116	amino acid, peptide or protein	Yes
3	cat	C0008169	Chloramphenicol O-acetyltransferase	T116	amino acid, peptide or protein	No
4	antigens	C0003320	antigens	T129	immunological factor	N/A^*
5	Mandatory	C1145667	Mandatory action	T052	activity	Yes
6	Antibodies	C0003241	Antibodies	T116	amino acid, peptide or protein	Yes

(No: unambiguous mappings)

I thought this might be an interesting technique to share, so I wrote about it. Additionally, in the spirit of repeatability, I’ve also provided the following artifacts for your convenience.

Code: This github gist contains code that performs NER + NEL on an input sentence using SciSpacy and its UMLS integration, and then uses my adaptation of the Viterbi method (as described in this post ) to resolve ambiguous entity connections.
Data: I have also provided the transition and emission matrices and their associated lookup tables for convenience, as these can be time consuming to create from scratch from the CORD-19 data set.

As always, I appreciate your feedback. Please let me know if you find any flaws in my approach and/or know of a better approach for unit misunderstandings

[ad_2]

Source link

The RedPajama Project: An Open Source Initiative to Democratize LLMs

Mastering Data Science with Microsoft Fabric: A Tutorial for Beginners

Will AI kill your job?

Leave A Reply Cancel Reply

SciSpacy + UMLS entity disambiguation using the Viterbi algorithm

Related Posts

The RedPajama Project: An Open Source Initiative to Democratize LLMs

Mastering Data Science with Microsoft Fabric: A Tutorial for Beginners

Will AI kill your job?

Leave A Reply Cancel Reply