We are using semantic information to identify when sentences have common content when there is little similarity in the vocabulary or structure. For example, a scientific paper might use the formal name for a particular species (eg. Zootermopsis angusticollis) while the same work presented to a general audience may use a common name (termite) when the level of precision is less important. Documents may also be structured differently depending on the intended audience, so that sentences which carry the same information may share no common terms.
We are developing approximate, or“rough”, semantic representations which can be matched with more lightweight algorithms than are required to recognise full semantic equivalence. These allow us to recognise when different terms appear in similar linguistic contexts, and so may have similar denotations. We are applying this work to identify which parts of popular science articles discuss the source academic articles.