Background

  • For an Innovate UK backed innovation project DMRC was tasked to find an innovative way to identify similarities between text documents better.
  • Our problem statement: “How can we identify parts of documents that are related to each other beyond looking for matching keywords?”
  • For example, ‘creating a fire strategy’ and ‘creating a cleaning strategy’ would suggest that these are related items if just matching on keywords, but in reality they mean two different things
  • There was a need to find a solution to better understand the relationship between paragraphs

Approach

  • DMRC developed a proof of concept using a dataset based on 1 million news articles, with topics ranging from clinical trials for disease-related treatments to people unfairly taxed on their inheritance.
  • The news articles were split into paragraphs and a pre-trained BERT related model was used to generate sentence embeddings.
  • The aim was to understand which paragraphs are closely related to each other.
  • KNN was used as the search algorithm and different similarity metrics, including L2 distance and Cosine distance, were trailed as the method for finding the most semantically similar’ paragraphs’.

Benefits

  • State of the art technology
  • Connect information through AI
  • Automated search process
  • Quickly and accurately find similar paragraphs/documents
  • Better on demand insights