Case study: Understanding semantic similarity better than ever before using Google BERT
- For an Innovate UK backed innovation project DMRC was tasked to find an innovative way to identify similarities between text documents better.
- Our problem statement: “How can we identify parts of documents that are related to each other beyond looking for matching keywords?”
- For example, ‘creating a fire strategy’ and ‘creating a cleaning strategy’ would suggest that these are related items if just matching on keywords, but in reality they mean two different things
- There was a need to find a solution to better understand the relationship between paragraphs
- DMRC developed a proof of concept using a dataset based on 1 million news articles, with topics ranging from clinical trials for disease-related treatments to people unfairly taxed on their inheritance.
- The news articles were split into paragraphs and a pre-trained BERT related model was used to generate sentence embeddings.
- The aim was to understand which paragraphs are closely related to each other.
- KNN was used as the search algorithm and different similarity metrics, including L2 distance and Cosine distance, were trailed as the method for finding the most semantically similar’ paragraphs’.
- State of the art technology
- Connect information through AI
- Automated search process
- Quickly and accurately find similar paragraphs/documents
- Better on demand insights