Optical Character Recognition (OCR) tools are software able to detect and extract texts from images. They are used in the early steps of the analysis of scanned documents to recognize and automatically process the information that the documents contain. Depending on the complexity of the documents to be analyzed, OCR tools can be used to both detect and extract the texts from them or, in some pipelines, they are used only to extract the text from previously identified regions of interest, e.g. Paragraphs, Tables, Titles,… The latter case is of my particular interest. …

credit: https://www.data-science-hub.com/

As the name suggests, Constrained Clustering is the study and the development of clustering algorithms which try to integrate previous knowledge of the data in the clustering process.

I decided to write this article because I think that the field of Constrained Clustering is not very known, it is not mainstream as many other fields of Machine Learning but it can be useful in some situations. So I think that it is important to at least know about its existence.


Generally speaking, in standard clustering algorithms we try to divide the original dataset in non intersecting groups (clusters) such that…

In this article we will show you a very simple method to build a baseline for the Entity Linking task. Entity Linking is a very challenging task and it is very important in both the academic and the industrial sector.

For Entity Linking, the input is a text given as a sequence of words from a dictionary. The output of an Entity Linking model is a list of mention — entity pairs , where each mention is a word subsequence of the input text and each entity is an entry in a Knowledge Base (e.g. Wikipedia).

End-to-End Neural Entity Linking…

Ricciuti Federico

Data Scientist

