In our fourth TechXperience, Javier González, Data Scientisttook the reins of the presentation and developed a session on NLP (natural language processing) and reviewed some of the different language models that can be found today.
Who is Javier González Peñalosa?
First of all, you should introduce yourselves to Javier González Peñalosa. Javier, studied economics at the University He was a graduate of the University of Zaragoza, which allowed him to enter the world of commercial banking. In spite of this, he wanted to specialise in the IT world by means of a master's degree in Big Data y Business Intelligence. This subsequently enabled him to enter the Bosonit in the position of Data Science.
What is NLP (Natural Language Processing)?
The NLP focuses on how machines understand, interpret and process human language. It is not only about translating words but also about knowing how to interpret their different meanings and phrases according to the context.
Natural language processing is the combination of two different areas such as linguistics and Machine Learning. Not only does it translate word for word for the models to understand, but it is also necessary that there is a relationship between those words.
Language models
Models need to understand the whole context in order to form, relate and show a simple reference between one word and another. This area encompasses different tasks, one of the most common of which is text classification.
Nowadays, if we go to pages such as FilmAffinity or IMDB, we can find different reviews and by processing this type of models it is possible to check whether these types of reviews have been positive or negative for the film. This is a classification to measure the sentiment of that text.
Another of the most common classification tasks would be spam classification. By processing the entire email we will be able, with the creation of a model, to categorise whether an email is spam or not. Adding value and automating the process for the user.
Text generation is another of NLP's models. After the release of GPT, text generative models came to the forefront. By entering a small amount of text, it is able to take the style of your input and continue generating text while keeping the style. An example of this would be to train these models with the writing style of Gustavo Adolfo Becquer and end up being able to replicate Becquer's style.
But NLP does not only focus on text processing. Over the years, it has been advancing and has been covering new domains of which we can highlight:
- Audio to text: With models like wac2vec, we are able to process all spoken text into different languages. An example of this could be the assistants we all have on our mobile phones (e.g. Alexa or Siri).
- Image generation: Open AI created a project called DALLE where by putting together their text generation models (GPT) and image generation models, they were able to recreate in digital image format the text written by the user.
- Copilot: with its GPT-3 model, trained on a project in collaboration with Microsoft, taking as input all the public repositories that exist on GitHub. This allows generative models to write code in different programming languages through text only. An excellent tool when programming, making progress much faster.