To achieve the best conversational experience in Wingbot.ai we're building an own natural language processing (NLP) pipeline, which best suits our chatbot engine. I'll try to explain how it works.
To understand the language you have to know the meaning of every word and relations between those words. In the pre-trained model each word has its numeric representation and is placed in a multidimensional space according to its meaning. This technique is known as word2vec. To be able to understand another language we just need to train a language model for it.
When training the custom model, the training data are "mapped" onto the pre-trained models space and a subset of related words is extracted from its dictionary. We're able to determine a numeric representation of each training sample and use these data to train a neural network. The neural network training is like establishing imaginary borders between words according to which intent they belong to.
For a chatbot to be able to recognize your intention, we need to provide sample utterances and their meanings - intents. Sometimes there is a need to detect a specific type of information, for example a city - we call it entity. Intent samples with marked entities makes together the training data.
When you ask our chatbot to find a flight, for example, we'll first detect all entities - words that match with pre-defined lists or rule sets, like numbers, dates or cities. Then the NLP will try to find a numeric representation of your sentence by determining its coordinates in the the NLP model. With sentence translated to numbers we can use the previously trained neural network to recognize the winning intent and confidence of recognition.
It often happens you ask the bot about something it does not understand. The most important part of training is collecting and analyzing those misunderstandings. To provide the best experiece we use human curation as a filter for new training data. Chatbot trainers are carefully selecting new utterances or whole topics to enhance the training data. Only with regular and continuous training the NLP can achieve high understanding rates.