When it comes to working with text, there also exists unstructured data.
If we want to train a model, we have to clean up the training-data to improve the quality & training-duration of the model.
When speaking about unstructured data, we have to translate the text for our model to structured data.
# 1. Text-Processing
# Cleaning
Use Regex to clean the data.
This step removes things like -tags, characters like ”/}“…
# Normalizing
Use .lower() function to transfer the text in smaller-case.
Remove punctuations from the text.
# Tokenize
Seperate the text to “Tokens” (one sentence - seperated word by word)
Stopwords are words, that often appear in our language.
Examples would be “a, the, …”
Image searching in your search engine “How to develop chatbot using Python” - Here, the search engine will remove the stopwords “how, to”
# Stemming
# Lemmatization
# POS
”Part Of Speech-Tagging”
POS using NLTK:
In this example you can see, that “I” is a Personal-pronoum. “love” is a Verb. …
List word-categories
POS using spacy:
Note: “I” is also identified as stopword. So it gets removed