When it comes to working with text, there also exists unstructured data.
If we want to train a model, we have to clean up the training-data to improve the quality & training-duration of the model.
When speaking about unstructured data, we have to translate the text for our model to structured data.
# 1. Text-Processing
# Cleaning
Use Regex to clean the data.
This step removes things like -tags, characters like ”/}“…
# Normalizing
Use .lower() function to transfer the text in smaller-case.
Remove punctuations from the text.
# Tokenize
Seperate the text to “Tokens” (one sentence - seperated word by word)
NLTK
spaCy
# Stopwords
Stopwords are words, that often appear in our language.
Examples would be “a, the, …”
Image searching in your search engine “How to develop chatbot using Python” - Here, the search engine will remove the stopwords “how, to”
# Stemming
# Lemmatization
# POS
”Part Of Speech-Tagging”
POS using NLTK:
In this example you can see, that “I” is a Personal-pronoum. “love” is a Verb. …
List word-categories
POS using spacy:
Note: “I” is also identified as stopword. So it gets removed