Различные методы очистки текста в НЛП

В этом посте мы практически рассмотрим большинство методов очистки текстовых данных.

Введение

В НЛП очистка текста — утомительная часть. Это требует тщательного анализа того, какую информацию сохранить, а какие части удалить. В частности, это становится сложной задачей, поскольку данные поступают из разных доменов, и мы не хотим пропустить важную информацию. Неправильная очистка может негативно повлиять на наш анализ и конечные результаты. Здесь мы увидим некоторые популярные методы очистки текста и применим их все к набору данных.

1. Удаление URL
2: удаление тегов HTML
3: удаление символов с диакритическими знаками
4: расширяющиеся сокращения
5: преобразование слов в числа
6. Удаление специальных символов и лишних пробелов
7. Удаление эмодзи и других пиктограмм.
8:орфографическая коррекция
9:лемматизация текста
10: удаление стоп-слов

Удаление URL

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    text =  url.sub(r'',text)
    return text
example="New competition launched :https://www.kaggle.com/c/nlp-getting-started"
print("URL removed : {}".format(remove_URL(example)))
Output >>> URL removed : New competition launched :

Удаление тегов HTML

# Below Function will remove html tags in text using beautiful soup
def strip_html_tags(text):
    """remove html tags from text"""
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text(separator=" ")
    return stripped_text

Удаление символов с диакритическими знаками

Удалите из текста символы с диакритическими знаками, например: cafe to cafe.

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text
example = "I love to café"
print("Non-accented text : ", remove_accented_chars(example))
Output >>> Non-accented text :  I love to cafe

Расширение сокращений

Это расширит сокращенные слова, например: не делай, не делай и не хочешь хотеть.

def expand_contractions(text):
    text = contractions.fix(text)
    return text
example = "I don't wanna go home"
print("Contractions free text : ", expand_contractions(example))
Output >>> Contractions free text :  I do not want to go home

Преобразование слова в число (необязательный шаг)

def word_to_num(text):
    doc = nlp(text)
    tokens = [w2n.word_to_num(token.text) if token.pos_ == 'NUM' else token for token in doc]
    tokens = " ".join([str(tok) for tok in tokens])
    return tokens
example = """three cups of coffee to 3 cups of coffee"""
print("Words to numbers : {}".format(word_to_num(example)))
Output >>> Words to numbers : 3 cups of coffee to 3 cups of coffee

Удаление специальных символов и лишних пробелов

def remove_special_characters(text):
    text = re.sub('[^a-zA-z0-9\s]', '', text)
    return text
# Extra white spaces
def remove_whitespace(text):
    """remove extra whitespaces from text"""
    text = text.strip()
    return " ".join(text.split())
example = "   Hi,    how are you ?   ."
print("Extra spaces removed : ", remove_whitespace(example))
Output >>> Extra spaces removed :  Hi, how are you ? .

Удаление эмодзи и других пиктограмм

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    return text
example = "Hey how are you 😁😂👍🙌💕😜👀✔🎁"
print(remove_emoji(example))
Output >>> Hey how are you

Орфографическая коррекция

## Spelling correction library ( pip install pyspellchecker)
spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)
        
example = "corect me please"
print(correct_spellings(example))
Output >>> correct me please

Лемматизация текста

def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text
example = "He was trying to get into the house"
print("Lemmatized text : ", lemmatize_text(example))
Output >>> Lemmatized text :  he be try to get into the house

Удаление стоп-слов

def remove_stopwords(text, is_lower_case=False):
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token.lower() not in stpwrds]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stpwrds]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
example = "Steve is an amazing photographer"
print("Text without stopwords : ", remove_stopwords(example))
Output >>> Text without stopwords :  Steve amazing photographer

Теперь, когда мы видели различные функции предварительной обработки текста. Вместо того, чтобы применять их по одному, мы создадим еще одну основную функцию normalize_doc() для применения всех этих подфункций. Вы можете установить флаги на True/False для любой подфункции.

"""Main function to apply all above cleaning fucntions with adjustable parameters to be passed"""
def normalize_doc(doc, URL_stripping=True, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True , emoji_removal=True,
                     spelling_correction = True, word_to_num=False):
    normalized_document = []
    # Stip URL's
    if URL_stripping:
        doc = remove_URL(doc)
    # strip HTML
    if html_stripping:
        doc = strip_html_tags(doc)
    # remove accented characters
    if accented_char_removal:
        doc = remove_accented_chars(doc)
    # expand contractions    
    if contraction_expansion:
        doc = expand_contractions(doc)
    # lowercase the text    
    if text_lower_case:
        doc = doc.lower()
    # Word to numbers
    if word_to_num:
        doc = word_to_num(doc)
    # remove extra newlines
    doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
    # remove special characters    
    if special_char_removal:
        doc = remove_special_characters(doc)
    # remove extra whitespace
    doc = re.sub(' +', ' ', doc)
    # remove emogis
    if emoji_removal:
        doc = remove_emoji(doc)
    # spelling_correction
    if spelling_correction:
        doc = correct_spellings(doc)
    # lemmatize text
    if text_lemmatization:
        doc = lemmatize_text(doc)
    # remove stopwords
    if stopword_removal:
        doc = remove_stopwords(doc, is_lower_case=text_lower_case)
    normalized_document.append(doc)
    return doc

Есть много других пользовательских методов предварительной обработки для конкретных задач. Каждые текстовые данные в реальном времени имеют свои собственные требования, которым необходимо следовать при очистке. Всегда лучше сначала понять предметную область данных, чтобы обеспечить качество и количество анализируемых данных. Проверьте репозиторий Github для получения полного кода:

Хорошего дня :)