Stemmer
Use the package manager pip to install nltk.
pip install nltk
Usage
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("romanian")
print(stemmer.stem("alergare"))
Tokeniser, Lemmatiser and POS (Part-Of-Speech)
Use the package manager pip to install spacy and spacy-stanza.
pip install spacy spacy-stanza
Usage
import stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(lang="ro")
nlp = StanzaLanguage(snlp)
doc = nlp("Această propoziție este în limba română.")
for token in doc:
print(token.text, token.lemma_, token.pos_)
For more info visit https://spacy.io/universe/project/spacy-stanza.
SpaCy
Create Doc objects and play with its tokens:
from spacy.lang.ro import Romanian
nlp = Romanian()
doc = nlp("Aceasta este propoziția mea: eu am 7 mere, ce să fac cu ele?")
print("Index: ", [token.i for token in doc])
print("Text: ", [token.text for token in doc])
print("is alpha: ", [token.is_alpha for token in doc])
print("is punctuation: ", [token.is_punct for token in doc])
print("is like_num: ", [token.like_num for token in doc])
Output:
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Text: ['Aceasta', 'este', 'propoziția', 'mea', ':', 'eu', 'am', '7', 'mere', ',', 'ce', 'să', 'fac', 'cu', 'ele', '?']
is alpha: [True, True, True, True, False, True, True, False, True, False, True, True, True, True, True, False]
is punctuation: [False, False, False, False, True, False, False, False, False, True, False, False, False, False, False, True]
is like_num: [False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False]
Search for POS and dependencies:
import spacy
from spacy.lang.ro.examples import sentences
#load pre-trained romanian model
nlp = spacy.load("ro_core_news_sm")
doc = nlp("Ea a mâncat pizza")
for token in doc:
print('{:<12}{:<10}{:<10}{:<10}'.format(token.text, token.pos_, token.dep_, token.head.text))
Output:
Ea PRON nsubj mâncat
a AUX aux mâncat
mâncat VERB ROOT mâncat
pizza ADV obj mâncat
Predict Named Entities:
import spacy
from spacy.lang.ro.examples import sentences
nlp = spacy.load("ro_core_news_sm")
doc = nlp("Iulia Popescu, cea din Constanta, s-a dus la Lidl să cumpere pâine. Pe drum și-a dat seama că are nevoie de 50 de lei așa că a trecut și pe la bancomat înainte.")
for ent in doc.ents:
print(ent.text, ent.label_)
Output:
Iulia Popescu PERSON
Constanta GPE
Lidl LOC
50 de lei MONEY
Rule-based Matching
Matching can be done by handlers: LEMMA, POS, TEXT, IS_DIGIT, IS_PUNCT, LOWER, UPPER, OP.
The OP handler can have the following values:
- '!' = never
- '?' = never or once
- '+' = once or more times
- '*' = never or more times
import spacy
from spacy.matcher import Matcher
#load pre-trained romanian model
nlp = spacy.load('ro_core_news_sm')
#create matcher
matcher = Matcher(nlp.vocab)
#create doc object
doc = nlp("Caracteristicile aplicației includ un design frumos, căutare inteligentă, etichete automate și răspunsuri vocale opționale.")
#create pattern for adjective plus one or two nouns
pattern = [{'POS': 'NOUN'}, {'POS': 'ADJ'}, {'POS': 'ADJ', 'OP': '?'}]
#add the pattern to the matcher
matcher.add('QUALITIES', [pattern])
#apply mather on doc
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
Output:
design frumos
căutare inteligentă
etichete automate
răspunsuri vocale opționale
RoWordnet
Use the package manager pip to install rowordnet.
pip install rowordnet
Usage
import rowordnet
wordnet = rowordnet.RoWordNet()
word = 'arbore'
synset_ids = wordnet.synsets(literal=word)
wordnet.print_synset(synset_ids[0])
For more info visit https://github.com/dumitrescustefan/RoWordNet.
BERT for Romanian
from transformers import BertModel, BertTokenizer
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
For more info visit https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1.
Word Vectors
fastText
import fasttext.util
fasttext.util.download_model('ro', if_exists='ignore')
ft = fasttext.load_model('path/to/cc.ro.300.bin')
or download from here.
More info on usage here: https://fasttext.cc/docs/en/crawl-vectors.html.
word2vec
from here: https://github.com/senisioi/ro_resources.
Other Lingvistic resources
- List of (all, I hope) romanian words - from here
- List of prefixes - from here
- List of suffixes - from here
- RoSentiwordnet - download from here
RoSentiWordNet is a lexical resource in which each RoWordNet synset is associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective, positive, and negative the terms contained in the synset are. It was created by translating SentiWordnet into Romanian using googletrans Python library.
Source: https://github.com/Alegzandra/Romanian-NLP-tools