Title: | Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library |
---|---|
Description: | An R interface to the C 'libstemmer' library that implements Porter's word stemming algorithm for collapsing words to a common root to aid comparison of vocabulary. Currently supported languages are Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil and Turkish. |
Authors: | Milan Bouchet-Valat [aut, cre] |
Maintainer: | Milan Bouchet-Valat <[email protected]> |
License: | BSD_3_clause + file LICENSE |
Version: | 0.7.1 |
Built: | 2024-10-26 05:41:24 UTC |
Source: | https://github.com/nalimilan/r.temis |
This dynamically determines the names of the languages for which stemming is currently supported by this package.
getStemLanguages()
getStemLanguages()
The language names in lower case are returned, though please note
that two- and three- letter ISO-639 codes are also accepted by
wordStem
(see references for the list of codes).
This queries the C code for the list of languages that were compiled when the package was installed which in turn is determined by the code that was included in the distributed package itself.
A character vector giving the names of the languages.
Milan Bouchet-Valat
http://www.loc.gov/standards/iso639-2/php/code_list.php for a list of ISO-639 language codes.
getStemLanguages()
getStemLanguages()
This function extracts the stems of each of the given words in the vector.
wordStem(words, language = "porter")
wordStem(words, language = "porter")
words |
a character vector of words whose stems are to be extracted. |
language |
the name of a recognized language, as returned by
|
This uses Dr. Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball.
A character vector with as many elements as there are in the input vector with the corresponding elements being the stem of the word. Elements of the vector are converted to UTF-8 encoding before the stemming is performed, and the returned elements are marked as such when they contain non-ASCII characters.
Milan Bouchet-Valat
http://www.loc.gov/standards/iso639-2/php/code_list.php for a list of ISO-639 language codes.
# Simple example wordStem(c("win", "winning", "winner")) # Test some of the vocabulary supplied at https://github.com/snowballstem/snowball-data for(lang in getStemLanguages()) { load(system.file("words", paste0(lang, ".RData"), package="SnowballC")) stopifnot(all(wordStem(dat$words, lang) == dat$stem)) } stopifnot(is.na(wordStem(NA)))
# Simple example wordStem(c("win", "winning", "winner")) # Test some of the vocabulary supplied at https://github.com/snowballstem/snowball-data for(lang in getStemLanguages()) { load(system.file("words", paste0(lang, ".RData"), package="SnowballC")) stopifnot(all(wordStem(dat$words, lang) == dat$stem)) } stopifnot(is.na(wordStem(NA)))