Title: | Import Texts from Files in the 'Alceste' Format Using the 'tm' Text Mining Framework |
---|---|
Description: | Provides a 'tm' Source to create corpora from a corpus prepared in the format used by the 'Alceste' application (i.e. a single text file with inline meta-data). It is able to import both text contents and meta-data (starred) variables. |
Authors: | Milan Bouchet-Valat [aut, cre] |
Maintainer: | Milan Bouchet-Valat <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.1 |
Built: | 2024-11-02 02:40:46 UTC |
Source: | https://github.com/nalimilan/r.temis |
This package provides a tm Source to create corpora from files formatted in the format used by the Alceste application.
Typical usage is to create a corpus from an Alceste file
prepared manually (here called myAlcesteCorpus.txt
).
Frequently, it is necessary to specify the encoding of the texts
via link{AlcesteSource}
's encoding
argument.
# Import corpus source <- europresseSource("myAlcesteCorpus.txt") corpus <- Corpus(source) # See how many articles were imported corpus # See the contents of the first article and its meta-data inspect(corpus[1]) meta(corpus[[1]])
See link{AlcesteSource}
for more details and real examples.
Milan Bouchet-Valat <[email protected]>
https://image-zafar.com/Logicieluk.html
Construct a source for an input containing a set of texts saved in the Alceste format in a single text file.
AlcesteSource(x, encoding = "auto")
AlcesteSource(x, encoding = "auto")
x |
Either a character identifying the file or a connection. |
encoding |
A character string: if non-empty declares the encoding
used when reading the file, so the character data can be
re-encoded. See the ‘Encoding’ section of the help for
|
Several texts are saved in a single Alceste-formatted file, separated
by lines starting with “***” or digits, followed by starred
variables (see links below). These variables are set as document
meta-data that can be accessed via the meta
function.
Currently, “theme” lines starting with “-*” are ignored.
An object of class AlcesteSource
which extends the class
Source
representing set of articles from Alceste.
Milan Bouchet-Valat
https://image-zafar.com/sites/default/files/telechargements/formatage_alceste.pdf (in French) about the Alceste format
readAlceste
for the function actually parsing
individual articles.
getSources
to list available sources.
library(tm) file <- system.file("texts", "alceste_test.txt", package = "tm.plugin.alceste") corpus <- Corpus(AlcesteSource(file)) # See the contents of the documents inspect(corpus) # See meta-data associated with first article meta(corpus[[1]])
library(tm) file <- system.file("texts", "alceste_test.txt", package = "tm.plugin.alceste") corpus <- Corpus(AlcesteSource(file)) # See the contents of the documents inspect(corpus) # See meta-data associated with first article meta(corpus[[1]])
Read in a text in the Alceste format using starred variables.
readAlceste(elem, language, id)
readAlceste(elem, language, id)
elem |
A |
language |
A |
id |
A |
A PlainTextDocument
with the contents of the article and the available meta-data set.
Milan Bouchet-Valat
getReaders
to list available reader functions.