public class DumpProcessingController extends Object
The methods for registering listeners to process dump files that contain
revisions are
registerMwRevisionProcessor(MwRevisionProcessor, String, boolean)
and
registerEntityDocumentProcessor(EntityDocumentProcessor, String, boolean)
.
For processing the content of wiki pages, there are two modes of operation: revision-based and entity-document-based. The former is used when processing dump files that contain revisions. These hold detailed information about each revision (revision number, author, time, etc.) that could be used by revision processors.
The entity-document-based operation is used when processing simplified dumps that contain only the content of the current (entity) pages of a wiki. In this case, no additional information is available and only the entity document processors are called (since we have no revisions). Both modes use the same entity document processors. In revision-based runs, it is possible to restrict some entity document processors to certain content models only (e.g., to process only properties). In entity-document-based runs, this is ignored and all entity document processors get to see all the data.
The methods for revision-based processing of selected dump files (and
downloading them first, finding out which ones are relevant) are
processAllRecentRevisionDumps()
,
processMostRecentMainDump()
, and
processMostRecentMainDump()
.
To extract the most recent sitelinks information, the method
getSitesInformation()
can be used. To get information about the
revision dump files that the main methods will process, one can use
getWmfDumpFileManager()
to get access to the underlying dump file
manager, which can be used to get access to dump file data.
The controller will also catch exceptions that may occur when trying to download and read dump files. They will be turned into logged errors.
Constructor and Description |
---|
DumpProcessingController(String projectName)
Creates a new DumpFileProcessingController for the project of the given
name.
|
Modifier and Type | Method and Description |
---|---|
Sites |
getSitesInformation()
Processes the most recent dump of the sites table to extract information
about registered sites.
|
WmfDumpFileManager |
getWmfDumpFileManager()
Returns a WmfDumpFileManager based on the current settings.
|
void |
processAllRecentRevisionDumps()
Processes all relevant page revision dumps in order.
|
void |
processMostRecentDailyDump()
Processes the most recent incremental (daily) dump that is available.
|
void |
processMostRecentDump(DumpContentType dumpContentType,
MwDumpFileProcessor dumpFileProcessor)
Processes the most recent dump of the given type using the given dump
processor.
|
void |
processMostRecentJsonDump()
Processes the most recent main (complete) dump in JSON form that is
available.
|
void |
processMostRecentMainDump()
Processes the most recent main (complete) dump that is available.
|
void |
registerEntityDocumentProcessor(EntityDocumentProcessor entityDocumentProcessor,
String model,
boolean onlyCurrentRevisions)
Registers an EntityDocumentProcessor, which will henceforth be notified
of all entity documents that are encountered in the dump.
|
void |
registerMwRevisionProcessor(MwRevisionProcessor mwRevisionProcessor,
String model,
boolean onlyCurrentRevisions)
Registers an MwRevisionProcessor, which will henceforth be notified of
all revisions that are encountered in the dump.
|
void |
setDownloadDirectory(String downloadDirectory)
Sets the directory where dumpfiles are stored locally.
|
void |
setOfflineMode(boolean offlineModeEnabled)
Disables or enables Web access.
|
public DumpProcessingController(String projectName)
projectName
- Wikimedia projectname, e.g., "wikidatawiki" or "enwiki"public void setDownloadDirectory(String downloadDirectory) throws IOException
downloadDirectory
- the download base directoryIOException
- if the existence of the directory could not be checked or if
it did not exists and could not be created eitherpublic void setOfflineMode(boolean offlineModeEnabled)
offlineModeEnabled
- if true, all Web access is disabled and only local files will
be processedpublic void registerMwRevisionProcessor(MwRevisionProcessor mwRevisionProcessor, String model, boolean onlyCurrentRevisions)
This only is used when processing dumps that contain revisions. In particular, plain JSON dumps contain no revision information.
Importantly, the MwRevision
that the registered processors will
receive is valid only during the execution of
MwRevisionProcessor.processRevision(MwRevision)
, but it will not
be permanent. If the data is to be retained permanently, the revision
processor needs to make its own copy.
mwRevisionProcessor
- the revision processor to registermodel
- the content model that the processor is registered for; it
will only be notified of revisions in that model; if null is
given, all revisions will be processed whatever their modelonlyCurrentRevisions
- if true, then the subscriber is only notified of the most
current revisions; if false, then it will receive all
revisions, current or notpublic void registerEntityDocumentProcessor(EntityDocumentProcessor entityDocumentProcessor, String model, boolean onlyCurrentRevisions)
It is possible to register processors for specific content types and to use either all revisions or only the most current ones. This functionality is only available when processing dumps that contain this information. In particular, plain JSON dumps do not specify content models at all and have only one (current) revision of each entity.
entityDocumentProcessor
- the entity document processor to registermodel
- the content model that the processor is registered for; it
will only be notified of revisions in that model; if null is
given, all revisions will be processed whatever their modelonlyCurrentRevisions
- if true, then the subscriber is only notified of the most
current revisions; if false, then it will receive all
revisions, current or notpublic Sites getSitesInformation() throws IOException
IOException
- if there was a problem accessing the sites table dump or the
dump download directorypublic void processAllRecentRevisionDumps()
public void processMostRecentDailyDump()
public void processMostRecentMainDump()
This method is useful to obtain reliable results given that single incremental dump files are sometimes missing, even if earlier and later incremental dumps are available. In such a case, processing all recent dumps will miss some (random) revisions, thus reflecting a state that the wiki has never really been in. If this is considered a problem, then it is better to use this method instead.
public void processMostRecentJsonDump()
This method is useful to obtain reliable results given that single incremental dump files are sometimes missing, even if earlier and later incremental dumps are available. In such a case, processing all recent dumps will miss some (random) revisions, thus reflecting a state that the wiki has never really been in. If this is considered a problem, then it is better to use this method instead.
public void processMostRecentDump(DumpContentType dumpContentType, MwDumpFileProcessor dumpFileProcessor)
dumpContentType
- the type of dump to processdumpFileProcessor
- the processor to useprocessMostRecentMainDump()
,
processMostRecentDailyDump()
,
processAllRecentRevisionDumps()
public WmfDumpFileManager getWmfDumpFileManager() throws IOException
DumpProcessingController
and this is often
preferable.
This dump file manager will not be updated if the settings change later.
IOException
- if there was a problem, usually owing to some problem when
accessing the dumpfile directoryCopyright © 2014 Wikidata Toolkit Developers. Generated from source code published under the Apache License 2.0. For more information, see the Wikidata Toolkit homepage