Last updated: 28 January 2008 ***What is this? This is a small collection of scripts that will go get pages changed, added, or deleted to a wikiproject and update the full xml dump accordingly, to produce a current snapshot. ***System requirements: linux (untested on other unix variants) bash, awk, sed, grep and all the usual goodies perl curl ***How to install: See the file INSTALL for details. **How to run: cd into the directory where this package was unpacked. If this is the first time running it, you will want a full copy of the XML dump for your project to start with. See http://download.wikimedia.org/backup-index.html for these dumps. You will want the one that has "All pages, current versions only", including discussion pages. (pages-meta-current.*.xml) *Copy* this file into last_full.xml (the file will be overwritten later with the new snapshot). If you edited config.txt to change the values for snapshot or snapshotdir, move the file accordingly. Look at the date of the xml dump; you are going to want to get everything from that date through today. So for example if your dump is dated Jan 13 and today is Jan 18, you will want to get 6 days of data. There may be some overlap with existing content; this is ok. Overlap ensures that you don't miss some changes. For the full run, type ./getrcs.sh today today-numdays In our example, we would have ./getrcs.sh today today-6 Now wait for a while. The script will update you as it goes along. It fetches the (relevant part of the) rc logs, the move logs, the import logs, the upload logs, and the delete logs. It then retrieves all pages called for by those logs (except for the deletes :-) ) Finally, it merges these pages into the last full dump and produces a new current snapshot., which will be found in last_full.xml The next time you run this script, look at the date you generated the last_full.xml file, and make the same calculation for the number of days. If you are running this script on more than one wiki for example, you can use multiple configuration files and give the command ./getrcs.sh today today-6 my-config-file You can also specify absolute timestamps, a base date that depends on the last run time, or hour increments instead of days. Type ./getrcs.sh for more information. **Other info Temporary files live in ./tmp, and you can remove them when the run is finished. They will be removed by the script itself on the next run, all except for the files with the extension raw or raw.save; these you will want to clear out yourself once in awhile, as they are kept in case debugging is needed. This script is meant to be run no more than once a day, so the timestamps it puts in filenames are the date. If you need to run it more often because you are on a very active project, change the file naming convention by editing the variable "ext" near the top of the file. **Still more info (What are all these scripts?) sort.pl and uniq.pl tiny scripts that replace sort and uniq on linux cause they are busted for some utf8 characters in my locale, as I found out the hard way. merge-pages-main-and-export.pl grabs the pages we exported from the live dbn and folds them into the last full xml file. merge-deletes.pl deletes selected pages based on the retrieved portion of the delete log. do-links.sh makes symlinks to the getrcs.sh script in case you want to run certain phases of it in isolation. symlinks (getchanges.sh etc) allows you to run each phase of the script separately; see the INSTALL file for more info. **A teeny bit more info By default the script sleeps 5 seconds between requesting page exports, which are done in batches of 500 pages each, and 2 seconds between requests of log portions, which are 500 lines each. See TODO for things that probably should be... done. **Copyright This little mess is released for use under the GPL v3 or later, as well as under the GFDL 1.2 or later; the reader may choose which one to use. Copyright (C) Ariel T. Glenn 2008 (and all other editors of this page; please see the history page for details). Please improve it and share!
Last updated: 20 January 2008 ***System requirements: linux (untested on other unix variants) bash, awk, sed, grep and all the usual goodies perl curl ***How to install: Untar the file into a convenient location. Edit the file config.txt and change the line wiki="en.wiktionary.org" to contain the name of your project. Change the line expurl='Special:Export' to contain the name of your Special:Export page. You can change the number of seconds between requests for pieces of the various logs by editing the line logsecs=2 You can change the number of seconds between requests for page exports (500 pages each export) by editing the line pagesecs=5 You can also change the temporary work directory by editing the line tmp="tmp" You can change the directory and the name where the last snapshot be found, by editing the lines snapshot="last_full.xml" and snapshotdir="./" You can change the name of the file where we put the startdate for this run by editing the line lastrun="lastrun" If you want to run the script on more than one wiki for example, you can create multiple configuration files, one for each. After that, run the command ./do-links.sh This will create sym links to other names you can use for invoking the script one piece at a time. **Note You can run pieces of this script at a time. If you do a directory listing, you can see that there are several symlinks; each of these is to a name that, if used to invoke the script, will run that phase. Generally, if you are going to do that, you should pass the same number of days to each phase that you run separately, with the exception of the getpages and domerge phases, where the number of days isn't actually used :-) The phases are: (1) getchanges.sh getmoves.sh getimports.sh getuploads.sh getdeletes.sh (2) getpages.sh (3) domerges.sh The scripts in phase 1 should be run before phase 2 which should be run before phase 3. The scripts in phase 1 retrieve the appropriate part of the specified log. Titles of pages to retrieve or to delete are generated from these lists. The script in phase 2 retrieves all pages (except for the deletes) that we put together in phase 1. THe script in phase 3 merges these new pages into the old full dump and deletes any that need to be removed, checking timestamps to see which version is most current or whether the deletion was before a recreation of the page (for example). Why would you want to do this? Maybe you are debugging :-/ But, more likely, you may want to get all the log updates once a day but only build a snapshot once a week. (Depending on how large your project is, building the whole snapshot could take a long time.) You would have to do some manual catting of files together for this but it might be useful.