Set up English Wikipedia
English Wikipedia has a lot of data. There are 15.0+ million pages with 20.0+ GB of text, as well as 4.0+ million thumbnails.
Setting all this up on your computer will not be a quick process. As a general estimate, you will need about 30 GB and 5 hours processing time. If you want images as well, the numbers increase to 100 GB of disk space and 30+ hours of processing time. However, when you are done, you will have a complete, recent copy of English Wikipedia with images that can fit on a 128 GB SD card.
Although the process itself is not hard, I strongly recommend that you try Simple Wikipedia first. Simple Wikipedia has 184,000 pages and 90,000 images. The text version uses 200 MB and sets up in 5 minutes. With images, this expands to 2 GB and 30 minutes of downloading time. Simple Wikipedia is a reasonably accurate simulation of English Wikipedia -- just much smaller. It'll also give you a pretty good idea of what XOWA can do.
Part 1: Set up the wiki
The first part is to set up the wiki. You have two approaches for this part: import the wiki with XOWA or download a copy from the internet.
Option 1: Import the wiki with XOWA
- XOWA will download the database dump from the Wikimedia backup servers
- The database dump will be 10+ GB and take about 3 hours to download
- XOWA will take about 2.5 hours to build the wiki. The final wiki will use about 20 GB of disk space.
- Launch XOWA
Use the menu bar and select Tools -> Import From List. Alternatively, you can enter
home/wiki/Help:Import/Listinto the address bar
- Find en.wikipedia.org
- Click on the "download" link to the left.
That's it. The import process has now started. This part takes at least 5 hours so you may want to let it run for a while. When it's done, it will automatically load the Main Page.
Option 2: Download the wiki from archive.org
- The download will be approximately 20 GB.
- When the download is completed, extract the files to J:\gplx\xowa\wiki\en.wikipedia.org
- Download the file from here
- After the download completes, unzip the archive file in J:\gplx\xowa\. When you are done you should have a file like J:\gplx\xowa\wiki\en.wikipedia.org\en.wikipedia.org-core.xowa
- Launch XOWA
- Enter "w:" in the address bar. The Main_Page should load.
Part 2: Download the images
This part takes much longer to complete. It will require at least 70 GB of disk space and 24+ hours of download time. You'll be downloading compressed files from archive.org.
- Go to https://archive.org/details/Xowa_enwiki_latest
Download each of the listed links marked
- Extract the files to J:\gplx\xowa\. When you are done, you will have files from J:\gplx\xowa\wiki\en.wikipedia.org\en.wikipedia.org-file-ns.000-db.001.xowa to J:\gplx\xowa\wiki\en.wikipedia.org\en.wikipedia.org-file-ns.000-db.023.xowa as well as several others
Updating the wiki
Wikipedia is constantly updating. New pages are added, and existing pages are changed to include different images. The above steps will give you a complete set of images for 2015-04-03. However, if you want to stay up to date with Wikipedia, then you may also want to download the monthly updates.
Monthly updates will be posted at the same url: https://archive.org/details/Xowa_enwiki_latest There will be a new link with the name of the wiki dump: for example:
2015-05-02. They will have new images introduced in the Wikipedia dump for that month. Note that these updates should be downloaded and unzipped in order (i.e.: first 2015-05-02, then 2015-06-02, etc). There are some files that appear in multiple sets: the most recent copy of the file should always replace the earlier version.
Note that if you update your wiki, you do not have to update the images. The two are independent of each other. In other words, you can use the 2017-01-01 English Wikipedia xml dump with the 2015-04-03 English Wikipedia images. Note that new images in the 2017-01-01 dump will not show up until you download the appropriate monthly updates.
Disk space usage
Some may wonder why XOWA needs so much disk space, especially when compared to other apps. The following is a brief list of reasons:
- XOWA is complete. It includes all articles across all namespaces, including the Wikipedia namespace, the Portal namespace, the Help namespace, and several others. It also includes redirect stubs. Other apps will only provide articles in the Main namespace.
- XOWA includes Categories as well. Other apps will skip Categories altogether.
- XOWA shows all content on the page. Other apps will omit sections, such as Table of Contents or Navigation boxes at the bottom of the page.
- XOWA includes all images for the Main namespace, the Portal namespace and the Wikipedia namespace. Other apps will only provide images for the Main namespace
- XOWA provides an accurate sized thumbnail for an article. If an article shows an 800 pixel wide image, XOWA shows an 800 pixel wide image. Other apps will actually show a smaller 220 pixel wide image.
- XOWA includes the latest content. Other apps may be many months (if not years) behind.
- Note that when the import completes, it will move the 10 GB file to /xowa/wiki/#dump/done. This file can be deleted safely. Note that XOWA doesn't delete the file, as some users may want to keep the 10 GB file around for archival purposes, and redownoading 10 GB would be time-consuming.
- Note that these images are thumbnails, and are not the originals. They will show correctly in the context of the article, but if you want the original file, you will need to download the tarballs. See Help:Offline images