OurBigBook
This bot imports the Wikipedia article category tree into OurBigBook. Only titles are currently imported, not the actual article content.
This is just an exploratory step to future exports or generative AI.
We don't have an amazing automation setup as we should, but the steps are:
Now let's look at the shape of the data. Total pages:
sqlite3 enwiki.sqlite 'select count(*) from page'
gives ~59M.
Total articles:
sqlite3 enwiki.sqlite 'select count(*) from page where page_namespace = 0'
gives ~17M.
Total non-redirect articles:
sqlite3 enwiki.sqlite 'select count(*) from page where page_namespace = 0 page_is_redirect = 0'
gives: ~6.7M
Categories:
sqlite3 enwiki.sqlite 'select count(*) from page where page_namespace = 14'
gives: ~2.3M.
Allowing for depth 6 of all of STEM:
./sqlite_preorder.py -D3 -d6 -Obigb -m -N enwiki.sqlite Mathematics Physics Chemistry Biology Technology
leads to ~980k articles.
Depth 6 on Mathematics only:
./sqlite_preorder.py -D3 -d6 -Obigb -m -N enwiki.sqlite Mathematics
leads to 150k articles. Some useless hogs and how they were reached:
  • Actuarial science via Applied Mathematis: 4k
  • Molecular biology via Applied geometry: 4k
  • Ship identification numbers via Numbers: 5k
  • Galaxies via Dynamical systems: 7k
  • Video game gameplay via Game design via Game theory: 17k
Depth 6 on Mathematics + Physics:
./sqlite_preorder.py -D3 -d5 -Obigb -m -N enwiki.sqlite Mathematics Physics
leads to 104k articles.
Allowing for unlimited depth on Mathematics:
./sqlite_preorder.py -D3 -Obigb -m -N enwiki.sqlite Mathematics
leads seems to reach all ~9M articles + categories , or most of them. We gave up around 8.6M, when things got really really slow, possibly due to heavy duplicate removal. We didn't log it properly, but depths of 3k+ were seen... so not setting depth is just pointless unless you want the entire Wiki.

Ancestors

  1. Generated data
  2. OurBigBook Web development
  3. OurBigBook Web
  4. OurBigBook Project