- OurBigBook.com user: ourbigbook.com/wikibot
- Static website render: wikibot.ourbigbook.com
- Static website source code: github.com/ourbigbook/wikibot
This bot imports the Wikipedia article category tree into OurBigBook. Only titles are currently imported, not the actual article content.
This is just an exploratory step to future exports or generative AI.
We don't have an amazing automation setup as we should, but the steps are:
- obtain
enwiki.sqlite
containing the tablespage
andcategorylinks
stackoverflow.com/questions/17432254/wikipedia-category-hierarchy-from-dumps/77313490#77313490 - Run cirosantilli.com/_raw/wikipedia/sqlite_preorder.py, potentially differently parametrized, as:To publish as a Static website we do:
rm -rf out ./sqlite_preorder.py -D3 -d6 -Obigb -m -N enwiki.sqlite Mathematics Physics Chemistry Biology Technology cd out ls . | grep -E '\.bigb$' | xargs sed -i -r '${/^$/d}' echo '{}' > ourbigbook.json echo '*.tmp' > .gitignore ourbigbook .
and for OurBigBook Web it is important to use thegit init git add . export GIT_COMMITTER_EMAIL='bot@mail.com'; export GIT_COMMITTER_NAME='Mr. Bot'; export GIT_COMMITTER_DATE="2000-01-01T00:00:00+0000"; export GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL"; export GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME"; export GIT_AUTHOR_DATE="$GIT_COMMITTER_DATE"; git config --add user.email "$GIT_COMMITTER_EMAIL" git config --add user.name "$GIT_COMMITTER_NAME" git commit --author "${GIT_COMMITTER_NAME} <${GIT_COMMITTER_EMAIL}>" -m 'Autogenerated commit' )
--web-nested-set-bulk
option to speed thing up:The current limiting factor on the number of articles per user is memory of the nested set generation. We've managed to review this and reduce it with attribute selection, but we have not yet been able to indefinitely scale it, e.g. we would not be able to handle 1M articles per user. The root problems are:ourbigbook --web --web-nested-set-bulk
- lack of depth first in SQLite due to lack of arrays, as opposed to PostgreSQL: stackoverflow.com/questions/65247873/preorder-tree-traversal-using-recursive-ctes-in-sql/77276675#77276675
- lack of proper streaming in Sequelize: stackoverflow.com/questions/28787889/how-can-i-set-up-sequelize-js-to-stream-data-instead-of-a-promise-callback
Now let's look at the shape of the data. Total pages:gives ~59M.
sqlite3 enwiki.sqlite 'select count(*) from page'
Total articles:gives ~17M.
sqlite3 enwiki.sqlite 'select count(*) from page where page_namespace = 0'
Total non-redirect articles:gives: ~6.7M
sqlite3 enwiki.sqlite 'select count(*) from page where page_namespace = 0 page_is_redirect = 0'
Categories:gives: ~2.3M.
sqlite3 enwiki.sqlite 'select count(*) from page where page_namespace = 14'
Allowing for depth 6 of all of STEM:leads to ~980k articles.
./sqlite_preorder.py -D3 -d6 -Obigb -m -N enwiki.sqlite Mathematics Physics Chemistry Biology Technology
Depth 6 on Mathematics only:leads to 150k articles. Some useless hogs and how they were reached:
./sqlite_preorder.py -D3 -d6 -Obigb -m -N enwiki.sqlite Mathematics
- Actuarial science via Applied Mathematis: 4k
- Molecular biology via Applied geometry: 4k
- Ship identification numbers via Numbers: 5k
- Galaxies via Dynamical systems: 7k
- Video game gameplay via Game design via Game theory: 17k
Depth 6 on Mathematics + Physics:leads to 104k articles.
./sqlite_preorder.py -D3 -d5 -Obigb -m -N enwiki.sqlite Mathematics Physics
Allowing for unlimited depth on Mathematics:leads seems to reach all ~9M articles + categories , or most of them. We gave up around 8.6M, when things got really really slow, possibly due to heavy duplicate removal. We didn't log it properly, but depths of 3k+ were seen... so not setting depth is just pointless unless you want the entire Wiki.
./sqlite_preorder.py -D3 -Obigb -m -N enwiki.sqlite Mathematics