Wikipedia bot (wikibot)

OurBigBook.com user: ourbigbook.com/wikibot
Static website render: wikibot.ourbigbook.com
Static website source code: github.com/ourbigbook/wikibot

This bot imports the Wikipedia article category tree into OurBigBook. Only titles are currently imported, not the actual article content.

This is just an exploratory step to future exports or generative AI.

We don't have an amazing automation setup as we should, but the steps are:

obtain enwiki.sqlite containing the tables page and categorylinks stackoverflow.com/questions/17432254/wikipedia-category-hierarchy-from-dumps/77313490#77313490
Run cirosantilli.com/_raw/wikipedia/sqlite_preorder.py, potentially differently parametrized, as:
```
rm -rf out
./sqlite_preorder.py -D3 -d6 -Obigb -m -N enwiki.sqlite Mathematics Physics Chemistry Biology Technology
cd out
ls . | grep -E '\.bigb$' | xargs sed -i -r '${/^$/d}'
echo '{}' > ourbigbook.json
echo '*.tmp' > .gitignore
ourbigbook .
```
To publish as a Static website we do:
```
  git init
  git add .
  export GIT_COMMITTER_EMAIL='bot@mail.com';
  export GIT_COMMITTER_NAME='Mr. Bot';
  export GIT_COMMITTER_DATE="2000-01-01T00:00:00+0000";
  export GIT_AUTHOR_EMAIL="$GIT_COMMITTER_EMAIL";
  export GIT_AUTHOR_NAME="$GIT_COMMITTER_NAME";
  export GIT_AUTHOR_DATE="$GIT_COMMITTER_DATE";
  git config --add user.email "$GIT_COMMITTER_EMAIL"
  git config --add user.name "$GIT_COMMITTER_NAME"
  git commit --author "${GIT_COMMITTER_NAME} <${GIT_COMMITTER_EMAIL}>" -m 'Autogenerated commit'
)
```
and for OurBigBook Web it is important not to use the --no-web-nested-set-bulk option to speed thing up:
```
ourbigbook --web
```
The current limiting factor on the number of articles per user is memory of the nested set generation. We've managed to review this and reduce it with attribute selection, but we have not yet been able to indefinitely scale it, e.g. we would not be able to handle 1M articles per user. The root problems are:
- lack of depth first in SQLite due to lack of arrays, as opposed to PostgreSQL: stackoverflow.com/questions/65247873/preorder-tree-traversal-using-recursive-ctes-in-sql/77276675#77276675
- lack of proper streaming in Sequelize: stackoverflow.com/questions/28787889/how-can-i-set-up-sequelize-js-to-stream-data-instead-of-a-promise-callback

Now let's look at the shape of the data. Total pages:

sqlite3 enwiki.sqlite 'select count(*) from page'

gives ~59M.

Total articles:

sqlite3 enwiki.sqlite 'select count(*) from page where page_namespace = 0'

gives ~17M.

Total non-redirect articles:

sqlite3 enwiki.sqlite 'select count(*) from page where page_namespace = 0 page_is_redirect = 0'

gives: ~6.7M

Categories:

sqlite3 enwiki.sqlite 'select count(*) from page where page_namespace = 14'

gives: ~2.3M.

Allowing for depth 6 of all of STEM:

./sqlite_preorder.py -D3 -d6 -Obigb -m -N enwiki.sqlite Mathematics Physics Chemistry Biology Technology

leads to ~980k articles.

Depth 6 on Mathematics only:

./sqlite_preorder.py -D3 -d6 -Obigb -m -N enwiki.sqlite Mathematics

leads to 150k articles. Some useless hogs and how they were reached:

Actuarial science via Applied Mathematis: 4k
Molecular biology via Applied geometry: 4k
Ship identification numbers via Numbers: 5k
Galaxies via Dynamical systems: 7k
Video game gameplay via Game design via Game theory: 17k

Depth 6 on Mathematics + Physics:

./sqlite_preorder.py -D3 -d5 -Obigb -m -N enwiki.sqlite Mathematics Physics

leads to 104k articles.

Allowing for unlimited depth on Mathematics:

./sqlite_preorder.py -D3 -Obigb -m -N enwiki.sqlite Mathematics

leads seems to reach all ~9M articles + categories , or most of them. We gave up around 8.6M, when things got really really slow, possibly due to heavy duplicate removal. We didn't log it properly, but depths of 3k+ were seen... so not setting depth is just pointless unless you want the entire Wiki.

Wikipedia bot (wikibot)

 Ancestors (4)

 Incoming links (3)

 Synonyms (1)