Mostly for fun we've populated the body of wikibot-generated sections on OurBigBook.com with some LLM-generated content.
At first we considered using Ollama with something like llama3.2, but because each prompt took 5 seconds on a laptop it would take about 5 days to do all 100k articles and so we decided to try existing API providers.
We finally ended up going with the cheapest model that seemed easy to use and chose OpenAI gpt-4o-mini. We managed to complete the generation for 100 output tokens per section with only 3 dollars. To help reduce costs we also:
- used batch jobs, which can take up to 24 hours to complete and sometimes just fail after 24 hours
- enabled data collection to get some extra credits.
We just took the Wikipedia titles and prompted directly:
What is TITLE?
It is a bit of shame that many of the replies end in crap like "Let me know if you need more help on this subject"-type output. We could have prevented those with a role=system prompt, but there seems to be no way to factor out a single system prompt for multiple queries, so the input token could would have increased.
Our pipeline is as follows. First clone the wikibot repo:
cd ..
git clone https://github.com/ourbigbook/wikibot
cd -
Then:The generated repository with bodies added should now be present under:
export OPENAI_API_KEY=...
./wikibot-static-llm-submit
# Wait 24 hours and pray.
# Check completion with:
./wikibot-static-llm-list-batches
./wikibot-static-jsonl-to-sqlite
./wikibot-static-add-body
_out/wikibot-llm/repo/
A more elegant option would have been to have used Wikipedia article extracts for the job instead, but unfortunately there doesn't seem to be an immediate way to extract them from the database dumps, there is only an API endpoint that gets generated on the fly:
Another really cool thing would have been to add an image to the wikibot article whenever the main article has one. This would likely be feasible via the imagelinks table: www.mediawiki.org/wiki/Manual:Imagelinks_table but of course, more work :-)
Announced at: LLM-generated wikibot abstracts.