Paul Masurel's blog post about Tantivy's indexing (which I just submitted as a QuickPeep seed :^)), it should be possible to make the index itself smaller than the document, if term positions are disabled.
...but open! QuickPeep is open-source: quickpeep. My collection of seeds and weeds is open data: quickpeep_seeds.
If you'd like to give it a try, you can visit https://quickpeep.net, but don't get your hopes up :-)!
/, ignoring the protocol scheme (
https://) and ignoring a
www.subdomain if present.
#blah) should be stripped when raking.
When a lot of sites have been crawled, it'd be fun to load the 'references' rake packs into some kind of graph database and be able to answer queries like 'which pages refer to my page'.
Reverse references like this could be quite interesting for finding extensions or different points of view to an article you are already aware of.
(This would be gamed for SEO purposes in a heartbeat on a conventional search engine!)
Because these rake packs will be available for download, it should be quite possible for someone to do this if they like.
QuickPeep is built on top of many great quality, open-source libraries.
Many thanks to all their authors for putting their work out into the open for fools like me to enjoy!