Midnight Pub

WIP search engine

~moonsheep

It's been a while since I posted here! How is everyone doing? Lately I've been working on a gemini search engine. You can check out my current progress at:

https://repo.or.cz/sheepsearch.git

Might look to host it at some point.

I'm well aware that the geminispace doesn't need more search engines, this is just a fun side-project. I'd like to hear what you people think.


tetris

That looks really promising. How do you compute the backlinkscore for your pages? I briefly looked at the tree, but was on mobile.

reply

moonsheep

I simply perform a text search (using Postgres' text search functions) on the backlinks, rank them, and sum their scores. This is meant as a very temporary ranking method before I implement something like an actual PagRank.

reply

tetris

I say leave pagerank out of it, otherwise desperate people will link across these small communities like crazy to inflate their discoverability.

I think the most fair[1] thing you could do is a random ranking, and print the seed at the top in case anyone wants to reproduce their results.

1: https://en.wikipedia.org/wiki/Random_ballot

reply

moonsheep

Well to be fair PageRank was specifically designed to prevent people from doing that. Ranking algorithms before that (like the one I'm currently using) simply looked at backlinks, making it very easy to boost your capsule by writing a page chock-full of links pointing to it.

reply

tetris

Wait, was it? I was under the impression that PageRank had no defense at all against link farming since it derives the weighting/eigenvalues for *every* page by the contribution of others in the previous iteration. Discriminating against inflated contributions came later I think, and isn't published by them anywhere.

reply

moonsheep

Well there were certainly improved versions later, but PageRank was one of the first attempts to counteract this issue. The key point is that not only is the backlink count used for ranking, but the quality of each of those backlinks is taken into consideration--that is, getting a single good-quality backlink is often much better than many poor backlinks (rendering at least small-scale link farming methods ineffective).

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Web pages vary greatly in terms of the number of backlinks they have. For example, the Netscape home page has 62,804 backlinks in our current database compared to most pages which have just a few backlinks. Generally, highly linked pages are more important than pages with few links. Simple citation counting has b een used to speculate on the future winners of the Nobel Prize [San95]. PageRank provides a more sophisticated method for doing citation counting. The reason that PageRank is interesting is that there are many cases where simple citation counting does not correspond to our common sense notion of importance. For example, if a web page has a link to the Yahoo home page, it may be just one link but it is a very important one. This page should be ranked higher than many pages with more links but from obscure places. PageRank is an attempt to see how good an approximation to importance can be obtained just from the link structure.
These types of personalized PageRanks are virtually immune to manipulation by commercial interests. For a page to get a high PageRank, it must convince an important page, or a lot of non-important pages to link to it. At worst, you can have manipulation in the form of buying advertisements (links) on important sites. But, this seems well under control since it costs money. This immunity to manipulation is an extremely important property. This kind of commercial manipulation is causing search engines a great deal of trouble, and making features that would be great to have very difficult to implement.

I'm not sure how much that applies to gemini, since its smaller size may make exploitation easier, but on the other hand it also makes manual blacklisting significantly more managable.

reply

tetris

I never thanked you for this added context. Cheers both for the info and the correction -- and I hope your engine takes off!

reply

tatterdemalion

I think fun side projects are the most important thing geminispace needs, whether they're search engines or not.

reply