It turns out, after all, that Google Reader was shut down only for news aggregation to become Google Search’s primary function.
I have been having substantial trouble these days finding meaningful but relevant information, especially from smaller blogs and old university websites. Instead, Google tends to show only the top results from the most prominent 1,000 websites, even if the results are not relevant.
I look up the phrase “moral bankruptcy.” Google boasts millions of results, but it only presented to me about 5 pages (232 results), most of which interestingly enough were strongly worded articles from The New Yorker, New York Post, The Atlantic, relating to current events. The algorithm is working because one of the results is an old-looking university page, another is a PDF, but where are the rest of the results? What if I want to search deeper?
What if I want to only show results from small blogs (without naming any particular domain names)? What if I want to show results from websites that don’t exist anymore, such as GeoCities? What if I also want to connect a well-known library database, such as WorldCat? What if I want to search and visualize the plethora of social media results as well?
The point is that in an ever more complicated Internet, Google Search is an insult for finding the world’s information. Searching the Internet deserves a better experience.
It may well be possible that over 50% of pages that the Internet has ever hosted have been deleted, though an exact number would require further research. Google Search does not reflect this fact. Instead, it suffers from an acute case of survivorship bias when it ranks the longest-living, most content-rich websites at the top, which are usually the largest, most financially secure websites.
Then there’s also geospatial data: businesses, historic landmarks, municipal zones, festivals, traffic, relative popularity/density (in other words, “where is everyone at/what is everyone doing right now”), geocaches, and other importable GIS data that could probably be aggregated and organized.
My conclusion is that Google gave up on Search a very long time ago; it’s now just a facade for its massive advertising and data-mining operations.
If someone wanted to make an even better search engine for the modern Internet, the market is open for that. Many scholars and researchers are thirsty for an elegant tool that allows them to deep-search all archived and printed content that has ever existed, not just what is recent, trendy, or popular.
There are some problems to be overcome, of course:
- Physical (library) media cannot be ranked the same way as online media. The PageRank algorithm, Google’s winning formula for a meaningful search engine, requires hyperlinks between pages to be recognized. Instead, physical media has to be ranked based on popularity (how many patrons have checked it out, how many clicks it has received) or by having a natural language processor that is robust enough to understand cross-references and citations to other works and people, even outside the context of scholarly articles. Fortunately, advanced natural language processing is approaching our reach.
- It’s difficult to come up with resources that can index today’s massive Internet. Twenty years ago, it could be done with a rack-mounted server in the basement and crawling to the heart’s content; these days, the activity that web crawling creates looks an awful lot like malicious probing, and a comprehensive crawl could take many months to complete. It would require the cooperation of other organizations and existing search engines to seed content while an original index is constructed.
- Most non-hypertext content is not conducive to crawling and indexing. While there are APIs for social media and library databases, they are not open to outright data dumps. Crawling is once again frowned upon as abusive activity – you’re stealing the social media website’s business, which is data. Many of these platforms require the user to be on the platform to browse all of its content.
Now, if you can give it an academic or humanitarian twist – “oh, I am trying to make it easier for scholars to research social media trends and information in a centralized page” – perhaps one can woo over a large organization into providing a data set.