Do not crawl list?
Web is becoming a messy place day by day with lots of personal information being exposed to everyone. We have seen enough evidence lately with such information being exposed and causing trouble. Search Engines are the ones which capture/store most of this information and once captured, it is literally available forever. When it comes to filtering content from captured information, the problem is, only search engine providers determine which information can be filtered not making these filters freely available for everyone to set their own filters to block their personal information from displaying in other user’s search results.
I hope to see some standard in place that’ll allow users to make sure that certain personal information is not available/stored in search engines. This can be a ‘Do not Crawl list’ (similar to ‘Do not call list’ where the information owner has control).
As a end user I should be able to make sure that my phone number, email address etc is never displayed in any search results even if it is available in some web pages (we dont get calls - or atleast not supposed to - if we list our phone numbers in do not call list even when the phone number is listed in telephone directory right? This is similar to that).
Instead of the user visiting each search engine to submit which info should be blocked, all search engines should take the information from a central ‘Do not crawl list’ DB.
Search engines dont own any information. They just use information that belongs to us. So it has to be us who should decide which information should be filtered/displayed in search engines.
Raju
January 23rd, 2007 at 6:08 am
why maintain a central list when a robots.txt file maintained in each web site can tell search engines what to crawl and what not to?
January 23rd, 2007 at 6:08 am
Badri:
robots.txt specify which pages not to crawl. I dont think they define which content not to crawl. Most of the content might be crawlable, but there could be some sensitive information that the user may not want to provide easy access to search engine.
Also, robots.txt has to be defined for each website. Lets say I list your phone number on some discussion forums in some random website, you’ll not have an option to change that website’s robots.txt. While you can ask the webmaster to remove it, this data is present in search engines DB forever. You’ll keep getting spam calls. With the approach I mentioned, you’ll just be able to go to a single central website and say ‘Donot let any search engines store my phone number’.
The concept is similar to Do Not call registry (https://www.donotcall.gov/), but applied to search engines.
Raju