Advanced Search Features

Be sure to review the tutorial for searching with blekko before you read this page.

Creating your own slashtags

Logged-in users can create their own topical slashtags by clicking “create a slashtag” on any results page. At creation time, you can choose if the slashtag will be public or private. Your public slashtags can be used by other users with the name /yourusername/yourtagname, for example /greg/mc is a slashtag that I (user “greg”) created to search the websites of the members of the San Francisco-based Media Consortium. A shortcut to view or edit this slashtag is to type /view /greg/mc into the search box. You can invite other users to help edit your slashtags. Topical slashtags can be included within each other; you can include blekko slashtags in your personal slashtags if you wish to extend a blekko slashtag with additional websites.

The topical slashtags such as /sports that we’ve been talking about previously in this tutorial series are actually slashtags created by the user named blekko. The full name of /sports is /blekko/sports. Whenever you use the shortcut /sports, we will first look for /yourusername/sports, and if it does not exist, /blekko/sports.

The individual elements in a slashtag can be narrower than an entire website. The possibilities are:

  • an entire website: quora.com
  • a prefixed subset of a website: espn.go.com/nfl
  • an individual URL: foo.com/bar.html
  • a wildcarded subset of a website: stackoverflow.com/*javascript

Subsets of a website are useful on websites with a hierarchy of content; espn.go.com has separate prefixes for football (/nfl), baseball (/mlb), and basketball (/nba). Wildcards are useful for websites where the URL contains the topic of the webpage, and the topic tends to contain whatever you’re looking for. These four possibilities can’t be successfully used with every topic and website, but they work for most.

Individual slashtags are limited to 25,000 entries; the maximum count for included slashtags is 125,000.

Sort order, and date ranges

In addition to the usual relevance-sorted results, blekko allows you to sort by date by adding /date. Date sorted results only include webpages which appear to be explicitly dated, such as newspaper articles, blog postings, press releases, etc. Sorting by date may hurt relevance. For example, webspam can easily appear in obama /date, because “obama” is a fairly common word in spam blogs (splogs). Using a topical slashtag can help improve relevance: try obama /topnews /date or obama /politics /date to avoid spam and improve quality. Another way to fiddle with relevance is to replace /date with /fastdate or /date /more, or adding quotes around phrases. /fastdate will increase relevance by searching only webpage titles, and only returns very recent results; it is useful when you have too many low-quality results in a date-sorted query. /date /more returns more results if there are lots of results, and is useful when you wish to see every mention of a search term. Plain /date will behave like either /fastdate or /date /more, depending on how recently popular the search terms are.

Compare:

/fastdate and /date /more are more interesting in searches that always have a lot of recent content: try

If you’d like dates limited to a date range, but still want a relevance sort, use /daterange like this:

Date ranges can also be relative, such as

Using “last week” is especially useful if you’d like to monitor search terms using an RSS feed.

RSS feeds and email alerts

Adding /rss to any search will result in output in rss, which can be read with your favorite rss reader, such as NewsBlur.com or Google Reader. RSS feeds can also be turned into emails by services such as https://www.feedmyinbox.com/. An RSS feed turned into email is roughly equivalent to a Google Alert.

/rss applied to a non-date search is fairly boring; new items will appear only if something new appears in the top 20 results for the search. Adding /date makes sure that new dated material appears, but might bring in too much webspam. Using /daterange=”last week” instead of /date helps exclude spam from dated material. Note that material without a date will never be returned by /date or /daterange. If you would like to see everything new about a topic, we recommend a combination of 2 rss feeds: one non-date with /ps=100, and one with /date or /daterange.

The number of results examined in rss feeds can be increased by adding /ps=100 to the search. You will find that the optimal number of results depends upon the amount of spam present in the search.

Troubleshooting

Not getting what you want? Try adding quotes around words to make them exact, or add /web to turn off auto-boost. Try adding /noblend, which will reduce the number of results, but may make exclusions and quoted words/phrases much more exact.

You can always contact us at support@blekko.com for help. Contacting us not only gets you an answer for your question, but it also helps us improve blekko’s behavior and our documentation.

Advanced Search Tips

blekko does not currently support Boolean searches, which use AND, OR, NOT, and parentheses. AND is implicit in all searches, and NOT is expressed using a minus sign: tiger woods -affair. There is no way to express an OR with blekko; use multiple searches instead. We also don’t support using * as a wildcard in the search box.

The number of results claimed for a query are just an estimate and shouldn’t be taken too seriously. This is true of all search engines; for an overview, please see the blog posting Why Google Can’t Count Results Properly by Danny Sullivan of Search Engine Land.

Due to the way search queries are evaluated, it’s impossible to get more than 1,000 results for a query out of the search engine, even if there are supposedly millions of results. Often only 500-600 results is the maximum.

The next section, Webgrep, hows how you can get an accurate results for searches over our entire database.

Webgrep

Search engines are fast because we’ve built inverted indices for all the words we have found in webpages. We then look things up in these indices in ways that prevent us from returning all of the results, especially if there are a large number of results.

It’s possible for us to instead search by looking at every webpage in our crawl individually. This takes about 10 hours for our current crawl, which is 1 petabyte of data (1 million billion characters) in 4 billion webpages. This kind of computing is called MapReduce. There could be 10s to 100s of millions of results for this kind of query, and we can return them all.

We can also do special searches this way, such as searches involving punctuation, and searches within the HTML text of webpages. This last kind of search is handy for asking questions like: What RDF microformats are in common use? What jQuery libraries are popular?

For more details, see https://blekko.com/webgrep.

API

If you would like to make programmatic access to blekko search results, we do offer an API. Results can be fetched either as XML or JSON. Please contact apiauth@blekko.com for more details. The API can also be used to manipulate slashtags from a program.

About greg

I'm the CTO at blekko
This entry was posted in customization, search, slashtags and tagged , , . Bookmark the permalink.

Comments are closed.