This Blog is intended to give budding MapReduce developers a start off in developing hadoop based applications. It involves some development tips and tricks on hadoop MapReduce programming, tools that use map reduce under the hood and some practical applications of hadoop using these tools. Most of the code samples provided here is tested on hadoop environment but still do post me if you find any not working.
Sunday, June 12, 2011
How does your search provider get you some common search terms as you type? OR How do the auto suggest feature work well with search providers?
It isn’t a core hadoop technological area, but still being map reduce programmers dealing on large data volumes we should be aware on search engines as these scenarios involve very large volume of data, to be more specific the whole of data in web. It’d seem kind of easy but if you think more deeply over, it is not as simple as you think. If you have already thought about this, there would be at least a few questions you missed asking yourself like, what is the data structure that auto suggest uses to retrieve data from? Does it involve client side processing? Let us answer the whole scenario with a series of related questions. A basic knowledge on search engines is always great before we go ahead, so let us start with basic working of web searches.
How does a basic search work?
When we do a search we are not actually searching the web but instead we are searching the index table created by the search provider. So now we should know how the data is sourced into these index tables. It is an outcome of the pre process that the search provider does on web pages to deliver you most relevant details on your searches. There is web crawler or rather the spider that runs across the web getting pages linked to a particular page and there by following links on the new page and so on, with this process the search provider gets all the related web pages then indexes the made on the pages and stores them on data base or rather a huge distributed data base. It is from this index data that you get your search results on the fly.
Before answering this question, you have noticed this feature in google search. What is it called in google? Google Instant. Now lets us just drill down on the working possibilities such an auto suggest. (What mentioned here are the possibilities and is not related to the working of any search provider). It is hard to fetch as you type from a database that holds the whole web data, but it is proved possible. First, apart from the indexed data the searches from every region/county are also stored, and you get instant suggestions based on this as well as if you are logged in, then based on your previous searches as well. The instant suggestions are more related to these data than the data indexed from spider.
What would be the data structure used to store this data?
If I’m storing the search key words one by one as a whole, then fetching them in micro second is a challenge because we may have to go in for sub string operations pattern matching operations on strings etc. With all these operations under the hood the results can't be instant.So here we need a faster data structure which would aid retrievals and inserts in range of micro or nano seconds. What would you go for the same in java? We may think of some data structure that could store elements as individual characters(not as a whole string) like an array, a list or a tree. Isn’t a linked data structure like a tree beneficial to do the job? Let us look into a sample tree now
From this tree correlate the fact that you are searching for the word benz, if there is no lower limit to the number of characters from where the auto suggest has to work, then when you type ‘b’ you would have the words like
Then when you go for ‘be’, you would get
What do you infer from this? When you type the letters you retrieve the the sub elements of the same are retrieved from the tree till its leaf. Remember, it is not just leaf nodes that are being displayed intermediate nodes are also chosen as well. Not all of the sub nodes are chosen, there are pretty much prerequisites, filtering and ranking used while such a tree is constructed and results/nodes are fetched.
What is the most suitable tree implementation here?
It is a Binary search tree? Can be, but there could be better implementations of tree that is faster than a BSE in terms of insertions and retrievals. Such an implementation is a Trie or a pefix tree.
What is a trie or a prefix tree?
Now that is a whole topic as such. Wiki could better explain you on this. Refer
Where does the request processing happens in auto suggest, on client or server ?
Can the processing be on client side? To process the request we need data. It would be pointless to fetch a relatively large volume of data to client and do the processing. So in auto suggest there is an ajax GET request passed to sever to fetch the most relevant data on every key stroke. We can call this as ‘search before you type’ ie when you are googling for ‘apple‘ and as soon as you type ’ap’ itself, it fetches you apple, apple ipod etc .So in short request processing takes place on the server by means of ajax calls.
NOTE: This post defines just one suitable implementation of auto suggest feature in search engines. There would be more sophisticated implications and implementations of the same which would vary from provider to provider.