Each of us has been faced with the problem of searching for information more than once. Irregardless of the data source we are using (Internet, file system on our hard drive, data base or a global information system of a big company) the problems can be multiple and include the physical volume of the data base searched, the information being unstructured, different file types and also the complexity of accurately wording the search query. We have already reached the stage when the amount of data on one single PC is comparable to the amount of text data stored in a proper library. And as to the unstructured data flows, in future they are only going to increase, and at a very rapid tempo. If for an average user this might be just a minor misfortune, for a big company absence of control over information can mean significant problems. So the necessity to create search systems and technologies simplifying and accelerating access to the necessary information, originated long ago. Such systems are numerous and moreover not every one of them is based on a unique technology. And the task of choosing the right one depends directly on the specific tasks to be solved in the future. While the demand for the perfect data searching and processing tools is steadily growing let’s consider the state of affairs with the supply side.
Not going deeply into the various peculiarities of the technology, all the searching programs and systems can be divided into three groups. These are: global Internet systems, turnkey business solutions (corporate data searching and processing technologies) and simple phrasal or file search on a local computer. Different directions presumably mean different solutions.
Everything is clear about search on a local PC. It’s not remarkable for any particular functionality features accept for the choice of file type (media, text etc.) and the search destination. Just enter the name of the searched file (or part of text, for example in the Word format) and that’s it. The speed and result depend fully on the text entered into the query line. There is zero intellectuality in this: simply looking through the available files to define their relevance. This is in its sense explicable: what’s the use of creating a sophisticated system for such uncomplicated needs.
Global search technologies
Matters stand totally different with the search systems operating in the global network. One can’t rely simply on looking through the available data. Huge volume (Yandex for instance can boast the indexing capacity of more than 11 terabyte of data) of the global chaos of unstructured information will make the simple search not only ineffective but also long and labor-consuming. That’s why lately the focus has shifted towards optimizing and improving quality characteristics of search. But the scheme is still very simple (except for the secret innovations of every separate system) – the phrasal search through the indexed data base with proper consideration for morphology and synonyms. Undoubtedly, such an approach works but doesn’t solve the problem completely. Reading dozens of various articles dedicated to improving search with the help of Google or Yandex, one can drive at the conclusion that without knowing the hidden opportunities of these systems finding a relevant document by the query is a matter of more than a minute, and sometimes more than an hour. The problem is that such a realization of search is very dependent on the query word or phrase, entered by the user. The more indistinct the query the worse is the search. This has become an axiom, or dogma, whichever you prefer.
Of course, intelligently using the key functions of the search systems and properly defining the phrase by which the documents and sites are searched, it is possible to get acceptable results. But this would be the result of painstaking mental work and time wasted on looking through irrelevant information with a hope to at least find some clues on how to upgrade the search query. In general, the scheme is the following: enter the phrase, look through several results, making sure that the query was not the right one, enter a new phrase and the stages are repeated till the relevancy of results achieves the highest possible level. But even in that case the chances to find the right document are still few. No average user will voluntary go for the sophistication of “advanced search” (although it is equipped with a number of very useful functions such as the choice of language, file format etc.). The best would be to simply insert the word or phrase and get a ready answer, without particular concern for the means of getting it. Let the horse think – it has a big head. Maybe this is not exactly up to the point, but one of the Google search functions is called “I am feeling lucky!” characterizes very well the existent searching technologies. Nevertheless, the technology works, not ideally and not always justifying the hopes, but if you allow for the complexity of searching through the chaos of Internet data volume, it could be acceptable.
The third on the list are the turnkey solutions based on the searching technologies. They are meant for serious companies and corporations, possessing really large data bases and staffed with all sorts of information systems and documents. In principle, the technologies themselves can also be used for home needs. For example, a programmer working remotely from the office will make good use of the search to access randomly located on his hard drive program source codes. But these are particulars. The main application of the technology is still solving the problem of quickly and accurately searching through large data volumes and working with various information sources. Such systems usually operate by a very simple scheme (although there are undoubtedly numerous unique methods of indexing and processing queries underneath the surface): phrasal search, with proper consideration for all the stem forms, synonyms etc. which once again leads us to the problem of human resource. When using such technology the user should first word the query phrases which are going to be the search criteria and presumably met in the necessary documents to be retrieved. But there is no guarantee that the user will be able to independently choose or remember the correct phrase and furthermore, that the search by this phrase will be satisfactory.
One more key moment is the speed of processing a query. Of course, when using the whole document instead of a couple of words, the accuracy of search increases manifold. But up to date, such an opportunity has not been used because of the high capacity drain of such a process. The point is that search by words or phrases will not provide us with a highly relevant similarity of results. And the search by phrase equal in its length the whole document consumes much time and computer resources. Here is an example: while processing the query by one word there is no considerable difference in speed: whether it’s 0,1 or 0,001 second is not of crucial importance to the user. But when you take an average size document which contains about 2000 unique words, then the search with consideration for morphology (stem forms) and thesaurus (synonyms), as well as generating a relevant list of results in case of search by key words will take several dozens of minutes (which is unacceptable for a user).
The interim summary
As we can see, currently existing systems and search technologies, although properly functioning, don’t solve the problem of search completely. Where speed is acceptable the relevancy leaves more to be desired. If the search is accurate and adequate, it consumes lots of time and resources. It is of course possible to solve the problem by a very obvious manner – by increasing the computer capacity. But equipping the office with dozens of ultra-fast computers which will continuously process phrasal queries consisting of thousands of unique words, struggling through gigabytes of incoming correspondence, technical literature, final reports and other information is more than irrational and disadvantageous. There is a better way.
The unique similar content search
At present many companies are intensively working on developing full text search. The calculation speeds allow creating technologies that enable queries in different exponents and wide array of supplementary conditions. The experience in creating phrasal search provides these companies with an expertise to further develop and perfect the search technology. In particular, one of the most popular searches is the Google, and namely one of its functions called the “similar pages”. Using this function enables the user to view the pages of maximum similarity in their content to the sample one. Functioning in principle, this function does not yet allow getting relevant results – they are mostly vague and of low relevancy and furthermore, sometimes utilizing this function shows complete absence of similar pages as a result. Most probably, this is the result of the chaotic and unstructured nature of information in the Internet. But once the precedent has been created, the advent of the perfect search without a hitch is just a matter of time.
What concerns the corporate data processing and knowledge retrieval systems, here the matters stand much worse. The functioning (not existing on paper) technologies are very few. And no giant or the so called search technology guru has so far succeeded in creating a real similar content search. Maybe, the reason is that it’s not desperately needed, maybe – too hard to implement. But there is a functioning one though.
SoftInform Search Technology, developed by SoftInform, is the technology of searching for documents similar in their content to the sample. It enables fast and accurate search for documents of similar content in any volume of data. The technology is based on the mathematical model of analyzing the document structure and selecting the words, word combinations and text arrays, which results in forming a list of documents of maximum similarity the sample text abstract with the relevancy percent defined. In contrast to the standard phrasal search by the similar content search there is no need to determine the key words beforehand – the search is conducted through the whole document. The technology works with several sources of information that can be stored both in text files of txt, doc, rtf, pdf, htm, html formats, and the information systems of the most popular data bases (Access, MS SQL, Oracle, as well as any SQL-supporting data bases). It also additionally supports the synonyms and important words functions that enable to carry out a more specific search.
The similar search technology enables to significantly cut time wasted on searching and reviewing the same or very similar documents, diminish the processing time at the stage of entering data into the archive by avoiding the duplicate documents and forming sets of data by a certain subject. Another advantage of the SoftInform technology is that it’s not so sensitive to the computer capacity and allows processing data at a very high speed even on ordinary office computers.
This technology is not just a theoretic development. It has been tested and successfully implemented in a project of giving legal advice via phone, where the speed of information retrieval is of crucial importance. And it will undoubtedly be more than useful in any knowledge base, analytical service and support department of any large firm. Universality and effectiveness of the SoftInform Search Technology allows solving a wide spectrum of problems, arising while processing information. These include the fuzziness of information (at the document entering stage it is possible to immediately define whether such a document already belongs to the data base or not) and the similarity analysis of the documents which are already entered into the data base, and the search for semantically similar documents which saves time spent on selecting the appropriate key words and viewing the irrelevant documents.
Besides its primary assignment (fast and high quality search for information in huge volume such as texts, archives, data bases) an Internet direction could also be defined. For example, it is possible to work out an expert system to process incoming correspondence and news which will become an important tool for analysts from different companies. Mainly, this will be possible due to the unique similar content search technology, absent from any of the existent systems so far except for the SearchInform. The problem of spamming search engines with the so called doorways (hidden pages with key words redirecting to the site’s main pages and used to increase the page rating with the search engines) and the e-mail spam problem (a more intellectual analysis would ensure higher level of security) would also be solved with the help of this technology. But the most interesting perspective of the SoftInform Search technology is creating a new Internet search engine, the main competitive advantage of which would be ability to search not just by key words, but also for similar web pages, which will add to the flexibility of search making it more comfortable and efficient.
To draw a conclusion, it could be stated with confidence that the future belongs to the full text search technologies, both in the Internet and the corporate search systems. Unlimited development potential, adequacy of the results and processing speed of any size of query make this technology much more comfortable and in high demand. SoftInform Search technology might not be the pioneer, but it’s a functioning, stable and unique one with no existent analogues (which can be proved by the active Eurasian patent). To my mind, even with the help of the “similar search” it will be difficult to find a similar technology.