Press Release
Summary = Each of us has been faced with the problem of searching
for information more than once. Irregardless of the data source
we are using the problems can be multiple and include the physical
volume of the data base searched, the information being unstructured,
different file types and also the complexity of
accurately wording the search query.
Press Release
Body = Each of us has been faced with the problem of searching for
information more than once. Irregardless of the data source we are
using (Internet, file system on our hard drive, data base or a global
information system of a big company) the problems can be multiple
and include the physical volume of the data base searched, the information
being unstructured, different file types and also the complexity
of accurately wording the search query. We have already reached
the stage when the amount of data on one single PC is comparable
to the amount of text data stored in a proper library. And as to
the unstructured data flows, in future they are only going to increase,
and at a very rapid tempo. If for an average user this
might be just a minor misfortune, for a big company absence of control
over information can mean significant problems. So the necessity
to create search systems and technologies simplifying and accelerating
access to the necessary information, originated long ago. Such systems
are numerous and moreover not every one of them is
based on a unique technology. And the task of choosing the right
one depends directly on the specific tasks to be solved in the future.
While the demand for the perfect data searching and processing tools
is steadily growing let’s consider the state of affairs with
the supply side.
Not going deeply
into the various peculiarities of the technology, all the searching
programs and systems can be divided into three groups. These are:
global Internet systems, turnkey business solutions (corporate data
searching and processing technologies) and simple phrasal or file
search on a local computer. Different directions presumably mean
different solutions.
Local search
Everything is clear about search on a local PC. It’s not remarkable
for any particular functionality features accept for the choice
of file type (media, text etc.) and the search destination. Just
enter the name of the searched file (or part of text, for example
in the Word format) and that’s it. The speed and result depend
fully on the text entered into the query line. There is zero intellectuality
in this: simply looking through the available files to define their
relevance. This is in its sense explicable: what’s the use
of creating a sophisticated system for such uncomplicated needs.
Global search
technologies
Matters stand totally different with the search systems operating
in the global network. One can’t rely simply on looking through
the available data. Huge volume (Yandex for instance can boast the
indexing capacity of more than 11 terabyte of data) of the global
chaos of unstructured information will make the simple search
not only ineffective but also long and labor-consuming. That’s
why lately the focus has shifted towards optimizing and improving
quality characteristics of search. But the scheme is still very
simple (except for the secret innovations of every separate system)
- the phrasal search through the indexed data base with proper consideration
for morphology and synonyms. Undoubtedly, such an approach works
but doesn’t solve the problem completely. Reading dozens of
various articles dedicated to improving search with the help of
Google or Yandex, one can drive at the conclusion that without knowing
the hidden opportunities of these systems finding a relevant docume
nt by the query is a matter of more than a minute, and sometimes
more than an hour. The problem is that such a realization of search
is very dependent on the query word or phrase, entered by the user.
The more indistinct the query the worse is the search. This has
become an axiom, or dogma, whichever you prefer. Of course, intelligently
using the key functions of the search systems and properly defining
the phrase by which the documents and sites are searched, it is
possible to get acceptable results. But this would be the result
of painstaking mental work and time wasted on looking through irrelevant
information with a hope to at least find some clues on how to upgrade
the search query. In general, the scheme is the following: enter
the phrase, look through several results, making sure that the query
was not the right one, enter a new phrase and the stages are repeated
till the relevancy of results achieves the highest possible level.
But even in that case the chances to find the right document are
still few. No average user will voluntary go for the sophistication
of “advanced search” (although it is equipped with a
number of very useful functions such as the choice of language,
file format etc.). The best
would be to simply insert the word or phrase and get a ready answer,
without particular concern for the means of getting it. Let the
horse think – it has a big head. Maybe this is not exactly
up to the point, but one of the Google search functions is called
“I am feeling lucky!” characterizes very well the existent
searching technologies. Nevertheless, the technology works, not
ideally and not always justifying the hopes, but if you allow for
the complexity of searching
through the chaos of Internet data volume, it could be acceptable.
Corporate systems
The third on the list are the turnkey solutions based on the searching
technologies. They are meant for serious companies and corporations,
possessing really large data bases and staffed with all sorts of
information systems and documents. In principle, the technologies
themselves can also be used for home needs. For example, a
programmer working remotely from the office will make good use of
the search to access randomly located on his hard drive program
source codes. But these are particulars. The main application of
the technology is still solving the problem of quickly and accurately
searching through large data volumes and working with various
information sources. Such systems usually operate by a very simple
scheme (although there are undoubtedly numerous unique methods of
indexing and processing queries underneath the surface): phrasal
search, with proper consideration for all the stem forms, synonyms
etc. which once again leads us to the problem of human resource.
Whe n using such technology the user should first word the query
phrases which are going to be the search criteria and presumably
met in the necessary documents to be retrieved. But there is no
guarantee that the user will be able to independently choose or
remember the correct phrase and furthermore, that the search by
this phrase will be satisfactory. One more key moment is the speed
of processing a query. Of course, when using the whole document
instead of a couple of words, the accuracy of search increases manifold.
But up to date, such an opportunity has not been used because of
the high capacity drain of such a process. The point is that search
by words or phrases will not provide us with a highly relevant similarity
of results. And the search by phrase equal in its length the whole
document consumes much time and computer resources. Here is an example:
while processing the query by one word there is no considerable
difference in speed: whether it’s 0,1 or 0,001 second is not
of crucial importance to the user. But when you take an average
size document which contains about 2000 unique words, then the search
with consideration for morphology (stem forms) and thesaurus (synonyms),
as well as generating a relevant list of results in case of search
by key words will take several dozens of minutes (which is unacceptable
for a user).
The interim
summary
As we can see, currently existing systems and search technologies,
although properly functioning, don’t solve the problem of
search completely. Where speed is acceptable the relevancy leaves
more to be desired. If the search is accurate and adequate, it consumes
lots of time and resources. It is of course possible to solve the
problem
by a very obvious manner – by increasing the computer capacity.
But equipping the office with dozens of ultra-fast computers which
will continuously process phrasal queries consisting of thousands
of unique words, struggling through gigabytes of incoming correspondence,
technical literature, final reports and other information is more
than irrational and disadvantageous. There is a better way.
The unique similar
content search
At present many companies are intensively working on developing
full text search. The calculation speeds allow creating technologies
that enable queries in different exponents and wide array of supplementary
conditions. The experience in creating phrasal search provides these
companies with an expertise to further develop and perfect the search
technology. In particular, one of the most popular searches is the
Google, and namely one of its functions called the “similar
pages”. Using this function enables the user to view the pages
of maximum similarity in their content to the sample one. Functioning
in principle, this function does not yet allow getting relevant
results – they are mostly vague and of low relevancy and furthermore,
sometimes utilizing this function shows complete absence of similar
pages as a result. Most probably, this is the result of the chaotic
and unstructured nature of information in the Internet. But once
the precedent has been created, the advent of the perfect search
without a hitch is just a matter of time. What concerns the corporate
data processing and knowledge retrieval systems, here the matters
stand much worse. The functioning (not existing on paper) technologies
are very few. And no giant or the so called search technology guru
has so far succeeded in creating a real similar content search.
Maybe, the reason is that it’s not desperately needed, maybe
– too hard to implement. But there is a functioning one though.
SoftInform Search
Technology, developed by SoftInform, is the technology of searching
for documents similar in their content to the sample. It enables
fast and accurate search for documents of similar content in any
volume of data. The technology is based on the mathematical model
of analyzing the document structure and selecting the words, word
combinations and text arrays, which results in forming a list of
documents of maximum similarity the sample text abstract with the
relevancy percent defined. In contrast to the standard phrasal search
by the similar content search there is no need to determine the
key words beforehand – the search is conducted through the
whole document. The technology works with several sources of information
that can be stored both in text files of txt, doc, rtf, pdf, htm,
html formats, and the information systems of the most popular data
bases (Access, MS SQL, Oracle, as well as any SQL-supporting data
bases). It also additionally supports the syno
nyms and important words functions that enable to carry out a more
specific search. The similar search technology enables to significantly
cut time wasted on searching and reviewing the same or very similar
documents, diminish the processing time at the stage of entering
data into the archive by avoiding the duplicate documents and forming
sets of data by a certain subject. Another advantage of the SoftInform
technology is that it’s not so sensitive to the computer capacity
and allows processing data at a very high speed even on ordinary
office computers. This technology is not just a theoretic development.
It has been tested and successfully implemented in a project of
giving legal advice via phone, where the speed of information retrieval
is of crucial importance. And it will undoubtedly be more than useful
in any knowledge base, analytical service and support department
of
any large firm. Universality and effectiveness of the SoftInform
Search Technology allows solving a wide spectrum of problems, arising
while processing information. These include the fuzziness of information
(at the document entering stage it is possible to immediately define
whether such a document already belongs to the data base or not)
and the similarity analysis of the documents which are already entered
into the data base, and the search for semantically similar documents
which saves time spent on selecting the appropriate key words and
viewing the irrelevant documents.
Perspectives
Besides its primary assignment (fast and high quality search for
information in huge volume such as texts, archives, data bases)
an Internet direction could also be defined. For example, it is
possible to work out an expert system to process incoming correspondence
and news which will become an important tool for analysts from different
companies. Mainly, this will be possible due to the unique similar
content search technology, absent from any of the existent systems
so far except for the SearchInform. The problem of spamming search
engines with the so called doorways (hidden pages with key words
redirecting to the site’s main pages and used to increase
the page rating with the search engines) and the e-mail spam problem
(a
more intellectual analysis would ensure higher level of security)
would also be solved with the help of this technology. But the most
interesting perspective of the SoftInform Search technology is creating
a new Internet search engine, the main competitive advantage of
which would be ability to search not just by key words, but also
for
similar web pages, which will add to the flexibility of search making
it more comfortable and efficient.
To draw a conclusion,
it could be stated with confidence that the future belongs to the
full text search technologies, both in the Internet and the corporate
search systems. Unlimited development potential, adequacy of the
results and processing speed of any size of query make this technology
much more comfortable and in high
demand. SoftInform Search technology might not be the pioneer, but
it’s a functioning, stable and unique one with no existent
analogues (which can be proved by the active Eurasian patent). To
my mind, even with the help of the “similar search”
it will be difficult to find a similar technology.