Press Release
Summary = Each of us has been faced with the problem of searching
for information more than once. Irregardless of the data source
we are using the problems can be multiple and include the physical
volume of the data base searched, the information being unstructured,
different file types and also the complexity of accurately wording
the search query.
Press Release
Body = Each of us has been faced with the problem of searching for
information more than once. Irregardless of the data source we are
using (Internet, file system on our hard drive, data base or a global
information system of a big company) the problems can be multiple
and include the physical volume of the data base searched, the information
being unstructured, different file types and also the
complexity of accurately wording the search query. We have already
reached the stage when the amount of data on one single PC is comparable
to the amount of text data stored in a proper library. And as to
the unstructured data flows, in future they are only going to increase,
and at a very rapid tempo. If for an average user this might be
just a minor misfortune, for a big company absence of control over
information can mean significant problems. So the necessity to create
search systems and technologies simplifying and accelerating access
to the necessary information, originated long ago. Such systems
are numerous and moreover not every one of them is based on a unique
technology. And the task of choosing the right one depends directly
on the specific tasks to be solved in the future. While the demand
for the perfect data searching and processing tools is steadily
growing let’s consider the state of affairs with the supply
side.
Not going deeply
into the various peculiarities of the technology, all the searching
programs and systems can be divided into three groups. These are:
global Internet systems, turnkey business solutions (corporate data
searching and processing technologies) and simple phrasal or file
search on a local computer. Different directions presumably mean
different solutions.
Local search
Everything is clear about search on a local PC. It’s not remarkable
for any particular functionality features accept for the choice
of file type (media, text
etc.) and the search destination. Just enter the name of the searched
file (or part of text, for example in the Word format) and that’s
it. The speed and result depend fully on the text entered into the
query line. There is zero intellectuality in this: simply looking
through the available files to define their relevance. This is in
its sense explicable: what’s the use of creating a sophisticated
system for such uncomplicated needs.
Global search
technologies
Matters stand totally different with the search systems operating
in the global network. One can’t rely simply on looking through
the available data. Huge volume (Yandex for instance can boast the
indexing capacity of more than 11 terabyte of
data) of the global chaos of unstructured information will make
the simple search not only ineffective but also long and labor-consuming.
That’s why lately the focus has shifted towards optimizing
and improving quality characteristics of search. But the scheme
is still very simple (except for the secret innovations of every
separate system) - the phrasal search through the indexed data base
with proper consideration
for morphology and synonyms. Undoubtedly, such an approach works
but doesn’t solve the problem completely. Reading dozens of
various articles dedicated to improving search with the help of
Google or Yandex, one can drive at the conclusion that
without knowing the hidden opportunities of these systems finding
a relevant document by the query is a matter of more than a minute,
and sometimes more than an hour. The problem is that such a realization
of search is very dependent on the query word or phrase, entered
by the user. The more indistinct the query the worse is the search.
This has become an axiom, or dogma, whichever you prefer. Of course,
intelligently using the key functions of the search systems and
properly defining the phrase by which the documents and sites are
searched, it is possible to get acceptable results. But this would
be the result of painstaking mental work and time wasted on looking
through irrelevant information with a hope to at least find some
clues on how to upgrade the search query. In general, the scheme
is the following: enter the phrase, look through several results,
making sure that the query was not the right one, enter a new phrase
and the stages are repeated till the
relevancy of results achieves the highest possible level. But even
in that case the chances to find the right document are still few.
No average user will voluntary go for the sophistication of “advanced
search” (although it is equipped with a number
of very useful functions such as the choice of language, file format
etc.). The best would be to simply insert the word or phrase and
get a ready answer, without particular concern for the means of
getting it. Let the horse think – it has a big head. Maybe
this is not exactly up to the point, but one of the Google search
functions is called “I am feeling lucky!” characterizes
very well the existent searching technologies. Nevertheless, the
technology works, not ideally and not always justifying the hopes,
but if you allow for the complexity of searching through the chaos
of Internet data volume, it could be acceptable.
Corporate systems
The third on the list are the turnkey solutions based on the searching
technologies. They are meant for serious companies and corporations,
possessing really large data bases and staffed with all sorts of
information systems and documents. In principle,
the technologies themselves can also be used for home needs. For
example, a programmer working remotely from the office will make
good use of the search to access randomly located on his hard drive
program source codes. But these are particulars. The main application
of the technology is still solving the problem of quickly and accurately
searching through large data volumes and working with various
information sources. Such systems usually operate by a very simple
scheme (although there are undoubtedly numerous unique methods of
indexing and processing queries underneath the surface): phrasal
search, with proper consideration for all the stem forms, synonyms
etc. which once again leads us to the problem of human resource.
When using such technology the user should first word the query
phrases which are going to be the search criteria and presumably
met in the necessary documents to be retrieved. But there is no
guarantee that the user will be able to independently
choose or remember the correct phrase and furthermore, that the
search by this phrase will be satisfactory. One more key moment
is the speed of processing a query. Of course, when using the
whole document instead of a couple of words, the accuracy of search
increases manifold. But up to date, such an opportunity has not
been used because of the high capacity drain of such a process.
The point is that search by words or phrases will not provide us
with a highly relevant similarity of results. And the search by
phrase equal in its length the whole document consumes much time
and computer
resources. Here is an example: while processing the query by one
word there is no considerable difference in speed: whether it’s
0,1 or 0,001 second is not of crucial importance to the user. But
when you take an average size document which contains about 2000
unique words, then the search with consideration for morphology
(stem forms) and thesaurus (synonyms), as well as generating a relevant
list of results in
case of search by key words will take several dozens of minutes
(which is unacceptable for a user).
The interim
summary
As we can see, currently existing systems and search technologies,
although properly functioning, don’t solve the problem of
search completely. Where speed is acceptable the relevancy leaves
more to be desired. If the search is accurate and adequate, it consumes
lots of time and resources. It is of course possible to solve the
problem by a very obvious manner – by increasing the computer
capacity. But equipping the
office with dozens of ultra-fast computers which will continuously
process phrasal queries consisting of thousands of unique words,
struggling through gigabytes of incoming correspondence, technical
literature, final reports and other information
is more than irrational and disadvantageous. There is a better way.
The unique similar
content search
At present many companies are intensively working on developing
full text search. The calculation speeds allow creating technologies
that enable queries in different exponents and wide array of supplementary
conditions. The experience in creating phrasal search provides these
companies with an expertise to further develop and perfect the search
technology. In particular, one of the most popular searches is
the Google, and namely one of its functions called the “similar
pages”. Using this function enables the user to view the pages
of maximum similarity in their content to the sample one. Functioning
in principle, this function does not yet allow getting relevant
results – they are mostly vague and of low relevancy and furthermore,
sometimes utilizing this function shows complete absence of similar
pages as a result. Most probably, this is the result of the chaotic
and unstructured nature of information in the Internet. But once
the precedent has been created, the advent of the perfect search
without a hitch is just a matter of time. What concerns the corporate
data processing and knowledge retrieval systems, here the matters
stand much worse. The functioning (not existing on paper) technologies
are very few. And no giant or the so called search technology guru
has so far
succeeded in creating a real similar content search. Maybe, the
reason is that it’s not desperately needed, maybe –
too hard to implement. But there is a functioning one though.
SoftInform Search
Technology, developed by SoftInform, is the technology of searching
for documents similar in their content to the sample. It enables
fast and accurate search for documents of similar content in any
volume of data. The
technology is based on the mathematical model of analyzing the document
structure and selecting the words, word combinations and text arrays,
which results in forming a list of documents of maximum similarity
the sample text abstract with the
relevancy percent defined. In contrast to the standard phrasal search
by the similar content search there is no need to determine the
key words beforehand – the search is conducted through the
whole document. The technology works with several sources of information
that can be stored both in text files of txt, doc, rtf, pdf, htm,
html formats, and the information systems of the most popular data
bases (Access, MS
SQL, Oracle, as well as any SQL-supporting data bases). It also
additionally supports the syno
nyms and important words functions that enable to carry out a more
specific search. The similar search technology enables to significantly
cut time wasted on searching and reviewing the same or very similar
documents, diminish the processing time at
the stage of entering data into the archive by avoiding the duplicate
documents and forming sets of data by a certain subject. Another
advantage of the SoftInform technology is that it’s not so
sensitive to the computer capacity and allows processing data at
a very high speed even on ordinary office computers.
This technology is not just a theoretic development. It has been
tested and successfully implemented in a project of giving legal
advice via phone, where the
speed of information retrieval is of crucial importance. And it
will undoubtedly be more than useful in any knowledge base, analytical
service and support department of any large firm. Universality and
effectiveness of the SoftInform Search Technology
allows solving a wide spectrum of problems, arising while processing
information. These include the fuzziness of information (at the
document entering stage it is possible to immediately define whether
such a document already belongs to the data
base or not) and the similarity analysis of the documents which
are already entered into the data base, and the search for semantically
similar documents which saves time spent on selecting the appropriate
key words and viewing the irrelevant
documents.
Perspectives
Besides its primary assignment (fast and high quality search for
information in huge volume such as texts, archives, data bases)
an Internet direction could also be defined. For example, it is
possible to work out an expert system to process incoming correspondence
and news which will become an important tool for analysts from different
companies. Mainly, this will be possible due to the unique similar
content search technology, absent from any of the existent systems
so far except for the SearchInform. The problem of spamming search
engines with the so called doorways
(hidden pages with key words redirecting to the site’s main
pages and used to increase the page rating with the search engines)
and the e-mail spam problem (a
more intellectual analysis would ensure higher level of security)
would also be solved with the help of this technology. But the most
interesting perspective of the SoftInform Search technology is creating
a new Internet search engine, the main competitiv
e advantage of which would be ability to search not just by key
words, but also for similar web pages, which will add to the flexibility
of search making it more comfortable and efficient.
To draw a conclusion,
it could be stated with confidence that the future belongs to the
full text search technologies, both in the Internet and the corporate
search systems. Unlimited development potential, adequacy of the
results and processing
speed of any size of query make this technology much more comfortable
and in high demand. SoftInform Search technology might not be the
pioneer, but it’s a
functioning, stable and unique one with no existent analogues (which
can be proved by the active Eurasian patent). To my mind, even with
the help of the “similar search” it will be difficult
to find a similar technology.