Entering the era of private space travel, one would think that even though the rest of us may not yet be able to launch into orbit, at the very least, we could make our lives easier through instant search among Terabytes of up-to-date corporate data. from any place we can visit on this planet. The good news is that with corporate search software and a search index, we can make Earth even more habitable while waiting for our own spacecraft.
Now you can look at the stars and ask yourself: what is a search index? A search index is not like the type of index you find at the end of a big book. Instead, it’s just an internal tool to store every unique word and every unique number in a business dataset, along with the locations of all words and numbers in the data. The sole purpose of the search index is for the search engine to perform an exhaustive search on everything, processing as many search requests as there are at any given time.
For this specific example, each search index can contain up to one terabyte, and there is no limit on the number of terabyte-sized indexes that the search engine can create and cover simultaneously for simultaneous search requests. Supported data types include Microsoft Office files, PDFs, compressed archives like ZIP or RAR, emails and attachments, databases, web formats, etc. And putting it all on the index couldn’t be easier.
To index, simply point to folders, emails, etc. that you want to cover, and the search engine will do the rest. You don’t even have to tell the software what kind of data it’s indexing. In fact, the search engine can determine for itself whether an item is a PowerPoint file versus a OneNote file versus an email by looking at inside each binary file (in this article I am using specific examples from dtSearch although concepts like a search index have general applicability.)
Because the search engine looks inside each binary file to determine the file type, it doesn’t matter if a file has the wrong file extension. You can save Access database with Excel file extension and PDF saved with Word extension, and search engine sorts it all. If you have files nested within other files, the search engine will find that out as well. Email with ZIP attachment containing Word document with fully integrated Excel spreadsheet is no problem.
After indexing, the search engine can run from a secure online environment such as a Windows IIS web server. The server can be “on-premises” or in the cloud like on Azure or AWS, allowing search from any device with a site connection and a web browser. For business search, the site will require appropriate security credentials. Once users are logged in, search requests can proceed statelessly, supporting an unlimited number of concurrent instant search threads.
Beyond searching for individual words, the search engine has over 25 different types of features specific to full text or metadata, such as boolean (and / or / not), phrase, proximity (before or after) , directed proximity (forward only), concept / synonym, date or date range, number or numeric range, wildcard, or any combination thereof. The fuzzy search adjustable from 1 to 10 can sift through typographical errors as often occurs in the text of emails or in OCR-processed PDFs. The search engine can even identify any credit card number in the text.
Because it’s a big planet, these search options work not only for English, but also for hundreds of international languages via Unicode. After a search request, the search engine can rank the retrieved data by relevance based on the density and scarcity of search terms. This means that if you are looking for planets, asteroids Where comets, and planets and asteroids are all over the data, but comets are much rarer, so comets will get a higher relevance score. More dense mentions of comets inside a single document or email will rank even higher.
The search engine offers several sorting options, not only by relevance, but also by file name, file location, and other metadata information. Each sort option is like a different window in the data, and you can instantly re-sort with just one click. After a search request, the search engine can display the full text of recovered files, emails and others with highlighted hits.
But the Earth is spinning and business data is changing. Fortunately, the search engine can engage in constant automatic updates of the index. Rather than reindexing everything from scratch, index updates can only take into account new files, deleted files, or files with new changes. Importantly, automatic updates can be performed * without * affecting simultaneous search.
A caching option further allows the indexer to save a full copy of the original files, emails, etc. as part of the index itself. With caching, the search engine can still display the recovered files even if the originals are not available. This way, even if some data is disconnected, searching with highlighted hits can continue without interruption.
Life is a little sweeter for the rest of us stuck here on earth.
Elizabeth Thede is Director of Sales at dtSearch Corp. The company offers enterprise and development products that work “on premise” or in the cloud to instantly search terabytes with more than 25 search options. DtSearch’s own document filters support files, email, databases, and web data. dtSearch features a beta multithreaded indexer to dramatically increase indexing speed on 64-bit multicore Windows systems. The 64-bit multithreaded indexer speed increase works for new index builds as well as for incremental index updates. (For existing dtSearch users, the multithreaded indexer does not affect the format of the index itself, maintaining backward compatibility.)