Archive for January 22, 2001

T h e Q u e s t

rush T h e   Q u e s t

My quest is to develop the fastest, most flexible disk-based relational search algorithm for matching patterns on 10^8 words, plain and simple. This is perhaps an area considered to most people as the definition of absolute boredom, but it keeps me awake at night. The idea of finding a needle in a modern day hay stack is intriguing. Imagine being able to specify everything you want, sharpness, color, length, etc… everything down to the atomic level without having a significant cost in retrieval time. In fact, imagine if all of these attributes could be found in near constant time. Meaning, that the dataset could be infinite (or rather, really, really big) and your questions detailed, with only a search time degradation of approximately O(1), where O is the cost of finding one attribute. 

Some say, oh no, one more search engine, that’s all I need. Well, it depends on what you need, doesn’t it? My problem with every large-scale search engine in existence today, is that they use some kind of heuristic to guess where the information you want is located in a huge database. A heuristic is a method to short-cut what would be a very exhaustive route, and isolate the most probable locations. The best way to do this is by doing educated guesses. Often times these guesses work and produce varied results. My experience, however, with these guesses are that they aren’t as educated as they’re made out to be. In fact, I always run into the problem using search engines that they don’t find what I want, but they have exactly what I want (in the way that I asked for it) in their database. This means that they aren’t guessing correctly. The algorithm that I’m working on has a guaranteed 100% retrieval rate assuming the information exists in the dataset. Not only this, you can do regular expression queries on this dataset. Regular expressions are methods for explicitly stating variations in a search. You can say that you want the needle’s pin hole to be anywhere from 1.5mm to 2.5m and to have a length of 3cm or 3.5cm. And of course, most searches (regular expressions included) take just a fraction of a second.

T h e   D e t a i l sErik Osterman UCLA, Applied Mathematics with a Specialization in Computation (Fall 2002).Erik Osterman UCLA, Applied Mathematics with a Specialization in Computation (Fall 2002).

Erik Osterman UCLA, Applied Mathematics with a Specialization in Computation (Fall 2002).

Proposed Thesis’s for Math 199 – Independent Study:

(1) An efficient, dynamic, external memory hashing implementation supporting deletion and insertion of phrases for dictionary style substring matching against a dataset whose size is on the order of 10^8 items of each (< ) n-bytes. Emphasis will be placed on ensuring O(n) dataset generation time with O(1) retrieval time, while maintaining data externally on an existing filesystem (FFS).

Remarks: BerkleyDB by Sleepy Cat Software provides a nice suite of external memory database libraries and interfaces, including one for Hashing. While they work wonderfully for the average database application, the hashing implementation in particular cannot handle datasets of the magnitude being considered. Tests that I have independently conducted using the C++ BerkleyDB Libraries are far from linear during set generation.

(2) Efficient joins of presorted lists for pattern matching using substring clusters and the extendibility to file clusters for facilitating maintenance and scalability.

Remarks: The classic concept of inverted indices remains one of the most comprehensive methods of isolating explicit information in huge datasets. In order to exploit its simplicity one must however, be able to join sorted lists extremely quickly. The research will of course focus on methods to maximize this performance.

(3) Efficient parallel sorting/filtering of gigabyte scale raw datas over high-speed LAN environments of heterogenous server clusters.

Remarks: One of the inherent problems of maintaining inverted indices is growing, sorting and updating dynamic data. Sorting is an algorithm that can directly exploit the advantages of networked environments if done correctly. And with the tremendous supply of powerful, off-the-shelf PCs available for about $500, there’s no reason not to.
Statistical analysis will be based on partial datasets and extended to complete distributions.
Problems that need to be considered are network saturation, scaling, and k-way merging. Also investigated will be the way to implement a parallel sorting algorithm to maximize the overlap between the disk, network, and CPU subsystems of a processing node.

(4) A scaleable technique of maintaining accurate, up-to-date indexes of ftp and http based file transfer services, by exercising a cross between GNUtella and Napster innovations.

Remarks: As a developer who is crazed with the inner workings of a search engine, I must admit I was caught off-guard with the introduction of the GNUtella system. It’s by all means beautiful. But with the growth of it’s popularity, its major weaknesses have also come forth. For example, unintentional network segmentations occur, where you get two or more broken networks of GNUtella clients (like a disconnected graph). The second, perhaps the most serious, is it’s high bandwidth requirements that can incapacitate all but the most fortunate internet users (i.e. ones on nice college LANs).


     

Issues that will be directly addressed:
(i) How efficiently can a given task be performed under reasonable assumptions/circumstances on available resources (e.g., time, storage, type of processor), and type of system (e.g. sequential, parallel or distributed)
(ii) How efficiently can a proposed system perform a collection of tasks (i) in terms of resources used and type of system.
(iii) What are the limits on how efficiently a system and each task can be performed.

what T h e   Q u e s t