Advanced Data Structure Project

Table of contents

Title of Project Succinct data structure in top-k documents retrieval

The objective of Research The main aim of this project is to discover how to efficiently find the k documents where a given pattern occurs most frequently. While the problem has been discussed in many papers and solved in various ways, our research is to look for the novel algorithms and (succinct) data structures among lately related materials and find the one dominating almost all the space/time tradeoff.

Background/History of the Study Before we begin our aim to find such a succinct data structure, there are a number of fundamental works in our approach. There exist two main among many ideas in classic information retrieval: inverted index and term frequency. (Angelos, Giannis, Epimeneidis, Euripides, & Evangelos, 2005). The inverted index is also referred to as the postings file, which is an index data structure storing a mapping from the content. It is the most utilized data structure in the Information Retrieval domain, used on a large scale for example in search engines. Term frequency is a measure of how often a term is found in a collection of documents.

However, there are restricted assumptions for the efficiency of the ideas: the text must be easily tokenized into words, there must not be too many different words, and queries must be whole words or phrases, causing lots of difficulty in the document retrieval via various languages. Moreover, one of the attractive properties of an inverted file is that it is easily compressible while still supporting fast queries. In practice, an inverted file occupies space close to that if a compressed document collection. Niko & Veli, 2007) In further development, people find efficient data structures such as suffix arrays and suffix trees (full-text indexes) providing good space/time efficiency to inverted files.

Recently, several compressed full-text indexes have been proposed and show effective in practice as well. A generalized suffix tree is a suffix tree for a set of strings. Given the set of strings D = S(1), S(2), … S(n) of total length n, it is a Patricia tree containing all n suffixes of the strings. It can be built in time and space and can be used to find all k occurrences of a string P of length m in time.  However, it requires bits, which is significantly more than the collection size. Later on, Niko V. and Veli M. in their paper present an alternative space-efficient variant of Muthukrishnan’s structure that takes bits, with optimal time. (Niko & Veli, 2007) Based on the background study, we finally move advance to our intensive topic – A succinct data structure in top-k documents retrieval.

Research to the Study According to the background study above, the suffix tree is used to minimize space consumption. In the suffix tree document model, a document is considered as a string consisting of words, not characters. During constructing the suffix tree, each suffix of a document is compared to all suffixes which exist in the tree already to find out a position for inserting it. Hon W. K., Shah R., and Wu S. B. introduced the first efficient solution for the top-k document retrieval. (Hon, Shah, & Wu, 2009). In order to get rid of too many noisy factors in the large collection, the algorithm adds a minimum term frequency as one of the parameters for highly relevant patterns P. Hon, Shah, & Wu, 2009).

Furthermore, they also developed the f-mine problem for the high relevancy, that only documents that have more than f occurrences of the pattern need to be retrieved. The notion of relevance here is simply the term frequency. In the latter study, Hon W. K., Shah R., and Wu S. B. achieved the study of “Efficient Index for Retrieving Top-k Most Frequent Documents” by driving the solution derived from the related problems by Muthukrishnan, answering queries in time and taking space. The approach is based on a new use of the suffix tree called induced generalized suffix tree (IGST). (Hon, Shah, & Wu, 2009) The practicality of the proposed index is validated by the experimental results.

Future Works Since all the fundamental works are settled, our future analysis of the “Succinct data structure in top-k documents retrieval” is mainly based on the most recent accomplishment by Gonzalo N. and Daniel V. (Gonzalo & Daniel, 2012), a New Top-k Algorithm dominating almost all the space/time tradeoff.

References

  1. H., Giannis, V., Epimeneidis, V., Euripides, P. G., & Evangelos, M. (2005).
  2. Information Retrieval by Semantic Similarity. Dalhousie University, Faculty of Computer Science. Halifax: None. Bieganski, P. (1994).
  3. Generalized suffix trees for biological sequence data: applications and implementation. Minnesota University, Dept. of Comput. Sci. Minneapolis: None. Gonzalo, N., & Daniel, V. (2012).
  4. Space-Efficient Top-k Document Retrieval. Univ. of Chile, Dept. f Computer Science. Valdivia: None. Hon, W. K., Shah, R., & Wu, S. B. (2009).
  5. Efficient INdex for Retrieving Top-k Most Frequency Documents. None: Springer, Heidelberg. Niko, V., & Veli, M. (2007).
  6. Space-efficient Algorithms for Document Retrieval. The University of Helsinki, Department of Computer Science. Finland: None. (1998).
  7. Augmenting suffix trees with applications. 6th Annual European Symposium on Algorithms (ESA 1998).

Calculate the price
Make an order in advance and get the best price
Pages (550 words)
$0.00
*Price with a welcome 15% discount applied.
Pro tip: If you want to save more money and pay the lowest price, you need to set a more extended deadline.
We know how difficult it is to be a student these days. That's why our prices are one of the most affordable on the market, and there are no hidden fees.

Instead, we offer bonuses, discounts, and free services to make your experience outstanding.
How it works
Receive a 100% original paper that will pass Turnitin from a top essay writing service
step 1
Upload your instructions
Fill out the order form and provide paper details. You can even attach screenshots or add additional instructions later. If something is not clear or missing, the writer will contact you for clarification.
Pro service tips
How to get the most out of your experience with MyStudyWriters
One writer throughout the entire course
If you like the writer, you can hire them again. Just copy & paste their ID on the order form ("Preferred Writer's ID" field). This way, your vocabulary will be uniform, and the writer will be aware of your needs.
The same paper from different writers
You can order essay or any other work from two different writers to choose the best one or give another version to a friend. This can be done through the add-on "Same paper from another writer."
Copy of sources used by the writer
Our college essay writers work with ScienceDirect and other databases. They can send you articles or materials used in PDF or through screenshots. Just tick the "Copy of sources" field on the order form.
Testimonials
See why 20k+ students have chosen us as their sole writing assistance provider
Check out the latest reviews and opinions submitted by real customers worldwide and make an informed decision.
Business and administrative studies
Excellent job
Customer 452773, March 17th, 2023
BUSINESS LAW
excellent job made a 93
Customer 452773, March 22nd, 2023
Business and administrative studies
excellent job
Customer 452773, March 12th, 2023
Psychology
Thank you!
Customer 452545, February 6th, 2021
Business and administrative studies
excellent, got a 100
Customer 452773, May 17th, 2023
FIN571
excellent
Customer 452773, March 15th, 2024
Business and administrative studies
always perfect work and always completed early
Customer 452773, February 21st, 2023
Leadership Studies
excellent job
Customer 452773, August 3rd, 2023
English 101
great summery in terms of the time given. it lacks a bit of clarity but otherwise perfect.
Customer 452747, June 9th, 2021
Leadership Studies
excellent job
Customer 452773, August 26th, 2023
DATA565
The support team was late responding , my paper was late because the support team didn't respond in a timely manner. The writer of the paper finally got it right but seems there was a problem getting the revisioin to me.
Customer 452773, April 7th, 2024
Business and administrative studies
excellent work
Customer 452773, March 12th, 2023
11,595
Customer reviews in total
96%
Current satisfaction rate
3 pages
Average paper length
37%
Customers referred by a friend
OUR GIFT TO YOU
15% OFF your first order
Use a coupon FIRST15 and enjoy expert help with any task at the most affordable price.
Claim my 15% OFF Order in Chat
Close

Sometimes it is hard to do all the work on your own

Let us help you get a good grade on your paper. Get professional help and free up your time for more important courses. Let us handle your;

  • Dissertations and Thesis
  • Essays
  • All Assignments

  • Research papers
  • Terms Papers
  • Online Classes
Live ChatWhatsApp