CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Finding delta difference in large data sets
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Computer Science.
2019 (English)Independent thesis Basic level (professional degree), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

To find out what differs between two versions of a file can be done with several different techniques and programs. These techniques and programs are often focusd on finding differences in text files, in documents, or in class files for programming. An example of a program is the popular git tool which focuses on displaying the difference between versions of files in a project. A common way to find these differences is to utilize an algorithm called Longest common subsequence, which focuses on finding the longest common subsequence in each file to find similarity between the files. By excluding all similarities in a file, all remaining text will be the differences between the files. The Longest Common Subsequence is often used to find the differences in an acceptable time. When two lines in a file is compared to see if they differ from each other hashing is used. The hash values for each correspondent line in both files will be compared. Hashing a line will give the content on that line a unique value. If as little as one character on a line is different between the version, the hash values for those lines will be different as well. These techniques are very useful when comparing two versions of a file with text content. With data from a database some, but not all, of these techniques can be useful. A big difference between data in a database and text in a file will be that content is not just added and delete but also updated. This thesis studies the problem on how to make use of these techniques when finding differences between large datasets, and doing this in a reasonable time, instead of finding differences in documents and files.  Three different methods are going to be studied in theory. These results will be provided in both time and space complexities. Finally, a selected one of these methods is further studied with implementation and testing. The reason only one of these three is implemented is because of time constraint. The one that got chosen had easy maintainability, an easy implementation, and maintains a good execution time.

Place, publisher, year, edition, pages
2019. , p. 34
Keywords [en]
Delta Difference Large Data Sets
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:ltu:diva-74943OAI: oai:DiVA.org:ltu-74943DiVA, id: diva2:1329764
External cooperation
Arctic Group AB
Subject / course
Student thesis, at least 15 credits
Educational program
Computer Engineering, bachelor's level
Supervisors
Examiners
Available from: 2019-07-04 Created: 2019-06-24 Last updated: 2019-07-04Bibliographically approved

Open Access in DiVA

fulltext(1229 kB)29 downloads
File information
File name FULLTEXT01.pdfFile size 1229 kBChecksum SHA-512
2d6eaf833a7445f0db186ed9fbe6bf6decb198e3d715a601acbf3e05e90a829262f25268949c61ec593504aa41ead7aaae7553909550784630fe4387ee57bb5f
Type fulltextMimetype application/pdf

By organisation
Computer Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 29 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 84 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf