Search

top

Search

Prerequisite for a search in the cache archives with the MM3WebAssistant is an indexation.
The search is available:

At first you select a cache archive. Make sure that only indexed archives are available. You can search for words, domains and URLs. Several search criteria are combined with operation AND.

Search Terms

Search for words

  • Equal
    Input: Search-Word
    Output: pages with words which are equal to the Search-Word.
  • Word beginning
    Input: Search-Word*
    Output: pages with words which start with the Search-Word.
  • Word endeing
    Input: *Search-Word
    Output: pages with words which end with the Search-Word.
  • Include
    Input: *Search-Word*
    Output: pages with words which include the Search-Word .

Search for a domain

  • Equal
    Input: site:Search-Domain
    Output: pages from the Search-Domain.
  • Domain beginning
    Input: site:Search-Domain*
    Output: pages from domains which start with the Search-Domain.
  • Domain ending
    Input: site:*Search-Domain
    Output: pages from domains which end with the Search-Domain.
  • Include
    Input: site:*Search-Domain*
    Output: pages from domains which include the Search-Domain.

Search in a part of URL

  • Input: url:Search-URL
    Output: pages which include the Search-URL as a part of their URL.

Output of a Search

The result of a search is displayed as a hitlist. The files (pages) are listed with their URL, size, date of archiving as well as 200 characters.
Text files are marked by [TXT] in addition.
The title and the description are reported to HTML files in addition.
The sequenz of the files corresponds to the alphabetical sort of URL. Several files from the same domain are reported intendedly. Files with a red archiving date were actualized after construction of their index. The link Marker displays the file with highlighted search words by every hit. Marking isn't possible at all files.

Information about the Index

Word Histogram

The histogram displays a sorting of the words and the number of the files in which the corresponding word occurs.

For an alphabetical sort you use keyword wordAlphabetical and the following input.

  • All
    wordAlphabetical:*
  • Equal
    wordAlphabetical:Search-Word
  • Word beginning
    wordAlphabetical:Search-Word*
  • Word ending
    wordAlphabetical:*Search-Word
  • Include
    wordAlphabetical:*Search-Word*

For a sorting after frequency you use the keyword wordFrequency.

For a sorting after word length you use the keyword wordLength.

Domain Histogram

The histogram displays an alphabetical sort of the domains and the number of the files which are included in the domain. There for you use the keyword siteAlphabetical.

  • All
    siteAlphabetical:*
  • Equal
    siteAlphabetical:Search-Domain
  • Word beginning
    siteAlphabetical:Search-Domain*
  • Word ending
    siteAlphabetical:*Search-Domain
  • Include
    siteAlphabetical:*Search-Domain*

For a sorting after frequency you use the keyword siteFrequency.

top

Indexing

The search in the cache archives with the MM3WebAssistant presupposes an indexation. It becomes indexedly text and HTML files (pages). The algorithm of the Indexer works essentially language independently. At this the corresponding lower case characters are always used for capital characters and support only Latin characters as well as some special characters of European languages.
Please, inform MM3Tools, if you need another language.

Script file

You start the indexing with one of the following script files:

  • For operating systems of Microsoft the BAT file:
    MM3-Indexer.bat
  • For operating systems Linux and UNIX the skript:
    MM3-Indexer.sh
  • For the operating system Mac OS X from Apple:
    MM3-Indexer.sh

Configuration of the Indexer

For the indexing you can set the following configuration:

  • Select the cache archive to be indexed
  • Specification of the minimal word length.
    Only words which have a minimal word length are included into the indexing. Simplified this word length consists of the characters of a word.
  • Display of the positive and negative word list

    • Negative word list
      These words aren't included into the index.
    • Positive word list
      These words are taken despite fall below the minimal word length.

    The corresponding files are in the files positive.*.txt and negative.*.txt of the folder MM3-WebAssistantProfessional/config/search/. You can adapt the word lists to your need. The characters * stands for a language specific word list, e.g. en for the English and de for the German language. All files with a name structured correspondingly are used. We recommend for the identification of the language to use the abbreviations to ISO LanguageCode (ISO-639).

You start the indexation after you have done the settings. The needed duration is dependent on the size of the archives. The indexation can take up for some time. The MM3WebAssistant shouldn't be used during this time and the cache archives shouldn't be changed.

Protocol output

You can take from the output of the MM3Indexer:

  • Indexed cache archives
  • Number of the file still to be indexed.
  • At the moment indexed Domain
  • Time needed till now
  • Progress bar
  • Summarizing statistics about the indexation

Out of Memory

The needed memory is dependent on the size of the archives and the chosen minimal word length. You can increase the available memory for the program in the script file, if the MM3Indexer needs more memory. You can alternatively subdivide the cache archive into several archives or increase the minimal word length.