-
Sparck Jones, K.: Automatic keyword classification for information retrieval (1971)
0.15
0.14769812 = product of:
0.5907925 = sum of:
0.5907925 = weight(_text_:jones in 5175) [ClassicSimilarity], result of:
0.5907925 = score(doc=5175,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
1.3647194 = fieldWeight in 5175, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.15625 = fieldNorm(doc=5175)
0.25 = coord(1/4)
-
Jones, K.P.: Natural-language processing and automatic indexing : a reply (1990)
0.12
0.11815849 = product of:
0.47263396 = sum of:
0.47263396 = weight(_text_:jones in 393) [ClassicSimilarity], result of:
0.47263396 = score(doc=393,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
1.0917755 = fieldWeight in 393, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.125 = fieldNorm(doc=393)
0.25 = coord(1/4)
-
Sparck Jones, K.; Tait, J.I.: Automatic search term variant generation (1984)
0.12
0.11815849 = product of:
0.47263396 = sum of:
0.47263396 = weight(_text_:jones in 2917) [ClassicSimilarity], result of:
0.47263396 = score(doc=2917,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
1.0917755 = fieldWeight in 2917, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.125 = fieldNorm(doc=2917)
0.25 = coord(1/4)
-
Sparck Jones, K.; Jackson, D.M.: ¬The use of automatically obtained keyword classification for information retrieval (1970)
0.12
0.11815849 = product of:
0.47263396 = sum of:
0.47263396 = weight(_text_:jones in 5176) [ClassicSimilarity], result of:
0.47263396 = score(doc=5176,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
1.0917755 = fieldWeight in 5176, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.125 = fieldNorm(doc=5176)
0.25 = coord(1/4)
-
Sparck Jones, K.: Index term weighting (1973)
0.12
0.11815849 = product of:
0.47263396 = sum of:
0.47263396 = weight(_text_:jones in 5490) [ClassicSimilarity], result of:
0.47263396 = score(doc=5490,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
1.0917755 = fieldWeight in 5490, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.125 = fieldNorm(doc=5490)
0.25 = coord(1/4)
-
Dow Jones unveils knowledge indexing system (1997)
0.12
0.11815849 = product of:
0.47263396 = sum of:
0.47263396 = weight(_text_:jones in 751) [ClassicSimilarity], result of:
0.47263396 = score(doc=751,freq=8.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
1.0917755 = fieldWeight in 751, product of:
2.828427 = tf(freq=8.0), with freq of:
8.0 = termFreq=8.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.0625 = fieldNorm(doc=751)
0.25 = coord(1/4)
- Abstract
- Dow Jones Interactive Publishing has developed a sophisticated automatic knowledge indexing system that will allow searchers of the Dow Jones News / Retrieval service to get highly targeted results from a search in the service's Publications Library. Instead of relying on a thesaurus of company names, the new system uses a combination of that basic algorithm plus unique rules based on the editorial styles of individual publications in the Library. Dow Jones have also announced its acceptance of the definitions of 'selected full text' and 'full text' from Bibliodata's Fulltext Sources Online directory
-
Porter, M.F.: ¬An algorithm for suffix stripping (1980)
0.09
0.088618875 = product of:
0.3544755 = sum of:
0.3544755 = weight(_text_:jones in 4122) [ClassicSimilarity], result of:
0.3544755 = score(doc=4122,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
0.8188317 = fieldWeight in 4122, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.09375 = fieldNorm(doc=4122)
0.25 = coord(1/4)
- Footnote
- Wiederabgedruckt in: Readings in information retrieval. Ed.: K. Sparck Jones u. P. Willett. San Francisco: Morgan Kaufmann 1997. S.313-316.
-
Jones, R.L.: Automatic document content analysis : the AIDA project (1992)
0.07
0.07384906 = product of:
0.29539624 = sum of:
0.29539624 = weight(_text_:jones in 2606) [ClassicSimilarity], result of:
0.29539624 = score(doc=2606,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
0.6823597 = fieldWeight in 2606, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.078125 = fieldNorm(doc=2606)
0.25 = coord(1/4)
-
Salton, G.; Wong, A.; Yang, C.S.: ¬A vector space model for automatic indexing (1975)
0.07
0.07384906 = product of:
0.29539624 = sum of:
0.29539624 = weight(_text_:jones in 2934) [ClassicSimilarity], result of:
0.29539624 = score(doc=2934,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
0.6823597 = fieldWeight in 2934, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.078125 = fieldNorm(doc=2934)
0.25 = coord(1/4)
- Footnote
- Wiederabgedruckt in: Readings in information retrieval. Ed.: K. Sparck Jones u. P. Willett. San Francisco: Morgan Kaufmann 1997. S.273-280.
-
Salton, G.; Allan, J.; Buckley, C.; Singhal, A.: Automatic analysis, theme generation, and summarization of machine readable texts (1994)
0.07
0.07384906 = product of:
0.29539624 = sum of:
0.29539624 = weight(_text_:jones in 2949) [ClassicSimilarity], result of:
0.29539624 = score(doc=2949,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
0.6823597 = fieldWeight in 2949, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.078125 = fieldNorm(doc=2949)
0.25 = coord(1/4)
- Footnote
- Wiederabgedruckt in: Readings in information retrieval. Ed.: K. Sparck Jones u. P. Willett. San Francisco: Morgan Kaufmann 1997. S.478-483.
-
Biebricher, N.; Fuhr, N.; Lustig, G.; Schwantner, M.; Knorz, G.: ¬The automatic indexing system AIR/PHYS : from research to application (1988)
0.07
0.07384906 = product of:
0.29539624 = sum of:
0.29539624 = weight(_text_:jones in 2952) [ClassicSimilarity], result of:
0.29539624 = score(doc=2952,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
0.6823597 = fieldWeight in 2952, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.078125 = fieldNorm(doc=2952)
0.25 = coord(1/4)
- Footnote
- Wiederabgedruckt in: Readings in information retrieval. Ed.: K. Sparck Jones u. P. Willett. San Francisco: Morgan Kaufmann 1997. S.513-517.
-
Tavakolizadeh-Ravari, M.: Analysis of the long term dynamics in thesaurus developments and its consequences (2017)
0.06
0.06056326 = product of:
0.12112652 = sum of:
0.04818721 = weight(_text_:und in 4081) [ClassicSimilarity], result of:
0.04818721 = score(doc=4081,freq=20.0), product of:
0.15546227 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.070094384 = queryNorm
0.3099608 = fieldWeight in 4081, product of:
4.472136 = tf(freq=20.0), with freq of:
20.0 = termFreq=20.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.03125 = fieldNorm(doc=4081)
0.07293931 = weight(_text_:headings in 4081) [ClassicSimilarity], result of:
0.07293931 = score(doc=4081,freq=2.0), product of:
0.34012607 = queryWeight, product of:
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.070094384 = queryNorm
0.21444786 = fieldWeight in 4081, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.03125 = fieldNorm(doc=4081)
0.5 = coord(2/4)
- Abstract
- Die Arbeit analysiert die dynamische Entwicklung und den Gebrauch von Thesaurusbegriffen. Zusätzlich konzentriert sie sich auf die Faktoren, die die Zahl von Indexbegriffen pro Dokument oder Zeitschrift beeinflussen. Als Untersuchungsobjekt dienten der MeSH und die entsprechende Datenbank "MEDLINE". Die wichtigsten Konsequenzen sind: 1. Der MeSH-Thesaurus hat sich durch drei unterschiedliche Phasen jeweils logarithmisch entwickelt. Solch einen Thesaurus sollte folgenden Gleichung folgen: "T = 3.076,6 Ln (d) - 22.695 + 0,0039d" (T = Begriffe, Ln = natürlicher Logarithmus und d = Dokumente). Um solch einen Thesaurus zu konstruieren, muss man demnach etwa 1.600 Dokumente von unterschiedlichen Themen des Bereiches des Thesaurus haben. Die dynamische Entwicklung von Thesauri wie MeSH erfordert die Einführung eines neuen Begriffs pro Indexierung von 256 neuen Dokumenten. 2. Die Verteilung der Thesaurusbegriffe erbrachte drei Kategorien: starke, normale und selten verwendete Headings. Die letzte Gruppe ist in einer Testphase, während in der ersten und zweiten Kategorie die neu hinzukommenden Deskriptoren zu einem Thesauruswachstum führen. 3. Es gibt ein logarithmisches Verhältnis zwischen der Zahl von Index-Begriffen pro Aufsatz und dessen Seitenzahl für die Artikeln zwischen einer und einundzwanzig Seiten. 4. Zeitschriftenaufsätze, die in MEDLINE mit Abstracts erscheinen erhalten fast zwei Deskriptoren mehr. 5. Die Findablity der nicht-englisch sprachigen Dokumente in MEDLINE ist geringer als die englische Dokumente. 6. Aufsätze der Zeitschriften mit einem Impact Factor 0 bis fünfzehn erhalten nicht mehr Indexbegriffe als die der anderen von MEDINE erfassten Zeitschriften. 7. In einem Indexierungssystem haben unterschiedliche Zeitschriften mehr oder weniger Gewicht in ihrem Findability. Die Verteilung der Indexbegriffe pro Seite hat gezeigt, dass es bei MEDLINE drei Kategorien der Publikationen gibt. Außerdem gibt es wenige stark bevorzugten Zeitschriften."
- Footnote
- Dissertation, Humboldt-Universität zu Berlin - Institut für Bibliotheks- und Informationswissenschaft.
- Imprint
- Berlin : Humboldt-Universität zu Berlin / Institut für Bibliotheks- und Informationswissenschaft
- Theme
- Konzeption und Anwendung des Prinzips Thesaurus
-
Koryconski, C.; Newell, A.F.: Natural-language processing and automatic indexing (1990)
0.06
0.059079245 = product of:
0.23631698 = sum of:
0.23631698 = weight(_text_:jones in 2312) [ClassicSimilarity], result of:
0.23631698 = score(doc=2312,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
0.54588777 = fieldWeight in 2312, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.0625 = fieldNorm(doc=2312)
0.25 = coord(1/4)
- Abstract
- The task of producing satisfactory indexes by automatic means has been tackled on two fronts: by statistical analysis of text and by attempting content analysis of the text in much the same way as a human indexer does. Though statistical techniques have a lot to offer for free-text database systems, neither method has had much success with back-of-the-book indexing. This review examines some problems associated with the application of natural-language processing techniques to book texts. - Vgl. auch die Erwiderung von K.P. Jones
-
Pritchard-Schoch, T.: Natural language comes of age (1993)
0.06
0.059079245 = product of:
0.23631698 = sum of:
0.23631698 = weight(_text_:jones in 3570) [ClassicSimilarity], result of:
0.23631698 = score(doc=3570,freq=2.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
0.54588777 = fieldWeight in 3570, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.0625 = fieldNorm(doc=3570)
0.25 = coord(1/4)
- Abstract
- Discusses natural languages and the natural language implementations of Westlaw's full-text legal documents, Westlaw Is Natural. Natural language is not aritificial intelligence but a hybrid of linguistics, mathematics and statistics. Provides 3 classes of retrieval models. Explains how Westlaw processes an English query. Assesses WIN. Covers WIN enhancements; the natural language features of Congressional Quarterly's Washington Alert using a document for a query; the personal librarian front end search software and Dowquest from Dow Jones news/retrieval. Conmsiders whether natural language encourages fuzzy thinking and whether Boolean logic will still be needed
-
Needham, R.M.; Sparck Jones, K.: Keywords and clumps (1985)
0.06
0.057796028 = product of:
0.23118411 = sum of:
0.23118411 = weight(_text_:jones in 4645) [ClassicSimilarity], result of:
0.23118411 = score(doc=4645,freq=10.0), product of:
0.43290398 = queryWeight, product of:
6.176015 = idf(docFreq=250, maxDocs=44421)
0.070094384 = queryNorm
0.5340309 = fieldWeight in 4645, product of:
3.1622777 = tf(freq=10.0), with freq of:
10.0 = termFreq=10.0
6.176015 = idf(docFreq=250, maxDocs=44421)
0.02734375 = fieldNorm(doc=4645)
0.25 = coord(1/4)
- Abstract
- The selection that follows was chosen as it represents "a very early paper an the possibilities allowed by computers an documentation." In the early 1960s computers were being used to provide simple automatic indexing systems wherein keywords were extracted from documents. The problem with such systems was that they lacked vocabulary control, thus documents related in subject matter were not always collocated in retrieval. To improve retrieval by improving recall is the raison d'être of vocabulary control tools such as classifications and thesauri. The question arose whether it was possible by automatic means to construct classes of terms, which when substituted, one for another, could be used to improve retrieval performance? One of the first theoretical approaches to this question was initiated by R. M. Needham and Karen Sparck Jones at the Cambridge Language Research Institute in England.t The question was later pursued using experimental methodologies by Sparck Jones, who, as a Senior Research Associate in the Computer Laboratory at the University of Cambridge, has devoted her life's work to research in information retrieval and automatic naturai language processing. Based an the principles of numerical taxonomy, automatic classification techniques start from the premise that two objects are similar to the degree that they share attributes in common. When these two objects are keywords, their similarity is measured in terms of the number of documents they index in common. Step 1 in automatic classification is to compute mathematically the degree to which two terms are similar. Step 2 is to group together those terms that are "most similar" to each other, forming equivalence classes of intersubstitutable terms. The technique for forming such classes varies and is the factor that characteristically distinguishes different approaches to automatic classification. The technique used by Needham and Sparck Jones, that of clumping, is described in the selection that follows. Questions that must be asked are whether the use of automatically generated classes really does improve retrieval performance and whether there is a true eco nomic advantage in substituting mechanical for manual labor. Several years after her work with clumping, Sparck Jones was to observe that while it was not wholly satisfactory in itself, it was valuable in that it stimulated research into automatic classification. To this it might be added that it was valuable in that it introduced to libraryl information science the methods of numerical taxonomy, thus stimulating us to think again about the fundamental nature and purpose of classification. In this connection it might be useful to review how automatically derived classes differ from those of manually constructed classifications: 1) the manner of their derivation is purely a posteriori, the ultimate operationalization of the principle of literary warrant; 2) the relationship between members forming such classes is essentially statistical; the members of a given class are similar to each other not because they possess the class-defining characteristic but by virtue of sharing a family resemblance; and finally, 3) automatically derived classes are not related meaningfully one to another, that is, they are not ordered in traditional hierarchical and precedence relationships.
-
Plaunt, C.; Norgard, B.A.: ¬An association-based method for automatic indexing with a controlled vocabulary (1998)
0.06
0.055832524 = product of:
0.2233301 = sum of:
0.2233301 = weight(_text_:headings in 2794) [ClassicSimilarity], result of:
0.2233301 = score(doc=2794,freq=12.0), product of:
0.34012607 = queryWeight, product of:
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.070094384 = queryNorm
0.6566098 = fieldWeight in 2794, product of:
3.4641016 = tf(freq=12.0), with freq of:
12.0 = termFreq=12.0
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.0390625 = fieldNorm(doc=2794)
0.25 = coord(1/4)
- Abstract
- In this article, we describe and test a two-stage algorithm based on a lexical collocation technique which maps from the lexical clues contained in a document representation into a controlled vocabulary list of subject headings. Using a collection of 4.626 INSPEC documents, we create a 'dictionary' of associations between the lexical items contained in the titles, authors, and abstracts, and controlled vocabulary subject headings assigned to those records by human indexers using a likelihood ratio statistic as the measure of association. In the deployment stage, we use the dictiony to predict which of the controlled vocabulary subject headings best describe new documents when they are presented to the system. Our evaluation of this algorithm, in which we compare the automatically assigned subject headings to the subject headings assigned to the test documents by human catalogers, shows that we can obtain results comparable to, and consistent with, human cataloging. In effect we have cast this as a classic partial match information retrieval problem. We consider the problem to be one of 'retrieving' (or assigning) the most probably 'relevant' (or correct) controlled vocabulary subject headings to a document based on the clues contained in that document
-
Olsgaard, J.N.; Evans, E.J.: Improving keyword indexing (1981)
0.05
0.045587067 = product of:
0.18234827 = sum of:
0.18234827 = weight(_text_:headings in 5064) [ClassicSimilarity], result of:
0.18234827 = score(doc=5064,freq=2.0), product of:
0.34012607 = queryWeight, product of:
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.070094384 = queryNorm
0.53611964 = fieldWeight in 5064, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.078125 = fieldNorm(doc=5064)
0.25 = coord(1/4)
- Abstract
- This communication examines some of the most frequently cited critisms of keyword indexing. These critisms include (1) absence of general subject headings, (2) limited entry points, and (3) irrelevant indexing. Some solutions are suggested to meet these critisms.
-
Junger, U.: Can indexing be automated? : the example of the Deutsche Nationalbibliothek (2012)
0.05
0.045128893 = product of:
0.18051557 = sum of:
0.18051557 = weight(_text_:headings in 2717) [ClassicSimilarity], result of:
0.18051557 = score(doc=2717,freq=4.0), product of:
0.34012607 = queryWeight, product of:
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.070094384 = queryNorm
0.5307314 = fieldWeight in 2717, product of:
2.0 = tf(freq=4.0), with freq of:
4.0 = termFreq=4.0
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.0546875 = fieldNorm(doc=2717)
0.25 = coord(1/4)
- Abstract
- The German subject headings authority file (Schlagwortnormdatei/SWD) provides a broad controlled vocabulary for indexing documents of all subjects. Traditionally used for intellectual subject cataloguing primarily of books the Deutsche Nationalbibliothek (DNB, German National Library) has been working on developping and implementing procedures for automated assignment of subject headings for online publications. This project, its results and problems are sketched in the paper.
-
Short, M.: Text mining and subject analysis for fiction; or, using machine learning and information extraction to assign subject headings to dime novels (2019)
0.05
0.045128893 = product of:
0.18051557 = sum of:
0.18051557 = weight(_text_:headings in 481) [ClassicSimilarity], result of:
0.18051557 = score(doc=481,freq=4.0), product of:
0.34012607 = queryWeight, product of:
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.070094384 = queryNorm
0.5307314 = fieldWeight in 481, product of:
2.0 = tf(freq=4.0), with freq of:
4.0 = termFreq=4.0
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.0546875 = fieldNorm(doc=481)
0.25 = coord(1/4)
- Abstract
- This article describes multiple experiments in text mining at Northern Illinois University that were undertaken to improve the efficiency and accuracy of cataloging. It focuses narrowly on subject analysis of dime novels, a format of inexpensive fiction that was popular in the United States between 1860 and 1915. NIU holds more than 55,000 dime novels in its collections, which it is in the process of comprehensively digitizing. Classification, keyword extraction, named-entity recognition, clustering, and topic modeling are discussed as means of assigning subject headings to improve their discoverability by researchers and to increase the productivity of digitization workflows.
-
Willis, C.; Losee, R.M.: ¬A random walk on an ontology : using thesaurus structure for automatic subject indexing (2013)
0.04
0.04408872 = product of:
0.08817744 = sum of:
0.015238133 = weight(_text_:und in 2016) [ClassicSimilarity], result of:
0.015238133 = score(doc=2016,freq=2.0), product of:
0.15546227 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.070094384 = queryNorm
0.098018214 = fieldWeight in 2016, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.03125 = fieldNorm(doc=2016)
0.07293931 = weight(_text_:headings in 2016) [ClassicSimilarity], result of:
0.07293931 = score(doc=2016,freq=2.0), product of:
0.34012607 = queryWeight, product of:
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.070094384 = queryNorm
0.21444786 = fieldWeight in 2016, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.8524013 = idf(docFreq=942, maxDocs=44421)
0.03125 = fieldNorm(doc=2016)
0.5 = coord(2/4)
- Abstract
- Relationships between terms and features are an essential component of thesauri, ontologies, and a range of controlled vocabularies. In this article, we describe ways to identify important concepts in documents using the relationships in a thesaurus or other vocabulary structures. We introduce a methodology for the analysis and modeling of the indexing process based on a weighted random walk algorithm. The primary goal of this research is the analysis of the contribution of thesaurus structure to the indexing process. The resulting models are evaluated in the context of automatic subject indexing using four collections of documents pre-indexed with 4 different thesauri (AGROVOC [UN Food and Agriculture Organization], high-energy physics taxonomy [HEP], National Agricultural Library Thesaurus [NALT], and medical subject headings [MeSH]). We also introduce a thesaurus-centric matching algorithm intended to improve the quality of candidate concepts. In all cases, the weighted random walk improves automatic indexing performance over matching alone with an increase in average precision (AP) of 9% for HEP, 11% for MeSH, 35% for NALT, and 37% for AGROVOC. The results of the analysis support our hypothesis that subject indexing is in part a browsing process, and that using the vocabulary and its structure in a thesaurus contributes to the indexing process. The amount that the vocabulary structure contributes was found to differ among the 4 thesauri, possibly due to the vocabulary used in the corresponding thesauri and the structural relationships between the terms. Each of the thesauri and the manual indexing associated with it is characterized using the methods developed here.
- Theme
- Konzeption und Anwendung des Prinzips Thesaurus