Data, Methodology and the idea of the Main Path
Clinicians regularly publish papers that report on more localised trials and this information base is of crucial significance in understanding the growth of understanding in the community of PTCA practitioners. This literature provides for us an effective and innovative way to trace the emergent problem sequence. The data we used were retrieved by searching the Institute for Scientific Information (ISI) database using a number of search words determined after extensive discussions about the key developments in the field with medical practitioners and scholars at the University of Manchester. This search procedure yielded a database of 11,240 articles titles between 1979 and 2003 and these contained over 300,000 references.
A number of scripts written in Perl were used to extract information to help us understand the dimensions of the data. Profiles are given in Table 1. The data is also used to create of a citation network; this is further discussed below. Starting with two articles published in 1979, the number of articles that delineate the field increased annually reaching a peak in 1996 and declined slightly thereafter.
Some 29,883 authors contributed to this body of scholarship over the period and as the table shows the cumulative author count increased from 4 in 1979 to 29,883 by 2003. It is also interesting to note that single authors accounted for only 6 percent of publication and in fact 46 percent of all publications involved teams of more than 5 co-authors. In coronary angioplasty, like in other areas reporting medical research there is clearly a lot of teamwork as these are essentially communities of practice in the Brown and Duguid (1991) mould, where formal and informal institutional bases become important loci for the development of new knowledge.
Further analysis of the data enabled us to uncover other attributes. For example, one might assume a priori that publications in this area would be dominated by public institutional authors from universities, hospitals and research organisations. While this is the case we however find that co-authorships from firm affiliations show up in the database from 1983 and by 2003 account they account for 7.6 percent of all institutional authors. We explored the dimensions of these firm related collaborations using patents in a related paper (Mina et al, 2004). Further insights relate to geographical distribution of the research activities. The ISI data does not easily facilitate a one to one mapping of author and address as these are listed in separate or unconnected fields. However from the address field we can identify and extract those papers that were collaboratively written across institutions or geographical domains. Annual publications are shown in the top left panel and map a sigmoid curve indicative of a near complete life cycle of knowledge growth in the field (Figure 1). Note also that the top right panel shows the cumulative number of authors contributing to field increasing exponentially. This is indicative of the growing pool of codified ideas and clinical study as researchers formulate and test hypotheses and record and share the evidence within their community.
As mentioned above, our principal use of the data was for the construction of a large citation network where papers are used as nodes and are linked through their citations. Network analysts have long suggested that such networks are a fruitful avenue for exploring ideas related to the sociology of science. Indeed citations have been acknowledged as explicit linkages between papers that have some important content in common since Garfield’s pathbreaking analysis in the 1950s and 1960s (Garfield, 1955: Garfield et al., 1964). De Solla Price (1965) proposed a network of scientific papers model by which scientific advances could be traced by analysing citation patterns in published journal articles.
The greater the number of citations to an earlier work, the greater the likelihood that this paper may be a milestone or key event in that subject field (Garfield, 1970). Studying citation patterns between articles, journals and other publications can therefore help in providing new insights about the interaction between disciplines and individuals in relation to the growth of understanding. There are of course obvious limitation in taking bibliographic citations at face value. Differences exist in propensity to cite across countries, cultures and disciplines (MacRoberts and MacRoberts, 1989) as well as authors’ use of self-citation. Inappropriate, indirect and negative citation, window dressing and politically motivated flattery can, if widely practiced, severely undermine such modes of analysis (Hummon and Doreian, 1989). Other problems recognised by users of bibliometric data include typographical errors, incorrect spelling of authors’ names or unsystematic citation for example citing Smith, T. W. as Smith T. in some instances or Smith T.W. Moreover, the citation provided by the ISI credits only the first named author for any multi-authored paper and this could biases the allocation of credit in any analysis.
With the above caveats in mind we make the assumption that if a publication is taken as an event of the reporting of (new) knowledge then its citation by subsequent scientific publications can be taken as a follow-up event which in some way has been affected by the original publication. In pursuing this line of approach we conceived the corpus of codified knowledge, i.e. scientific publication, on coronary angioplasty as a large directed acyclic graph (DAG). By keeping this duality of event-publication and effect-citation in mind we can think of the traditional citation network made up of publications linked by their citations as a directed acyclic graph. It is directed because any publication can only be cited by a subsequent publication, in other words the graph is weakly ordered in time or it has a direction parallel with time. In certain cases a publication can be cited contemporaneously but this does not change the properties of weak ordering. Moreover the graph is also acyclic in that an earlier publication cannot cite a later publication to form a cycle. In our case the network we are analysing comprises around 94,400 nodes including the lead authors of the 12, 400 primary references plus each unique cited article connected by 300,000 arcs.
In the area of social network analysis, the idea of the main path was first proposed by Hummon and Doreian (1989) in their analysis of the development of DNA theory. In this research and in a subsequent study of the literature on measures of centrality in social networks research (Hummon and Carley, 1993) distinctive pathways through the respective citation networks were found to be related to the key intellectual developments that defined the respective fields (see also Carley et al., 1993).
The main path captures a structural feature of a network that contrasts with the orthodox approaches such as bibliometric coupling or co-citation, used for studying structure, in that these latter approaches focus on the clustering of nodes. The novelty Hummon and Doreian’s proposed is to make use of the links of the network rather than the nodes, that is, on the network’s connectivity. Recall from above that our citation network is a DAG and even though there is a temporal ordering we are not yet in a position to say too much about its structure. For all intents and purposes it is still very much a set of nodes connected by links of equivalent value. However with this in mind it is relatively easy to visualise that it is possible for one to start at any early located article (position) in this network and attempt to find a route (or routes) that will link this node (article) to another published later in time. Hummon and Doreian used this basic idea which is called a traversal path to propose a solution to valuing the network so that the most important parts of it and especially its main path can be extracted for further analysis.
The main path of the network refers to the ‘structurally determined most-used path’ in a network; it is the path with the highest traversal counts (Batagelj and Mrvar, 1998). This measures the number of times that a tie or link between articles is involved in connecting other articles in a citation network (Hummon and Doreian, 1989). The main path analysis then determines all possible search paths through the network starting with an origin article through to endpoint articles, and calculates the traversal counts of each link in the network. Main path analysis thus provides for a longitudinal examination of how a citation network or research field has evolved through their citation patterns.
The algorithm for extracting a main path is embedded in the software Pajek, a tool for visualising and analysing large network (Batagelj and Mrvar, 1998). Batagelj (2002) implemented algorithms to efficiently compute a number of indices suggested by Hummon and Doreian (1989) in Pajek so that they can be used with networks of very large dimensions – up to several thousands of nodes or vertices. These three indices (NPPC, SPLC, SPNP)  or weights of edges provide us a way to computationally identify the (most) important part of the citation network – the main path.
It is rather unfortunate that the algorithm is called the main-path algorithm. This label unnecessarily prejudices our expectations about understanding the dynamics of the network of knowledge accumulation as captured by citations. The main-path algorithm, in fact, not only presents the main-path but also other paths which are explored and compares those paths. In this sense, the main-path algorithm indeed clearly presents paths of exploration and paths of exploitation evident in any knowledge accumulation or learning process (March 1994). However we stick to the label due to its well-known designation.