File diff 20d1d0419125 → 2d694dafc2c9
www/conservancy/static/copyleft-compliance/vmware-code-similarity.html
Show inline comments
...
 
@@ -16,100 +16,100 @@
 
<p>Indeed, when the same test is run to compare Linux to the LLVM+Clang system, the &quot;ratio of similarity&quot; was 0.075%.</p>
 
<h1 id="general-comparison-of-linux-kernel-to-vmware-sources">General Comparison of Linux Kernel to VMware sources</h1>
 
<p>With the baseline established, we now begin relevant comparisons. First, we compare the Linux kernel version 2.6.34 to the sources released by VMware in their (partial) source release. the &quot;ratio of similarity&quot; between Linux 2.6.34 and VMware's partial source release is 20.72%. There is little question that much of VMware's kernel has come from Linux.</p>
 
<h1 id="methodology-of-showing-hellwigs-contributions-in-vmware-esxi-5.5-sources">Methodology Of Showing Hellwig's Contributions in VMware ESXi 5.5 Sources</h1>
 
<p>The following describes a methodology to show Hellwig's contributions to Linux, and how they compare to code found in VMware ESXi 5.5.</p>
 
<h2 id="extracting-hellwigs-contributions-from-linux-historical-repository">Extracting Hellwig's Contributions From Linux Historical Repository</h2>
 
<p>Excellent records exist of contributions made to Linux from 2002-02-04 through present date. From 2002-02-04 through 2005-04-03, Bitkeeper was used to store revision control history of Linux. Each improvement contributed to Linux has information regarding who placed the contribution in Linux, and a comment field in which the contributor can credit others, such as by noting that the contribution actually came from someone else.</p>
 
<p>I extracted from the historical Linux tree the identifying number of all commits that are either made with Hellwig in the official Author field, or where the person in the Author field left notes clearly indicating that the contribution was done by Hellwig. For the latter, the following regular expression search against the log file was used:</p>
 
<pre><code>(Submitted\s+by|original\s+patch|patch\s+(from|by)|originally\s+(from|by)).*Hellwig</code></pre>
 
<p>Specifically, I used <a href="https://github.com/conservancy/gpl-compliance-tools/blob/master/commit-id-list-matching-regex.plx">a script</a> to extract a list of commit ids from the <a href="git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git">historical Linux repository</a>. This method found 1,012 separate occasions of contribution by Hellwig from 2002-02-04 through 2005-04-03.</p>
 
<p>After finding these separate occasions of contribution, I then extracted the source code lines that Hellwig added or changed in each contribution in this repository. I did so by carefully cross-referencing the commits that Hellwig performed with the output of <code>git blame</code>. I specifically <a href="https://github.com/conservancy/gpl-compliance-tools/blob/master/extract-code-added-in-commits.plx">wrote a script</a> to carefully extracted only lines that Hellwig changed or added in that repository, and placed only those contributions identifiable as Hellwig's into new files whose named matched the original filenames. This created a corpus of code that can be verifiable as added or changed by Hellwig and no one else.</p>
 
<p>Here are the specific commands I ran:</p>
 
<pre><code>$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git linux-historical
 
$ ./commit-id-list-matching-regex.plx `pwd`/linux-historical/.git Hellwig &#39;(Submitted\s+by|originals+patch|patch\s+from|originally\s+by).*&#39; &gt; hellwig-historical.ids
 
$ ./extract-code-added-in-commits.plx --repository=`pwd`/linux-historical --output-dir=`pwd`/hellwig-historical --central-commit e7e173af42dbf37b1d946f9ee00219cb3b2bea6a --progress --blame-opts=-M --blame-opts=-C &lt; ./hellwig-historical.ids
 
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-current
 
$ ./commit-id-list-matching-regex.plx `pwd`/linux-current/.git Hellwig &#39;(Submitted\s+by|original\s+patch|patch\s+(from|by)|originally\s+(from|by)).*&#39; &gt; ./hellwig-current.ids
 
$ ./extract-code-added-in-commits.plx --progress --repository=`pwd`/linux-current --output-dir=`pwd`/hellwig-through-2.6.34 --fork-limit=14 --blame-opts=-M  --blame-opts=-M --blame-opts=-C --blame-opts=-C --central-commit e40152ee1e1c7a63f4777791863215e3faa37a86   &lt; hellwig-current.ids </code></pre>
 
<p>Note: e40152ee1e1c7a63f4777791863215e3faa37a86 is the 2.6.34 version created by Linus Torvalds <script type="text/javascript">
 
<!--
 
h='&#108;&#x69;&#110;&#x75;&#120;&#x2d;&#102;&#x6f;&#x75;&#110;&#100;&#x61;&#116;&#x69;&#x6f;&#110;&#46;&#x6f;&#114;&#x67;';a='&#64;';n='&#116;&#x6f;&#114;&#118;&#x61;&#108;&#100;&#x73;';e=n+a+h;
 
document.write('<a h'+'ref'+'="ma'+'ilto'+':'+e+'">'+e+'<\/'+'a'+'>');
 
// -->
 
</script><noscript>&#116;&#x6f;&#114;&#118;&#x61;&#108;&#100;&#x73;&#32;&#x61;&#116;&#32;&#108;&#x69;&#110;&#x75;&#120;&#x2d;&#102;&#x6f;&#x75;&#110;&#100;&#x61;&#116;&#x69;&#x6f;&#110;&#32;&#100;&#x6f;&#116;&#32;&#x6f;&#114;&#x67;</noscript> on 2010-05-16 14:17:36 -0700, with Git commit comment: &quot;Linus 2.6.34&quot;.</p>
 
<h2 id="comparing-hellwigs-contributions-from-linux-historical-repository-to-vmware-sources">Comparing Hellwig's Contributions From Linux Historical Repository to VMware Sources</h2>
 
<p>I then used this corpus as input to CCFinderX (similar to the other CCFinderX comparisons explained earlier). Specifically, this CCFinderX comparison compared all known Hellwig-contributed material from the historical Linux repository to the partial VMware source release. CCFinderX found a ratio of similarity of 0.0900% between Hellwig's code and the source code in VMware's (partial) source release. CCFinderX specifically identified 12 distinct locations where substantial sections of code contributed by Hellwig appeared in VMware's code.</p>
 
<p>Most notably, substantial portions of the the following core SCSI functions were found by this search technique: <code>__scsi_device_lookup</code> and <code>__scsi_get_command</code>, <code>mpt_get_product_name</code>, <code>scsi_proc_host_rm</code>, <code>mega_enum_raid_scsi</code>, <code>mega_m_to_n</code>, <code>mega_prepare_passthru</code>, <code>proc_scsi_show</code>, and <code>__down_read_trylock</code>.</p>
 
<h2 id="extracting-hellwigs-contributions-from-modern-linux-repository">Extracting Hellwig's Contributions From Modern Linux Repository</h2>
 
<p>Beginning on 2005-04-16, Linux began using the new Git system to store revision history. This history can be analyzed in a similar fashion as was done for the historical repository.</p>
 
<p>In this case, I picked a specific revision to center the analysis, the Linux 2.6.34 release from 2010-05-16. For the period from 2005-04-16 through 2010-05-16, I extracted from the modern Linux tree the identifying number of all commits that are either made with Hellwig in the official Author field, or where the person in the Author field left notes clearly indicating that the contribution was done by Hellwig. For the latter, the following regular expression search against the log file was used (as before):</p>
 
<pre><code>(Submitted\s+by|original\s+patch|patch\s+(from|by)|originally\s+(from|by)).*Hellwig</code></pre>
 
<p>Specifically, I used the <a href="https://github.com/conservancy/gpl-compliance-tools/blob/master/commit-id-list-matching-regex.plx">same script as before</a> to now extract a list of commit ids from the <a href="git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git">modern Linux repository</a>. This method found 2,595 separate occasions of contribution by Hellwig from 2005-04-16 through 2010-05-16.</p>
 
<p>As before, after finding these separate occasions of contribution, I then extracted the source code lines that Hellwig added or changed in each contribution in this repository. I did so by carefully cross-referencing the commits that Hellwig performed with the output of <code>git blame</code>. I specifically <a href="https://github.com/conservancy/gpl-compliance-tools/blob/master/extract-code-added-in-commits.plx">used the same script as before</a> to carefully extracted only lines that Hellwig changed or added in that repository, and placed only those contributions identifiable as Hellwig's into new files whose named matched the original filenames. This created a corpus of code that can be verifiable as added or changed by Hellwig and no one else.</p>
 
<h2 id="comparing-hellwigs-contributions-from-modern-linux-repository-to-vmware-sources">Comparing Hellwig's Contributions From Modern Linux Repository to VMware Sources</h2>
 
<p>I then used this corpus as input to CCFinderX again. Specifically, this CCFinderX comparison compared all known Hellwig-contributed material from the modern Linux repository to the partial VMware source release. CCFinderX found a ratio of similarity of 0.1615% between Hellwig's code and the source code in VMware's (partial) source release was contributed by Hellwig. CCFinderX specifically identified 23 distinct locations where substantial sections of code contributed by Hellwig appeared. These 23 locations are found in the following 19 functions: <code>mptsas_init</code>, <code>mptsas_get_linkerrors</code>, <code>megasas_build_and_issue_cmd</code>, <code>cciss_getgeo</code>, <code>mptsas_get_bay_identifier</code>, <code>phy_to_ioc</code>, <code>mptsas_sas_enclosure_pg0</code>, <code>SendIocInit</code>, <code>mptsas_parse_device_info</code>, <code>csmisas_sas_device_pg0</code>, <code>mptsas_sas_io_unit_pg0</code>, <code>mptsas_sas_io_unit_pg1</code>, <code>mptsas_sas_expander_pg1</code>, <code>mptsas_sas_enclosure_pg0</code>, <code>aac_handle_aif</code>, <code>mptsas_get_bay_identifier</code>, <code>mpt_host_page_alloc</code>, <code>mptsas_probe_one_phy</code>.</p>
 
<h2 id="changed-and-added-lines-create-an-impartial-picture">Changed And Added Lines Create an Impartial Picture</h2>
 
<p>In <a href="https://www.linuxfoundation.org/sites/main/files/publications/estimatinglinux.html"><em>Estimating the Total Cost of a Linux Distribution</em></a>, McPherson, Proffitt, and Hale-Evans write:</p>
 
<blockquote>
 
<p>Anyone who is familiar with kernel development, for instance, realizes that the highest man-power cost in its development is when code is deleted and modified. The amount of effort that goes into deleting and changing code, not just adding to it, is not reflected in the values associated with this estimate. Because in a collaborative development model, code is developed and then changed and deleted, the true value is far greater than the existing code base. Just think about the process: when a few lines of code are added to the kernel, for instance, many more have to be modified to be compatible with that change. The work that goes into understanding the dependencies and outcomes and then changing that code is not well represented in this study.</p>
 
</blockquote>
 
<p>Therefore, the process described herein, which ignores lines that are <em>deleted</em> (thus streamlining and improving code), also ignores a fundamental tenant of software development. Namely, making code smaller, more expressive, and more concise yields better-designed and more easily maintainable software. While the process herein <em>can</em> produce a clear list of code whose known introduction is directly attributable to Hellwig, the analysis produced by this process does not do justice to the full weight of the contributions made by Hellwig, since removed code is outright ignored in this process.</p>
 
<p>However, we can consider this process above to have found a bare minimum of Hellwig's contributions that appear in VMware's partial source release.</p>
 
<h1 id="further-analysis-of-additional-examples">Further Analysis of Additional Examples</h1>
 
<p>Separately from the analysis above, Hellwig identified a specific list of eight critical C functions to which he specifically recalls contributing, and near-equivalents were found in in VMware's ESXi 5.5 product.</p>
 
<p>In this additional analysis, we used CCFinderX in a different way <a href="#fn6" class="footnoteRef" id="fnref6"><sup>6</sup></a>. In these tests, I confine the code tests to specific small sections of code that were previously identified by human analysis as similar. In this way, I used CCFinderX to confirm with computational analysis what was already obvious to the human eye.</p>
 
<p>As expected, the ratio of similarity between the Hellwig's implementation and the corresponding implementation found in VMware's ESXi 5.5 product are quite high. As show in the table below, these particular functions show a incredibly high degree of similarity.</p>
 
<table>
 
<thead>
 
<tr class="header">
 
<th style="text-align: left;">Function</th>
 
<th style="text-align: center;">Ratio of Similarity</th>
 
</tr>
 
</thead>
 
<tbody>
 
<tr class="odd">
 
<td style="text-align: left;"><code>scsi_destroy_command_freelist</code></td>
 
<td style="text-align: center;">82.9545%</td>
 
</tr>
 
<tr class="even">
 
<td style="text-align: left;"><code>__scsi_device_lookup</code></td>
 
<td style="text-align: center;">98.4375%</td>
 
</tr>
 
<tr class="odd">
 
<td style="text-align: left;"><code>scsi_device_lookup</code></td>
 
<td style="text-align: center;">58.4785%</td>
 
</tr>
 
<tr class="even">
 
<td style="text-align: left;"><code>__scsi_get_command</code></td>
 
<td style="text-align: center;">69.2308%</td>
 
</tr>
 
<tr class="odd">
 
<td style="text-align: left;"><code>__scsi_iterate_devices</code></td>
 
<td style="text-align: center;">47.6190%</td>
 
</tr>
 
<tr class="even">
 
<td style="text-align: left;"><code>scsi_put_command</code></td>
 
<td style="text-align: center;">49.0347%</td>
 
</tr>
 
<tr class="odd">
 
<td style="text-align: left;"><code>scsi_remove_single_device</code></td>
 
<td style="text-align: center;">99.0566%</td>
 
</tr>
 
<tr class="even">
 
<td style="text-align: left;"><code>scsi_setup_command_freelist</code></td>
 
<td style="text-align: center;">14.8148%</td>
 
</tr>
 
</tbody>
 
</table>
 
<section class="footnotes">
 
<hr />
 
<ol>
 
<li id="fn1"><p>B. S. Baker. <a href="http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&amp;tp=&amp;arnumber=514697&amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel3%2F3936%2F11405%2F00514697.pdf%3Farnumber%3D514697">On finding duplication and near-duplication in large software systems</a>. In Second Working Conference on Reverse Engineering, pages 86–95, Los Alamitos, California, 1995.<a href="#fnref1">↩</a></p></li>
 
<li id="fn2"><p>T. Kamiya, S. Kusumoto, and K. Inoue. <a href="https://www.cs.drexel.edu/~spiros/teaching/CS675/papers/clone-kamiya.pdf">CCFinder: A multi-linguistic token-based code clone detection system for large scale source code.</a> IEEE Transactions on Software Engi- neering, 28(7):654–670, July 2002.<a href="#fnref2">↩</a></p></li>
 
<li id="fn3"><p><a href="http://www.ccfinder.net/ccfinderinarticle.html">Online list of citations to CCFinder and CCFinderX</a><a href="#fnref3">↩</a></p></li>
 
<li id="fn4"><p>Yu Kashima, Yasuhiro Hayase, Norihiro Yoshida, Yuki Manabe, Katsuro Inoue. <a href="http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&amp;tp=&amp;arnumber=6079772&amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6079772">An Investigation into the Impact of Software Licenses on Copy-and-paste Reuse among OSS Projects</a>. WCRE 2011: 28-32<a href="#fnref4">↩</a></p></li>
 
<li id="fn5"><p>T. Kamiya, S. Kusumoto, and K. Inoue. <a href="https://www.cs.drexel.edu/~spiros/teaching/CS675/papers/clone-kamiya.pdf">CCFinder: A multi-linguistic token-based code clone detection system for large scale source code.</a> IEEE Transactions on Software Engi- neering, 28(7):654–670, July 2002.<a href="#fnref5">↩</a></p></li>
 
<li id="fn6"><p>For this analysis, we set the configuration parameters on CCFinderX to a minimum clone length of 15 and minimum TKS of 10. Research papers on CCFinder [^2] and <a href="http://www.ccfinder.net/doc/10.2/en/tutorial-gemx.html">its documentation</a> discuss these settings further.<a href="#fnref6">↩</a></p></li>
 
<li id="fn6"><p>For this analysis, we set the configuration parameters on CCFinderX to a minimum clone length of 15 and minimum TKS of 10. Research papers on CCFinder and <a href="http://www.ccfinder.net/doc/10.2/en/tutorial-gemx.html">its documentation</a> discuss these settings further.<a href="#fnref6">↩</a></p></li>
 
</ol>
 
</section>
 
{% endblock %}