August 1, 2011

Off the fence at last? Conducting a PDF metadata extraction experiment

As a PhD student entering her third year of studies, I think it's time for me to get off the fence about reference management software. I've muddled through the last two years, cobbling together reference pages in APA style and exploring, but never fully committing to, Zotero.

When I started my PhD program in Fall 2009, I chose to do my first literature review with open-source Zotero instead of the proprietary EndNote software that my institution supports.

I learned to love open-source applications when I took a course from Dr. Jay Pfaffman as an Instructional Technology master's student. With Zotero's Firefox plugin, I could create and own my own bibliographic database that synced with my Zotero web account, meaning it was accessible from any computer with an Internet connection.

I finished the literature review using Zotero to organize and tag my files and then automatically generate an APA-formatted bibliography. But under the crush of my course load and the multitude of distractions and obligations that go with doctoral-level work, I never gave myself the time to explore Zotero's interface and documentation.  I never used Zotero to take notes on my resources, nor did I take advantage of its word processor plug-ins for cite-and-write functionality. It all seemed so complicated.

I was printing, photocopying, underlining, annotating, and sticky-noting mounds of literature and basically using Zotero as the digital equivalent of 3x5 bib cards, which most people of a certain age can remember from their high school and college English classes. In and of itself, the ability to automatically generate a list of works cited is a nice thing, but is that enough value-added "worth investing money and time in?" (Hensley, 2011, p. 205) I'm pretty sure I could alphabetize my bib cards and word-process my bibliography the "old-fashioned" way in about the same amount of time it takes to fool with Zotero.

Now, in EP604 we are learning about the next generation of citation management software. These programs, including Zotero, aspire to do more and may possibly alter the entire academic research experience.  Zotero, for instance, has released an alpha version of a free-standing desktop application as part of the larger Zotero Everywhere project. Meanwhile, Mendeley, a commercial, cross-platform application, is already just about everywhere, with desktop, web, and mobile apps.

Both Zotero and Mendeley draw on social media functionality to provide a collaborative platform for public and private research groups.  But Mendeley has upped the ante with what it calls "Knowledge Discovery," which draws on readership statistics to predict research trends and to push new content out to users.  Another step toward the Googlification of everything.

Mendeley claims to be the "world's largest crowd-sourced library," and I am interested in the impact of collaboration and social networking on the research experience.   But for the moment, I have more pressing needs. I want a citation management tool that will integrate seamlessly with my conversion to paperless and that will help me reign in and organize two years' worth of scattered resources.

Ideally, I would like a tool that functions both as a document reader and a citation manager, but I am not at all impressed with either Mendeley's or Zotero's annotating capabilities.  To make matters worse, the Mendeley iPad app repeatedly crashes even after being uninstalled and re-installed. (Zotero doesn't even have an iPad app, although one appears to be in the works). I've resolved this issue by using a different iPad app to annotate and export "flattened" PDFs to Dropbox, a process I will describe in more detail in a future post.

If I outsource reading and annotating to an iOS app and put networking and collaboration on the back burner for the time being, that leaves me with the same basic question about reference managers that Aaron Tay posed last year on his Musings about Librarianship blog: "How good are they at figuring out citations from PDFs?"  Tay ran a series of "non-scientific tests" to see how well EndNote, Mendeley, WizFolio, and Zotero ingested a collection of 10 PDFs he downloaded from the Internet.

Using five bibliographic fields (article title, author, publication year, journal volume and issue number, and page numbers), Tay evaluated the results for each of the ten articles. A "pass" meant the software extracted correct information for all five fields, a "partial" indicated at least one field was satisfied, and a "fail" meant no bibliographic information was found. EndNote and WizFolio each had five "fails," and Mendeley and Zotero had a respectable combination of "passes" and "partials," with Zotero having the most passes of all.

Since a year has gone by with many upgrades and fixes along the way, I thought it would be interesting to try a scaled-down version of Tay's metadata experiment.  I used four articles from my desktop representing a range of years (from 1989 to 2011) and what I hoped would be a range of PDF versions, with and without DOIs, etc. I focused exclusively on Mendeley and Zotero; other than that, I followed all the steps that Tay described in his original post.

To make it more interesting, I utilized both applications' capability for inserting formatted citations by drag-and-drop technology straight into the text editor of this blog. I had never tried this before with either Mendeley or Zotero and was eager to see how it works. (BTW, the drag-and-drop piece was super easy with both applications, but I prefer Zotero's split-screen format that integrates with the Firefox browser better than resizing the Mendeley Desktop window.)

For comparison purposes, I copied and pasted my manually formatted APA citations first, with the Mendeley and Zotero citations following. I changed the text color within citations to indicate deviations from APA or missing information.

Manually formatted APA citations:  
Brown, J. S., Collins, A., & Duguid, P. (1989). Situated cognition and the culture of learning. Educational Researcher, 18(1), 32-42.
Mishra, P. & Koehler, M. J. (2006). Technological pedagogical content knowledge: A framework for teacher knowledge. Teachers College Record, 108(6), 1017-1054.
O’Bannon, B. W., Lubke, J. K., Beard, J. L., & Britt, V. G. (2011). Using podcasts to replace lecture: Effects on student achievement. Computers & Education, 57, 1885-1892. doi:10.1016/j.compedu.2011.04.001
The New London Group. (1996). A pedagogy of multiliteracies: Designing social futures. Harvard Educational Review, 66(1), 60-92.

Mendeley citations (note automatic insertion of some DOIs):
Brown, J. S., Collins, A., & Duguid, P. (1989). Situated Cognition and the Culture of Learning. Educational Researcher, 18(1), 32(missing page range). doi:10.2307/1176008
MISHRA, P., & KOEHLER, M. J. (2006). Technological Pedagogical Content Knowledge: A Framework for Teacher Knowledge. Teachers College Record, 108(6), 1017-1054. doi:10.1111/j.1467-9620.2006.00684.x
O’Bannon, B. W., Lubke, J. K., Beard, J. L., & Britt, V. G. (2011). Using podcasts to replace lecture: Effects on student achievement. Computers & Education, 57(3), 1885-1892. doi:10.1016/j.compedu.2011.04.001
No author. A pedagogy of multiliteracies : Designing social futures. (1996). Library. (No journal, volume or issue number, or page numbers)

Zotero citations (note automatic double-spacing):
Brown, J. S., Collins, A., & Duguid, P. (1989). Situated cognition and the culture of learning. Educational researcher, 18(1), 32(missing page range).
Mishra, P., & Koehler, M. J. (2006). Technological pedagogical content knowledge: A framework for teacher knowledge. Teachers college record, 108(6), 1017(missing page range).
O’Bannon, B. W., Lubke, J. K., Beard, J. L., & Britt, V. G. (2011). Using Podcasts to Replace Lecture: Effects on Student Achievement. Computers & Education. (No volume or page numbers)
Pedagogy+of+Multiliteracies_New+London+Group.pdf. (n.d.). . Retrieved from (Hmmm. Just really messed up!)

So, should I stick it out with Zotero or make the leap to Mendeley?

I like that Mendeley located and added the DOIs for three out of the four documents.  I like how Zotero automatically double-spaces.  The 1996 article by The New London Group is just all the way around problematic and perhaps should not have been included in my little "experiment," but it is a seminal writing in the field of literacy and I will need it in my web-based library at some point.  Another limitation is the fact that I did not include at least one example of a conference proceeding to see how the two reference managers performed in that situation.  I probably should run another test with a proceedings paper before I choose a tool.

Or, do I even have to choose? According to Julie Meloni at the ProfHacker blog, it is easy to import Zotero resources to Mendeley, and, "Given the syncing abilities, it would be possible (and not terribly difficult or time consuming) to, say, work with Zotero as your primary tool yet sync with Mendeley so as to increase the content in your field and just add to the community in general."

What would you do?

