Re: [CODE4LIB] Plagiarism checker

2015-01-23 Thread Andreas Orphanides
My first thought was something like programatically doing a pairwise diff
of the files, 5500 times. I was surprised I couldn't find a utility that
just does this.

But i did find something called diffuse [1], that allows you to graphically
compare any number of text files in a diff-like fashion. This would
probably at least be able to tell you which files need closer scrutiny.

I think you'd presumably have to be able to extract the text from each
file; I doubt it would work on raw Word docs or PDFs, so that might be a
stopper.

It seems like the realm of source control has a lot of software designed to
help with this problem, so there might be other similar things out there.
But probably not anything designed to natively handle print-ready files.

-dre.


[1] http://diffuse.sourceforge.net/about.html

On Fri, Jan 23, 2015 at 7:26 AM, Judy Meirose jmeir...@fcsl.edu wrote:

 Can anyone recommend a plagiarism checking software besides Turnitin and
 SafeAssign?  I need to compare about 100 student assignments against each
 other to make sure they don't copy each other's assignments.

 Thanks.

 Judy K. Meirose
 Systems Librarian
 Florida Coastal School of Law
 8787 Baypine Rd
 Jacksonville, FL
 (904)680-7603

 This email transmission, and any documents, files or previous e-mail
 messages attached to it, may contain confidential, privileged and/or
 proprietary information for the sole use of the intended recipient(s). If
 you are not an intended recipient or a person responsible for delivering it
 to an intended recipient, any disclosure, copying, distribution or use of
 any of the information contained in or attached to this transmission is
 strictly prohibited. If you have received this transmission in error,
 please: (1) immediately notify me by reply e-mail; and (2) destroy the
 original (and any copies) of this transmission and its attachments without
 reading or saving in any manner.



Re: [CODE4LIB] Plagiarism checker

2015-01-23 Thread Adam Traub
Just thought I'd pop my head in:

TurnItIn does compare to other previous submissions (both at your own 
institution and others) unless the submitter chooses not to include them in the 
repository for future checks.  

Cheers,
Adam Traub
Electronic Resources Librarian
The Wallace Center
Rochester Institute of Technology
adam.tr...@rit.edu



-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Mark A. 
Matienzo
Sent: Friday, January 23, 2015 9:45 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Plagiarism checker

I believe Turnitin and SafeAssign both compare the text of submissions to 
against external sources (e.g., SafeAssign uses ABI/INFORM, among others).
I am not certain if they compare submissions against each other.

However, if you're looking for something along the lines of what Dre suggests, 
you could use ssdeep, which is an implementation of a piecewise hashing 
algorithm [0]. The issue with that you would have to assume that all students 
would probably be using the same file format.

You could also using something like Tika to extract the text content from all 
the submissions, and then compare them against each other.

[0] http://ssdeep.sourceforge.net/
[1] http://tika.apache.org/

Mark

--
Mark A. Matienzo m...@matienzo.org
Director of Technology, Digital Public Library of America

On Fri, Jan 23, 2015 at 8:47 AM, Andreas Orphanides akorp...@ncsu.edu
wrote:

 My first thought was something like programatically doing a pairwise 
 diff of the files, 5500 times. I was surprised I couldn't find a 
 utility that just does this.

 But i did find something called diffuse [1], that allows you to 
 graphically compare any number of text files in a diff-like fashion. 
 This would probably at least be able to tell you which files need closer 
 scrutiny.

 I think you'd presumably have to be able to extract the text from each 
 file; I doubt it would work on raw Word docs or PDFs, so that might be 
 a stopper.

 It seems like the realm of source control has a lot of software 
 designed to help with this problem, so there might be other similar things 
 out there.
 But probably not anything designed to natively handle print-ready files.

 -dre.


 [1] http://diffuse.sourceforge.net/about.html

 On Fri, Jan 23, 2015 at 7:26 AM, Judy Meirose jmeir...@fcsl.edu wrote:

  Can anyone recommend a plagiarism checking software besides Turnitin 
  and SafeAssign?  I need to compare about 100 student assignments 
  against each other to make sure they don't copy each other's assignments.
 
  Thanks.
 
  Judy K. Meirose
  Systems Librarian
  Florida Coastal School of Law
  8787 Baypine Rd
  Jacksonville, FL
  (904)680-7603
 
  This email transmission, and any documents, files or previous e-mail 
  messages attached to it, may contain confidential, privileged and/or 
  proprietary information for the sole use of the intended 
  recipient(s). If you are not an intended recipient or a person 
  responsible for delivering
 it
  to an intended recipient, any disclosure, copying, distribution or 
  use of any of the information contained in or attached to this 
  transmission is strictly prohibited. If you have received this 
  transmission in error,
  please: (1) immediately notify me by reply e-mail; and (2) destroy 
  the original (and any copies) of this transmission and its 
  attachments
 without
  reading or saving in any manner.
 



Re: [CODE4LIB] Plagiarism checker

2015-01-23 Thread Joe Hourcle
On Jan 23, 2015, at 9:44 AM, Mark A. Matienzo wrote:

 I believe Turnitin and SafeAssign both compare the text of submissions to
 against external sources (e.g., SafeAssign uses ABI/INFORM, among others).
 I am not certain if they compare submissions against each other.

My understanding of TurnItIn, at least initially, was that they
built their corpus on existing submissions.  

(they had some deals with universities back when they started up
to use their service for free or cheap, so that they could build
up their corpus).


 However, if you're looking for something along the lines of what Dre
 suggests, you could use ssdeep, which is an implementation of a piecewise
 hashing algorithm [0]. The issue with that you would have to assume that
 all students would probably be using the same file format.
 
 You could also using something like Tika to extract the text content from
 all the submissions, and then compare them against each other.

I'd agree on extracting the text.  MS Word used to store documents
as strings of edits, making it difficult to compare two
documents for similarity without parsing the format.

(I don't know if they still do this in .docx)

-Joe


Re: [CODE4LIB] Plagiarism checker

2015-01-23 Thread Mark A. Matienzo
I believe Turnitin and SafeAssign both compare the text of submissions to
against external sources (e.g., SafeAssign uses ABI/INFORM, among others).
I am not certain if they compare submissions against each other.

However, if you're looking for something along the lines of what Dre
suggests, you could use ssdeep, which is an implementation of a piecewise
hashing algorithm [0]. The issue with that you would have to assume that
all students would probably be using the same file format.

You could also using something like Tika to extract the text content from
all the submissions, and then compare them against each other.

[0] http://ssdeep.sourceforge.net/
[1] http://tika.apache.org/

Mark

--
Mark A. Matienzo m...@matienzo.org
Director of Technology, Digital Public Library of America

On Fri, Jan 23, 2015 at 8:47 AM, Andreas Orphanides akorp...@ncsu.edu
wrote:

 My first thought was something like programatically doing a pairwise diff
 of the files, 5500 times. I was surprised I couldn't find a utility that
 just does this.

 But i did find something called diffuse [1], that allows you to graphically
 compare any number of text files in a diff-like fashion. This would
 probably at least be able to tell you which files need closer scrutiny.

 I think you'd presumably have to be able to extract the text from each
 file; I doubt it would work on raw Word docs or PDFs, so that might be a
 stopper.

 It seems like the realm of source control has a lot of software designed to
 help with this problem, so there might be other similar things out there.
 But probably not anything designed to natively handle print-ready files.

 -dre.


 [1] http://diffuse.sourceforge.net/about.html

 On Fri, Jan 23, 2015 at 7:26 AM, Judy Meirose jmeir...@fcsl.edu wrote:

  Can anyone recommend a plagiarism checking software besides Turnitin and
  SafeAssign?  I need to compare about 100 student assignments against each
  other to make sure they don't copy each other's assignments.
 
  Thanks.
 
  Judy K. Meirose
  Systems Librarian
  Florida Coastal School of Law
  8787 Baypine Rd
  Jacksonville, FL
  (904)680-7603
 
  This email transmission, and any documents, files or previous e-mail
  messages attached to it, may contain confidential, privileged and/or
  proprietary information for the sole use of the intended recipient(s). If
  you are not an intended recipient or a person responsible for delivering
 it
  to an intended recipient, any disclosure, copying, distribution or use of
  any of the information contained in or attached to this transmission is
  strictly prohibited. If you have received this transmission in error,
  please: (1) immediately notify me by reply e-mail; and (2) destroy the
  original (and any copies) of this transmission and its attachments
 without
  reading or saving in any manner.
 



[CODE4LIB] Plagiarism checker

2015-01-23 Thread Judy Meirose
Can anyone recommend a plagiarism checking software besides Turnitin and 
SafeAssign?  I need to compare about 100 student assignments against each other 
to make sure they don't copy each other's assignments.

Thanks.

Judy K. Meirose
Systems Librarian
Florida Coastal School of Law
8787 Baypine Rd
Jacksonville, FL
(904)680-7603

This email transmission, and any documents, files or previous e-mail messages 
attached to it, may contain confidential, privileged and/or proprietary 
information for the sole use of the intended recipient(s). If you are not an 
intended recipient or a person responsible for delivering it to an intended 
recipient, any disclosure, copying, distribution or use of any of the 
information contained in or attached to this transmission is strictly 
prohibited. If you have received this transmission in error, please: (1) 
immediately notify me by reply e-mail; and (2) destroy the original (and any 
copies) of this transmission and its attachments without reading or saving in 
any manner.