On Tue, Jun 23, 2009 at 03:15, Platonidesplatoni...@gmail.com wrote:
Although not trivial, downloading all images is in fact quite easy. You
can find scripts to do that already made. You can also ask Brion to
rsync3 them.
But do you have enough space to dedicate?
How many wikis do you want to
Yes, but my understanding is that while google provided part of the mbp data
and scans, its continued updates to ocr since then are not being shared. I
would be glad to learn this was not the case...
samuel klein. s...@laptop.org. +1 617 529 4266
On Jun 21, 2009 3:14 AM, Nikola Smolenski
On Mon, Jun 22, 2009 at 9:15 PM, Platonides platoni...@gmail.com wrote:
Anthony wrote:
(although I still haven't seen the WMF step up
to the plate and make it easy for people to make a full history fork, or
even to download all the images)
You'll find full history dumps of almost all
2009/6/23 Samuel Klein meta...@gmail.com
Yes, but my understanding is that while google provided part of the mbp
data
and scans, its continued updates to ocr since then are not being shared. I
would be glad to learn this was not the case...
The dataset you need to train an OCR system to be
Brian wrote:
2009/6/23 Samuel Klein meta...@gmail.com
Yes, but my understanding is that while google provided part of the mbp
data
and scans, its continued updates to ocr since then are not being shared. I
would be glad to learn this was not the case...
The dataset you need to
On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow wikipe...@verizon.netwrote:
The dataset you need to train an OCR system to be as good as theirs is
the
raw images and the plain text. They aren't making it easy to get either
of
those things :( They have presumably improved the software in
Brian wrote:
On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow wikipe...@verizon.netwrote:
The dataset you need to train an OCR system to be as good as theirs is
the
raw images and the plain text. They aren't making it easy to get either
of
those things :( They
Ok Shakespeare. But in plain english you appear to be saying that
corporations are inherently greedy and have a tendency to be evil. Sure, but
we expect more out of GOOG. This is not MSFT we are talking about.
On Tue, Jun 23, 2009 at 12:13 PM, Michael Snow wikipe...@verizon.netwrote:
Brian
On Tue, Jun 23, 2009 at 1:09 PM, Brian brian.min...@colorado.edu wrote:
2009/6/23 Samuel Klein meta...@gmail.com
Yes, but my understanding is that while google provided part of the mbp
data
and scans, its continued updates to ocr since then are not being shared.
I
would be glad to
On Tue, Jun 23, 2009 at 2:24 PM, Brian brian.min...@colorado.edu wrote:
Ok Shakespeare. But in plain english you appear to be saying that
corporations are inherently greedy and have a tendency to be evil. Sure,
but
we expect more out of GOOG. This is not MSFT we are talking about.
Of course
On Tue, Jun 23, 2009 at 3:58 PM, Anthony wikim...@inbox.org wrote:
On Tue, Jun 23, 2009 at 2:24 PM, Brian brian.min...@colorado.edu wrote:
Ok Shakespeare. But in plain english you appear to be saying that
corporations are inherently greedy and have a tendency to be evil. Sure,
but
we expect
On Wed, Jun 24, 2009 at 6:10 AM, Anthony wikim...@inbox.org wrote:
On Tue, Jun 23, 2009 at 3:58 PM, Anthony wikim...@inbox.org wrote:
On Tue, Jun 23, 2009 at 2:24 PM, Brian brian.min...@colorado.edu wrote:
Ok Shakespeare. But in plain english you appear to be saying that
corporations
Anthony wrote:
On Sun, Jun 21, 2009 at 7:54 AM, John Vandenberg jay...@gmail.com wrote:
Whether Google is good or evil is off-topic, and irrelevant to boot.
Whether or not they have a right to exclude bots isn't.
Also worth noting, Project Gutenberg has digitised less than 30,000
books
On Sat, Jun 20, 2009 at 14:35, Ray Saintongesainto...@telus.net wrote:
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we
can do about it except complain to them. Which I don't know how to do - they
The statute supports that as well, providing a private right of action
and civil remedy. It's not entirely that cut and dry (there are
certain restrictions that must be met) but yeah, it appears that in
some cases TOS violations can be illegal.
-Dan
On Jun 22, 2009, at 7:49 PM, Mark Wagner
Anthony wrote:
(although I still haven't seen the WMF step up
to the plate and make it easy for people to make a full history fork, or
even to download all the images)
You'll find full history dumps of almost all wikis at
http://download.wikimedia.org/
Although not trivial, downloading all
Samuel Klein wrote:
There is a wealth of work done all the time by primary source
researchers and publishers, which could be improved on by having
wikisource entries, translations, c.
Related question : how appropriate would large numbers of public
domain texts, with page scans and the best
Дана Saturday 20 June 2009 18:29:24 Brian написа:
This has reminded me to complain about Google Books. Google has the world's
best OCR (in virtue of having the largest OCR'able dataset) and also has a
mission to scan in all the public domain books they can get their hand on.
They recently
On Sun, Jun 21, 2009 at 1:41 AM, David Gerard dger...@gmail.com wrote:
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alternative-open-access-repository-for-legal-scholarship/
Interesting. How well does this fit with what Wikisource does?
Tim Armstrong is a sysop on
On Sun, Jun 21, 2009 at 1:51 AM, Ray Saintonge sainto...@telus.net wrote:
Stephen Bain wrote:
On Sun, Jun 21, 2009 at 5:27 AM, Parker Higginsparkerhigg...@gmail.com
wrote:
Except google isn't asserting any kind of copyright control over these
books, they're just not making it convenient
On Sun, Jun 21, 2009 at 7:17 AM, Anthony wikim...@inbox.org wrote:
(*) Personally, I'm of the opinion that merely accessing a website is not
sufficient to bind a websurfer to a TOS, and that at most a TOS which you do
not have to even click agree to is a unilateral contract which can only
On Sun, Jun 21, 2009 at 9:17 PM, Anthony wikim...@inbox.org wrote:
On Sun, Jun 21, 2009 at 1:51 AM, Ray Saintonge sainto...@telus.net wrote:
Stephen Bain wrote:
On Sun, Jun 21, 2009 at 5:27 AM, Parker Higginsparkerhigg...@gmail.com
wrote:
Except google isn't asserting any kind of
On Sun, Jun 21, 2009 at 7:54 AM, John Vandenberg jay...@gmail.com wrote:
Whether Google is good or evil is off-topic, and irrelevant to boot.
Whether or not they have a right to exclude bots isn't.
Also worth noting, Project Gutenberg has digitised less than 30,000
books since 1971.
On Sun, Jun 21, 2009 at 10:07 PM, Anthony wikim...@inbox.org wrote:
On Sun, Jun 21, 2009 at 7:54 AM, John Vandenberg jay...@gmail.com wrote:
Whether Google is good or evil is off-topic, and irrelevant to boot.
Whether or not they have a right to exclude bots isn't.
Actually, it is. This
On Sun, Jun 21, 2009 at 8:35 AM, John Vandenberg jay...@gmail.com wrote:
I suggest you take a look at a few of the DJVU files provided by
Internet Archive. Then you can point out real faults that you see.
I will. My apologies for misunderstanding your email.
On Sun, Jun 21, 2009 at 10:23 AM, Anthony wikim...@inbox.org wrote:
On Sun, Jun 21, 2009 at 8:35 AM, John Vandenberg jay...@gmail.com wrote:
I suggest you take a look at a few of the DJVU files provided by
Internet Archive. Then you can point out real faults that you see.
I will. My
On Sun, Jun 21, 2009 at 10:55 AM, Anthony wikim...@inbox.org wrote:
On Sun, Jun 21, 2009 at 10:23 AM, Anthony wikim...@inbox.org wrote:
On Sun, Jun 21, 2009 at 8:35 AM, John Vandenberg jay...@gmail.comwrote:
I suggest you take a look at a few of the DJVU files provided by
Internet Archive.
On Sun, Jun 21, 2009 at 1:41 AM, David Gerard dger...@gmail.com wrote:
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alternative-open-access-repository-for-legal-scholarship/
Interesting. How well does this fit with what Wikisource does?
Here are seven articles from
Anthony wrote:
On Sun, Jun 21, 2009 at 10:55 AM, Anthony wrote:
Okay, http://www.archive.org/details/catholicencyclo16herbgoog happened to
be the first book I randomly picked from Google Book Search. There's no
text version.
And the text version I find of other editions seems to be much
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alternative-open-access-repository-for-legal-scholarship/
Interesting. How well does this fit with what Wikisource does?
- d.
___
foundation-l mailing list
There is a wealth of work done all the time by primary source
researchers and publishers, which could be improved on by having
wikisource entries, translations, c.
Related question : how appropriate would large numbers of public
domain texts, with page scans and the best available OCR [and
This has reminded me to complain about Google Books. Google has the world's
best OCR (in virtue of having the largest OCR'able dataset) and also has a
mission to scan in all the public domain books they can get their hand on.
They recently updated their interface to, as they put it, make it easier
Brian wrote:
Unfortunately the only way I've found to download the full text of a public
domain book from Google is to flip through the book a page at a time,
copying the text to your clipboard.
There are roughly 2-3 million public domain books in Google Books.
That's easy to fix :)
Not likely. I've been banned from Google's regular search at least a dozen
times during semi-frenetic search sprees in which I was identified as a bot.
There is no doubt that if you try to automate it you will be quickly shot
down.
On Sat, Jun 20, 2009 at 12:02 PM, Platonides platoni...@gmail.com
Easier than scanning, though :)
On Sat, Jun 20, 2009 at 2:04 PM, Brian brian.min...@colorado.edu wrote:
Not likely. I've been banned from Google's regular search at least a dozen
times during semi-frenetic search sprees in which I was identified as a
bot.
There is no doubt that if you try to
So the bot just has to run at human speeds so it does not get banned, it
still won't get tired or make unpredictable mistakes. And you can run it
from different IPs to parallelize.
--Falcorian
On Sat, Jun 20, 2009 at 11:04 AM, Brian brian.min...@colorado.edu wrote:
Not likely. I've been banned
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we
can do about it except complain to them. Which I don't know how to do - they
apparently believe that the plain text versions of their books are akin to
their intellectual
Mailing List foundation-l@lists.wikimedia.org
Sent: Saturday, June 20, 2009 11:47:28 AM
Subject: Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative
Open Access Repository for Legal Scholarship
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google
, 2009 8:41:45 AM
Subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open
Access Repository for Legal Scholarship
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alternative-open-access-repository-for-legal-scholarship/
Interesting. How well does this fit
domain material under copyright.
From: Brian brian.min...@colorado.edu
To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org
Sent: Saturday, June 20, 2009 11:47:28 AM
Subject: Re: [Foundation-l] Info/Law blog: Using Wikisource as an
Alternative
Wow, what's Wikipedia's policy about using a bot to scrape everything?
On Sat, Jun 20, 2009 at 2:47 PM, Brian brian.min...@colorado.edu wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we
can do about it except
On Sat, Jun 20, 2009 at 1:29 PM, Platonides platoni...@gmail.com wrote:
Where does it forbid them?
5.3 You agree not to access (or attempt to access) any of the Services by
any means other than through the interface that is provided by Google,
unless you have been specifically allowed to do so
Sent: Saturday, June 20, 2009 2:35:52 PM
Subject: Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative
Open Access Repository for Legal Scholarship
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil
Anthony wrote:
Wow, what's Wikipedia's policy about using a bot to scrape everything?
I don't know about any policy, but I think it should still be
discouraged. For me this has less to do with predation on other sites
than with our inability to keep up with the volume of data that would
Geoffrey Plourde wrote:
If a bot has a meaningful effect on server load (i.e. page requests), it
falls under the category of malicious software, which is highly illegal.
Malicious software or overloading servers goes well beyond ignoring a
ToS. Why should downloading whole books from
On Sun, Jun 21, 2009 at 5:27 AM, Parker Higginsparkerhigg...@gmail.com wrote:
Except google isn't asserting any kind of copyright control over these
books, they're just not making it convenient to download them in your
preferred format. Maybe not The Right Thing, but not as boneheaded as suing
to unpleasant
consequences.
From: Ray Saintonge sainto...@telus.net
To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org
Sent: Saturday, June 20, 2009 5:07:44 PM
Subject: Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative
Open
Geoffrey Plourde wrote:
A bot or bots calling up massive amounts of data at high speed can have a
negative effect on a server. While I doubt the bot we use would have the
power to take down a Google server, the speed of the requests and the
constant number of requests will definitely be
Stephen Bain wrote:
On Sun, Jun 21, 2009 at 5:27 AM, Parker Higginsparkerhigg...@gmail.com
wrote:
Except google isn't asserting any kind of copyright control over these
books, they're just not making it convenient to download them in your
preferred format. Maybe not The Right Thing, but
49 matches
Mail list logo