Re: [CODE4LIB] Processing Circ data

2015-08-06 Thread Kyle Banerjee
On Wed, Aug 5, 2015 at 1:07 PM, Harper, Cynthia char...@vts.edu wrote:

 Hi all. What are you using to process circ data for ad-hoc queries.  I
 usually extract csv or tab-delimited files - one row per item record, with
 identifying bib record data, then total checkouts over the given time
 period(s).  I have been importing these into Access then grouping them by
 bib record. I think that I've reached the limits of scalability for Access
 for this project now, with 250,000 item records.  Does anyone do this in
 R?  My other go-to- software for data processing is RapidMiner free
 version.  Or do you just use MySQL or other SQL database?  I was looking
 into doing it in R with RSQLite (just read about this and sqldf
 http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure
 my IT department will be skeptical of letting me have MySQL on my desktop.
 (I've moved into a much more users-don't-do-real-computing kind of
 environment).  I'm rusty enough in R that if anyone will give me some
 start-off data import code, that would be great.


As has been mentioned already, it's worth investigating whether OpenRefine
or sqllite are options for you. If not, I'd be inclined to explore
solutions that don't rely on your local IT dept.

It's so easy to spend far more time going through approval, procurement,
and then negotiating local IT security/policies than actually working that
it pays to do a lot of things on the cloud. There are many services out
there, but I like Amazon for occasional need things because you can
provision anything you want in minutes and they're stupid cheap. If all you
need is mysql for a few minutes now and then, just pay for Relational
Database Services. If you'd rather have a server and run mysql off it, get
an EBS backed EC2 instance (the reason to go this route rather than
instance store is improved IO and your data is all retained if you shut off
the server without taking a snapshot). Depending on your usage, bills of
less than a buck a month are very doable. If you need something that runs
24x7, other routes will probably be more attractive. Another option is to
try the mysql built into cheapo web hosting accounts like bluehost, though
you might find that your disk IO gets you throttled. But it might be worth
a shot.

If doing this work on your desktop is acceptable (i.e. other people don't
need access to this service), you might seriously consider just doing it on
a personal laptop that you can install anything you want on. In addition to
mysql, you can also install VirtualBox which is a great environment for
provisioning servers that you can export to other environments or even
carry around on your cell phone.

With regards to some of the specific issues you bring up, 40 minutes for a
query on a database that size is insane which indicates the tool you have
is not up for the job. Because of the way databases store info, performance
degrades on a logarthmic (rather than linear) basis on indexed data. In
plain English, this means even queries on millions of records take
surprisingly little power. Based on what you've described, changing a field
from variable to fixed might not save you any space and could even increase
it depending on what you have. In any case, the difference won't be worth
worrying about.

Whatever solution you go with, I'd recommend learning to provision yourself
resources when you can find some time. Work is hard enough when you can't
get the resources you need. When you can simply assign them to yourself,
the tools you need are always at hand so life gets much easier and more fun.

kyle


Re: [CODE4LIB] Processing Circ data

2015-08-06 Thread Harper, Cynthia
I have compacted the database, and I'm using the Group By SQL query. I think I 
actually am hitting the 2GB limit, because of all the data I have for each row. 
I'm wondering if having added a field for reserves history notes, that that's 
treated as a fixed-length field for every record, rather than variable length, 
and just appearing for the small number of records that have been put on 
reserve.  I suppose if I exported my data in two tables - bib and item data, 
the database would be much more efficient than the flat-file approach I've been 
using.  Time to turn the mind back on, rather than just taking the lazy 
approach every time...

Cindy

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin 
Ford
Sent: Wednesday, August 05, 2015 5:16 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Processing Circ data

On the surface, your difficulties suggest you may need look at a few 
optimization tactics. Apologies if these are things you've already considered 
and addressed - just offering a suggestion.

This page [1] is for Access 2003 but the items under Improve query 
performance should apply - I think - to newer versions also.  I'll draw 
specific attention to 1) Compacting the database; 2) making sure you have an 
index set up on the bib record number field and number of circs field; and 3) 
make sure you are using hte Group by sql syntax [2].

Now, I'm not terribly familiar with Access so I can't actually help you with 
point/click instructions, but the above are common 'gotchas' that could be a 
problem regardless of RDBMS.

Yours,
Kevin

[1] https://support.microsoft.com/en-us/kb/209126
[2] http://www.w3schools.com/sql/sql_groupby.asp



On 8/5/15 4:01 PM, Harper, Cynthia wrote:
 Well, I guess it could be bad data, but I don't know how to tell. I think 
 I've done more than this before.

 I have a Find duplicates query that groups by bib record number.  That 
 query seemed to take about 40 minutes to process. Then I added a criterion to 
 limit to only records that had 0 circs this year. That query displays the 
 rotating cursor, then says Not Responding, then the cursor, and loops 
 through that for hours.  Maybe I can find the Access bad data, but I'd be 
 glad to find a more modern data analysis software.  My db is 136,256 kb.  But 
 adding that extra query will probably put it over the 2GB mark.  I've tried 
 extracting to a csv, and that didn't work. Maybe I'll try a Make table to a 
 separate db.

 Or the OpenRefine suggestion sounds good too.

 Cindy Harper

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
 Of Kevin Ford
 Sent: Wednesday, August 05, 2015 4:23 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Processing Circ data

 Hi Cindy,

 This doesn't quite address your issue, but, unless you've hit the 2 GB Access 
 size limit [1], Access can handle a good deal more than 250,000 item records 
 (rows, yes?) you cited.

 What makes you think you've hit the limit?  Slowness, something else?

 All the best,
 Kevin

 [1]
 https://support.office.com/en-us/article/Access-2010-specifications-1e
 521481-7f9a-46f7-8ed9-ea9dff1fa854





 On 8/5/15 3:07 PM, Harper, Cynthia wrote:
 Hi all. What are you using to process circ data for ad-hoc queries.  I 
 usually extract csv or tab-delimited files - one row per item record, with 
 identifying bib record data, then total checkouts over the given time 
 period(s).  I have been importing these into Access then grouping them by 
 bib record. I think that I've reached the limits of scalability for Access 
 for this project now, with 250,000 item records.  Does anyone do this in R?  
 My other go-to- software for data processing is RapidMiner free version.  Or 
 do you just use MySQL or other SQL database?  I was looking into doing it in 
 R with RSQLite (just read about this and sqldf  
 http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because ...  I'm 
 rusty enough in R that if anyone will give me some start-off data import 
 code, that would be great.

 Cindy Harper
 E-services and periodicals librarian
 Virginia Theological Seminary
 Bishop Payne Library
 3737 Seminary Road
 Alexandria VA 22304
 char...@vts.edumailto:char...@vts.edu
 703-461-1794



Re: [CODE4LIB] Processing Circ data

2015-08-06 Thread Harper, Cynthia
I did just bring in my own laptop to see if my problem is unique to my work 
computer.  I actually have used Amazon AWS, and yes, that might be the best 
option.  I've been looking into why my MSAccess job is limited to 25% of my CPU 
time - Maybe Access just can't use multiprocessors.  I'm going to investigate 
SLQite and OpenRefine on my presonal laptop.

Thanks all!
Cindy Harper

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
Banerjee
Sent: Thursday, August 06, 2015 12:34 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Processing Circ data

On Wed, Aug 5, 2015 at 1:07 PM, Harper, Cynthia char...@vts.edu wrote:

 Hi all. What are you using to process circ data for ad-hoc queries.  I 
 usually extract csv or tab-delimited files - one row per item record, 
 with identifying bib record data, then total checkouts over the given 
 time period(s).  I have been importing these into Access then grouping 
 them by bib record. I think that I've reached the limits of 
 scalability for Access for this project now, with 250,000 item 
 records.  Does anyone do this in R?  My other go-to- software for data 
 processing is RapidMiner free version.  Or do you just use MySQL or 
 other SQL database?  I was looking into doing it in R with RSQLite 
 (just read about this and sqldf 
 http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my 
 IT department will be skeptical of letting me have MySQL on my desktop.
 (I've moved into a much more users-don't-do-real-computing kind of 
 environment).  I'm rusty enough in R that if anyone will give me some 
 start-off data import code, that would be great.


As has been mentioned already, it's worth investigating whether OpenRefine or 
sqllite are options for you. If not, I'd be inclined to explore solutions that 
don't rely on your local IT dept.

It's so easy to spend far more time going through approval, procurement, and 
then negotiating local IT security/policies than actually working that it pays 
to do a lot of things on the cloud. There are many services out there, but I 
like Amazon for occasional need things because you can provision anything you 
want in minutes and they're stupid cheap. If all you need is mysql for a few 
minutes now and then, just pay for Relational Database Services. If you'd 
rather have a server and run mysql off it, get an EBS backed EC2 instance (the 
reason to go this route rather than instance store is improved IO and your data 
is all retained if you shut off the server without taking a snapshot). 
Depending on your usage, bills of less than a buck a month are very doable. If 
you need something that runs 24x7, other routes will probably be more 
attractive. Another option is to try the mysql built into cheapo web hosting 
accounts like bluehost, though you might find that your disk IO gets yo!
 u throttled. But it might be worth a shot.

If doing this work on your desktop is acceptable (i.e. other people don't need 
access to this service), you might seriously consider just doing it on a 
personal laptop that you can install anything you want on. In addition to 
mysql, you can also install VirtualBox which is a great environment for 
provisioning servers that you can export to other environments or even carry 
around on your cell phone.

With regards to some of the specific issues you bring up, 40 minutes for a 
query on a database that size is insane which indicates the tool you have is 
not up for the job. Because of the way databases store info, performance 
degrades on a logarthmic (rather than linear) basis on indexed data. In plain 
English, this means even queries on millions of records take surprisingly 
little power. Based on what you've described, changing a field from variable to 
fixed might not save you any space and could even increase it depending on what 
you have. In any case, the difference won't be worth worrying about.

Whatever solution you go with, I'd recommend learning to provision yourself 
resources when you can find some time. Work is hard enough when you can't get 
the resources you need. When you can simply assign them to yourself, the tools 
you need are always at hand so life gets much easier and more fun.

kyle


Re: [CODE4LIB] Processing Circ data

2015-08-06 Thread Cornel Darden Jr.
Hello,

I'd go with a full database solution over sqlite. No limitations. 

Thanks,

Sent from my iPhone

 On Aug 6, 2015, at 11:49 AM, Harper, Cynthia char...@vts.edu wrote:
 
 I did just bring in my own laptop to see if my problem is unique to my work 
 computer.  I actually have used Amazon AWS, and yes, that might be the best 
 option.  I've been looking into why my MSAccess job is limited to 25% of my 
 CPU time - Maybe Access just can't use multiprocessors.  I'm going to 
 investigate SLQite and OpenRefine on my presonal laptop.
 
 Thanks all!
 Cindy Harper
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
 Banerjee
 Sent: Thursday, August 06, 2015 12:34 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Processing Circ data
 
 On Wed, Aug 5, 2015 at 1:07 PM, Harper, Cynthia char...@vts.edu wrote:
 
 Hi all. What are you using to process circ data for ad-hoc queries.  I 
 usually extract csv or tab-delimited files - one row per item record, 
 with identifying bib record data, then total checkouts over the given 
 time period(s).  I have been importing these into Access then grouping 
 them by bib record. I think that I've reached the limits of 
 scalability for Access for this project now, with 250,000 item 
 records.  Does anyone do this in R?  My other go-to- software for data 
 processing is RapidMiner free version.  Or do you just use MySQL or 
 other SQL database?  I was looking into doing it in R with RSQLite 
 (just read about this and sqldf 
 http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my 
 IT department will be skeptical of letting me have MySQL on my desktop.
 (I've moved into a much more users-don't-do-real-computing kind of 
 environment).  I'm rusty enough in R that if anyone will give me some 
 start-off data import code, that would be great.
 
 As has been mentioned already, it's worth investigating whether OpenRefine or 
 sqllite are options for you. If not, I'd be inclined to explore solutions 
 that don't rely on your local IT dept.
 
 It's so easy to spend far more time going through approval, procurement, and 
 then negotiating local IT security/policies than actually working that it 
 pays to do a lot of things on the cloud. There are many services out there, 
 but I like Amazon for occasional need things because you can provision 
 anything you want in minutes and they're stupid cheap. If all you need is 
 mysql for a few minutes now and then, just pay for Relational Database 
 Services. If you'd rather have a server and run mysql off it, get an EBS 
 backed EC2 instance (the reason to go this route rather than instance store 
 is improved IO and your data is all retained if you shut off the server 
 without taking a snapshot). Depending on your usage, bills of less than a 
 buck a month are very doable. If you need something that runs 24x7, other 
 routes will probably be more attractive. Another option is to try the mysql 
 built into cheapo web hosting accounts like bluehost, though you might find 
 that your disk IO gets !
 yo!
 u throttled. But it might be worth a shot.
 
 If doing this work on your desktop is acceptable (i.e. other people don't 
 need access to this service), you might seriously consider just doing it on a 
 personal laptop that you can install anything you want on. In addition to 
 mysql, you can also install VirtualBox which is a great environment for 
 provisioning servers that you can export to other environments or even carry 
 around on your cell phone.
 
 With regards to some of the specific issues you bring up, 40 minutes for a 
 query on a database that size is insane which indicates the tool you have is 
 not up for the job. Because of the way databases store info, performance 
 degrades on a logarthmic (rather than linear) basis on indexed data. In plain 
 English, this means even queries on millions of records take surprisingly 
 little power. Based on what you've described, changing a field from variable 
 to fixed might not save you any space and could even increase it depending on 
 what you have. In any case, the difference won't be worth worrying about.
 
 Whatever solution you go with, I'd recommend learning to provision yourself 
 resources when you can find some time. Work is hard enough when you can't get 
 the resources you need. When you can simply assign them to yourself, the 
 tools you need are always at hand so life gets much easier and more fun.
 
 kyle


Re: [CODE4LIB] Processing Circ data

2015-08-06 Thread Dunham, Denise
You can use sqlite within R using the sqldf package.  It allows you to perform 
sql select statements on your data in memory.  I've used it with datasets in 
this size range and it was fairly fast. 



Denise Dunham
Information Discovery Services Team
River Campus Libraries
University of Rochester
Rochester, NY 14627
585.275.1059
ddun...@library.rochester.edu



-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Harrison G. Dekker
Sent: Thursday, August 06, 2015 12:08 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Processing Circ data

Hi Cynthia,

R would be ideal for the types data manipulation you describe and would allow 
to automate the entire process. If you can share a sample of your data and 
examples of the types of queries you're running, I'd be glad to help you get 
started.

If you'd like to keep a relational database in your workflow, check out sqlite. 
It's a file format rather than a database server, so won't be an issue for your 
IT staff. There's a Firefox plug-in that provides basic client functionality, 
and you can also easily access the tables from R
(directly) or for that matter with Access or Excel via odbc (not what I'd 
recommend, but it's possible!).

Harrison Dekker
Head, Library Data Lab
UC Berkeley Libraries

On Thu, Aug 6, 2015 at 6:05 AM, Harper, Cynthia char...@vts.edu wrote:

 I have compacted the database, and I'm using the Group By SQL query. I 
 think I actually am hitting the 2GB limit, because of all the data I 
 have for each row. I'm wondering if having added a field for reserves 
 history notes, that that's treated as a fixed-length field for every 
 record, rather than variable length, and just appearing for the small 
 number of records that have been put on reserve.  I suppose if I 
 exported my data in two tables - bib and item data, the database would 
 be much more efficient than the flat-file approach I've been using.  
 Time to turn the mind back on, rather than just taking the lazy approach 
 every time...

 Cindy

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
 Of Kevin Ford
 Sent: Wednesday, August 05, 2015 5:16 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Processing Circ data

 On the surface, your difficulties suggest you may need look at a few 
 optimization tactics. Apologies if these are things you've already 
 considered and addressed - just offering a suggestion.

 This page [1] is for Access 2003 but the items under Improve query 
 performance should apply - I think - to newer versions also.  I'll 
 draw specific attention to 1) Compacting the database; 2) making sure 
 you have an index set up on the bib record number field and number of 
 circs field; and 3) make sure you are using hte Group by sql syntax [2].

 Now, I'm not terribly familiar with Access so I can't actually help 
 you with point/click instructions, but the above are common 'gotchas' 
 that could be a problem regardless of RDBMS.

 Yours,
 Kevin

 [1] https://support.microsoft.com/en-us/kb/209126
 [2] 
 https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3schools.com_
 sql_sql-5Fgroupby.aspd=BQIBaQc=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQ
 ZhHbOUr=1MNr3t6NTVLgjsxd2xillyz8v6r4pfPDxJ9G15aln5sm=LY4gYEwKT6jqEL7
 Yt12pXGSIinKBC-60A-PooQO9r4Es=PtVv4Be4EXJvoL3_3I6ro2rn3xwWD3Ab2OlwYSU
 _5GAe=



 On 8/5/15 4:01 PM, Harper, Cynthia wrote:
  Well, I guess it could be bad data, but I don't know how to tell. I
 think I've done more than this before.
 
  I have a Find duplicates query that groups by bib record number.  
  That
 query seemed to take about 40 minutes to process. Then I added a 
 criterion to limit to only records that had 0 circs this year. That 
 query displays the rotating cursor, then says Not Responding, then 
 the cursor, and loops through that for hours.  Maybe I can find the 
 Access bad data, but I'd be glad to find a more modern data analysis 
 software.  My db is 136,256 kb.
 But adding that extra query will probably put it over the 2GB mark.  
 I've tried extracting to a csv, and that didn't work. Maybe I'll try a 
 Make table to a separate db.
 
  Or the OpenRefine suggestion sounds good too.
 
  Cindy Harper
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf 
  Of Kevin Ford
  Sent: Wednesday, August 05, 2015 4:23 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Processing Circ data
 
  Hi Cindy,
 
  This doesn't quite address your issue, but, unless you've hit the 2 
  GB
 Access size limit [1], Access can handle a good deal more than 250,000 
 item records (rows, yes?) you cited.
 
  What makes you think you've hit the limit?  Slowness, something else?
 
  All the best,
  Kevin
 
  [1]
  https://urldefense.proofpoint.com/v2/url?u=https-3A__support.office.
  com_en-2Dus_article_Access-2D2010-2Dspecifications-2D1ed=BQIBaQc=k
  bmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOUr

Re: [CODE4LIB] Processing Circ data

2015-08-06 Thread Harrison G. Dekker
Hi Cynthia,

R would be ideal for the types data manipulation you describe and would
allow to automate the entire process. If you can share a sample of your
data and examples of the types of queries you're running, I'd be glad to
help you get started.

If you'd like to keep a relational database in your workflow, check out
sqlite. It's a file format rather than a database server, so won't be an
issue for your IT staff. There's a Firefox plug-in that provides basic
client functionality, and you can also easily access the tables from R
(directly) or for that matter with Access or Excel via odbc (not what I'd
recommend, but it's possible!).

Harrison Dekker
Head, Library Data Lab
UC Berkeley Libraries

On Thu, Aug 6, 2015 at 6:05 AM, Harper, Cynthia char...@vts.edu wrote:

 I have compacted the database, and I'm using the Group By SQL query. I
 think I actually am hitting the 2GB limit, because of all the data I have
 for each row. I'm wondering if having added a field for reserves history
 notes, that that's treated as a fixed-length field for every record, rather
 than variable length, and just appearing for the small number of records
 that have been put on reserve.  I suppose if I exported my data in two
 tables - bib and item data, the database would be much more efficient than
 the flat-file approach I've been using.  Time to turn the mind back on,
 rather than just taking the lazy approach every time...

 Cindy

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Kevin Ford
 Sent: Wednesday, August 05, 2015 5:16 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Processing Circ data

 On the surface, your difficulties suggest you may need look at a few
 optimization tactics. Apologies if these are things you've already
 considered and addressed - just offering a suggestion.

 This page [1] is for Access 2003 but the items under Improve query
 performance should apply - I think - to newer versions also.  I'll draw
 specific attention to 1) Compacting the database; 2) making sure you have
 an index set up on the bib record number field and number of circs field;
 and 3) make sure you are using hte Group by sql syntax [2].

 Now, I'm not terribly familiar with Access so I can't actually help you
 with point/click instructions, but the above are common 'gotchas' that
 could be a problem regardless of RDBMS.

 Yours,
 Kevin

 [1] https://support.microsoft.com/en-us/kb/209126
 [2] http://www.w3schools.com/sql/sql_groupby.asp



 On 8/5/15 4:01 PM, Harper, Cynthia wrote:
  Well, I guess it could be bad data, but I don't know how to tell. I
 think I've done more than this before.
 
  I have a Find duplicates query that groups by bib record number.  That
 query seemed to take about 40 minutes to process. Then I added a criterion
 to limit to only records that had 0 circs this year. That query displays
 the rotating cursor, then says Not Responding, then the cursor, and loops
 through that for hours.  Maybe I can find the Access bad data, but I'd be
 glad to find a more modern data analysis software.  My db is 136,256 kb.
 But adding that extra query will probably put it over the 2GB mark.  I've
 tried extracting to a csv, and that didn't work. Maybe I'll try a Make
 table to a separate db.
 
  Or the OpenRefine suggestion sounds good too.
 
  Cindy Harper
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
  Of Kevin Ford
  Sent: Wednesday, August 05, 2015 4:23 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Processing Circ data
 
  Hi Cindy,
 
  This doesn't quite address your issue, but, unless you've hit the 2 GB
 Access size limit [1], Access can handle a good deal more than 250,000 item
 records (rows, yes?) you cited.
 
  What makes you think you've hit the limit?  Slowness, something else?
 
  All the best,
  Kevin
 
  [1]
  https://support.office.com/en-us/article/Access-2010-specifications-1e
  521481-7f9a-46f7-8ed9-ea9dff1fa854
 
 
 
 
 
  On 8/5/15 3:07 PM, Harper, Cynthia wrote:
  Hi all. What are you using to process circ data for ad-hoc queries.  I
 usually extract csv or tab-delimited files - one row per item record, with
 identifying bib record data, then total checkouts over the given time
 period(s).  I have been importing these into Access then grouping them by
 bib record. I think that I've reached the limits of scalability for Access
 for this project now, with 250,000 item records.  Does anyone do this in
 R?  My other go-to- software for data processing is RapidMiner free
 version.  Or do you just use MySQL or other SQL database?  I was looking
 into doing it in R with RSQLite (just read about this and sqldf
 http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because ...  I'm
 rusty enough in R that if anyone will give me some start-off data import
 code, that would be great.
 
  Cindy Harper
  E-services and periodicals librarian
  Virginia Theological Seminary

Re: [CODE4LIB] Processing Circ data

2015-08-05 Thread Owen Stephens
Another option might be to use OpenRefine http://openrefine.org - this should 
easily handle 250,000 rows. I find it good for basic data analysis, and there 
are extensions which offer some visualisations (e.g. the VIB BITs extension 
which will plot simple data using d3 
https://www.bits.vib.be/index.php/software-overview/openrefine 
https://www.bits.vib.be/index.php/software-overview/openrefine)

I’ve written an introduction to OpenRefine available at 
http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/
 
http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

 On 5 Aug 2015, at 21:07, Harper, Cynthia char...@vts.edu wrote:
 
 Hi all. What are you using to process circ data for ad-hoc queries.  I 
 usually extract csv or tab-delimited files - one row per item record, with 
 identifying bib record data, then total checkouts over the given time 
 period(s).  I have been importing these into Access then grouping them by bib 
 record. I think that I've reached the limits of scalability for Access for 
 this project now, with 250,000 item records.  Does anyone do this in R?  My 
 other go-to- software for data processing is RapidMiner free version.  Or do 
 you just use MySQL or other SQL database?  I was looking into doing it in R 
 with RSQLite (just read about this and sqldf  
 http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my 
 IT department will be skeptical of letting me have MySQL on my desktop.  
 (I've moved into a much more users-don't-do-real-computing kind of 
 environment).  I'm rusty enough in R that if anyone will give me some 
 start-off data import code, that would be great.
 
 Cindy Harper
 E-services and periodicals librarian
 Virginia Theological Seminary
 Bishop Payne Library
 3737 Seminary Road
 Alexandria VA 22304
 char...@vts.edumailto:char...@vts.edu
 703-461-1794


Re: [CODE4LIB] Processing Circ data

2015-08-05 Thread Kevin Ford

Hi Cindy,

This doesn't quite address your issue, but, unless you've hit the 2 GB 
Access size limit [1], Access can handle a good deal more than 250,000 
item records (rows, yes?) you cited.


What makes you think you've hit the limit?  Slowness, something else?

All the best,
Kevin

[1] 
https://support.office.com/en-us/article/Access-2010-specifications-1e521481-7f9a-46f7-8ed9-ea9dff1fa854






On 8/5/15 3:07 PM, Harper, Cynthia wrote:

Hi all. What are you using to process circ data for ad-hoc queries.  I usually 
extract csv or tab-delimited files - one row per item record, with identifying 
bib record data, then total checkouts over the given time period(s).  I have 
been importing these into Access then grouping them by bib record. I think that 
I've reached the limits of scalability for Access for this project now, with 
250,000 item records.  Does anyone do this in R?  My other go-to- software for 
data processing is RapidMiner free version.  Or do you just use MySQL or other 
SQL database?  I was looking into doing it in R with RSQLite (just read about 
this and sqldf  http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) 
because I'm sure my IT department will be skeptical of letting me have MySQL on 
my desktop.  (I've moved into a much more users-don't-do-real-computing kind of 
environment).  I'm rusty enough in R that if anyone will give me some start-off 
data import code, that would be great.

Cindy Harper
E-services and periodicals librarian
Virginia Theological Seminary
Bishop Payne Library
3737 Seminary Road
Alexandria VA 22304
char...@vts.edumailto:char...@vts.edu
703-461-1794



Re: [CODE4LIB] Processing Circ data

2015-08-05 Thread Harper, Cynthia
Well, I guess it could be bad data, but I don't know how to tell. I think I've 
done more than this before.

I have a Find duplicates query that groups by bib record number.  That query 
seemed to take about 40 minutes to process. Then I added a criterion to limit 
to only records that had 0 circs this year. That query displays the rotating 
cursor, then says Not Responding, then the cursor, and loops through that for 
hours.  Maybe I can find the Access bad data, but I'd be glad to find a more 
modern data analysis software.  My db is 136,256 kb.  But adding that extra 
query will probably put it over the 2GB mark.  I've tried extracting to a csv, 
and that didn't work. Maybe I'll try a Make table to a separate db.

Or the OpenRefine suggestion sounds good too.

Cindy Harper

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin 
Ford
Sent: Wednesday, August 05, 2015 4:23 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Processing Circ data

Hi Cindy,

This doesn't quite address your issue, but, unless you've hit the 2 GB Access 
size limit [1], Access can handle a good deal more than 250,000 item records 
(rows, yes?) you cited.

What makes you think you've hit the limit?  Slowness, something else?

All the best,
Kevin

[1]
https://support.office.com/en-us/article/Access-2010-specifications-1e521481-7f9a-46f7-8ed9-ea9dff1fa854





On 8/5/15 3:07 PM, Harper, Cynthia wrote:
 Hi all. What are you using to process circ data for ad-hoc queries.  I 
 usually extract csv or tab-delimited files - one row per item record, with 
 identifying bib record data, then total checkouts over the given time 
 period(s).  I have been importing these into Access then grouping them by bib 
 record. I think that I've reached the limits of scalability for Access for 
 this project now, with 250,000 item records.  Does anyone do this in R?  My 
 other go-to- software for data processing is RapidMiner free version.  Or do 
 you just use MySQL or other SQL database?  I was looking into doing it in R 
 with RSQLite (just read about this and sqldf  
 http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because ...  I'm 
 rusty enough in R that if anyone will give me some start-off data import 
 code, that would be great.

 Cindy Harper
 E-services and periodicals librarian
 Virginia Theological Seminary
 Bishop Payne Library
 3737 Seminary Road
 Alexandria VA 22304
 char...@vts.edumailto:char...@vts.edu
 703-461-1794



Re: [CODE4LIB] Processing Circ data

2015-08-05 Thread Kevin Ford
On the surface, your difficulties suggest you may need look at a few 
optimization tactics. Apologies if these are things you've already 
considered and addressed - just offering a suggestion.


This page [1] is for Access 2003 but the items under Improve query 
performance should apply - I think - to newer versions also.  I'll draw 
specific attention to 1) Compacting the database; 2) making sure you 
have an index set up on the bib record number field and number of circs 
field; and 3) make sure you are using hte Group by sql syntax [2].


Now, I'm not terribly familiar with Access so I can't actually help you 
with point/click instructions, but the above are common 'gotchas' that 
could be a problem regardless of RDBMS.


Yours,
Kevin

[1] https://support.microsoft.com/en-us/kb/209126
[2] http://www.w3schools.com/sql/sql_groupby.asp



On 8/5/15 4:01 PM, Harper, Cynthia wrote:

Well, I guess it could be bad data, but I don't know how to tell. I think I've 
done more than this before.

I have a Find duplicates query that groups by bib record number.  That query seemed to 
take about 40 minutes to process. Then I added a criterion to limit to only records that had 0 
circs this year. That query displays the rotating cursor, then says Not Responding, then 
the cursor, and loops through that for hours.  Maybe I can find the Access bad data, but I'd be glad 
to find a more modern data analysis software.  My db is 136,256 kb.  But adding that extra query will 
probably put it over the 2GB mark.  I've tried extracting to a csv, and that didn't work. Maybe I'll 
try a Make table to a separate db.

Or the OpenRefine suggestion sounds good too.

Cindy Harper

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin 
Ford
Sent: Wednesday, August 05, 2015 4:23 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Processing Circ data

Hi Cindy,

This doesn't quite address your issue, but, unless you've hit the 2 GB Access size limit 
[1], Access can handle a good deal more than 250,000 item records (rows, 
yes?) you cited.

What makes you think you've hit the limit?  Slowness, something else?

All the best,
Kevin

[1]
https://support.office.com/en-us/article/Access-2010-specifications-1e521481-7f9a-46f7-8ed9-ea9dff1fa854





On 8/5/15 3:07 PM, Harper, Cynthia wrote:

Hi all. What are you using to process circ data for ad-hoc queries.  I usually 
extract csv or tab-delimited files - one row per item record, with identifying 
bib record data, then total checkouts over the given time period(s).  I have 
been importing these into Access then grouping them by bib record. I think that 
I've reached the limits of scalability for Access for this project now, with 
250,000 item records.  Does anyone do this in R?  My other go-to- software for 
data processing is RapidMiner free version.  Or do you just use MySQL or other 
SQL database?  I was looking into doing it in R with RSQLite (just read about 
this and sqldf  http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) 
because ...  I'm rusty enough in R that if anyone will give me some start-off 
data import code, that would be great.

Cindy Harper
E-services and periodicals librarian
Virginia Theological Seminary
Bishop Payne Library
3737 Seminary Road
Alexandria VA 22304
char...@vts.edumailto:char...@vts.edu
703-461-1794