Re: [CODE4LIB] Processing Circ data
On Wed, Aug 5, 2015 at 1:07 PM, Harper, Cynthia char...@vts.edu wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my IT department will be skeptical of letting me have MySQL on my desktop. (I've moved into a much more users-don't-do-real-computing kind of environment). I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. As has been mentioned already, it's worth investigating whether OpenRefine or sqllite are options for you. If not, I'd be inclined to explore solutions that don't rely on your local IT dept. It's so easy to spend far more time going through approval, procurement, and then negotiating local IT security/policies than actually working that it pays to do a lot of things on the cloud. There are many services out there, but I like Amazon for occasional need things because you can provision anything you want in minutes and they're stupid cheap. If all you need is mysql for a few minutes now and then, just pay for Relational Database Services. If you'd rather have a server and run mysql off it, get an EBS backed EC2 instance (the reason to go this route rather than instance store is improved IO and your data is all retained if you shut off the server without taking a snapshot). Depending on your usage, bills of less than a buck a month are very doable. If you need something that runs 24x7, other routes will probably be more attractive. Another option is to try the mysql built into cheapo web hosting accounts like bluehost, though you might find that your disk IO gets you throttled. But it might be worth a shot. If doing this work on your desktop is acceptable (i.e. other people don't need access to this service), you might seriously consider just doing it on a personal laptop that you can install anything you want on. In addition to mysql, you can also install VirtualBox which is a great environment for provisioning servers that you can export to other environments or even carry around on your cell phone. With regards to some of the specific issues you bring up, 40 minutes for a query on a database that size is insane which indicates the tool you have is not up for the job. Because of the way databases store info, performance degrades on a logarthmic (rather than linear) basis on indexed data. In plain English, this means even queries on millions of records take surprisingly little power. Based on what you've described, changing a field from variable to fixed might not save you any space and could even increase it depending on what you have. In any case, the difference won't be worth worrying about. Whatever solution you go with, I'd recommend learning to provision yourself resources when you can find some time. Work is hard enough when you can't get the resources you need. When you can simply assign them to yourself, the tools you need are always at hand so life gets much easier and more fun. kyle
Re: [CODE4LIB] Processing Circ data
I have compacted the database, and I'm using the Group By SQL query. I think I actually am hitting the 2GB limit, because of all the data I have for each row. I'm wondering if having added a field for reserves history notes, that that's treated as a fixed-length field for every record, rather than variable length, and just appearing for the small number of records that have been put on reserve. I suppose if I exported my data in two tables - bib and item data, the database would be much more efficient than the flat-file approach I've been using. Time to turn the mind back on, rather than just taking the lazy approach every time... Cindy -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin Ford Sent: Wednesday, August 05, 2015 5:16 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data On the surface, your difficulties suggest you may need look at a few optimization tactics. Apologies if these are things you've already considered and addressed - just offering a suggestion. This page [1] is for Access 2003 but the items under Improve query performance should apply - I think - to newer versions also. I'll draw specific attention to 1) Compacting the database; 2) making sure you have an index set up on the bib record number field and number of circs field; and 3) make sure you are using hte Group by sql syntax [2]. Now, I'm not terribly familiar with Access so I can't actually help you with point/click instructions, but the above are common 'gotchas' that could be a problem regardless of RDBMS. Yours, Kevin [1] https://support.microsoft.com/en-us/kb/209126 [2] http://www.w3schools.com/sql/sql_groupby.asp On 8/5/15 4:01 PM, Harper, Cynthia wrote: Well, I guess it could be bad data, but I don't know how to tell. I think I've done more than this before. I have a Find duplicates query that groups by bib record number. That query seemed to take about 40 minutes to process. Then I added a criterion to limit to only records that had 0 circs this year. That query displays the rotating cursor, then says Not Responding, then the cursor, and loops through that for hours. Maybe I can find the Access bad data, but I'd be glad to find a more modern data analysis software. My db is 136,256 kb. But adding that extra query will probably put it over the 2GB mark. I've tried extracting to a csv, and that didn't work. Maybe I'll try a Make table to a separate db. Or the OpenRefine suggestion sounds good too. Cindy Harper -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin Ford Sent: Wednesday, August 05, 2015 4:23 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data Hi Cindy, This doesn't quite address your issue, but, unless you've hit the 2 GB Access size limit [1], Access can handle a good deal more than 250,000 item records (rows, yes?) you cited. What makes you think you've hit the limit? Slowness, something else? All the best, Kevin [1] https://support.office.com/en-us/article/Access-2010-specifications-1e 521481-7f9a-46f7-8ed9-ea9dff1fa854 On 8/5/15 3:07 PM, Harper, Cynthia wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because ... I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. Cindy Harper E-services and periodicals librarian Virginia Theological Seminary Bishop Payne Library 3737 Seminary Road Alexandria VA 22304 char...@vts.edumailto:char...@vts.edu 703-461-1794
Re: [CODE4LIB] Processing Circ data
I did just bring in my own laptop to see if my problem is unique to my work computer. I actually have used Amazon AWS, and yes, that might be the best option. I've been looking into why my MSAccess job is limited to 25% of my CPU time - Maybe Access just can't use multiprocessors. I'm going to investigate SLQite and OpenRefine on my presonal laptop. Thanks all! Cindy Harper -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Thursday, August 06, 2015 12:34 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data On Wed, Aug 5, 2015 at 1:07 PM, Harper, Cynthia char...@vts.edu wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my IT department will be skeptical of letting me have MySQL on my desktop. (I've moved into a much more users-don't-do-real-computing kind of environment). I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. As has been mentioned already, it's worth investigating whether OpenRefine or sqllite are options for you. If not, I'd be inclined to explore solutions that don't rely on your local IT dept. It's so easy to spend far more time going through approval, procurement, and then negotiating local IT security/policies than actually working that it pays to do a lot of things on the cloud. There are many services out there, but I like Amazon for occasional need things because you can provision anything you want in minutes and they're stupid cheap. If all you need is mysql for a few minutes now and then, just pay for Relational Database Services. If you'd rather have a server and run mysql off it, get an EBS backed EC2 instance (the reason to go this route rather than instance store is improved IO and your data is all retained if you shut off the server without taking a snapshot). Depending on your usage, bills of less than a buck a month are very doable. If you need something that runs 24x7, other routes will probably be more attractive. Another option is to try the mysql built into cheapo web hosting accounts like bluehost, though you might find that your disk IO gets yo! u throttled. But it might be worth a shot. If doing this work on your desktop is acceptable (i.e. other people don't need access to this service), you might seriously consider just doing it on a personal laptop that you can install anything you want on. In addition to mysql, you can also install VirtualBox which is a great environment for provisioning servers that you can export to other environments or even carry around on your cell phone. With regards to some of the specific issues you bring up, 40 minutes for a query on a database that size is insane which indicates the tool you have is not up for the job. Because of the way databases store info, performance degrades on a logarthmic (rather than linear) basis on indexed data. In plain English, this means even queries on millions of records take surprisingly little power. Based on what you've described, changing a field from variable to fixed might not save you any space and could even increase it depending on what you have. In any case, the difference won't be worth worrying about. Whatever solution you go with, I'd recommend learning to provision yourself resources when you can find some time. Work is hard enough when you can't get the resources you need. When you can simply assign them to yourself, the tools you need are always at hand so life gets much easier and more fun. kyle
Re: [CODE4LIB] Processing Circ data
Hello, I'd go with a full database solution over sqlite. No limitations. Thanks, Sent from my iPhone On Aug 6, 2015, at 11:49 AM, Harper, Cynthia char...@vts.edu wrote: I did just bring in my own laptop to see if my problem is unique to my work computer. I actually have used Amazon AWS, and yes, that might be the best option. I've been looking into why my MSAccess job is limited to 25% of my CPU time - Maybe Access just can't use multiprocessors. I'm going to investigate SLQite and OpenRefine on my presonal laptop. Thanks all! Cindy Harper -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Thursday, August 06, 2015 12:34 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data On Wed, Aug 5, 2015 at 1:07 PM, Harper, Cynthia char...@vts.edu wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my IT department will be skeptical of letting me have MySQL on my desktop. (I've moved into a much more users-don't-do-real-computing kind of environment). I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. As has been mentioned already, it's worth investigating whether OpenRefine or sqllite are options for you. If not, I'd be inclined to explore solutions that don't rely on your local IT dept. It's so easy to spend far more time going through approval, procurement, and then negotiating local IT security/policies than actually working that it pays to do a lot of things on the cloud. There are many services out there, but I like Amazon for occasional need things because you can provision anything you want in minutes and they're stupid cheap. If all you need is mysql for a few minutes now and then, just pay for Relational Database Services. If you'd rather have a server and run mysql off it, get an EBS backed EC2 instance (the reason to go this route rather than instance store is improved IO and your data is all retained if you shut off the server without taking a snapshot). Depending on your usage, bills of less than a buck a month are very doable. If you need something that runs 24x7, other routes will probably be more attractive. Another option is to try the mysql built into cheapo web hosting accounts like bluehost, though you might find that your disk IO gets ! yo! u throttled. But it might be worth a shot. If doing this work on your desktop is acceptable (i.e. other people don't need access to this service), you might seriously consider just doing it on a personal laptop that you can install anything you want on. In addition to mysql, you can also install VirtualBox which is a great environment for provisioning servers that you can export to other environments or even carry around on your cell phone. With regards to some of the specific issues you bring up, 40 minutes for a query on a database that size is insane which indicates the tool you have is not up for the job. Because of the way databases store info, performance degrades on a logarthmic (rather than linear) basis on indexed data. In plain English, this means even queries on millions of records take surprisingly little power. Based on what you've described, changing a field from variable to fixed might not save you any space and could even increase it depending on what you have. In any case, the difference won't be worth worrying about. Whatever solution you go with, I'd recommend learning to provision yourself resources when you can find some time. Work is hard enough when you can't get the resources you need. When you can simply assign them to yourself, the tools you need are always at hand so life gets much easier and more fun. kyle
Re: [CODE4LIB] Processing Circ data
You can use sqlite within R using the sqldf package. It allows you to perform sql select statements on your data in memory. I've used it with datasets in this size range and it was fairly fast. Denise Dunham Information Discovery Services Team River Campus Libraries University of Rochester Rochester, NY 14627 585.275.1059 ddun...@library.rochester.edu -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Harrison G. Dekker Sent: Thursday, August 06, 2015 12:08 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data Hi Cynthia, R would be ideal for the types data manipulation you describe and would allow to automate the entire process. If you can share a sample of your data and examples of the types of queries you're running, I'd be glad to help you get started. If you'd like to keep a relational database in your workflow, check out sqlite. It's a file format rather than a database server, so won't be an issue for your IT staff. There's a Firefox plug-in that provides basic client functionality, and you can also easily access the tables from R (directly) or for that matter with Access or Excel via odbc (not what I'd recommend, but it's possible!). Harrison Dekker Head, Library Data Lab UC Berkeley Libraries On Thu, Aug 6, 2015 at 6:05 AM, Harper, Cynthia char...@vts.edu wrote: I have compacted the database, and I'm using the Group By SQL query. I think I actually am hitting the 2GB limit, because of all the data I have for each row. I'm wondering if having added a field for reserves history notes, that that's treated as a fixed-length field for every record, rather than variable length, and just appearing for the small number of records that have been put on reserve. I suppose if I exported my data in two tables - bib and item data, the database would be much more efficient than the flat-file approach I've been using. Time to turn the mind back on, rather than just taking the lazy approach every time... Cindy -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin Ford Sent: Wednesday, August 05, 2015 5:16 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data On the surface, your difficulties suggest you may need look at a few optimization tactics. Apologies if these are things you've already considered and addressed - just offering a suggestion. This page [1] is for Access 2003 but the items under Improve query performance should apply - I think - to newer versions also. I'll draw specific attention to 1) Compacting the database; 2) making sure you have an index set up on the bib record number field and number of circs field; and 3) make sure you are using hte Group by sql syntax [2]. Now, I'm not terribly familiar with Access so I can't actually help you with point/click instructions, but the above are common 'gotchas' that could be a problem regardless of RDBMS. Yours, Kevin [1] https://support.microsoft.com/en-us/kb/209126 [2] https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3schools.com_ sql_sql-5Fgroupby.aspd=BQIBaQc=kbmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQ ZhHbOUr=1MNr3t6NTVLgjsxd2xillyz8v6r4pfPDxJ9G15aln5sm=LY4gYEwKT6jqEL7 Yt12pXGSIinKBC-60A-PooQO9r4Es=PtVv4Be4EXJvoL3_3I6ro2rn3xwWD3Ab2OlwYSU _5GAe= On 8/5/15 4:01 PM, Harper, Cynthia wrote: Well, I guess it could be bad data, but I don't know how to tell. I think I've done more than this before. I have a Find duplicates query that groups by bib record number. That query seemed to take about 40 minutes to process. Then I added a criterion to limit to only records that had 0 circs this year. That query displays the rotating cursor, then says Not Responding, then the cursor, and loops through that for hours. Maybe I can find the Access bad data, but I'd be glad to find a more modern data analysis software. My db is 136,256 kb. But adding that extra query will probably put it over the 2GB mark. I've tried extracting to a csv, and that didn't work. Maybe I'll try a Make table to a separate db. Or the OpenRefine suggestion sounds good too. Cindy Harper -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin Ford Sent: Wednesday, August 05, 2015 4:23 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data Hi Cindy, This doesn't quite address your issue, but, unless you've hit the 2 GB Access size limit [1], Access can handle a good deal more than 250,000 item records (rows, yes?) you cited. What makes you think you've hit the limit? Slowness, something else? All the best, Kevin [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__support.office. com_en-2Dus_article_Access-2D2010-2Dspecifications-2D1ed=BQIBaQc=k bmfwr1Yojg42sGEpaQh5ofMHBeTl9EI2eaqQZhHbOUr
Re: [CODE4LIB] Processing Circ data
Hi Cynthia, R would be ideal for the types data manipulation you describe and would allow to automate the entire process. If you can share a sample of your data and examples of the types of queries you're running, I'd be glad to help you get started. If you'd like to keep a relational database in your workflow, check out sqlite. It's a file format rather than a database server, so won't be an issue for your IT staff. There's a Firefox plug-in that provides basic client functionality, and you can also easily access the tables from R (directly) or for that matter with Access or Excel via odbc (not what I'd recommend, but it's possible!). Harrison Dekker Head, Library Data Lab UC Berkeley Libraries On Thu, Aug 6, 2015 at 6:05 AM, Harper, Cynthia char...@vts.edu wrote: I have compacted the database, and I'm using the Group By SQL query. I think I actually am hitting the 2GB limit, because of all the data I have for each row. I'm wondering if having added a field for reserves history notes, that that's treated as a fixed-length field for every record, rather than variable length, and just appearing for the small number of records that have been put on reserve. I suppose if I exported my data in two tables - bib and item data, the database would be much more efficient than the flat-file approach I've been using. Time to turn the mind back on, rather than just taking the lazy approach every time... Cindy -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin Ford Sent: Wednesday, August 05, 2015 5:16 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data On the surface, your difficulties suggest you may need look at a few optimization tactics. Apologies if these are things you've already considered and addressed - just offering a suggestion. This page [1] is for Access 2003 but the items under Improve query performance should apply - I think - to newer versions also. I'll draw specific attention to 1) Compacting the database; 2) making sure you have an index set up on the bib record number field and number of circs field; and 3) make sure you are using hte Group by sql syntax [2]. Now, I'm not terribly familiar with Access so I can't actually help you with point/click instructions, but the above are common 'gotchas' that could be a problem regardless of RDBMS. Yours, Kevin [1] https://support.microsoft.com/en-us/kb/209126 [2] http://www.w3schools.com/sql/sql_groupby.asp On 8/5/15 4:01 PM, Harper, Cynthia wrote: Well, I guess it could be bad data, but I don't know how to tell. I think I've done more than this before. I have a Find duplicates query that groups by bib record number. That query seemed to take about 40 minutes to process. Then I added a criterion to limit to only records that had 0 circs this year. That query displays the rotating cursor, then says Not Responding, then the cursor, and loops through that for hours. Maybe I can find the Access bad data, but I'd be glad to find a more modern data analysis software. My db is 136,256 kb. But adding that extra query will probably put it over the 2GB mark. I've tried extracting to a csv, and that didn't work. Maybe I'll try a Make table to a separate db. Or the OpenRefine suggestion sounds good too. Cindy Harper -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin Ford Sent: Wednesday, August 05, 2015 4:23 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data Hi Cindy, This doesn't quite address your issue, but, unless you've hit the 2 GB Access size limit [1], Access can handle a good deal more than 250,000 item records (rows, yes?) you cited. What makes you think you've hit the limit? Slowness, something else? All the best, Kevin [1] https://support.office.com/en-us/article/Access-2010-specifications-1e 521481-7f9a-46f7-8ed9-ea9dff1fa854 On 8/5/15 3:07 PM, Harper, Cynthia wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because ... I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. Cindy Harper E-services and periodicals librarian Virginia Theological Seminary
Re: [CODE4LIB] Processing Circ data
Another option might be to use OpenRefine http://openrefine.org - this should easily handle 250,000 rows. I find it good for basic data analysis, and there are extensions which offer some visualisations (e.g. the VIB BITs extension which will plot simple data using d3 https://www.bits.vib.be/index.php/software-overview/openrefine https://www.bits.vib.be/index.php/software-overview/openrefine) I’ve written an introduction to OpenRefine available at http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/ http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/ Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 5 Aug 2015, at 21:07, Harper, Cynthia char...@vts.edu wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my IT department will be skeptical of letting me have MySQL on my desktop. (I've moved into a much more users-don't-do-real-computing kind of environment). I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. Cindy Harper E-services and periodicals librarian Virginia Theological Seminary Bishop Payne Library 3737 Seminary Road Alexandria VA 22304 char...@vts.edumailto:char...@vts.edu 703-461-1794
Re: [CODE4LIB] Processing Circ data
Hi Cindy, This doesn't quite address your issue, but, unless you've hit the 2 GB Access size limit [1], Access can handle a good deal more than 250,000 item records (rows, yes?) you cited. What makes you think you've hit the limit? Slowness, something else? All the best, Kevin [1] https://support.office.com/en-us/article/Access-2010-specifications-1e521481-7f9a-46f7-8ed9-ea9dff1fa854 On 8/5/15 3:07 PM, Harper, Cynthia wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because I'm sure my IT department will be skeptical of letting me have MySQL on my desktop. (I've moved into a much more users-don't-do-real-computing kind of environment). I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. Cindy Harper E-services and periodicals librarian Virginia Theological Seminary Bishop Payne Library 3737 Seminary Road Alexandria VA 22304 char...@vts.edumailto:char...@vts.edu 703-461-1794
Re: [CODE4LIB] Processing Circ data
Well, I guess it could be bad data, but I don't know how to tell. I think I've done more than this before. I have a Find duplicates query that groups by bib record number. That query seemed to take about 40 minutes to process. Then I added a criterion to limit to only records that had 0 circs this year. That query displays the rotating cursor, then says Not Responding, then the cursor, and loops through that for hours. Maybe I can find the Access bad data, but I'd be glad to find a more modern data analysis software. My db is 136,256 kb. But adding that extra query will probably put it over the 2GB mark. I've tried extracting to a csv, and that didn't work. Maybe I'll try a Make table to a separate db. Or the OpenRefine suggestion sounds good too. Cindy Harper -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin Ford Sent: Wednesday, August 05, 2015 4:23 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data Hi Cindy, This doesn't quite address your issue, but, unless you've hit the 2 GB Access size limit [1], Access can handle a good deal more than 250,000 item records (rows, yes?) you cited. What makes you think you've hit the limit? Slowness, something else? All the best, Kevin [1] https://support.office.com/en-us/article/Access-2010-specifications-1e521481-7f9a-46f7-8ed9-ea9dff1fa854 On 8/5/15 3:07 PM, Harper, Cynthia wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because ... I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. Cindy Harper E-services and periodicals librarian Virginia Theological Seminary Bishop Payne Library 3737 Seminary Road Alexandria VA 22304 char...@vts.edumailto:char...@vts.edu 703-461-1794
Re: [CODE4LIB] Processing Circ data
On the surface, your difficulties suggest you may need look at a few optimization tactics. Apologies if these are things you've already considered and addressed - just offering a suggestion. This page [1] is for Access 2003 but the items under Improve query performance should apply - I think - to newer versions also. I'll draw specific attention to 1) Compacting the database; 2) making sure you have an index set up on the bib record number field and number of circs field; and 3) make sure you are using hte Group by sql syntax [2]. Now, I'm not terribly familiar with Access so I can't actually help you with point/click instructions, but the above are common 'gotchas' that could be a problem regardless of RDBMS. Yours, Kevin [1] https://support.microsoft.com/en-us/kb/209126 [2] http://www.w3schools.com/sql/sql_groupby.asp On 8/5/15 4:01 PM, Harper, Cynthia wrote: Well, I guess it could be bad data, but I don't know how to tell. I think I've done more than this before. I have a Find duplicates query that groups by bib record number. That query seemed to take about 40 minutes to process. Then I added a criterion to limit to only records that had 0 circs this year. That query displays the rotating cursor, then says Not Responding, then the cursor, and loops through that for hours. Maybe I can find the Access bad data, but I'd be glad to find a more modern data analysis software. My db is 136,256 kb. But adding that extra query will probably put it over the 2GB mark. I've tried extracting to a csv, and that didn't work. Maybe I'll try a Make table to a separate db. Or the OpenRefine suggestion sounds good too. Cindy Harper -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kevin Ford Sent: Wednesday, August 05, 2015 4:23 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Processing Circ data Hi Cindy, This doesn't quite address your issue, but, unless you've hit the 2 GB Access size limit [1], Access can handle a good deal more than 250,000 item records (rows, yes?) you cited. What makes you think you've hit the limit? Slowness, something else? All the best, Kevin [1] https://support.office.com/en-us/article/Access-2010-specifications-1e521481-7f9a-46f7-8ed9-ea9dff1fa854 On 8/5/15 3:07 PM, Harper, Cynthia wrote: Hi all. What are you using to process circ data for ad-hoc queries. I usually extract csv or tab-delimited files - one row per item record, with identifying bib record data, then total checkouts over the given time period(s). I have been importing these into Access then grouping them by bib record. I think that I've reached the limits of scalability for Access for this project now, with 250,000 item records. Does anyone do this in R? My other go-to- software for data processing is RapidMiner free version. Or do you just use MySQL or other SQL database? I was looking into doing it in R with RSQLite (just read about this and sqldf http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ ) because ... I'm rusty enough in R that if anyone will give me some start-off data import code, that would be great. Cindy Harper E-services and periodicals librarian Virginia Theological Seminary Bishop Payne Library 3737 Seminary Road Alexandria VA 22304 char...@vts.edumailto:char...@vts.edu 703-461-1794