Peter, Google Refine is not an ETL tool, it's more of the TL part, but gets inspiration from the many ETL tools I've worked with. David Huynh is the primary programmer. It can handle millions, but you'll need quite a bit of memory, depending on how long the strings are & how many columns,etc... typical Java object stuff is where the memory gets eaten up, and you'll simplyl see long waits while garbage collecting is going on if you don't have enough memory for the job. I do some enterprisey stuff on an EC2 Amazon instance with 16 GB and letting Refine have most of it when I'm doing some serious web scraping, splitting, partitioning, and exporting.
On Wed, Oct 19, 2011 at 8:37 AM, Peter Neubauer < [email protected]> wrote: > Very cool, > just installed it. Anyone knows how well it works for bigger > documents, like a couple of million, to massage them into good batch > insert formats? Can it compete with AWK? > > Cheers, > > /peter neubauer > > GTalk: neubauer.peter > Skype peter.neubauer > Phone +46 704 106975 > LinkedIn http://www.linkedin.com/in/neubauer > Twitter http://twitter.com/peterneubauer > > http://www.neo4j.org - NOSQL for the Enterprise. > http://startupbootcamp.org/ - Ă–resund - Innovation happens HERE. > > > > On Wed, Oct 19, 2011 at 2:40 AM, Thad Guidry <[email protected]> wrote: > > The author states in that write up that he had trouble separating the > > fixed-width data file (like so many governments produce). > > > > I saw the need and had the Google Refine community add a great visual aid > > feature to help with aligning fixed-width data files (along with working > > with XML / RDFa / JSON etc and easily exporting to TSV, CSV, or whatever > > formats you need. > > (Much easier than wiring up jobs within Talend or creating load scripts > in > > an unfamiliar language for non-programmers) > > > > My comment to the author: > > " Google Refine 2.5 release has a Great visual aid unlike some other > tools > > (Excel) for importing and exporting fixed width data text files. I tried > it > > on those same files and it allowed me to easily separate and created a > > delimited tabbed or comma separated value file. The new visual aid for > this > > is only in the 2.5 RC release that can be downloaded here: > > http://code.google.com/p/google-refine/downloads/list " > > > > On Tue, Oct 18, 2011 at 10:47 AM, Marko Rodriguez <[email protected] > >wrote: > > > >> Digging it up was the easy part :D. Writing it was the hard part. This > is > >> the fellow who wrote it: > >> > >> http://twitter.com/#!/davefauth > >> > >> Marko. > >> > >> http://markorodriguez.com > >> > >> On Oct 18, 2011, at 7:33 AM, Peter Neubauer wrote: > >> > >> > Hi all, > >> > the other day Marko dug this up - using Neo4j with Campaign data. > Pretty > >> cool! > >> > > >> > http://s113319.gridserver.com/?p=48 > >> > > >> > Cheers, > >> > > >> > /peter neubauer > >> > > >> > GTalk: neubauer.peter > >> > Skype peter.neubauer > >> > Phone +46 704 106975 > >> > LinkedIn http://www.linkedin.com/in/neubauer > >> > Twitter http://twitter.com/peterneubauer > >> > > >> > http://www.neo4j.org - Your high performance graph > >> database. > >> > http://startupbootcamp.org/ - Ă–resund - Innovation happens HERE. > >> > http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing > party. > >> > _______________________________________________ > >> > Neo4j mailing list > >> > [email protected] > >> > https://lists.neo4j.org/mailman/listinfo/user > >> > >> _______________________________________________ > >> Neo4j mailing list > >> [email protected] > >> https://lists.neo4j.org/mailman/listinfo/user > >> > > > > > > > > -- > > -Thad > > http://www.freebase.com/view/en/thad_guidry > > _______________________________________________ > > Neo4j mailing list > > [email protected] > > https://lists.neo4j.org/mailman/listinfo/user > > > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user > -- -Thad http://www.freebase.com/view/en/thad_guidry _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

