Peter,

Google Refine is not an ETL tool, it's more of the TL part, but gets
inspiration from the many ETL tools I've worked with.  David Huynh is the
primary programmer.  It can handle millions, but you'll need quite a bit of
memory, depending on how long the strings are & how many columns,etc...
typical Java object stuff is where the memory gets eaten up, and you'll
simplyl see long waits while garbage collecting is going on if you don't
have enough memory for the job.  I do some enterprisey stuff on an EC2
Amazon instance with 16 GB and letting Refine have most of it when I'm doing
some serious web scraping, splitting, partitioning, and exporting.

On Wed, Oct 19, 2011 at 8:37 AM, Peter Neubauer <
[email protected]> wrote:

> Very cool,
> just installed it. Anyone knows how well it works for bigger
> documents, like a couple of million, to massage them into good batch
> insert formats? Can it compete with AWK?
>
> Cheers,
>
> /peter neubauer
>
> GTalk:      neubauer.peter
> Skype       peter.neubauer
> Phone       +46 704 106975
> LinkedIn   http://www.linkedin.com/in/neubauer
> Twitter      http://twitter.com/peterneubauer
>
> http://www.neo4j.org               - NOSQL for the Enterprise.
> http://startupbootcamp.org/    - Ă–resund - Innovation happens HERE.
>
>
>
> On Wed, Oct 19, 2011 at 2:40 AM, Thad Guidry <[email protected]> wrote:
> > The author states in that write up that he had trouble separating the
> > fixed-width data file (like so many governments produce).
> >
> > I saw the need and had the Google Refine community add a great visual aid
> > feature to help with aligning fixed-width data files (along with working
> > with XML / RDFa / JSON etc and easily exporting to TSV, CSV, or whatever
> > formats you need.
> > (Much easier than wiring up jobs within Talend or creating load scripts
> in
> > an unfamiliar language for non-programmers)
> >
> > My comment to the author:
> > " Google Refine 2.5 release has a Great visual aid unlike some other
> tools
> > (Excel) for importing and exporting fixed width data text files. I tried
> it
> > on those same files and it allowed me to easily separate and created a
> > delimited tabbed or comma separated value file. The new visual aid for
> this
> > is only in the 2.5 RC release that can be downloaded here:
> > http://code.google.com/p/google-refine/downloads/list "
> >
> > On Tue, Oct 18, 2011 at 10:47 AM, Marko Rodriguez <[email protected]
> >wrote:
> >
> >> Digging it up was the easy part :D. Writing it was the hard part. This
> is
> >> the fellow who wrote it:
> >>
> >>        http://twitter.com/#!/davefauth
> >>
> >> Marko.
> >>
> >> http://markorodriguez.com
> >>
> >> On Oct 18, 2011, at 7:33 AM, Peter Neubauer wrote:
> >>
> >> > Hi all,
> >> > the other day Marko dug this up - using Neo4j with Campaign data.
> Pretty
> >> cool!
> >> >
> >> > http://s113319.gridserver.com/?p=48
> >> >
> >> > Cheers,
> >> >
> >> > /peter neubauer
> >> >
> >> > GTalk:      neubauer.peter
> >> > Skype       peter.neubauer
> >> > Phone       +46 704 106975
> >> > LinkedIn   http://www.linkedin.com/in/neubauer
> >> > Twitter      http://twitter.com/peterneubauer
> >> >
> >> > http://www.neo4j.org               - Your high performance graph
> >> database.
> >> > http://startupbootcamp.org/    - Ă–resund - Innovation happens HERE.
> >> > http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing
> party.
> >> > _______________________________________________
> >> > Neo4j mailing list
> >> > [email protected]
> >> > https://lists.neo4j.org/mailman/listinfo/user
> >>
> >> _______________________________________________
> >> Neo4j mailing list
> >> [email protected]
> >> https://lists.neo4j.org/mailman/listinfo/user
> >>
> >
> >
> >
> > --
> > -Thad
> > http://www.freebase.com/view/en/thad_guidry
> > _______________________________________________
> > Neo4j mailing list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
-Thad
http://www.freebase.com/view/en/thad_guidry
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to