[Neo4j] google n grams data set in neo4j

2011-11-27 Thread René Pickhardt
Hey Everyone,

I am curently advising two high school students for a programing project
for some german student competition.

They have inserted the German google n-gram data set several GB of natural
language to a neo4j data base and used this to make sentence prediction to
improve typing speed.

The entire project is far from being complete but there is some code
available on how we modelled n-grams in neo4j and what we used for
prediction

Both approaches very basic and as you would expect them. Still they already
work in a decent way showing again the power of neo4j.

We would be happy for some feedback thoghts and suggestions for further
improvement. Find more info in my blog post:
http://www.rene-pickhardt.de/download-google-n-gram-data-set-and-neo4j-source-code-for-storing-it/

or in the source code:
http://code.google.com/p/complet/source/browse/trunk/Completion_DataCollector/src/completion_datacollector/Main.java?spec=svn64&r=64

by the way. even though the code is just hacked down it uses hashmaps to
store nodes in memory and increase inserting speed. and builds the lucene
index later. Of course it would be even better to use the batch inserter.

best regards René
-- 
--
mobile: +49 (0)176 6433 2481

Skype: +49 (0)6131 / 4958926

Skype: rene.pickhardt

www.rene-pickhardt.de
 
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] google n grams data set in neo4j

2011-11-28 Thread Peter Neubauer
Seriously cool stuff René!

I would love to hear more as the project progresses! Also, maybe the
dataset could be added to the example dataset collection for playing around
with neo4j? WDYT?

Cheers,

/peter neubauer

GTalk:  neubauer.peter
Skype   peter.neubauer
Phone   +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter  http://twitter.com/peterneubauer

http://www.neo4j.org  - NOSQL for the Enterprise.
http://startupbootcamp.org/- Öresund - Innovation happens HERE.


2011/11/27 René Pickhardt 

> Hey Everyone,
>
> I am curently advising two high school students for a programing project
> for some german student competition.
>
> They have inserted the German google n-gram data set several GB of natural
> language to a neo4j data base and used this to make sentence prediction to
> improve typing speed.
>
> The entire project is far from being complete but there is some code
> available on how we modelled n-grams in neo4j and what we used for
> prediction
>
> Both approaches very basic and as you would expect them. Still they already
> work in a decent way showing again the power of neo4j.
>
> We would be happy for some feedback thoghts and suggestions for further
> improvement. Find more info in my blog post:
>
> http://www.rene-pickhardt.de/download-google-n-gram-data-set-and-neo4j-source-code-for-storing-it/
>
> or in the source code:
>
> http://code.google.com/p/complet/source/browse/trunk/Completion_DataCollector/src/completion_datacollector/Main.java?spec=svn64&r=64
>
> by the way. even though the code is just hacked down it uses hashmaps to
> store nodes in memory and increase inserting speed. and builds the lucene
> index later. Of course it would be even better to use the batch inserter.
>
> best regards René
> --
> --
> mobile: +49 (0)176 6433 2481
>
> Skype: +49 (0)6131 / 4958926
>
> Skype: rene.pickhardt
>
> www.rene-pickhardt.de
>  
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] google n grams data set in neo4j

2011-11-28 Thread Jacopo Farina
That's AMAZING!
I was just thinking about using Neo4j to store some extracted n-grams, I
previously did it with a SQLite database but maybe using a graph an
application could surf between nodes more efficiently.
One question: is it possible to download the google ngram corpus release
(or at least some part of it) for free (and legally, of course) ? I've
found just this page (
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13) but
it seems I would have to pay.
Cheers,
Jacopo Farina


2011/11/28 Peter Neubauer 

> Seriously cool stuff René!
>
> I would love to hear more as the project progresses! Also, maybe the
> dataset could be added to the example dataset collection for playing around
> with neo4j? WDYT?
>
> Cheers,
>
> /peter neubauer
>
> GTalk:  neubauer.peter
> Skype   peter.neubauer
> Phone   +46 704 106975
> LinkedIn   http://www.linkedin.com/in/neubauer
> Twitter  http://twitter.com/peterneubauer
>
> http://www.neo4j.org  - NOSQL for the Enterprise.
> http://startupbootcamp.org/- Öresund - Innovation happens HERE.
>
>
> 2011/11/27 René Pickhardt 
>
> > Hey Everyone,
> >
> > I am curently advising two high school students for a programing project
> > for some german student competition.
> >
> > They have inserted the German google n-gram data set several GB of
> natural
> > language to a neo4j data base and used this to make sentence prediction
> to
> > improve typing speed.
> >
> > The entire project is far from being complete but there is some code
> > available on how we modelled n-grams in neo4j and what we used for
> > prediction
> >
> > Both approaches very basic and as you would expect them. Still they
> already
> > work in a decent way showing again the power of neo4j.
> >
> > We would be happy for some feedback thoghts and suggestions for further
> > improvement. Find more info in my blog post:
> >
> >
> http://www.rene-pickhardt.de/download-google-n-gram-data-set-and-neo4j-source-code-for-storing-it/
> >
> > or in the source code:
> >
> >
> http://code.google.com/p/complet/source/browse/trunk/Completion_DataCollector/src/completion_datacollector/Main.java?spec=svn64&r=64
> >
> > by the way. even though the code is just hacked down it uses hashmaps to
> > store nodes in memory and increase inserting speed. and builds the lucene
> > index later. Of course it would be even better to use the batch inserter.
> >
> > best regards René
> > --
> > --
> > mobile: +49 (0)176 6433 2481
> >
> > Skype: +49 (0)6131 / 4958926
> >
> > Skype: rene.pickhardt
> >
> > www.rene-pickhardt.de
> >  
> > ___
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> ___
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] google n grams data set in neo4j

2011-11-28 Thread Avi Shai
We're doing something similar, but I am afraid I can't release the code quite
yet. Great to have a free example out there though. One problem I found with
using n-grams and almost any database, neo4j included, is that speed is very
important if you want to use this in auto-complete. Therefore, I would
highly recommend doing one or more of the following.

1. Cache the entire dataset if possible 
2. If relying solely on neo4j, as a corollary to #1, write a warm-up script
3. Use with a very fast caching layer such as memcached or redis in addition
to or instead of neo4j. You can always have a script that loads redis with
neo4j's data for instance as an external index.

The gist is that if your auto-complete cannot do lookups in fractions of a
millisecond, it will just "feel" wrong even if it is below one second. For
that reason, we are going with #3 for web form auto-complete. For anything
like a spell-checker where speed is important, but not the only thing that
matters, a pure neo4j solution gives more sophisticated levels of checking
and algorithms to leverage.

--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-google-n-grams-data-set-in-neo4j-tp3540107p3543961.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user