the ruby crash when i request all the page with parsed = false
using directly the REST interface with CURL : the result is a huge
json with ~10.000 nodes
is there a way to limit the result size, like a SQL "SELECT * from
node where parsed == 'true' limit 100;" ?
i tried using a traverser instead of requesting the index :
node_to_parse = @neo.traverse(ob_root_node, "nodes", { "relationships"
=> [{"type"=> "link", "direction" => "out" }], "prune evaluator" =>
{"language" => "javascript", "body" =>
"position.endNode().getProperty('parsed') == 'false';"}, "return
filter" => {"language" => "builtin", "name" => "all but start
node"}})
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/protocol.rb:140:in
`rescue in rbuf_fill': Timeout::Error (Timeout::Error)
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/protocol.rb:134:in
`rbuf_fill'
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/protocol.rb:116:in
`readuntil'
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/protocol.rb:126:in
`readline'
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/http.rb:2219:in
`read_status_line'
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/http.rb:2208:in
`read_new'
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/http.rb:1191:in
`transport_request'
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/http.rb:1177:in
`request'
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/http.rb:1170:in
`block in request'
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/http.rb:627:in
`start'
from
/home/ker2x/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/http.rb:1168:in
`request'
from
/home/ker2x/.rvm/gems/ruby-1.9.2-p180/gems/httparty-0.7.8/lib/httparty/request.rb:69:in
`perform'
from
/home/ker2x/.rvm/gems/ruby-1.9.2-p180/gems/httparty-0.7.8/lib/httparty.rb:390:in
`perform_request'
from
/home/ker2x/.rvm/gems/ruby-1.9.2-p180/gems/httparty-0.7.8/lib/httparty.rb:358:in
`post'
from
/home/ker2x/.rvm/gems/ruby-1.9.2-p180/gems/httparty-0.7.8/lib/httparty.rb:426:in
`post'
from
/home/ker2x/.rvm/gems/ruby-1.9.2-p180/gems/neography-0.0.13/lib/neography/rest.rb:363:in
`post'
from
/home/ker2x/.rvm/gems/ruby-1.9.2-p180/gems/neography-0.0.13/lib/neography/rest.rb:317:in
`traverse'
from nokogiri-test.rb:26:in `<main>'
--
Keru
On Fri, Jul 1, 2011 at 1:15 PM, Michael Hunger
<[email protected]> wrote:
> Can you call the index REST request manually and see what it returns?
>
> see here
> http://components.neo4j.org/neo4j-server/snapshot/rest.html#Index_search_-_Exact_keyvalue_lookup
>
> curl -H Accept:application/json
> http://localhost:7474/db/data/index/node/my_nodes/the_key/the_value%20with%20space
>
> see here:
> http://stackoverflow.com/questions/547127/in-ruby-how-do-i-replace-the-question-mark-character-in-a-string
>
> require "addressable/uri"
> Addressable::URI.encode_component("http://test.com?test/test%test",Addressable::URI::CharacterClasses::PATH)
>
> Cheers
>
> Michael
>
>
> Am 01.07.2011 um 11:23 schrieb Laurent Laborde:
>
>> After a few run (and more and more and more page to crawl) it seems
>> that the result returned by the index is too big :
>>
>> /home/ker2x/.rvm/gems/ruby-1.9.2-p180/gems/crack-0.1.8/lib/crack/json.rb:54:
>> stack level too deep (SystemStackError)
>>
>> Any idea ? workaround ?
>>
>> thank you
>>
>> --
>> Ker2x
>>
>> On Fri, Jul 1, 2011 at 10:44 AM, Laurent Laborde <[email protected]> wrote:
>>> I used Base64.encode64 instead ! it still didn't worked.
>>> So i used Base64.encode64 and get_node_index instead of
>>> find_node_index and it worked \o/
>>>
>>> --
>>> Ker2x
>>>
>>> On Fri, Jul 1, 2011 at 10:25 AM, Laurent Laborde <[email protected]>
>>> wrote:
>>>> thank you for your help.
>>>> as you probably noticed i'm not good with ruby (i'm a sysadmin ^^)
>>>>
>>>> i tried using URI.encode but it doesn't works as expected.
>>>>
>>>> irb(main):001:0> require 'uri'
>>>> => true
>>>> irb(main):002:0> puts URI.escape("http://www.over.blog.com/")
>>>> http://www.over.blog.com/
>>>> => nil
>>>> irb(main):003:0> puts URI.encode("http://www.over.blog.com/")
>>>> http://www.over.blog.com/
>>>> => nil
>>>>
>>>> i guess that the output sould be more like
>>>> "http%3A%2F%2Fwww.over-blog.com%2" isn't it ?
>>>>
>>>> --
>>>> Ker2x
>>>>
>>>> On Thu, Jun 30, 2011 at 6:40 PM, Michael Hunger
>>>> <[email protected]> wrote:
>>>>> you have to escape the url index value
>>>>> otherwise the jersey rest framework consumes it silently. I had this
>>>>> problem when working on the birdies demo app. Took me a while to work
>>>>> that out.
>>>>>
>>>>> see http://github.com/jexp/birdies
>>>>> and http://birdies.heroku.com
>>>>>
>>>>> Michael
>>>>>
>>>>> Sent from my iBrick4
>>>>>
>>>>>
>>>>> Am 30.06.2011 um 17:43 schrieb Laurent Laborde <[email protected]>:
>>>>>
>>>>>> Friendly greetings !
>>>>>> i'm on the same problem since many days (an hour per day) and i can't
>>>>>> find a solution
>>>>>> i have 2 index (see source doe below)
>>>>>> No problem with the "parsed" index, but the "url" index never return any
>>>>>> result.
>>>>>> I don't if it's because the url isn't indexed or because the query on
>>>>>> the index is wrong.
>>>>>> Or something else ?
>>>>>>
>>>>>> Could you please take a look and see what's wrong ?
>>>>>> thank you
>>>>>>
>>>>>> (you can try to run the script, it works)
>>>>>>
>>>>>> require 'nokogiri'
>>>>>> require 'open-uri'
>>>>>> require 'neography'
>>>>>>
>>>>>> #init neography
>>>>>> @neo = Neography::Rest.new
>>>>>> neo_root = @neo.get_root
>>>>>>
>>>>>> domaine = 'http://www.over-blog.com/'
>>>>>> parsed_idx = "ob_parsed_idx"
>>>>>> url_idx = "ob_url_idx"
>>>>>>
>>>>>> #FIRST RUN
>>>>>> #ob_root_node = @neo.create_node("domaine" => domaine, "parsed" =>
>>>>>> "false", "url" => domaine)
>>>>>> #@neo.create_relationship("obgraph", neo_root, ob_root_node)
>>>>>> #pidx = @neo.create_node_index(parsed_idx)
>>>>>> #uidx = @neo.create_node_index(url_idx)
>>>>>> #@neo.add_node_to_index(parsed_idx, "parsed", "false", ob_root_node)
>>>>>> ##@neo.add_node_to_index(url_idx, "url", domaine, ob_root_node)
>>>>>> #node_to_parse = @neo.get_node_index(parsed_idx, "parsed", "false")
>>>>>>
>>>>>> ob_root_node = @neo.traverse(neo_root, "nodes", { "relationships" =>
>>>>>> [{"type"=> "obgraph", "direction" => "out" }], "depth" => 1})
>>>>>> #node_to_parse = @neo.traverse(ob_root_node, "nodes", {
>>>>>> "relationships" => [{"type"=> "link", "direction" => "out" }] })
>>>>>> node_to_parse = @neo.get_node_index(parsed_idx, "parsed", "false")
>>>>>>
>>>>>> #print @neo.list_node_indexes
>>>>>>
>>>>>> node_to_parse.each do |node|
>>>>>>
>>>>>> url_to_parse = @neo.get_node_properties(node)["url"]
>>>>>> printf("exploring : %s\n", url_to_parse)
>>>>>>
>>>>>> doc = Nokogiri::HTML(open(url_to_parse))
>>>>>> @neo.set_node_properties(node, {"parsed" => "true"})
>>>>>> @neo.remove_node_from_index(parsed_idx, node)
>>>>>> @neo.add_node_to_index(parsed_idx, "parsed", "true", node)
>>>>>>
>>>>>> doc.xpath('//a').each do |link|
>>>>>>
>>>>>> link_text = link.content.strip()
>>>>>> link_url = link['href'].to_s().strip()
>>>>>> link_title = link['title'].to_s().strip()
>>>>>>
>>>>>> link_url = link_url.sub(/#.*$/, "")
>>>>>>
>>>>>> if(link_url =~ /^\/.*/)
>>>>>> link_url = link_url.sub(/^\//, '')
>>>>>> link_url = domaine + link_url
>>>>>> end
>>>>>>
>>>>>> if(link_text == '')
>>>>>> link_text = link_title
>>>>>> end
>>>>>>
>>>>>>
>>>>>> #skiping empty stuff
>>>>>> next if link_url.empty?
>>>>>> next if link_text.empty?
>>>>>>
>>>>>> node_found = @neo.find_node_index(url_idx, "url", link_url)
>>>>>> #node_found = @neo.traverse(ob_root_node, "nodes", {
>>>>>> "relationships" => [{"direction" => "out" }], "prune evaluator" =>
>>>>>> {"language" => "javascript", "body" =>
>>>>>> "position.endNode().getProperty(url) == #{link_url};"}, "return
>>>>>> filter" => {"language" => "builtin", "name" => "all but start
>>>>>> node"}})
>>>>>> print "\nsearching url #{link_url}\n"
>>>>>> printf("node_found : %s \n", node_found)
>>>>>> if(node_found.nil?)
>>>>>> printf("create node %s\n", link_url)
>>>>>> nnode = @neo.create_node("parsed" => "false", "url" =>
>>>>>> link_url)
>>>>>> @neo.add_node_to_index(url_idx, "url", link_url, nnode)
>>>>>> @neo.add_node_to_index(parsed_idx, "parsed", "false", nnode)
>>>>>> else
>>>>>> printf("node_found : %s \n", node_found)
>>>>>> end
>>>>>>
>>>>>>
>>>>>> nrel = @neo.create_relationship("link", node, nnode)
>>>>>> @neo.set_relationship_properties(nrel, {"text" => link_text})
>>>>>>
>>>>>> #printf("%s => %s\n", link_text, link_url)
>>>>>>
>>>>>> end
>>>>>>
>>>>>> sleep(1.0)
>>>>>>
>>>>>>
>>>>>> end
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Laurent "ker2x" Laborde
>>>>>> Sysadmin & DBA at http://www.over-blog.com/
>>>>>> _______________________________________________
>>>>>> Neo4j mailing list
>>>>>> [email protected]
>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>> _______________________________________________
>>>>> Neo4j mailing list
>>>>> [email protected]
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Laurent "ker2x" Laborde
>>>> Sysadmin & DBA at http://www.over-blog.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Laurent "ker2x" Laborde
>>> Sysadmin & DBA at http://www.over-blog.com/
>>>
>>
>>
>>
>> --
>> Laurent "ker2x" Laborde
>> Sysadmin & DBA at http://www.over-blog.com/
>> _______________________________________________
>> Neo4j mailing list
>> [email protected]
>> https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
--
Laurent "ker2x" Laborde
Sysadmin & DBA at http://www.over-blog.com/
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user