Re: retrieve ids of all indexed docs efficiently

2017-01-18 Thread Erick Erickson
Added a tip on the CursorMark CWiki page, thanks for the suggestion!

On Wed, Jan 18, 2017 at 5:21 PM, Pushkar Raste  wrote:
> I think we should add the suggestion about docValues to the cursormark wiki
> (documentation), we too ran in the same problem.
>
> On Jan 18, 2017 5:52 PM, "Erick Erickson"  wrote:
>
>> Is your ID field docValues? Making it a docValues field should reduce
>> the amount of JVM heap you need.
>>
>>
>> But the export is _much_ preferred, it'll be lots faster as well. Of
>> course to export you need the values you're returning to be
>> docValues...
>>
>> Erick
>>
>> On Wed, Jan 18, 2017 at 1:12 PM, Slomin, David 
>> wrote:
>> > The export feature sounds promising, although I'll have to talk with our
>> deployment folks here about enabling it.
>> >
>> > The query I'm issuing is:
>> >
>> > http://:8983/solr/_shard1_replica1/
>> select?q=*:*&sort=id+asc&rows=1000&cursorMark=&
>> fl=id&omitHeader=true&distrib=false&wt=json
>> >
>> > Thanks,
>> > Div.
>> >
>> >
>> > On 1/18/17, 3:54 PM, "Jan Høydahl"  wrote:
>> >
>> > Don't know why you have mem problems. Can you paste in examples of
>> full query strings during cursor mark querying? Sounds like you may be
>> using it wrong.
>> >
>> > Or try exporting
>> >
>> > https://emea01.safelinks.protection.outlook.com/?url=
>> https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
>> 2Fsolr%2FExporting%2BResult%2BSets&data=01%7C01%7C%
>> 7Ccc878ba7e8364e60387008d43fe4316a%7C6d4034cd72254f72b85391feaea6
>> 4919%7C1&sdata=9FYFbyop1VzT2aLuZPEcY8unQnMO5R5VZEMyhCKA6iM%3D&reserved=0
>> >
>> > --
>> > Jan Høydahl
>> >
>> > > Den 18. jan. 2017 kl. 21.44 skrev Slomin, David <
>> david.slo...@here.com>:
>> > >
>> > > Hi --
>> > >
>> > > I'd like to retrieve the ids of all the docs in my Solr 5.3.1
>> index.  In my query, I've set rows=1000, fl=id, and am using the cursorMark
>> mechanism to split the overall traversal into multiple requests.  Not
>> because I care about the order, but because the documentation implies that
>> it's necessary to make cursorMark work reliably, I've also set sort=id
>> asc.  While this does give me the data I need on a smaller index, it causes
>> the heap memory utilization to go through the roof; for our large indices,
>> the Solr JVM throws an out of memory exception, and we've already
>> configured it as large as is practical given the physical memory of the
>> machine.
>> > >
>> > > For what it's worth, we do use Solr cloud to split each of our
>> indices into multiple shards.  However for this query, I'm addressing a
>> single shard directly (connecting to the correct Solr server instance for
>> one replica of that shard and setting distrib=false in my query) rather
>> than relying on Solr to route and assemble the results.
>> > > Thanks in advance,
>> > > Div Slomin.
>> > >
>> >
>> >
>>


Re: retrieve ids of all indexed docs efficiently

2017-01-18 Thread Pushkar Raste
I think we should add the suggestion about docValues to the cursormark wiki
(documentation), we too ran in the same problem.

On Jan 18, 2017 5:52 PM, "Erick Erickson"  wrote:

> Is your ID field docValues? Making it a docValues field should reduce
> the amount of JVM heap you need.
>
>
> But the export is _much_ preferred, it'll be lots faster as well. Of
> course to export you need the values you're returning to be
> docValues...
>
> Erick
>
> On Wed, Jan 18, 2017 at 1:12 PM, Slomin, David 
> wrote:
> > The export feature sounds promising, although I'll have to talk with our
> deployment folks here about enabling it.
> >
> > The query I'm issuing is:
> >
> > http://:8983/solr/_shard1_replica1/
> select?q=*:*&sort=id+asc&rows=1000&cursorMark=&
> fl=id&omitHeader=true&distrib=false&wt=json
> >
> > Thanks,
> > Div.
> >
> >
> > On 1/18/17, 3:54 PM, "Jan Høydahl"  wrote:
> >
> > Don't know why you have mem problems. Can you paste in examples of
> full query strings during cursor mark querying? Sounds like you may be
> using it wrong.
> >
> > Or try exporting
> >
> > https://emea01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> 2Fsolr%2FExporting%2BResult%2BSets&data=01%7C01%7C%
> 7Ccc878ba7e8364e60387008d43fe4316a%7C6d4034cd72254f72b85391feaea6
> 4919%7C1&sdata=9FYFbyop1VzT2aLuZPEcY8unQnMO5R5VZEMyhCKA6iM%3D&reserved=0
> >
> > --
> > Jan Høydahl
> >
> > > Den 18. jan. 2017 kl. 21.44 skrev Slomin, David <
> david.slo...@here.com>:
> > >
> > > Hi --
> > >
> > > I'd like to retrieve the ids of all the docs in my Solr 5.3.1
> index.  In my query, I've set rows=1000, fl=id, and am using the cursorMark
> mechanism to split the overall traversal into multiple requests.  Not
> because I care about the order, but because the documentation implies that
> it's necessary to make cursorMark work reliably, I've also set sort=id
> asc.  While this does give me the data I need on a smaller index, it causes
> the heap memory utilization to go through the roof; for our large indices,
> the Solr JVM throws an out of memory exception, and we've already
> configured it as large as is practical given the physical memory of the
> machine.
> > >
> > > For what it's worth, we do use Solr cloud to split each of our
> indices into multiple shards.  However for this query, I'm addressing a
> single shard directly (connecting to the correct Solr server instance for
> one replica of that shard and setting distrib=false in my query) rather
> than relying on Solr to route and assemble the results.
> > > Thanks in advance,
> > > Div Slomin.
> > >
> >
> >
>


Re: retrieve ids of all indexed docs efficiently

2017-01-18 Thread Erick Erickson
Is your ID field docValues? Making it a docValues field should reduce
the amount of JVM heap you need.


But the export is _much_ preferred, it'll be lots faster as well. Of
course to export you need the values you're returning to be
docValues...

Erick

On Wed, Jan 18, 2017 at 1:12 PM, Slomin, David  wrote:
> The export feature sounds promising, although I'll have to talk with our 
> deployment folks here about enabling it.
>
> The query I'm issuing is:
>
> http://:8983/solr/_shard1_replica1/select?q=*:*&sort=id+asc&rows=1000&cursorMark=&fl=id&omitHeader=true&distrib=false&wt=json
>
> Thanks,
> Div.
>
>
> On 1/18/17, 3:54 PM, "Jan Høydahl"  wrote:
>
> Don't know why you have mem problems. Can you paste in examples of full 
> query strings during cursor mark querying? Sounds like you may be using it 
> wrong.
>
> Or try exporting
>
> 
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2Fsolr%2FExporting%2BResult%2BSets&data=01%7C01%7C%7Ccc878ba7e8364e60387008d43fe4316a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=9FYFbyop1VzT2aLuZPEcY8unQnMO5R5VZEMyhCKA6iM%3D&reserved=0
>
> --
> Jan Høydahl
>
> > Den 18. jan. 2017 kl. 21.44 skrev Slomin, David :
> >
> > Hi --
> >
> > I'd like to retrieve the ids of all the docs in my Solr 5.3.1 index.  
> In my query, I've set rows=1000, fl=id, and am using the cursorMark mechanism 
> to split the overall traversal into multiple requests.  Not because I care 
> about the order, but because the documentation implies that it's necessary to 
> make cursorMark work reliably, I've also set sort=id asc.  While this does 
> give me the data I need on a smaller index, it causes the heap memory 
> utilization to go through the roof; for our large indices, the Solr JVM 
> throws an out of memory exception, and we've already configured it as large 
> as is practical given the physical memory of the machine.
> >
> > For what it's worth, we do use Solr cloud to split each of our indices 
> into multiple shards.  However for this query, I'm addressing a single shard 
> directly (connecting to the correct Solr server instance for one replica of 
> that shard and setting distrib=false in my query) rather than relying on Solr 
> to route and assemble the results.
> > Thanks in advance,
> > Div Slomin.
> >
>
>


Re: retrieve ids of all indexed docs efficiently

2017-01-18 Thread Slomin, David
The export feature sounds promising, although I'll have to talk with our 
deployment folks here about enabling it.

The query I'm issuing is:

http://:8983/solr/_shard1_replica1/select?q=*:*&sort=id+asc&rows=1000&cursorMark=&fl=id&omitHeader=true&distrib=false&wt=json

Thanks,
Div.


On 1/18/17, 3:54 PM, "Jan Høydahl"  wrote:

Don't know why you have mem problems. Can you paste in examples of full 
query strings during cursor mark querying? Sounds like you may be using it 
wrong.

Or try exporting


https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2Fsolr%2FExporting%2BResult%2BSets&data=01%7C01%7C%7Ccc878ba7e8364e60387008d43fe4316a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=9FYFbyop1VzT2aLuZPEcY8unQnMO5R5VZEMyhCKA6iM%3D&reserved=0

--
Jan Høydahl

> Den 18. jan. 2017 kl. 21.44 skrev Slomin, David :
> 
> Hi --
> 
> I'd like to retrieve the ids of all the docs in my Solr 5.3.1 index.  In 
my query, I've set rows=1000, fl=id, and am using the cursorMark mechanism to 
split the overall traversal into multiple requests.  Not because I care about 
the order, but because the documentation implies that it's necessary to make 
cursorMark work reliably, I've also set sort=id asc.  While this does give me 
the data I need on a smaller index, it causes the heap memory utilization to go 
through the roof; for our large indices, the Solr JVM throws an out of memory 
exception, and we've already configured it as large as is practical given the 
physical memory of the machine.
> 
> For what it's worth, we do use Solr cloud to split each of our indices 
into multiple shards.  However for this query, I'm addressing a single shard 
directly (connecting to the correct Solr server instance for one replica of 
that shard and setting distrib=false in my query) rather than relying on Solr 
to route and assemble the results.
> Thanks in advance,
> Div Slomin.
> 




Re: retrieve ids of all indexed docs efficiently

2017-01-18 Thread Jan Høydahl
Don't know why you have mem problems. Can you paste in examples of full query 
strings during cursor mark querying? Sounds like you may be using it wrong.

Or try exporting

https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets

--
Jan Høydahl

> Den 18. jan. 2017 kl. 21.44 skrev Slomin, David :
> 
> Hi --
> 
> I'd like to retrieve the ids of all the docs in my Solr 5.3.1 index.  In my 
> query, I've set rows=1000, fl=id, and am using the cursorMark mechanism to 
> split the overall traversal into multiple requests.  Not because I care about 
> the order, but because the documentation implies that it's necessary to make 
> cursorMark work reliably, I've also set sort=id asc.  While this does give me 
> the data I need on a smaller index, it causes the heap memory utilization to 
> go through the roof; for our large indices, the Solr JVM throws an out of 
> memory exception, and we've already configured it as large as is practical 
> given the physical memory of the machine.
> 
> For what it's worth, we do use Solr cloud to split each of our indices into 
> multiple shards.  However for this query, I'm addressing a single shard 
> directly (connecting to the correct Solr server instance for one replica of 
> that shard and setting distrib=false in my query) rather than relying on Solr 
> to route and assemble the results.
> Thanks in advance,
> Div Slomin.
> 


retrieve ids of all indexed docs efficiently

2017-01-18 Thread Slomin, David
Hi --

I'd like to retrieve the ids of all the docs in my Solr 5.3.1 index.  In my 
query, I've set rows=1000, fl=id, and am using the cursorMark mechanism to 
split the overall traversal into multiple requests.  Not because I care about 
the order, but because the documentation implies that it's necessary to make 
cursorMark work reliably, I've also set sort=id asc.  While this does give me 
the data I need on a smaller index, it causes the heap memory utilization to go 
through the roof; for our large indices, the Solr JVM throws an out of memory 
exception, and we've already configured it as large as is practical given the 
physical memory of the machine.

For what it's worth, we do use Solr cloud to split each of our indices into 
multiple shards.  However for this query, I'm addressing a single shard 
directly (connecting to the correct Solr server instance for one replica of 
that shard and setting distrib=false in my query) rather than relying on Solr 
to route and assemble the results.
Thanks in advance,
Div Slomin.