Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread ZiYuan
No intention of spamming but I also want to mention tika-python
<https://github.com/chrismattmann/tika-python> in the toolchain.

Ziyuan

On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan <ziyu...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> I also took a look at the Python clients (say, SolrClient and pysolr)
> because Python is my main programming language. I have an impression that
> 1. they send HTTP requests to the server according to the server APIs; 2.
> they are not official and thus possibly not up to date. Does SolrJ talk to
> the server via HTTP or some other more native ways? Is the main benefit of
> SolrJ over other clients the official shipment with Solr? Thank you.
>
> Best regards,
> Ziyuan
>
> On Jun 19, 2017 18:43, "ZiYuan" <ziyu...@gmail.com> wrote:
>
>> Dear Erick and Timothy,
>>
>> yes I will parse from the client for all the benefits. I am just trying
>> to figure out what is going on by indexing one or two PDF files first.
>> Thank you both.
>>
>> Best regards,
>> Ziyuan
>>
>> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>>> bq: Hope that there is no side effect of not mapping the PDF
>>>
>>> Well, yes it will have that side effect. You can cure that with a
>>> copyField directive from content to _text_.
>>>
>>> But do really consider running this as a SolrJ program on the client.
>>> Tim knows in far more painful detail than I do what kinds of problems
>>> there are when parsing all the different formats so I'd _really_
>>> follow his advice.
>>>
>>> Tika pretty much has an impossible job. "Here, try to parse all these
>>> different formats, implemented by different vendors with different
>>> versions that more or less follow a spec which really isn't a spec in
>>> many cases just recommendations using packages that may or may not be
>>> actively maintained. And by the way, we'll try to handle that 1G
>>> document that someone sends us, but don't blame us if we hit an
>>> OOM.". When Tika is run on the same box as Solr any problems in
>>> that entire chain can adversely affect your search.
>>>
>>> Not to mention that Tika has to do some heavy lifting, using CPU
>>> cycles that are unavailable for Solr.
>>>
>>> Extracting Request Handler is a fine way to get started, but for
>>> production seriously consider a separate client.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote:
>>> > Hi Erick,
>>> >
>>> > Now it is clear. I have to update the request handler of
>>> /update/extract/
>>> > from
>>> > "defaults":{"fmap.content":"_text_"}
>>> > to
>>> > "defaults":{"fmap.content":"content"}
>>> > to fill the field.
>>> >
>>> > Hope that there is no side effect of not mapping the PDF content to
>>> _text_.
>>> > Thank you for the hint.
>>> >
>>> > Best regards,
>>> > Ziyuan
>>> >
>>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <erik.hatc...@gmail.com>
>>> > wrote:
>>> >
>>> >> Ziyuan -
>>> >>
>>> >> You may be interested in the example/files that ships with Solr too.
>>> It’s
>>> >> got schema and config and even UI for file indexing and searching.
>>>  Check
>>> >> it out README.txt under example/files in your Solr install.
>>> >>
>>> >> Erik
>>> >>
>>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
>>> >> >
>>> >> > Hi Erick,
>>> >> >
>>> >> > thanks very much for the explanations! Clarification for question
>>> 2: more
>>> >> > specifically I cannot see the field content in the returned JSON,
>>> with
>>> >> the
>>> >> > the same definitions as in the post
>>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>>> >> > :
>>> >> >
>>> >> > >> stored="true"/>
>>> >> > >> indexed="true"
>>> >> > stored="false"

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread ZiYuan
Dear Erick and Timothy,

I also took a look at the Python clients (say, SolrClient and pysolr)
because Python is my main programming language. I have an impression that
1. they send HTTP requests to the server according to the server APIs; 2.
they are not official and thus possibly not up to date. Does SolrJ talk to
the server via HTTP or some other more native ways? Is the main benefit of
SolrJ over other clients the official shipment with Solr? Thank you.

Best regards,
Ziyuan

On Jun 19, 2017 18:43, "ZiYuan" <ziyu...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> yes I will parse from the client for all the benefits. I am just trying to
> figure out what is going on by indexing one or two PDF files first. Thank
> you both.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> bq: Hope that there is no side effect of not mapping the PDF
>>
>> Well, yes it will have that side effect. You can cure that with a
>> copyField directive from content to _text_.
>>
>> But do really consider running this as a SolrJ program on the client.
>> Tim knows in far more painful detail than I do what kinds of problems
>> there are when parsing all the different formats so I'd _really_
>> follow his advice.
>>
>> Tika pretty much has an impossible job. "Here, try to parse all these
>> different formats, implemented by different vendors with different
>> versions that more or less follow a spec which really isn't a spec in
>> many cases just recommendations using packages that may or may not be
>> actively maintained. And by the way, we'll try to handle that 1G
>> document that someone sends us, but don't blame us if we hit an
>> OOM.". When Tika is run on the same box as Solr any problems in
>> that entire chain can adversely affect your search.
>>
>> Not to mention that Tika has to do some heavy lifting, using CPU
>> cycles that are unavailable for Solr.
>>
>> Extracting Request Handler is a fine way to get started, but for
>> production seriously consider a separate client.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote:
>> > Hi Erick,
>> >
>> > Now it is clear. I have to update the request handler of
>> /update/extract/
>> > from
>> > "defaults":{"fmap.content":"_text_"}
>> > to
>> > "defaults":{"fmap.content":"content"}
>> > to fill the field.
>> >
>> > Hope that there is no side effect of not mapping the PDF content to
>> _text_.
>> > Thank you for the hint.
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <erik.hatc...@gmail.com>
>> > wrote:
>> >
>> >> Ziyuan -
>> >>
>> >> You may be interested in the example/files that ships with Solr too.
>> It’s
>> >> got schema and config and even UI for file indexing and searching.
>>  Check
>> >> it out README.txt under example/files in your Solr install.
>> >>
>> >> Erik
>> >>
>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
>> >> >
>> >> > Hi Erick,
>> >> >
>> >> > thanks very much for the explanations! Clarification for question 2:
>> more
>> >> > specifically I cannot see the field content in the returned JSON,
>> with
>> >> the
>> >> > the same definitions as in the post
>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>> >> > :
>> >> >
>> >> > > stored="true"/>
>> >> > > indexed="true"
>> >> > stored="false"/>
>> >> > 
>> >> >
>> >> > Is it so that Tika does not fill these two fields automatically and I
>> >> have
>> >> > to write some client code to fill them?
>> >> >
>> >> > Best regards,
>> >> > Ziyuan
>> >> >
>> >> >
>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> 1> Yes, you can use your single definition. The author ident

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan
Dear Erick and Timothy,

yes I will parse from the client for all the benefits. I am just trying to
figure out what is going on by indexing one or two PDF files first. Thank
you both.

Best regards,
Ziyuan

On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: Hope that there is no side effect of not mapping the PDF
>
> Well, yes it will have that side effect. You can cure that with a
> copyField directive from content to _text_.
>
> But do really consider running this as a SolrJ program on the client.
> Tim knows in far more painful detail than I do what kinds of problems
> there are when parsing all the different formats so I'd _really_
> follow his advice.
>
> Tika pretty much has an impossible job. "Here, try to parse all these
> different formats, implemented by different vendors with different
> versions that more or less follow a spec which really isn't a spec in
> many cases just recommendations using packages that may or may not be
> actively maintained. And by the way, we'll try to handle that 1G
> document that someone sends us, but don't blame us if we hit an
> OOM.". When Tika is run on the same box as Solr any problems in
> that entire chain can adversely affect your search.
>
> Not to mention that Tika has to do some heavy lifting, using CPU
> cycles that are unavailable for Solr.
>
> Extracting Request Handler is a fine way to get started, but for
> production seriously consider a separate client.
>
> Best,
> Erick
>
> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote:
> > Hi Erick,
> >
> > Now it is clear. I have to update the request handler of /update/extract/
> > from
> > "defaults":{"fmap.content":"_text_"}
> > to
> > "defaults":{"fmap.content":"content"}
> > to fill the field.
> >
> > Hope that there is no side effect of not mapping the PDF content to
> _text_.
> > Thank you for the hint.
> >
> > Best regards,
> > Ziyuan
> >
> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <erik.hatc...@gmail.com>
> > wrote:
> >
> >> Ziyuan -
> >>
> >> You may be interested in the example/files that ships with Solr too.
> It’s
> >> got schema and config and even UI for file indexing and searching.
>  Check
> >> it out README.txt under example/files in your Solr install.
> >>
> >> Erik
> >>
> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
> >> >
> >> > Hi Erick,
> >> >
> >> > thanks very much for the explanations! Clarification for question 2:
> more
> >> > specifically I cannot see the field content in the returned JSON, with
> >> the
> >> > the same definitions as in the post
> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> >> > :
> >> >
> >> >  stored="true"/>
> >> >  indexed="true"
> >> > stored="false"/>
> >> > 
> >> >
> >> > Is it so that Tika does not fill these two fields automatically and I
> >> have
> >> > to write some client code to fill them?
> >> >
> >> > Best regards,
> >> > Ziyuan
> >> >
> >> >
> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> 1> Yes, you can use your single definition. The author identifies the
> >> >> "text" field as a catch-all. Somewhere in the schema there'll be a
> >> >> copyField directive copying (perhaps) many different fields to the
> >> >> "text" field. That permits simple searches against a single field
> >> >> rather than, say, using edismax to search across multiple separate
> >> >> fields.
> >> >>
> >> >> 2> The link you referenced is for Data Import Handler, which is much
> >> >> different than just posting files to Solr. See
> >> >> ExtractingRequestHandler:
> >> >> https://cwiki.apache.org/confluence/display/solr/
> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> >> >> There are ways to map meta-data fields from the doc into specific
> >> >> fields matching your schema. Be a little careful here. There is no
> >> >>

Re: how to leave the mailing list? eof

2017-06-19 Thread ZiYuan
You can check this page: http://lucene.apache.org/solr/community.html

On Mon, Jun 19, 2017 at 5:22 PM, david fernandes 
wrote:

>
>


Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan
Hi Erick,

Now it is clear. I have to update the request handler of /update/extract/
from
"defaults":{"fmap.content":"_text_"}
to
"defaults":{"fmap.content":"content"}
to fill the field.

Hope that there is no side effect of not mapping the PDF content to _text_.
Thank you for the hint.

Best regards,
Ziyuan

On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <erik.hatc...@gmail.com>
wrote:

> Ziyuan -
>
> You may be interested in the example/files that ships with Solr too.  It’s
> got schema and config and even UI for file indexing and searching.   Check
> it out README.txt under example/files in your Solr install.
>
> Erik
>
> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
> >
> > Hi Erick,
> >
> > thanks very much for the explanations! Clarification for question 2: more
> > specifically I cannot see the field content in the returned JSON, with
> the
> > the same definitions as in the post
> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> > :
> >
> > 
> >  > stored="false"/>
> > 
> >
> > Is it so that Tika does not fill these two fields automatically and I
> have
> > to write some client code to fill them?
> >
> > Best regards,
> > Ziyuan
> >
> >
> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> 1> Yes, you can use your single definition. The author identifies the
> >> "text" field as a catch-all. Somewhere in the schema there'll be a
> >> copyField directive copying (perhaps) many different fields to the
> >> "text" field. That permits simple searches against a single field
> >> rather than, say, using edismax to search across multiple separate
> >> fields.
> >>
> >> 2> The link you referenced is for Data Import Handler, which is much
> >> different than just posting files to Solr. See
> >> ExtractingRequestHandler:
> >> https://cwiki.apache.org/confluence/display/solr/
> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> >> There are ways to map meta-data fields from the doc into specific
> >> fields matching your schema. Be a little careful here. There is no
> >> standard across different types of docs as to what meta-data field is
> >> included. PDF might have a "last_edited" field. Word might have a
> >> "last_modified" field where the two mean the same thing. Here's a link
> >> to a SolrJ program that'll dump all the fields:
> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
> >> hack out the DB bits.
> >>
> >> BTW, once you get more familiar with processing, I strongly recommend
> >> you do the document processing on the client, the reasons are outlined
> >> in that article.
> >>
> >> bq: even I define the fields as he said I cannot see them in the
> >> search results as keys in JSON
> >> are the fields set as stored="true"? They must be to be returned in
> >> requests (skipping the docValues discussion here).
> >>
> >> 3> Yes, the text field is a concatenation of all the other ones.
> >> Because it has stored=false, you can only search it, you cannot
> >> highlight or view. Fields you highlight must have stored=true BTW.
> >>
> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of
> >> things, most particularly whether that text is ever actually in a
> >> field in your index. Just because there's no guarantee that the name
> >> of the file is indexed in a searchable/highlightable way.
> >>
> >> And the query q=id:Trevor Hastie won't do what you think. It'll be
> parsed
> >> as
> >> id:Trevor _text_:Hastie
> >> _text_ is the default field, look for a "df" parameter in your request
> >> handler in solrconfig.xml (usually "/select" or "/query").
> >>
> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyu...@gmail.com> wrote:
> >>> Hi,
> >>>
> >>> I am new to Solr and I need to implement a full-text search of some PDF
> >>> files. The indexing part works out of the box by using bin/post. I can
> >> see
> >>> search results in the admin UI given some queries, though without the
> >>> matched texts and the conte

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan
Hi Erick,

thanks very much for the explanations! Clarification for question 2: more
specifically I cannot see the field content in the returned JSON, with the
the same definitions as in the post
<http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
:





Is it so that Tika does not fill these two fields automatically and I have
to write some client code to fill them?

Best regards,
Ziyuan


On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> 1> Yes, you can use your single definition. The author identifies the
> "text" field as a catch-all. Somewhere in the schema there'll be a
> copyField directive copying (perhaps) many different fields to the
> "text" field. That permits simple searches against a single field
> rather than, say, using edismax to search across multiple separate
> fields.
>
> 2> The link you referenced is for Data Import Handler, which is much
> different than just posting files to Solr. See
> ExtractingRequestHandler:
> https://cwiki.apache.org/confluence/display/solr/
> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> There are ways to map meta-data fields from the doc into specific
> fields matching your schema. Be a little careful here. There is no
> standard across different types of docs as to what meta-data field is
> included. PDF might have a "last_edited" field. Word might have a
> "last_modified" field where the two mean the same thing. Here's a link
> to a SolrJ program that'll dump all the fields:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
> hack out the DB bits.
>
> BTW, once you get more familiar with processing, I strongly recommend
> you do the document processing on the client, the reasons are outlined
> in that article.
>
> bq: even I define the fields as he said I cannot see them in the
> search results as keys in JSON
> are the fields set as stored="true"? They must be to be returned in
> requests (skipping the docValues discussion here).
>
> 3> Yes, the text field is a concatenation of all the other ones.
> Because it has stored=false, you can only search it, you cannot
> highlight or view. Fields you highlight must have stored=true BTW.
>
> Whether or not you can highlight "Trevor Hastie" depends an a lot of
> things, most particularly whether that text is ever actually in a
> field in your index. Just because there's no guarantee that the name
> of the file is indexed in a searchable/highlightable way.
>
> And the query q=id:Trevor Hastie won't do what you think. It'll be parsed
> as
> id:Trevor _text_:Hastie
> _text_ is the default field, look for a "df" parameter in your request
> handler in solrconfig.xml (usually "/select" or "/query").
>
> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyu...@gmail.com> wrote:
> > Hi,
> >
> > I am new to Solr and I need to implement a full-text search of some PDF
> > files. The indexing part works out of the box by using bin/post. I can
> see
> > search results in the admin UI given some queries, though without the
> > matched texts and the context.
> >
> > Now I am reading this post
> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> > for the highlighting part. It is for an older version of Solr when
> managed
> > schema was not available. Before fully understand what it is doing I have
> > some questions:
> >
> > 1. He defined two fields:
> >
> >  > multiValued="false"/>
> >  > multiValued="true"/>
> >
> > But why are there two fields needed? Can I define a field
> >
> >  > multiValued="true"/>
> >
> > to capture the full text?
> >
> > 2. How are the fields filled? I don't see relevant information in
> > TikaEntityProcessor's documentation
> > <https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/org/
> apache/solr/handler/dataimport/TikaEntityProcessor.html#
> fields.inherited.from.class.org.apache.solr.handler.
> dataimport.EntityProcessorBase>.
> > The current text extractor should already be Tika (I can see
> >
> > "x_parsed_by":
> > ["org.apache.tika.parser.DefaultParser","org.apache.
> tika.parser.pdf.PDFParser"]
> >
> > in the returned JSON of some query). But even I define the fields as he
> > said I cannot see them in the search results as keys in JSON.
> >
> > 3. The _text_ field seems a concatenation of other fields, does it
> contain
> > the full text? Though it does not seem to be accessible by default.
> >
> > To be brief, using The Elements of Statistical Learning
> > <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
> ESLII_print10.pdf>
> > as an example, how to highlight the relevant texts for the query "SVM"?
> And
> > if changing the file name into "The Elements of Statistical Learning -
> > Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
> > query "id:Trevor Hastie"?
> >
> > Thank you.
> >
> > Best regards,
> > Ziyuan
>


Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-17 Thread ZiYuan
Hi,

I am new to Solr and I need to implement a full-text search of some PDF
files. The indexing part works out of the box by using bin/post. I can see
search results in the admin UI given some queries, though without the
matched texts and the context.

Now I am reading this post
<http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
for the highlighting part. It is for an older version of Solr when managed
schema was not available. Before fully understand what it is doing I have
some questions:

1. He defined two fields:




But why are there two fields needed? Can I define a field



to capture the full text?

2. How are the fields filled? I don't see relevant information in
TikaEntityProcessor's documentation
<https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/org/apache/solr/handler/dataimport/TikaEntityProcessor.html#fields.inherited.from.class.org.apache.solr.handler.dataimport.EntityProcessorBase>.
The current text extractor should already be Tika (I can see

"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]

in the returned JSON of some query). But even I define the fields as he
said I cannot see them in the search results as keys in JSON.

3. The _text_ field seems a concatenation of other fields, does it contain
the full text? Though it does not seem to be accessible by default.

To be brief, using The Elements of Statistical Learning
<http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf>
as an example, how to highlight the relevant texts for the query "SVM"? And
if changing the file name into "The Elements of Statistical Learning -
Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
query "id:Trevor Hastie"?

Thank you.

Best regards,
Ziyuan


Will Solr flush docs to disk when ram buffer is full (time of auto commit is not reached yet)?

2017-01-19 Thread Ziyuan Qin
Hi All,

I'm trying to understand how Solr works with disk IO during and between hard 
commits. Wish you can help me.

Let's assume Softcommit is turned off. Autocommit is turned on. 

Then during a hard commit:
 
1. The tlog is truncated: A new tlog is started. (Disk IO involved)
2. The current index segment is closed and flushed. (Disk IO involved)

Between two hard commits:
1. new added documents are hosted in Ram buffer first. (defined in solrconfig, 
ramBufferSizeMB. No Disk IO involved)
I assume this buffer actually host an open segment in ram, am I right?
2. What will happen when the ram buffer is full and the time of autocommit is 
not reached yet? Will Solr flush the segment in ram to disk anyway?

Thank you very much,
Atom