Re: BigQueryIO - Why is CREATE_NEVER not supported when using a tablespec?

2017-04-07 Thread Josh
Hi Dan,

Ok great thanks for confirming. I will create a JIRA and submit a PR to
remove this check then.

Thanks,
Josh

On Fri, Apr 7, 2017 at 6:09 PM, Dan Halperin <dhalp...@apache.org> wrote:

> Hi Josh,
> You raise a good point. I think we had put this check in (long before
> partition tables existed) because we need schema to create a table and we
> assumed the number of tables would be unbounded. But now it's an outdated
> check, overly conservative, and probably should be removed.
>
> Would you like to send a PR to fix this?
>
> Thanks,
> Dan
>
>
> On Fri, Apr 7, 2017 at 10:03 AM, Josh <jof...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have a use case where I want to stream into BigQuery, using a tablespec
>> but with CreateDisposition.CREATE_NEVER.I want to partition/shard my
>> data by date, and use BigQuery's date partitioning feature within a single
>> table (rather than creating a new BigQuery table for every day). In this
>> case writes would be made to a partition in a single table, e.g.
>> `my-project:dataset.my_table$20170407`, and in my tablespec I would just
>> be choosing the partition decorator using the window.
>>
>> Unfortunately this doesn't seem possible with BigQueryIO at the moment,
>> because it requires me to use CreateDisposition.CREATE_IF_NEEDED. I
>> can't use CreateDisposition.CREATE_IF_NEEDED because it requires me to
>> provide a table schema and my BigQuery schema isn't available at compile
>> time.
>>
>> Is there any good reason why CREATE_NEVER is not allowed when using a
>> tablespec?
>>
>> Thanks,
>> Josh
>>
>
>


Re: BigQueryIO - Why is CREATE_NEVER not supported when using a tablespec?

2017-04-07 Thread Dan Halperin
Hi Josh,
You raise a good point. I think we had put this check in (long before
partition tables existed) because we need schema to create a table and we
assumed the number of tables would be unbounded. But now it's an outdated
check, overly conservative, and probably should be removed.

Would you like to send a PR to fix this?

Thanks,
Dan


On Fri, Apr 7, 2017 at 10:03 AM, Josh <jof...@gmail.com> wrote:

> Hi all,
>
> I have a use case where I want to stream into BigQuery, using a tablespec
> but with CreateDisposition.CREATE_NEVER.I want to partition/shard my data
> by date, and use BigQuery's date partitioning feature within a single table
> (rather than creating a new BigQuery table for every day). In this case
> writes would be made to a partition in a single table, e.g.
> `my-project:dataset.my_table$20170407`, and in my tablespec I would just
> be choosing the partition decorator using the window.
>
> Unfortunately this doesn't seem possible with BigQueryIO at the moment,
> because it requires me to use CreateDisposition.CREATE_IF_NEEDED. I can't
> use CreateDisposition.CREATE_IF_NEEDED because it requires me to provide
> a table schema and my BigQuery schema isn't available at compile time.
>
> Is there any good reason why CREATE_NEVER is not allowed when using a
> tablespec?
>
> Thanks,
> Josh
>


BigQueryIO - Why is CREATE_NEVER not supported when using a tablespec?

2017-04-07 Thread Josh
Hi all,

I have a use case where I want to stream into BigQuery, using a tablespec
but with CreateDisposition.CREATE_NEVER.I want to partition/shard my data
by date, and use BigQuery's date partitioning feature within a single table
(rather than creating a new BigQuery table for every day). In this case
writes would be made to a partition in a single table, e.g.
`my-project:dataset.my_table$20170407`, and in my tablespec I would just be
choosing the partition decorator using the window.

Unfortunately this doesn't seem possible with BigQueryIO at the moment,
because it requires me to use CreateDisposition.CREATE_IF_NEEDED. I can't
use CreateDisposition.CREATE_IF_NEEDED because it requires me to provide a
table schema and my BigQuery schema isn't available at compile time.

Is there any good reason why CREATE_NEVER is not allowed when using a
tablespec?

Thanks,
Josh


Re: Reading/ writing xml file hangs indefinitely

2017-04-07 Thread Dan Halperin
Hi Richard,

I wonder if you're being hit by
https://issues.apache.org/jira/browse/BEAM-1309 -- namely, that the entire
/tmp directory might be being traversed.

As a sanity check, can you try moving your test file into a more specific
folder, like /tmp/beam/test_input/input.xml

If this resolves your issue, it's a good argument for prioritizing fixing
that issue ;)

Dan

On Fri, Apr 7, 2017 at 5:37 AM, Richard Hanson  wrote:

>
> On 06 April 2017 at 19:53 Dan Halperin  wrote:
>
> Hi Richard,
>
> Can you share a little more info about your environment? Here's a
> smattering of questions for which answers may be useful.
>
> * What runner are you using?
>
> I don't specify any runner so I believe it should be use direct runner.
>
> * What version of the SDK?
>
> Apache Beam version is 0.6.0 (beam-sdks-java-core
> and beam-runners-direct-java)
>
> * Does this reproduce in the DirectRunner?
>
> This problem I believe happens while running DirectRunner.
>
> * Can you share a full reproduction? (e.g., in a github gist)?
>
> JDK: 1.8.0_121
>
> Scala: 2.12.1
>
> sbt: 0.13.13
>
>
> Below is the sample xml file
>
> 
> 
> 
> 33
> John Smith
> 
> 
>
>
> The sample record object.
>
> @XmlRootElement
> class Customer {
>
> private var name: String = ""
>
> private var age: Int = 0
>
> private var id: Int = -1
>
> def getName():String = name
>
> @XmlElement
> def setName(name: String) = this.name = name
>
> def getAge(): Int = age
>
> @XmlElement
> def setAge(age: Int) = this.age = age
>
> def getId(): Int = id
>
> @XmlAttribute
> def setId(id: Int) = this.id = id
>
> }
>
> Pipeline procedure of the code.
>
>
> val options = PipelineOptionsFactory.create
> val p = Pipeline.create(options)
>
> val source = XmlSource.from[Customer](
> new File("customers.xml").toPath.toString
> ).withRootElement("customers").withRecordElement("customer").
> withRecordClass(classOf[Customer])
>
> val sink = XmlSink.write().toFilenamePrefix("xmlout").
> ofRecordClass(classOf[Customer]).
> withRootElement("customers")
>
> p.apply(Read.from(source)).apply(Write.to(sink))
>
> p.run.waitUntilFinish
>
>
> * What is happening on the machine(s) executing the job? Is there high
> CPU? Is the disk active? Etc.
>
> There is a high cpu usage which keeps at 99.x% when Java process is
> executing (when checking with top command).
>
> 7624 user  20   0 2837.6m 582.5m  23.5m S 99.3 11.4   2:42.11 java
>
> Monitoring with iotop shows disk io are (mostly) often performed by system
> processes e.g. kworker. Only seeing once or twice Java process (the only
> user process that runs on the machine) is doing disk io.
>
> Total DISK READ : 0.00 B/s | Total DISK WRITE : 0.00 B/s
> Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s
>
>
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
> 7720 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.01 % [kworker/0:2]
>
>
> Total DISK READ : 0.00 B/s | Total DISK WRITE : 15.62 K/s
> Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s
> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
>
> 7626 be/4 user 0.00 B/s 11.72 K/s 0.00 % 0.00 % java -Xms~h.jar test
>
> 7633 be/4 user 0.00 B/s 3.91 K/s 0.00 % 0.00 % java -Xms~h.jar test
>
>
> Thanks,
> Dan
>
> On Tue, Apr 4, 2017 at 9:33 AM, Richard Hanson 
> wrote:
>
> I am testing apache beam to read/ write xml files. But I encounter a
> problem that even the code is just to read a single xml file and write it
> out without doing any transformation, the process seems to hang
> indefinitely. The output looks like below:
>
> [pool-2-thread-5-ScalaTest-running-XmlSpec] INFO org.apache.beam.sdk.io
> .FileBasedSource - Matched 1 files for pattern /tmp/input/my.xml
> [pool-6-thread-1] INFO org.apache.beam.sdk.io.Write - Initializing write
> operation org.apache.beam.sdk.io.XmlSink$XmlWriteOperation@1c72df2c
>
>
> The code basically do the following:
>
> val options = PipelineOptionsFactory.create
> val p = Pipeline.create(options)
>
>
> val xml = XmlSource.from[Record](new File("/tmp/input/my.xml").toPa
> th.toString).withRootElement("RootE").withRecordElement("
> Record").withRecordClass(classOf[Record])
>
> p.apply(Read.from(xml)).apply(Write.to(XmlSink.write().toFil
> enamePrefix("xml").ofRecordClass(classOf[Record]).
> withRootElement("RootE")))
>
> p.run.waitUntilFinish
>
>
> What part may be missing in my program?
>
> Thanks
>
>
>


Re: Reading/ writing xml file hangs indefinitely

2017-04-07 Thread Richard Hanson

> On 06 April 2017 at 19:53 Dan Halperin  wrote:
> 
> Hi Richard,
> 
> Can you share a little more info about your environment? Here's a 
> smattering of questions for which answers may be useful.
> 
> * What runner are you using? 
> 

I don't specify any runner so I believe it should be use direct runner.

> * What version of the SDK?
> 

Apache Beam version is 0.6.0 (beam-sdks-java-core and beam-runners-direct-java)

> * Does this reproduce in the DirectRunner?
> 

This problem I believe happens while running DirectRunner. 

> * Can you share a full reproduction? (e.g., in a github gist)?
> 

JDK: 1.8.0_121

Scala: 2.12.1

sbt: 0.13.13


Below is the sample xml file




33
John Smith




The sample record object.

@XmlRootElement
class Customer {

private var name: String = ""

private var age: Int = 0

private var id: Int = -1

def getName():String = name

@XmlElement
def setName(name: String) = this.name = name

def getAge(): Int = age

@XmlElement
def setAge(age: Int) = this.age = age

def getId(): Int = id

@XmlAttribute
def setId(id: Int) = this.id = id

}

Pipeline procedure of the code. 


val options = PipelineOptionsFactory.create
val p = Pipeline.create(options)

val source = XmlSource.from[Customer](
new File("customers.xml").toPath.toString
).withRootElement("customers").withRecordElement("customer").
withRecordClass(classOf[Customer])

val sink = XmlSink.write().toFilenamePrefix("xmlout").
ofRecordClass(classOf[Customer]).
withRootElement("customers")

p.apply(Read.from(source)).apply(Write.to(sink))

p.run.waitUntilFinish


> * What is happening on the machine(s) executing the job? Is there high 
> CPU? Is the disk active? Etc.
> 

There is a high cpu usage which keeps at 99.x% when Java process is executing 
(when checking with top command). 

7624 user  20   0 2837.6m 582.5m  23.5m S 99.3 11.4   2:42.11 java

Monitoring with iotop shows disk io are (mostly) often performed by system 
processes e.g. kworker. Only seeing once or twice Java process (the only user 
process that runs on the machine) is doing disk io. 

Total DISK READ : 0.00 B/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s


TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
7720 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.01 % [kworker/0:2]


Total DISK READ : 0.00 B/s | Total DISK WRITE : 15.62 K/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND

7626 be/4 user 0.00 B/s 11.72 K/s 0.00 % 0.00 % java -Xms~h.jar test

7633 be/4 user 0.00 B/s 3.91 K/s 0.00 % 0.00 % java -Xms~h.jar test


> Thanks,
> Dan
> 
> On Tue, Apr 4, 2017 at 9:33 AM, Richard Hanson  mailto:rhan...@mailbox.org > wrote:
> 
> > > 
> > I am testing apache beam to read/ write xml files. But I encounter 
> > a problem that even the code is just to read a single xml file and write it 
> > out without doing any transformation, the process seems to hang 
> > indefinitely. The output looks like below:
> > 
> > [pool-2-thread-5-ScalaTest-running-XmlSpec] 
> > INFOhttp://org.apache.beam.sdk.io .FileBasedSource - Matched 1 files for 
> > pattern /tmp/input/my.xml
> > [pool-6-thread-1] INFO org.apache.beam.sdk.io.Write - Initializing 
> > write operationhttp://org.apache.beam.sdk.io 
> > .XmlSink$XmlWriteOperation@1c72df2c
> > 
> > 
> > The code basically do the following:
> > 
> > val options = PipelineOptionsFactory.create
> > val p = Pipeline.create(options)
> > 
> > 
> > val xml = XmlSource.from[Record](new 
> > File("/tmp/input/my.xml").toPath.toString).withRootElement("RootE").withRecordElement("Record").withRecordClass(classOf[Record])
> > 
> > 
> > p.apply(Read.from(xml)).apply(Write.to(XmlSink.write().toFilenamePrefix("xml").ofRecordClass(classOf[Record]).withRootElement("RootE")))
> > 
> > p.run.waitUntilFinish
> > 
> > 
> > What part may be missing in my program?
> > 
> > Thanks
> > 
> > > 
>